Federated Heterogeneous Compute and Storage Infrastructure for PUNCH4NFDI

1. Introduction

Particles, Universe, NuClei and Hadrons for the National Research Data Infrastructure (PUNCH4NFDI) is a major German consortium funded by the DFG (German Research Foundation). It represents approximately 9,000 scientists from particle, astro-, astroparticle, hadron, and nuclear physics communities. The consortium's prime goal is to establish a federated, FAIR (Findable, Accessible, Interoperable, Reusable) science data platform. This platform aims to provide seamless access to the diverse and heterogeneous computing and storage resources distributed across participating institutions, addressing common challenges of massive data volumes and complex, resource-intensive algorithms. This document focuses on the architectural concepts—Compute4PUNCH and Storage4PUNCH—developed to federate these in-kind contributed resources.

2. Federated Heterogeneous Compute Infrastructure – Compute4PUNCH

The Compute4PUNCH concept addresses the challenge of providing unified access to a wide array of existing High-Throughput Compute (HTC), High-Performance Compute (HPC), and Cloud resources contributed by various institutions. These resources differ in architecture, OS, software, and authentication. The key constraint is minimizing changes to existing, operational systems shared by multiple communities.

2.1 Core Architecture & Integration Strategy

The strategy employs a federated overlay batch system. Instead of modifying local resource managers (like SLURM, PBS), an HTCondor-based overlay pool is created. The COBalD/TARDIS resource meta-scheduler dynamically and transparently integrates heterogeneous backends (HPC clusters, HTC farms, cloud VMs) into this unified pool. It acts as a "pilot" system, submitting placeholder jobs to claim resources and then deploying actual user workloads.

2.2 User Access & Software Environment

Access is provided via traditional login nodes and a JupyterHub service, serving as central entry points. A token-based Authentication and Authorization Infrastructure (AAI) standardizes access. Software environment complexity is managed through container technologies (Docker, Singularity/Apptainer) and the CERN Virtual Machine File System (CVMFS), which delivers pre-configured, community-specific software stacks in a scalable, read-only manner.

3. Federated Storage Infrastructure – Storage4PUNCH

Storage4PUNCH aims to federate community-supplied storage systems, primarily based on dCache or XRootD technologies, which are well-established in High-Energy Physics (HEP). The federation creates a common namespace and access layer. The concept also evaluates existing technologies for caching (to reduce latency and WAN traffic) and metadata handling, aiming for deeper integration to facilitate data discovery and management across the federated storage.

4. Technical Implementation & Core Components

4.1 Compute Federation: HTCondor & COBalD/TARDIS

HTCondor: Provides the job management layer, queueing, and scheduling within the federated pool. Its ClassAd mechanism allows matching complex job requirements with dynamic resource properties.
COBalD/TARDIS: Sits between HTCondor and the heterogeneous backends. TARDIS translates HTCondor "pilots" into backend-specific submission commands (e.g., a SLURM job script). COBalD implements the decision logic for when and where to spawn these pilots based on policy, cost, and queue status. The core function can be modeled as an optimization problem: $\text{Maximize } U = \sum_{r \in R} (w_r \cdot u_r(\text{alloc}_r)) \text{ subject to } \text{alloc}_r \leq \text{cap}_r, \forall r \in R$, where $U$ is total utility, $R$ is the set of resource types, $w_r$ is a weight, $u_r$ is a utility function for resource type $r$, $\text{alloc}_r$ is allocated capacity, and $\text{cap}_r$ is total capacity.

4.2 Storage Federation: dCache & XRootD

dCache: A hierarchical storage management system, often used as a frontend for tape archives. It provides POSIX-like interfaces (NFS, WebDAV) and HEP-specific protocols (xrootd, gridftp).
XRootD: A protocol and suite for scalable, fault-tolerant data access. Its "redirector" component enables building federations where a client query is directed to the appropriate data server.
Federation creates a logical layer that presents multiple physical instances as a single system, crucial for data locality-aware scheduling.

4.3 Software & Data Delivery: Containers & CVMFS

Containers: Ensure reproducible software environments across diverse host systems. They encapsulate complex dependencies (e.g., specific versions of ROOT, Geant4).
CVMFS: A global, distributed filesystem for software distribution. It uses HTTP and aggressive caching. Its content is published once and becomes available everywhere, solving the software deployment problem at scale. The publication process involves a "stratum 0" server and replication to "stratum 1" mirrors.

5. Prototype Status & Initial Experiences

The paper reports that prototypes for both Compute4PUNCH and Storage4PUNCH have been deployed. Initial scientific applications have been executed successfully on the available prototypes, demonstrating the feasibility of the concepts. Specific performance metrics or detailed case studies are not provided in the abstract, but the successful execution validates the integration approach and the chosen technology stack.

6. Key Insights & Strategic Analysis

Federation-over-Integration: The project prioritizes lightweight federation of existing systems over deep, disruptive integration, a pragmatic choice for a consortium with strong, independent partners.
Leveraging HEP Heritage: Heavy reliance on battle-tested HEP technologies (HTCondor, dCache, XRootD, CVMFS) reduces risk and accelerates development.
Abstraction is Key: Success hinges on multiple abstraction layers: COBalD/TARDIS abstracts compute resources, the storage federation abstracts data location, and containers/CVMFS abstract software environments.
User-Centric Access: Providing familiar entry points (JupyterHub, login nodes) lowers the adoption barrier for a diverse user base.

7. Original Analysis: Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight: PUNCH4NFDI isn't building a new supercomputer; it's orchestrating a symphony of existing, disparate instruments. Its true innovation lies in the meta-layer—the "orchestra conductor" comprised of COBalD/TARDIS and federation protocols—that creates a unified resource pool without demanding homogeneity from the underlying providers. This is a strategic masterstroke for politically complex, multi-institutional collaborations, reminiscent of the federated learning paradigm in AI (like in Google's work on Federated Averaging) where data remains distributed, but models are aggregated.

Logical Flow: The architecture follows a clean separation of concerns. 1) Access & Identity: Token-based AAI authenticates users. 2) Compute Abstraction: User submits a job to HTCondor. COBalD/TARDIS monitors queues, decides which backend (e.g., a university's HPC cluster) has capacity, and deploys a pilot job to "claim" those resources for the HTCondor pool. The actual user job then runs within this pilot. 3) Software Environment: The job pulls its specific software stack via CVMFS or from a container registry. 4) Data Access: The job reads/writes data via the federated storage layer (dCache/XRootD), which redirects requests to the actual data location.

Strengths & Flaws: The strength is undeniable pragmatism. By wrapping existing systems, it achieves rapid deployability and buy-in from resource owners. The use of HEP-proven tech stack (validated by CERN's Worldwide LHC Computing Grid success) is a major risk mitigator. However, the flaws are in the inherent complexity of the meta-scheduling layer. COBalD/TARDIS must make intelligent provisioning decisions across heterogeneous systems with different policies, costs (e.g., cloud credits), and performance profiles. A poorly tuned policy could lead to inefficient resource utilization or job starvation. Furthermore, while the storage federation provides unified access, advanced data management features like global namespace indexing, metadata catalog federation, and intelligent data placement (akin to ideas in the Lustre parallel file system or research on automated data tiering) appear to be future evaluation items, representing a current limitation.

Actionable Insights: For other consortia (e.g., in bioinformatics or climate science), the takeaway is to invest heavily in the meta-scheduler and abstraction layer design from day one. The PUNCH approach suggests starting with a minimal viable federation using a stable technology like HTCondor, rather than attempting a greenfield build. Resource providers should be engaged with clear, minimal API-like requirements (e.g., "must support SSH or a specific batch system command"). Crucially, the project must develop robust monitoring and auditing tools for the federated layer itself—understanding cross-site utilization and diagnosing failures in this complex chain will be operational paramount. The future roadmap should explicitly address the integration of workflow managers (like Nextflow or Apache Airflow) and the development of the evaluated caching and metadata services to move from simple federation to intelligent, performance-optimized data logistics.

8. Technical Details & Mathematical Framework

The resource allocation problem tackled by COBalD/TARDIS can be framed as an online optimization. Let $Q(t)$ be the queue of pending jobs in HTCondor at time $t$, each with estimated runtime $\hat{r}_i$ and resource request vector $\vec{c}_i$ (CPU, memory, GPU). Let $B$ be the set of backends, each with a time-varying available capacity $\vec{C}_b(t)$ and a cost function $f_b(\vec{c}, \Delta t)$ for allocating resources $\vec{c}$ for duration $\Delta t$. The meta-scheduler's goal is to minimize the average job turnaround time $T_{ta}$ while respecting backend policies and a budget constraint. A simplified heuristic decision rule for spawning a pilot on backend $b$ could be: $\text{Spawn if } \frac{|\{j \in Q(t): \vec{c}_j \preceq \vec{C}_b(t)\}|}{\text{Cost}_b} > \theta$, where $\preceq$ denotes "fits within", $\text{Cost}_b$ is a normalized cost, and $\theta$ is a threshold. This captures the trade-off between queue demand and provisioning cost.

9. Experimental Results & Prototype Metrics

While the provided PDF abstract does not include specific quantitative results, a successful prototype implies key qualitative and potential quantitative outcomes:

Functional Success: Demonstrated ability to submit a single job via HTCondor/JupyterHub and have it execute transparently on a remote HPC or HTC resource, with software from CVMFS and data from federated storage.
Key Metrics to Track (Future):
- Job Success Rate: Percentage of jobs that complete successfully across the federation.
- Average Wait Time: Time from submission to start, compared to native backend queues.
- Resource Utilization: Aggregate CPU-hours delivered across the federated pool.
- Data Transfer Efficiency: Throughput and latency for jobs accessing remote storage via the federation layer.
Diagram Description: A conceptual architecture diagram would show: Users interacting with JupyterHub/Login Nodes. These connect to a central HTCondor Central Manager. The COBalD/TARDIS component interacts with both HTCondor and multiple Resource Backends (HPC Cluster A, HTC Farm B, Cloud C). Each backend has a local batch system (SLURM, PBS, etc.). Arrows indicate job submission and pilot deployment. A separate section shows Federated Storage (dCache, XRootD instances) connected to the backends and accessible by jobs. CVMFS Stratum 1 mirrors are shown as a layer accessible by all backends.

10. Analysis Framework: Conceptual Workflow Example

Scenario: An astroparticle physicist needs to process 1,000 telescope images using a complex, custom analysis pipeline (Python/ROOT based).

User Entry: The researcher logs into the PUNCH JupyterHub.
Environment Setup: In a Jupyter notebook, they select a pre-defined kernel backed by a Singularity container that contains their specific software stack (published to CVMFS).
Job Definition: They write a script that defines the analysis task and use a PUNCH helper library to create an HTCondor submit description, specifying needed CPUs, memory, and input data references (e.g., `root://fed-storage.punch.org/path/to/images_*.fits`).
Submission & Scheduling: The job is submitted to the HTCondor pool. COBalD/TARDIS, seeing 1,000 short jobs, decides to spawn multiple pilot jobs on a high-throughput farm (Backend B) with fast local storage cache for the input data.
Execution: Pilots claim slots on Backend B. Each pilot pulls the container, fetches its assigned input files via the XRootD federation (which may redirect to a local cache), executes the analysis, and writes results back to federated storage.
Completion: HTCondor aggregates job completion status. The researcher's notebook can now query and visualize the results from the output storage location.

This example highlights the complete abstraction: the user never needed to know about SLURM commands on Backend B, how to install ROOT there, or the physical location of the data files.

11. Future Applications & Development Roadmap

The PUNCH4NFDI infrastructure lays the groundwork for transformative applications:

Multi-Messenger Astrophysics Workflows: Real-time correlation analyses between gravitational wave (LIGO/Virgo), neutrino (IceCube), and electromagnetic observatory data, requiring urgent compute across geographically distributed resources.
AI/ML Model Training at Scale: Federated learning experiments where the training process itself is distributed across the compute federation, with models aggregated centrally—a compute parallel to the data federation.
Digital Twins of Complex Experiments: Running massive simulation ensembles to create digital counterparts of particle detectors or telescope arrays, leveraging HPC for simulation and HTC for parameter scans.

Development Roadmap:

Short-term (1-2 years): Solidify production-grade deployment of Compute4PUNCH and Storage4PUNCH core services. Integrate advanced monitoring (Prometheus/Grafana) and billing/accounting tools.
Mid-term (3-4 years): Implement and integrate the evaluated caching and global metadata catalog services. Develop tighter integration with workflow management systems. Explore "bursting" to commercial clouds during peak demand.
Long-term (5+ years): Evolve towards an "intelligent data lakehouse" for PUNCH science, incorporating data discovery, provenance tracking, and automated data lifecycle management powered by the federated metadata. Serve as a blueprint for other NFDI consortia and international collaborations.

12. References

PUNCH4NFDI Consortium. (2024). PUNCH4NFDI White Paper. [Official Consortium Documentation].
Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2-4), 323-356. https://doi.org/10.1002/cpe.938
Krebs, K., et al. (2022). COBalD/TARDIS – A dynamic resource provisioning framework for heterogeneous computing environments. Journal of Physics: Conference Series, 2438(1), 012045. (Reference for the meta-scheduler).
Blomer, J., et al. (2011). The CernVM File System. Journal of Physics: Conference Series, 331(5), 052004. https://doi.org/10.1088/1742-6596/331/5/052004
dCache Collaboration. (2023). dCache.org [Software and Documentation]. https://www.dcache.org
XRootD Collaboration. (2023). XRootD Documentation. http://xrootd.org/docs.html
McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). (Cited for federated learning analogy).
European Organization for Nuclear Research (CERN). (2023). Worldwide LHC Computing Grid (WLCG). https://wlcg.web.cern.ch (Cited as precedent for large-scale federation).