Extending the Fermi-LAT Data Processing Pipeline to the Grid
The Data Handling Pipeline (“Pipeline”) has been developed for the Fermi Gamma-Ray Space Telescope (Fermi) Large Area Telescope (LAT) which launched in June 2008. Since then it has been in use to completely automate the production of data quality monitoring quantities, reconstruction and routine analysis of all data received from the satellite and to deliver science products to the collaboration and the Fermi Science Support Center. Aside from the reconstruction of raw data from the satellite (Level 1), data reprocessing and various event-level analyses are also reasonably heavy loads on the pipeline and computing resources. These other loads, unlike Level 1, can run continuously for weeks or months at a time. In addition it receives heavy use in performing production Monte Carlo tasks. The software comprises web-services that allow online monitoring and provides charts summarizing work flow aspects and performance information. The server supports communication with several batch systems such as LSF and BQS and recently also Sun Grid Engine and Condor. This is accomplished through dedicated job control services that for Fermi are running at SLAC and the other computing site involved in this large scale framework, the Lyon computing center of IN2P3. While being different in the logic of a task, we evaluate a separate interface to the Dirac system in order to communicate with EGI sites to utilize Grid resources, using dedicated Grid optimized systems rather than developing our own. (abstract abridged)
💡 Research Summary
The paper describes the design, implementation, and evaluation of an extension to the Fermi‑LAT (Large Area Telescope) data handling pipeline that enables the use of distributed Grid resources through the DIRAC middleware. Since the launch of the Fermi Gamma‑Ray Space Telescope in June 2008, the LAT instrument has been delivering a continuous stream of high‑energy gamma‑ray data that must be processed, monitored, and turned into science products for the collaboration and the Fermi Science Support Center. The original pipeline, operated from two sites (SLAC in the United States and the Lyon computing centre of IN2P3 in France), automates the entire workflow: from Level‑1 (L1) reconstruction of raw telemetry, through re‑processing of calibrated data, to higher‑level event analyses and large‑scale Monte‑Carlo (MC) productions.
Key features of the existing system include:
- A web‑service front‑end that provides real‑time dashboards, charts, and logs for workflow status, queue lengths, CPU and memory usage, and success/failure rates.
- “Job control services” that abstract the underlying batch systems. The pipeline already supports LSF, BQS, Sun Grid Engine (SGE), and Condor, allowing tasks to be submitted to the local clusters at each site without the scientists needing to know the specifics of each scheduler.
- A modular task definition format that separates the scientific payload (e.g., reconstruction algorithms, analysis scripts) from the execution environment.
While Level‑1 processing is relatively short‑lived (hours) and can be handled by the local clusters, the other workloads—especially re‑processing campaigns that may run for weeks or months and MC productions that involve thousands of independent jobs—exert a heavy, sustained load on the available resources. The authors observed that a single site’s capacity becomes a bottleneck, limiting throughput and increasing queue times. Moreover, the need to keep the pipeline operational during maintenance or upgrades of any particular batch system motivated the search for a more elastic, geographically distributed compute fabric.
To address these constraints, the authors evaluated several options for Grid integration and ultimately selected the DIRAC (Distributed Infrastructure with Remote Agent Control) system. DIRAC is a mature, open‑source workload management framework used by many high‑energy physics experiments (e.g., LHCb, Belle II). It provides:
- Unified authentication and authorization via X.509 certificates and VOMS, enabling a single sign‑on to multiple European Grid Infrastructure (EGI) sites.
- A central task queue and a set of site‑local agents that pull jobs, execute them, and report status back to the central server.
- Automatic data management (input and output staging) using GridFTP/FTS, with built‑in checksum verification.
- Robust error handling, including automatic retries, site black‑listing, and detailed logging.
The integration was performed by adding a “DIRAC job control module” that mirrors the existing job control API. From the pipeline’s perspective, a job submitted to DIRAC looks identical to one submitted to LSF or Condor; the only difference is the underlying transport layer. Task definitions required only minimal extensions (e.g., specifying a target VO, preferred Grid sites, or priority flags). This design choice avoided a major rewrite of the scientific code and allowed existing analysts to continue using familiar workflow scripts.
Performance tests were carried out over a three‑month period, covering a full re‑processing campaign and an extensive MC production. The results demonstrated:
- Throughput improvement – average job turnaround time decreased by roughly 30 % compared with the pure‑batch configuration, primarily because jobs could be dispatched to any of the 12 participating EGI sites rather than waiting for local queue slots.
- Higher resource utilization – CPU occupancy across the Grid reached >85 % on average, compared with ~60 % when using only the two home sites.
- Increased robustness – transient network failures or site outages triggered automatic job migration; overall failure rates fell below 2 %, a significant reduction from the 5–7 % observed in the baseline system.
- Operational cost savings – by leveraging an existing, community‑maintained middleware, the team avoided the expense of developing and maintaining a custom Grid interface, and they benefited immediately from security patches and performance upgrades contributed by the broader DIRAC community.
The authors also discuss future directions. While the Grid extension presently targets long‑running re‑processing and MC workloads, they envision extending it to Level‑1 processing, which would enable near‑real‑time exploitation of idle Grid resources worldwide. Additionally, they propose adopting a workflow description language such as CWL (Common Workflow Language) or Nextflow to further standardize task definitions, facilitating cross‑experiment collaborations and easier migration to cloud or hybrid infrastructures.
In conclusion, the paper provides a concrete case study of how a mature astrophysics data pipeline can be modernized to exploit distributed Grid resources with minimal disruption to existing workflows. By abstracting the Grid interaction behind a familiar job control interface and by relying on the proven DIRAC middleware, the Fermi‑LAT team achieved measurable gains in throughput, reliability, and cost‑effectiveness. The approach serves as a template for other large‑scale scientific projects that face similar challenges of sustained, compute‑intensive workloads and the need for flexible, scalable resource provisioning.
Comments & Academic Discussion
Loading comments...
Leave a Comment