The demise of the filesystem and multi level service architecture

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many astronomy data centres still work on filesystems. Industry has moved on; current practice in computing infrastructure is to achieve Big Data scalability using object stores rather than POSIX file systems. This presents us with opportunities for portability and reuse of software underlying processing and archive systems but it also causes problems for legacy implementations in current data centers.

💡 Research Summary

The paper addresses a fundamental shift that astronomy data centres must undergo: moving away from traditional POSIX‑based file systems toward object‑store‑centric architectures that dominate modern big‑data infrastructures. It begins by documenting the current state: many observatories still rely on networked file systems such as NFS, Lustre, or GPFS for ingest, processing, and archival of raw images, calibrated products, and large‑scale simulations. While these systems have served well for decades, they now exhibit critical bottlenecks. Metadata servers become contention points as the number of files climbs into the billions; file‑system limits on inode counts and directory depth impede scaling; and the cost of adding capacity grows linearly because each petabyte requires new storage nodes, power, and cooling. Moreover, the “copy‑and‑move” paradigm inherent to file‑system workflows creates massive data duplication during pipeline stages, inflating storage footprints and backup expenses.

Object stores—implemented as Amazon S3, OpenStack Swift, Google Cloud Storage, or on‑premise Ceph RGW—offer a fundamentally different model. Data are stored as immutable objects identified by a globally unique key; the underlying system handles replication, distribution, and durability without a central metadata server. The S3‑compatible API provides a uniform interface across cloud providers and on‑premise deployments, enabling portable pipelines. Because storage is billed by actual usage (capacity, request count, egress), institutions can align costs with scientific activity rather than with static hardware provisioning.

The authors then dissect the technical challenges of migrating legacy astronomy pipelines to an object‑store world. First, existing code bases are tightly coupled to file‑system calls (open, read, write, stat) and to hierarchical path conventions. A naïve rewrite would be costly, so the paper recommends an intermediate “file‑object gateway” (e.g., a FUSE layer) that translates POSIX operations into S3 calls, allowing legacy software to run unchanged while the backend storage is an object store. Second, astronomical data formats (FITS, HDF5, CASA Measurement Sets) embed extensive scientific metadata (observation time, instrument configuration, calibration tables). Object stores only support limited key‑value metadata per object, so a separate metadata service—often a relational database or a search engine like Elasticsearch—is required to index and query these attributes efficiently. Third, data integrity is paramount; file systems traditionally rely on fsck and hardware RAID checksums, whereas object stores expose only basic ETag checksums. The paper advocates client‑side checksum calculation (SHA‑256 or MD5) at ingest, storing the checksum as object metadata, and periodic verification jobs to detect silent corruption.

From an architectural perspective, the paper proposes replacing the monolithic “file server → application server → user” stack with a multi‑level service model: an Object Gateway Layer that abstracts storage back‑ends and enforces IAM policies; a Microservice Layer where each pipeline stage (ingest, calibration, source extraction, catalog generation, visualization) runs as an independent container orchestrated by Kubernetes; and a Data Layer consisting of the object store plus a dedicated metadata database. This decomposition yields several benefits: independent scaling of compute‑intensive stages, fault isolation (a failure in source extraction does not affect catalog serving), and rapid deployment of new algorithms via CI/CD pipelines. Event‑driven processing is enabled by integrating a message broker (Kafka or Pulsar) that publishes “object‑created” events, automatically triggering downstream microservices for real‑time reduction or quality‑control checks.

Cost modeling is another focal point. The paper contrasts the fixed‑cost model of traditional file systems with the usage‑based pricing of object stores. By profiling access patterns, data can be tiered: “hot” objects remain in S3 Standard, “warm” data migrate to S3 Infrequent Access, and “cold” archives move to Glacier or Deep Archive. Lifecycle policies automate this migration, delivering up to 70 % savings on long‑term storage. Additionally, request‑cost optimization—caching frequently accessed objects in an edge CDN or an in‑cluster Redis cache—reduces egress charges and improves latency for interactive analysis tools.

Recognizing that a wholesale cut‑over is rarely feasible, the authors outline a phased migration strategy. Phase 1 introduces the file‑object gateway for read‑only access, allowing scientists to retrieve legacy data from the new store without code changes. Phase 2 adds write capability for new observations, storing them directly as objects while still generating traditional directory listings for backward compatibility. Phase 3 refactors critical pipeline components to use native S3 SDKs, eliminating the gateway and gaining performance benefits (multipart upload, parallel reads). Throughout, comprehensive testing, automated checksum verification, and staff training are emphasized to mitigate risk.

In conclusion, the paper argues that adopting object‑store‑backed, microservice‑oriented architectures is not merely a technology upgrade but a strategic necessity for astronomy data centres. It delivers horizontal scalability to accommodate petabyte‑scale surveys (e.g., LSST, SKA), aligns operational expenses with actual scientific usage, and positions institutions to leverage cloud‑native tools for data sharing, reproducibility, and international collaboration. By systematically addressing data integrity, metadata management, legacy compatibility, and cost optimization, the proposed roadmap provides a realistic path for the community to transition from the “demise of the filesystem” to a resilient, future‑proof data ecosystem.

The demise of the filesystem and multi level service architecture

💡 Research Summary

Comments & Academic Discussion

Leave a Comment