- Title: The demise of the filesystem and multi level service architecture
- ArXiv ID: 1907.13060
- Date: 2019-08-01
- Authors: William O Mullane, Niall Gaffney, Frossie Economou, Arfon M. Smith, J. Ross Thomson, Tim Jenness
📝 Abstract
Many astronomy data centres still work on filesystems. Industry has moved on; current practice in computing infrastructure is to achieve Big Data scalability using object stores rather than POSIX file systems. This presents us with opportunities for portability and reuse of software underlying processing and archive systems but it also causes problems for legacy implementations in current data centers.
💡 Summary & Analysis
This paper argues that the astronomical community should transition from using POSIX file systems to object stores for handling large-scale data. The current reliance on traditional filesystems is limiting scalability and efficiency in data processing, especially as datasets grow larger. By adopting standardized APIs and leveraging cloud-based services, astronomers can improve interoperability and streamline data management processes.
The paper highlights the historical use of object stores in astronomy, such as with FITS tapes, but notes that modern researchers have grown accustomed to POSIX file systems due to their widespread adoption. However, the limitations of these systems are becoming more apparent as datasets expand, necessitating a shift towards more scalable solutions like those used by leading tech companies.
The authors recommend developing a common API layer and metadata transformation services to support data access across different astronomical projects. They also propose adopting federated identity services for better security and user authentication. Overall, this paper aims to guide the astronomy community toward leveraging industry-standard tools to enhance their research capabilities.
📄 Full Paper Content (ArXiv Source)
# Introduction
Executive summary: the filesystem notion limits our ability to scale
processing and has long since been dropped by industry. Astronomy needs
to move on with a new architecture.
Object Stores are nothing new to Astronomy. FITS and IRAF tapes are examples of object
stores that were used when file systems were unable to handle the
volumes of data being produced. Even then, standards for migrating from
Object Store to file systems were created, allowing for objects to be
retrieved into a predefined namespace on disk. The explosion of
individual POSIX disk capacity and
POSIX-like file systems have
produced generations of researchers who have never used an Object Store.
While this growth has supported data systems up till now, the size and
complexity of data being produced by surveys and even pointed telescope
archives is reaching scales where the requirements placed on file access
by the POSIX standard are significantly
hindering our ability to work with data. Different parallel file systems
have different strengths and weaknesses.
At large scale, data service providers such as Dropbox and
AWS do
not store files in POSIX systems. Rather they present
the illusion of directory structure layered over large scale object
stores. This allows for faster file access, with only
CRUD
style functions taking place on each object. Further, as the
pseudo-filesystem layer is simply a view of structure typically provided
by a graph database, users can arrange or potentially have real time
query driven structure for the file organization, removing how many now
organize data through a sea of nested symlinks. Some provide local
POSIX
caches of that users view of the pseudo-filesystem, allowing for
POSIX
style applications to access the files with fopen, fscan, and
fclose standard commands. Additionally, these providers do not show
users how data are stored. One can simply request data in the format
needed (e.g., Excel, CSV, or as I assume it is the
JSON
format the web apps use for sheets). At scale, applications often forgo
such POSIX layers and simply use the CRUD interfaces to the objects to
load them into memory, act on the objects, and then update or delete
them, in the format they need as input and with the format they naively
produce. Further, the data providers need not update their data archive
when formats change, simply provide a new updated data access format
that can be fueled by legacy data formats.
It is time for Astronomy data researchers to follow this curve. As users
have migrated from using tools on their laptops to support collaboration
while reducing individuals need to manage their systems (Jupyter Hubs,
Overleaf, Google Slides), so should astronomical data processing and
analysis. We propose the adoption of a common Astronomical data access
API
layer.
Recommendations
Develop a community wide architecture supporting Science as a
Service following an industry standard layered architecture for
astronomy processing and data access as depicted in .
style="width:49.0%" />
Industry standard Cyber infrastructure model (left) and an
astronomy instantiation of such a model (right)
Compel all funded funded astronomical projects with software
deliverables to leverage the architecture and APIs for data access.
Develop data and metadata transform services to
support these common API layers to aid
interoperation.
Adopt and subscribe to a common federated identity service (e.g.,
InCommon or Globus Auth) that is supported by most university and
research organizations.
Commodity services and software based community architecture
The astronomy and astrophysics community have historically relied on the
development and use of bespoke software and hardware infrastructure to
solve challenges related to the managing and analyzing datasets at a
scale that was difficult to find in industry or other scientific
domains. These requirements are no longer unique and we have access to a
wealth of open source software, commodity hardware, and managed cloud
services (offered by commercial providers and federally-funded
institutions) that are well positioned to meet the needs of astronomers
and astrophysicists . By providing documentation and reference
implementations of the “astronomy stack” using these technologies and
making it easier for researchers and missions to access cloud computing
services, we can reduce operations costs, accelerate time to science,
and increase the scientific return on Federally-funded research in
astronomy and astrophysics.
shows the layers of the CI from the interfaces for service
access exposed at multiple levels, the common domain wide enabled
services, and a collection of system level components that support the
higher levels of the CI. The lower down the diagram are
commodity layers based on well established and supported components. As
one moves up from these layers, more abstraction can be done to expose
these pieces in domain, or highly specific subdomain, level interfaces.
By making these abstractions, more universal service can be developed
that can be applied more globally across the entirety of the
CI as a
whole.
An example of this would be authentication, where each university or
agency may provide their own authentication method but unifying services
like CILogon can bring those together to give global spaced identity for
a wide range of users based on disparate authentication systems. By
providing this structure along with a reference architecture of these
System Services based on well supported software components, providers
are easily able to both deploy and support these common services which
enable cross mission and center interoperability. This structure also
reflects how this architecture allows for greater reusability as one
gets closer to the actual implementation of these services while
supporting greater flexibility and general usability as one works
further from the core components. Alternatives such as using github
authentication may be more flexible but lack the rigor of InCommon which
assures the individual is a member of an academic body — we must also
work with these industry standards though.
This service architecture should be based on using standard reusable
software from many of the established standards developed outside of
astronomy (e.g., common authentication mechanisms such as CILogon,
standard data and metadata management systems).
Standard API interfaces should also be used
to expose these components to higher level APIs. Data formatting and
metadata structure can be exposed
at the service level, allowing for more data and
metadata reuse. An example of a
storage agnostic data/metadata layer is the LSST data butler . Creation has
been driven by the need to abstract data access away from algorithms.
The algorithm code deals with Python
objects and never directly with data formats or even storage. The data
butler is a lot more than a data access layer, it includes a full
registry of all data objects and how to locate them. This means it may
sit atop a filesystem or an object store — a prototype S3 plugin is now
available and we are testing it in a pipeline. The data butler is
astronomy oriented — it has a built in understanding of certain
relationships such as between observations and calibrations or between
observations and region of the sky. Since all metadate is held in a
registry provenance queries can be answered
by the butler and it already can act as the registry to find objects in
an object store.
Part of a Cyber Infrastructure model such as depicted in is an object
store oriented API — this should be used for data
sharing. Such APIs exist such as Amazon’s S3. The
API
layers must expose both data and metadata in common transferable and
transformable formats (e.g., CAOM).
There must also be an authentication source for users at institutions
without such means and for citizen science efforts.
Why we should kill the filesystem
Users should not care and repositories should not be tied to legacy
formats and storage representations because of legacy constraints at
other repositories. The rest of the world has already moved on, Google,
Amazon, GitHub, Netflix etc. do not host large filesystems and and can
scale because they are not limited by this antiquated formalism.
Filesystems with name spaces are very fragile at large scale. As we get
larger data sets we have to trick the filesystem to not run out of
Inodes, we make countless sub directories to cope with our thousands of
files. This is turn leads to countless hours spent fighting over how to
organize files the right way in a filesystem. Countless years have
been spent fighting over data formats (FITS vs HDF vs CSV vs Pandas). If we move code
then perhaps the filesystem is not organised in the same manner and the
code may not work — remote access to allow caching is not always an
option.
We need to foster better remote collaboration. The laptop is the bane of
file sharing. This has changed with cloud based pseudo-file systems but
require storage in a single cloud providers infrastructure. By creating
a Filesystem as a Service (FSAAS) federated across data and
cloud providers, we will win.
This would imply that POSIX based file access be deprecated in software
development and only used when applications require thread safe data
access (something that is currently not possible with
FITS
files). We should however develop a pseudo-directory structure system to
integrate local and remote files into a dynamic namespace for each user
and potentially each users use-case (e.g., the Box sync interface or the
FUSE
based WholeTale file system )
This “Infrastructure as Code” approach lowers the bar to entry and
allows for easier adoption of more standardized services that will
enable large-scale astronomical research in ways that are well
demonstrated in plant genomics (CyVerse and Galaxy), natural hazards
(Designsafe), and surface water research (Hydroshare). (See also the
decadal paper on Cloud infrastructure by Arfon Smith et. Al.)
The catch
Switching to an object store removes the filesystem bottle neck however
it also removes the filesystem index. This implies that a registry of
the objects must be contained. We frequently do this anyway usually
sucking meta data into a databases to allow searching — just in the case
of an object store this would no longer be optional.
Conclusion
We should agree on an astronomy stack of services with agreed
interfaces such that we can concentrate on building domain specific
layers on top of industry standard tools.
We should stop worrying about filesystems in astronomy — we should agree
on a decent API for object storage and a
registry to go with it. The registry should build on existing agreements
i.e., based on CAOM-2 .
Adoption of a limited set of services will aid ease of use and cut down
on wasted effort by all data providers.
The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.