USID and Pycroscopy -- Open frameworks for storing and analyzing spectroscopic and imaging data
Materials science is undergoing profound changes due to advances in characterization instrumentation that have resulted in an explosion of data in terms of volume, velocity, variety and complexity. Harnessing these data for scientific research requires an evolution of the associated computing and data infrastructure, bridging scientific instrumentation with super- and cloud- computing. Here, we describe Universal Spectroscopy and Imaging Data (USID), a data model capable of representing data from most common instruments, modalities, dimensionalities, and sizes. We pair this schema with the hierarchical data file format (HDF5) to maximize compatibility, exchangeability, traceability, and reproducibility. We discuss a family of community-driven, open-source, and free python software packages for storing, processing and visualizing data. The first is pyUSID which provides the tools to read and write USID HDF5 files in addition to a scalable framework for parallelizing data analysis. The second is Pycroscopy, which provides algorithms for scientific analysis of nanoscale imaging and spectroscopy modalities and is built on top of pyUSID and USID. The instrument-agnostic nature of USID facilitates the development of analysis code independent of instrumentation and task in Pycroscopy which in turn can bring scientific communities together and break down barriers in the age of open-science. The interested reader is encouraged to be a part of this ongoing community-driven effort to collectively accelerate materials research and discovery through the realms of big data.
💡 Research Summary
Materials science is currently experiencing a data explosion driven by rapid advances in characterization instruments such as electron microscopes, spectrometers, and X‑ray diffraction systems. Each instrument traditionally generates its own proprietary file format and analysis scripts, creating barriers to data exchange, reproducibility, and collaborative research. To address these challenges, the authors introduce the Universal Spectroscopy and Imaging Data (USID) model, a flexible, instrument‑agnostic schema capable of representing virtually any spectroscopic or imaging dataset regardless of dimensionality, modality, or size.
USID abstracts all data into four core concepts: a “dataset” that holds the raw values, an “index” that maps each value to a position in the multidimensional measurement space, a “position” that defines the physical coordinates (e.g., spatial location, time, bias voltage), and a “spectrum” that describes the dependent variable (e.g., intensity, current). By flattening these concepts into a two‑dimensional matrix, USID can encode a simple 1‑D I‑V curve, a 3‑D scanning probe map, or a 4‑D hyperspectral image within the same structural framework.
The model is stored using the Hierarchical Data Format version 5 (HDF5), which provides a hierarchical group‑dataset‑attribute architecture. This choice ensures long‑term accessibility, cross‑platform compatibility, and built‑in support for chunked storage and compression—critical for handling gigabyte‑to‑terabyte scale files. Moreover, HDF5’s ability to embed metadata alongside the data guarantees that experimental conditions, instrument settings, and processing history are permanently recorded, thereby enhancing traceability and reproducibility.
To operationalize USID, the authors have released two open‑source Python packages. The first, pyUSID, offers a high‑level API for reading and writing USID‑compliant HDF5 files. Its central class, USIDataset, mimics NumPy array semantics while lazily loading data from HDF5 chunks, allowing efficient manipulation of datasets that exceed available RAM. pyUSID also includes a parallel processing framework (Process and Job classes) that integrates with Dask, enabling automatic distribution of computational tasks across multiple cores or clusters. This makes it feasible to preprocess, visualize, and analyze hundreds of gigabytes of spectroscopic mapping data in a matter of seconds to minutes.
Built on top of pyUSID, the second package—Pycroscopy—delivers domain‑specific analysis tools for nanoscale imaging and spectroscopy. It provides routines for noise reduction, background subtraction, dimensionality reduction (PCA, NMF), clustering (k‑means, DBSCAN), and physics‑based model fitting (e.g., I‑V curve fitting, spectral deconvolution). Because Pycroscopy operates on the abstract USID representation rather than on instrument‑specific file structures, the same analysis pipeline can be applied to data from SEM, AFM, STM, or any future modality without modification. This “instrument‑agnostic” design dramatically lowers the software development overhead for new instruments and encourages code reuse across research groups.
A key aspect of the project is its community‑driven development model. All source code, documentation, and issue tracking are hosted on GitHub, with continuous integration/continuous deployment (CI/CD) pipelines that automatically test compatibility across Python versions and operating systems. The open‑source nature invites contributions from both academia and industry, fostering a shared ecosystem for data standards, analysis algorithms, and best practices.
In summary, USID together with pyUSID and Pycroscopy provides a comprehensive solution that unifies data representation, storage, metadata management, scalable computation, and instrument‑independent analysis. By lowering technical barriers, the framework aims to accelerate the adoption of big‑data and machine‑learning techniques in materials research. Future extensions that couple USID‑based datasets with cloud‑native AI services could enable automated feature extraction, predictive modeling, and accelerated materials discovery, positioning the community to fully exploit the wealth of information generated by modern characterization tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment