In this paper, we present ChemRecon, a meta-database and Python interface for integrating and exploring biochemical data across multiple heterogeneous resources by consolidating compounds, reactions, enzymes, molecular structures, and atom-to-atom maps from several major databases into a single, consistent ontology. ChemRecon enables unified querying, cross-database analysis, and the construction of graph-based representations of sets of related database entries by the traversal of inter-database connections. This facilitates information extraction which is impossible within any single database, including deriving consensus information from conflicting sources, of which identifying the most probable molecular structure associated with a given compound is just one example. The Python interface is available via pip from the Python Package Index (https://pypi.org/project/chemrecon/). ChemRecon is open-source and the source code is hosted at GitLab (https://gitlab.com/casbjorn/chemrecon). Documentation and additional information is available at https://chemrecon.org.
Researchers increasingly rely on data in biochemical databases as core sources of knowledge. A diverse ecosystem of specialized databases has emerged, each with their own focus. Combining data from multiple such sources can, in principle, provide a far more comprehensive fpiview of the state of biochemical knowledge than any single database alone.
In practice, however, several obstacles prevent scientists from making use of the full potential of the vast amount of available data. First, interfacing with each database typically requires a custom implementation for web access or for parsing database downloads, as data formats vary widely. This makes working with multiple data sources a time-consuming process and hinders data integration. As an example, many databases provide cross-references to related entries in other resources, but making effective use of these links requires accessing several sources, which carries the aforementioned challenges. Second, many databases encode ontological relationships, but these are not compatible across resources, limiting their usefulness in settings where several data sources are used. Third, the landscape of biochemical databases is not only heterogenous in the sense that they are formatted and accessed in different ways; they also frequently disagree with each other. Discrepancies include listing different tautomers and identifying a completely different entry in another database as equivalent.
These challenges are apparent in common bioinformatics workflows. For example, genome-scale metabolic models from the BiGG database offer a view of metabolites and reactions within organisms, but does not provide chemical or structural details on the metabolites themselves. Instead, entries in BiGG are connected to other sources where this information is available e.g. MetaNetX, ChEbI, but exploiting these connections require nontrivial efforts to reconcile and integrate the sources.
Here we present ChemRecon, a consolidated meta-database with a Python interface designed to simplify the integration and exploration of biochemical data from a range of sources. ChemRecon is built from full-database downloads of compounds, reactions, enzymes, molecular structures, and atom-to-atom maps from the following source databases: BiGG [7], BRENDA [3], ChEbI [2], ECMDB [10], M-CSA [9], MetaMDB [11], MetaNetX [8], and PubChem [6] (see Table 1). Heterogenous data formats were standardized, and relationships within and between these databases were reconstructed in a consistent format. The resulting meta-database is freely accessible online, and is complemented by a Python interface which allows for easy integration into existing workflows. This enables unified querying of entries from all the source databases, and discovery and visualization of relationships between these entries. In contrast to existing integration resources (e.g., MetaNetX [8]) which provide fixed identifier mappings, ChemRecon preserves the original database entries and exposes their cross-references as an explicit, traversable graph. This design enables systematic cross-database exploration, complex cross-resource querying, and the derivation of consensus information from conflicting annotations across databases. In short, ChemRecon simplifies workflows by allowing researchers to focus on scientific analyses rather than database engineering and enables knowledge discovery through its ability to construct and visualize graphs of associated biochemical information.
This paper describes the design and functionality of ChemRecon, presents practical examples of the use of the Python interface, and discusses potential applications.
ChemRecon consists of two main components: a consolidated meta-database and a corresponding Python interface which enables easy programmatic access to the database. In this section, we describe the design and construction of the meta-database, the methods enabled by ChemRecon, and the usage of the interface. 1: The source databases contributing to the ChemRecon meta-database, and the number of entries sourced from each.
The ChemRecon meta-database was created from full downloads of the source databases. The contents of these downloads were then parsed and converted into a uniform format. An overview of the data collected from the source databases is provided in Table 1. The parsing and conversion routines are extensible, allowing users to expand the database capabilities by writing their own parsing scripts in case they have access to proprietary sources of data. References from the source databases to other databases, including KEGG [4] and MetaCyc [5], are also included (with no additional information), allowing workflows based on identifiers from this larger set of databases.
Each entry present in the source databases are consolidated into the ChemRecon meta-database as an entry. The meta-database contains various entry types, including Compound, Reaction, and Enzyme, MolStructure, and AAM.
This content is AI-processed based on open access ArXiv data.