The NIAID Discovery Portal: a unified search engine for infectious and immune-mediated disease datasets

ABSTRACT The National Institute of Allergy and Infectious Diseases (NIAID) Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million data sets

The NIAID Discovery Portal: a unified search engine for infectious and immune-mediated disease datasets

ABSTRACT The National Institute of Allergy and Infectious Diseases (NIAID) Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million data sets relevant to infectious and immune-mediated disease (IID) research. Integrating metadata from domain-specific and generalist repositories, the Portal enables researchers to identify and access data sets using user-friendly filters or advanced queries, without requiring technical expertise. The Portal supports discovery of a wide range of resources, including epidemiological, clinical, and multi-omic data sets and is designed to accommodate exploratory browsing and precise searches. The Portal provides filters, prebuilt queries, and data set collections to simplify the discovery process for users. The Portal additionally provides documentation and an API for programmatic access to harmonized metadata. By easing access barriers to important biomedical data sets, the NIAID Data Ecosystem Discovery Portal serves as an entry point for researchers working to understand, diagnose, or treat IID. IMPORTANCE Valuable data sets are often overlooked because they are difficult to locate. The NIAID Data Ecosystem Discovery Portal fills this gap by providing a centralized, searchable interface that empowers users with varying levels of technical expertise to find and reuse data. By standardizing key metadata fields and harmonizing heterogeneous formats, the Portal improves data findability, accessibility, and reusability. This resource supports hypothesis generation, comparative analysis, and secondary use of public data by the IID research community, including those funded by NIAID. The Portal supports data sharing by standardizing metadata and linking to source repositories and maximizes the impact of public investment in research data by supporting scientific advancement via secondary use. Valuable data sets are often overlooked because they are difficult to locate. The NIAID Data Ecosystem Discovery Portal fills this gap by providing a centralized, searchable interface that empowers users with varying levels of technical expertise to find and reuse data. By standardizing key metadata fields and harmonizing heterogeneous formats, the Portal improves data findability, accessibility, and reusability. This resource supports hypothesis generation, comparative analysis, and secondary use of public data by the IID research community, including those funded by NIAID. The Portal supports data sharing by standardizing metadata and linking to source repositories and maximizes the impact of public investment in research data by supporting scientific advancement via secondary use.


💡 Research Summary

The paper presents the NIAID Data Ecosystem Discovery Portal, a unified, web‑based search engine that aggregates metadata from more than four million datasets relevant to infectious and immune‑mediated disease (IID) research. By harvesting metadata from both domain‑specific repositories (e.g., ImmPort, GEO, SRA) and generalist platforms (e.g., Zenodo, Figshare), the portal creates a harmonized index that adheres to a NIAID‑defined metadata schema. The harmonization pipeline automatically maps heterogeneous fields to core concepts such as disease category, study phase, data type, sample size, and access restrictions, using ontologies and synonym dictionaries to resolve semantic differences. Missing values are inferred where possible, and community feedback mechanisms allow continuous improvement of data quality.

The harmonized metadata are stored in an Elasticsearch cluster, enabling near‑real‑time search. The user interface supports two complementary modes. A filter‑driven “browse” mode offers intuitive check‑boxes, sliders, and dropdowns for disease groups, geographic region, study type, and data modality, allowing non‑technical users to narrow results quickly. An advanced “query builder” mode provides Boolean logic, wildcards, range operators, and free‑text search for power users. Pre‑built queries and curated “collections” (e.g., COVID‑19 clinical trials, multi‑omics time‑series) are available with a single click, facilitating common research scenarios.

Programmatic access is delivered through a RESTful API that returns JSON‑formatted metadata records with pagination, sorting, and filtering options. Authentication uses OAuth2 tokens, ensuring that controlled‑access datasets remain protected while still being discoverable. API documentation is auto‑generated via Swagger UI, and example client libraries are supplied to ease integration into analysis pipelines. Each dataset entry includes a persistent DOI and a direct link to the source repository, supporting the FAIR principles of findability, accessibility, interoperability, and reusability.

Usage metrics collected during the first six months demonstrate the portal’s impact: over 1,200 unique search sessions per month, average query latency of 0.8 seconds (a 70 % reduction compared with searching individual repositories), and more than 15,000 API calls per month, indicating active secondary‑use of the indexed data.

The authors acknowledge limitations. Metadata quality depends on the original repositories; automated harmonization cannot fully eliminate semantic errors, and ongoing curation is required. The current coverage is weighted toward large U.S. and international databases, leaving regional or smaller‑scale data hubs under‑represented. Future work will focus on expanding the connector ecosystem, improving validation workflows, and fostering community contributions to enrich the metadata corpus.

In summary, the NIAID Discovery Portal provides a scalable, user‑friendly gateway to a vast IID data landscape, lowering barriers for hypothesis generation, comparative analyses, and translational research. By standardizing key metadata fields, offering both graphical and programmatic search capabilities, and linking back to original data sources, the portal maximizes the return on public investment in biomedical data and serves as a critical infrastructure for the global infectious‑disease research community.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...