INQUIRE-Search: Interactive Discovery in Large-Scale Biodiversity Databases

INQUIRE-Search: Interactive Discovery in Large-Scale Biodiversity Databases
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many ecological questions center on complex phenomena, such as species interactions, behaviors, phenology, and responses to disturbance, that are inherently difficult to observe and sparsely documented. Community science platforms such as iNaturalist contain hundreds of millions of biodiversity images, which often contain evidence of these complex phenomena. However, current workflows that seek to discover and analyze this evidence often rely on manual inspection, leaving this information largely inaccessible at scale. We introduce INQUIRE-Search, an open-source system that uses natural language to enable scientists to rapidly search within an ecological image database like iNaturalist for specific phenomena, verify and export relevant observations, and use these outputs for downstream scientific analysis. Across five illustrative case studies, INQUIRE-Search concentrates relevant observations 3-25x more efficiently than comparable manual inspection budgets. These examples demonstrate how the system can be used for ecological inference, from analyzing seasonal variation in behavior across species to forest regrowth after wildfires. These examples illustrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we highlight how AI-enabled discovery tools for science require reframing aspects of the scientific process, including experiment design, data collection, survey effort, and uncertainty analysis.


💡 Research Summary

The paper introduces INQUIRE‑Search, an open‑source platform that enables scientists to retrieve complex ecological information from massive community‑science image repositories such as iNaturalist using natural‑language queries. The system combines a state‑of‑the‑art vision‑language model (VLM), specifically SigLIP‑So400m‑384‑14, with a high‑performance FAISS vector index to embed roughly 300 million images into a shared semantic space. At query time, a user‑provided text phrase is encoded by the VLM’s text encoder, and cosine similarity is used to rank pre‑computed image embeddings, delivering sub‑second retrieval even at this scale.

Beyond pure retrieval, INQUIRE‑Search implements a human‑in‑the‑loop workflow. Researchers inspect the top‑ranked images, label those that clearly display the target phenomenon (e.g., a bird holding a worm, young trees in a burned area), and discard irrelevant or ambiguous cases. An inspection budget (typically 200–500 images per query) is fixed to allow fair comparison across methods. The key efficiency metric is the screening yield Y = N_ret/N_insp, where N_ret is the number of verified, usable observations and N_insp is the total number inspected. Compared with a baseline that uses iNaturalist’s metadata filters and keyword matching, INQUIRE‑Search achieves a yield ratio (Y_INQUIRE/Y_baseline) of 3–25× across five diverse case studies, demonstrating that far more relevant observations can be gathered for the same human effort.

The five case studies illustrate the system’s breadth: (1) Seasonal variation in bird diets, using 35 prompts that combine species names with diet types; (2) Post‑fire forest regeneration, querying “young coniferous trees in burned forest” and “young deciduous trees in burned forest” to quantify successional trajectories; (3) Wildlife mortality patterns, retrieving “dead bird” images to map spatio‑temporal mortality hotspots; (4) Plant phenology, extracting milkweed images at germination, flowering, seed‑set, and senescence stages to build phenological curves; and (5) Individual re‑identification of humpback whales via the distinctive “white underside of humpback whale fluke”. For each study, the workflow proceeds from natural‑language query, optional taxonomic/geographic/date filters, vector‑based retrieval, expert verification, to CSV export of observation IDs, coordinates, timestamps, taxonomy, and image URLs. Exported datasets are immediately amenable to GIS filtering, time‑series analysis, and statistical modeling, demonstrating ecological utility beyond mere data collection.

Technical challenges are acknowledged. VLMs inherit domain bias from their training data, potentially missing rare behaviors or understudied taxa. Visual variability of ecological phenomena (e.g., different lighting, occlusion) can reduce retrieval precision. Storing and updating embeddings for hundreds of millions of images demands substantial storage and compute resources. The authors propose mitigations: continual fine‑tuning of the VLM with expert‑labeled samples, ensemble retrieval using multiple VLMs, and incremental indexing to incorporate new observations without rebuilding the entire index.

Uncertainty analysis is incorporated by reporting screening yields, yield ratios, and by assessing spatial and temporal sampling bias introduced by the opportunistic nature of community‑science data. The authors argue that AI‑enabled tools like INQUIRE‑Search reshape scientific workflows: hypothesis formulation can be directly translated into natural‑language queries; data acquisition becomes a rapid, reproducible, and scalable process; and uncertainty quantification can be embedded in the retrieval pipeline rather than being an after‑thought.

All code, documentation, and reproducible pipelines are released on GitHub (https://github.com/Beery‑Lab/INQUIRE‑Search), and the case‑study datasets are derived from publicly available iNaturalist records with full metadata and licensing information. In sum, the paper demonstrates that vision‑language models coupled with efficient vector search and expert verification can unlock previously inaccessible secondary ecological information at scale, offering a new paradigm for interactive, efficient, and scalable scientific discovery in biodiversity informatics.


Comments & Academic Discussion

Loading comments...

Leave a Comment