Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution
The search for suitable datasets is the critical “first step” in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.
💡 Research Summary
The paper tackles the long‑standing challenge of finding appropriate datasets based on high‑level, natural‑language task descriptions—a scenario that most existing dataset search engines fail to support due to a semantic gap between user intent and keyword‑driven retrieval. The authors identify three core obstacles: ambiguous user intent, the lack of explicit task‑to‑dataset mappings together with a missing benchmark for systematic evaluation, and entity ambiguity caused by inconsistent dataset naming across scientific literature.
To address these issues, they introduce KATS (Knowledge‑graph‑Augmented Task‑oriented dataset Search), an end‑to‑end system that consists of (1) an offline knowledge‑base construction pipeline and (2) an online hybrid query engine. The offline component automatically extracts task descriptions, dataset mentions, and their relationships from a large corpus of scientific papers using a collaborative multi‑agent framework. A semantic‑based entity linking module normalizes task expressions, while a dedicated dataset entity resolution module consolidates various aliases, acronyms, and version names into unique KG nodes. The resulting task‑dataset knowledge graph is stored in a NoSQL backend that supports incremental updates as new papers become available.
The online component first encodes a user’s natural‑language task query with a large language model (LLM) to obtain a dense embedding. This embedding is used for fast approximate nearest‑neighbor search (e.g., via FAISS) to retrieve an initial candidate set of datasets. Those candidates are then re‑ranked by a graph‑based relevance model that combines personalized PageRank scores with edge‑weight‑derived relevance signals from the KG, effectively bridging the gap between pure vector similarity and the richer relational context captured in the graph. This hybrid retrieval strategy mitigates the “semantic gap” and resolves entity ambiguity that would otherwise lead to low recall.
A major contribution is the CS‑TDS benchmark suite, which the authors construct in two scales: CS‑TDS M (628 papers, 47 task queries, ~1.8 k datasets) and CS‑TDS L (2 101 papers, 204 queries, ~7.5 k datasets). Queries are generated by LLMs and manually verified to reflect realistic task descriptions. Crucially, the source paper that contains the ground‑truth dataset is excluded from the search corpus, forcing systems to generalize beyond simple document matching. Ground‑truth annotation accepts the exact dataset used in the source paper, its aliases, and functionally equivalent substitutes, thereby explicitly handling dataset naming ambiguity.
Experimental evaluation compares KATS against several state‑of‑the‑art Retrieval‑Augmented Generation frameworks (HippoRAG, HippoRAG2, Raptor) and LLM‑integrated dataset search systems (PNEUMA, LEDD, AU‑TOTUS). KATS outperforms all baselines on MAP@10 and NDCG@10 by 15–22 % and achieves an average latency of 0.35 seconds, suitable for interactive use. Ablation studies reveal that removing the entity‑resolution module degrades performance by roughly 8–10 %, underscoring the importance of handling dataset name variations.
In summary, the paper delivers a comprehensive solution that (i) builds a dynamically updatable task‑dataset knowledge graph, (ii) resolves entity ambiguity through semantic linking, (iii) integrates dense retrieval with graph‑based reasoning for high‑quality ranking, and (iv) provides a reproducible benchmark for future research. The authors argue that their approach not only advances dataset discovery in computer‑science literature but also offers a blueprint that can be adapted to other domains such as biomedical or social‑science data repositories.
Comments & Academic Discussion
Loading comments...
Leave a Comment