SciDef: Automating Definition Extraction from Academic Literature with Large Language Models
Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra & DefSim, novel datasets of human-extracted definitions and definition-pairs’ similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.
💡 Research Summary
The paper introduces SciDef, a novel pipeline that leverages large language models (LLMs) to automatically extract definitions from academic literature. Recognizing that precise definitions are essential for scientific progress yet increasingly difficult to curate manually due to the exponential growth of publications, the authors set out to fill three gaps in the existing research: (1) the lack of publicly available, reproducible benchmark datasets for definition extraction from scholarly articles, (2) limited exploration of extraction pipelines and prompting strategies, and (3) insufficient evaluation methodologies for comparing model outputs with human‑annotated ground truth.
To address these gaps, the authors create and release two new resources. DefExtra is a benchmark dataset containing 268 human‑extracted definitions drawn from 75 carefully curated papers in the media‑bias domain. Each entry includes the definition text, a “type” label (explicit quote vs. implicit paraphrase), and a three‑sentence context window (preceding, containing, and following sentence). DefSim consists of 60 definition pairs annotated on a 1‑5 similarity scale, designed to evaluate semantic similarity metrics specifically for definitions. Both datasets are constructed with rigorous annotation protocols, double‑review verification, and metadata that facilitate downstream analysis.
The SciDef pipeline consists of four stages: (1) PDF collection and conversion to structured text using GROBID, (2) definition extraction via LLMs, (3) prompting strategy selection and optimization, and (4) post‑processing and evaluation. The authors experiment with 16 LLMs—including GPT‑3.5‑turbo, Claude‑2, and various Llama‑2 variants—and three prompting paradigms: single‑step prompting, multi‑step prompting, and DSPy‑optimized prompting. DSPy is an open‑source framework that automatically compiles and optimizes prompts for each model, effectively tailoring the prompt to the model’s strengths. Results show that multi‑step prompting and DSPy‑optimized prompts consistently outperform single‑step baselines, yielding 4–7 percentage‑point gains in F1 score across models.
A central contribution of the work is a systematic evaluation of similarity metrics for definition comparison. The authors assess three families: (i) cosine similarity over transformer‑based embeddings (e.g., Sentence‑BERT), (ii) bidirectional natural language inference (NLI) entailment scores, and (iii) LLM‑as‑a‑Judge prompting where the model directly rates similarity. These metrics are benchmarked on standard semantic similarity datasets (STS‑B, SICK, MSRP, Quora Question Pairs) and on the newly created DefSim. The NLI‑based metric emerges as the most reliable, exhibiting the highest correlation with human judgments, especially when augmented with a binary type‑match bonus and, for DSPy‑trained models, a context‑match component.
Evaluation on the DefExtra test set reveals that the best LLM‑prompt combination extracts 86.4 % of the ground‑truth definitions. However, the models tend to over‑generate, producing on average 1.3 extra definitions per paper, which can dilute precision. The authors therefore propose a set‑level scoring function that balances recall‑like coverage (ground‑truth match) with precision‑like penalization of over‑generation, using a best‑match alignment algorithm with a similarity threshold τ = 0.25.
The paper concludes that LLMs are highly capable of extracting definitions from scholarly texts, provided that prompting is carefully engineered and evaluation employs robust NLI‑based similarity measures. Nonetheless, the authors argue that future work should shift focus from mere extraction to relevance filtering—identifying which extracted definitions are truly pertinent to a given query or research agenda. They also suggest developing automated post‑processing filters to curb hallucinated or redundant definitions. All code, models, and datasets are released under FAIR principles, with persistent identifiers to ensure long‑term reproducibility.
Overall, SciDef represents a significant step toward scalable, high‑quality definition mining in the scientific literature, offering both a practical system and a benchmark framework that the community can build upon.
Comments & Academic Discussion
Loading comments...
Leave a Comment