Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM-Meta-Analysis).


💡 Research Summary

The paper presents a systematic diagnostic study of large language models (LLMs) for the highly structured task of evidence extraction required for meta‑analysis. While LLMs have shown impressive performance on isolated entity extraction, meta‑analysis demands the preservation of complex relational bindings among populations, variables, statistical methods, sample sizes, and effect sizes across multiple documents. To probe where current models break down, the authors introduce a tiered, schema‑constrained query framework that incrementally increases relational and numerical complexity. The framework defines four levels of queries: (1) single‑entity extraction, (2) variable‑role pairing (independent vs. dependent), (3) linking statistical methods to reported effect sizes, and (4) full meta‑analytic association tuples that combine all required fields. Each query is expressed as a structured prompt that forces the model to output JSON‑like records adhering to a unified study schema.

The authors curate a multi‑domain benchmark covering five scientific fields—civil engineering, medical and health sciences, agricultural science, earth and environmental sciences, and social science. Over 1,200 full‑text articles were manually annotated to provide gold‑standard study records for each of the schema elements. Two state‑of‑the‑art LLMs, GPT‑4‑Turbo and Claude‑2, are evaluated under two input regimes: (a) per‑document processing, and (b) long‑context, multi‑document ingestion (up to 32 KB). Standard classification metrics (precision, recall, F1) and numeric error measures (mean absolute error) are reported for each tier.

Results reveal a clear pattern. At tier 1, both models achieve moderate F1 scores (0.71–0.78), indicating reliable entity recognition. However, performance collapses as relational demands increase. Tier 2 (variable‑role pairing) sees recall drop below 0.45, with frequent role reversals (e.g., treating the independent variable as dependent). Tier 3 (method‑effect linking) exhibits substantial numeric misattribution: mean absolute error exceeds 0.32 standard deviations, and units or value ranges are often confused. Tier 4, which requires extracting the full tuple needed for meta‑analysis, yields near‑zero reliability (recall ≈ 0.07). Long‑context inputs exacerbate these failures, as models tend to ignore or compress information located in the middle of extended texts, leading to “instance compression” in dense result sections.

A downstream aggregation simulation demonstrates that even modest upstream binding errors propagate dramatically: average effect‑size estimates can be biased by more than 15 % when fed with the imperfect extractions. Importantly, the authors argue that these errors stem not from simple entity‑recognition mistakes but from systematic structural breakdowns—role reversals, binding drift across analyses, compression of dense result sections, and numeric misattribution. Consequently, current LLMs lack the structural fidelity, relational binding stability, and numerical grounding required for fully automated meta‑analysis.

The paper concludes that pure LLM‑based pipelines are insufficient for generating the high‑quality, machine‑actionable datasets needed in systematic reviews. It advocates for hybrid approaches that combine schema‑constrained prompting with neural‑symbolic techniques, external knowledge‑graph validation, and post‑hoc verification modules. The authors also release their code and annotated dataset on GitHub to facilitate reproducibility and further research into structurally robust scientific information extraction.


Comments & Academic Discussion

Loading comments...

Leave a Comment