IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.


💡 Research Summary

The paper investigates whether large language models (LLMs) can assist in the notoriously difficult task of discovering valid instrumental variables (IVs) for causal inference. Recognizing that a valid IV must satisfy three core conditions—relevance (correlation with the treatment), exclusion restriction (no direct effect on the outcome except through the treatment), and independence from unobserved confounders—the authors ask whether LLMs, trained on massive corpora spanning economics, health, law, and history, possess enough contextual knowledge to generate and evaluate candidate IVs.

To answer this, the authors design a two‑stage evaluation framework. In the first stage they test whether LLMs can recover well‑established IVs from the literature for a set of benchmark treatment‑outcome pairs drawn from economics, health, and social science. Success here demonstrates that the model can replicate standard causal reasoning and retain domain‑specific knowledge. In the second stage they assess whether LLMs can avoid suggesting instruments that have been empirically or theoretically discredited (e.g., rainfall as an instrument for war‑related outcomes). This stage probes whether the model merely reproduces statistical associations or truly understands the causal constraints.

Building on these validation steps, the authors introduce IV Co‑Scientist, a multi‑agent pipeline that mirrors the human research workflow of hypothesis generation followed by expert critique. The pipeline consists of three LLM‑based agents:

  1. HypothesisGenerator – given a treatment‑outcome pair (T, Y), it proposes three candidate IVs (Z₁…Z₃) and three plausible confounders (U₁…U₃). Prompt engineering includes an “economist” persona to encourage domain‑appropriate reasoning.
  2. CriticAgent‑Exclusion – independently evaluates each proposed Z for the exclusion restriction, reasoning whether Z could affect Y through any pathway other than T.
  3. CriticAgent‑Independence – independently checks whether each Z is conditionally independent of the set of hypothesized confounders U.

Only those Z that receive a positive judgment from both critics are retained as Z_valid. This separation of creative generation and rigorous vetting allows the system to explore a broader hypothesis space while still enforcing the two untestable causal assumptions via textual reasoning.

The experimental suite compares five LLMs: GPT‑4o, o3‑mini, QwQ (all modern generative models), and Llama 3.1 (8 B and 70 B parameters). For the canonical IV recovery task, GPT‑4o and o3‑mini achieve Exact Match (EM) scores between 0.74 and 1.00 and Conceptual Match (CM) scores in the same high range, indicating they can either directly name literature‑cited IVs or produce semantically equivalent substitutes. QwQ performs similarly. By contrast, Llama 3.1 8 B lags considerably (EM ≈ 0.28–0.76, CM ≈ 0.32–0.65), while the larger 70 B version shows intermediate performance.

In the invalid‑IV avoidance experiment, the authors inject a known discredited instrument (e.g., rainfall) into the candidate list and observe whether the CriticAgents flag it. GPT‑4o and QwQ both rarely generate the discredited IV in the first place, and when it is forcibly inserted, the critics reject it with >90 % accuracy. Llama 3.1 8 B, however, frequently proposes the invalid IV and fails to reject it reliably, highlighting the importance of model size and instruction tuning for causal reasoning.

To demonstrate real‑world utility, the authors apply IV Co‑Scientist to the Gapminder dataset, which contains over 300 socio‑economic indicators across countries and years. For novel treatment‑outcome pairs not previously studied, the system generates multiple candidate IVs. Since ground‑truth validity cannot be directly measured, the authors propose a consistency metric: they run two‑stage least squares (2SLS) using each candidate IV and compute the variance of the resulting causal effect estimates. Low variance suggests that the different IVs are converging on the same underlying causal effect, providing indirect evidence of validity. GPT‑4o‑driven pipelines achieve the lowest average variance (≈ 0.12), whereas smaller models exhibit higher dispersion (0.25–0.38).

Overall, the paper makes three substantive contributions: (1) a rigorous two‑stage benchmark that tests both recall of established IVs and avoidance of known invalid instruments; (2) the design of a multi‑agent LLM framework that separates hypothesis generation from causal critique, enabling systematic refinement of candidate IVs; and (3) an internal consistency metric for evaluating IVs when external ground truth is unavailable. The findings suggest that state‑of‑the‑art LLMs can indeed act as “thinking collaborators” in the early stages of IV discovery, offering plausible candidates and flagging obvious violations.

Nevertheless, the authors acknowledge several limitations. First, the critics’ judgments are still based on textual inference rather than empirical tests; thus, human domain experts must ultimately validate any proposed IV before policy or scientific conclusions are drawn. Second, the evaluation focuses on economics, health, and social science; extending the approach to fields like environmental policy, law, or education will require additional domain‑specific prompts and possibly fine‑tuning. Third, the consistency metric, while useful, does not replace formal over‑identification tests or external validation studies; it merely provides a heuristic signal of stability. Finally, the current pipeline assumes a fixed number of candidate IVs and confounders, which may limit exploration of richer hypothesis spaces.

In conclusion, the study demonstrates that large language models, when orchestrated through a structured multi‑agent system, can meaningfully contribute to the discovery and preliminary vetting of instrumental variables. This opens a promising avenue for augmenting traditional causal inference workflows with AI‑driven creativity, while still preserving the essential role of expert judgment for final validation. Future work should explore tighter integration with statistical over‑identification tests, broader domain coverage, and user‑in‑the‑loop interfaces that allow researchers to interactively guide and refine the LLM‑generated hypotheses.


Comments & Academic Discussion

Loading comments...

Leave a Comment