Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning – estimating effect sizes that parametrize functional relationships – remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.

💡 Research Summary

The paper introduces Linear‑LLM‑SCM, a plug‑and‑play benchmarking framework designed to evaluate large language models (LLMs) on the task of directly eliciting the numerical coefficients of a linear‑Gaussian structural causal model (SCM) when the directed acyclic graph (DAG) is already known. The authors argue that while recent work has shown LLMs can identify qualitative causal relations (type‑1 tasks), their ability to perform quantitative parameter estimation (type‑3 tasks) remains largely unexplored, especially for continuous variables.

Linear‑LLM‑SCM works by decomposing a given DAG into local parent‑child sets ordered topologically. For each target node, a prompt is automatically generated that includes: a domain‑expert persona, a brief description of the phenomenon, the names, units and hard bounds of the target variable and its direct parents, a linear equation template, and a strict JSON output specification. The LLM is asked to return a concrete linear structural equation of the form
(Y = \beta_0 + \sum_i \beta_i X_i + E_Y)
with actual numeric values for all (\beta)s and a description of the error term. The response is parsed, the coefficients are stored, and the process repeats for the next node.

A key novelty is an iterative feedback loop (Algorithm 2). After each proposal the framework computes the feasible range of the target variable based on the parent ranges and the proposed coefficients (C1) and checks whether this range is fully contained within the pre‑specified hard constraints (C2). If not, the prompt is enriched with the previous proposal and the validation result, and the LLM is queried again, up to a fixed budget (typically five iterations). This mechanism enforces consistency with domain knowledge and variable units.

Evaluation metrics are defined at four levels:

M1: Global L2 distance between the vector of all LLM‑produced coefficients and the ground‑truth vector.
M2: Node‑wise normalized L2 distance, which removes scale differences by normalizing each node’s coefficient vector before computing the error.
M3: Same as M2 but restricted to nodes with more than one parent, focusing on multi‑causal interactions.
M4: A binary indicator of whether the relative ordering (sign and magnitude ranking) of the LLM’s coefficients matches the ground truth for each multi‑parent node.

The authors benchmark several state‑of‑the‑art LLMs, including Gemini 2.5 Flash and multiple sizes of the Llama 3 family (dense and mixture‑of‑experts variants). Experiments are conducted on synthetic DAGs with known parameters, on real‑world DAGs derived from medical and financial domains, and on perturbed DAGs where spurious edges are deliberately added to test robustness.

Key findings:

High stochasticity – Repeated calls to the same model with identical prompts produce noticeably different coefficient sets, indicating that the underlying generation process is highly probabilistic.
Structural sensitivity – Adding even a single spurious edge dramatically worsens all metrics (M1‑M3 can increase by 30‑50 %). This shows that the local‑parent prompting strategy is vulnerable to misspecified structures.
Variable performance across models – Larger models (e.g., Llama 3 70B) tend to yield lower L2 errors than smaller ones, but none achieve errors low enough for practical causal inference.
Failure modes – For nodes with multiple parents, LLMs frequently flip signs, over‑ or under‑estimate magnitudes, or ignore unit constraints, whereas single‑parent nodes are handled more reliably.
Feedback loop effectiveness – The iterative refinement succeeds in about 60 % of cases on the first attempt; the remaining cases converge after an average of 2–3 additional iterations, yet some still violate constraints when the DAG is misspecified.

The paper concludes that while Linear‑LLM‑SCM provides a systematic way to probe LLMs’ quantitative causal reasoning, current off‑the‑shelf LLMs are not yet trustworthy for precise coefficient estimation. The authors suggest future directions: (i) richer prompt engineering (e.g., chain‑of‑thought, few‑shot examples), (ii) pre‑training or fine‑tuning on large corpora of linear regression or SCM specifications to embed quantitative causal knowledge, (iii) incorporation of Bayesian uncertainty estimates (confidence intervals) into the output, and (iv) exploring multimodal models that can ingest the entire graph structure at once rather than relying on local prompts.

Overall, Linear‑LLM‑SCM fills a gap in the literature by moving beyond qualitative causal discovery toward quantitative effect size estimation, and it highlights concrete limitations of present LLM technology that must be addressed before they can serve as reliable quantitative causal modelers.

Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment