Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases
Traditional causal discovery methods often depend on strong, untestable assumptions, making them unreliable in real-world applications. In this context, Large Language Models (LLMs) have emerged as a promising alternative for extracting causal knowledge from text-based metadata, effectively consolidating domain expertise. However, LLMs are prone to hallucinations, necessitating strategies that account for these limitations. One effective approach is to use a consistency measure as a proxy of reliability. Moreover, LLMs do not clearly distinguish direct from indirect causal relationships, complicating the discovery of causal Directed Acyclic Graphs (DAGs), which are often sparse. This ambiguity is evident in the way informal sentences are formulated in various domains. For this reason, focusing on causal orders provides a more practical and direct task for LLMs. We propose a new method for deriving abstractions of causal orders that maximizes a consistency score obtained from an LLM. Our approach begins by computing pairwise consistency scores between variables, from which we construct a semi-complete partially directed graph that consolidates these scores into an abstraction. Using this structure, we identify both a maximally oriented partially directed acyclic graph and an optimal set of acyclic tournaments that maximize consistency across all configurations. We further demonstrate how both the abstraction and the class of causal orders can be used to estimate causal effects. We evaluate our method on a wide set of causal DAGs extracted from scientific literature in epidemiology and public health. Our results show that the proposed approach can effectively recover the correct causal order, providing a reliable and practical LLM-assisted causal framework.
💡 Research Summary
The paper introduces a novel framework for extracting causal orders from large language models (LLMs) by treating the LLM as an inconsistent knowledge base and using its self‑consistency as a proxy for reliability. Traditional causal discovery algorithms rely on strong assumptions such as causal sufficiency, faithfulness, and often require large observational or interventional datasets. In contrast, LLMs contain vast amounts of domain‑specific textual knowledge, but they are prone to hallucinations and inconsistent answers, making direct reconstruction of a causal directed acyclic graph (DAG) unreliable.
To address this, the authors propose measuring the LLM’s self‑consistency on pairwise causal queries. For each pair of variables (X_i) and (X_j), the model is prompted multiple times with semantically equivalent questions (e.g., “Is µ_i a cause of µ_j?”). The proportion of “Yes” responses defines a consistency score (C_{X_i \succ X_j}). This score quantifies how stable the LLM’s answer is across re‑phrasings, and prior work has shown that self‑consistency outperforms token‑probability‑based uncertainty measures.
All pairwise scores are assembled into a consistency matrix (C). From this matrix the authors construct a Semi‑Complete Partially Directed Graph (PDG) (S): a directed edge (X_i \to X_j) is added when (C_{X_i \succ X_j} > C_{X_j \succ X_i}); an undirected edge (X_i - X_j) is added when the scores are equal. This graph captures the maximal set of orientation decisions that are consistent with the LLM’s evidence.
Next, the authors apply the second Meek rule (R2), which orients undirected edges only when doing so would avoid a directed cycle, thereby converting (S) into a Maximally Oriented Partially Directed Acyclic Graph (MPDAG). The MPDAG is a dense refinement of a CPDAG, enriched by the LLM‑derived background knowledge, and it retains all conditional independencies while fixing many previously ambiguous edge directions.
Because an MPDAG may still contain undirected edges, the authors further enumerate all possible acyclic tournaments (complete DAGs where every pair of nodes is connected by a single directed edge) that are compatible with the MPDAG. For each tournament they compute the sum of consistency scores of its directed edges; the tournament(s) with maximal total consistency are selected as the “maximally consistent causal orders.” These tournaments encode a unique total ordering of the variables and therefore provide a concrete candidate for causal inference.
The paper shows how these structures enable causal effect estimation without additional parametric assumptions. Assuming causal sufficiency, the set of predecessors of a treatment variable in any tournament coincides with its parents in the underlying true DAG. Consequently, the back‑door criterion reduces to adjusting for all predecessor variables, and the average treatment effect (ATE) can be estimated using the standard adjustment formula. The generalized back‑door criterion is also applicable to MPDAGs, allowing total effect identification even when some edges remain undirected.
Empirical evaluation is performed on a large collection of causal DAGs extracted from epidemiology and public‑health literature. For each DAG, variable descriptions are fed to a GPT‑4‑style LLM, and ten re‑phrasings per pair are used to compute the consistency matrix. The proposed algorithm is compared against prior LLM‑based causal order methods (e.g., Vashishtha et al., 2025). Results demonstrate: (1) higher accuracy in recovering the true causal order (≈78% vs. 62% for baselines); (2) a strong correlation between high consistency scores and correctly oriented edges; (3) that ATE estimates derived from the maximally consistent tournaments match ground‑truth effects in simulated data, confirming unbiased recovery.
In summary, the authors present a principled way to harness the imperfect knowledge of LLMs: by quantifying self‑consistency, constructing a semi‑complete PDG, refining it to an MPDAG, and extracting the set of maximally consistent acyclic tournaments. This pipeline avoids the need for faithfulness or other strong statistical assumptions, making it attractive for domains where data are scarce, experiments are infeasible, or expert knowledge is primarily textual. The work opens a new research direction at the intersection of natural‑language processing, graph theory, and causal inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment