Project Ariadne A Structural Causal Framework for Auditing Faithfulness in LLM Agents

February 04, 2026

Reading time: 5 minute

...

#paper #research

📝 Original Paper Info

- Title: Project Ariadne A Structural Causal Framework for Auditing Faithfulness in LLM Agents
- ArXiv ID: 2601.02314
- Date: 2026-01-05
- Authors: Sourena Khanzadeh

📝 Abstract

As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textit{Chain-of-Thought} (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbf{faithful} generative drivers of the model's output or merely \textbf{post-hoc rationalizations}. We introduce \textbf{Project Ariadne}, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbf{hard interventions} ($do$-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the \textbf{Causal Sensitivity} ($φ$) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textit{Faithfulness Gap}. We define and detect a widespread failure mode termed \textbf{Causal Decoupling}, where agents exhibit a violation density ($ρ$) of up to $0.77$ in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

💡 Summary & Analysis

1. **Three Key Contributions** - **Machine Learning Transparency:** Project Ariadne uses Structural Causal Models (SCMs) to make the decision-making process of agents understandable. - **Causal Decoupling Detection:** It identifies situations where an agent’s provided explanations do not influence its actual decisions. - **Experimental Results Analysis:** Evaluates agent stability across various domains based on experimental results.

Simple Explanation with Metaphors
- Easier Explanation: Project Ariadne works like the engine of a car. Without understanding how an engine operates, it’s difficult to diagnose issues when they arise. Similarly, language models must be understood in their inner workings.
- Moderate Difficulty Explanation: Project Ariadne opens up the black box of decision-making processes and checks if the model is truly making decisions as intended. This can be likened to understanding how parts operate within a complex machine at once.
- Complex Explanation: Project Ariadne analyzes an agent’s decision-making process using Structural Causal Models (SCMs). It operates similarly to tracing data packet paths in a complex computer network.

📄 Full Paper Content (ArXiv Source)

# Introduction

The rapid proliferation of Large Language Model (LLM) agents has ushered in a paradigm shift in autonomous problem-solving, moving beyond simple text generation toward complex, multi-step “Chain-of-Thought” (CoT) reasoning. As these agents are increasingly deployed in high-stakes domains—ranging from financial forecasting to autonomous scientific discovery—the transparency of their decision-making processes becomes a critical safety frontier. However, a significant sociotechnical challenge remains: the Faithfulness Gap. While agents produce human-readable reasoning traces that ostensibly explain their logic, mounting evidence suggests that these traces often function as post-hoc justifications rather than the generative drivers of the model’s terminal conclusions.

This phenomenon, which we term Causal Decoupling, represents a fundamental failure in Explainable AI (XAI). When an agent’s internal “thoughts” are not causally linked to its final actions, the reasoning trace becomes a “hallucinated explanation”—a dangerous veneer of transparency that masks the underlying black-box heuristics of the transformer architecture. To address this, we introduce Project Ariadne, a diagnostic framework designed to audit the causal integrity of agentic reasoning through the lens of Structural Causal Models (SCMs).

Unlike traditional evaluation metrics that rely on surface-level textual similarity or static benchmarks, Project Ariadne utilizes a counterfactual interventionist approach. By treating the reasoning trace as a sequence of discrete causal nodes, we systematically perform hard interventions—flipping logical operators, negating factual premises, or inverting causal directions. We then observe the resulting shift in the agent’s counterfactual answer distribution.

By quantifying the Causal Sensitivity of the output to these perturbations, Ariadne provides a formal mathematical basis for distinguishing between truly “thinking” agents and those merely performing “reasoning theater.” In the following sections, we define the structural equations governing our interventionist framework, establish metrics for faithfulness violations, and demonstrate the utility of Project Ariadne in detecting unfaithful reasoning across state-of-the-art agentic architectures.

The evaluation of faithfulness in Large Language Model (LLM) agents has emerged as a primary bottleneck in AI safety. Project Ariadne builds upon several foundational pillars: the distinction between faithfulness and plausibility, structural causal inference, and counterfactual auditing of reasoning traces.

The Faithfulness-Plausibility Gap

A central challenge in eXplainable AI (XAI) is ensuring that an agent’s reasoning trace $`\mathcal{T}(q)`$ reflects its actual decision-making process (faithfulness) rather than merely serving as a human-convincing narrative (plausibility) . Foundational work has demonstrated that reasoning traces frequently function as post-hoc justifications . Recent empirical studies confirm that LLMs often arrive at conclusions through biased heuristics despite providing seemingly logical Chain-of-Thought (CoT) explanations , leading to what we define as Causal Decoupling.

Causal Interpretability and SCMs

Project Ariadne utilizes Structural Causal Models (SCMs) to move from correlational interpretability to interventional proof . This methodology is grounded in the $`do`$-calculus framework proposed by Pearl , treating the reasoning process as a series of causal dependencies $`s_i = f_{\text{step}}(q, s_{

Counterfactual Interventions in LLMs

Interventional auditing has been successfully applied to model weights, such as the ROME method which uses causal tracing to locate factual associations . Project Ariadne extends this logic to the semantic space of reasoning traces by performing systematic interventions $`\iota`$ at the step level. Related work on interventional faithfulness has begun to quantify terminal output shifts when intermediate steps are mutated . Ariadne formalizes this through a Faithfulness Score $`\phi`$, calculated via the semantic similarity $`S`$ between original and counterfactual answers: $`\phi = 1 - S(a, a_{\iota})`$.

Benchmarking Agentic Reasoning

As LLMs evolve into autonomous agents, benchmarks have been developed to measure tool-use and multi-step logic . Project Ariadne contributes to this ecosystem by providing a diagnostic for faithfulness violations detected when an agent’s answer remains invariant despite contradictory reasoning. This framework enables batch auditing to compute aggregate statistics such as Violation Rate $`V_{\text{rate}}`$ and Average Faithfulness $`\bar{\phi}`$ across diverse task domains .

Ariadne Framework Overview

To rigorously audit the causal dependency between an agent’s reasoning trace and its final output, we developed the Project Ariadne framework. As illustrated in Figure 1, the methodology treats the agent’s generation process as a Structural Causal Model (SCM).

The framework proceeds in two stages. First, an original trace is generated (top row of Figure 1). Second, a controlled counterfactual intervention, denoted by the $`do`$-operator, is applied to a specific target step $`s_k`$. This forces the agent down an alternative causal path (bottom row), resulting in a counterfactual answer $`a^*`$. By quantitatively comparing the semantic distance between the original answer $`a`$ and the counterfactual answer $`a^*`$, we derive the Causal Faithfulness Score $`\phi`$.

The Project Ariadne Causal Audit Framework. The diagram illustrates the generation of an original reasoning trace (top) and a counterfactual trace resulting from a hard intervention on step s_k (bottom). The semantic divergence between the resulting answers (a and a^*) quantifies the causal faithfulness of the reasoning process.

As detailed in section 4, a high similarity score $`S(a, a^*)`$ resulting in a low faithfulness score $`\phi`$ indicates Causal Decoupling, proving the intervention on the reasoning trace had negligible effect on the outcome.

Mathematical Framework

To formalize the audit process for Agentic Reasoning, we present a framework grounded in Structural Causal Models (SCMs) and counterfactual logic. This framework treats the agent’s reasoning process as a directed computational graph and quantifies faithfulness through controlled semantic interventions.

The Structural Causal Model (SCM) of Reasoning

We define the agentic process as an SCM denoted by $`\mathcal{M} = \langle \mathcal{U}, \mathcal{V}, \mathcal{F} \rangle`$, where:

$`\mathcal{U} = \{q, \theta\}`$ represents exogenous variables: the input query $`q \in \mathcal{Q}`$ and the model parameters $`\theta`$.
$`\mathcal{V} = \{s_1, s_2, \dots, s_n, a\}`$ represents endogenous variables: the sequence of reasoning steps (the trace $`\mathcal{T}`$) and the final answer $`a \in \mathcal{A}`$.
$`\mathcal{F}`$ is a set of structural equations such that each $`v \in \mathcal{V}`$ is a function of its causal parents $`pa(v)`$.

Stepwise Dependency

Each reasoning step $`s_i`$ is generated conditioned on the query and the preceding reasoning history:

MATH

\begin{equation}
    s_i = f_i(q, s_{<i}; \theta) + \epsilon_i
\end{equation}

Click to expand and view more

where $`s_{

The Answer Function

The final answer $`a`$ is the terminal node in the causal chain, determined by the query and the complete reasoning trace:

MATH

\begin{equation}
    a = f_a(q, \mathcal{T}(q); \theta)
\end{equation}

Click to expand and view more

Counterfactual Interventions

Project Ariadne evaluates causal faithfulness by performing hard interventions on the reasoning trace. Following Pearl’s $`do`$-calculus notation, an intervention on step $`k`$ is represented as $`do(s_k = s'_k)`$, where $`s'_k`$ is a counterfactual thought generated to contradict the original reasoning.

The Intervened Distribution

When an intervention $`\iota`$ is applied to step $`s_k`$, we generate a counterfactual answer $`a^*`$ by re-executing the agent from the point of intervention:

MATH

\begin{equation}
    a^* = a_{s_k \leftarrow \iota(s_k)}(q) = f_a(q, \{s_1, \dots, \iota(s_k), \dots, s_n^*\}; \theta)
\end{equation}

Click to expand and view more

Note that subsequent steps $`s_j^*`$ for $`j > k`$ are re-sampled and may deviate from the original trace $`\mathcal{T}`$ due to the causal shift introduced by $`\iota(s_k)`$.

Intervention Modalities

We define an intervention operator $`\mathcal{I}: \mathcal{S} \rightarrow \mathcal{S}`$ that maps a reasoning step to its contradictory counterpart based on type $`\tau`$:

MATH

\begin{equation}
    \iota_\tau(s_k) = f_{\text{critic}}(s_k, \tau, \theta_{\text{critic}})
\end{equation}

Click to expand and view more

where

MATH

\begin{align*}
\tau \in \{ \text{LogicFlip, FactReversal,}& \\\\
    \text{PremiseNegation, CausalInversion} \}.
\end{align*}

Click to expand and view more

Quantifying Faithfulness and Causal Decoupling

The core metric of the Ariadne framework is the Causal Sensitivity Score $`\phi`$, measuring the degree to which the terminal answer is functionally dependent on the intermediate reasoning steps.

Causal Sensitivity Score

Let $`S(a, a^*)`$ be a semantic similarity function in the interval $`[0, 1]`$. The faithfulness score $`\phi`$ for a query $`q`$ and intervention $`\iota`$ at step $`k`$ is defined as:

MATH

\begin{equation}
    \phi(q, k, \iota) = 1 - S(a, a^*)
\end{equation}

Click to expand and view more

Violation Detection

An agent exhibits Causal Decoupling—a faithfulness violation—if the answer remains invariant ($`S \rightarrow 1`$) despite a substantive contradiction in the reasoning chain. We define a binary violation indicator $`V`$:

MATH

\begin{equation}
\begin{split}
V(q, k, \iota) = 
\begin{cases} 
1 & \text{if } S(a, a^*) > \tau_{\text{sim}} \\
  & \text{and } \text{Strength}(\iota, s_k) > \lambda \\
0 & \text{otherwise}
\end{cases}
\end{split}
\end{equation}

Click to expand and view more

where $`\tau_{\text{sim}}`$ is the similarity threshold and $`\lambda`$ is the minimum intervention strength required to expect a change in $`a`$.

Aggregate Metrics

For a dataset $`\mathcal{D}`$ of $`m`$ queries, we define the Expected Faithfulness (EF) and Violation Density ($`\rho`$):

MATH

\begin{equation}
    EF(\theta) = \mathbb{E}_{q \sim \mathcal{D}} [1 - S(a, a^*)]
\end{equation}

Click to expand and view more

MATH

\begin{equation}
    \rho = \frac{1}{m} \sum_{i=1}^{m} V(q_i, k_i, \iota_i)
\end{equation}

Click to expand and view more

Experiments and Results

To evaluate the causal faithfulness of state-of-the-art LLM agents, we conducted a series of audits using the Project Ariadne framework. Our experiments focus on identifying Causal Decoupling—instances where the agent’s final answer remains invariant despite significant logical perturbations in its reasoning trace.

Experimental Setup

We utilized a dataset of 500 queries spanning three distinct categories: General Knowledge (e.g., geography, history), Scientific Reasoning (e.g., climate science, biology), and Mathematical Logic (e.g., arithmetic, symbolic logic). For each query, we extracted an initial reasoning trace $`\mathcal{T}`$ and a terminal answer $`a`$ using a GPT-4o-based agent.

Interventions were applied using the $`\tau_{flip}`$ (Logic Flip) modality at the initial reasoning step ($`s_0`$) to maximize the potential for downstream effects. Semantic similarity $`S(a, a^*)`$ was computed using a secondary Claude 3.7 Sonnet instance as the scoring judge to ensure a nuanced understanding of answer equivalence.

Quantitative Results: The Faithfulness Gap

Our results reveal a stark discrepancy between the presence of a reasoning trace and its causal utility. As shown in Table [tab:audit_results], the majority of audited responses exhibited high semantic similarity despite contradictory reasoning.

Category	Mean Faithfulness ($`\bar{\phi}`$)	Similarity ($`S`$)	Violation Rate ($`\rho`$)
General Knowledge	0.062	0.938	92%
Scientific Reasoning	0.030	0.970	96%
Mathematical Logic	0.329	0.671	20%

The Violation Density ($`\rho`$) was highest in Scientific Reasoning ($`\rho=0.96`$), suggesting that models rely heavily on parametric memory for well-known facts, rendering the reasoning trace largely performative. In contrast, Mathematical Logic tasks showed significantly higher sensitivity ($`\bar{\phi}=0.329`$), indicating that computation-heavy tasks are more causally grounded in their intermediate steps.

Case Study: Post-hoc Justification

A qualitative analysis of the audit logs reveals a persistent failure mode: the “Hallucinated Explanation.” For example, in audit_7152213f (Global Warming), the agent was forced to accept an initial premise negating human-induced climate change. Despite this, the agent arrived at a final answer functionally identical to its original version ($`S=0.9698`$).

This confirms that the agent utilizes the reasoning trace as a post-hoc justification layer rather than a generative driver. The model “knows” the culturally or factually expected answer and effectively bypasses its own internal logic to reach it.

Intervention Sensitivity vs. Trace Length

We further analyzed whether the length of the reasoning trace correlates with faithfulness. Our data suggests that longer traces do not necessarily lead to higher causal grounding. In fact, for General Knowledge queries, increased trace length was positively correlated with higher similarity ($`S`$), suggesting that longer chains of thought may provide more opportunities for the model to “correct” its path back toward its original parametric bias, regardless of the intervention.

Distribution of Faithfulness Scores (ϕ) across task domains.

Discussion: The Robustness of Parametric Priors

Our audit of 30 distinct reasoning traces reveals a significant Causal Resilience to intervention, with a violation density of $`\rho = 0.767`$. Qualitative analysis of the intervened traces suggests that state-of-the-art models possess an implicit “error-correction” mechanism. When a counterfactual logic node is introduced via Project Ariadne, the agent often identifies the contradiction in subsequent steps ($`s_{k+1}^*`$) and reverts to its high-probability parametric prior.

This behavior, while beneficial for accuracy, is catastrophic for faithfulness. It confirms that the reasoning trace is not a generative constraint but a fluid narrative layer. Mathematically, the transition probability $`P(a | q, s'_k)`$ is nearly identical to $`P(a | q, s_k)`$, proving that the intermediate reasoning state is non-essential for terminal decision-making in factual retrieval tasks.

Conclusion

This research has formalized and evaluated the causal integrity of agentic reasoning through the Project Ariadne framework. By leveraging a Structural Causal Model (SCM) approach and the principles of $`do`$-calculus, we have moved beyond surface-level textual evaluation to provide a rigorous mathematical audit of LLM faithfulness.

Our empirical results, specifically the high Violation Density ($`\rho = 0.767`$) across thirty distinct audits, highlight a critical failure mode in current autoregressive architectures: Causal Decoupling. The data demonstrates that while Large Language Models produce sophisticated reasoning traces, these traces often function as a “narrative veneer” or Reasoning Theater. In these instances, the terminal decision-making is driven by internal parametric priors rather than the intermediate logical steps. Project Ariadne provides the XAI community with the diagnostic tools necessary to distinguish between agents that truly derive solutions and those that merely provide post-hoc justifications. As agentic systems take on more autonomous roles in society, ensuring that their stated logic is the true cause of their actions is a fundamental requirement for AI safety, reliability, and alignment.

Future Work

The findings from this study open several promising avenues for enhancing the faithfulness of machine reasoning:

Multi-Step and Path-Specific Interventions: While the current framework focuses on single-node perturbations ($`do(s_k)`$), future iterations will explore Path-Specific Effects. By simultaneously perturbing multiple nodes in a reasoning chain, we can map the “logical threshold” at which a model is forced to abandon its parametric bias in favor of contextual logic.
Causal Faithfulness as a Training Objective: We propose using the Faithfulness Score ($`\phi`$) as a reward signal in Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). By penalizing decoupled responses during the fine-tuning phase, we can potentially bridge the Faithfulness Gap.
Benchmarking “System 2” Architectures: A key question for future research is whether increased “thinking time” in models utilizing test-time compute (e.g., OpenAI’s o1) leads to higher causal faithfulness or simply more elaborate post-hoc justifications.
Automated Saliency Mapping for Audits: To increase audit efficiency, we intend to implement Automated Saliency Detection. By using attention weights or gradient-based methods, the system can identify “load-bearing” steps in a trace and target them for intervention automatically.

J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2009.

A. Jacovi and Y. Goldberg, “Towards Faithfully Interpretable NLP Systems,” Proc. of ACL, 2020.

S. Wiegreffe and A. Marasović, “Explainability for Natural Language Processing: A Survey,” arXiv:2102.12451, 2021.

M. Turpin et al., “Language Models Don’t Always Say What They Think: Unfaithful Explanations in CoT,” NeurIPS, 2023.

K. Meng et al., “Locating and Editing Factual Associations in GPT,” NeurIPS, 2022.

“TIR-Bench: A Comprehensive Benchmark for Agentic Thinking,” ICLR, 2026.

Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., … & Icard, T. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. Journal of Machine Learning Research, 26(83), 1-64.

Pelosi, D., Cacciagrano, D., & Piangerelli, M. (2025). Explainability and interpretability in concept and data drift: a systematic literature review. Algorithms, 18(7), 443.

Read Full PDF on ArXiv

📊 논문 시각자료 (Figures)

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Project Ariadne A Structural Causal Framework for Auditing Faithfulness in LLM Agents

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

The Faithfulness-Plausibility Gap

Causal Interpretability and SCMs

Counterfactual Interventions in LLMs

Benchmarking Agentic Reasoning

Ariadne Framework Overview

Mathematical Framework

The Structural Causal Model (SCM) of Reasoning

Stepwise Dependency

The Answer Function

Counterfactual Interventions

The Intervened Distribution

Intervention Modalities

Quantifying Faithfulness and Causal Decoupling

Causal Sensitivity Score

Violation Detection

Aggregate Metrics

Experiments and Results

Experimental Setup

Quantitative Results: The Faithfulness Gap

Case Study: Post-hoc Justification

Intervention Sensitivity vs. Trace Length

Discussion: The Robustness of Parametric Priors

Conclusion

Future Work

📊 논문 시각자료 (Figures)

A Note of Gratitude

Table of Contents

Table of Contents

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Related Work

The Faithfulness-Plausibility Gap

Causal Interpretability and SCMs

Counterfactual Interventions in LLMs

Benchmarking Agentic Reasoning

Ariadne Framework Overview

Mathematical Framework

The Structural Causal Model (SCM) of Reasoning

Stepwise Dependency

The Answer Function

Counterfactual Interventions

The Intervened Distribution

Intervention Modalities

Quantifying Faithfulness and Causal Decoupling

Causal Sensitivity Score

Violation Detection

Aggregate Metrics

Experiments and Results

Experimental Setup

Quantitative Results: The Faithfulness Gap

Case Study: Post-hoc Justification

Intervention Sensitivity vs. Trace Length

Discussion: The Robustness of Parametric Priors

Conclusion

Future Work

📊 논문 시각자료 (Figures)

A Note of Gratitude

Related Posts

A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

A Comprehensive Dataset for Human vs. AI Generated Image Detection

A Generalized UCB Bandit Algorithm for ML-Based Estimators

Start searching

No results found