금융 분야 대형 언어모델의 산술 환각 메커니즘 탐구
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model’s confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.
💡 Analysis
Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model’s confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.
📄 Content
Dissecting the Ledger: Locating and Suppressing “Liar Circuits” in Financial Large Language Models Soham Mirajkar IIT Jodhpur December 1, 2025 Abstract Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational “scratchpad” in middle layers (L12-L30) and a decisive “aggregation” circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model’s confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception. 1 Introduction The integration of Large Language Models (LLMs) into quantitative finance is hindered by the “Hal- lucination Problem.” While often framed as random noise, we hypothesize that hallucinations in arithmetic reasoning are structural failures. Specifically, when an LLM is asked to compute “Rev- enue growth from 50M to 30M,” and it answers “50%” (instead of -40%), it is not merely guessing; it is executing a flawed computational circuit. While recent surveys, such as Lee et al. (2024), comprehensively categorize the landscape of Fi- nancial LLMs and identify hallucination as a primary barrier, they predominantly focus on behavioral evaluations [2]. Our work complements this by providing a mechanistic explanation for these failures, moving from symptom identification to root-cause analysis. 2 Methodology 2.1 Task Definition We utilize the ConvFinQA dataset, filtering for numerical reasoning tasks involving arithmetic op- erations. We categorize model outputs into two sets: Yclean (Factually Correct) and Yhallucinated (Arithmetic Errors). 2.2 Causal Tracing Setup We adapt the Causal Tracing method (Meng et al., 2022). We iterate through every hidden state hl i (at token i, layer l) and intervene to measure its restorative potential: Impact(hl i) = Ppatch(Correct Answer) −Pcorrupted(Correct Answer) (1) 1 arXiv:2511.21756v1 [cs.CL] 24 Nov 2025 2.3 Implementation We utilize the TransformerLens library to hook into internal activations. 1 def patching_hook(resid_pre , hook , pos , clean_cache): 2
Overwrite
the corrupted state with the clean state 3 resid_pre [:, pos , :] = clean_cache[hook.name ][:, pos , :] 4 return resid_pre 5 6 patched_logits = model.run_with_hooks ( 7 corrupted_tokens , 8 fwd_hooks =[( hook_name , partial(patching_hook , pos=position))] 9 ) Listing 1: Activation Patching Hook 3 Core Findings 3.1 The Dual-Stage Mechanism Our analysis of GPT-2 XL (1.5B parameters) reveals that financial reasoning is not monolithic but split into two distinct sites. Figure 1: Causal Tracing Heatmap. The x-axis represents tokens in the financial prompt. The y- axis represents Transformer layers (0-48). Note the distributed impact in the middle layers (L12-L30) at the operand tokens and the massive peak at Layer 46 at the final token. Observation 1 (The Calculation Site): As visualized in Figure 1, we observe sustained, dis- tributed causal impact in Layers 12 through 30, specifically localized at the operand tokens. In- terpretation: These layers function as the “computational engine,” where the model attends to and processes the numerical values. Observation 2 (The Late-Layer Gatekeeper): The single highest causal impact (0.0073) occurs at Layer 46 on the final token position. Interpretation: This “Late Site” acts as an aggregator. It consolidates the upstream calculations before the final decoding. 2 3.2 Validation via Causal Suppression To confirm the causal necessity of the identified mechanism, we performed an ablation study on the “Liar Layer” (L46). By suppressing the activation of this layer during inference (setting the residual contribution to zero), we observed an 81.8% reduction in the model’s confidence for the hallucinatory output (from 0.0522 to 0.0095). This effectively “breaks” the hallucination circuit, proving that Layer 46 is the active bottleneck for the arithmetic decision. 3.3 Robustness Verification To ensure these findings are not artifacts of a specific sentence structure, we averaged Causal Traces across N = 5 distinct financial scenarios. Figure 2: Robustness Analysis. Averaged Causal Impact across 5 diverse financial prompts. The high-impact region at Layer 46 remains consistent, confirming it as a structural bottleneck for arith- metic output. As shown in Figure 2, the Late-Layer Gatekeeper at Layer 46 persists across all prompts. This confirms that the “Liar Circuit” m
This content is AI-processed based on ArXiv data.