낮은 누수와 강한 인과력: LLM 인과 추론 파라미터의 새로운 통찰

Reading time: 3 minute
...

📝 Original Info

  • Title: 낮은 누수와 강한 인과력: LLM 인과 추론 파라미터의 새로운 통찰
  • ArXiv ID: 2512.11909
  • Date:
  • Authors: Unknown

📝 Abstract

no-MV agents have low leakage b (0-0.1), strong causal strength m 1 , m 2 (0.75-0.99), and midrange priors, while agents with MV or weak EA show higher b (0.15-0.62) and weaker m i (0.25-0.82). Outlook. Next steps include extending this framework to semantically meaningless tasks and other causal structures beyond colliders to probe reasoning robustness. It should be noted that "normative" parameter regimes (low leak, strong causes) are not universally optimal and ultimately depend on the user-setting: tasks that legitimately require uncertainty about unobserved causes may warrant nonzero leak. Our prompts do not control this dimension -we neither instruct models to ignore nor to include unmentioned causes. A targeted analysis of the explanations received through CoT could provide first insights into whether and how LLMs represent and regulate them.

📄 Full Content

The nature of intelligence in both humans and machines is a longstanding question. While there is no universally accepted definition, the ability to reason causally is often regarded as a pivotal aspect of intelligence (Lake et al., 2017). Evaluating causal reasoning in LLMs and humans on the same tasks provides hence a more comprehensive understanding of their respective strengths and weaknesses. The naming convention for the GPT-5 familiy is as follows: gpt-5 v r .

Human-LLM alignment: SOTA models establish ceiling; CoT helps others converge (Q1). Recent top-performing LLMs, e.g., gemini-2.5-pro, already show strong human alignment under Direct prompting (ρ ≈ 0.85), with only little to no improvement via CoT. Conversely, CoT significantly boosts alignment in lighter or older models (e.g., gemini-2.5-flash-lite: +0.503 → ρ = 0.845), helping them converge to the same ceiling.

Humans are consistent reasoners; CoT improves reasoning consistency, especially for smaller & older models (Q2). On our tasks, CoT yields a small but reliable lift in median reasoning consistency, raising LOOCV R 2 from 0.933 to 0.941 (+0.008, +0.91%). More importantly, CoT disproportionately helps the less consistent agents under Direct prompting: the lower tail rises (minimum R 2 : 0.277 (gemini-2.5-flash-lite; numeric) → 0.692 (claude-3-haiku-20240307; CoT)) and the spread tightens markedly (IQR 0.116 → 0.060). Humans show high consistency across tasks, with LOOCV R 2 = 0.937 with only a narrow gap to SOTA models who achieve LOOCVR 2 of up to .99 (gemini-2.5-pro).

Explaining-away is common; CoT effects are mixed (Q3). Most LLMs (27/30) show explaining-away (EA > 0), and 24/30 exceed human EA levels (EA human =0.09) (see Fig. 2). CoT helps agents lacking EA (e.g., claude-3-haiku, gemini-2.5-flash-lite) but can slightly reduce EA in strong ones (e.g., gpt-4o, o3-mini). A similar pattern holds for Markov violations: while eight agents show MV under Direct prompting, CoT improves most but can worsen others (e.g., claude-3.5-haiku). High-EA-

📸 Image Gallery

02_indep_legend.png 03_epres.png ea_levels_overlay_rw17_indep_causes_all.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut