KOINEU

February 10, 2026

Reading time: 3 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.11909
Date:
Authors: Unknown

📝 Abstract

no-MV agents have low leakage b (0-0.1), strong causal strength m 1 , m 2 (0.75-0.99), and midrange priors, while agents with MV or weak EA show higher b (0.15-0.62) and weaker m i (0.25-0.82). Outlook. Next steps include extending this framework to semantically meaningless tasks and other causal structures beyond colliders to probe reasoning robustness. It should be noted that "normative" parameter regimes (low leak, strong causes) are not universally optimal and ultimately depend on the user-setting: tasks that legitimately require uncertainty about unobserved causes may warrant nonzero leak. Our prompts do not control this dimension -we neither instruct models to ignore nor to include unmentioned causes. A targeted analysis of the explanations received through CoT could provide first insights into whether and how LLMs represent and regulate them.

📄 Full Content

The nature of intelligence in both humans and machines is a longstanding question. While there is no universally accepted definition, the ability to reason causally is often regarded as a pivotal aspect of intelligence (Lake et al., 2017). Evaluating causal reasoning in LLMs and humans on the same tasks provides hence a more comprehensive understanding of their respective strengths and weaknesses. The naming convention for the GPT-5 familiy is as follows: gpt-5 v r .

Human-LLM alignment: SOTA models establish ceiling; CoT helps others converge (Q1). Recent top-performing LLMs, e.g., gemini-2.5-pro, already show strong human alignment under Direct prompting (ρ ≈ 0.85), with only little to no improvement via CoT. Conversely, CoT significantly boosts alignment in lighter or older models (e.g., gemini-2.5-flash-lite: +0.503 → ρ = 0.845), helping them converge to the same ceiling.

Humans are consistent reasoners; CoT improves reasoning consistency, especially for smaller & older models (Q2). On our tasks, CoT yields a small but reliable lift in median reasoning consistency, raising LOOCV R 2 from 0.933 to 0.941 (+0.008, +0.91%). More importantly, CoT disproportionately helps the less consistent agents under Direct prompting: the lower tail rises (minimum R 2 : 0.277 (gemini-2.5-flash-lite; numeric) → 0.692 (claude-3-haiku-20240307; CoT)) and the spread tightens markedly (IQR 0.116 → 0.060). Humans show high consistency across tasks, with LOOCV R 2 = 0.937 with only a narrow gap to SOTA models who achieve LOOCVR 2 of up to .99 (gemini-2.5-pro).

Explaining-away is common; CoT effects are mixed (Q3). Most LLMs (27/30) show explaining-away (EA > 0), and 24/30 exceed human EA levels (EA human =0.09) (see Fig. 2). CoT helps agents lacking EA (e.g., claude-3-haiku, gemini-2.5-flash-lite) but can slightly reduce EA in strong ones (e.g., gpt-4o, o3-mini). A similar pattern holds for Markov violations: while eight agents show MV under Direct prompting, CoT improves most but can worsen others (e.g., claude-3.5-haiku). High-EA-

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found