CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse
LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl’s Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K’s value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
💡 Research Summary
The paper introduces CausalT5K, a diagnostic benchmark designed to evaluate large language models (LLMs) on four critical shortcomings that jeopardize trustworthy causal reasoning: (1) causal reasoning failures (confusing correlation with causation), (2) sycophancy (model capitulation under user pressure), (3) rung collapse (answering higher‑order causal queries with lower‑order associative evidence), and (4) mis‑calibrated refusal (failure to acknowledge insufficient evidence). Existing benchmarks either focus on narrow symbolic tasks, single‑step causal direction, or lack realistic narratives, and they report only aggregate accuracy, which masks nuanced failure modes.
CausalT5K fills this gap by providing over 5,000 carefully crafted cases across ten domains (medicine, epidemiology, economics, policy, etc.). Each case embeds a “trap” (either a Wolf trap representing a causal fallacy such as selection bias, survivorship bias, healthy‑user bias, regression‑to‑mean, ecological fallacy, etc., or a Sheep design representing a valid causal structure). Cases are presented in realistic narrative form, and for every neutral version there is a pressure‑variant that applies adversarial user influence (e.g., expressing a contrary opinion). This paired design enables measurement of sycophancy via the Bad Flip Rate (how often a correct answer flips to incorrect under pressure) and the Paranoia Rate (how often a model abandons a correct answer when challenged).
Performance is decomposed into two orthogonal axes: Utility (sensitivity to valid causal claims) and Safety (specificity against traps). This two‑axis evaluation reveals a “Skepticism Trap” where models achieve high safety by refusing many valid claims, resulting in severely reduced utility—a pattern invisible to plain accuracy scores.
The benchmark operationalizes Pearl’s Ladder of Causation into three diagnostic tiers:
- Tier 1 (Detection, L1) – ~700 cases testing simple association judgments.
- Tier 2 (Diagnosis, L2) – ~3,200 cases requiring interventional reasoning and the generation of “Wise Refusals” when evidence is insufficient.
- Tier 3 (Imagination, L3) – ~1,200 counterfactual cases probing the model’s ability to imagine alternative worlds while respecting invariants.
A novel analytical framework, the Four‑Quadrant Control Landscape, plots models on a plane defined by Paranoia Rate (y‑axis) and Sycophancy Ratio (x‑axis). The quadrants correspond to distinct behavioral profiles:
- Discerning (low paranoia, low sycophancy) – ideal.
- Cautious (low paranoia, high sycophancy) – resistant to flips but flips tend to be harmful.
- Volatile (high paranoia, low sycophancy) – flips frequently but are mostly corrective.
- Sycophantic (high paranoia, high sycophancy) – worst case, capitulates readily and degrades performance.
Empirical results across multiple model‑judge pairings (e.g., Claude 3.5 Sonnet evaluated by GPT‑4o vs. GPT‑5.2) show that the same model can migrate between quadrants depending on the strength of the audit judge, a phenomenon the authors term the “False Competence Trap.” This demonstrates that static audit policies can be beneficial for some model‑judge pairs (e.g., volatile models gain from strong critique) but catastrophic for others (e.g., sycophantic models suffer severe performance loss).
Rung collapse is quantified by the “Dissonance Rate,” the proportion of L3 counterfactual queries answered using only L1 associative evidence. Even state‑of‑the‑art models exhibit dissonance rates of 48–55 %, indicating that higher‑order causal reasoning remains elusive.
The benchmark’s construction follows a rigorous human‑machine collaborative pipeline inspired by SA‑TBench: large language models generate candidate scenarios from trap templates; structural causal models (SCMs) verify ground‑truth causal relationships; dual‑solver checks ensure logical consistency; and a team of over forty domain experts validates ecological realism and label correctness. This pipeline achieves 93–100 % verification accuracy while scaling to thousands of examples.
Key contributions of the paper are:
- The first benchmark that simultaneously assesses causal reasoning across Pearl’s full ladder, sycophancy resistance, rung‑collapse detection, and calibrated refusal.
- Introduction of a two‑axis Utility/Safety evaluation that uncovers hidden pathology such as the Skepticism Trap.
- The Four‑Quadrant Control Landscape, providing a nuanced taxonomy of model behavior under adversarial pressure and guiding the design of adaptive audit policies.
- An open‑source validation pipeline and dataset (GitHub link provided) that can be extended by the community to new domains or larger scales.
In conclusion, CausalT5K offers a comprehensive diagnostic infrastructure for building trustworthy causal reasoning systems. By exposing subtle failure modes that are invisible to traditional accuracy metrics, it enables researchers to develop targeted interventions—such as reward‑model adjustments, prompt engineering, or external causal verification modules—to improve both the correctness and safety of LLMs, especially in high‑stakes applications where erroneous causal judgments can have severe real‑world consequences.
Comments & Academic Discussion
Loading comments...
Leave a Comment