Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models
In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
💡 Research Summary
The paper investigates how well large language models (LLMs) can perform causal reasoning, a task that requires multi‑hop logical composition and strict conjunctive control. The authors hypothesize that encoder‑only and encoder‑decoder architectures, because they can project the entire input into a latent space, are better suited for this type of reasoning than decoder‑only models, which process tokens sequentially. To test this, they evaluate three families of models—encoder‑only (e.g., BERT), encoder‑decoder (e.g., BART), and decoder‑only (e.g., GPT‑4, Claude Opus, Qwen)—under both zero‑shot/few‑shot in‑context learning (ICL) and supervised fine‑tuning.
The evaluation uses a synthetic benchmark derived from SimpleLogic, which generates tuples of facts, rules, a query, and a label. Two test splits are created: (1) a natural‑language split with varying reasoning depth (the number of inference steps required) and (2) a non‑natural‑language split where characters are randomized to remove lexical cues. This design isolates logical structure from surface form.
Results show that ICL alone is insufficient for reliable causal reasoning. Decoder‑only models, when used only with ICL, over‑focus on spurious lexical patterns and are brittle to distributional shifts, especially as reasoning depth increases. Fine‑tuned encoder‑only and encoder‑decoder models maintain higher accuracy across deeper chains and are robust to the lexical ablation, indicating they learn to attend to the underlying logical relations. The performance gap widens with depth; only a very large decoder‑only model (GPT‑5) approaches parity, but it incurs substantial latency and computational cost.
Consequently, the authors argue that for cost‑effective, short‑horizon causal reasoning, targeted fine‑tuning of encoder‑based or encoder‑decoder models is preferable to relying on ICL with decoder‑only models. They suggest future work explore hybrid symbolic‑neural approaches and scaling strategies that retain the efficiency of encoder architectures while closing the performance gap at larger scales.
Comments & Academic Discussion
Loading comments...
Leave a Comment