Large Language Model Reasoning Failures

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

💡 Research Summary

The paper “Large Language Model Reasoning Failures” presents the first comprehensive survey that systematically catalogs and analyzes the myriad ways in which large language models (LLMs) falter at reasoning tasks. Recognizing that impressive benchmark scores mask persistent errors—even on seemingly trivial problems—the authors propose a dual‑axis taxonomy to bring order to a fragmented literature.

The first axis classifies reasoning itself. It distinguishes embodied reasoning, which requires interaction with a physical environment (e.g., spatial navigation, physics simulation, tool use), from non‑embodied reasoning, which operates purely on textual inputs. Non‑embodied reasoning is further split into informal (intuitive) reasoning, driven by heuristics, biases, and everyday judgment, and formal (logical) reasoning, which involves explicit manipulation of symbols, mathematics, code, and formal logic.

The second axis categorizes failure types:

Fundamental failures – intrinsic to the LLM architecture or training objective, manifesting across many downstream tasks.
Application‑specific limitations – deficits that appear only in particular domains such as coding, scientific reasoning, or social interaction.
Robustness issues – performance instability caused by minor prompt variations, noise, or distribution shifts.

By crossing these two axes, the authors generate a matrix that maps each failure to a concrete research area. The survey then walks through the matrix, providing for each cell: a precise definition, a curated list of representative studies, an analysis of root causes, and a set of mitigation strategies.

Key insights include:

Cognitive‑function analogues – LLMs lack human‑like working memory, inhibitory control, and cognitive flexibility. This leads to “proactive interference” when earlier context overwhelms later updates, and to rigid adherence to learned patterns despite contextual cues.
Cognitive biases – Confirmation bias, anchoring, order bias, framing effects, and many socially‑oriented biases (group attribution, negativity bias) are reproduced in LLM outputs. The authors trace these to three sources: (i) statistical regularities in massive pre‑training corpora, (ii) architectural predispositions such as causal masking in Transformers, and (iii) alignment procedures like RLHF that inherit human raters’ own biases.
Formal reasoning gaps – Even state‑of‑the‑art models struggle with multi‑step logical proofs, symbolic manipulation, and precise arithmetic, largely because next‑token prediction rewards surface‑level pattern completion rather than deep, deliberative reasoning.
Embodied reasoning deficits – Purely text‑based LLMs cannot directly perceive or act in the world, so tasks requiring physical dynamics, affordance reasoning, or real‑time feedback expose severe shortcomings.

Mitigation strategies are organized into three families:

Data‑centric – Curating balanced, bias‑reduced datasets; augmenting training data with counter‑factual or adversarial examples; and incorporating multimodal signals (vision, proprioception) to ground language.
Training‑centric – Introducing architectural modifications (e.g., external memory modules, attention mechanisms inspired by human executive control), employing chain‑of‑thought prompting, meta‑learning, and curriculum learning that explicitly reward multi‑step reasoning.
Post‑processing – Advanced prompt engineering, output filtering, self‑consistency checks, and “personality” conditioning to steer models away from undesirable bias patterns.

The authors also release an extensive, continuously updated GitHub repository (https://github.com/Peiyang‑Song/Awesome‑LLM‑Reasoning‑Failures) that aggregates over 300 papers, datasets, and benchmark suites related to LLM reasoning failures. This resource is intended to serve as a one‑stop entry point for researchers seeking to diagnose, compare, or improve upon existing failure modes.

Overall, the survey not only maps the current landscape of LLM reasoning failures but also highlights common underlying mechanisms, thereby offering a clear research agenda: improve working‑memory‑like capacities, reduce inherited biases, develop robust prompting and alignment techniques, and integrate embodied, multimodal experiences. By doing so, the community can move toward LLMs that reason more reliably, transparently, and safely across a broad spectrum of real‑world applications.

Large Language Model Reasoning Failures

💡 Research Summary

Comments & Academic Discussion

Leave a Comment