Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems

Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an over-thinking phenomenon, primarily caused by extensive yet largely ineffective self-verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at https://github.com/Bklight999/GrAlgoBench.


💡 Research Summary

The paper introduces GrAlgoBench, a benchmark specifically designed to probe the reasoning capabilities of Large Reasoning Models (LRMs) using graph algorithm problems. Existing benchmarks in mathematics, code generation, and common‑sense reasoning suffer from three major drawbacks: they involve short contexts that do not test long‑range memory, they are no longer challenging for state‑of‑the‑art models, and their answers are expressed in heterogeneous formats that make automated verification difficult. Graph algorithm tasks naturally address all three issues. Describing a graph requires enumerating nodes and edges, which quickly yields inputs of several thousand tokens; scaling the number of nodes (from 8 up to 160) provides a smooth, controllable difficulty curve; and the desired outputs are typically integers, node identifiers, or edge lists, which can be checked programmatically with a single correct representation.

GrAlgoBench comprises nine tasks that are organized into three algorithmic reasoning categories—Enumeration (brute‑force), Exploration (search/backtracking), and Intuition (greedy). Each category maps directly onto a classic design paradigm from CLRS (complete enumeration, graph search, greedy optimization). Within each category the authors define easy, medium, and hard variants (e.g., Maximum Degree node, Maximum Weight Triangle, Maximum Clique for Enumeration). The dataset contains 2,700 graphs sampled from real‑world networks (street maps, Wikipedia link graphs, DBpedia, etc.) across six size scales, ensuring diversity and minimizing data contamination. To assign a reasoning category to each problem, the authors generate 300 Erdős‑Rényi instances per problem, query LRMs for solutions, and use a strong LLM (Qwen‑2.5‑72B) as a judge; human verification confirms the automatic labels.

Evaluation is performed on a suite of contemporary LRMs (OpenAI‑O1, DeepSeek‑R1, Qwen‑2.5, GPT‑4‑Turbo) and on non‑reasoning baselines. Metrics include Pass@k, Consistency@k, Z‑score, and an efficiency measure that captures how many tokens are spent before reaching the correct answer. Detailed error analysis tags failures as execution errors, memory lapses, or redundant reasoning steps.

Two principal weaknesses emerge. First, LRMs are highly sensitive to context length. Accuracy drops sharply once graph size exceeds about 120 nodes, falling below 50 % for most models. The decline is driven by (a) step‑by‑step execution mistakes (e.g., mis‑summing edge weights, failing to detect cycles), (b) weak long‑term memory that causes models to forget previously mentioned node degrees or edge weights, and (c) redundant reasoning where the model revisits already explored states, inflating token usage. Second, the authors identify an “over‑thinking” phenomenon. Models engage in extensive self‑verification, inserting filler phrases such as “Wait”, “But”, “So” and repeatedly re‑checking intermediate results. This self‑verification is largely ineffective: it adds a large amount of reasoning trace (on average 1.8× more tokens) without improving the success rate of the verification step, especially in Exploration tasks that require backtracking.

The paper suggests two avenues for mitigation. Enhancing explicit memory structures or state‑update mechanisms could allow models to store intermediate results and retrieve them without recomputation. Introducing explicit termination signals for verification (e.g., “Verification complete”) could curb unnecessary repetitions. Moreover, the benchmark itself can be expanded to dynamic graphs, weighted temporal networks, or multi‑agent scenarios to further stress test future models.

In summary, GrAlgoBench provides a rigorous, scalable, and programmatically verifiable testbed that reveals critical limitations of current LRMs—namely, fragile long‑context reasoning and inefficient over‑thinking. The findings offer concrete guidance for the next generation of reasoning‑oriented language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment