Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating this capability. We propose data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning-the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench, spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.


💡 Research Summary

The paper addresses a critical gap in the evaluation of large language models (LLMs): the lack of a fine‑grained, algorithm‑centric benchmark that isolates pure reasoning ability from external tools or domain‑specific knowledge. To fill this gap, the authors propose DSR‑Bench (Data Structure Reasoning Benchmark), a synthetic, fully automated suite that tests LLMs on the manipulation of fundamental data structures, which they argue are the building blocks of algorithmic reasoning.

DSR‑Bench covers 20 distinct data structures grouped into six relational categories—Linear (arrays), Temporal (stacks, queues, priority queues), Associative (hash maps, tries, suffix trees, skip lists), Hierarchical (binary search trees, heaps, red‑black trees, B+ trees), Network (graphs, disjoint‑set union), and Hybrid (Bloom filters, directed acyclic word graphs). For each structure, the benchmark defines 35 operations ranging from elementary (access, insert, delete) to compound sequences that require multi‑step reasoning. The total corpus contains 4,140 problem instances, each generated programmatically with uniform numeric ranges (0‑100) and random lowercase strings, guaranteeing zero contamination and deterministic ground truth.

A key design principle is hierarchical task organization: simpler tasks serve as prerequisites for more complex ones, enabling precise failure localization. Difficulty is stratified into short (5‑10 tokens), medium (11‑20), and long (21‑30) inputs, allowing assessment of length generalization. Five evaluation components broaden the scope:

  1. Main – canonical data‑structure tasks.
  2. Challenge – high‑complexity hybrid structures and long operation chains; the best model scores only 0.46/1 here.
  3. Spatial – high‑dimensional or multi‑dimensional data, revealing performance degradation as dimensionality grows.
  4. Realistic – tasks embedded in natural‑language scenarios (e.g., clinic appointments, children lining up), exposing difficulties with ambiguity and context extraction.
  5. Code – a probe that asks models to generate and reason over code; results show minimal benefit from self‑generated code and limited gains from external interpreters on non‑standard tasks.

The benchmark’s prompts follow a four‑part template: (i) concise description of the data structure, (ii) explicit definition of allowed operations, (iii) initial state and any auxiliary inputs, and (iv) a direct question about the final state. The authors experiment with five prompting strategies (zero‑shot, few‑shot, chain‑of‑thought, self‑consistency, and mixed) to explore robustness.

Empirical evaluation involves 13 state‑of‑the‑art LLMs, spanning open‑source and closed‑source, instruction‑tuned and reasoning‑oriented models. Key findings include:

  • Linear and Temporal categories show relatively higher accuracy, but performance collapses on multi‑attribute or multi‑hop operations.
  • Associative, Hierarchical, and Network categories suffer from basic errors in traversal, insertion, and union‑find, indicating weak internal models of hierarchy and connectivity.
  • Challenge subset stresses even the strongest models, confirming that current architectures lack the compositional depth needed for complex hybrid reasoning.
  • Spatial probe demonstrates a steep accuracy drop as the number of dimensions increases, suggesting limited spatial reasoning capabilities.
  • Realistic probe reveals that models struggle to extract structural information from ambiguous, context‑rich language, highlighting a gap between formal reasoning and real‑world deployment.
  • Code probe indicates that self‑generated code rarely improves outcomes; external interpreters help only on familiar, well‑specified tasks and fail on novel or realistic variants.

The authors conclude that while LLMs have made progress on isolated algorithmic primitives, they remain far from robust structural reasoning, especially when tasks require composition, handling of high‑dimensional data, or integration with natural language context. They advocate for future research directions such as incorporating explicit graph‑or tree‑based memory modules, training objectives that emphasize multi‑step manipulation of data structures, and curriculum learning that mirrors the hierarchical organization of DSR‑Bench.

All data, generation scripts, and evaluation code are released publicly (GitHub and HuggingFace), inviting the community to extend the benchmark, add new structures, and develop models that can truly reason about the underlying relationships that power algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment