Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls, deeply nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, we construct a dataset of 1200 reasoning problems from two sources: existing code reasoning benchmarks and popular GitHub Python repositories. Our pipeline leverages static and dynamic program analysis to automatically serialize/deserialize compound, complex, and custom types galore in real-world code, going far beyond only primitive types used in prior studies. A key feature of our dataset is categorizing each reasoning problem as Lower Complexity (LC) or Higher Complexity (HC) via a principled majority-vote mechanism over nine diverse and interpretable code-complexity metrics, yielding two well-separated, semantically meaningful categories of problem difficulty suitable for precise calibration of LLM reasoning ability. This categorization shows that the problems used in existing code-reasoning evaluation mostly belong to the LC category, failing to represent real-world complexity.

💡 Research Summary

The paper addresses a critical gap in the evaluation of large language models (LLMs) for code reasoning: most existing benchmarks rely on overly simplistic, often LLM‑generated programs that do not reflect the complexity of real‑world software. To remedy this, the authors construct RE2‑Bench, a dataset of 1,200 reasoning problems drawn from two sources: established code‑reasoning benchmarks (Avatar, ClassEval, CRUXEval, HumanEval) and popular Python repositories on GitHub. Using a combination of static analysis and dynamic execution, they extract dynamic slices (the method of interest together with all directly or indirectly invoked callees) and filter out duplicates, methods without inputs/outputs, and those with circular type dependencies. This pipeline yields 3,129 unique problems, of which 1,200 are selected for the benchmark (the “lite” version contains 500 problems).

To quantify problem difficulty, the authors define nine interpretable code‑complexity metrics: (M1) Cyclomatic Complexity, (M2) Compound Predicates, (M3) Nested Constructs, (M4) Structural Complexity (list/dict/set comprehensions, generators, lambdas, decorators, etc.), (M5) Third‑party API calls, (M6) Inter‑class Dependencies, (M7) Intra‑class Dependencies, (M8) Primitive Variable Count, and (M9) Complex Variable Count. They automatically tune thresholds for each metric and apply a majority‑vote mechanism to label each problem as Lower Complexity (LC) or Higher Complexity (HC). Silhouette analysis and the Davies‑Bouldin Index confirm a clear separation between the two clusters. The analysis shows that virtually all problems from prior benchmarks fall into the LC group, whereas real‑world projects contribute the majority of HC problems.

A key engineering contribution is the automated serialization/deserialization of arbitrary Python objects into JSON for prompt construction. Unlike prior work that only handles primitive types, this pipeline can represent custom classes, nested dictionaries, and other complex structures, reducing ambiguity for the model. The authors also employ adaptive few‑shot examples to guide the LLMs in predicting variable values, and they evaluate predictions by actually executing the code rather than relying on string matching, thereby detecting false negatives that would be missed by static checks.

The empirical study evaluates ten LLMs—including GPT‑4‑Turbo, Claude‑2, Llama‑2‑70B, and several “non‑reasoning” or low‑reasoning variants—across four code‑reasoning tasks: input prediction, output prediction, loop‑variable prediction, and branch‑decision prediction. When moving from LC to HC problems, average performance drops dramatically: 37.36 % for input, 36.16 % for output, 20.90 % for loop, and 48.60 % for branch prediction. Detailed analysis reveals that nested constructs and compound predicates are the most detrimental features; longer call chains correlate with poorer input prediction, while forward reasoning (output prediction) is relatively more robust. Models configured with a “high‑reasoning” budget consistently outperform low‑reasoning or general‑purpose counterparts. Correlation analysis shows a moderate to strong negative relationship between each complexity metric and model performance.

Finally, the authors perform a systematic failure analysis, categorizing errors into 18 distinct failure types (e.g., serialization errors, missing callees, misinterpreted conditionals, API misuse). For each category they provide concrete examples and suggest remediation strategies, offering a roadmap for future LLM improvements.

In summary, the paper makes six major contributions: (1) a realistic benchmark (RE2‑Bench) that includes complex objects, long call chains, and third‑party API usage; (2) a principled, metric‑driven method for separating problems into LC and HC groups; (3) automated prompt generation that handles complex variable types; (4) fine‑grained evaluation metrics beyond overall accuracy; (5) a large‑scale empirical assessment of ten LLMs under realistic conditions; and (6) a taxonomy of 18 reasoning failure modes with case studies. By exposing the substantial performance gap between simple and real‑world code, the work cautions against over‑optimistic claims based on existing benchmarks and provides the community with tools and insights necessary to develop more robust code‑reasoning LLMs.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

💡 Research Summary

Comments & Academic Discussion

Leave a Comment