Decomposing Reasoning Efficiency in Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.

💡 Research Summary

The paper tackles a critical blind spot in the evaluation of large language models (LLMs): while most benchmarks report only final accuracy, they ignore how many inference tokens are spent to achieve that accuracy. To make token‑efficiency transparent, the authors propose a hierarchical, trace‑optional decomposition framework that isolates where tokens are useful and where they are wasted.

1. Outcome‑Level Decomposition
Only three outcome variables are required per instance: a success flag (correct answer), the total number of generated tokens, and a binary flag indicating whether the model hit a predefined token budget (truncation). From these, they define two probabilities: r_ctx (the probability of avoiding budget truncation) and r_logic (the probability of being correct given no truncation). Token efficiency is defined as E₀ = 1000·P(S)/E

Decomposing Reasoning Efficiency in Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment