X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

💡 Research Summary

The paper introduces X‑RAY (eXplainable Reasoning Analysis sYstem), a novel framework for measuring and interpreting the reasoning capabilities of large language models (LLMs) through formally verified and calibrated probes. The authors argue that traditional reasoning benchmarks conflate pattern‑matching with genuine structured reasoning, offering only aggregate accuracy scores that mask underlying weaknesses. To address this, they define “extractable structure” as the set of formal properties a model must uncover and manipulate to solve a problem, specifically focusing on three dimensions: constraint interaction (how multiple conditions combine), reasoning depth (the number of logical steps required), and solution‑space geometry (the shape and dimensionality of the set of valid solutions).

X‑RAY’s pipeline consists of two main stages. First, an automatic formalizer translates natural‑language problems into explicit constraint systems (e.g., SMT formulas) with clearly identified variables and constraints. Second, a calibration and verification stage systematically varies problem parameters—such as the range of target values, the number of constraints, or the dimensionality of the solution manifold—while preserving the underlying logical skeleton. Each generated instance is checked for correctness using an SMT solver (Z3), guaranteeing that the ground truth is noise‑free and that any performance differences can be attributed to the controlled structural manipulations rather than annotation errors or dataset contamination.

The authors apply this methodology across three domains—mathematics (stamp‑coverage and N‑primable numbers), physics (impulse calculation in a collision), and chemistry—covering junior to advanced difficulty levels. For each domain they produce thousands of randomized instances, evaluate state‑of‑the‑art models (GPT‑4o and o4‑mini) ten times per instance, and plot success‑rate surfaces over the calibrated structural parameters.

Two key empirical findings emerge. (1) Constraint refinement robustness: when additional constraints merely shrink an existing solution space without altering its fundamental representation, models retain relatively high performance. This suggests that LLMs can effectively filter an already‑known set of solutions when faced with extra conditions. (2) Solution‑space restructuring fragility: when modifications require a re‑organization of the solution manifold—changing the underlying geometry or representation—the models’ performance drops sharply. GPT‑4o, in particular, exhibits a steep decline once the task demands non‑trivial restructuring, whereas o4‑mini shows a more stable but overall lower sensitivity, highlighting a qualitative gap in structural reasoning capacity that is invisible to aggregate accuracy metrics.

Because the probes are generated programmatically and verified formally, X‑RAY is contamination‑free, eliminating the common pitfalls of data leakage and noisy labels. Moreover, the fine‑grained, step‑wise verification enables the creation of curricula for model fine‑tuning: intermediate results (e.g., variable assignments, sub‑expressions) can be used as supervised signals to strengthen the specific reasoning operations that fail under restructuring. This dual role—as a more discriminating benchmark and as a training substrate—offers a concrete path toward building LLMs that truly understand and manipulate underlying problem structure rather than relying on surface pattern memorization.

In summary, X‑RAY provides a rigorous, extensible, and interpretable framework for mapping LLM reasoning capability along explicit structural dimensions. By exposing asymmetries between constraint refinement and solution‑space restructuring, the work reveals fundamental limitations of current models and proposes a principled methodology for both evaluating and improving structured reasoning in future LLM systems.

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment