Reading time: 30 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.20595
  • Date:
  • Authors: Unknown

📝 Abstract

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed-vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multistep control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs. 1

📄 Full Content

Multimodal large language models (MLLMs) now match or surpass prior systems on static perception benchmarks, i.e., recognizing objects, transcribing text in images, and answering short questions. However, practical deployments require models that can act interactively: taking actions, observing consequences, and staying coherent across many steps (closed-loop control). In such settings, small errors compound and recovery is rare; single-shot vision tests provide little signal about whether a model can plan, act, and 1 The code is available at https://github.com/dana-23/ cube-bench. correct itself. Surveys of multimodal benchmarks highlight this mismatch between static leaderboards and the demands of interactive use [13], and recent web-based agent evaluations similarly report that strong snapshot performance does not directly translate to reliable multi-step behaviour [10,31]. Classical work on sequential decision making has long emphasized how compounding errors degrade longhorizon performance [5].

Existing visual and spatial reasoning benchmarks mostly operate in this static regime. Synthetic datasets such as CLEVR, GQA, and related visual question answering benchmarks probe fine-grained spatial relations, counting, and logical composition from a single image, but typically stop at one-shot answers about a fixed scene. More recent multimodal leaderboards continue this pattern, emphasizing captioning, OCR, and short-form VQA on web images [13]. These settings test local spatial understanding, yet they rarely couple perception to an explicit dynamics model, track progress toward a goal, or require models to maintain a consistent internal state over multiple actions. On the other side, rich agentic environments on the web or in 3D scenes [10,31] introduce realistic noise and partial observability, but success metrics are often coarse, and it is difficult to disentangle failures of perception, planning, and control. We therefore lack a compact, generator-based domain that is simultaneously spatial and sequential, fully observed, exactly solvable, and regenerable on demand with strict fairness controls and oracle progress signals, making it possible to localize failures within a decision loop rather than aggregate them into a single static score.

We introduce Cube Bench, a compact and fully controllable generator-based benchmark for studying long-horizon decision making under complete state information (see Figure 1). Conceptually, Cube Bench instantiates a simple TEA loop i.e., Transition → Evaluation → Argmin (Action): the cube dynamics define an exact transition model, oracle distance provides a scalar evaluation of each state, and the model’s role is to choose actions that (ideally) minimize distance to the goal at each step. Cube Bench uses the Rubik’s Cube as a testbed: it is visually simple, easy to simulate exactly, and combinatorially rich, with known optimal and near-optimal solvers that tell us precisely how many moves remain to reach the goal [11,21]. This lets us, unlike static image corpora, generate unlimited episodes, render both an image and a concise text description of each state, present a small menu of candidate moves, and compute exact feedback on how much each move improves or worsens the situation.

On top of this testbed we define seven tests that target different stages of the see → evaluate → act → reflect → recover loop: (i) reconstructing the full state from vision; (ii) checking consistency between vision and text; (iii) choosing the next move from a short list; (iv) judging whether a move helps or harms progress; (v) executing a sequence of moves to reach the goal; (vi) using self-reflection to revise an earlier choice; and (vii) trying to recover after the first error. Our framework enforces strict fairness controls, yielding measurements that are easy to replicate, difficult to game, and tailored to disentangle which stage of the TEA loop is failing.

Evaluating strong open-and closed-source MLLMs under identical rules reveals a consistent picture. First, seeing is not reasoning: models that parse cube faces well still struggle to sustain improvement over multiple steps. Second, feedback breaks coherence: when outputs are fed back as inputs, error rates rise sharply with scramble depth, and recovery after a wrong move is rare. Third, reflection helps but is fragile: structured, label-aware reflection can repair some mistakes, yet unconstrained reflection can induce overthinking and instability [16,23,30]. Together, these results pinpoint lingering weaknesses in sequential spatial reasoning, i.e., state tracking and action evaluation, that are not revealed by current single-step benchmarks or coarse web-agent success rates.

Static perception benchmarks for MLLMs. A large body of work evaluates multimodal LLMs on single-shot perception tasks such as classification, detection, captioning, VQA, OCR, and charts, e.g., ImageNet, COCO, CLEVR, GQA, TextCaps, and ChartQA [7,8,14,17,22,24]. Recent surveys synthesize these trends and note rapid leaderboard gains driven by scaling and instruction tuning [13]. Classic perception datasets (e.g., VQAv2) emphasize recognition under short contexts and have pushed models toward improved visual grounding and bias control [5]; ScienceQA further probes multimodal reasoning with thought chains [15]. However, these settings rarely require closed-loop decision making, where actions alter the state and errors can compound across steps. Our benchmark targets this gap by measuring whether models can sustain improvements over multiple actions, given precise state feedback. Interactive / agent benchmarks. Web-scale agent environments assess language-to-action competence in longhorizon tasks (navigation, form-filling, retrieval, tool use) [10,31]. While such benchmarks capture realistic challenges (dynamic pages, delayed rewards, partial observability), they introduce confounds, non-deterministic layouts, fragile tool chains, and evaluation noise, that complicate causal analysis of failures. We instead adopt a compact, fully controllable domain with deterministic transitions and exact progress signals, allowing us to isolate where breakdowns arise (perception, evaluation, selection, or reflection) without web-induced variance. Reasoning and control in structured domains. Structured puzzles and games offer precise state representations, small action alphabets, and well-defined optimality notions. For the Rubik’s Cube, shortest-path structure and the diameter (“God’s number”) enable oracle feedback about optimal move distance [11,21]. Learning-to-search systems such as DeepCubeA demonstrate scalable RL+search on the cube and related combinatorial puzzles [1]. We repurpose these properties to build an evaluation, not a solver, so perception and control can be disentangled and measured reproducibly. Rubik’s-cube with VLM/LLM and embodied control. Recent cube-focused VLM/LLM efforts mix perception, planning, and tooling (e.g., CubeRobot, VisionCube, Auto-RubikAI), while earlier robotics work (Dactyl) highlights actuation confounds [3,19,28,29]. Our setting keeps the environment compact and oracle-scored to avoid attribution ambiguity. Self-reflection and tool-augmented prompting. Interleaving reasoning and acting (or adding self-critique) can improve sample efficiency but may also induce instability or “overthinking.” Representative approaches include Reflexion [23], Self-Refine [16], ReAct [30], and analyses of reflection effects [20]. Our experiments use a constrained, label-aware reflection protocol to quantify when reflection repairs errors and when it degrades coherence, under identical rules across models. Positioning. Compared to perception-centric benchmarks [5,7,8,13,14,17,22,24] and open-web agent suites [10,31], our benchmark provides (i) deterministic gener-ation with paired image and text state, (ii) strict I/O with a small, discrete action set, (iii) fairness controls that remove answer position bias, and (iv) exact oracle progress signals grounded in optimal-move distance [11,21]. This design yields compact, reproducible measurements of closedloop competence and exposes where modern MLLMs fail to maintain multi-step coherence, even when their single-step perception appears strong.

Cube Bench is a compact, fully controllable benchmark for long-horizon decision making, using the Rubik’s Cube as a testbed. The cube gives us a visually simple but combinatorially rich environment with an exact optimal solver: for any state we know how many moves remain to reach the goal. This lets us study the full see → evaluate → act → reflect → recover loop under complete state information, without web-scale noise or ambiguous evaluation metrics.

Cube Bench is simulator-based and is generated on-thefly rather than shipping a fixed dataset. This keeps the benchmark lightweight to distribute and extend while still making results exactly reproducible across models. For a target distance d (number of optimal moves from solved) and episode index i, the VirtualCube simulator deterministically emits: (i) a rendered image, (ii) an authoritative textual state, and (iii) four candidate actions (A-D). We publish per-depth seed lists; episode i uses seed=i, fixing the scramble s d,i , the teacher inverse plan π teach , and the oracle distances shared across models. Oracle distance d(s) is computed once via IDA* with pattern databases [11]; unless stated otherwise, we use the standard face-turn / half-turn move metric (HTM/FTM), and we use the same metric both scrambling and evaluation. To avoid answer-position bias, we enforce that the correct option is approximately uniform over A-D at each depth; the harness tracks the per-slot counts, discards duplicate items, and regenerates episodes until the distribution is near-uniform. Model outputs must follow strict one-line formats; anything else is treated as an error. Cube Bench is test-only: we release prompts, seed lists, and quality-control thresholds/methods in the Supplementary Material for reproducibility.

We evaluate seven tests that instantiate the see → evaluate → act → reflect → recover steps. Each episode is constructed from a published seed list at depth d and exposes (i) an explicit text state (canonical cube notation), (ii) a paired image for realism, and (iii) four candidate actions A-D (moves), with exactly one correct closed-form label per test instance. Text is the source of truth; images never override the textual state. We use a single move metric d(•) (e.g., half-turn or face-turn metric) consistently for scrambling and scoring. Next, we introduce the evaluated tasks: Cube Face Reconstruction. This test measures raw visual parsing of the cube without help from text or action selection. From a seed at depth d ∈ {1, 2, 3}, VirtualCube renders a fixed-pose cube-net image (see Supplementary for an example image); the center of the front face is labeled Front. The model must return a strict 3 × 3 grid for the front face only, using color initials in {W,Y,R,O,G,B} row-by-row; no textual state is provided and free-form prose is disallowed. Primary metrics are Element-wise Acc. (per-sticker on the front face) and Matrix Acc. (share of items with the exact 3 × 3 grid) reported per depth with Wilson 95% CIs. (Top-1) and Parse (percentage of outputs in the allowed format), which are reported per modality with matched item counts.

Reflection-Guided Re-Answering. This test measures whether a structured self-check can repair wrong answers without destabilizing correct ones. Using the same Optimal Move Prediction items, we take the model’s initial choice and run a single reflection pass under two regimes: Guided (Unredacted), which reveals the correct option during reflection to estimate an oracle upper bound, and Guided (Redacted), which hides labels (default). Prompts, parsing, and item balancing mirror Optimal Move Prediction; decoding uses zero temperature, deterministic option shuffling, and strict A-D parsing. Primary metrics reported for both Unredacted and Redacted settings are Initial Acc., Final Acc., ∆ = Final -Initial, EFR (percentage of answers initially incorrect but corrected after reflection) , and OTR (percentage of answers initially correct but flipped to incorrect after reflection).

Step-by-Step (act under feedback). This test measures whether a model can maintain a monotonic path when its outputs feed back into inputs. An episode starts from s 1 at optimal-move distance d from solved. At each step t ∈ {1, . . . , d} the model selects one action from {A-D}. We execute the chosen move, obtain s t+1 , and recompute d(s t+1 ). Progress is recorded when a t belongs to the optimal action set A ⋆ t = {a : d(s t+1 ) is minimal}. The rollout stops at the first non-progress step (no a t ∈ A ⋆ t ) or on any parse/format violation; solved episodes terminate immediately on reaching distance 0. Primary metrics are TA % (=

and Perfect-Solve % (share of episodes solved before termination).

Causal Move-Effect (pre-action evaluation). This test isolates the model’s ability to evaluate action consequences independent of action selection. The MLLM is given the current state and a specific candidate move to predict the causal label in {DECREASE, NO CHANGE, INCREASE} for the change in optimal distance d(•) prior to move execution. Labels are computed by applying the candidate move in the simulator and comparing d(s t+1 ) to d(s t ) under the same metric used for scrambling. Primary metrics are Micro-Accuracy, Macro-F1 over the 3 classes, and Cohen’s κ (chance-corrected agreement given empirical marginals).

Learning-Curve / Recovery (post-error). This test measures whether a model can repair trajectories once it has deviated from optimality. Starting from the first Closed-Loop failure point (first non-progress step or parse error), we continue from the resulting post-error state for at most max attempts additional decisions. At each attempt the model again chooses from {A-D}; we always execute its choice (even if non-optimal) to test whether it can re-enter a monotonic regime and eventually solve the puzzle. The episode ends on solve or when the attempt budget is exhausted. Primary metrics are Solve Rate (SR) with Wilson 95% CI; early recovery P(1) and P(≤ 3) (probability of solving within 1 or 3 attempts); Med@Solved (median attempts among solved episodes); Avg@All (mean attempts over all episodes, counting unsolved as max attempts). Exact d-dependent horizons and max attempts are provided in the Supplement.

Episodes and general quality control (QC) follow what stated earlier in this section. Task-specific knobs are: (i) Closed-Loop uses the same turn metric as scrambling, terminates at the first non-progress or parse failure (unconditional denominators), and enforces one-token A-D outputs. (ii) Move-Effect uses the label set {DECREASE, NO CHANGE, INCREASE}. (iii) Learning-Curve starts from the post-error state and grants max attempts. Baselines use deterministic decoding (temperature=0); any deviations are declared in §4.

Models & Inference Setup. We evaluate a range of open-source MLLMs, including Gemma 3-27B [25], GLM-4.5V [27], Llama 4-Scout-17B:16E [18], Qwen2.5-VL-7B/32B [2], Qwen3-VL-30B-A3B [26], and one API model, Gemini 2.5 Pro [4]. For engine and decoding, we utilize vLLM [12] with bf16 and a temperature of 0 (i.e., greedy decoding) unless specified otherwise, with reflection variants specifying their own decoding. Request-level batching is used to optimize throughput. Our hardware setup consists of NVIDIA A100-80GB GPUs. To ensure reproducibility, we use identical prompts for each task and modality, publish seed lists for each depth, and maintain a fixed turn metric (HTM/FTM) per run. Parsing is strict, requiring one-line outputs formatted as A-D, Yes|No, or |DECREASE|NO CHANGE|INCREASE|, with any invalid lines counted as errors. For sampling, the percondition N is fixed, and we reuse the same episode set across models to facilitate paired comparisons.

In Table 1 (left), by d=3 Gemini 2.5 Pro retains high per-sticker fidelity (96.8%) with a moderate drop in exact 3 × 3 matches (84.0%), whereas Qwen 2.5-VL-32B declines more sharply (71.3% / 19.0%), indicating weaker global binding. The gap between element-wise and matrix accuracy widens with depth across models, consistent with small edge misplacements compounding into fullface mismatches. Complementing this, Cross-Modal Verification at fixed d=5 (Table 1, right) shows that models with near-perfect parsing still differ markedly in discrimination: Qwen3 reaches Bal. 84.00%, Qwen 2.5-VL-32B 81.00%, Qwen 2.5-VL-7B 71.30%, GLM 95.00%, and Gemini 100.00%. Despite Parse rates ≈100% for the top models, residual label bias persists (52-93%); for instance, Llama 4-Scout-17B:16E shows 93% bias, suggesting a tendency to default to Yes when uncertain.

Takeaways: Face Reconstruction establishes the vision floor: depth stresses spatial coherence beyond token recog-Element-wise Acc.

Matrix Acc. Verification (%) @ d = 5 nition, and exact 3 × 3 matches are the first to fail. Verification then serves as a high-precision gate on crossmodal grounding, revealing discrimination gaps and labelselection biases even when parsing is flawless. We, therefore, place Verification before move selection or closedloop tests and condition downstream analyzes, Move Prediction and Closed-Loop Recovery, on passing Reconstruction, while reporting both element-wise and matrix accuracy to separate recognition from spatial arrangement.

As shown in . We therefore emphasize κ (with Micro, q•π, and Macro-F1 for transparency) when comparing depth, and interpret Qwen-2.5-32B’s small κ gains as prior-alignment effects more than substantive improvements in cube reasoning.

Models degrade with depth at varying rates. From Fig. 2, Gemini 2.5 Pro starts strong in the green band at d=1, then declines with errors (from 90 to 10 TA%). Open-weights cluster in the red band by d≥3: Qwen-2.5-VL-7B (from 26 to 6), Qwen-2.5-VL-32B (from 20 to 7.5), Qwen3-VL-30B (from 22 to 9), GLM-4.5V (from 20 to 9), and Gemma-3-27B (from 20 to 7).

Takeaways: The results highlight several critical weaknesses in current models: (1) Sequential Reasoning Collapses with Depth: All models, regardless of size, show a severe collapse in performance as the number of sequential steps (depth d) increases. This demonstrates a fundamental fragility in tasks requiring long-horizon, closed-loop reasoning, (2) Even Frontier Models are Brittle: While Gemini 2.5 Pro is the clear leader, starting at 90% Teacher-Adherence (TA) at d = 1, its performance is not robust. It suffers a massive drop to 10% TA, indicating that even the most advanced models are highly sensitive to sequential task length, (3) Significant Gap Between Frontier and Open Models: A stark performance divide exists. Gemini 2.5 Pro (the green band) starts with 3-4× higher adherence than the cluster of open-weight models (the red band), which only achieve 20-26% TA at d = 1 and fall to near-failure levels (6-9% TA) by d ≥ 3, and (4) Planning, Not Just Perception, is the Bottleneck: The analysis diagnoses why models fail. Models with decent perception (like GLM-4.5V) still decline sharply, indicating the failure is not just in perception but in the higher-level cognitive task of planning and making correct sequential selections.

All episodes start at d=3; after the first deviation we allow six attempts. As shown in Table 5, Gemini 2.5 Pro leads (SR 40%, P (≤ 3) = 0.40, Med@Solved 3, Avg@All 4.8). Qwen 2.5-VL-7B is modest (SR 8%, Med@Solved 4.5, Avg@All 5.88). Gemma 3-27B (SR 4%, Med@Solved 3, Avg@All 5.88) and Llama 4-Scout- Table 5. Learning-curve leaderboard. SR = solve rate; CI = 95% Wilson interval; P (1) and P (≤ 3) are fractions solved within 1 and 3 attempts; Med@Solved = median attempts (solved only); Avg@All = average attempts counting failures as max attempts.

17B:16E (SR 6%, Med@Solved 6, Avg@All 5.98) recover rarely; the latter typically only at the budget limit. GLM-4.5v is the lowest (SR 2%, Med@Solved 2, Avg@All 5.92). Qwen 2.5-VL-32B does not recover within budget (SR 0%, Avg@All 6.00). No system achieves immediate bounceback (P (1) = 0 for all). CIs are reported in Table 5.

Takeaways: Recovery from d=3 is non-trivial: most errorbearing episodes drift to the attempt limit rather than reenter a solving trajectory. Gemini 2.5 Pro is comparatively strong yet still leaves 60% of episodes unsolved within six attempts. The rest exhibit fragile local repair, GLM-4.5v is fast when it finds a fix (low Med@Solved) but almost never does (2% SR), while Llama 4-Scout-17B:16E only succeeds at the ceiling (6-attempt median) with no early recoveries. Combined with adherence results, this suggests: (i) one-step skill does not translate into post-error control, (ii) immediate local repair is weak (no P (1) successes), and (iii) effective recovery likely requires explicit mechanisms, short rollouts or self-checks before acting (TEA “Evalua-tion”), selective abstention/deferral, and probing deeper attempt budgets to locate saturation points.

We briefly probe two additional axes to check that our conclusions are not artefacts of a particular protocol choice: abstention-aware control in the step-by-step test, and reflection regimes with vs. without label leakage. We also summarize fairness diagnostics for the Move-Effect sampler.

Stepby-Step test, we optionally allow an explicit “I don’t know (IDK)” action, mapped to a no-operation policy in the environment [9] (teacher on abstain: apply the teacher move so the episode length is preserved). Primary metrics (TA% and Perfect-Solves%) remain unchanged and keep unconditional denominators: IDK receives no credit and counts as incorrect when computing adherence. Abstention can therefore prevent a bad move but cannot inflate TA%.

To understand how models use abstention, we additionally report coverage (answered / total), selective accuracy (correct / answered), and APA, an abstention-penalized accuracy with weight λ = 0.25 on IDK. Table 6 shows a small slice of this analysis at d = 5.

Gemini shows the highest precision when it chooses to act (Sel. Acc 28.6%) but at the cost of lower coverage (87.5%) driven by explicit abstention (IDK 12.5%); with λ = 0.25 this yields the best penalty-aware score (APA 0.28). Qwen2.5-VL-32B and Gemma-3 almost never abstain (IDK 0%), maintain near-total coverage (99.3-100%), but achieve lower selective accuracy (23.7% and 17.3%) and lower APA (0.24 and 0.17). Overall, calibrated abstention can modestly improve penalty-aware accuracy by avoiding low-confidence moves, but it does not change the depth-collapse pattern in TA% or Perfect-Solves%.

Reflection regimes: Unredacted vs. Redacted. Table 3 reports the label-safe “Guided (Redacted)” regime, where reflection operates without access to the gold answer. For completeness, we also evaluated a leaky “Guided (Unredacted)” regime in which the reflection prompt includes the correct option as an oracle hint (upper-bound oracle guidance). Under Unredacted, every model improves sharply: for example, GLM-4.5v and Llama 4-Scout-17B:16E climb from 31% and 18% initial accuracy to 100% (gains of +69 and +82 points), and Qwen models jump to 59-72% final accuracy with gains of +42-+50 points. Gemma 3-27B rises from 22% to 81%.

These results frame the Redacted curves as realistic, label-free behavior and the Unredacted curves as upper bounds. In the leaky regime, reflection can make almost any model look strong on static MCQ; in the label-safe regime, gains are smaller and strongly model-dependent (Qwen models benefit, Gemma does not, GLM degrades), consistent with the EFR/OTR trade-off in §4.3. This supports our decision to headline Redacted results in the main experiments and treat Unredacted as an oracle ablation.

At fixed difficulty, pre-action causal evaluation (Move-Effect κ) tracks closed-loop control (TA%, Perfect-Solve%) and post-error recoverability (SR). At d = 1 the association is very strong (Fig. 3: r = 0.997). The only model with consistently positive κ across depths, e.g., Gemini 2.5 Pro (Table 4), also leads in Step-by-Step closed-loop metrics (TA%, Perfect-Solve%) at shallow scrambles (Fig. 2) and is the only system to show a meaningful Learning Curve (Table 5), where we grant a retry budget and continue the episode even when a non-optimal move is chosen to test recovery. However, once outputs feed back into inputs, performance degrades sharply with depth (Fig. 2); even the best model falls steeply and early recovery (P (1), P (≤ 3)) is rare. Among open-source models, κ clusters near 0, so the κ →TA relation weakens at d=2, 3 (per-depth plots in Supplementary Material.).

We introduced Cube Bench, a compact, simulator-based benchmark that uses the Rubik’s Cube as a controlled testbed to probe the full see → evaluate → act → reflect → recover steps in multimodal language models. Across seven tests, we showed that current systems can reconstruct and verify the state with high accuracy, yet struggle to turn local decisions into stable sequence of decisions: depth increases quickly lead to collapse in performance, post-error recovery is rare, and strong one-step skills often fail to translate into robust long-horizon behavior. Pre-action causal evaluation (Move-Effect κ) tracks both adherence and recovery, highlighting the value of decision-aware metrics beyond raw accuracy. Reflection can help, but only under label-safe regimes and with careful control of overthinking. Taken together, these results suggest that improving multimodal LLMs will require not only better perception, but also explicit mechanisms for pre-action evaluation, selective control, and recovery. Cube Bench offers a lightweight, reproducible platform for studying these issues in isolation and for comparing future models and intervention strategies under matched conditions.

Our work on Cube Bench has some limitations. It uses Rubik’s Cube to test spatial perception and short-horizon planning, but the results may not apply to broader tasks like robotics or web-agent tasks. Due to poor performance of existing MLLMs, our evaluations are limited to shallow horizons. Nonetheless, our framework could produce arbitrarily long-horizon tasks for future studies. Tasks use low scramble depths and multiple-choice formats, limiting insights into complex thinking and reasoning errors.

Slice moves (M, E, S) and wide moves (u, d, f , etc.) are not permitted as atomic actions and are decomposed into their constituent face turns for distance calculation.

In the Closed-Loop Step-by-Step task, the oracle evaluates the transition from state s t to s t+1 caused by a model’s chosen action a t . We determine if a step constitutes progress by comparing the exact oracle distances:

Since d(s) is guaranteed optimal, any move satisfying this condition is mathematically proven to be on a shortest path to the solution.

To ensure exact reproducibility across all models and runs, we strictly enforce a deterministic generation protocol. For every episode i at depth d, the simulator is initialized with seed = i. This determines the scramble state s d,i , the teacher inverse plan, and the distractor options.

We provide the full metadata for these episodes in the supplementary code package (file: configs/seed lists.json). This file explicitly maps every episode index to its corresponding scramble string and ground-truth attributes. Future benchmarks can strictly replicate our test set by loading these pre-generated configurations or by re-running the generator with the same seed integers. Let I be the input image and M ⋆ ∈ V 3×3 the ground-truth face matrix over vocabulary V. The model outputs M = f (I) after normalization N (•). Element-wise accuracy is Acc elem = 1 9

and matrix accuracy is Acc mat = 1{N ( M ) = M ⋆ }. Parse-violations are counted when N ( M ) / ∈ V 3×3 (wrong shape/tokens).

Images are generated with VirtualCube using a fixed camera pose, a standard color scheme, and uniform random cube states sampled by scramble depth d, ensuring that color-token priors do not deviate from a uniform distribution by more than 5% (maximum deviation < 0.05).. We disable motion blur and heavy lighting effects and clamp minor rotation jitter to reduce nuisance variance. Parse-violations occur when the response lacks a valid Yes/No tag.

Hypotheses are auto-derived from ground-truth states produced by VirtualCube. We enforce a 50/50 Yes/No prior per depth (and per template, when applicable), include polarity reversals (e.g., “is green”/“is not green”), and randomize surface forms to avoid lexical shortcuts. Ambiguous references are excluded.

Let T (s, a) be the transition and d(•) the fixed distance (FTM/HTM). For a one-move scramble s = T (s ⋆ , a s ),

and at d=1, y = a -1 s .

A prediction ŷ ∈ {A, B, C, D} is correct iff ŷ = y.

We fix n_moves=1 so the gold action is the inverse of the scramble (unique-optimal under the chosen metric). Camera pose/colors match Recognition/Verification; the text serialization reuses the same normalization. Options are full-entropy shuffled; over many items this approximates uniform A-D placement.

Let I = {1, 2, . . . , N } be the set of N test items. For each item i ∈ I:

• s i is the sample data (cube state, options A, B, C, D).

• g i ∈ {A, B, C, D} is the ground-truth correct (gold) answer.

• ŷi,1 is the initial “Draft” choice from the model.

• ŷi,2 is the final “Re-answer” choice from the model, after processing reflection r i .

• 1{•} is the indicator function.

We define the initial accuracy (score) a i,1 and final accuracy a i,2 for item i:

To define EFR and OTR, we partition the total set I based on initial performance:

• Initially Wrong Set (I W ):

Initial Acc. The accuracy before reflection (from the “Draft” phase) [cite: 196].

Final Acc (All). The accuracy after reflection and re-answering, measured over all N items[cite: 196].

Gain (∆). The net gain in accuracy after reflection [cite: 197].

Error-Fix Rate (EFR). The fraction of previously incorrect items (I W ) that were corrected in the “Re-answer” phase [cite: 197].

Overthink Rate (OTR). The fraction of previously correct items (I C ) that were flipped to wrong in the “Re-answer” phase [cite: 198].

Items are rebuilt deterministically via per-index seeds; A-D options are fully shuffled; the FTM/HTM distance metric is fixed per run. The re-answer prompt defines the deterministic tie-break (A > B > C > D) and the strict output format. We log tokens and latency for both reflection and re-answer. ) is the inverse of the scramble. At step t the model outputs a t ∈ {a A , a B , a C , a D } (or IDK), the environment updates s t = T (s t-1 , a ⋆ t ) per policy, and ∆ t = d(s t ) -d(s t-1 ). The episode halts at the first t with ∆ t ≥ 0 or a parse violation; otherwise it runs up to d steps.

  1. Depth-aware target mix. For depth d, estimate the feasibility of doubling class c as

The per-depth target mix is 0. Let S be cube states, M the standard move set (quarter/half turns in Singmaster), and m(s) the deterministic transition obtained by applying m ∈ M to s ∈ S. A fixed oracle distance d : S → N 0 (FTM/HTM; chosen once per run) gives the distance-to-solved. The start state s 0 is produced by a depth-d scramble with per-episode seed; the teacher plan π 0 is the inverse scramble (shortest plan from s 0 to solved), with head m ⋆ .

Progress-making moves. For state s, the progress set is

An option is progress-making iff it is in G(s).

MCQ construction. At attempt t, from state s t we construct an MCQ with four distinct options: one option is sampled from G(s t ) (when G(s t ) ̸ = ∅), and the remaining three are sampled from M \ G(s t ) when feasible; options are then uniformly shuffled into A-D using an RNG independent of the scramble seed. (If |M \ G(s t )| < 3, a rare fallback samples without the progress/non-progress restriction; evaluation remains defined.)

Parsing and attempts. The model outputs a single letter X t ∈ {A, B, C, D} using either X or ANSWER: X (case-insensitive). A parse-fail is recorded if neither pattern is found. An attempt is counted for every query (including parse-fails).

Accept-progress control policy. Let m ⋆ be the head of π t (teacher next move), and let a t be the concrete move mapped from the chosen option X t (undefined on parse-fail). The environment update is: Here, Solve(•) is the oracle shortest-plan solver (returns an empty plan at solved), ⟨a -1 t ⟩ is the one-move deque containing the group-theoretic inverse of a t , and ∥ denotes deque concatenation (push-front).

Stopping rule and outcomes. An episode solves if d(s t ) = 0 for some t ≤ max attempts; otherwise it is a failure. The number of attempts used is Uniform letter shuffling, independent RNG. Options are uniformly permuted into A-D using a full-entropy RNG (System Random) that is independent of both the scramble seed and episode index. This prevents fixed letter positions from correlating with correctness and reduces positional cueing. We do not otherwise regularize or audit slot frequencies; independence plus sample size keeps slot priors approximately flat in expectation.

Parse protocol & penalties. The model must output exactly one letter via either X or ANSWER: X (case-insensitive), X ∈ {A, B, C, D}. Any other output is a parse-fail. Parse-fails count as attempts and leave the environment unchanged, applying a consistent penalty without introducing spurious transitions.

Fixed control policy (accept-progress). The same policy is applied to every model’s choice a t : (i) if a t matches the teacher head, apply and advance the plan; (ii) else if a t ∈ G(s t ), apply and replan from the new state; (iii) else apply and prepend a -1 t to the plan. This creates a uniform opportunity to recover from near-misses and a uniform consequence for regressions.

Attempts budget parity & stopping. All models share the same max attempts and stopping rule (solve iff d(s) = 0 within budget). Aggregate metrics (SR, P (1), P (≤ 3), Med@Solved, Avg@All) are computed over the same episode set with the unconditional denominator, so unsolved runs contribute 0 to proportion metrics and max attempts to Avg@All.

Scope of randomness. We intentionally resample distractors each attempt (subject to the uniqueness and canonical/fallback rules) rather than fixing distractor identities across episodes or models. This avoids overfitting to specific move tuples while keeping evaluation comparable through shared scrambles, shared decoding, and a fixed control policy.

• X = 90°clockwise quarter-turn of face X.

• X’ = 90°counter-clockwise quarter-turn of face X.

• X2 = a single 180°half-turn of face X (not two 90s).

• Faces are identified by their centers (isomorphic color scheme).

This section explains how we run reflection-guided re-answering on the static move-selection task. Each reflection trial begins from the same Image+Text prompt we use for move prediction test (Prompt C.3.1). Concretely, every item shows (i) a rendered image of the cube for context and (ii) the authoritative textual cube state, along with four candidate move.

You are an expert Rubik’s Cube solver. You previously answered the following multiple-choice question incorrectly.Your task is to reflect on your mistake, understand why it occurred, and improve your reasoning for the future. Complete the following steps clearly and concisely: 1. Explain why you answered incorrectly. 2. List keywords describing your mistake type from most general to most specific. 3. Solve the problem again step-by-step using the correct answer provided. 4. Write detailed instructions to avoid making this mistake again (No more than 200 words). Provide general advice to improve your (Large Language Model) performance on similar problems in the future.

The • Perfect Rationalization: GLM-4.5v and Llama 4-Scout-17B achieve 100% Final Accuracy (gains of +69.0 and +82.0 points, respectively). This confirms that these models possess sufficient world knowledge to verify a solution when given the answer, even though they fail to derive it autonomously. • Broad Improvement: Even the models that struggled most in the standard setting show strong recovery. Gemma 3-27B, which showed zero net gain in the redacted setting, jumps from 22% to 81% accuracy in the unredacted setting. Similarly, the Qwen family sees improvements ranging from +42 to +50 points. These results highlight the “Post-Hoc Rationalization” capabilities of MLLMs. The discrepancy between the Redacted and Unredacted scores indicates that the bottleneck in spatial reasoning is not the inability to understand the move mechanics, but the lack of a robust internal verification signal. When that signal is provided externally (the oracle label), models can successfully align their reasoning, but without it, they are prone to overthinking and instability.

Move Prediction (1 move from solved). Top-1 accuracy and format compliance (Parse: % outputs matching ANSWER: [A-D] or ) across prompt modalities.

GLM-4.5V shows good reconstruction and verification but low MCQ (31% Image+Text; 21% Image-only), highlighting selection/ranking issues over perception. All runs use Reflection-guided re-answering on Optimal Move Prediction (“Guided (Redacted)”). ∆ is the absolute change in accuracy by applying reflection; EFR is the corrected share of initially wrong items; OTR is the share of correct items flipped to wrong. All runs use Image+Text modality, zero temperature, deterministic option shuffling, and strict A-D parsing. the same number of items per modality for comparability. Takeaways: (i) Static MCQ exposes a real selection gap: several models with decent reconstruction/verification still hover near chance because they mis-rank candidates or violate the A-D format; (ii) When vision is the bottleneck (e.g., Gemma-3, Qwen-7B), Image+Text performance can degrade compared to text-only (e.g., Llama 4); (iii) Strict parsing matters, non-A-D outputs directly convert to errors and can mask underlying competence; (iv) Only Gemini presents a consistent performance across tasks, high perceptual fidelity (reconstruction), strong cross-modal grounding (verification), and accurate selection (MCQ) while others are capped either by perception (i.e., in Gemma-3) or by the Argmin step (i.e., in GLM and Qwen2.5-32B).

outcomes emphasize that label-free reflection can help, but is not uniformly positive; the EFR/OTR balance is decisive. Takeaways: Reflection under no-leak prompts can help, but gains are model-dependent: Qwen models consistently

From Table4, the frontier model (Gemini 2.5 Pro) leads with consistently positive κ and the strongest Macro-F1 across depths, indicating genuine causal forecasting beyond priors (though performance still drops as scramble depth increases). Open-source baselines cluster near chance. For Qwen-2.5-32B, micro-accuracy edges up with depth, but κ shows what’s really happening: below chance at d = 1 (≈ -0.074), barely above chance at d = 2 (≈ 0.040), and only a small positive at d = 3 (≈ 0.086). The confusion and prediction mix explain this: Qwen mostly predicts IN-CREASE and rarely predicts DECREASE. In practice, it treats most moves as “making things worse,” so it misses many true DECREASEs (low recall), keeping Macro-F1 modest (≈ 0.25 at d = 1, ≈ 0.33 at d = 3) even when micro-accuracy improves. still weaker at higher d); open-weights show shallow heuristics (INCREASE bias, DECREASE blind spot)

From Table4

From Table

d × 100.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut