Where Bits Matter in World Model Planning: A Paired Mixed-Bit Study for Efficient Spatial Reasoning
Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low-bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO-WM on the Wall planning task, we run a paired-goal mixed-bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three-regime pattern: 8-bit and 6-bit settings remain close to FP16, 3-bit settings collapse, and 4-bit settings are allocation-sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near-size asymmetric variants show the same encoder-side direction. In a later strict 22-cell replication with smaller per-cell episode count, the mixed-versus-uniform INT4 sign becomes budget-conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module-aware, budget-aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at https://github.com/suraj-ranganath/DINO-MBQuant.
💡 Research Summary
This paper investigates how the allocation of low‑bit precision across the components of a world‑model planner influences planning performance, rather than focusing solely on the total bit‑width. Using the DINO‑WM model—a visual encoder paired with a latent dynamics predictor—trained on the Wall planning benchmark, the authors conduct a systematic paired‑goal evaluation under two planner budgets (bA: horizon 9, 2 optimization steps, 2 iterations; bB: horizon 12, 3 steps, 3 iterations).
Four families of quantization variants are examined: (1) uniform quantization where all linear layers are quantized to the same bit‑width (INT8, INT6, INT4, INT3); (2) mixed quantization where the encoder remains in FP16 while the predictor is quantized; (3) asymmetric quantization where encoder and predictor receive different bit‑widths (e.g., E8/P4, E6/P4, E4/P8, E4/P6); and (4) layer‑wise encoder retention where a percentage of encoder layers stay in FP16 and the remainder are INT4. Quantization is performed post‑training, weight‑only, using per‑output‑channel symmetric scaling without activation quantization, calibration, or fine‑tuning.
The central empirical finding is a three‑regime pattern. At 8‑bit and 6‑bit, planning success matches full‑precision FP16 (≈0.53 – 0.65) while model size shrinks dramatically (e.g., INT6 ≈ 78 MB vs FP16 ≈ 205 MB). At 3‑bit, success collapses to zero for both uniform and mixed variants. The 4‑bit region is a transition zone where performance is highly sensitive to where bits are placed. When the encoder is kept in high precision (mixed INT4 or encoder‑heavy asymmetric variants such as E6/P4), success rates are substantially higher than uniform INT4 (e.g., mixed INT4 achieves 0.267 vs 0.067 at budget bA, and 0.500 vs 0.200 at budget bB). Paired‑unit bootstrapped confidence intervals show mean deltas of +0.20 (bA) and +0.30 (bB) with sign‑test p≈0.11, indicating a directional but not statistically decisive effect.
A stricter replication with only 22 cells (66 episodes) confirms the 8/6‑bit safety and 3‑bit collapse, but reveals that the mixed‑vs‑uniform INT4 advantage can flip depending on the budget: mixed INT4 outperforms uniform INT4 at bA, while the opposite occurs at bB. This underscores the fragility of the 4‑bit frontier.
To disentangle total‑precision effects from allocation effects, the authors compare near‑size asymmetric variants. E6/P4 (73 MB, success 0.300) and E8/P4 (78 MB, success 0.233) both beat uniform INT4 (68 MB, success 0.067) despite being smaller than the mixed INT4 model (139 MB, success 0.267). Conversely, increasing predictor bits while keeping the encoder at INT4 (E4/P8, E4/P6) does not recover performance, suggesting that encoder precision is the dominant factor in the 4‑bit regime. Layer‑wise ablations further support this: success peaks when the encoder is fully retained in FP16, and degrades non‑monotonically as more encoder layers are quantized.
Mechanistic diagnostics show strong negative correlations between planning success and two divergence metrics extracted from planning logs: mean latent state distance (Spearman ρ = ‑0.928) and visual‑embedding divergence (ρ = ‑0.708). While correlation does not prove causality, it aligns with the hypothesis that low‑bit degradation of the encoder harms the latent geometry needed for goal‑directed planning.
The paper’s contributions are threefold: (1) identification of a three‑regime precision landscape (safe 8/6‑bit, collapse at 3‑bit, transition at 4‑bit); (2) empirical evidence that, in the transition regime, allocating bits to the encoder yields higher planning success than uniform allocation; (3) demonstration that the mixed‑vs‑uniform INT4 advantage is budget‑conditioned and can be confounded by total model size, motivating future work on size‑matched comparisons.
Limitations include evaluation on a single environment and checkpoint, modest paired sample sizes (especially for the 4‑bit regime), exclusive use of weight‑only PTQ, and hardware confinement to an Apple M4 MacBook Pro. The authors propose future directions: extending the paired evaluation to diverse tasks and model families; incorporating activation‑aware quantization, calibration, and fine‑tuning; testing on GPUs with hardware‑accelerated kernels; performing block‑wise and attention‑head sensitivity analyses; and ultimately developing automated, budget‑aware bit‑allocation optimizers that directly maximize planning success rather than reconstruction error.
In summary, the study reveals that efficient spatial reasoning with world models is not merely a matter of reducing total bits; the placement of those bits—particularly preserving encoder precision—plays a decisive role near the 4‑bit frontier. This insight opens a new design axis—module‑aware, budget‑aware quantization—for deploying world‑model planners on resource‑constrained platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment