DISK: Dynamic Inference SKipping for World Models
We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.
💡 Research Summary
The paper introduces DISK (Dynamic Inference Skipping), a training‑free, test‑time acceleration technique designed for autoregressive world models that simultaneously predict future video frames and ego‑vehicle trajectories using two coupled diffusion transformers (DiTs). In conventional diffusion‑based world models, each rollout step requires a full sequence of denoising updates for both visual and trajectory branches, leading to thousands of model evaluations for long‑horizon driving scenarios. Existing acceleration methods either reduce the number of diffusion steps globally, require costly retraining (e.g., consistency models, distillation), or focus on a single network without coordinating cross‑modal predictions. DISK addresses these gaps by attaching a lightweight “skip controller” to each branch and a safety gate that synchronizes the two branches when necessary.
Core Mechanism
For each diffusion timestep k (descending from K to 1) the controller computes a simple difficulty proxy d(b)ₖ = mean|x(b)ₖ – x(b)ₖ₋₁| for branch b ∈ {v (vision), t (trajectory)}. It also evaluates a second‑order finite‑difference residual Δ(b)ₖ = |(d(b)ₖ₊₂ + d(b)ₖ)/2 – d(b)ₖ₊₁| using the three most recent d values. If Δ(b)ₖ ≤ θ·d(b)ₖ₊₁ (θ is a tunable tolerance) the step is marked “skip”; otherwise the model recomputes the noise prediction F_b(x(b)ₖ; τₖ, c) and caches it. Skipping simply reuses the cached prediction inside the numerical update Ψ, avoiding a forward pass through the diffusion network.
Safety Guards
Three safeguards prevent pathological skipping: (1) a warm‑up period of W mandatory compute steps to populate the cache, (2) a cap C_max on consecutive skips, and (3) a stall threshold ε that forces compute when d becomes numerically tiny. These guards ensure that the controller does not get stuck in a zero‑change regime or skip too aggressively during rapid dynamics.
Cross‑Modal Coordination
Because independent skipping could desynchronize appearance and motion, DISK introduces a unidirectional safety signal σ_t(k) derived from the trajectory branch’s guards. When σ_t(k)=1 (e.g., during a complex maneuver, after C_max skips, or when d_t is near zero), the vision controller is forced to compute, guaranteeing that visual output reflects the updated trajectory. The reverse direction is not enforced; a visual compute does not compel trajectory recomputation, preserving efficiency since the video branch is typically more expensive.
Statistics Propagation
After each autoregressive rollout step s, DISK aggregates per‑branch statistics: total compute count c(b)_s, skip count u(b)_s, compute ratio r(b)_s = c(b)_s / K, and safety‑trigger ratio ρ(b)_s (excluding gate‑forced computes). These summaries seed the next step, allowing the controller to adapt its θ dynamically—relaxing the skip threshold after a hard step and tightening it during stable intervals. This feedback loop stabilizes long‑horizon rollouts by mitigating error accumulation.
Experimental Evaluation
Experiments were conducted on an NVIDIA L40S GPU using 1,500 closed‑loop driving rollouts from NuPlan and NuScenes. DISK achieved roughly 2× speedup on the trajectory diffusion branch (20 ms → 10 ms per step) and 1.6× speedup on the vision diffusion branch (244 ms → 155 ms). Crucially, quality metrics remained on par with the baseline (Epona): L2 planning error showed negligible change, FID stayed at 12.5, FVD at 88.1, and NAVSIM PDMS scores as well as collision rates were unchanged. Qualitative video samples (Figure 2) demonstrated indistinguishable visual fidelity despite the reduced compute budget.
Significance and Limitations
DISK offers a plug‑and‑play solution that does not alter model weights, making it immediately applicable to existing diffusion‑based world models. By adapting compute per‑step and coordinating across modalities, it delivers substantial inference savings while preserving the temporal consistency essential for safe autonomous driving. Limitations include the current one‑directional coupling (trajectory → vision only) and reliance on simple smoothness metrics; future work could explore bidirectional coordination, learned meta‑controllers, or integration with other system‑level sparsity techniques.
Conclusion
The authors present a novel, training‑free adaptive inference framework that dynamically skips diffusion denoising steps in both visual and trajectory branches of autoregressive world models. DISK achieves 1.6–2× acceleration on real‑world driving rollouts without sacrificing planning accuracy, visual fidelity, or safety, thereby advancing the practicality of diffusion‑based world models for real‑time autonomous systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment