Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning
Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/
💡 Research Summary
The paper addresses a fundamental inefficiency in current Vision‑Language‑Action (VLA) systems: they allocate a fixed amount of computation to every control step, regardless of task difficulty. While Chain‑of‑Thought (CoT) prompting can introduce variable computation, it does so by generating explicit tokens, which leads to linearly growing memory usage and is ill‑suited for continuous action spaces typical in robotics.
To overcome these drawbacks, the authors propose Recurrent‑Depth VLA (RD‑VLA), an architecture that performs iterative refinement entirely within a latent representation space, eliminating the need for token‑level reasoning. RD‑VLA consists of three functional modules:
-
Prelude – A non‑recurrent interface that introduces a set of learned latent queries (64 tokens). These queries self‑attend and then cross‑attend to mid‑layer visual features of a frozen vision‑language backbone, producing a fixed conditioning matrix (S_{pre}).
-
Recurrent Core – A weight‑tied transformer block that is unrolled for an arbitrary number of steps (r). At each iteration (k), the current scratch‑pad state (S_{k-1}) is concatenated with the static (S_{pre}), passed through a learned adapter and RMSNorm to obtain an input vector (x_k). The recurrent block then performs bidirectional self‑attention on the (K) latent slots, followed by gated cross‑attention to a conditioning manifold that combines final‑layer visual tokens, the 64 latent tokens, and proprioceptive data. This “input injection” strategy prevents representation collapse over long unrolls.
-
Coda – A non‑recurrent decoder that maps the converged latent state (S_r) to continuous robot actions (e.g., 6‑DoF gripper commands).
Training uses truncated back‑propagation through time (TBPTT) to keep memory consumption constant while supervising the refinement process. At inference time, an adaptive stopping criterion based on latent‑space uncertainty (e.g., cosine distance or KL divergence between successive states) determines when the recurrent loop can be terminated, allowing per‑sample compute allocation.
Empirical evaluation on the LIBERO‑10 suite shows that a fixed depth of eight iterations yields 93.0 % success, while the adaptive version attains 92.5 % with comparable variance. Crucially, tasks that fail completely (0 % success) with a single iteration achieve over 90 % success when the model is allowed four iterations, demonstrating that “recurrent depth” is a key factor for solving complex multi‑step manipulations. On the CALVIN benchmark, RD‑VLA reaches an average episode length of 3.39 and a task‑5 success rate of 45.3 %, indicating strong long‑horizon generalization.
Speedwise, because RD‑VLA never materializes intermediate tokens, its memory footprint remains constant irrespective of the number of iterations. This leads to inference speedups of up to 80× compared with token‑based reasoning models such as ThinkAct or MolmoAct. Real‑world robot experiments (bread toasting, towel folding) confirm that the method transfers beyond simulation, handling noisy observations and continuous control demands.
The paper also discusses limitations: the choice of the initial noise distribution for the scratch‑pad and the maximum allowable recurrence depth influence performance and are not fully explored. Moreover, determining optimal uncertainty thresholds for early stopping in safety‑critical settings requires further study. Future work may extend RD‑VLA to multi‑robot scenarios where recurrent depth is shared, incorporate meta‑learning to automatically tune stopping criteria, and integrate larger pre‑trained foundation models to broaden the scope of embodied reasoning.
Overall, RD‑VLA introduces a principled way to achieve test‑time compute scaling in robotics by moving iterative reasoning into a latent space, offering constant memory usage, adaptive computation, and substantial speed gains while maintaining or improving task success rates.
Comments & Academic Discussion
Loading comments...
Leave a Comment