MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent’s actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.


💡 Research Summary

The paper tackles the problem of extracting high‑quality latent actions from unlabeled human manipulation videos, a key step for scaling robot learning without costly action annotations. Existing latent‑action models (LAMs) are trained on single‑view video streams and rely on a reconstruction loss that forces the latent token to capture any visual change between consecutive frames. In real‑world videos, however, many visual changes are caused by exogenous factors such as camera motion, viewpoint shifts, or background activity, which act as noise and dilute the information about the underlying robot action. The authors formalize the notion of an “action‑centric” latent action as one that maximizes mutual information I(Z;A) with the true action A. By deriving a lower bound I(Z;A) ≥ H(Z) – I(Z;V,V′|S,S′) – C, they show that reducing the conditional mutual information between the latent token and the viewpoint variables V, V′ (given the true state transition) directly improves action‑centricity, especially when the latent space is capacity‑limited (as in vector‑quantized models).

To achieve this, they introduce MVP‑LAM (Multi‑View‑Point Latent Action Model), which leverages time‑synchronized multi‑camera recordings. For each synchronized view v, a visual encoder (e.g., DINOv2) extracts features oᵥₜ and oᵥₜ₊₁. An encoder Eθ produces a continuous latent eᵥₜ = Eθ(oᵥₜ, oᵥₜ₊₁), which is quantized to an discrete token zᵥₜ via a codebook. A decoder Dθ predicts the next observation from the current observation and a token. The training objective combines (i) self‑view reconstruction loss (predict oᵥₜ₊₁ from (oᵥₜ, zᵥₜ)) and (ii) cross‑view reconstruction loss (swap tokens between synchronized views and predict oᵥₜ₊₁ from (oᵥₜ, z̃ᵥₜ)). Because the decoder does not receive the viewpoint identifier of the token, any viewpoint‑specific information encoded in zᵥₜ would increase the cross‑view loss; minimizing this loss forces the token to discard such information. Standard VQ‑VAE quantization and commitment losses are also included.

Experiments are conducted on the Bridge V2 dataset, which provides synchronized multi‑view human manipulation videos, and on out‑of‑distribution (OOD) view‑perturbed test sets. MVP‑LAM is compared against three strong baselines: UniVLA (a recent VQ‑based latent‑action model), LAPA, and other single‑view LAMs. The evaluation focuses on (1) mutual information between latent tokens and ground‑truth robot actions, (2) action prediction accuracy using a simple linear classifier, (3) robustness under viewpoint changes, and (4) downstream manipulation performance after using the latent tokens as pseudo‑labels for Vision‑Language‑Action (VLA) pretraining.

Results show that MVP‑LAM achieves 8–12 % higher I(Z;A) than baselines, while maintaining comparable token entropy, indicating that the extra information is indeed action‑relevant rather than viewpoint noise. Linear action classifiers built on MVP‑LAM tokens reach significantly higher accuracy (up to 9 % improvement) and degrade far less on OOD views (≤2 % drop). When the tokens are used to pretrain VLA models, fine‑tuning on the SIMPLER and LIBERO‑Long benchmarks yields average success‑rate gains of 4.3 % and 5.7 % respectively, especially benefiting long‑horizon and multi‑task scenarios.

Ablation studies confirm that removing the cross‑view loss collapses the mutual information gains and harms OOD robustness, while varying codebook size shows that the method scales without losing its action‑centric property. The authors discuss practical considerations: multi‑view data are more expensive to collect than single‑view, but many existing human‑video datasets already contain synchronized views, and the method can be extended to more than two cameras or even asynchronous setups. Limitations include the reliance on discrete token representations, which may struggle with highly complex continuous motions, suggesting future work on hybrid continuous‑discrete or hierarchical token schemes.

In summary, MVP‑LAM demonstrates that explicitly penalizing viewpoint‑specific information through cross‑viewpoint reconstruction yields latent actions that are both compact and highly informative about true robot actions. These high‑quality pseudo‑labels enable effective large‑scale VLA pretraining and improve downstream manipulation performance, marking a significant step toward scalable, vision‑only robot learning from abundant human video data.


Comments & Academic Discussion

Loading comments...

Leave a Comment