Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90% success rates to as low as 2%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.

💡 Research Summary

Vision‑Language‑Action (VLA) models have become a leading paradigm for generalist robotic manipulation, integrating perception, language understanding, and action prediction within a single end‑to‑end architecture. Despite impressive performance in clean, simulated environments, these models are extremely fragile when confronted with sensor‑level image corruptions such as electronic noise, dead pixels, lens contaminants, or water droplets. The authors first quantify this vulnerability, showing that state‑of‑the‑art VLAs (π₀.₅ and SmolVLA) suffer catastrophic drops in success rate—from around 90 % on clean inputs to as low as 2 % under common corruptions.

To address the problem without the costly retraining of the entire VLA, the paper introduces the Corruption Restoration Transformer (CRT), a plug‑and‑play Vision Transformer placed upstream of the VLA. CRT is model‑agnostic: it receives a corrupted observation (x’) and outputs a restored image (\hat{x}) that is fed to the unchanged policy network. The architecture builds on the image‑restoration transformer of

Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment