Action-to-Action Flow Matching
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot’s physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
💡 Research Summary
The paper introduces Action‑to‑Action Flow Matching (A2A), a novel policy framework that replaces the conventional random‑noise initialization used in diffusion‑based robotic policies with a history‑driven latent initialization. Instead of starting from a Gaussian distribution and performing dozens of denoising steps, A2A encodes a short sequence of past proprioceptive actions into a high‑dimensional latent vector z₀, which serves as the starting point for generating future actions. A visual encoder extracts a global conditioning vector c from recent image frames. A flow network, built from AdaLN‑MLP blocks, learns a time‑dependent vector field vₜ that transports z₀ to a target latent z₁ in a single ODE integration step. The latent z₁ is then decoded into the next action sequence.
Training combines three losses: (1) Flow‑matching loss that aligns the predicted vector field with the analytically defined linear transport between z₀ and z₁; (2) Auto‑encoder reconstruction loss that forces the action encoder/decoder to preserve the structure of the action space; (3) Inference‑consistency loss that ties the ODE‑integrated latent to both the ground‑truth latent and the decoded actions, preventing latent collapse. The total loss is a weighted sum of these components.
Experiments span five simulated manipulation tasks (Close Box, Pick Cube, Stack Cube, Open Drawer, Pick‑Place Bowl) and real‑world robot setups. A2A is compared against nine strong baselines, including VIT‑A (6‑step), flow‑matching UNet/DiT (10 steps), and diffusion models with 40–100 steps. With only six inference steps (≈0.56 ms latency), A2A achieves success rates of 86–92 %, outperforming all baselines that require many more steps and exhibit lower success rates (30–80 %). Training converges up to 20× faster than vanilla diffusion and 5× faster than other flow‑matching methods.
Robustness tests show that A2A maintains performance under visual perturbations (blur, color shifts) and when deployed on unseen robot configurations (different link lengths, payloads). The reliance on action history provides a strong physical prior that compensates for degraded visual input. The authors also demonstrate that the same flow‑matching pipeline can generate temporally coherent video frames, indicating the method’s broader applicability to sequential data beyond control.
Limitations include sensitivity to severely corrupted action histories (e.g., sensor failures). The authors mitigate this by injecting mild noise during training to introduce stochasticity, but more sophisticated fault‑tolerant mechanisms are needed. Additionally, the current implementation focuses on joint angles or end‑effector positions/velocities; extending to force, torque, or multi‑modal control signals will require further architectural and loss‑function design.
In summary, A2A fundamentally reshapes diffusion‑based robot policy design by leveraging the continuity of robot motions: a short, informed transport in latent space replaces long, stochastic denoising chains. This yields orders‑of‑magnitude gains in inference speed, training efficiency, and robustness, while preserving or improving task performance. The work opens a promising direction for real‑time, high‑fidelity robotic control and suggests that flow‑matching with action‑to‑action initialization could become a general tool for time‑series generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment