CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches provide strong modeling capacity but typically incur high inference latency, while flow matching enables fast one-step generation yet often leads to unstable execution when applied directly in the raw action space. We propose LG-Flow Policy, a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally regularized latent trajectories and learning an explicit latent-space flow, the proposed approach decouples global motion structure from low-level control noise, resulting in smooth and reliable long-horizon execution. LG-Flow Policy further incorporates geometry-aware point cloud conditioning and execution-time multimodal modulation, with visual cues evaluated as a representative modality in real-world settings. Experimental results in simulation and on physical robot platforms demonstrate that LG-Flow Policy achieves near single-step inference, substantially improves trajectory smoothness and task success over flow-based baselines operating in the raw action space, and remains significantly more efficient than diffusion-based policies.


💡 Research Summary

Title: CoLA‑Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

Abstract (English):
The paper introduces CoLA‑Flow Policy, a trajectory‑level imitation‑learning framework that addresses the three‑fold challenge of expressive behavior modeling, real‑time inference, and stable execution in long‑horizon robotic manipulation. Instead of generating raw motor commands directly, the method first encodes action sequences into a continuous, temporally coherent latent space using a recurrent encoder and variational regularization. A flow‑matching model is then trained to learn a time‑dependent velocity field in this latent space, enabling a single‑step ODE integration that produces an entire latent trajectory at inference time. Geometry‑aware point‑cloud features condition the latent flow, while execution‑time visual cues from a wrist‑mounted camera are injected into the decoder via FiLM layers, allowing multimodal adaptation without altering the latent planning. Experiments on six simulated tasks and three real‑world UR5e manipulation scenarios demonstrate that CoLA‑Flow achieves near‑single‑step inference (~5 ms), improves trajectory smoothness by up to 93.7 %, and raises task success rates by up to 25 % compared with raw‑action flow baselines, while remaining significantly faster than diffusion‑based policies.

Key Contributions:

  1. Latent‑Space Trajectory Modeling: Action chunks are encoded with a lightweight temporal convolution and a GRU, producing a smooth latent trajectory. A KL‑regularized variational objective enforces compactness and continuity.
  2. Flow Matching in Latent Space: Consistency flow matching learns a time‑dependent velocity field νθ(t, z). A single evaluation of the flow fθ(1, ·) transforms samples from a simple Gaussian into the full latent trajectory, eliminating iterative denoising.
  3. Geometry‑Aware Conditioning: Sparse point clouds are processed by a dual‑branch encoder (local residual conv + center MLP) to generate scene embeddings that condition the flow network, ensuring the generated trajectory respects the 3D environment.
  4. Execution‑Time Multimodal Modulation: Visual features from a wrist camera are encoded by a pretrained ResNet‑18 and injected into the action decoder via FiLM layers, allowing the policy to adapt to visual changes at execution without re‑planning.
  5. Extensive Evaluation: The method is benchmarked against raw‑action flow policies, diffusion policies, and hierarchical diffusion baselines in both simulation and on a physical robot. Results show orders‑of‑magnitude faster inference, markedly smoother trajectories, and higher success rates.

Method Overview:

  • Latent Action Encoder: Split an action trajectory Aₜ (H × dₐ) into K chunks of length c. Each chunk is transformed into a feature vector xₖ via a 1‑D convolution. A GRU processes the sequence {xₖ} to produce latent codes {zₖ} ∈ ℝ^{d_z}.
  • Variational Regularization: The encoder outputs a posterior q(z|A) that is penalized with KL divergence against N(0, I), encouraging a smooth latent manifold.
  • Latent Decoder with FiLM: For each latent code zₖ, a lightweight MLP reconstructs the corresponding action chunk ˆAₖ, modulated by visual features vₜ through FiLM (γ(vₜ), β(vₜ)).
  • Flow Matching: A neural network νθ(t, z) predicts the velocity field. Consistency flow matching defines fθ(t, z) = z + (1−t)·νθ(t, z). Time‑dependent scaling c(t) = 1/√(t² + (1−t)²) normalizes inputs across noise levels.
  • Training: The flow network is trained with a mean‑squared error between predicted and true latent transitions, combined with the KL term.
  • Inference: Sample ˜zₖ ∼ N(0, I), compute ˆzₖ = fθ(1, ˜zₖ) in a single forward pass, then decode each ˆzₖ into executable actions using the FiLM‑conditioned decoder.

Experimental Findings:

  • Latency: CoLA‑Flow requires ~5 ms per trajectory, compared to 48 ms for a diffusion policy and 12 ms for a raw‑action flow baseline.
  • Smoothness: Measured by jerk and a custom smoothness metric, CoLA‑Flow reduces high‑frequency noise by up to 93.7 % relative to raw‑action flow.
  • Success Rate: Across all tasks, CoLA‑Flow achieves an average success of 82 %, versus 57 % for raw‑action flow and 48 % for diffusion.
  • Ablations: Removing the GRU encoder drops success to 45 %; omitting KL regularization degrades smoothness by ~30 %; excluding 3D conditioning reduces performance on cluttered scenes by 15 %.

Discussion:
The work demonstrates that moving the generative burden from raw motor space to a well‑structured latent space can reconcile speed and stability, two properties traditionally at odds in generative robot policies. However, the approach relies on careful selection of latent dimensionality and currently only leverages visual feedback; integrating tactile or force sensing could further improve robustness. Moreover, scalability to multi‑robot or highly dynamic environments remains an open question.

Conclusion:
CoLA‑Flow Policy offers a principled solution for long‑horizon robotic manipulation by combining continuous latent action encoding, efficient flow‑matching generation, geometry‑aware scene conditioning, and execution‑time multimodal adaptation. The resulting system delivers near‑instant inference, smooth trajectories, and higher task success, marking a significant step toward practical, real‑time imitation learning for complex manipulation tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment