DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.


💡 Research Summary

DRAW2ACT introduces a depth‑aware, trajectory‑conditioned video diffusion framework designed to generate realistic robotic manipulation demonstrations while offering fine‑grained controllability. The core idea is to transform a user‑provided 3D trajectory—represented as a sequence of pixel coordinates (x, y) together with relative depth values (d)—into three complementary conditioning signals: (1) a depth‑encoded trajectory map, (2) object‑level semantic features extracted by DINOv2, and (3) a textual prompt augmented with pixel‑level positional information.

The depth‑aware trajectory is obtained by first estimating per‑frame depth using the Video Depth Anything model, then sampling depth at the trajectory points. This yields a (x, y, d) sequence that directly encodes the robot arm’s motion in three dimensions, overcoming the occlusion and dis‑occlusion problems inherent to purely 2D guidance. DINOv2 features are extracted from the manipulated object in the initial frame, cropped to suppress background, and then propagated along the trajectory to form a spatio‑temporal feature map aligned with the latent video space. The textual prompt, encoded by a T5 model, supplies a high‑level description of the task and reinforces the positional cues.

All three representations are injected into a Diffusion Transformer (DiT) backbone. DINOv2 features undergo a patch‑embedding and a specialized Fusion Block that applies a learnable gating mechanism, LayerNorm, and residual addition to the transformer hidden states. The depth‑encoded trajectory and the augmented text are incorporated via standard cross‑attention. This multi‑modal conditioning enables the diffusion model to respect both the geometric path of the robot and the semantic identity of the object throughout the generation process.

A distinctive aspect of DRAW2ACT is the simultaneous generation of RGB and depth videos. The two modalities are concatenated along the temporal axis, forming a single long sequence that is processed by the same 3D causal VAE encoder and the DiT denoiser. Self‑attention across the combined sequence allows the model to learn complementary spatial cues: depth supervision enforces geometric consistency, while RGB synthesis preserves visual realism. This design avoids extra embedding layers for depth and reduces training complexity.

To turn the generated videos into actionable robot commands, the authors propose a multimodal policy model. RGB and depth latent sequences are separately passed through spatial and temporal transformers, then fused via cross‑attention. A ResNet‑based decoder finally regresses the robot’s joint angles and gripper state. By leveraging both visual and depth cues, the policy model achieves higher prediction accuracy and more stable execution compared with models that rely on RGB alone.

Experiments were conducted on three real‑world datasets (Bridge V2, Berkeley Autolab) and several simulated benchmarks. Evaluation metrics included video quality (motion consistency, background consistency, subjective consistency), object trajectory error, and actual manipulation success rate. DRAW2ACT consistently outperformed state‑of‑the‑art baselines such as LevIT, TORA, and MotionCtrl across all metrics. Notably, the inclusion of depth supervision reduced object‑arm collision errors and improved success rates by 10–15 percentage points. Ablation studies confirmed that each conditioning component contributes uniquely: depth‑aware trajectories improve spatial accuracy, DINOv2 features preserve object semantics, and the augmented text enhances overall task coherence. The full combination yields the best performance.

In summary, DRAW2ACT presents a comprehensive solution for controllable robotic video synthesis by integrating 3D trajectory information, high‑level semantic features, and multimodal generation. It bridges the gap between video diffusion models and precise robot control, opening avenues for data‑efficient policy learning, online simulation, and extension to multi‑robot or multi‑object scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment