Vision-Proprioception Fusion with Mamba2 in End-to-End Reinforcement Learning for Motion Control
End-to-end reinforcement learning (RL) for motion control trains policies directly from sensor inputs to motor commands, enabling unified controllers for different robots and tasks. However, most existing methods are either blind (proprioception-only) or rely on fusion backbones with unfavorable compute-memory trade-offs. Recurrent controllers struggle with long-horizon credit assignment, and Transformer-based fusion incurs quadratic cost in token length, limiting temporal and spatial context. We present a vision-driven cross-modal RL framework built on SSD-Mamba2, a selective state-space backbone that applies state-space duality (SSD) to enable both recurrent and convolutional scanning with hardware-aware streaming and near-linear scaling. Proprioceptive states and exteroceptive observations (e.g., depth tokens) are encoded into compact tokens and fused by stacked SSD-Mamba2 layers. The selective state-space updates retain long-range dependencies with markedly lower latency and memory use than quadratic self-attention, enabling longer look-ahead, higher token resolution, and stable training under limited compute. Policies are trained end-to-end under curricula that randomize terrain and appearance and progressively increase scene complexity. A compact, state-centric reward balances task progress, energy efficiency, and safety. Across diverse motion-control scenarios, our approach consistently surpasses strong state-of-the-art baselines in return, safety (collisions and falls), and sample efficiency, while converging faster at the same compute budget. These results suggest that SSD-Mamba2 provides a practical fusion backbone for resource-constrained robotic and autonomous systems in engineering informatics applications.
💡 Research Summary
The paper introduces a novel end‑to‑end reinforcement‑learning (RL) framework for quadrupedal motion control that fuses visual and proprioceptive information using the SSD‑Mamba2 architecture. Traditional approaches either rely solely on proprioception (blind control) or employ vision‑proprioception fusion backbones that suffer from poor compute‑memory trade‑offs. Recurrent networks struggle with long‑horizon credit assignment, while Transformer‑based fusion incurs quadratic complexity in token length, limiting both spatial resolution and temporal context for real‑time control.
SSD‑Mamba2 (Selective State‑Space Model 2) addresses these limitations by leveraging state‑space duality (SSD), which unifies recurrent evolution and block‑wise parallel scanning under a single mathematical formulation. This yields near‑linear time and memory complexity, hardware‑aware streaming, and the ability to process variable‑length token sequences efficiently. The model’s input‑dependent state updates and exponentially decaying dynamics provide a stabilizing inductive bias that preserves long‑range dependencies without the gradient‑vanishing problems typical of vanilla RNNs.
In the proposed architecture, proprioceptive data (93‑dimensional vector containing IMU readings, joint angles, and recent actions) are embedded via a lightweight multilayer perceptron (MLP). Four consecutive depth frames (64 × 64 pixels each) are processed by a compact convolutional neural network (CNN) and patchified into spatial tokens. These tokens, together with the proprioceptive embedding, form a token sequence that is fed into multiple stacked SSD‑Mamba2 layers. The fused representation is then split into policy and value heads, which are optimized jointly using Proximal Policy Optimization (PPO).
Training incorporates three key ingredients: (1) extensive domain randomization (terrain type, lighting, texture) to broaden environmental diversity; (2) an obstacle‑density curriculum that gradually increases scene complexity; and (3) a compact state‑centric reward function that balances forward progress, energy efficiency, and safety. The forward term rewards the projection of base velocity onto the desired direction, the energy term penalizes the squared joint torques, and an alive bonus encourages the agent to stay upright. Optional sphere‑collection rewards further test task‑specific behaviors.
Empirical evaluation spans six terrain categories (rough ground, stairs, mud, snow, irregular blocks, composite obstacles) at three difficulty levels. Baselines include a proprioception‑only LSTM, a Vision‑Transformer fusion model, and earlier Mamba variants. SSD‑Mamba2 consistently outperforms these baselines: it achieves 18‑27 % higher cumulative return, reduces collisions and falls by more than 40 %, and improves sample efficiency by roughly 1.5× under the same compute budget (8 GB GPU memory, 1 M environment steps). Importantly, increasing token resolution to 128 × 128 does not cause memory overflow; SSD‑Mamba2’s near‑linear scaling keeps memory usage below 2 GB, enabling higher‑fidelity perception.
From a systems perspective, the selective scanning mechanism can be executed in a streaming fashion on embedded hardware, yielding inference latencies of about 3 ms and supporting control frequencies above 30 Hz. The input‑dependent updates also allow the model to skip unnecessary computations, reducing power consumption—critical for battery‑operated robots.
In summary, SSD‑Mamba2 provides a practical, efficient, and scalable backbone for cross‑modal fusion in motion‑control RL. It simultaneously offers (1) linear scaling for high‑resolution, long‑horizon sequences; (2) hardware‑friendly streaming execution; and (3) stable learning through selective state updates. The authors conclude that SSD‑Mamba2 opens the door to safe, anticipatory, and compute‑constrained robotic applications, and they outline future work on sim‑to‑real transfer, multi‑agent cooperation, and integration of additional sensors such as LiDAR and high‑rate IMU data.
Comments & Academic Discussion
Loading comments...
Leave a Comment