DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer

DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dexterous manipulation remains one of the most challenging problems in robotics, requiring coherent control of high-DoF hands and arms under complex, contact-rich dynamics. A major barrier is embodiment variability: different dexterous hands exhibit distinct kinematics and dynamics, forcing prior methods to train separate policies or rely on shared action spaces with per-embodiment decoder heads. We present DexFormer, an end-to-end, dynamics-aware cross-embodiment policy built on a modified transformer backbone that conditions on historical observations. By using temporal context to infer morphology and dynamics on the fly, DexFormer adapts to diverse hand configurations and produces embodiment-appropriate control actions. Trained over a variety of procedurally generated dexterous-hand assets, DexFormer acquires a generalizable manipulation prior and exhibits strong zero-shot transfer to Leap Hand, Allegro Hand, and Rapid Hand. Our results show that a single policy can generalize across heterogeneous hand embodiments, establishing a scalable foundation for cross-embodiment dexterous manipulation. Project website: https://davidlxu.github.io/DexFormer-web/.


💡 Research Summary

DexFormer tackles one of the most stubborn challenges in robotic manipulation: the need to control high‑DoF hands and arms across a wide variety of embodiments without retraining a separate policy for each hand. The authors observe that different dexterous hands differ not only in kinematics (joint count, link lengths, functional roles) but also in dynamics (mass distribution, actuator response, contact behavior). Traditional solutions either perform system identification for each hand, rely on massive domain randomization, or add residual dynamics models that require real‑world data per embodiment. All of these approaches either scale poorly or fail to provide zero‑shot transfer.

The core contribution of the paper is a single, morphology‑agnostic policy built on a history‑conditioned transformer. Instead of feeding the current observation alone, the policy receives a fixed‑length window of past observation‑action pairs. The transformer encoder, equipped with causal self‑attention and positional encodings, aggregates this temporal context into a compact latent vector that implicitly encodes the hand’s morphology and dynamics. No explicit morphology identifier or per‑embodiment decoder head is needed; the network learns to infer the latent physical parameters directly from the sequence of sensor readings (joint positions, velocities, fingertip forces, object point clouds, etc.).

To enable a unified control interface, the authors define a canonical finger action space of dimension D_F. Each hand’s actuated joints are mapped to functional indices (e.g., thumb abduction, index flexion) within this space; unused dimensions are zero‑padded. This shared action space allows the same policy output to be applied to hands with as few as 10 DoF (Leap Hand) or as many as 20 DoF (Rapid Hand). A simple smoothing filter (parameter λ) further stabilizes the commands. The high‑level action vector also includes a delta pose for the arm, yielding a combined action a_t =


Comments & Academic Discussion

Loading comments...

Leave a Comment