Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild)
We introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.
💡 Research Summary
The paper introduces a novel motion capture system that relies exclusively on sparse pairwise distance (PWD) measurements obtained from body‑mounted ultra‑wideband (UWB) sensors. By measuring time‑of‑flight ranging between six wireless nodes placed on key body joints, the system can reconstruct full‑body 3D joint positions and global translation without any external cameras, markers, or prior knowledge of the subject’s kinematic model. The core of the approach is a Transformer‑based architecture called Wild‑Poser (WiP). WiP is formulated as a “Refinement‑Generative Model”: it receives a temporal window of past poses and the current noisy distance matrix, then predicts the current pose. The model uses a decoder‑only Transformer augmented with cross‑attention, where queries and keys are derived from the current distance matrix and values come from the sequence of past poses. This design enables the network to fuse temporal consistency with noisy spatial cues, effectively denoising the measurements and handling non‑line‑of‑sight (NLoS) conditions typical of UWB signals.
Training employs a combination of losses: an L2 pose loss, a distance‑reconstruction loss that forces the predicted joints to reproduce the input distances, and a temporal smoothing loss to discourage jitter. Data augmentation adds Gaussian noise and random missing entries to simulate realistic UWB errors. The authors collect a comprehensive dataset comprising humans, dogs, and cats performing a variety of motions in indoor, outdoor, and adverse lighting conditions. Experiments show that WiP achieves an average joint position error of 55 mm for humans and around 60 mm for animals, comparable to commercial optical systems (≈30 mm) and substantially better than state‑of‑the‑art sparse IMU methods (≈80 mm). The system runs at 50 FPS on an RTX 3090 GPU, delivering real‑time performance with sub‑60 mm error and negligible global drift (≤5 cm over 1 m translation).
Ablation studies confirm the importance of the cross‑attention mechanism and the noise‑robust training regime; removing cross‑attention increases error by ~20 %, while training without noise augmentation degrades performance under NLoS by ~35 %. Compared to related works such as Ultra‑Inertial Poser (UIP) and UMotion, which fuse IMU data with UWB for stabilization, WiP is the first to rely solely on pairwise distances, making it truly shape‑invariant and applicable to any articulated creature without per‑subject calibration.
The paper’s contributions are threefold: (1) a portable, camera‑free mocap pipeline that reconstructs both pose and global motion from PWDs; (2) a robust, shape‑invariant Transformer model that denoises and refines noisy distance measurements; and (3) a comprehensive empirical validation on diverse subjects and environments, demonstrating competitive accuracy, real‑time capability, and scalability. Limitations include the requirement of at least six sensors, reduced accuracy for very small limb segments (e.g., fingers), and the current reliance on a separate learned module for joint rotation estimation. Future work will explore optimal sensor placement, ultra‑compact hardware, and end‑to‑end models that infer full 6‑DoF joint orientations directly from distance matrices, as well as extensions to multi‑subject scenarios. Overall, the work opens a promising avenue for low‑cost, outdoor‑ready motion capture that can serve animation, biomechanics, sports analytics, and animal behavior research.
Comments & Academic Discussion
Loading comments...
Leave a Comment