Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception

Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.


💡 Research Summary

The paper tackles the problem of multi‑agent human trajectory prediction for socially aware robot navigation by investigating how different skeletal cues extracted from an egocentric 360° robot camera influence forecasting performance. While most prior work treats humans as point masses on a 2‑D map, the authors systematically evaluate the predictive value of both 2‑D and 3‑D human keypoints and of derived biomechanical features such as joint articulation angles, step length, and head orientation. Their experiments are conducted on two datasets: the publicly available JRDB dataset, which provides synchronized LiDAR, 360° stereo imagery, and both 2‑D (COCO‑style) and 3‑D (33‑point MediaPipe) pose annotations; and a newly collected “panoramic” dataset captured with an Insta360 X4 camera mounted on an AgileX Scout Mini robot navigating crowded indoor spaces.

The backbone for all experiments is the Human Scene Transformer (HST), a transformer‑based trajectory predictor that can be extended with additional feature streams. Each feature configuration (denoted s) is encoded by a lightweight MLP and fused with the agents’ past positions before self‑attention across agents and timesteps. The models are trained on 2‑second histories (6 frames at 3 Hz) to predict 4‑second futures (12 frames) and are evaluated with four standard metrics: Minimum ADE, Minimum FDE, Most‑Likely ADE, and Negative Log‑Likelihood of positions.

Key findings from the JRDB experiments:

  1. Lower‑body 3‑D keypoints (K³ᴰᴸ, 10 joints) alone achieve the largest reduction over the baseline (no pose) – 13 % lower MinADE, 11 % lower MinFDE, 14 % lower Most‑Likely ADE, and 24 % lower NLL. This demonstrates that leg motion carries the strongest signal for short‑term trajectory forecasting.
  2. Adding biomechanical cues derived from those lower‑body joints (joint angles, step length) yields an extra 1‑4 % improvement, confirming that compact motion descriptors are beneficial but not essential when high‑quality 3‑D keypoints are available.
  3. Upper‑body 3‑D keypoints (K³ᴰᵁ) and full‑body 3‑D (K³ᴰ) also improve over the baseline but consistently underperform the lower‑body‑only configuration.

When comparing 3‑D versus 2‑D inputs on a paired subset of JRDB where both are available, 3‑D keypoints outperform 2‑D keypoints: K³ᴰ reduces MinADE by 7 % relative to the baseline, while K²ᴰ only yields a marginal gain. The advantage is attributed to depth information that disambiguates walking direction and speed.

The panoramic dataset experiments focus on 2‑D keypoints extracted from equirectangular images without any distortion correction. Despite the inherent stretching near the image poles, lower‑body 2‑D keypoints (K²ᴰᴸ) still improve MinADE from 1.10 to 1.02 m, confirming that monocular surround‑view perception can capture useful gait cues. Full‑body 2‑D keypoints provide similar performance, indicating that the lower‑body subset is sufficient.

From a systems perspective, the study yields actionable design recommendations:

  • Equip robots with 360° cameras to obtain full surround coverage; this eliminates blind spots and enables detection of approaching pedestrians from any direction.
  • Prioritize extraction of lower‑body joint locations (either 3‑D when depth sensors are available or 2‑D when only RGB is present) because they dominate the predictive signal.
  • Augment the joint locations with lightweight biomechanical descriptors if computational budget permits; the gains are modest but consistent.
  • For low‑cost platforms that cannot run real‑time 3‑D pose estimation, a pipeline of 2‑D pose detection (e.g., HRNet) on panoramic frames followed by simple coordinate conversion suffices for accurate trajectory forecasting.

Overall, the paper provides the first comprehensive quantitative analysis showing that “watching the legs” is the most effective strategy for a robot to anticipate human motion. By demonstrating that even distorted panoramic 2‑D keypoints retain predictive power, the work bridges the gap between high‑fidelity motion capture and practical robot perception, offering a clear roadmap for building socially compliant navigation systems that are both affordable and robust in crowded, real‑world environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment