Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting

Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D


💡 Research Summary

The paper addresses a critical gap in autonomous‑driving perception: the lack of vehicle‑aware 3D pedestrian pose forecasting. While many recent works predict pedestrian trajectories or 2D/3D poses, they either ignore surrounding agents or limit interactions to other pedestrians. Moreover, existing 3D pose datasets (e.g., MuPoTS‑3D, JRDB‑GMP) are collected in non‑driving contexts, making it difficult to train models that understand realistic vehicle‑pedestrian dynamics.

To overcome this, the authors extend the Waymo‑3DSkelMo dataset, which already provides high‑quality 3D skeletal motion reconstructed from LiDAR, by aligning it with the 3D vehicle bounding boxes available in the original Waymo Open Dataset. This results in a unified 3‑D space containing over 2.4 million pedestrian poses and thousands of vehicle boxes across 837 scenes (≈ 4 h of urban driving).

Because scenes vary widely in agent density, the authors introduce a systematic sampling scheme. First, a KD‑Tree is used to find groups of pedestrians whose root joint distances stay below 18 m (the same threshold used in the original TBIFormer training set). Then, for each temporal segment they compute the minimum distance between any pedestrian and any vehicle; only segments where at least one vehicle lies within a configurable threshold (0–15 m) are kept. By varying the vehicle‑count threshold they create twelve balanced experimental conditions (1‑3 pedestrians × 1‑4 vehicles). This stratification enables fair evaluation of models under increasing interaction complexity.

The core technical contribution is a vehicle‑conditioned pose forecasting network built on top of TBIFormer, a state‑of‑the‑art 3D pose predictor that uses a Temporal Body Partition Module (TBPM) and a Transformer encoder‑decoder. The new architecture adds:

  1. Vehicle Encoder – Each vehicle is represented by its eight 3‑D bounding‑box corners. The corners are grouped into 12 logical edges (mirroring the body‑part partitioning for pedestrians). Frame‑to‑frame displacements are computed, then a Discrete Cosine Transform (DCT) compresses the trajectory, discarding high‑frequency noise. The resulting compact representation is down‑sampled temporally.

  2. Pedestrian‑Vehicle Interaction Cross‑Attention (PVI‑CA) – Pedestrian features (from the original TBIFormer encoder) and vehicle features (from the Vehicle Encoder) are fused via a multi‑head cross‑attention mechanism. The authors extend the Trajectory‑Aware Relative Position Encoding (TRPE) to the cross‑attention, providing explicit spatial‑temporal cues about the relative positions of body parts and vehicle groups.

  3. Transformer Decoder – Receives the fused representation and predicts future pedestrian displacement sequences, which are then inverse‑DCT‑ed and cumulatively summed to reconstruct 3‑D pose trajectories.

Training objective minimizes standard MPJPE (Mean Per Joint Position Error) and its Procrustes‑aligned variant, as well as a 3‑D IoU metric for bounding‑box consistency.

Experimental results on the enhanced Waymo‑3DSkelMo show that adding vehicle information reduces MPJPE by 12‑18 % across all vehicle‑count settings, with the largest gains when four vehicles are present. Ablation studies confirm that (a) the vehicle encoder’s DCT compression is essential for stable training, (b) the 12‑edge grouping outperforms naïve whole‑box encoding, and (c) the TRPE extension significantly improves attention focus on nearby vehicles. Qualitative visualizations illustrate that the model correctly anticipates pedestrian deceleration or lane‑changing when a vehicle approaches from the side, behaviors that a vehicle‑agnostic model fails to capture.

Limitations acknowledged by the authors include: (i) the absence of explicit vehicle dynamics such as speed, acceleration, or intent labels; (ii) no ground‑truth interaction annotations, which prevents quantitative assessment of the attention maps; and (iii) dataset bias toward Waymo’s geographic region, raising questions about cross‑city generalization.

Future work directions suggested are: integrating vehicle intent prediction (e.g., stopping, turning) into a multi‑task framework, combining synthetic simulation data with real‑world recordings for domain adaptation, and extending the cross‑attention analysis to produce interpretable interaction heatmaps for safety verification.

In summary, the paper delivers a practical pipeline for vehicle‑conditioned 3‑D pedestrian pose forecasting, demonstrates its effectiveness on a large, real‑world driving dataset, and opens new research avenues for richer multi‑agent motion modeling in autonomous driving.


Comments & Academic Discussion

Loading comments...

Leave a Comment