Hybrid-driven Trajectory Prediction Based on Group Emotion
đĄ Research Summary
The paper introduces a novel hybrid framework for predicting the future trajectories of pedestrians, cyclists, and vehicles by explicitly incorporating groupâlevel emotional states. Traditional trajectoryâprediction approaches fall into two categories: physicsâbased models such as the Social Force model, which encode repulsive and attractive forces, and dataâdriven deepâlearning models (e.g., SocialâLSTM, Trajectron++, PECNet) that learn motion patterns from past trajectories. While both categories can capture basic kinematic constraints, they struggle to represent the subtle influence of human affect on movement decisions, especially in crowded or highâstress environments.
To address this gap, the authors first design a multimodal groupâemotion recognition pipeline. From each video frame they extract facial crops, bodyâpose keypoints, 3âD skeletal motion, and contextual cues (e.g., surrounding objects, lighting). A preâtrained ResNetâ50 classifier predicts the probability distribution over three coarse emotion categoriesâpositive, neutral, negativeâwhile a 3âD CNN recognizes concurrent actions such as walking, running, or stopping. Individual emotion probabilities are aggregated across all members of a social group (defined by spatial proximity and gaze alignment) to produce a groupâemotion vector. This vector is then embedded into a timeâvarying Graph Emotion Dynamics (GED) structure, where nodes represent agents and edges are weighted by distance, mutual gaze, and synchronized motion. The GED captures how affect propagates through the group, allowing the model to reason about âcontagious anxietyâ or âcollective excitement.â
The second component is the hybrid trajectory predictor. The physicsâbased subâmodule extends the classic Social Force formulation with vehicle dynamics, generating a baseline motion field that respects collision avoidance, goal attraction, and roadâboundary constraints. The dataâdriven subâmodule employs a Transformer encoderâdecoder architecture. Its input consists of two streams: (1) a sequence of past 2âD positions for each agent and (2) the corresponding GED embeddings for each timestep. By feeding the GED directly into the multiâhead attention mechanism, the network learns to weight historical positions according to the current emotional context. The outputs of the physics and data subâmodules are fused through a learnable, contextâaware weighting scheme; the weights are modulated by confidence signals such as occlusion status and emotionârecognition certainty.
Experiments are conducted on two datasets. The first is a realâworld urban traffic collection comprising synchronized RGB video, LiDAR, and GPS data for pedestrians, cyclists, and cars. The second is an expanded version of the ETHâUCY crowd datasets, augmented with manually annotated groupâemotion labels. For each scenario the model receives 5âŻseconds (30 frames) of historical motion and predicts the next 3âŻseconds (18 frames). Evaluation metrics are Average Displacement Error (ADE) and Final Displacement Error (FDE). Compared with stateâofâtheâart baselines (SocialâLSTM, Trajectron++, PECNet), the proposed hybrid model achieves an ADE of 0.42âŻm and an FDE of 0.71âŻm, representing improvements of roughly 15âŻ% and 18âŻ% respectively. Qualitative analysis shows that in ânegativeâemotionâ groups the model correctly anticipates sudden direction changes, while in âpositiveâemotionâ groups it predicts smoother, more cooperative flows.
Ablation studies isolate the contributions of each component. Removing the GED raises ADE to 0.58âŻm, indicating that emotional context is critical for handling nonâlinear, socially driven maneuvers. Using only the physics subâmodule yields large errors in dense crowds, whereas the dataâonly version fails to respect hard physical constraints such as collision avoidance. The full hybrid system thus benefits from complementary strengths: the physics module enforces feasibility, while the Transformer leverages emotional cues to refine intent.
Realâtime feasibility is demonstrated on an NVIDIA RTXâŻ3080 GPU, where the endâtoâend pipeline runs at over 30âŻframes per second, satisfying latency requirements for autonomous driving or robot navigation.
The paperâs contributions can be summarized as follows:
- A multimodal groupâemotion detection framework that converts affective cues into a graphâstructured representation (GED).
- A hybrid trajectoryâprediction architecture that fuses physicsâbased forces with a Transformer that attends to emotional dynamics.
- Extensive quantitative and qualitative validation on both realâworld traffic and simulated crowd scenarios, showing superior accuracy and robustness.
- Proof of realâtime operation, opening the door to deployment in safetyâcritical systems such as autonomous vehicles, crowdâmanagement platforms, and humanârobot collaboration.
Future work is suggested in three directions: (i) refining emotion granularity (e.g., fear, anger, joy) and incorporating demographic variables; (ii) exploring causal links between external stimuli (noise, lighting) and emotion shifts; and (iii) extending the approach to 3âD trajectory prediction for aerial drones or indoor service robots. In sum, by explicitly modeling group affect and integrating it with both physical and learned motion priors, the authors present a compelling step toward socially aware, emotionâsensitive trajectory prediction.
Comments & Discussion
Loading comments...
Leave a Comment