PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We address the task of identifying distracted driving by analyzing in-car videos using efficient transformers. Although transformer models have achieved outstanding performance in human action recognition tasks, their high computational costs limit their application onboard a vehicle. We introduce POGUISE+, a multi-task video transformer that, given an input clip, predicts the distracted driving action, the driver’s pose, and the interacting object. Our enhanced features for token selection are specifically adapted to driver actions by leveraging information about object interaction and the driver’s pose. With POGUISE+, we significantly reduce the model’s computational demands while maintaining or improving baseline accuracy across various computational budgets. Additionally, to evaluate our model’s performance in real-world scenarios, we have developed benchmarks on a Jetson computing platform, demonstrating its effectiveness across different configurations and computational budgets. Our model outperforms current state-of-the-art results on the Drive&Act, 100-Driver, and 3MDAD datasets, while having superior efficiency compared to existing video transformer-based methods.

💡 Research Summary

The paper introduces PO‑GUISE+, a multi‑task video transformer designed for distracted‑driver detection from in‑car video streams while dramatically reducing computational cost. Building on state‑of‑the‑art ViT‑based backbones (VideoMAEv2 and InternVideo2), the authors augment the standard token sequence with learnable heat‑map tokens that encode temporal motion of human body landmarks and interacting objects. These heat‑map tokens are decoded by a lightweight deconvolution‑convolution head into motion heat‑maps, providing supervision for pose and object location without requiring any external detector at inference time.

To address the quadratic complexity of transformers, PO‑GUISE+ employs a two‑stage token selection mechanism. In the first stage (pruning), each visual token’s relevance is assessed by its attention scores toward the class token, pose heat‑maps, and object heat‑maps. Tokens with low combined relevance are discarded according to a continuous token‑keep rate. In the second stage (merging), the remaining tokens are clustered and averaged, further compressing the representation while preserving salient information. This module is inserted at multiple depths of the transformer, allowing early layers to operate on a drastically reduced token set and later layers to focus on the most informative features.

Training is performed in a unified multi‑task fashion: a classification head predicts the distracted‑driving action, while separate heat‑map heads regress driver pose and object location. The loss combines classification cross‑entropy with heat‑map regression losses, encouraging the model to learn a shared representation that simultaneously supports all three tasks. Importantly, pseudo‑labels for pose and objects are generated offline using ViTPose and YOLO‑based detectors, but the final model is detector‑free.

Extensive experiments on three large driver‑action datasets—Drive&Act, 100‑Driver, and 3MDAD—show that PO‑GUISE+ outperforms current state‑of‑the‑art methods (e.g., DRVMon‑VM, TransDARC) by 2–5 % in top‑1 accuracy while using far fewer FLOPs. Ablation studies confirm that incorporating both pose and object heat‑maps yields the greatest performance gain, and that the attention‑guided pruning surpasses naïve Top‑K token selection.

Real‑world feasibility is demonstrated through benchmarks on NVIDIA Jetson platforms (Xavier NX and Nano). PO‑GUISE+ achieves up to 30 % reduction in FLOPs and latency in the range of 20–40 ms per clip, satisfying real‑time constraints for on‑vehicle driver monitoring systems.

In summary, the contributions are: (1) a novel token‑selection strategy that leverages pose and object cues, (2) a detector‑free multi‑task transformer that simultaneously outputs distraction class, driver pose, and object location, (3) state‑of‑the‑art accuracy on multiple driver‑action benchmarks, and (4) thorough evaluation of efficiency‑accuracy trade‑offs on embedded hardware, establishing PO‑GUISE+ as a practical solution for next‑generation in‑vehicle safety systems.

PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition

💡 Research Summary

Comments & Academic Discussion

Leave a Comment