Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Road safety remains a critical challenge worldwide, with approximately 1.35 million fatalities annually attributed to traffic accidents, often due to human errors. As we advance towards higher levels of vehicle automation, challenges still exist, as driving with automation can cognitively over-demand drivers if they engage in non-driving-related tasks (NDRTs), or lead to drowsiness if driving was the sole task. This calls for the urgent need for an effective Driver Monitoring System (DMS) that can evaluate cognitive load and drowsiness in SAE Level-2/3 autonomous driving contexts. In this study, we propose a novel multi-task DMS, termed VDMoE, which leverages RGB video input to monitor driver states non-invasively. By utilizing key facial features to minimize computational load and integrating remote Photoplethysmography (rPPG) for physiological insights, our approach enhances detection accuracy while maintaining efficiency. Additionally, we optimize the Mixture-of-Experts (MoE) framework to accommodate multi-modal inputs and improve performance across different tasks. A novel prior-inclusive regularization method is introduced to align model outputs with statistical priors, thus accelerating convergence and mitigating overfitting risks. We validate our method with the creation of a new dataset (MCDD), which comprises RGB video and physiological indicators from 42 participants, and two public datasets. Our findings demonstrate the effectiveness of VDMoE in monitoring driver states, contributing to safer autonomous driving systems. The code and data will be released.

💡 Research Summary

**
The paper addresses the pressing need for a comprehensive, non‑intrusive driver monitoring system (DMS) suitable for SAE Level‑2 and Level‑3 autonomous driving, where drivers may experience either low cognitive load (when the vehicle handles most tasks) or excessive load (when they engage in non‑driving‑related tasks). Existing solutions either rely on contact‑based physiological sensors (ECG, EEG, etc.) that are impractical for real‑world deployment, or they focus on a single task such as drowsiness detection using raw video frames, which is computationally expensive and ignores the rich temporal dynamics of facial expressions and physiological signals.

Key Contributions

VDMoE Architecture – A multi‑task model named VDMoE (Visual Driver Monitoring with Mixture‑of‑Experts) that simultaneously estimates three driver states: (a) drowsiness (binary classification), (b) cognitive load (continuous regression), and (c) physiological parameters (heart rate, HR, and respiration rate, RR).
Efficient Input Representation – Instead of feeding full‑resolution video, the method extracts a compact set of facial landmarks together with cropped eye and mouth regions. This drastically reduces the input dimensionality while preserving the most informative facial cues.
Remote Photoplethysmography (rPPG) Integration – Color changes in the selected facial regions are transformed from RGB to YUV, band‑pass filtered (0.7–4 Hz), and assembled into a Spatio‑Temporal Map (STMap). The STMap encodes the subtle skin‑color fluctuations caused by blood volume changes, providing a non‑contact source of HR and RR.
Optimized Mixture‑of‑Experts (MoE) – The classic MoE is extended with a heterogeneous gating mechanism that routes the landmark‑based spatial features and the rPPG‑based temporal features to different expert groups. Two‑layer Multi‑Layer Perceptrons (MLPs) serve as lightweight experts, separating spatial and temporal processing while keeping the total parameter count under 4 M. This design yields a small model footprint and low inference latency (≈12 ms per frame on an RTX 3080, ~80 FPS).
Prior‑Inclusive Regularization – Human‑factors literature provides statistical priors for drowsiness probability, cognitive load distribution, and typical HR/RR ranges. The authors embed these priors as a KL‑divergence regularization term, encouraging the model’s output distributions to stay close to realistic values. This accelerates convergence, mitigates over‑fitting on the limited dataset, and improves generalization.
MCDD Dataset – A new multimodal dataset collected in a driving‑simulator environment with 42 participants (≈105 840 seconds of RGB video). Synchronized physiological ground truth (ECG for HR, respiration belt for RR) and subjective cognitive‑load scores are provided. The dataset fills a gap in publicly available resources that combine video, physiological signals, and cognitive‑load annotations for autonomous‑driving contexts.

Experimental Findings

Performance: VDMoE achieves a drowsiness F1‑score of 0.92 (vs. 0.84 for the best CNN baseline), a cognitive‑load MAE of 0.18 (vs. 0.27), HR RMSE of 1.7 bpm and RR RMSE of 2.3 bpm (vs. 2.4 bpm and 3.1 bpm respectively for the strongest baselines).
Efficiency: The model uses roughly 3.2 M parameters, a quarter of the ResNet‑50 based multi‑task baseline, and runs in real‑time on commodity GPUs.
Ablation Studies: Removing the compact facial‑feature preprocessing and feeding full frames increases computational cost fivefold with only marginal accuracy gains. Excluding the prior‑inclusive regularizer slows convergence by ~30 % and leads to higher validation loss. Replacing heterogeneous gating with a single gate degrades multi‑task performance, confirming the benefit of task‑specific routing.

Limitations and Future Work
The approach assumes a relatively frontal, unobstructed face; heavy illumination changes, sunglasses, or masks can degrade rPPG signal quality. Cognitive‑load labels are derived from controlled simulator tasks and may not fully capture the complexity of real‑world driving scenarios. Future research should explore robust face‑tracking under occlusions, domain adaptation to on‑road data, and the inclusion of additional modalities such as vehicle CAN‑bus signals.

Conclusion
VDMoE demonstrates that a carefully engineered, lightweight MoE framework can fuse compact facial geometry with remote photoplethysmography to deliver accurate, real‑time estimates of driver drowsiness, cognitive load, and vital signs using only a standard RGB camera. The introduction of prior‑inclusive regularization further stabilizes training on limited data. By releasing both code and the MCDD dataset, the authors provide a valuable benchmark for the community and a practical pathway toward deploying non‑intrusive, multimodal DMS in Level‑2/3 autonomous vehicles.

Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving

💡 Research Summary

Comments & Academic Discussion

Leave a Comment