Risky Action Recognition in Lane Change Video Clips using Deep Spatiotemporal Networks with Segmentation Mask Transfer

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Advanced driver assistance and automated driving systems rely on risk estimation modules to predict and avoid dangerous situations. Current methods use expensive sensor setups and complex processing pipeline, limiting their availability and robustness. To address these issues, we introduce a novel deep learning based action recognition framework for classifying dangerous lane change behavior in short video clips captured by a monocular camera. We designed a deep spatiotemporal classification network that uses pre-trained state-of-the-art instance segmentation network Mask R-CNN as its spatial feature extractor for this task. The Long-Short Term Memory (LSTM) and shallower final classification layers of the proposed method were trained on a semi-naturalistic lane change dataset with annotated risk labels. A comprehensive comparison of state-of-the-art feature extractors was carried out to find the best network layout and training strategy. The best result, with a 0.937 AUC score, was obtained with the proposed network. Our code and trained models are available open-source.

💡 Research Summary

Advanced driver assistance systems (ADAS) and automated driving systems (ADS) rely heavily on risk estimation modules to anticipate and avoid hazardous situations. Existing approaches typically depend on expensive sensors such as lidar and radar, or on complex processing pipelines that limit scalability and robustness. This paper proposes a low‑cost, camera‑only solution for detecting risky lane‑change maneuvers using deep spatiotemporal learning. The core idea is “Semantic Mask Transfer” (SMT): a pre‑trained Mask R‑CNN (ResNet‑101) generates instance segmentation masks for each frame of a short video clip (average length ≈10 s, sampled to 16 frames). These masks are overlaid on the original RGB frames, producing a masked video that emphasizes the geometry of relevant road users while suppressing background clutter and illumination variations.

The masked frames are fed into a convolutional neural network (CNN) backbone to extract high‑level spatial features. Several state‑of‑the‑art backbones (ResNet‑50/101, EfficientNet‑B0, MobileNet‑V2, etc.) are evaluated. The resulting feature sequence is processed by a Long Short‑Term Memory (LSTM) network in a many‑to‑one configuration; only the final hidden state is passed to a dense soft‑max layer that outputs a binary risk label (safe vs. risky). Two training strategies are explored: (1) training the entire CNN‑LSTM pipeline from scratch, and (2) leveraging transfer learning by freezing or fine‑tuning the pre‑trained CNN while only training the LSTM and classifier. The latter approach dramatically reduces data‑hungriness and improves generalization, which is crucial given the modest size of the lane‑change dataset (860 annotated clips).

The dataset originates from a previously released naturalistic driving study that includes front‑camera video, vehicle dynamics, and manually assigned subjective risk scores. For this work, only the video modality is used; risk labels are binary (safe = 0, risky = 1). Experiments compare three model families: (i) Frame‑by‑Frame CNN (FbF‑CNN) that classifies each frame independently, (ii) standard CNN‑LSTM that consumes raw frames, and (iii) the proposed SMT+CNN‑LSTM. Performance is measured primarily by the area under the ROC curve (AUC). The SMT+CNN‑LSTM model achieves an AUC of 0.937, substantially outperforming FbF‑CNN (≈0.71) and raw CNN‑LSTM (≈0.84). Among backbones, ResNet‑101 yields the best feature representation, while EfficientNet‑B0 offers a favorable trade‑off between accuracy and computational cost.

Analysis shows that the mask overlay concentrates the network’s attention on moving vehicles and lane markings, making the temporal dynamics learned by the LSTM more discriminative for risky maneuvers (e.g., sudden lane encroachment, reduced inter‑vehicle gaps). Transfer learning proves essential: using pre‑trained weights mitigates overfitting on the limited dataset and accelerates convergence.

Limitations are acknowledged. The risk labels are subjective, so correlation with objective collision metrics remains to be validated. The mask generation step adds computational overhead, potentially challenging real‑time deployment on embedded automotive hardware. Moreover, the dataset is confined to a specific geographic and weather context, leaving the model’s robustness to diverse lighting, weather, and road layouts untested.

Future work is suggested in three directions: (1) integrating lightweight segmentation models or joint segmentation‑classification networks to reduce latency, (2) fusing additional modalities such as steering angle, speed, or radar returns to enrich the risk context, and (3) expanding the dataset across multiple regions and conditions, possibly employing domain adaptation techniques to improve generalization.

In summary, the paper introduces a novel, camera‑only risk assessment pipeline that combines semantic mask transfer with a CNN‑LSTM architecture. It demonstrates that high‑quality risk detection (AUC = 0.937) is achievable without costly sensors, and provides an extensive benchmark of backbone networks and training strategies. All code and trained models are released as open‑source, facilitating reproducibility and further research in vision‑based automotive risk perception.

Risky Action Recognition in Lane Change Video Clips using Deep Spatiotemporal Networks with Segmentation Mask Transfer

💡 Research Summary

Comments & Academic Discussion

Leave a Comment