A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion
Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
💡 Research Summary
This paper tackles two major shortcomings of current road surface classification (RSC) research: the lack of diverse, realistic datasets and the limited ability of existing models to fuse multimodal sensor data robustly under domain shifts. To address these gaps, the authors introduce (1) a novel multimodal dataset called ROAD (Road surface Observation and Analysis Dataset) and (2) a lightweight yet powerful fusion architecture that combines camera images and inertial measurement unit (IMU) data through bidirectional cross‑attention and an adaptive gating mechanism.
The ROAD dataset is composed of three complementary subsets. The first subset contains synchronized RGB‑IMU streams captured with an industry‑grade datalogger across twelve Brazilian cities, covering a wide range of lighting (day, night, twilight), weather (clear, rain, heavy rain, dust, snow), and surface types (asphalt, cobblestone, dirt, gravel, ice). The second subset is vision‑only but deliberately varies camera placement, exposure, and illumination to stress visual‑only models. The third subset is synthetic, generated in a physics‑based simulator to emulate rare conditions such as night‑rain and mixed‑surface transitions that are difficult to record in the field. All recordings are timestamped at 10 Hz, geo‑referenced with high‑precision GPS, and manually annotated at the frame level, providing long, continuous driving sequences suitable for temporal modeling.
The proposed fusion framework processes each modality with a dedicated encoder: EfficientNet‑B0 for RGB frames and a CNN‑BLSTM for six‑axis IMU data (accelerometer + gyroscope). Both encoders output modality‑specific token sequences after a LayerNorm + MLP tokenization step. A bidirectional cross‑attention block then lets visual tokens query inertial tokens and vice‑versa, producing refined representations V′ (vision) and A′ (IMU). These are combined by an adaptive gating layer that learns a sample‑dependent weight α; the final fused embedding F = α·V′ + (1‑α)·A′ is fed to a pooling layer and a shallow MLP classifier that predicts the road surface class.
Extensive experiments compare the new model against strong baselines: IMU‑only LSTM, vision‑only ResNet, and a simple concatenation‑fusion network. Evaluation is performed on the established Passive Vehicle Sensors (PVS) benchmark and on the ROAD multimodal subset. The proposed method achieves 94.2 % accuracy on PVS (up from 92.8 % for the previous state‑of‑the‑art) and 88.5 % accuracy on ROAD (up from 76.9 %). Gains are especially pronounced for minority classes such as gravel and cobblestone, where F1‑scores improve by 0.13–0.16. Under challenging conditions—nighttime, heavy rain, and mixed‑surface transitions—performance degradation stays below 3 %, whereas vision‑only models drop by more than 15 % in the same scenarios. Ablation studies reveal that removing cross‑attention reduces accuracy by 4.3 pp, and fixing the gating weight to 0.5 (instead of learning it) costs an additional 2.9 pp, confirming the importance of both components.
Qualitative analysis of attention maps shows that when visual input is degraded (e.g., low light or motion blur), the gating mechanism increases the IMU contribution, while in smooth, low‑vibration segments the visual stream dominates. However, the authors note a sensitivity to severe sensor mis‑alignment: calibration errors exceeding 5 cm cause the gating to over‑rely on a single modality, leading to up to a 6 % drop in accuracy. They suggest future work on automatic calibration correction.
The paper concludes that a cost‑effective combination of commodity cameras and IMUs, when coupled with bidirectional cross‑attention and adaptive gating, can deliver robust road surface perception even in highly variable real‑world environments. The ROAD dataset, with its breadth of conditions and multimodal synchronization, is positioned as a new benchmark for evaluating both unimodal and multimodal RSC approaches. The authors outline future directions including extending the fusion to additional sensors (LiDAR, acoustic), integrating continuous surface quality scoring for predictive maintenance, and deploying the model on edge hardware such as Raspberry Pi for real‑time fleet monitoring. Overall, the work advances the state of the art in both data resources and algorithmic design for intelligent vehicle systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment