Intrusion Detection in Internet of Vehicles Using Machine Learning
The Internet of Vehicles (IoV) has evolved modern transportation through enhanced connectivity and intelligent systems. However, this increased connectivity introduces critical vulnerabilities, making vehicles susceptible to cyber-attacks such Denial-ofService (DoS) and message spoofing. This project aims to develop a machine learning-based intrusion detection system to classify malicious Controller Area network (CAN) bus traffic using the CiCIoV2024 benchmark dataset. We analyzed various attack patterns including DoS and spoofing attacks targeting critical vehicle parameters such as Spoofing-GAS - gas pedal position, Spoofing-RPM, Spoofing-Speed, and Spoofing-Steering_Wheel. Our initial findings confirm a multi-class classification problem with a clear structural difference between attack types and benign data, providing a strong foundation for machine learning models.
💡 Research Summary
**
The paper presents a comprehensive study on building a machine‑learning‑based intrusion detection system (IDS) for the Internet of Vehicles (IoV), focusing on the Controller Area Network (CAN) bus traffic. Using the publicly available CICIoV2024 benchmark dataset, the authors treat the problem as a multi‑class classification task that distinguishes normal traffic from five malicious categories: Denial‑of‑Service (DoS) and four spoofing attacks (gas pedal, RPM, speed, and steering‑wheel).
The dataset comprises 1,408,219 records with 12 features (CAN ID, eight data bytes, and three label columns). Normal traffic dominates (86.9 %), while malicious traffic accounts for only 13.1 %, creating a severe class‑imbalance scenario. A striking 99.7 % of the records are exact duplicates, reflecting the periodic nature of CAN messages. The authors meticulously clean the data: they remove duplicates, standardize label strings (trimming whitespace and normalizing case), and implement a fallback imputation strategy (median for numeric, mode for categorical) even though the raw data contains no missing values.
Exploratory data analysis reveals that DoS attacks are uniquely associated with CAN ID 291, making this feature a strong rule‑based indicator. Spoofing attacks, however, spread across four IDs (513, 476, 128, 344), requiring more nuanced feature extraction. Correlation analysis shows high multicollinearity among several data‑byte columns (e.g., D_A_T_A3 with D_A_T_A5 at 0.86), suggesting that these bytes encode related vehicle parameters.
To address dimensionality and noise, three feature‑selection/dimensionality‑reduction techniques are evaluated:
- Principal Component Analysis (PCA) – unsupervised, retains ~95 % variance in 7 components but sacrifices interpretability because the new axes are linear combinations of the original bytes.
- Linear Discriminant Analysis (LDA) – supervised, projects data onto 4 discriminant axes (limited by C‑1 where C is the number of classes). Visual inspection shows clear class separation with fewer dimensions, though LDA assumes normality and equal covariance, making it sensitive to outliers.
- ANOVA F‑value selection – a filter method that selects the original features most statistically different across classes. The top five features (ID, D_A_T_A0, D_A_T_A1, D_A_T_A2, D_A_T_A6) preserve semantic meaning, facilitating model explainability.
The literature review highlights a concerning trend: several recent studies report near‑perfect precision/recall (often 1.0) on the same dataset using advanced sampling (hybrid up/down‑sampling) and gradient‑boosting models (HistGradientBoosting, XGBoost). The authors argue that such results likely stem from data leakage, overly aggressive balancing, or overfitting, especially given the dataset’s known difficulty in distinguishing certain spoofing attacks (e.g., speed spoofing). Deep learning models achieving 99.8 % accuracy are similarly scrutinized; without adversarial or cross‑validation on truly unseen vehicle data, these metrics may not reflect real‑world robustness.
Practical deployment considerations are also discussed. Edge devices in vehicles have limited CPU, memory, and power budgets, rendering large DNNs or ensembles impractical for real‑time IDS. The paper suggests lightweight tree‑based classifiers, or hybrid approaches that combine unsupervised clustering (e.g., K‑Means) with supervised fine‑tuning, as more feasible. Moreover, the authors advocate for decentralized security architectures—federated learning to keep raw CAN logs on‑device and blockchain for immutable model verification—to reconcile the need for sophisticated detection with privacy and resource constraints.
In conclusion, the study provides a end‑to‑end pipeline: rigorous data cleaning, thoughtful handling of class imbalance, comparative feature‑selection analysis, and a critical appraisal of reported performance in the literature. It underscores that high benchmark scores alone are insufficient; future work must incorporate realistic vehicle deployments, adversarial testing, and edge‑optimized models to truly secure the IoV ecosystem.
Comments & Academic Discussion
Loading comments...
Leave a Comment