Deep Learning-Based Object Detection for Autonomous Vehicles: A Comparative Study of One-Stage and Two-Stage Detectors on Basic Traffic Objects
Object detection is a crucial component in autonomous vehicle systems. It enables the vehicle to perceive and understand its environment by identifying and locating various objects around it. By utilizing advanced imaging and deep learning techniques, autonomous vehicle systems can rapidly and accurately identify objects based on their features. Different deep learning methods vary in their ability to accurately detect and classify objects in autonomous vehicle systems. Selecting the appropriate method significantly impacts system performance, robustness, and efficiency in real-world driving scenarios. While several generic deep learning architectures like YOLO, SSD, and Faster R-CNN have been proposed, guidance on their suitability for specific autonomous driving applications is often limited. The choice of method affects detection accuracy, processing speed, environmental robustness, sensor integration, scalability, and edge case handling. This study provides a comprehensive experimental analysis comparing two prominent object detection models: YOLOv5 (a one-stage detector) and Faster R-CNN (a two-stage detector). Their performance is evaluated on a diverse dataset combining real and synthetic images, considering various metrics including mean Average Precision (mAP), recall, and inference speed. The findings reveal that YOLOv5 demonstrates superior performance in terms of mAP, recall, and training efficiency, particularly as dataset size and image resolution increase. However, Faster R-CNN shows advantages in detecting small, distant objects and performs well in challenging lighting conditions. The models’ behavior is also analyzed under different confidence thresholds and in various real-world scenarios, providing insights into their applicability for autonomous driving systems.
💡 Research Summary
This paper presents a systematic comparative study of two widely used deep‑learning object detectors for autonomous driving: the one‑stage YOLOv5 and the two‑stage Faster R‑CNN. The authors constructed a hybrid dataset that combines real‑world images from the BDD100K benchmark with synthetic images from the SHIFT dataset, maintaining a 1:1 ratio of real to synthetic data. Three dataset sizes (≈2 k, 3 k, and 5 k images) and two image resolutions (640 × 640 and 800 × 800) were prepared, each containing balanced instances of three critical traffic classes—cars, pedestrians, and trucks.
Both detectors were implemented using their official repositories (Ultralytics YOLOv5‑m and Detectron2 Faster R‑CNN with a ResNet‑50‑FPN backbone) and trained under identical hyper‑parameter settings: learning rate 0.01, 300 epochs, batch size 4, and default optimizer configurations. No extensive hyper‑parameter tuning was performed, ensuring that observed performance differences stem from architectural characteristics rather than optimization tricks.
Quantitative evaluation employed mean Average Precision (mAP), recall, and inference speed (frames per second, FPS). Results show that YOLOv5 consistently outperforms Faster R‑CNN in overall mAP (by roughly 2–3 %) and achieves real‑time processing speeds of 70–80 FPS on a modern GPU, making it suitable for latency‑critical applications. However, Faster R‑CNN demonstrates superior recall on small and distant objects, particularly pedestrians, thanks to its Region Proposal Network and Feature Pyramid Network, and it is more robust under low‑light or high‑contrast conditions where YOLOv5 suffers an increase in false negatives.
A confidence‑threshold analysis (0.3–0.7) revealed that YOLOv5’s precision‑recall curve degrades more gracefully at higher thresholds, whereas Faster R‑CNN’s false‑positive rate spikes sharply when the threshold is lowered. This suggests that threshold selection must be tailored to the specific detector in deployment.
The discussion highlights practical trade‑offs: YOLOv5 offers higher throughput and better performance on medium‑to‑large objects, making it ideal for scenarios where computational resources are limited but detection speed is paramount. Faster R‑CNN, though slower (≈12–15 FPS), provides higher fidelity for small, far‑away, or poorly illuminated objects, which is critical in dense urban environments with vulnerable road users.
Limitations include the absence of adverse weather (rain, snow) images, a restricted class set (only three object types), and the intentional omission of hyper‑parameter optimization, which may under‑represent each model’s peak capability. The authors propose future work that expands the dataset to cover diverse weather conditions, incorporates additional traffic participants (bicycles, traffic signs), evaluates newer transformer‑based detectors such as YOLOv8 and DETR, and applies automated hyper‑parameter search (e.g., Bayesian optimization) to fine‑tune models for specific autonomous‑driving contexts.
In conclusion, the study provides concrete, experimentally validated guidance for autonomous‑vehicle engineers: choose YOLOv5 when real‑time processing and overall accuracy dominate design criteria, and opt for Faster R‑CNN when the operational environment demands meticulous detection of small or low‑visibility objects. The paper thereby bridges the gap between academic performance benchmarks and the practical requirements of safety‑critical autonomous driving systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment