A Large-Scale Car Dataset for Fine-Grained Categorization and Verification
Updated on 24/09/2015: This update provides preliminary experiment results for fine-grained classification on the surveillance data of CompCars. The train/test splits are provided in the updated dataset. See details in Section 6.
š” Research Summary
The paper introduces CompCars, a largeāscale, richly annotated car image dataset designed to stimulate research on fineāgrained categorization, attribute prediction, and model verification. The dataset comprises 208,826 images covering 1,716 car models from 163 makes, collected from two distinct scenarios: (1) webānature images sourced from forums, manufacturer sites, and search engines, and (2) surveillanceānature images captured by traffic cameras. The web portion contains 136,727 fullācar images and 27,618 part images, while the surveillance portion adds 44,481 frontāview images with bounding boxes, model labels, and color information.
Key annotations include:
- A threeālevel hierarchy (make ā model ā year) allowing hierarchical classification.
- Five viewpoint labels (front, rear, side, frontāside, rearāside) for each fullācar image.
- Eight part categories (headlight, taillight, fog light, air intake, console, steering wheel, dashboard, gear lever) with roughly aligned crops.
- Five car attributes: maximum speed, engine displacement, number of doors, number of seats, and car type (12 categories such as SUV, sedan, hatchback, etc.). The attributes are derived from manufacturer specifications, ensuring objective ground truth.
The authors argue that existing car datasets (e.g., the āCarsā dataset) are limited in scale, viewpoint diversity, part coverage, and attribute information. CompCars addresses these gaps, enabling crossāmodality research because the same models appear in both web and surveillance domains, yet under very different imaging conditions (resolution, lighting, occlusion).
Three benchmark tasks are defined and evaluated using a Convolutional Neural Network (CNN) based on the Overfeat architecture preātrained on ImageNet and subsequently fineātuned on CompCars data:
-
Fineāgrained car classification ā 431 model classes (year variations merged). Experiments compare models trained on single viewpoints (F, R, S, FS, RS) versus a model trained on all viewpoints (āAllāViewā). AllāView achieves the highest topā1 accuracy (~78%), demonstrating that the network can integrate multiāview information effectively. FS and RS viewpoints also outperform pure front, rear, or side views, suggesting that oblique angles provide discriminative cues. A separate experiment using aligned part crops shows that exterior parts (especially headlights and taillights) are more informative than interior parts for model discrimination.
-
Attribute prediction ā Both regression (maximum speed, displacement) and classification (door count, seat count, car type) are tackled. The network fineātuned on fullācar images predicts car type with >85% accuracy, while regression errors for speed and displacement remain within practical limits (e.g., mean absolute error <10āÆkm/h). This indicates that visual appearance encodes functional specifications to a considerable degree.
-
Car model verification ā The task is to decide whether two images belong to the same model. Features extracted from the fineātuned Overfeat network are fed into a Joint Bayesian classifier, a method popular in face verification. Despite the domain shift between web and surveillance images, verification accuracy stays above 70%, confirming that the learned representation is robust enough for crossāscenario matching.
The paper also discusses challenges revealed by the experiments: (i) imbalanced viewpoint distribution across models, (ii) limited viewpoint diversity in the surveillance set (only front view), (iii) the need for more advanced architectures (e.g., ResNet, Vision Transformers) to push performance further, and (iv) potential label noise in year annotations.
Future research directions suggested include domain adaptation techniques to bridge web and surveillance gaps, multiātask learning that jointly optimizes classification, attribute regression, and verification, 3āD reconstruction leveraging the hierarchical and part annotations, and exploring selfāsupervised preātraining on the massive unlabeled car video streams.
In summary, CompCars provides a comprehensive platform with unprecedented scale, hierarchical structure, multiāview and part annotations, and objective attribute labels. The baseline results establish solid reference points, and the datasetās crossāmodality nature opens avenues for a wide range of computerāvision investigations beyond traditional object classification, positioning it as a cornerstone resource for the automotive vision community.
Comments & Academic Discussion
Loading comments...
Leave a Comment