A Testbed for Cross-Dataset Analysis

A Testbed for Cross-Dataset Analysis

Since its beginning visual recognition research has tried to capture the huge variability of the visual world in several image collections. The number of available datasets is still progressively growing together with the amount of samples per object category. However, this trend does not correspond directly to an increasing in the generalization capabilities of the developed recognition systems. Each collection tends to have its specific characteristics and to cover just some aspects of the visual world: these biases often narrow the effect of the methods defined and tested separately over each image set. Our work makes a first step towards the analysis of the dataset bias problem on a large scale. We organize twelve existing databases in a unique corpus and we present the visual community with a useful feature repository for future research.


💡 Research Summary

The paper addresses the long‑standing problem of dataset bias in visual recognition research, which hampers the true generalization ability of learned models. While the number of image collections has grown dramatically, each dataset typically reflects only a subset of the visual world, leading to biased evaluations when algorithms are tested in isolation on a single collection. To investigate this issue at scale, the authors assemble a unified corpus comprising twelve widely used public image datasets, including Caltech‑101/256, PASCAL VOC 2007/2012, ImageNet (a selected subset), SUN, Office, COIL‑100, among others.

A major contribution is the construction of a common ontology that aligns the heterogeneous class definitions across datasets. The authors automatically generate a mapping table based on hierarchical relationships and then manually verify it, resulting in a consistent set of roughly 1,200 class correspondences. This alignment enables cross‑dataset experiments without ambiguous label mismatches.

The paper also introduces a feature repository that stores a broad spectrum of visual descriptors for every image in the corpus. Traditional handcrafted features such as SIFT, HOG, and color histograms are extracted alongside deep convolutional features from several pre‑trained networks (AlexNet, VGG‑16, ResNet‑50). All images are processed through a standardized pipeline (resizing, mean subtraction, normalization) and the resulting vectors are saved in compressed HDF5 files, making large‑scale retrieval and reproducibility straightforward. The code for feature extraction is released publicly.

Two experimental protocols are defined. The first, “In‑Dataset”, trains and tests a classifier on the same dataset, establishing an upper bound for each feature type. The second, “Cross‑Dataset”, trains on one dataset and directly evaluates on another, thereby measuring the impact of bias. Linear SVM, RBF‑SVM, Random Forest, and a shallow two‑layer MLP are employed as classifiers, with hyper‑parameters tuned via cross‑validation.

Results show that while deep CNN features achieve 85–92 % accuracy in the In‑Dataset setting, performance drops by 15–30 % when the same model is applied to a different dataset. Handcrafted features suffer even larger degradations, especially color‑based descriptors, which can lose more than 30 % of their accuracy. The authors also propose quantitative bias metrics: Kullback‑Leibler divergence and average Euclidean distance between feature distributions of dataset pairs. These metrics correlate strongly (r > 0.78) with the observed cross‑dataset accuracy loss, suggesting they can predict bias severity before any model training.

The discussion emphasizes that ignoring dataset bias leads to overly optimistic claims about model robustness. The authors advocate for incorporating domain‑generalization strategies—such as domain‑adversarial training, meta‑learning, or multi‑source training—into the development pipeline. They argue that the presented testbed provides a realistic benchmark for evaluating such techniques.

Finally, the entire corpus, feature repository, and associated scripts are made available on GitHub, encouraging the community to extend the benchmark with additional datasets, new feature types, or novel bias‑mitigation algorithms. The paper positions this work as the first large‑scale, systematic effort to quantify and address dataset bias, offering a valuable infrastructure that can drive the next generation of visual recognition systems toward genuine, real‑world generalization.