Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The accuracy and robustness of machine learning models against adversarial attacks are significantly influenced by factors such as training data quality, model architecture, the training process, and the deployment environment. In recent years, duplicated data in training sets, especially in language models, has attracted considerable attention. It has been shown that deduplication enhances both training performance and model accuracy in language models. While the importance of data quality in training image classifier Deep Neural Networks (DNNs) is widely recognized, the impact of duplicated images in the training set on model generalization and performance has received little attention. In this paper, we address this gap and provide a comprehensive study on the effect of duplicates in image classification. Our analysis indicates that the presence of duplicated images in the training set not only negatively affects the efficiency of model training but also may result in lower accuracy of the image classifier. This negative impact of duplication on accuracy is particularly evident when duplicated data is non-uniform across classes or when duplication, whether uniform or non-uniform, occurs in the training set of an adversarially trained model. Even when duplicated samples are selected in a uniform way, increasing the amount of duplication does not lead to a significant improvement in accuracy.

💡 Research Summary

This paper investigates the often‑overlooked problem of duplicated training images in deep neural network (DNN) classifiers for computer vision. While prior work on large language models has shown that deduplication improves training efficiency and downstream accuracy, the impact of duplicate images on image classification models—both standard and adversarially trained—has received little systematic study. The authors fill this gap by providing a comprehensive theoretical and empirical analysis of how varying levels and patterns of duplication affect model generalization, training speed, and robustness to adversarial attacks.

The theoretical contribution frames generalization error using the classic bias‑variance decomposition. In a normal setting, adding more (independent) samples reduces variance while slightly increasing bias. However, when duplicated samples are added, the effective information content does not increase. Proposition 1 formalizes this: duplicated samples reduce variance for the duplicated class but increase variance for the non‑duplicated classes, while simultaneously inflating bias toward the duplicated class. This creates a biased decision boundary that can hurt overall performance, especially when duplication is non‑uniform across classes.

The paper then extends the analysis to adversarial training. By incorporating adversarial perturbations β(x) and stochastic noise γ into the mean‑squared error loss, Theorem 2 derives an additional term that captures the interaction between model gradients and adversarial perturbations (cₓ) and a term reflecting variability under attack (c′ₓ). Proposition 2 argues that duplicated data amplifies both terms, making the model more sensitive to specific perturbations and thus less robust. In other words, duplication not only harms standard generalization but also compounds the robustness‑generalization trade‑off inherent in adversarial training.

Empirically, the authors conduct two sets of experiments. First, they generate synthetic two‑dimensional Gaussian data for two classes and train an SVM with an RBF kernel. By varying the duplication rate (10 %–90 %) and the duplication ratio for the positive class (D‑ratio +1), they show that uniform duplication has a modest effect on the decision boundary, whereas biased duplication dramatically shifts the boundary toward the over‑represented class and reduces test accuracy sharply once D‑ratio +1 exceeds roughly 0.5.

Second, they evaluate real‑world performance on CIFAR‑10 using a ResNet‑18 architecture. They compare a standard model trained on the original data with one trained on data containing duplicated images, and they repeat the experiment with a model that has undergone PGD‑based adversarial training. Uniform duplication (randomly selected images) yields negligible changes in top‑1 accuracy for the standard model, but biased duplication (over‑sampling a single class) leads to a 2–3 % absolute drop in accuracy even at modest duplication rates (5–10 %). In the adversarially trained setting, the same biased duplication causes a larger degradation of 4–6 % absolute accuracy, and the effect is amplified when the duplicated class coincides with the target class of the adversarial attack. Moreover, duplicated data increase the number of training iterations required to converge, inflating overall training time by 20 %–80 % depending on the duplication rate.

The authors also release their implementation publicly, enabling reproducibility. Their findings suggest that data duplication is not a benign way to increase dataset size; rather, it can introduce class‑specific bias, increase variance for under‑represented classes, and weaken robustness against adversarial perturbations. Consequently, practitioners should incorporate deduplication steps into data preprocessing pipelines, especially when dealing with class‑imbalanced datasets or when employing adversarial training for security‑critical applications.

In conclusion, the paper provides a solid theoretical foundation and convincing empirical evidence that duplicated training images can harm both accuracy and robustness of image classifiers. It calls for systematic deduplication as a standard practice in computer‑vision model development and opens avenues for future work on more sophisticated duplication‑aware training strategies and on extending the analysis to other vision architectures such as Vision Transformers.

Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment