Training Convolutional Networks with Noisy Labels
The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.
💡 Research Summary
The paper tackles the problem of training convolutional neural networks (ConvNets) when the available training labels are noisy—a common situation when large‑scale datasets are built from web sources, user tags, or cheap crowdsourcing. The authors observe that standard ConvNets are surprisingly robust to modest amounts of label noise, but performance degrades sharply as the noise level rises. To address this, they propose a simple yet effective architectural modification: a linear “noise layer” placed directly after the softmax output. This layer is parameterized by a K × K (or K + 1 × K + 1 for outlier handling) probability transition matrix Q, where each entry q_{ji} represents the probability that a true label i is observed as noisy label j.
During training the network predicts the true‑label distribution \hat{p}(y*|x) as usual; the noise layer then transforms this distribution into \hat{p}(\tilde{y}|x) = Q · \hat{p}(y*|x). The loss is the standard cross‑entropy between the observed noisy label \tilde{y} and this transformed prediction. Crucially, Q is learned jointly with the base ConvNet parameters via back‑propagation. After each gradient step Q is projected back onto the space of stochastic matrices (rows sum to one, non‑negative).
If the true noise transition matrix Q* were known, setting Q = Q* would force the base model’s confusion matrix C to converge to the identity, meaning the ConvNet would learn to predict clean labels despite being trained on noisy ones. Since Q* is unknown, the authors introduce a regularizer on Q that encourages diffusion (e.g., weight decay or trace minimization). This regularizer prevents the trivial solution where the base model absorbs the noise (C ≈ Q*) and Q collapses to the identity. Instead, it pushes Q toward the true noise distribution, allowing the base model to focus on the underlying clean signal.
Training proceeds in two phases. Initially Q is fixed to the identity and only the base network is trained; this allows the network to learn a reasonable representation even under noise. After a few epochs, Q is unfrozen and updated together with the base parameters, with a small weight‑decay term to encourage diffusion. Empirically the authors find that a modest decay (≈ 10⁻³) works well; too strong a decay over‑diffuses Q and harms performance.
The paper also handles outlier noise, where some images do not belong to any of the target classes. An extra “outlier” class is added, expanding the output dimension to K + 1. The corresponding Q* has a block structure that maps true in‑class labels to themselves and maps outlier samples uniformly to the K in‑class labels. Because this matrix is singular, a small ε is added to make it invertible during training.
Extensive experiments are conducted on CIFAR‑10, CIFAR‑100, and ImageNet. The authors synthetically inject label‑flip noise at rates ranging from 0 % to 80 % and compare three setups: (1) a vanilla ConvNet, (2) the proposed model with a learned Q, and (3) an oracle model that knows Q*. Results show that the vanilla network’s accuracy drops dramatically beyond 30‑40 % noise, while the proposed model retains much higher accuracy, often within 5‑10 % of the oracle even at 60 % noise. On ImageNet, where the baseline top‑1 accuracy falls from ~70 % to ~60 % under heavy noise, the noise‑layer model improves by roughly 3 % absolute. For outlier experiments, adding the extra class and learning Q yields a consistent boost of about 2‑3 % over the baseline.
The contributions of the work are threefold: (i) a clean, end‑to‑end trainable formulation that explicitly models label noise via a linear stochastic layer, (ii) a practical learning scheme that jointly estimates the noise transition matrix without any clean validation data, and (iii) thorough empirical validation on both synthetic and large‑scale real datasets, demonstrating that the method scales to millions of images with negligible computational overhead. The approach sidesteps the need for costly pre‑processing or manual label cleaning, making it attractive for real‑world applications where noisy annotations are the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment