An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.

💡 Research Summary

Concept bottleneck models (CBMs) decompose a prediction into a set of human‑interpretable binary concepts and then use these concepts to produce the final output. While this architecture offers transparency and the ability to intervene on concepts at test time, it also makes the model highly dependent on the quality of the concept annotations used during training. This paper presents the first systematic investigation of how noisy concept labels affect CBMs, quantifies the degradation across three key dimensions—prediction accuracy, interpretability, and intervention effectiveness—and proposes a two‑stage mitigation framework that preserves both performance and interpretability.

Empirical Study of Noise Impact
The authors inject uniform label noise by flipping each binary concept with probability γ (0 %–40 %) and similarly corrupt target labels. Experiments are conducted on two benchmark datasets: CUB (bird species) and AwA2 (animal attributes). Three noise regimes are examined: concept‑only, target‑only, and combined. Results show that even modest noise (10 %) reduces CUB task accuracy by 16.6 % and 30 % noise leads to a 51 % drop. Crucially, concept‑only noise yields almost the same performance loss as combined noise, indicating that corrupted concepts are the primary driver of degradation, whereas target‑label noise has a negligible effect. t‑SNE visualizations reveal that concept noise destroys class‑wise clustering of learned embeddings, while target noise leaves the structure largely intact.

Interpretability is measured with the Concept Alignment Score (CAS), which quantifies the semantic alignment between learned representations and ground‑truth concepts. CAS declines roughly 7 % for every 10 % increase in noise, dropping from 84 % (clean) to 58 % (40 % noise) on CUB. Qualitative visualizations of individual concepts (e.g., “blue upperparts”) show that noise entangles active and inactive samples, eroding the explanatory power of the concepts.

Intervention effectiveness is evaluated using three concept‑selection strategies from prior work: Random, Uncertainty of Concept Prediction (UCP), and Contribution of Concept to Target Prediction (CCTP). As noise increases, the maximum achievable accuracy after exhaustive intervention falls dramatically; at 40 % noise even perfect intervention cannot recover the performance of a clean model. Moreover, deliberately introducing incorrect interventions demonstrates that CBMs trained on noisy data are far less tolerant of human errors.

Susceptible Concept Set
Although noise is applied uniformly, the degradation is highly non‑uniform across concepts. By comparing each concept’s accuracy drop to the mean drop, the authors identify a “susceptible set” comprising roughly 23 % of concepts that suffer disproportionately large accuracy losses. These concepts tend to be low‑frequency or semantically ambiguous. Kernel density estimates of concept frequencies show that noise flattens the natural long‑tail distribution, reducing the effective signal‑to‑noise ratio for rare but informative concepts. The susceptible set accounts for the majority of the overall performance decline, highlighting a localized vulnerability.

Two‑Stage Mitigation Framework

Training‑time Sharpness‑Aware Minimization (SAM) – SAM augments standard SGD by seeking parameters that minimize the maximum loss within a small neighborhood, effectively flattening the loss landscape. This reduces the sensitivity of the model to perturbations in the training data. Empirically, SAM dramatically curtails the accuracy drop of susceptible concepts (from ~30 % to ~12 % under 40 % noise) and improves overall task accuracy across noise levels.
Inference‑time Uncertainty‑Based Intervention – Since clean concept labels are unavailable at test time, the authors compute predictive entropy for each concept and rank concepts by uncertainty. High‑entropy concepts are corrected (e.g., by a human expert) before feeding the revised concept vector to the downstream predictor. This strategy leverages the observation that susceptible concepts exhibit higher entropy. Across all noise levels, entropy‑guided correction recovers substantially more accuracy than Random, UCP, or CCTP, achieving >70 % accuracy even at 40 % noise with a modest number of interventions.

Theoretical Insights
The paper provides a theoretical justification for both components. For SAM, the authors show that minimizing sharpness reduces the Lipschitz constant of the loss, leading to a broader basin of attraction around minima and thus robustness to label noise. For entropy‑based selection, a Bayesian analysis demonstrates that correcting the most uncertain concepts yields the greatest expected reduction in posterior predictive loss, because uncertainty directly bounds the contribution of a concept to the overall predictive variance.

Extensive Ablations
Additional experiments vary the SAM radius, test alternative uncertainty measures (e.g., variance, mutual information), and evaluate different backbone architectures. The proposed framework consistently outperforms baselines, confirming that the benefits are not tied to a specific model configuration.

Implications
The findings underscore that CBMs, while attractive for interpretability, are intrinsically vulnerable to noisy concept annotations. By pinpointing a small subset of fragile concepts and stabilizing them through sharpness‑aware training, followed by targeted uncertainty‑driven correction at inference, the authors achieve a practical balance between transparency and robustness. This has direct relevance for safety‑critical domains—such as medical imaging, autonomous driving, and legal decision support—where human‑in‑the‑loop interventions are essential and annotation noise is unavoidable. The work thus provides both a diagnostic toolkit for assessing CBM robustness and a principled mitigation strategy that can be readily incorporated into existing pipelines.

An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment