CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Semi-supervised learning (SSL) has demonstrated high performance in image classification tasks by effectively utilizing both labeled and unlabeled data. However, existing SSL methods often suffer from poor calibration, with models yielding overconfident predictions that misrepresent actual prediction likelihoods. Recently, neural networks trained with {\tt mixup} that linearly interpolates random examples from the training set have shown better calibration in supervised settings. However, calibration of neural models remains under-explored in semi-supervised settings. Although effective in supervised model calibration, random mixup of pseudolabels in SSL presents challenges due to the overconfidence and unreliability of pseudolabels. In this work, we introduce CalibrateMix, a targeted mixup-based approach that aims to improve the calibration of SSL models while maintaining or even improving their classification accuracy. Our method leverages training dynamics of labeled and unlabeled samples to identify easy-to-learn'' and hard-to-learn’’ samples, which in turn are utilized in a targeted mixup of easy and hard samples. Experimental results across several benchmark image datasets show that our method achieves lower expected calibration error (ECE) and superior accuracy compared to existing SSL approaches.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Deep neural networks (DNNs) (LeCun, Bengio, and Hinton 2015;Mathew, Amudha, and Sivakumari 2021) have achieved remarkable success across a wide range of computer vision tasks, including image classification (Lu and Weng 2007), object detection (Papageorgiou, Oren, and Poggio 1998), and semantic segmentation (Minaee et al. 2021). However, alongside accuracy, the predictive confidence of the models plays a vital role in real-world decision-making applications. For example, in critical applications such as autonomous driving, medical diagnosis, and disaster response, models must be accurate as well as reliably indicate when they are uncertain, so that additional safety measures can be triggered. For this reason, quantifying predictive uncertainty and calibration of the DNNs is a pivotal component toward building more reliable models. Despite strong performance, DNNs often suffer from poor calibration (Guo et al. 2017), which means that the predictive confidence likely overestimates the model’s true accuracy. A key reason behind this is that modern DNNs are trained using one-hot encoded labels and the cross-entropy loss, which assumes that every training sample belongs with full certainty to a single class. This forces the model to assign the entire probability mass to a single class label, which in turn suppresses any expression of uncertainty even for “ambiguous” samples. To overcome the issues of overconfidence, label smoothing (Müller, Kornblith, and Hinton 2019) has been introduced. By softening the target distribution during training, label smoothing regularizes the model’s output probabilities, encouraging it to remain uncertain where appropriate. More recently, Thulasidasan et al. (2019) explored the use of mixup training (Zhang et al. 2018) for improving model calibration, which creates augmented samples through convex combinations of input samples and their labels. Mixup distributes its probability into two classes, which introduces entropy, prevents overconfidence, and has proven to be an effective tool in model calibration.

However, this approach primarily targets the fully supervised setting, which requires a large amount of labeled data. With the emergence of AI in all domains, it is impractical to obtain large amounts of annotated data for every domain. To solve this issue, Semi-supervised learning (SSL) (Chapelle, Scholkopf, and Zien 2009) can be an effective strategy to leverage large amounts of unlabeled examples during training to boost model performance. Despite mixup being successful in supervised learning, mixing up unlabeled examples in an SSL setting poses certain challenges due to the uncertainty of pseudo-label correctness, especially at the early iterations of training. This makes it critical to ensure the quality of pseudolabels before incorporating them into the learning process. To ensure the quality of pseudo-labels, a popular SSL method to learn from unlabeled examples is pseudo-labeling (Lee et al. 2013), which leverages a model to make predictions on unlabeled examples and assign them pseudo-labels, which are in turn used as (pseudo) ground truth during training. To ensure that correct pseudo-labels are used for model training, modern SSL frameworks such as FixMatch (Sohn et al. 2020) and FlexMatch (Zhang et al. 2021) utilize high confidence thresholds to maintain data quality and filter out potentially incorrect examples. However, the calibration of these SSL models is not well studied, and we found empirical evidence that SSL models also suffer from poor calibration as shown in the reliability diagrams of FixMatch and FlexMatch on CIFAR-100 in Figures 1a and1b, respectively. The diagrams, which plot accuracy as a function of confidence, show that the confidence estimates of the models are not indicative of their correctness. Notably, FixMatch and FlexMatch predictions with confidences higher than 90% have less than 65% accuracy, contradicting the assumption that employing high confidence thresholds leads to high pseudo-label quality.

As shown in Figure 1c, the impurity of unlabeled data for both FixMatch and FlexMatch on CIFAR-100 (Krizhevsky, Hinton et al. 2009) is higher than 13%, indicating that more than 13% of the unlabeled data is utilized with incorrect pseudo-labels during training even at the later stages of training at 9000 epoch. Figure 1d shows examples of incorrect pseudo-labels introduced by both FixMatch and FlexMatch at the end of training. These incorrect pseudo-labels arise due to miscalibrated model predictions, manifested here as overconfidence in potentially incorrect predictions. This calibration gap in SSL can not be fully solved using random mixup. When incorrect pseudo-labels are used in mixup, the interpolation process can propagate label noise across training samples. This not only reinforces errors but also makes them harder to correct. For deep learning models that are highly over-parameterized and capable of achieving nearzero training error, s

View Original ArXiv

This content is AI-processed based on ArXiv data.

CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found