Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that combine unlabeled data with labeled data ? i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi-supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.
💡 Research Summary
This paper presents a comprehensive empirical evaluation of several semi‑supervised learning (SSL) techniques across multiple domains, with a particular focus on the often‑overlooked issue of sample‑selection bias in the labeling process. The authors categorize bias into three statistical mechanisms: MCAR (Missing Completely At Random), MAR (Missing At Random), and MNAR (Missing Not At Random). Under MCAR the labeled and unlabeled sets are drawn from the same distribution, under MAR the probability of being labeled depends only on the observed features, and under MNAR it also depends on the true class label, creating a systematic discrepancy between the two sets.
The study evaluates six methods: (1) Co‑training, which assumes two independent feature views and iteratively labels unlabeled instances; (2) Re‑weighting, which estimates class priors from the unlabeled pool and adjusts instance weights; (3) ASSEMBLE, a bagging‑style ensemble of weak learners; (4) Common‑Component Mixture (CCM), an EM‑based mixture model that learns a shared latent structure; (5) Bivariate Probit, an econometric model that jointly estimates the selection equation and the outcome equation to correct MNAR bias; and (6) Sample‑Select, a two‑stage approach that first learns a selection model and then uses its predicted probabilities as weights in the SSL algorithm.
Experiments were conducted on four real‑world domains—text classification, credit scoring, customer marketing, and drug design—using several publicly available datasets. For each dataset the authors varied the proportion of labeled data (5 %, 10 %, 20 %, 50 %) and introduced controlled label noise (0 %, 10 %, 20 %). Performance was measured primarily by the area under the ROC curve (AUC), complemented by accuracy and F1‑score.
Key findings include:
- MCAR scenario – Co‑training and CCM consistently outperformed pure supervised learning, achieving 4–6 % AUC gains even when only 5–10 % of the data were labeled. Their success stems from the validity of the shared‑distribution assumption.
- MAR scenario – Re‑weighting and ASSEMBLE showed the largest improvements because they exploit the dependence of labeling probability on observable features. In text data with 5 % labeled examples, AUC increased by up to 8 %.
- MNAR scenario – All four traditional SSL methods suffered substantial performance drops; in some cases they performed worse than supervised baselines. Bivariate Probit and Sample‑Select recovered most of the loss, delivering average AUC gains of 5.3 % and 4.7 % respectively. The effect was most pronounced in credit‑scoring data where labeling was heavily class‑biased, with Bivariate Probit delivering >7 % absolute AUC improvement.
- Label noise – When noise exceeded 20 %, all SSL methods exhibited over‑fitting, with Re‑weighting being the most vulnerable (AUC decline of ~3 %). This underscores the need for high‑quality labels before applying SSL.
- Feature independence – Co‑training’s reliance on truly independent feature splits proved fragile; performance degraded sharply when independence was violated. CCM’s EM‑based latent‑structure learning proved robust to feature correlation.
- Scalability of bias correction – Both Bivariate Probit and Sample‑Select required sufficient unlabeled data (>10 k samples) to estimate the selection model reliably. When this condition was met, bias correction was stable across domains.
The authors also discuss practical implications. In many real applications, labeling is driven by human judgment or business rules, making MNAR bias common. In such settings, incorporating an explicit selection model (as in Bivariate Probit) or a two‑stage weighting scheme (Sample‑Select) is essential for achieving the promised benefits of SSL. Conversely, when labeled data are extremely scarce and noisy, investing in better labeling rather than sophisticated SSL may be more cost‑effective.
In conclusion, the paper demonstrates that semi‑supervised learning is not a one‑size‑fits‑all solution; its success hinges on correctly diagnosing the underlying missing‑label mechanism and selecting or adapting algorithms accordingly. By systematically evaluating bias‑aware methods, the study provides clear guidance for practitioners seeking to leverage unlabeled data in realistic, bias‑prone environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment