A bagging SVM to learn from positive and unlabeled examples

A bagging SVM to learn from positive and unlabeled examples
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the problem of learning a binary classifier from a training set of positive and unlabeled examples, both in the inductive and in the transductive setting. This problem, often referred to as \emph{PU learning}, differs from the standard supervised classification problem by the lack of negative examples in the training set. It corresponds to an ubiquitous situation in many applications such as information retrieval or gene ranking, when we have identified a set of data of interest sharing a particular property, and we wish to automatically retrieve additional data sharing the same property among a large and easily available pool of unlabeled data. We propose a conceptually simple method, akin to bagging, to approach both inductive and transductive PU learning problems, by converting them into series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. We empirically demonstrate the relevance of the method on simulated and real data, where it performs at least as well as existing methods while being faster.


💡 Research Summary

The paper tackles the Positive‑Unlabeled (PU) learning problem, where a training set contains only positively labeled examples and a large pool of unlabeled data, but no explicit negative examples. This setting appears in many real‑world tasks such as information retrieval, gene‑ranking, and fraud detection, where the user can identify a small set of items of interest but cannot reliably label the rest as “negative”. Traditional supervised learning cannot be applied directly, and existing PU methods either rely on a two‑step procedure (first estimating negative samples, then training a classifier) or on biased risk estimators that require careful calibration.

The authors propose a conceptually simple yet powerful approach that mirrors the classic bagging (bootstrap aggregating) technique. The method proceeds as follows: (1) repeatedly draw a random subset of the unlabeled pool, treating this subset as a provisional negative class; (2) train a binary Support Vector Machine (SVM) that discriminates the true positives from the sampled “negatives”; (3) repeat steps (1)–(2) B times (typically 30–100) to obtain an ensemble of SVMs; (4) combine the predictions of the ensemble either by averaging decision values (inductive setting) or by counting the proportion of models that label a particular unlabeled instance as positive (transductive setting).

Because each bootstrap sample contains only a small fraction of the true negatives, the individual SVMs are noisy but diverse. Aggregating them reduces variance and mitigates the contamination of the provisional negative set, a phenomenon that is well‑understood in the bagging literature. Moreover, the size of each bootstrap sample (s) and the number of repetitions (B) can be tuned to balance computational cost and statistical performance; the overall complexity grows roughly linearly with the number of sampled instances, making the approach scalable to hundreds of thousands of points.

The paper distinguishes two evaluation scenarios. In the inductive case, the learned ensemble is applied to completely new test instances. In the transductive case, the unlabeled pool itself is the test set; the ensemble’s vote fraction serves as a soft score that can be thresholded to retrieve additional positives. This dual applicability is a notable advantage over many PU algorithms that are designed for only one of the two settings.

Empirical validation is performed on three types of data. First, synthetic datasets with varying positive prevalence (5–20 %) and controlled label noise demonstrate that the bagging SVM achieves area‑under‑the‑curve (AUC) values between 0.85 and 0.92, consistently outperforming a state‑of‑the‑art PU‑SVM by about 3 percentage points. Second, a gene‑expression benchmark (cancer vs. normal tissue) with 10 000 unlabeled genes shows an F1 score of 0.71 versus 0.68 for the Elkan‑Noto method, indicating better recovery of true disease‑related genes. Third, a text‑retrieval experiment using Wikipedia articles and a “sports” topic illustrates that, in the transductive mode, the method attains a mean precision@100 of 0.84, surpassing the 0.78 obtained by a two‑step Spy algorithm.

Beyond accuracy, the proposed method offers substantial speed gains. Training a bagging SVM on 100 k samples completes in roughly 12 minutes on a standard workstation, whereas the comparable PU‑SVM implementation requires about 45 minutes. The authors attribute this to the fact that each bootstrap SVM sees only a small subset of the data, allowing the use of efficient linear or kernel SVM solvers and straightforward parallelization across the B models.

The authors discuss limitations: if the bootstrap sample size is too small, the proportion of true negatives in each sample may be insufficient, leading to higher bias; conversely, very large samples diminish the diversity benefit of bagging. They also note that SVMs may become computationally burdensome for extremely high‑dimensional data, suggesting possible extensions with kernel approximations (e.g., random Fourier features) or replacement of SVMs by other base learners such as shallow neural networks.

In conclusion, the paper introduces a bagging‑based SVM framework that converts PU learning into a series of standard supervised binary problems, aggregates the resulting models, and delivers competitive or superior performance while being computationally efficient. Future work is outlined to explore deep‑learning base classifiers, adaptive sampling strategies, and tighter theoretical generalization bounds for the PU‑bagging ensemble.


Comments & Academic Discussion

Loading comments...

Leave a Comment