On the Evaluation Criterions for the Active Learning Processes
In many data mining applications collection of sufficiently large datasets is the most time consuming and expensive. On the other hand, industrial methods of data collection create huge databases, and make difficult direct applications of the advanced machine learning algorithms. To address the above problems, we consider active learning (AL), which may be very efficient either for the experimental design or for the data filtering. In this paper we demonstrate using the online evaluation opportunity provided by the AL Challenge that quite competitive results may be produced using a small percentage of the available data. Also, we present several alternative criteria, which may be useful for the evaluation of the active learning processes. The author of this paper attended special presentation in Barcelona, where results of the WCCI 2010 AL Challenge were discussed.
💡 Research Summary
The paper addresses the problem of evaluating active learning (AL) processes, a topic that has become increasingly important as large unlabeled data pools are readily available while labeling remains costly. Using the online evaluation framework of the WCCI 2010 Active Learning Challenge, the author demonstrates that competitive performance can be achieved with only a small fraction of the total data, but also reveals serious shortcomings in the challenge’s official metric, the Area under the Learning Curve (ALC).
The author first divides the AL lifecycle into three phases: an “initial” phase with very few labeled instances, an “actual” phase where the learner is expected to make steady improvements, and a “validation” phase where performance is finally measured. The challenge’s ALC metric is defined as a weighted sum of the AUC values obtained after each labeling request, with weights w_i = log₂(1 + 1/(i‑1)). Because w_i decays rapidly as i grows, the first few requests dominate the final score. Consequently, a strategy that makes a large “jump” early—acquiring a relatively big batch of labels at the outset—receives a disproportionately high ALC, while incremental, genuinely active learning strategies are penalized.
To illustrate this effect, the author describes the concrete methodology used in the competition. Starting from a single positive example, a set of random negative samples is drawn (50‑100 instances) under the assumption of severe class imbalance. The decision function is computed as a simple average, sorted, and the region where the function declines sharply is used to select the next query points. After a few iterations the author switches from random sampling to uncertainty sampling, employing a variety of classifiers (ridge, GLM, AdaBoost, Gradient Boosting) from the CLOP and R packages, and finally builds an ensemble whose weights are derived from cross‑validation performance. The experiments on six datasets (A‑F) show that using only 1‑9 % of the available data yields AUC and ALC values comparable to the top submissions.
A detailed mathematical analysis of the ALC metric follows. By expanding the definition, the author shows that the contribution of later queries is essentially negligible, and that the metric can be rewritten as a function heavily weighted toward the first AUC (AU C₁). In the binary‑request case (N = 2) the formula simplifies further, confirming that maximizing AU C₁ is the most effective way to improve the final score. The paper also reports a “mistake” where two early points were identical due to a directory error, yet the ALC changed dramatically, underscoring the metric’s sensitivity.
Recognizing these flaws, the author proposes two alternative evaluation criteria. The first introduces a threshold δ (e.g., 1 % of the total pool) and ignores any performance measured before this point; the ALC is then computed only on the remaining segment. The second defines a score Q = max_i
Comments & Academic Discussion
Loading comments...
Leave a Comment