In a previous publication we proposed discrete global optimization as a method to train a strong binary classifier constructed as a thresholded sum over weak classifiers. Our motivation was to cast the training of a classifier into a format amenable to solution by the quantum adiabatic algorithm. Applying adiabatic quantum computing (AQC) promises to yield solutions that are superior to those which can be achieved with classical heuristic solvers. Interestingly we found that by using heuristic solvers to obtain approximate solutions we could already gain an advantage over the standard method AdaBoost. In this communication we generalize the baseline method to large scale classifier training. By large scale we mean that either the cardinality of the dictionary of candidate weak classifiers or the number of weak learners used in the strong classifier exceed the number of variables that can be handled effectively in a single global optimization. For such situations we propose an iterative and piecewise approach in which a subset of weak classifiers is selected in each iteration via global optimization. The strong classifier is then constructed by concatenating the subsets of weak classifiers. We show in numerical studies that the generalized method again successfully competes with AdaBoost. We also provide theoretical arguments as to why the proposed optimization method, which does not only minimize the empirical loss but also adds L0-norm regularization, is superior to versions of boosting that only minimize the empirical loss. By conducting a Quantum Monte Carlo simulation we gather evidence that the quantum adiabatic algorithm is able to handle a generic training problem efficiently.
In [NDRM08] we study a binary classifier of the form
where x ∈ R M are the input patterns to be classified, y ∈ {-1, +1} is the output of the classifier, the h i : x → {-1, +1} are so-called weak classifiers or features detectors, and the w i ∈ {0, 1} are a set of weights to be optimized during training. H(x) is known as a strong classifier. Training, i.e. the process of choosing the weights w i , proceeds by simultaneously minimizing two terms. The first term, called the loss L(w), measures the error over a set of S training examples {(x s , y s )|s = 1, . . . , S}. We choose least squares as the loss function. The second term, known as regularization R(w), ensures that the classifier does not become too complex. We employ a regularization term based on the L0-norm, w 0 . This term encourages the strong classifier to be built with as few weak classifiers as possible while maintaining a low training error. Thus, training is accomplished by solving the following discrete optimization problem:
Note that in our formulation, the weights are binary and not positive real numbers as in AdaBoost.
Even though discrete optimization could be applied to any bit depth representing the weights, we found that a small bit depth is often sufficient [NDRM08]. Here we only deal with the simplest case in which the weights are chosen to be binary.
In the case of a finite dictionary of weak classifiers {h i (x)|i = 1, …, N } AdaBoost can be seen as a greedy algorithm that minimizes the exponential loss [Zha04],
with α i ∈ R + . There are two differences between the objective of our algorithm (Eqn. 2) and the one employed by AdaBoost. The first is that we added L0-norm regularization. Second, we employ a quadratic loss function, while Adaboost works with the exponential loss.
It can easily be shown that including L0-norm regularization in the objective in Eqn.
(2) leads to improved generalization error as compared to using the quadratic loss only. The proof goes as follows. An upper bound for the Vapnik-Chernovenkis dimension of a strong classifier H of the form H(x) = T t=1 h t (x) is given by
where V C {h i } is the VC dimension of the dictionary of weak classifiers [FS95]. The strong classifier’s generalization error Error test has therefore an upper bound given by [VC71]
It is apparent that a more compact strong classifier that achieves a given training error Error train with a smaller number T of weak classifiers (hence, with a smaller VC dimension V C H ), comes
Figure 1: AdaBoost applied to a simple classification task. A shows the data, a separable set consisting of a two-dimensional cluster of positive examples (blue) surrounded by negative ones (red). B shows the random division into training (saturated colors) and test data (light colors). The dictionary of weak classifiers is constructed of axis-parallel one-dimensional hyperplanes. C shows the optimal classifier for this situation, which employs four weak classifiers to partition the input space into positive and a negative areas. The lower row shows partitions generated by AdaBoost after 10, 20, and 640 iterations. The configuration at T = 640 is the asymptotic configuration that does not change anymore in subsequent training rounds. The “breakout regions” outside the bounding box of the positive cluster occur in areas in which the training set does not contain negative examples. This problem becomes more severe for higher dimensional data. Due to AdaBoost’s greedy approach, the optimal configuration is not found despite the fact that the weak classifiers necessary to construct the ideal bounding box are generated. In fact AdaBoost fails to learn higher dimensional versions of this problem altogether with error rates approaching 50%. See section 6 for a discussion on how global optimization based learning can handle this data set.
with a guarantee for a lower generalization error. Looking at the optimization problem in Eqn. 2, one can see that if the regularization strength λ is chosen weak enough, i.e. λ < 2 N + 1 N 2 , then the effect of the regularization is merely to thin out the strong classifier. One arrives at the condition for λ by demanding that the reduction of the regularization term ∆R(w) that can be obtained by switching one w i to zero is smaller than the smallest associated increase in the loss term ∆L(w) that comes from incorrectly labeling a training example. This condition guarantees that weak classifiers are not eliminated at the expense of a higher training error. Therefore the regularization will only keep a minimal set of components, those which are needed to achieve the minimal training error that was obtained when using the loss term only. In this regime, the VC bound of the resulting strong classifier is lower or equal to the VC bound of a classifier trained with no regularization.
AdaBoost contains no explicit regularization term and it can easily happen that the classifier uses a richer set of weak classifiers than needed to
This content is AI-processed based on open access ArXiv data.