Kernels and Ensembles: Perspectives on Statistical Learning

Reading time: 6 minute
...

📝 Original Info

  • Title: Kernels and Ensembles: Perspectives on Statistical Learning
  • ArXiv ID: 0712.1027
  • Date: 2008-04-15
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Since their emergence in the 1990's, the support vector machine and the AdaBoost algorithm have spawned a wave of research in statistical machine learning. Much of this new research falls into one of two broad categories: kernel methods and ensemble methods. In this expository article, I discuss the main ideas behind these two types of methods, namely how to transform linear algorithms into nonlinear ones by using kernel functions, and how to make predictions with an ensemble or a collection of models rather than a single model. I also share my personal perspectives on how these ideas have influenced and shaped my own research. In particular, I present two recent algorithms that I have invented with my collaborators: LAGO, a fast kernel algorithm for unbalanced classification and rare target detection; and Darwinian evolution in parallel universes, an ensemble method for variable selection.

💡 Deep Analysis

Deep Dive into Kernels and Ensembles: Perspectives on Statistical Learning.

Since their emergence in the 1990’s, the support vector machine and the AdaBoost algorithm have spawned a wave of research in statistical machine learning. Much of this new research falls into one of two broad categories: kernel methods and ensemble methods. In this expository article, I discuss the main ideas behind these two types of methods, namely how to transform linear algorithms into nonlinear ones by using kernel functions, and how to make predictions with an ensemble or a collection of models rather than a single model. I also share my personal perspectives on how these ideas have influenced and shaped my own research. In particular, I present two recent algorithms that I have invented with my collaborators: LAGO, a fast kernel algorithm for unbalanced classification and rare target detection; and Darwinian evolution in parallel universes, an ensemble method for variable selection.

📄 Full Content

The 1990's saw two major advances in machine learning: the support vector machine (SVM) and the AdaBoost algorithm. Two fundamental ideas behind these algorithms are especially far-reaching. The first one is that we can transform many classical linear algorithms into highly flexible nonlinear algorithms by using kernel functions. The second one is that we can make accurate predictions by building an ensemble of models without much fine-tuning for each, rather than carefully fine-tuning a single model.

In this expository article, I first present the main ideas behind kernel methods (Section 2) and ensemble methods (Section 3) by reviewing four prototypical algorithms: the support vector machine (SVM, e.g., Cristianini and Shawe-Taylor 2000), kernel principal component analysis (kPCA, Schölkopf et al. 1998), AdaBoost (Freund and Schapire 1996), and random forest (Breiman 2001). I then illustrate the influence of these ideas on my own research (Section 4) by highlighting two recent algorithms that I have invented with my collaborators: LAGO (Zhu et al. 2006), a fast kernel machine for rare target detection; and Darwinian evolution in parallel universes (Zhu and Chipman 2006), an ensemble method for variable selection.

To better focus on the main ideas and not be distracted by the technicalities, I shall limit myself mostly to the two-class classification problem, although the SVM, AdaBoost and random forest can all deal with multi-class classification and regression problems as well. Technical details that do not affect the understanding of the main ideas are also omitted.

I begin with kernel methods. Even though the idea of kernels is fairly old, it is the support vector machine (SVM) that ignited a new wave of research in this area over the past 10 to 15 years.

In a two-class classification problem, we have predictor vectors x i ∈ R d and class labels y i ∈ {-1, +1}, i = 1, 2, …, n. SVM seeks an optimal hyperplane to separate the two classes.

A hyperplane in R d consists of all x ∈ R d that satisfy the linear equation:

Given x i ∈ R d and y i ∈ {-1, +1}, a hyperplane is called a separating hyperplane if there exists c > 0 such that

Clearly, a hyperplane can be reparameterized by scaling, e.g., β T x + β 0 = 0 is equivalent to s(β T x + β 0 ) = 0 for any scalar s. In particular, we can scale the hyperplane so that (1) becomes

that is, scaled so that c = 1. A separating hyperplane satisfying condition (2) is called a canonical separating hyperplane (CSHP).

If two classes are perfectly separable, then there exist an infinite number of separating hyperplanes. Figure 1 shows two competing hyperplanes in such a situation. The SVM is based on the notion that the “best” canonical separating hyperplane to separate two classes is the one that is the farthest away from the training points. This notion is formalized mathematically by the margin of a hyperplane -hyperplanes with larger margins are better. In particular, the margin of a hyperplane is equal to margin = 2 × min{y i d i , i = 1, 2, …, n}, where d i is the signed distance between observation x i and the hyperplane; see Figure 1 for an illustration. Figure 1 also shows to a certain extent why large margins are good on an intuitive level; there is also an elaborate set of theories to justify this (see, e.g., Vapnik 1995).

It can be shown (e.g., Hastie et al. 2001, Section 4.5) that d i is equal to

Then, equations (2) and (3) together imply that the margin of a CSHP is equal to margin = 2 × min{y i d i } = 2 β .

Margin (Better)

Figure 1: Two separating hyperplanes, one with a larger margin than the other.

To find the “best” CSHP with the largest margin, we are interested in solving the following optimization problem:

The extra variables ξ i are introduced to relax the separability condition (2) because, in general, we can’t assume the two classes are always perfectly separable. The term γ ξ i acts as a penalty to control the degree of such relaxation, and γ is a tuning parameter.

The main message from the brief introduction above is this: SVM tries to find the best CSHP; it is therefore a linear classifier. The usual immediate response to this message is: So what? How does this make the SVM much different from and superior to classical logistic regression?

Equivalently, the constrained optimization problem above can be written as (e.g., Hastie et al. 2001, Exercise 12.1)

where

For statisticians, the objective function in (6) has the familiar form of a loss function plus a penalty term. For the SVM, the loss function is [1 -y(β T x + β 0 )] + , and it is indeed very similar to the binomial log-likelihood used by logistic regression (e.g., Hastie et al. 2001, Figure 12.4). But the usual logistic regression model does not include the penalty term λ β 2 . This is the familiar ridge penalty and often stabilizes the solution, especially in high-dimensional problems. Indeed, this gives the SVM an advantage. However, one can’t possibly expect a li

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut