We consider the task of learning a classifier from the feature space $\mathcal{X}$ to the set of classes $\mathcal{Y} = \{0, 1\}$, when the features can be partitioned into class-conditionally independent feature sets $\mathcal{X}_1$ and $\mathcal{X}_2$. We show the surprising fact that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $\mathcal{X}_2$ to $\mathcal{X}_1$ and 2) learning the class-conditional distribution of the feature set $\mathcal{X}_1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.
Deep Dive into Surrogate Learning - An Approach for Semi-Supervised Classification.
We consider the task of learning a classifier from the feature space $\mathcal{X}$ to the set of classes $\mathcal{Y} = \{0, 1\}$, when the features can be partitioned into class-conditionally independent feature sets $\mathcal{X}_1$ and $\mathcal{X}_2$. We show the surprising fact that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $\mathcal{X}_2$ to $\mathcal{X}_1$ and 2) learning the class-conditional distribution of the feature set $\mathcal{X}_1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.
arXiv:0809.4632v1 [cs.LG] 26 Sep 2008
Surrogate Learning - An Approach for
Semi-Supervised Classification
Anonymous Author(s)
Affiliation
Address
email
Abstract
We consider the task of learning a classifier from the feature space X to the set of
classes Y = {0, 1}, when the features can be partitioned into class-conditionally
independent feature sets X1 and X2. We show the surprising fact that the class-
conditional independence can be used to represent the original learning task in
terms of 1) learning a classifier from X2 to X1 and 2) learning the class-conditional
distribution of the feature set X1. This fact can be exploited for semi-supervised
learning because the former task can be accomplished purely from unlabeled sam-
ples. We present experimental evaluation of the idea in two real world applica-
tions.
1
Introduction
Semi-supervised learning is said to occur when the learner exploits (a presumably large quantity of)
unlabeled data to supplement a relatively small labeled sample, for accurate induction. The high
cost of labeled data and the simultaneous plenitude of unlabeled data in many application domains,
has led to considerable interest in semi-supervised learning in recent years.
We show a somewhat surprising consequence of class-conditional feature independence that leads to
a simple semi-supervised learning algorithm. When the feature set can be partitioned into two class-
conditionally independent sets, we show that the original learning problem can be reformulated in
terms of the problem of learning a predictor from one of the partitions to the other. That is, the latter
partition acts as a surrogate for the class variable. Since such a predictor can be learned from only
unlabeled samples, an effective semi-supervised algorithm results.
In the next section we present the simple yet interesting result on which our semi-supervised learning
algorithm (which we call surrogate learning) is based. We present examples to clarify the intuition
behind the approach and present a special case of our approach that is used in the applications sec-
tion. We then examine related ideas in previous work and situate our algorithm among previous
approaches to semi-supervised learning. We present empirical evaluation on two real world appli-
cations where the required assumptions of our algorithm are satisfied.
2
Surrogate Learning
We consider the problem of learning a classifier from the feature space X to the set of classes
Y = {0, 1}. Let the features be partitioned into X = X1×X2. The random feature vector x ∈X will
be represented correspondingly as x = (x1, x2). Since we restrict our consideration to a two-class
problem, the construction of the classifier involves the estimation of the probability P(y = 0|x1, x2)
at every point (x1, x2) ∈X.
1
We make the following assumptions on the joint probabilities of the classes and features.
1. P(x1, x2|y) = P(x1|y)P(x2|y) for y ∈{0, 1}. That is, the feature sets x1 and x2 are class-
conditionally independent for both classes. Note that in general our assumption is less restrictive
than the Naive Bayes assumption.
2. P(x1|x2) ̸= 0, P(x1|y) ̸= 0 and P(x1|y = 0) ̸= P(x1|y = 1). These assumptions are to
avoid divide-by-zero problems in the algebra below. If x1 is a discrete valued random variable and
not irrelevant for the classification task, these conditions are often satisfied.
Under these assumptions, surprisingly, we can establish that P(y = 0|x1, x2) can be written as
a function of P(x1|x2) and P(x1|y). First, when we consider the quantity P(y, x1|x2), we may
derive the following.
P(y, x1|x2) = P(x1|y, x2)P(y|x2)
⇒
P(y, x1|x2) = P(x1|y)P(y|x2)
(from the independence assumption)
⇒
P(y|x1, x2)P(x1|x2) = P(x1|y)P(y|x2)
⇒
P(y|x1, x2)P(x1|x2)
P(x1|y)
= P(y|x2)
(1)
Since P(y = 0|x2) + P(y = 1|x2) = 1, Equation 1 implies
P(y = 0|x1, x2)P(x1|x2)
P(x1|y = 0)
+ P(y = 1|x1, x2)P(x1|x2)
P(x1|y = 1)
= 1
⇒P(y = 0|x1, x2)P(x1|x2)
P(x1|y = 0)
+ (1 −P(y = 0|x1, x2)) P(x1|x2)
P(x1|y = 1)
= 1
(2)
Solving Equation 2 for P(y = 0|x1, x2), we obtain
P(y = 0|x1, x2)
=
P(x1|y = 0)
P(x1|x2)
·
P(x1|y = 1) −P(x1|x2)
P(x1|y = 1) −P(x1|y = 0)
(3)
We have succeeded in writing P(y = 0|x1, x2) as a function of P(x1|x2) and P(x1|y). This leads
to a significant simplification of the learning task when a large amount of unlabeled data is available,
especially if x1 is finite valued. The learning algorithm involves the following two steps.
• Estimate the quantity P(x1|x2) from only the unlabeled data, by building a predictor from
the feature space X2 to the space X1. There is no restriction on the learning algorithm for
this prediction task.
• Estimate the quantity P(x1|y) from a smaller labeled sample by counting.
Thus, we can decouple the prediction problem into two separate tasks, one of which involves pre-
dicting x1 from the remaining features. In other words, x1 serves as a surrogate for the class label.
Furthermore, for the two steps above there is no necessity for complete samples. All the labeled
e
…(Full text truncated)…
This content is AI-processed based on ArXiv data.