Sample Selection Bias Correction Theory
This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more clos…
Authors: Corinna Cortes, Mehryar Mohri, Michael Riley
Sample Selection Bias Corre ction Theory Corinna Cortes 1 , Mehryar Mohri 2 , 1 , Michael Riley 1 , and Afshin Rostamizadeh 2 ⋆ 1 Google Research, 76 Ninth A venu e, New Y ork, NY 10011. 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New Y ork, NY 10 012. Abstract. This paper presents a theoretical analysis of sample selection bias cor - rection. The sample bias correction technique commonly used in machine learn- ing consists of reweighting the cost of an error on each training point of a biased sample to more cl osely reflect the unbiased distri bu tion. T his relies on weights deri ved by v arious e stimation tec hniques based o n finite samples. W e analyze the effe ct of an error in that estimation on the accurac y of the h ypothesis returned by the learning algorithm for two estimation techniques: a cluster-based esti mation technique and kernel mean matching. W e also report the results of sample bias correction ex periments with sev eral data sets using these techniques. Our analy- sis is based on the no vel concept of distributional stability which generalizes the existing con cept of point-ba sed stability . Much of our work and proof techn iques can be used to analyze other impo rtance weighting techniqu es and their ef fect on accurac y when using a distributionally stable algorithm. 1 Intr oduction In the standard form ulation of machine learning pr oblems, the learnin g algorith m re- ceiv es tra ining an d test samples dr awn accordin g to the sam e distribution. H o wev er , this assumptio n o ften does no t h old in pr actice. The trainin g sample av ailable is bi- ased in some way , which may be du e to a variety of practical reason s such as th e co st of data labeling o r a cquisition. T he p roblem occurs in many areas such a s a stronomy , econom etrics, an d species habitat modeling. In a common instance of this problem, poin ts a re d rawn acco rding to the test dis- tribution b ut not all of th em are made av ailable to the learner . This is called the sample selection b ias p r ob lem . Remarkably , it is of ten possible to cor rect this bias by using large amounts of unlabeled data. The pro blem of sample selectio n bias correction for linear regression h as been ex- tensiv ely studied in econometrics and statistics (Heckman, 1979; Little & Rubin, 1986) with the pioneerin g work of Heckman (197 9). Several re cent m achine lear ning publi- cations (Elkan, 20 01; Zadr ozny , 2004 ; Zadrozny et al., 2003 ; Fan et al., 2005; Dud´ ık et al., 2006) have also dea lt with this problem. The m ain co rrection techn ique u sed in all of these publications co nsists of reweighting the cost o f train ing po int errors to mo re closely reflect that of the test distribution. T his is in fact a technique commonly used in statistics and machine learning f or a variety of prob lems o f this type (L ittle & Rubin, 1986) . W ith the exact weights, this reweighting cou ld optimally co rrect the bias, but, in ⋆ Student’ s submission , to be considered as a candidate for the E. M. Gold A ward. practice, the weights are b ased on an estimate of the sam pling proba bility f rom finite data sets. Thus, it is imp ortant to d etermine to wh at extent the error in this estima tion can affect th e acc uracy of the hy pothesis re turned by the lea rning algorithm . T o o ur knowledge, this p roblem has not been analyzed in a general manner . This pap er giv es a theo retical analy sis of sample s election b ias correction . Our anal- ysis is based o n the novel concept of distributional stability which g eneralizes the poin t- based s tability i ntrodu ced and analyzed by pre vious auth ors (Devroye & W agner, 1979; Kearns & Ron, 1997; Bousquet & Elisseeff, 2002). W e sho w that large f amilies of learn- ing a lgorithms, inclu ding all kernel-based regula rization alg orithms such as Support V ector Regression (SVR) (V apnik , 1998) or kernel ridge regression (Saun ders et al., 1998) are distrib utionally stab le and we give th e expression of th eir s tability coefficient for both the l 1 and l 2 distance. W e then analyz e two com monly used sample b ias co rrection techn iques: a clu ster - based estimation technique and kerne l mean matching (KMM) ( Huang et al., 2006 b). For each of these techniqu es, we d eri ve bound s on th e difference of th e error rate of the hy pothesis retur ned b y a distributionally stable algorithm when u sing that estima- tion te chnique versus usin g perfect reweighting. W e br iefly discuss and comp are these bound s an d also rep ort the results o f exper iments with b oth estimation tec hniques f or se veral pub licly a vailable machine learning d ata sets. Much of our work an d proo f tech- niques ca n be u sed to analyze other impor tance weigh ting techniques and their effect on accuracy when used in combination with a distrib utionally stable algorithm. The remain ing sections of this p aper are organize d as fo llows . Section 2 describ es in detail the sample selection bias co rrection techn ique. Section 3 intro duces the concept of distrib utional stability and proves the distributional stab ility of kernel-based regular- ization algo rithms. Section 4 analyzes the effect o f estimation erro r using distribution- ally stable algo rithms for b oth the cluster-based and the KMM estimation techniqu es. Section 5 repor ts th e results of e xperimen ts with se v eral data sets comparin g these esti- mation techniques. 2 Sample Selection Bias Corre ction 2.1 Problem Let X denote the input space and Y the label set, which may be { 0 , 1 } in classification or a subset of R in r egression e stimation problems, and let D deno te the tru e distrib ution over X × Y accordin g to which test po ints are drawn. I n th e sample selection bias problem , some pairs z = ( x, y ) dr awn acco rding to D a re n ot m ade available to the learning alg orithm. The learning alg orithm receives a training sample S of m labeled points z 1 , . . . , z m drawn acco rding to a b iased distribution D ′ over X × Y . This s ample bias can be repr esented by a r andom variable s taking values in { 0 , 1 } : when s = 1 the point is sampled, o therwise it is not. Th us, b y definition of th e sample selectio n bias, the support of the biased distribution D ′ is includ ed in that of the true distribution D . As in standard lear ning scen arios, the objecti ve of th e learn ing algo rithm is to select a h ypothesis h ou t of a hypo thesis set H with a small gen eralization error R ( h ) with respect to the true distribution D , R ( h ) = E ( x,y ) ∼D [ c ( h, z )] , wher e c ( h, z ) is the cost of the error of h on point z ∈ X × Y . While th e samp le S is collected in some bia sed manner, it is of ten possible to der i ve some infor mation about the natu re of the bias. This can be done b y exp loiting large amounts of unlabeled d ata drawn acco rding to the true distribution D , which is o ften av ailable in practice. Thus, in the following let U be a samp le drawn according to D and S ⊆ U a labeled but biased sub-sample. 2.2 W eighte d Sa mples A weigh ted sample S w is a training sample S of m labeled po ints, z 1 , . . . , z m drawn i.i.d. from X × Y , that is augmen ted with a non -negati ve weight w i ≥ 0 for each point z i . T his weigh t is used to emph asize or de -emphasize the cost of an erro r o n z i as in the so-called importance weightin g o r co st-sensitive learning (E lkan, 20 01; Zadrozny et al., 2 003). One co uld use th e weights w i to d eri ve an eq ui valent unweighted sample S ′ where the mu ltiplicity of z i would reflect its weigh t w i , but most learning algor ithms, e.g., d ecision trees, logistic regression, AdaBoost, Sup port V ector Mach ines (SVMs), kernel ridge regression, can direc tly accept a w eighted sample S w . W e will refer to algorithm s th at can directly take S w as input as weight-sensitive algorithms . The empirical error of a hypothesis h on a weig hted sample S w is defined as b R w ( h ) = m X i =1 w i c ( h, z i ) . (1) Proposition 1. Let D ′ be a distrib ution whose sup port coincides w ith that of D a nd let S w be a weighted sample with w i = P r D ( z i ) / P r D ′ ( z i ) for all points z i in S . Then, E S ∼D ′ [ b R w ( h )] = R ( h ) = E z ∼D [ c ( h, z )] . (2) Pr oof. Sin ce the sample points are drawn i.i.d., E S ∼D ′ [ b R w ( h )] = 1 m X z E S ∼D ′ [ w i c ( h, z i )] = E z 1 ∼D ′ [ w 1 c ( h, z 1 )] . (3) By d efinition of w and the fact that the su pport of D and D ′ coincide, the right-h and side can be rewritten as follows X D ′ ( z 1 ) 6 =0 Pr D ( z 1 ) Pr D ′ ( z 1 ) Pr D ′ ( z 1 ) c ( h, z 1 ) = X D ( z 1 ) 6 =0 Pr D ( z 1 ) c ( h, z 1 ) = E z 1 ∼D [ c ( h, z 1 )] . (4) This last term is the definition of the generalizatio n er ror R ( h ) . ⊓ ⊔ 2.3 Bias Correction The probability o f drawing z = ( x, y ) according to th e true b ut unobserved d istrib ution D c an be straightfor wardly relate d to the observed distribution D ′ . By definition o f the random variable s , th e observed biased distribution D ′ can b e expressed by P r D ′ [ z ] = Pr D [ z | s = 1] . W e will assum e that all points z in the s uppor t of D can be s ampled with a non-zer o pro bability so th e sup port o f D and D ′ coincide. T hus for a ll z ∈ X × Y , Pr[ s = 1 | z ] 6 = 0 . Then, by the Bayes formu la, fo r all z in the supp ort of D , Pr D [ z ] = Pr[ z | s = 1] Pr[ s = 1] Pr[ s = 1 | z ] = Pr[ s = 1] Pr[ s = 1 | z ] Pr D ′ [ z ] . (5) Thus, if we were given the probabilities Pr[ s = 1] and Pr[ s = 1 | z ] , we could deriv e the true probability Pr D from the biased one Pr D ′ exactly and cor rect the samp le selection bias. It is impo rtant to no te that this cor rection is o nly needed f or the trainin g sample S , since it is the only sou rce of selection b ias. W ith a weight-sensitive algo rithm, it suffices to reweight each sample z i with the weight w i = Pr[ s =1] Pr[ s =1 | z i ] . T hus, Pr[ s = 1 | z ] need not be estimate d for all points z but only for tho se falling in the tr aining sample S . By Proposition 1, th e expected value o f the empirical error after re weighting is the same as if we were gi ven samples from the true d istrib ution an d th e usua l gen eralization bou nds hold for b R ( h ) and R ( h ) . When th e sampling prob ability is ind ependent o f th e lab els, as it is co mmonly as- sumed in many settings (Zadr ozny 2004; 2 003), Pr[ s = 1 | z ] = Pr[ s = 1 | x ] , and Equation 5 can be re-written as Pr D [ z ] = Pr[ s = 1 ] Pr[ s = 1 | x ] Pr D ′ [ z ] . (6) In that case, the prob abilities Pr[ s = 1] and Pr[ s = 1 | x ] needed to reconstitute Pr D from Pr D ′ do not depend on the la bels an d th us can b e estimated using th e un labeled points in U . Mor eover , as already men tioned, for weigh t-sensiti ve algo rithms, it suffices to estimate Pr[ s = 1 | x i ] f or the points x i of th e train ing data; no gen eralization is needed. A simple case is when the points are d efined over a discrete set. 3 Pr[ s = 1 | x ] can then be estimated from the frequ ency m x /n x , wher e m x denotes the numb er of times x appeared in S ⊆ U an d n x the number of times x ap peared in the full data set U . Pr[ s = 1] can be estimated b y the q uantity | S | / | U | . Howev er , since P r[ s = 1 ] is a constant indepe ndent o f x , its estimation is not e ven necessary . If the estimatio n o f th e sampling pr obability Pr [ s = 1 | x ] fr om the un labeled d ata set U were exact, th en the reweighting just discussed c ould corre ct the sample b ias optimally . Several techn iques have b een co mmonly used to estima te the reweighting quantities. But, these estimate weights are not gua ranteed to be exact. Th e next sec- tion ad dresses how the e rror in that estimation affects the e rror rate of th e hypoth esis returned by the learning algorithm. 3 Distribu tional Stability Here, we will examine the effect on the error of the hy pothesis returned by the learning algorithm in re sponse to a change in the way the training points are weighted. Since the weights are non-negative, we can assume that they are no rmalized and define a d istribu- tion over th e training samp le. This study can b e viewed as a generalizatio n of stability analysis wh ere a sin gle sample point is changed (De vroye & W ag ner , 1 979; Kearns & Ron, 1997; Bousqu et & Elisseeff, 2002 ) to the more g eneral case of distrib utional stability where the sample’ s weig ht distrib ution is changed. Thus, in this section the sample weigh t W of S W defines a distribution ov er S . For a fixed lear ning algorith m L and a fix ed sample S , we will deno te by h W the hy pothesis 3 This can be as a result of a quantization or clustering technique as discussed later . returned by L for the weig hted sample S W . W e will d enote by d ( W , W ′ ) a diver gence measure fo r two distributions W and W ′ . Th ere are many standar d measu res for the div ergences o r d istances between two distributions, including the relative entr opy , the Hellinger distance, and the l p distance. Definition 1 (Distributional β -Sta bility). A learnin g algorithm L is said to b e distri- butionally β -stable for th e d iver g ence mea sur e d if fo r an y two weighted samples S W and S W ′ , ∀ z ∈ X × Y , | c ( h W , z ) − c ( h W ′ , z ) | ≤ β d ( W , W ′ ) . (7) Thus, a n algorith m is distributionally stable when small c hanges to a weighted samp le’ s distribution, as m easured by a d i vergence d , result in a small cha nge in the cost of an error at any p oint. The following p roposition follows directly from th e d efinition of distributional stability . Proposition 2. Let L be a d istrib utionally β -stable a lgorithm an d let h W ( h W ′ ) denote the h ypothesis r eturned by L when trained on the weig hted sample S W (r esp. S W ′ ). Let W T denote the distrib ution acco r ding to which test points ar e drawn. Then , the following holds | R ( h W ) − R ( h W ′ ) | ≤ β d ( W , W ′ ) . (8) Pr oof. By the distributional stability of the algorithm, E z ∼W T [ | c ( z , h W ) − c ( z , h W ′ ) | ] ≤ β d ( W , W ′ ) , (9) which implies the statement of the propo sition. ⊓ ⊔ 3.1 Distribu tional Stability of Kernel-Based Regula rization Algorithms Here, we show that kernel-based regularization algor ithms are distributionally β -stab le. This family o f algor ithms inc ludes, among oth ers, Suppor t V ector Regression ( SVR) and kernel ridge regression. Oth er algorithms such as those based on the relati ve e ntropy regularization can be shown to be distributionally β -stable in a similar way as f or p oint- based stability . Our results also apply to classification algorithms such as Su pport V ector Machine (SVM) (Cortes & V apn ik, 19 95) u sing a m argin-based loss fu nction l γ as in (Bousquet & Elisseef f, 2002). W e will assume that th e cost function c is σ -ad missible , that is there exists σ ∈ R + such that for any tw o hypo theses h, h ′ ∈ H an d for all z = ( x, y ) ∈ X × Y , | c ( h, z ) − c ( h ′ , z ) | ≤ σ | h ( x ) − h ′ ( x ) | . (10) This assump tion holds f or the quadr atic cost and most oth er cost f unctions when th e hy - pothesis set and the set of ou tput labels are bound ed b y some M ∈ R + : ∀ h ∈ H , ∀ x ∈ X , | h ( x ) | ≤ M an d ∀ y ∈ Y , | y | ≤ M . W e will als o assume th at c is dif ferentiable. T his assumption is in fact n ot n ecessary and all of our results ho ld with out it, but it makes the presentation simpler . Let N : H → R + be a fun ction d efined over the h ypothesis set. Regularization - based algorithm s m inimize an objecti ve of the form F W ( h ) = b R W ( h ) + λN ( h ) , (11) where λ ≥ 0 is a trad e-off p arameter . W e den ote by B F the Bregman div ergence asso- ciated t o a co n ve x function F , B F ( f k g ) = F ( f ) − F ( g ) − h f − g , ∇ F ( g ) i , and defin e ∆h as ∆h = h ′ − h . Lemma 1. Let the hypoth esis set H b e a vector spa ce. Assume that N is a pr oper closed conve x fu nction and that N is differ entiable. Assume th at F W admits a min imizer h ∈ H and F W ′ a minimizer h ′ ∈ H . Then , the following boun d h olds, B N ( h ′ k h ) + B N ( h k h ′ ) ≤ σ l 1 ( W , W ′ ) λ sup x ∈ S | ∆h ( x ) | . (12) Pr oof. Sin ce B F W = B b R W + λB N and B F W ′ = B b R W ′ + λB N , and a Bregman div ergence is non-negative, λ B N ( h ′ k h ) + B N ( h k h ′ ) ≤ B F W ( h ′ k h ) + B F W ′ ( h k h ′ ) . By the definition of h a nd h ′ as the minimizers of F W and F W ′ , B F W ( h ′ k h ) + B F W ′ ( h k h ′ ) = b R F W ( h ′ ) − b R F W ( h ) + b R F W ′ ( h ) − b R F W ′ ( h ′ ) . (13) Thus, by the σ -a dmissibility of the cost function c , using the notation W i = W ( x i ) and W ′ i = W ′ ( x i ) , λ B N ( h ′ k h ) + B N ( h k h ′ ) ≤ b R F W ( h ′ ) − b R F W ( h ) + b R F W ′ ( h ) − b R F W ′ ( h ′ ) = m X i =1 c ( h ′ , z i ) W i − c ( h, z i ) W i + c ( h, z i ) W ′ i − c ( h ′ , z i ) W ′ i = m X i =1 ( c ( h ′ , z i ) − c ( h, z i ))( W i − W ′ i ) = m X i =1 σ | ∆h ( x i ) | ( W i − W ′ i ) ≤ σ l 1 ( W , W ′ ) sup x ∈ S | ∆h ( x ) | , (14) which establishes the lemma. ⊓ ⊔ Giv en x 1 , . . . , x m ∈ X and a positiv e definite symmetric (PDS) kernel K , w e denote by K ∈ R m × m the kernel matrix define d by K ij = K ( x i , x j ) and by λ max ( K ) ∈ R + the largest eigen v alue of K . Lemma 2. Let H be a r epr o ducing kernel Hilbert space with kernel K and let the r e gularization function N be d efined by N ( · ) = k·k 2 K . Then, the following bou nd holds, B N ( h ′ k h ) + B N ( h k h ′ ) ≤ σ λ 1 2 max ( K ) l 2 ( W , W ′ ) λ k ∆h k 2 . (15) Pr oof. As in the pr oof of Lemma 1, λ B N ( h ′ k h ) + B N ( h k h ′ ) ≤ m X i =1 ( c ( h ′ , z i ) − c ( h, z i ))( W i − W ′ i ) . (16) By definition of a reprod ucing kern el Hilbert space H , for any hypothesis h ∈ H , ∀ x ∈ X , h ( x ) = h h, K ( x, · ) i an d thus also for any ∆h = h ′ − h with h, h ′ ∈ H , ∀ x ∈ X , ∆h ( x ) = h ∆h, K ( x, · ) i . Let ∆ W i denote W ′ i − W i , ∆ W the vector whose compo- nents are the ∆ W i ’ s, an d let V denote B N ( h ′ k h ) + B N ( h k h ′ ) . Using σ -admissibility , V ≤ σ P m i =1 | ∆h ( x i ) ∆ W i | = σ P m i =1 | h ∆h, ∆ W i K ( x i , · ) i | . Let ǫ i ∈ {− 1 , +1 } denote the sign of h ∆h, ∆ W i K ( x i , · ) i . Th en, V ≤ σ * ∆h, m X i =1 ǫ i ∆ W i K ( x i , · ) + ≤ σ k ∆h k K k m X i =1 ǫ i ∆ W i K ( x i , · ) k K = σ k ∆h k K m X i,j =1 ǫ i ǫ j ∆ W i ∆ W j K ( x i , x j ) 1 / 2 = σ k ∆h k K ∆ ( W ǫ ) ⊤ K ∆ ( W ǫ ) 1 2 ≤ σ k ∆h k K k ∆ W k 2 λ 1 2 max ( K ) . (17) In this deriv ation, the second inequality follows from the Cauchy-Sch warz inequ ality and the last ine quality from the stan dard pro perty o f the Rayleigh quo tient for PDS matrices. Since k ∆ W k 2 = l 2 ( W , W ′ ) , this proves th e lemma. ⊓ ⊔ Theorem 1. Let K be a kernel su ch that K ( x, x ) ≤ κ < ∞ for all x ∈ X . Then, th e r e gularization alg orithm b ased on N ( · ) = k·k 2 K is distributionally β -stable for the l 1 distance with β ≤ σ 2 κ 2 2 λ , and for the l 2 distance with β ≤ σ 2 κλ 1 2 max ( K ) 2 λ . Pr oof. For N ( · ) = k·k 2 K , we have B N ( h ′ k h ) = k h ′ − h k 2 K , thus B N ( h ′ k h )+ B N ( h k h ′ ) = 2 k ∆h k 2 K and by Lemma 1, 2 k ∆h k 2 K ≤ σ l 1 ( W , W ′ ) λ sup x ∈ S | ∆h ( x ) | ≤ σ l 1 ( W , W ′ ) λ κ || ∆h || K . (18) Thus k ∆h k K ≤ σκ l 1 ( W , W ′ ) 2 λ . By σ -admissibility of c , ∀ z ∈ X × Y , | c ( h ′ , z ) − c ( h, z ) | ≤ σ | ∆h ( x ) | ≤ κσ k ∆h k K . (19) Therefo re, ∀ z ∈ X × Y , | c ( h ′ , z ) − c ( h, z ) | ≤ σ 2 κ 2 l 1 ( W , W ′ ) 2 λ , (20) which shows the distributional stability o f a kernel- based regular ization algo rithm f or the l 1 distance. Using Lemma 2, a similar deriv ation leads to ∀ z ∈ X × Y , | c ( h ′ , z ) − c ( h, z ) | ≤ σ 2 κλ 1 2 max ( K ) l 2 ( W , W ′ ) 2 λ , (21) which shows the distributional stability o f a kernel- based regular ization algo rithm f or the l 2 distance. ⊓ ⊔ Note that the standard settin g of a sample with no weight is equiv alent to a weig hted sample with the un iform distribution W U : each point is assigned the weig ht 1 /m . Re- moving a single point, say x 1 , is equiv alent to assigning weight 0 to x 1 and 1 / ( m − 1) to others. Let W U ′ be the correspon ding d istrib ution, then l 1 ( W U , W U ′ ) = 1 m + m − 1 X i =1 1 m − 1 m − 1 = 2 m . (22) Thus, in the case of kernel- based regu larized algo rithms and for the l 1 distance, stan - dard unifor m β -stability is a spe cial case of distributional β -stability . I t can be shown similarly that l 2 ( W U , W U ′ ) = 1 √ m ( m − 1) . 4 Effect of Estimation Error for Kern el-Based Regularization Algorithms This sectio n an alyzes the effect of an err or in the estimation o f the weight o f a train- ing example on the gener alization error o f th e hypo thesis h return ed by a weight- sensiti ve lear ning alg orithm. W e will examine two estimation tech niques: a straight- forward histog ram-based o r c luster -based method, a nd kernel mean matching (KMM) (Huang et al., 2006 b). 4.1 Cluster -Based Estimation A straigh tforward estimate of the pro bability of sampling is based on th e observed empirical fr equencies. The ratio of the number of times a point x app ears in S and the number o f times it appears in U is an em pirical estimate of Pr[ s = 1 | x ] . No te that ge neralization to u nseen points x is n ot n eeded since reweighting requ ires only assigning weights to the seen trainin g po ints. Howev er , in g eneral, training instances are typically un ique or very infrequent since features are real-valued numbers. Instead, features can be discretized based on a partition ing o f th e input space X . Th e p artitioning may be based on a simple histogram buckets o r the result of a clustering technique. Th e analysis of this section assumes such a prior partitioning of X . W e sha ll analy ze h o w fast the resulting e mpirical fr equencies converge to the true sampling p robability . For x ∈ U , let U x denote th e subsample of U contain ing e xactly all the instances o f x and let n = | U | and n x = | U x | . Fu rthermore, let n ′ denote the number of u nique poin ts in the sample U . Similarly , w e defin e S x , m , m x and m ′ for the set S . Add itionally , denote by p 0 = min x ∈ U Pr[ x ] 6 = 0 . Lemma 3. Let δ > 0 . Then, with pr obability a t least 1 − δ , the following inequ ality holds for all x in S : Pr[ s = 1 | x ] − m x n x ≤ s log 2 m ′ + log 1 δ p 0 n . (23) Pr oof. For a fixed x ∈ U , by Hoeffding’ s ineq uality , Pr U h ˛ ˛ Pr[ s = 1 | x ] − m x n x ˛ ˛ ≥ ǫ i = n X i =1 Pr x h | Pr[ s = 1 | x ] − m x i | ≥ ǫ | n x = i i Pr[ n x = i ] ≤ n X i =1 2 e − 2 iǫ 2 Pr U [ n x = i ] . Since n x is a binomial r andom variable with p arameters P r U [ x ] = p x and n , this last term can be expressed more e xplicitly and bounde d as fo llo ws: 2 n X i =1 e − 2 iǫ 2 Pr U [ n x = i ] ≤ 2 n X i =0 e − 2 iǫ 2 n i ! p i x (1 − p x ) n − i = 2( p x e − 2 ǫ 2 + (1 − p x )) n = 2(1 − p x (1 − e − 2 ǫ 2 )) n ≤ 2 exp( − p x n (1 − e − 2 ǫ 2 )) . Since for x ∈ [0 , 1 ] , 1 − e − x ≥ x/ 2 , this shows that for ǫ ∈ [0 , 1] , Pr U h Pr[ s = 1 | x ] − m x n x ≥ ǫ i ≤ 2 e − p x nǫ 2 . (24) By the union boun d an d the definition of p 0 , Pr U h ∃ x ∈ S : Pr[ s = 1 | x ] − m x n x ≥ ǫ i ≤ 2 m ′ e − p 0 nǫ 2 . Setting δ to match the upper bound yields the statement of the lemma. ⊓ ⊔ The following propo sition bou nds the distance between th e distrib ution W cor respond- ing to a perfectly reweighted sample ( S W ) an d the o ne corr esponding to a sample th at is reweighted accor ding to the o bserved bias ( S c W ). For a sampled po int x i = x , these distributions are defined as follows : W ( x i ) = 1 m 1 p ( x i ) and c W ( x i ) = 1 m 1 ˆ p ( x i ) , (25) where, fo r a distinct point x equal to the samp led p oint x i , we define p ( x i ) = P r[ s = 1 | x ] and ˆ p ( x i ) = m x n x . Proposition 3. Let B = max i =1 ,...,m max(1 /p ( x i ) , 1 / ˆ p ( x i )) . T hen, the l 1 and l 2 distances of the distributions W an d c W can be bo unded as follows, l 1 ( W , c W ) ≤ B 2 s log 2 m ′ + log 1 δ p 0 n and l 2 ( W , c W ) ≤ B 2 s log 2 m ′ + log 1 δ p 0 nm . (26) Pr oof. By defin ition of the l 2 distance, l 2 2 ( W , c W ) = 1 m 2 m X i =1 1 p ( x i ) − 1 ˆ p ( x i ) 2 = 1 m 2 m X i =1 p ( x i ) − ˆ p ( x i ) p ( x i ) ˆ p ( x i ) 2 ≤ B 4 m max i ( p ( x i ) − ˆ p ( x i )) 2 . It c an be shown similarly that l 1 ( W , c W ) ≤ B 2 max i | p ( x i ) − ˆ p ( x i ) | . Th e application of the unifo rm conv ergence b ound of Lem ma 3 directly yields the statement of the propo sition. ⊓ ⊔ The f ollowing theorem provides a bound on the difference b etween the g eneralization error of the h ypothesis r eturned by a kernel-based regularization algorithm when trained on the perfectly unbiased distrib ution, and th e one trained on the sample bias-corrected using frequ ency estimates. Theorem 2. Let K be a PDS kernel such that K ( x, x ) ≤ κ < ∞ for all x ∈ X . Let h W be th e hypothesis returned by the re gu larization algorithm ba sed on N ( · ) = k·k 2 K using S W , a nd h c W the one returned a fter training the same algorithm on S c W . Th en, for an y δ > 0 , with pr obability at lea st 1 − δ , the differ ence in generalization err or of these hypotheses is boun ded as fo llows | R ( h W ) − R ( h c W ) | ≤ σ 2 κ 2 B 2 2 λ s log 2 m ′ + log 1 δ p 0 n | R ( h W ) − R ( h c W ) | ≤ σ 2 κλ 1 2 max ( K ) B 2 2 λ s log 2 m ′ + log 1 δ p 0 nm . (27) Pr oof. T he result follo ws from Pr oposition 2, t he distributional stability and the boun ds on the stab ility c oefficient β fo r kernel- based regulariza tion algorith ms (Th eorem 1 ), and th e bound s on the l 1 and l 2 distances betwee n th e cor rect d istrib ution W an d the estimate c W . ⊓ ⊔ Let n 0 be the nu mber o f o ccurrences, in U , of the least frequen t training example. For large eno ugh n , p 0 n ≈ n 0 , th us the theorem sug gests that the difference of error rate between the hypo thesis retur ned after an optimal reweighting v ersus the one based on fr equency e stimates goes to zero as q log m ′ n 0 . I n p ractice, m ′ ≤ m , the numb er o f distinct po ints in S is sma ll, a fortior i, log m ′ is very small, thus, the conver gence rate depend s essentially on th e rate at which n 0 increases. Addition ally , if λ max ( K ) ≤ m (such as with Gaussian kernels), the l 2 -based bound will provide con vergence that is at least as fast. 4.2 Ker nel Mean M atching The following d efinitions in troduced by Steinwart (2002 ) will b e need ed fo r the pre- sentation an d discussion of the kernel mean matching (KMM) technique. Le t X be a compact metric space and let C ( X ) denote the space o f all continuo us function s over X equipp ed with the standar d infinite norm k · k ∞ . Let K : X × X → R b e a PDS kernel. Th ere exists a H ilbert spa ce F and a map Φ : X → F suc h tha t f or all x, y ∈ X , K ( x, y ) = h Φ ( x ) , Φ ( y ) i . No te that fo r a given kernel K , F and Φ are no t un ique an d that, for these de finitions, F does not ne ed to be a reprodu cing kernel Hilbert space (RKHS). Let P denote the set of all pro bability d istrib utions over X and let µ : P → F be the function defined by ∀ p ∈ P , µ ( p ) = E x ∼ p [ Φ ( x )] . (28) A function g : X → R is said to be ind uced by K if there exists w ∈ F such that for all x ∈ X , g ( x ) = h w, Φ ( x ) i . K is said to be universal if it is con tinuous and if the set of function s in duced by K are d ense i n C ( X ) . Theorem 3 (Huang et al. (2006a)) . Let F be a separable Hilbert space and let K be a universal kernel with feature space F and featur e map Φ : X → F . Then, µ is injective. Pr oof. T he proof given by Huang et al. (200 6a) doe s n ot seem to be complete, we ha ve included a complete proof in the Appen dix. ⊓ ⊔ The KMM tech nique is app licable when the learn ing algorithm is b ased on a u ni versal kernel. The theor em shows that for a un i versal kernel, the expected value of the f ea- ture vector s indu ced uniquely determ ines the p robability distribution. KMM uses this proper ty to reweight training po ints so th at th e average v alue of the fea ture vectors for the training d ata matches that of the fea ture vectors for a set of unlab eled points drawn from the unbiased distribution. Let γ i denote the perfect reweighting of the sample p oint x i and b γ i the estimate derived by KMM. Let B ′ denote th e largest p ossible reweighting co ef ficient γ and let ǫ be a p ositi ve real number . W e will assume that ǫ is chosen so th at ǫ ≤ 1 / 2 . Then, the following is the KMM constraint optimization min γ G ( γ ) = k 1 m m X i =1 γ i Φ ( x i ) − 1 n n X i =1 Φ ( x ′ i ) k subject to γ i ∈ [0 , B ′ ] ∧ 1 m m X i =1 γ i − 1 ≤ ǫ. (29) Let b γ be the so lution of this op timization pro blem, th en 1 m P m i =1 b γ i = 1 + ǫ ′ with − ǫ ≤ ǫ ′ ≤ ǫ . For i ∈ [1 , m ] , let b γ ′ i = b γ i / (1 + ǫ ′ ) . The no rmalized weig hts used in KMM’ s r e weighting of the sample are thus defined by b γ ′ i /m with 1 m P m i =1 γ ′ i = 1 . As in th e previous section , g i ven x 1 , . . . , x m ∈ X an d a strictly p ositi ve def - inite universal kernel K , we den ote by K ∈ R m × m the kernel matrix defined by K ij = K ( x i , x j ) and by λ min ( K ) > 0 the smallest eigenv alue of K . W e also den ote by cond( K ) the cond ition nu mber o f the matrix K : cond( K ) = λ max ( K ) /λ min ( K ) . When K is universal, it is c ontinuous over the com pact X × X and thus bo unded and there exists κ < ∞ such that K ( x, x ) ≤ κ for all x ∈ X . Proposition 4. Let K be a strictly positive definite universal kernel. Then, for any δ > 0 , with p r ob ability at least 1 − δ , th e l 2 distance of the distributions b γ ′ /m and γ /m is bound ed a s follows: 1 m k ( b γ ′ − γ ) k 2 ≤ 2 ǫB ′ √ m + 2 κ 1 2 λ 1 2 min ( K ) r B ′ 2 m + 1 n 1 + r 2 lo g 2 δ . (30) Pr oof. Sin ce th e o ptimal reweighting γ verifies th e constrain ts of the o ptimization, by definition of b γ as a minim izer , G ( b γ ) ≤ G ( γ ) . Thus, by the triangle inequality , k 1 m m X i =1 b γ i Φ ( x i ) − 1 m m X i =1 γ i Φ ( x i ) k ≤ G ( b γ ) + G ( γ ) ≤ 2 G ( γ ) . (31) Let L d enote the left-ha nd side of this inequ ality: L = 1 m k P m i =1 ( b γ i − γ i ) Φ ( x i ) k . By definition of the nor m in the Hilbert spa ce, L = 1 m p ( b γ − γ ) ⊤ K ( b γ − γ ) . Then, by the standard boun ds for the Rayleig h quotient of PDS matrices, L ≥ 1 m λ 1 2 min ( K ) k ( b γ − γ ) k 2 . This combined with Inequa lity 31 yie lds 1 m k ( b γ − γ ) k 2 ≤ 2 G ( γ ) λ 1 2 min ( K ) . (32) Thus, by the triangle inequality , 1 m k ( b γ ′ − γ ) k 2 ≤ 1 m k ( b γ ′ − b γ ) k 2 + 1 m k ( b γ − γ ) k 2 ≤ | ǫ ′ | /m 1 + ǫ ′ k γ k 2 + 2 G ( γ ) λ 1 2 min ( K ) ≤ 2 | ǫ ′ | B ′ √ m m + 2 G ( γ ) λ 1 2 min ( K ) ≤ 2 ǫB ′ √ m + 2 G ( γ ) λ 1 2 min ( K ) . (33) It is no t dif ficult to show using McDiarmid’ s inequality that for any δ > 0 , with proba - bility at least 1 − δ , the following holds (Lemma 4, (Huang et al., 2006a)): G ( γ ) ≤ κ 1 2 r B ′ 2 m + 1 n 1 + r 2 lo g 2 δ . (34) This combined with Inequa lity 33 yie lds the statement of the proposition. ⊓ ⊔ The f ollowing theorem provides a bound on the difference b etween the g eneralization error of the h ypothesis r eturned by a kernel-based regularization algorithm when trained on the true distribution, and the one trained on the sample bias-corr ected KMM. Theorem 4. Let K be a strictly po sitive definite s ymmetric universal k ernel. Let h γ be the hyp othesis returned b y the r e gularization a lgorithm based on N ( · ) = k·k 2 K using S γ /m and h b γ ′ the o ne returned after training the same algorithm on S b γ ′ /m . Then , fo r any δ > 0 , with pr obability at least 1 − δ , the differ ence in gener alization err o r of these hypothe ses is bo unded as follows | R ( h γ ) − R ( h b γ ′ ) | ≤ σ 2 κλ 1 2 max ( K ) λ ǫB ′ √ m + κ 1 2 λ 1 2 min ( K ) r B ′ 2 m + 1 n 1 + r 2 lo g 2 δ ! . F or ǫ = 0 , the bound becomes | R ( h γ ) − R ( h b γ ′ ) | ≤ σ 2 κ 3 2 cond 1 2 ( K ) λ r B ′ 2 m + 1 n 1 + r 2 lo g 2 δ . (35) Pr oof. T he result follows f rom Proposition 2 and the bound of Proposition 4. ⊓ ⊔ Comparing this bou nd f or ǫ = 0 with the l 2 bound o f Theorem 4, we first note that B and B ′ are essentially r elated mo dulo th e co nstant Pr[ s = 1] which is not included in the cluster-based reweighting. Thu s, the cluster-based conv ergence is o f th e o rder O ( λ 1 2 max ( K ) B 2 q log m ′ p 0 nm ) and the KMM co n ver gence of the o rder O (cond 1 2 ( K ) B √ m ) . T ak ing the r atio of the fo rmer over the latter and no ticing p − 1 0 ≈ O ( B ) , we obtain th e expression O q λ min ( K ) B log m ′ n . Thu s, for n > λ min ( K ) B log( m ′ ) the c on vergence of the cluster-based bound is more fav orable, while fo r o ther values th e KMM bo und conv erges faster . 5 Experimental Results In this section, we will compare the pe rformance of the cluster-based re weighting tech - nique an d the KM M techniq ue empirically . W e will first d iscuss and an alyze the pro p- erties of the clustering method and our particular implemen tation. The analysis of Sectio n 4. 1 dea ls with d iscrete points possibly re sulting f rom the use of a qu antization or clusterin g techniqu e. Ho we ver , due to the r elati vely small size of the pu blic training sets av ailable, clustering could leave us with fe w cluster represen- tati ves to train with. I nstead, in our experiments, we only u sed th e clu sters to estimate sampling prob abilities and applied these weights to the f ull set of train ing p oints. As the f ollowing proposition shows, the l 1 and l 2 distance bou nds of Pro position 5 do no t change significantly so long as the clu ster size is roug hly unifor m and the samp ling probab ility is the same f or all poin ts within a cluster . W e will refer to this as the clus- tering assumption . In what f ollows , let Pr[ s = 1 | C i ] designate the samp ling probab ility for all x ∈ C i . Finally , define q ( C i ) = Pr[ s = 1 | C i ] and ˆ q ( C i ) = | C i ∩ S | / | C i ∩ U | . Proposition 5. Let B = ma x i =1 ,...,m max(1 /q ( C i ) , 1 / ˆ q ( C i )) . The n, the l 1 and l 2 distances of the distributions W an d c W can be bo unded as follows, l 1 ( W , c W ) ≤ B 2 s | C M | k (log 2 k + log 1 δ ) q 0 nm l 2 ( W , c W ) ≤ B 2 s | C M | k (log 2 k + log 1 δ ) q 0 nm 2 , wher e q 0 = min q ( C i ) and | C M | = max i | C i | . Pr oof. By defin ition of the l 2 distance, l 2 2 ( W , c W ) = 1 m 2 k X i =1 X x ∈ C i „ 1 p ( x ) − 1 ˆ p ( x ) « 2 = 1 m 2 k X i =1 X x ∈ C i „ 1 q ( C i ) − 1 ˆ q ( C i ) « 2 ≤ B 4 | C M | m 2 k X i =1 max i ( q ( C i ) − ˆ q ( C i )) 2 . The right-ha nd side o f the first line follows f rom the clusterin g assumption and the inequality then f ollows f rom exactly the same steps as in Proposition 5 and factor ing away the sum over the elemen ts of C i . Finally , it is ea sy to see th at the max i ( q ( C i ) − ˆ q ( C i )) term can b e bound ed just as in Lemma 3 using a unifo rm convergence bound , howe ver now the union bound is taken o ver the clusters rather than uniq ue p oints. ⊓ ⊔ Note that wh en the cluster size is uniform , th en | C M | k = m , an d the bound above leads to an expression similar to that of Proposition 5 . W e used the leaves of a decision tree to de fine the clusters. A decision tree selects binary cu ts on the co ordinates of x ∈ X that gre edily minimize a n ode imp urity mea- sure, e .g., MSE for regression ( Brieman et al., 19 84). Points with similar featu res an d labels are clustered togeth er in this way with the assumption th at these will also have similar sampling prob abilities. Sev eral method s for bias corr ection are co mpared in T able 1. Each metho d assigns corrective weigh ts to the trainin g sam ples. Th e unweighted m ethod uses weight 1 fo r ev ery tra ining instan ce. T he id eal m ethod u ses weight 1 Pr[ s =1 | x ] , which is o ptimal but T able 1. Normalized mean-squared error (NMSE) for variou s regression data sets using un- weighted, ideal, clustered and kernel-mean -matched training sample reweightings. D A TA S E T | U | | S | n test U N W E I G H T E D I D E A L C L U S T E R E D K M M A B A L O N E 2 0 0 0 72 4 21 7 7 0 . 6 5 4 ± 0 . 0 1 9 0 . 5 5 1 ± 0 . 0 3 2 0 . 6 2 3 ± 0 . 0 3 4 0 . 7 0 9 ± 0 . 1 2 2 B A N K 3 2 N H 4 5 0 0 23 8 4 3 6 9 3 0 . 9 0 3 ± 0 . 0 2 2 0 . 6 1 0 ± 0 . 0 4 4 0 . 6 3 5 ± 0 . 0 4 6 0 . 6 9 1 ± 0 . 0 5 5 B A N K 8 F M 4 4 9 9 19 9 8 3 6 9 3 0 . 0 8 5 ± 0 . 0 0 3 0 . 0 5 8 ± 0 . 0 0 1 0 . 0 6 8 ± 0 . 0 0 2 0 . 0 7 9 ± 0 . 0 1 3 C A L - H O U S I N G 1 6 5 1 2 9 5 1 1 4 1 2 8 0 . 3 9 5 ± 0 . 0 1 0 0 . 3 6 0 ± 0 . 0 0 9 0 . 3 7 5 ± 0 . 0 1 0 0 . 5 9 5 ± 0 . 0 5 4 C P U - A C T 4 0 0 0 24 0 0 4 1 9 2 0 . 6 7 3 ± 0 . 0 1 4 0 . 5 2 3 ± 0 . 0 8 0 0 . 5 6 8 ± 0 . 0 1 8 0 . 5 1 8 ± 0 . 2 3 7 C P U - S M A L L 4 0 0 0 23 6 8 4 1 9 2 0 . 6 8 2 ± 0 . 0 5 3 0 . 4 7 7 ± 0 . 0 9 7 0 . 4 0 8 ± 0 . 0 7 1 0 . 5 3 1 ± 0 . 2 8 0 H O U S I N G 3 0 0 1 1 6 2 0 6 0 . 5 0 9 ± 0 . 0 4 9 0 . 3 9 0 ± 0 . 0 5 3 0 . 4 8 2 ± 0 . 0 4 2 0 . 4 6 9 ± 0 . 1 4 8 K I N 8 N M 5 0 0 0 25 1 0 3 1 9 2 0 . 5 9 4 ± 0 . 0 0 8 0 . 5 2 3 ± 0 . 0 4 5 0 . 5 7 4 ± 0 . 0 1 8 0 . 7 0 4 ± 0 . 0 6 8 P U M A 8 N H 4 4 9 9 22 4 6 3 6 9 3 0 . 6 8 5 ± 0 . 0 1 3 0 . 6 7 4 ± 0 . 0 1 9 0 . 6 4 1 ± 0 . 0 1 2 0 . 9 0 3 ± 0 . 0 5 9 requires the sampling distribution to be known. The clu ster e d method uses weigh t | C i ∩ U | / | C i ∩ S | , wher e the clusters C i are r egression tree leaves with a minimum count of 4 ( larger cluster sizes showed similar , th ough declining , perform ance). The KMM method uses the approa ch o f Huang et al. ( 2006b) with a Gaussian kerne l and parameters σ = p d/ 2 for x ∈ R d , B = 10 00 , ǫ = 0 . Note that we know of no principled way to do cro ss-v alidation with KMM since it can not produce weights for a held-ou t set (Sugiyama et al., 2008). The regression datasets are f rom LIAAD 4 and are sampled with P [ s = 1 | x ] = e v 1+ e v where v = 4 w · ( x − ¯ x ) σ w · ( x − ¯ x ) , x ∈ R d and w ∈ R d chosen at rand om f rom [ − 1 , 1] d . In ou r experiments, we chose ten r andom pro jections w and reported r esults with the w , f or each data set, that maximizes th e difference between the unweighted and ideal methods over re peated samp ling trials. In this way , we selected b ias samplings that are g ood candidates for bias correction estimation. For o ur experiments, we used a version o f SVR available fro m L ibSVM 5 that can take as inp ut w eighted sam ples, with param eter values C = 1 , an d ǫ = 0 . 1 comb ined with a Gaussian kern el with parameter σ = p d/ 2 . W e rep ort results using norm alized mean-squ ared error (NMSE) : 1 n test P n test i =1 ( y i − ˆ y i ) 2 σ 2 y , and pr ovide mean and standard deviations for ten-fold cross-validation. Our results show that reweighting with mor e reliable cou nts, due to clu stering, can be effective in the problem of samp le bias co rrection. These results a lso co nfirm th e depend ence that our theoretical bounds exhibit on the quantity n 0 . The results obtained using KM M seem to be consistent with th ose repo rted by th e auth ors of this techniqu e. 6 6 Conclusion W e presented a gener al analysis of sample selection bias correction and gave bound s analyzing the effect of an estimation e rror on the accu racy of th e h ypotheses retur ned. The n otion o f distributional stability and the techniqu es presented are genera l and can 4 www.liaad.up. pt/˜ltorgo/Regres sion/DataSets.html. 5 www.csie.ntu. edu.tw/˜cjlin/lib svmtools. 6 W e thank Arthur Gretton for discussion and help in clarifying the choice of the parame ters an d design of the KMM e xperiments reported in (Huan g et al., 2006b), and for pro viding the code used by the authors for comparison studies. be of ind ependent intere st f or th e analysis of learning algo rithms in other settings. In particular, these techniqu es apply similarly to othe r imp ortance weighting alg orithms and can be used in other co ntexts such th at of learn ing in the presence of uncertain labels. The analy sis of the discrimina ti ve meth od of (Bickel et al., 20 07) for the problem of covariate shif t could perhaps also benefit from this s tudy . Refer ences Bickel, S., Br ¨ u ckner, M., & Scheffer , T . (2 007). Discriminati ve learnin g for differing training and test distributions. ICML 20 07 (pp. 81–88). Bousquet, O., & Elisseeff, A. (2002 ). Stability and gener alization. JMLR , 2 , 499– 526. Brieman, L., Friedman , J., Stone, C., & Olshen, R. ( 1984). Classification and r e g r ess ion tr ees . New Y ork, NY , USA: Chapman & Hall. Cortes, C., & V apnik, V . N. ( 1995). Supp ort-V ector Netw orks. Ma chine Learning , 20 , 273–2 97. Devroye, L., & W agner, T . ( 1979). D istrib ution-fre e performance boun ds for potential function rules. IEE E T rans. on Information Theory (pp . 601– 604). Dud´ ık, M., Sch apire, R. E., & Phillips, S. J. (20 06). Cor recting samp le selectio n b ias in maximum entropy density estimation. NIPS 2005 . Elkan, C. (2001 ). The found ations o f cost-sensiti ve learning. IJCAI (p p. 973–978). Fan, W ., Davidson, I., Zadrozny , B., & Y u, P . S. (200 5). An improved categorization of classifier’ s sensitivity on sample selection bias. ICDM 2005 (pp. 605– 608). IEEE Computer Society . Heckman, J. J. (1 979). Samp le Selection Bias as a Specificatio n Error. Eco nometrica , 47 , 151– 161. Huang, J ., Smola, A., Gretton, A., B orgwardt, K., & Sch ¨ olkopf, B. (2006a). Corr ecting Sample Selection Bias by Unlabeled Data. T echn ical Rep ort CS-200 6-44). Univer - sity of W aterlo o. Huang, J., Smola, A. J., Gretton, A., Borgw ardt, K. M., & Sch ¨ olkopf, B. (200 6b). Cor- recting sample selection bias by unlabeled data. NIPS 200 6 (pp. 601–608). Kearns, M., & R on, D. (1997). Algorith mic stability and san ity-check boun ds for leave- one-ou t cro ss-v alidation. COLT 1997 (pp. 152– 162). Little, R. J. A., & Rubin, D. B. (1 986). Statistical a nalysis with missing d ata . New Y ork, NY , USA: John W ile y & Sons, Inc. Saunders, C., Gammer man, A., & V ovk, V . (1 998). Ridge Regression L earning Algo - rithm in Dual V ariables. ICML 1998 (pp. 515– 521). Steinwart, I . (2002 ). On the influence of th e kernel on th e con sistency of support vector machines. JMLR , 2 , 67 –93. Sugiyama, M., Na kajima, S., K ashima, H., von B ¨ u nau, P ., & Kawanabe, M . (2 008). Direct imp ortance estimation with mod el selection and its application to covariate shift adaptation. NIPS 200 8 . V apnik, V . N. (1998 ). Statistical learning theory . New Y o rk: W iley-Inter science. Zadrozny , B. (2004 ). L earning and ev aluating classifiers under sample selection b ias. ICML 2004 . Zadrozny , B., La ngford, J., & Ab e, N. (2003). Cost-sensiti ve lear ning b y cost- propo rtionate examp le weighting. ICDM 20 03 . A Proof of Theor em 3 Pr oof. Assum e that µ ( p ) = µ ( q ) for two pr obability distributions p an d q in P . It is known that if E x ∼ p [ f ( x )] = E x ∼ q [ f ( x )] for any f ∈ C ( X ) , then p = q . Let f ∈ C ( X ) and fix ǫ > 0 . Since K is universal, th ere exists a function g induce d b y K such that k f − g k ∞ ≤ ǫ . E x ∼ p [ f ( x )] − E x ∼ q [ f ( x )] can be re written as E x ∼ p [ f ( x ) − g ( x )] + E x ∼ p [ g ( x )] − E x ∼ q [ g ( x )] + E x ∼ q [ g ( x ) − f ( x )] . (36) Since E x ∼ p [ f ( x ) − g ( x )] ≤ E x ∼ p | f ( x ) − g ( x ) | ≤ k f − g k ∞ ≤ ǫ an d similarly E x ∼ q [ f ( x ) − g ( x )] ≤ ǫ , E x ∼ p [ f ( x )] − E x ∼ q [ f ( x )] ≤ E x ∼ p [ g ( x )] − E x ∼ q [ g ( x )] + 2 ǫ. (37) Since g is ind uced by K , there exists w ∈ F such that for all x ∈ X , g ( x ) = h w , Φ ( x ) i . Since F is separ able, it admits a cou ntable o rthonorm al basis ( e n ) n ∈ N . For n ∈ N , let w n = h w, e n i an d Φ n ( x ) = h Φ ( x ) , e n i . Then , g ( x ) = P ∞ n =0 w n Φ n ( x ) . For each N ∈ N , con sider the partial sum g N ( x ) = P N n =0 w n Φ n ( x ) . By th e Cauch y-Schwarz inequality , | g N ( x ) | ≤ k N X n =0 w n e n k 1 / 2 2 k N X n =0 Φ n ( x ) e n k 1 / 2 2 ≤ k w k 1 / 2 2 k Φ ( x ) k 1 / 2 2 . (38) Since K is u niv ersal, it is co ntinuous an d th us Φ is also co ntinuous ( Steinwart, 2002 ). Thus x 7→ k Φ ( x ) k 2 is a co ntinuous f unction over the co mpact X and admits an upper bound B ≥ 0 . Thus, | g N ( x ) | ≤ p k w k 2 B . Th e integral R p k w k 2 B dp is clearly well defined and equals p k w k 2 B . Th us, by the Lebesgue dominated con v ergence theore m, the following holds: E x ∼ p [ g ( x )] = Z ∞ X n =0 w n Φ n ( x ) dp ( x ) = ∞ X n =0 w n Z Φ n ( x ) dp ( x ) . (39) By d efinition of E x ∼ p [ Φ ( x )] , the last term is the in ner pr oduct of w and that term. Thus, E x ∼ p [ g ( x )] = w, E x ∼ p Φ ( x ) = h w , µ ( p ) i . (40) A similar equality holds with the distribution q , thu s, E x ∼ p [ g ( x )] − E x ∼ q [ g ( x )] = h w , µ ( p ) − µ ( q ) i = 0 . Thus, Inequ ality 3 7 can be re written as E x ∼ p [ f ( x )] − E x ∼ q [ f ( x )] ≤ 2 ǫ, (41) for all ǫ > 0 . This implies E x ∼ p [ f ( x )] = E x ∼ q [ f ( x )] for all f ∈ C ( X ) and the injectivity of µ . ⊓ ⊔
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment