Importance Weighted Active Learning

We present a practical and statistically consistent scheme for actively learning binary classifiers under general loss functions. Our algorithm uses importance weighting to correct sampling bias, and by controlling the variance, we are able to give r…

Authors: Alina Beygelzimer, Sanjoy Dasgupta, John Langford

Importance Weighted Active Learning
Impor t ance Weighted Active Le arning Imp ortance W eigh ted A cti v e Learning Alina Beygelzimer beygel@us.ibm.com IBM Thomas J. Watson R ese ar ch Center Hawthorne, NY 1053 2, USA Sanjo y Dasgupta dasgup t a@cs.ucsd.edu University of Cali fornia, San Die go L a Jol la, CA 92093, USA John Langford jl@y ahoo-inc.com Y aho o! R ese ar ch New Y ork, NY 1 0018, USA Editor: Abstract W e presen t a practical and statistically consistent sc heme for activ ely learning binary classifiers under general loss functions. Our algorithm uses importance weigh ting to co rrect sampling bias, and by controlling the v ar iance, we are able to give rigo r ous lab el complexity bo und s for the lea rning pro cess. Exp erimen ts on pass ively lab eled data show that this approach reduces the lab el complexity r equired to ach ieve go od predictive p erformance on many learning problems. Keyw ords: Activ e learning, impo rtance weigh ting, sampling bias 1. In tro duction Activ e learning is typical ly d e fined b y con trast to the passiv e mo del of sup ervised learning. In passiv e learning, all the lab els for an unlab eled dataset are obtained at once, while in activ e learning the learner in teractiv ely c ho ose s wh ich data p oin ts to lab el. The great h op e of activ e learning is that intera ction can sub s tantia lly reduce the n um b e r of labels required, making learning more pr a ctical . This hop e is kno wn to b e v alid in certain sp ecial ca ses, where the n um b er of labels needed to learn activ ely has b een sho w n to b e logarithmic in the u sual sample complexit y of passive learning; suc h cases include thresh o lds on a line, and linear separators w it h a spherically uniform un la b eled data distrib utio n (Dasgupta et al., 2005). Man y earlier activ e learning algorithms, su c h as (Cohn et al., 1994; Dasgupta et al., 2005), ha v e pr o blems with d ata that are not p erfectly separable under the giv en h yp othesis class. In such cases, they can exhibit a lac k of statistic al consistency: ev en with an infinite lab eli ng budget, they migh t not con v erge to an optimal p redic tor (see Dasgupta and Hsu (2008) for a discussion). This problem h a s r ec en tly b een addressed in t w o threads of r e searc h . One approac h (Balca n et al. , 2006; Dasgupta et al., 2008; Hannek e, 2007) constructs learning algorit hms that explicitly use sample complexit y b ounds to assess whic h hyp o theses are still “in the run ning” (giv en the lab els seen so far), thereby assessing the relativ e v alue of different un la b eled p oin ts (in 1 Beygelzimer, D asgupt a and Langf ord terms of whether they help d i stinguish b et wee n the r e maining h yp otheses) . These algo- rithms h a v e the usual P A C-st yle con v erge nce guaran tee s, b u t th e y also h a v e rigorous lab el complexit y b ound s that are in many cases significan tly b etter than the b ounds for passiv e sup ervised learning. Ho w eve r, these algorithms h a v e yet to see practical use. First, they are built explicitly for 0–1 loss and are not easily adapted to m ost other loss fun ct ions. This is problematic b ecause in many app lications, other loss fu n c tions are more appr o priate for describing the p roblem, or mak e learning more tractable (a s w ith conv ex pro xy losses on linear represen tatio ns). Second, these algorithms m a k e in ternal use of generalizatio n b ounds that are often lo ose in practice, and they can th us end up requiring far more lab el s th an are really necessary . Finall y , they t ypically require an explicit en umeratio n o v er the h yp ot hesis class (or an ǫ -co v er thereof ), wh ich is generally computationally intra ctable. The second approac h to activ e learning uses imp ortance w eigh ts to correct sampling bias (Bac h, 2007; Su g iy ama, 2006). This approac h has only b een analyzed in limited set- tings. F or example, (Bac h, 2007) considers linear mo dels and pro vides an analysis of con- sistency in cases where either (i) the mo del class fits the d at a p erfectl y , or (ii) the samp ling strategy is non-adaptiv e (that is, the data p oin t qu e ried at time t do esn’t dep end on the sequence of previous queries). The analysis in these w orks is also asymptotic rather th an yielding finite lab el b ounds , wh il e minimizing the actual lab el complexit y is of paramount imp ortance in activ e learning. F urthermore, the analysis do es n ot prescrib e ho w to c hoose imp ortance weig h ts, and a p o or c h o ice can result in high lab el complexit y . Imp ortance-w eigh ted activ e learning W e address the p r o blems ab o v e with an activ e learning scheme that pro v ably yields P A C- st yle lab e l complexit y guarante es. When pr esented with an unlab eled p oin t x t , this s c heme queries its lab el with a carefully c h ose n p r o babilit y p t , taking into accoun t the ident it y of the p oin t and the history of lab els seen so f ar. Th e p oin ts that end up getting lab eled are then w eigh ted according to th e recipro ca ls of these probabilities (that is, 1 /p t ), in order to remo v e sampling bias. W e show (theorem 1 ) that this simple metho d guarantee s statisti- cal consistency: for an y distribution and an y hyp othesis class, activ e learning ev en tually con v erges to the optimal hypothesis in the class. As in an y imp ortance sampling scenario, the biggest c halleng e is control ling the v ariance of the pro cess. This dep ends crucially on h o w the sampling probabilit y p t is c hosen. Our strategy , roughly , is to mak e it prop ortional to the spread of v alues h ( x t ), as h r anges o ver the remaining cand id a te h yp ot heses (those with go o d p erformance on the lab el ed p oin ts so far). F or this setting of p t , which we call IW AL(loss-w eighti ng), we hav e t wo results. First, w e sho w (theorem 2 ) a fallbac k guaran tee that the lab el complexit y is nev er m uc h w orse than that of sup ervised learning. S e cond, we rigorously analyze the lab el complexit y in terms of un derlying parameters of the learning problem (theorem 7 ). Previously , lab el complexit y b ounds for activ e learning w ere only kno wn for 0–1 loss, and w ere based on the disagr e ement c o efficient of the learning p roblem (Hannek e, 2007). W e generaliz e this n o tion to general loss functions, and analyze lab el complexit y in terms of it. W e consider settings in whic h these b ounds turn out to b e roughly the squ a r e r o ot of the sample complexit y of sup ervised learning. 2 Impor t ance Weighted Active Le arning In addition to these upp er b ounds, we show a general lo w er b ound on the lab el com- plexit y of acti v e learning (theorem 9) that significan tly impro v es the b est previous such result (K¨ a¨ ari¨ ainen, 2006). W e conduct practical exp erimen ts with t w o IW AL algorithms. Th e first is a sp eciali za- tion of IW AL(loss-w eigh ting) to the case of linear classifiers with conv ex loss fu n ct ions; here, the algorithm b ec omes trac table via co n v ex programming (sect ion 7). The sec- ond, IW AL(b ootstrap), u ses a simple b o otstrapping s c heme that reduces activ e learning to (batc h ) passiv e learning without requiring muc h add it ional computation (section 7.2). In ev ery case, these exp erimen ts yield substanti al reductions in lab el complexit y compared to passiv e learning, without compromising p redict iv e p erformance. T hey suggest that I W AL is a practical sc heme th at can redu ce the lab el complexit y of activ e learning without sacrificing the statistical guaran tees (like consistency) w e tak e for grant ed in passive learning. Other related work The activ e learning algorithms of Ab e and Mamitsuk a (1998), b a sed on b o osti ng an d bag- ging, are similar in spirit to our IW AL(b o otstrap) algorithm in section 7.2. But these earlier algorithms are not consisten t in the presence of adv ersarial noise: they may nev er con v erge to the correct solution, eve n giv en an infin ite lab el budget. In con trast, IW AL(b ootstrap) is consisten t and satisfies fu rther guaran tees (section 2 ) . The field of exp erimental design (Pukelsheim, 2006) emphasizes r e gression pr o blems in whic h the conditional distribu tion of the r esp onse v ariable give n the predictor v ariables is assumed to lie in a certa in class; the goal is to syn thesize query p oin ts suc h that the resulting least-squares estimator has low v ariance. In con trast, we are in terested in an agnostic setting, where no assu m ptio ns ab out the mo del cla ss b eing p o werful enough to represen t the ideal s olution exist. Moreo ve r, we are not allo wed to synthesize queries, b ut merely to c ho ose them f r o m a stream (or p o ol) of candidate queries p ro vided to us. A telling d iffe rence b et w een th e tw o m odels is that in exp erimen tal design, it is common to query the same p oin t rep eatedly , wh e reas in our setting this w ould m ake no sense. 2. Preliminaries Let X b e the input space and Y the output space. W e consider activ e learning in the streaming setting w here at eac h step t , a learner observes an un l ab ele d p oin t x t ∈ X and has to decide whether to ask for the lab el y t ∈ Y . The learner w orks with a h yp o thesis space H = { h : X → Z } , where Z is a prediction space. The alg orithm is ev aluated with resp e ct to a giv en loss f unctio n l : Z × Y → [0 , ∞ ). The most common loss fu ncti on is 0–1 loss, in wh ic h Y = Z = {− 1 , 1 } and l ( z , y ) = 1 ( y 6 = z ) = 1 ( y z < 0). The follo win g examples address the binary case Y = {− 1 , 1 } with Z ⊂ R : • l ( z , y ) = (1 − y z ) + (hinge loss), • l ( z , y ) = ln(1 + e − y z ) (logistic loss), • l ( z , y ) = ( y − z ) 2 = (1 − y z ) 2 (squared loss), and • l ( z , y ) = | y − z | = | 1 − y z | (absolute loss). 3 Beygelzimer, D asgupt a and Langf ord Notice that all the loss f u ncti ons mentio ned h e re are of th e form l ( z , y ) = φ ( y z ) for some function φ on the reals. W e sp ecifically highligh t this sub class of loss fun c tions when pr o ving lab el complexit y b oun ds. Since these functions are b ounded (if Z is), we further assume they are normalized to output a v alue in [0 , 1]. 3. The Import a nce W eigh ting Sk eleton Algorithm 1 describ es the basic outline of imp ortance-w eigh ted activ e learning (IW AL). Up on seeing x t , th e learner calls a subr o utine r eje ction-thr eshold (instanti ated in later s e c- tions), whic h lo oks at x t and past history to return the p r o babilit y p t of requesting y t . The algorit hm main tains a set of lab eled examples seen so far, eac h with an imp ortance w eigh t: if y t ends up b eing queried, its weigh t is set to 1 /p t . Algorithm 1 IW AL (subr o utine r eje ction-threshold) Set S 0 = ∅ . F or t fr o m 1 , 2 , . . . until the data stream run s out: 1. Receiv e x t . 2. Set p t = rejection-threshold( x t , { x i , y i , p i , Q i : 1 ≤ i < t } ). 3. Flip a coin Q t ∈ { 0 , 1 } with E [ Q t ] = p t . If Q t = 1, request y t and set S t = S t − 1 ∪ { ( x t , y t , 1 /p t ) } , else S t = S t − 1 . 4. Let h t = arg min h ∈ H P ( x,y ,c ) ∈ S t c · l ( h ( x ) , y ). Let D b e the u n derlying p robabili t y distribution on X × Y . Th e exp ected loss of h ∈ H on D is giv en b y L ( h ) = E ( x,y ) ∼ D l ( h ( x ) , y ). S i nce D is alwa ys clear from con text , we d rop it from notation. The imp ortance w eigh ted estimate of the loss at time T is L T ( h ) = 1 T T X t =1 Q t p t l ( h ( x t ) , y t ) , where Q t is as defined in the algorithm. It is easy to see that E [ L T ( h )] = L ( h ), with the exp ect ation take n o ver all the random v ariables inv olv ed. Theorem 2 give s large deviation b ounds for L T ( h ), provi ded that the probabilities p t are c hosen carefully . 3.1 A safety guarantee for IW AL A desirable prop ert y for a learning algorithm is c onsistency : Giv en an infin it e b u dget of unlab eled and lab eled examples, d oes it con v erge to the b est predictor? S ome early activ e learning algorithms (Coh n et al., 1994; Dasgupta et al., 2005) d o not s atisfy this baseline guaran tee: they hav e p roblems if the data cannot b e classified p erfect ly by the giv en hyp o thesis class. W e prov e that IW AL algorithms are consistent , as long as p t is b ounded aw a y from 0. F urther, we prov e that the lab el complexi t y required is within a constan t factor of sup ervised learning in the worst case. 4 Impor t ance Weighted Active Le arning Theorem 1 F or al l distributions D , for al l finite hyp othesis class es H , for any δ > 0 , if ther e is a c onstant p min > 0 such that p t ≥ p min for al l 1 ≤ t ≤ T , then P   max h ∈ H | L T ( h ) − L ( h ) | > √ 2 p min s ln | H | + ln 2 δ T   < δ. Comparing this result to the usual sample complexit y b ound s in sup ervised learning (for ex- ample, corollary 4.2 of (Langford, 2005)), w e see th at the lab el complexit y is at most 2 /p 2 min times that of a sup ervised algorithm. F or s i mplicit y , the b ound is giv en in terms of ln | H | rather than the VC dimension of H . The argument, whic h is a martingale mo dificatio n of standard results, can b e extended to VC sp aces. Pro o f Fix the underlying distribution. F or a hypothesis h ∈ H , consider a sequence of random v ariables U 1 , . . . , U T with U t = Q t p t l ( h ( x t ) , y t ) − L ( h ) . Since p t ≥ p min , | U t | ≤ 1 /p min . The sequence Z t = P t i =1 U i is a martingale, letting Z 0 = 0. Indeed, for an y 1 ≤ t ≤ T , E [ Z t | Z t − 1 , . . . , Z 0 ] = E Q t ,x t ,y t ,p t [ U t + Z t − 1 | Z t − 1 , . . . , Z 0 ] = Z t − 1 + E Q t ,x t ,y t ,p t  Q t p t l ( h ( x t ) , y t ) − L ( h )     Z t − 1 , . . . , Z 0  = E x t ,y t [ l ( h ( x t ) , y t ) − L ( h ) + Z t − 1 | Z t − 1 , . . . , Z 0 ] = Z t − 1 . Observe that | Z t +1 − Z t | = | U t +1 | ≤ 1 /p min for all 0 ≤ t < T . Using Z T = T ( L T ( h ) − L ( h )) and applying Azuma’s inequalit y (Azuma, 1967), w e see that for an y λ > 0, P  | L T ( h ) − L ( h ) | > λ p min √ T  = P " Z T > λ √ T p min # < 2 e − λ 2 / 2 . Setting λ = p 2(ln | H | + ln(2 /δ )) and taking a u n io n b o und o ve r h ∈ H then yields the desired result. 4. Setting the Rejection Threshold: Loss W eigh ting Algorithm 2 giv es a particular instan tiatio n of the rejection thresh o ld subroutine in IW AL. The subroutine main tains an effectiv e h yp othesis class H t , w h ic h is initially all of H and then gradually shrink s by setting H t +1 to the sub s et of H t whose empirical loss isn ’ t to o m uc h worse than L ∗ t , th e smallest emp iric al loss in H t : H t +1 = { h ∈ H t : L t ( h ) ≤ L ∗ t + ∆ t } . The allo w ed slac k ∆ t = p (8 /t ) ln(2 t ( t + 1) | H | 2 /δ ) comes f rom a standard sample complex- it y b ound . 5 Beygelzimer, D asgupt a and Langf ord W e will sho w that, with high probability , an y optimal h ypothesis h ∗ is alwa ys in H t , and thus all other hyp o theses can b e discarded from consideration. F or eac h x t , the loss- w eigh ting scheme lo oks at the range of predictions on x t made by hypotheses in H t and sets the sampling probabilit y p t to the size of this range. More precisely , p t = max f ,g ∈ H t max y l ( f ( x t ) , y ) − l ( g ( x t ) , y ) . Since th e loss v alues are n o rmalized to lie in [0 , 1], we can b e sur e that p t is also in this in terv al. Next section sh o ws that the resulting IW AL has sev eral desirable prop erties. Algorithm 2 loss-w eig h ting ( x , { x i , y i , p i , Q i : i < t } ) 1. Initialize H 0 = H . 2. Up date L ∗ t − 1 = min h ∈ H t − 1 1 t − 1 t − 1 X i =1 Q i p i l ( h ( x i ) , y i ) , H t = ( h ∈ H t − 1 : 1 t − 1 t − 1 X i =1 Q i p i l ( h ( x i ) , y i ) ≤ L ∗ t − 1 + ∆ t − 1 ) . 3. Return p t = max f ,g ∈ H t ,y ∈ Y l ( f ( x ) , y ) − l ( g ( x ) , y ). 4.1 A generalization b ound W e start with a large deviation b ound for eac h h t output by IW AL(loss-we igh ti ng). It is not a corollary of theorem 1 b ecause it do es not require the sampling probabilities b e b ound e d b elo w a wa y from zero. Theorem 2 Pic k any data distribution D and hyp othesis class H , and let h ∗ ∈ H b e a minimizer of the loss fu nctio n with r e sp e ct to D . Pick any δ > 0 . With pr ob ability at le ast 1 − δ , for any T ≥ 1 , ◦ h ∗ ∈ H T , and ◦ L ( f ) − L ( g ) ≤ 2∆ T − 1 for any f , g ∈ H T . In p articular, if h T is the output of IW AL(loss-w eigh ting) , then L ( h T ) − L ( h ∗ ) ≤ 2∆ T − 1 . W e need the follo wing lemma f o r the pro of. Lemma 1 F or al l data distributions D , for al l hyp oth esis classes H , for al l δ > 0 , with pr ob ability at le ast 1 − δ , for al l T and al l f , g ∈ H T , | L T ( f ) − L T ( g ) − L ( f ) + L ( g ) | ≤ ∆ T . 6 Impor t ance Weighted Active Le arning Pro o f Pic k an y T and f , g ∈ H T . Define Z t = Q t p t  l ( f ( x t ) , y t ) − l ( g ( x t ) , y t )  − ( L ( f ) − L ( g )) . Then E [ Z t | Z 1 , . . . , Z t − 1 ] = E x t ,y t [ l ( f ( x t ) , y t ) − l ( g ( x t ) , y t ) − ( L ( f ) − L ( g )) | Z 1 , . . . , Z t − 1 ] = 0. Thus Z 1 , Z 2 , . . . is a martingale difference sequence, and w e can use Azuma’s inequalit y to sho w that its sum is tigh tly concen trate d, if the individual Z t are b ounded. T o c hec k b oundedness, observe that since f and g are in H T , they m ust also b e in H 1 , H 2 , . . . , H T − 1 . T h us for all t ≤ T , p t ≥ | l ( f ( x t ) , y t ) − l ( g ( x t ) , y t ) | , whereup on | Z t | ≤ 1 p t | l ( f ( x t ) , y t ) − l ( g ( x t ) , y t ) | + | L ( f ) − L ( g ) | ≤ 2 . W e allo w failure p robabili t y δ /T ( T + 1) at time T . Applying Azuma’s inequalit y , w e ha v e P [ | L T ( f ) − L T ( g ) − L ( f ) + L ( g ) | ≥ ∆ T ] = P "      1 T T X t =1  Q t p t ( l ( f ( X t ) , Y t ) − l ( g ( X t ) , Y t )) − ( L ( f ) − L ( g ) )  !      ≥ ∆ T # = P "      T X t =1 Z t      ≥ T ∆ T # ≤ 2 e − T ∆ 2 T / 8 = δ T ( T + 1) | H | 2 . Since H T is a rand om su bset of H , it suffices to tak e a union b ound ov er all f , g ∈ H , and T . A u nio n b ound o v er T finishes the pro of. Pro o f (Theorem 2) Sta rt b y assum i ng that the 1 − δ pr o babilit y eve n t of lemma 1 holds. W e first sho w b y indu c tion that h ∗ = arg min h ∈ H L ( h ) is in H T for all T . It holds at T = 1, since H 1 = H 0 = H . No w supp ose it h o lds at T , and show that it is tru e at T + 1. Let h T minimize L T o v er H T . By lemma 1 , L T ( h ∗ ) − L T ( h T ) ≤ L ( h ∗ ) − L ( h T ) + ∆ T ≤ ∆ T . Thus L T ( h ∗ ) ≤ L ∗ T + ∆ T and hence h ∗ ∈ H T +1 . Since H T ⊆ H T − 1 , lemma 1 implies that for for any f , g ∈ H T , L ( f ) − L ( g ) ≤ L T − 1 ( f ) − L T − 1 ( g ) + ∆ T − 1 ≤ L ∗ T − 1 + ∆ T − 1 − L ∗ T − 1 + ∆ T − 1 = 2∆ T − 1 . Since h T , h ∗ ∈ H T , we ha v e L ( h T ) ≤ L ( h ∗ ) + 2∆ T − 1 . 5. Label Comple xit y W e sh o w ed that the loss of the classifier output by IW AL(loss-w eigh ti ng) is similar to the loss of the classifier chosen passiv ely after s e eing all T lab els. Ho w many of those T lab el s do es the activ e learner request? Dasgupta et al. (2008) stud i ed this question for an act iv e learning sc heme und er 0–1 loss. F or learning p r o blems with b ounded disagr e ement c o efficient (Hannek e, 2007), th e n um b er of queries was found to b e O ( ηT + d log 2 T ), where d is the V C dimens i on of the function class, and η is the b est error r a te ac hiev able on the un derlying d i stribution b y that 7 Beygelzimer, D asgupt a and Langf ord function class. W e will so on see (section 6) that the te rm η T is inevita ble for any activ e learning sc heme; the remaining term has just a p olylogarit hmic dep endence on T . W e generalize the disagreemen t co effici en t to arbitrary loss functions and sho w that, under conditions similar to the earlier result, the num b e r of queries is O  η T + p dT log 2 T  , where η is now the b est ac hiev able loss. The in e vitable η T is still there, and the second term is still sublinear, though not p olylogarithmic as b efore. 5.1 Lab el Complexity: Main Issues Supp ose the loss f unctio n is minimized by h ∗ ∈ H , with L ∗ = L ( h ∗ ). Theorem 2 sho ws that at time t , the remaining hypotheses H t include h ∗ and all ha v e losses in the range [ L ∗ , L ∗ + 2∆ t − 1 ]. W e no w pro v e that u nder suitable conditions, the samp l ing p r o babilit y p t has exp ected v alue ≈ L ∗ + ∆ t − 1 . Thus the exp ected total num b er of lab els queried upto time T is r o ughly L ∗ T + P T t =1 ∆ t − 1 ≈ L ∗ T + p T ln | H | . T o motiv ate the p roof, consider a loss function l ( z , y ) = φ ( y z ); all our examples are of this f orm . S a y φ is d iffe ren tiable w it h 0 < C 0 ≤ | φ ′ | ≤ C 1 . Th e n the sampling probabilit y for x t is p t = max f ,g ∈ H t max y ∈{− 1 , + 1 } l ( f ( x t ) , y ) − l ( g ( x t ) , y ) = max f ,g ∈ H t max y φ ( y f ( x t )) − φ ( y g ( x t )) ≤ C 1 max f ,g ∈ H t max y | y f ( x t ) − y g ( x t ) | = C 1 max f ,g ∈ H t | f ( x t ) − g ( x t ) | ≤ 2 C 1 max h ∈ H t | h ( x t ) − h ∗ ( x t ) | . So p t is d e termined by the range of pr e dictions on x t b y hypotheses in H t . C a n we b ound the size of this range, giv en that any h ∈ H t has loss at most L ∗ + 2∆ t − 1 ? 2∆ t − 1 ≥ L ( h ) − L ∗ ≥ E x,y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | − 2 L ∗ ≥ E x,y C 0 | y ( h ( x ) − h ∗ ( x )) | − 2 L ∗ = C 0 E x | h ( x ) − h ∗ ( x ) | − 2 L ∗ . So we can u pp erboun d max h ∈ H t E x | h ( x ) − h ∗ ( x ) | (in terms of L ∗ and ∆ t − 1 ), whereas we wan t to upp erb ound the exp ected v alue of p t , which is prop ortional to E x max h ∈ H t | h ( x ) − h ∗ ( x ) | . The ratio b et w een these tw o qu a n titie s is related to a fundamenta l parameter of the learning problem, a generalizati on of the disagr e ement c o efficient (Hanneke, 2007). W e flesh out this int uition in the remainder of this section. First we describ e a broader class of loss functions than th o se considered ab o ve (including 0–1 loss, whic h is not differ- en tiable) ; a distance metric on hypotheses, and a generalized disagreemen t co efficien t. W e then pro v e that for this broader class, activ e learning p erforms b ett er th a n passiv e learning when the generalized disagreemen t co efficie n t is small. 8 Impor t ance Weighted Active Le arning 5.2 A sub class of loss functions W e giv e lab el complexit y upp er b ounds for a class of loss functions that includes 0–1 loss and logistic loss but not hinge loss. Sp ecifically , w e require that the loss f u nctio n h a s b ounded slop e asymmetry , defin ed b elo w. Recall earlier notation: resp onse space Z , classifier space H = { h : X → Z } , and loss function l : Z × Y → [0 , ∞ ). Henceforth, the lab el space is Y = {− 1 , +1 } . Definition 3 The slop e asymm e try of a loss function l : Z × Y → [0 , ∞ ) is K l = sup z ,z ′ ∈ Z max y ∈ Y | l ( z , y ) − l ( z ′ , y ) | min y ∈ Y | l ( z , y ) − l ( z ′ , y ) | . The slop e asymmetry is 1 for 0–1 loss, and ∞ for hinge loss. F or d ifferentia ble loss fu ncti ons l ( z , y ) = φ ( y z ), it is easily related to b ounds on the deriv ativ e. Lemma 2 L et l φ ( z , y ) = φ ( z y ) , wher e φ is a differ entiable function define d on Z = [ − B , B ] ⊂ R . Supp ose C 0 ≤ | φ ′ ( z ) | ≤ C 1 for al l z ∈ Z . Then for any z , z ′ ∈ Z , and any y ∈ {− 1 , +1 } , C 0 | z − z ′ | ≤ | l φ ( z , y ) − l φ ( z ′ , y ) | ≤ C 1 | z − z ′ | . Thus l φ has slop e asymmetry at most C 1 /C 0 . Pro o f By the mean v alue theorem, there is some ξ ∈ Z su ch that l φ ( z , y ) − l φ ( z ′ , y ) = φ ( y z ) − φ ( y z ′ ) = φ ′ ( ξ )( y z − y z ′ ). Th us | l φ ( z , y ) − l φ ( z ′ , y ) | = | φ ′ ( ξ ) | · | z − z ′ | , and the rest follo ws from the b ounds on φ ′ . F or instance, this immediately app li es to logistic loss. Corollary 4 L o gistic loss l ( z , y ) = ln(1 + e − y z ) , define d on lab el sp ac e Y = {− 1 , +1 } and r esp onse sp ac e [ − B , B ] , has slop e asymmetry at most 1 + e B . 5.3 T op ologizing the space of classifiers W e in trod uce a simple distance function on the space of classifiers. Definition 5 F or any f , g ∈ H and distribution D define ρ ( f , g ) = E x ∼ D max y | l ( f ( x ) , y ) − l ( g ( x ) , y ) | . F or any r ≥ 0 , let B ( f , r ) = { g ∈ H : ρ ( f , g ) ≤ r } . Supp ose L ∗ = min h ∈ H L ( h ) is realized at h ∗ . W e kn o w th at at time t , the remaining h yp otheses ha v e loss at most L ∗ + 2∆ t − 1 . Do e s this mean they are close to h ∗ in ρ -distance? The ratio b et ween the t w o can b e expressed in terms of the slop e asymmetry of the loss. Lemma 3 F or any distribution D and any loss fu nctio n with slop e asymmetry K l , we have ρ ( h, h ∗ ) ≤ K l ( L ( h ) + L ∗ ) for al l h ∈ H . 9 Beygelzimer, D asgupt a and Langf ord Pro o f F or an y h ∈ H , ρ ( h, h ∗ ) = E x max y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ K l E x,y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ K l ( E x,y [ l ( h ( x ) , y )] + E x,y [ l ( h ∗ ( x ) , y )]) = K l ( L ( h ) + L ( h ∗ )) . 5.4 A generalized disagreemen t co efficien t When analyzing the A 2 algorithm (Balcan et al., 2006) f or activ e learning u nder 0–1 loss, Hannek e (2007) found that its lab el complexit y could b e charac terized in terms of w hat h e called the disagr e ement c o efficient of the learning pr o blem. W e no w generalize this notion to arbitrary loss functions. Definition 6 The disagr e e ment c o effici e nt is the infimum value of θ such that for al l r , E x ∼ D sup h ∈ B ( h ∗ ,r ) sup y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ θ r . Here is a simple example for linear separators. Lemma 4 Supp ose H c ons ists of line ar classifiers { u ∈ R d : k u k ≤ B } and the data distribution D i s uniform over the surfac e of the unit spher e in R d . Supp ose the loss function is l ( z , y ) = φ ( y z ) for differ entiable φ with C 0 ≤ | φ ′ | ≤ C 1 . Then the disagr e ement c o efficient is at most (2 C 1 /C 0 ) √ d . Pro o f Let h ∗ b e the optimal classifier, and h any other classifier with ρ ( h, h ∗ ) ≤ r . Let u ∗ , u b e the corresp onding v ecto rs in R d . Using lemma 2, r ≥ E x ∼ D sup y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≥ C 0 E x ∼ D | h ( x ) − h ∗ ( x ) | = C 0 E x ∼ D | ( u − u ∗ ) · x | ≥ C 0 k u − u ∗ k / (2 √ d ) . Th us for an y h ∈ B ( h ∗ , r ), we ha ve that the corresp onding vec tors satisfy k u − u ∗ k ≤ 2 r √ d/C 0 . W e can no w b ound the disagreemen t co efficien t: E x ∼ D sup h ∈ B ( h ∗ ,r ) sup y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ C 1 E x ∼ D sup h ∈ B ( h ∗ ,r ) | h ( x ) − h ∗ ( x ) | ≤ C 1 E x sup {| ( u − u ∗ ) · x | : k u − u ∗ k ≤ 2 r √ d/C 0 } ≤ C 1 · 2 r √ d/C 0 . 10 Impor t ance Weighted Active Le arning 5.5 Upp er Bound on Lab el Complexit y Finally , we give a b ound on lab el complexit y for learning problems with b ound e d disagree- men t co efficient and loss functions with b ounded slop e asymmetry . Theorem 7 F or al l le arning pr oblems D and hyp othesis sp ac es H , if the loss function has slop e asym metry K l , and the le arning pr oblem has disagr e ement c o efficient θ , then for al l δ > 0 , with pr ob ability at le as t 1 − δ over the choic e of data, the exp e cte d numb er of lab els r e queste d by IW AL(loss-w eighti ng) during the first T iter atio ns is at most 4 θ · K l · ( L ∗ T + O ( p T ln( | H | T /δ ) )) , wher e L ∗ is the minimum loss achievable on D by H , and the exp e ctation i s over the r andomness in the sele ctive sampling. Pro o f Sup pose h ∗ ∈ H ac hieve s loss L ∗ . Pic k any time t . By theorem 2, H t ⊂ { h ∈ H : L ( h ) ≤ L ∗ + 2∆ t − 1 } and by lemma 3 , H t ⊂ B ( h ∗ , r ) for r = K l (2 L ∗ + 2∆ t − 1 ). Thus, the exp ect ed v alue of p t (o ver the c hoice of x at time t ) is at most E x ∼ D sup f ,g ∈ H t sup y | l ( f ( x ) , y ) − l ( g ( x ) , y ) | ≤ 2 E x ∼ D sup h ∈ H t sup y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ 2 E x ∼ D sup h ∈ B ( h ∗ ,r ) sup y | l ( h ( x ) , y ) − l ( h ∗ ( x ) , y ) | ≤ 2 θ r = 4 θ · K l · ( L ∗ + ∆ t − 1 ) . Summing o v er t = 1 , . . . , T , we get the lemma. 5.6 Other examples of lo w lab el complexit y It is also sometimes p ossible to ac hieve s u bstan tial lab el complexit y reductions o v er p a ssiv e learning, ev en when th e slop e asymmetry is infin it e. Example 1 L et the sp ac e X b e the b al l of r ad ius 1 in d dimensions. L et the distribution D on X b e a p oint mass at the origin with weight 1 − β and lab el 1 and a p oint mass at (1 , 0 , 0 , . . . , 0) with weight β and lab el − 1 half the time and lab el 0 for the other half the time. L et the hyp othesis sp ac e b e line ar with weight ve ctor s satisfying || w || ≤ 1 . L et the loss of inter est b e squ a r e d loss: l ( h ( x ) , y ) = ( h ( x ) − y ) 2 which has infinite slop e asymmetry. Observ ation 8 F or the example ab ove, IW AL( loss-w eigh ting ) r e quir es only an exp e cte d β fr action of the lab ele d samples of p assive le arn ing to achieve the same loss. Pro o f P assive learning samples from the p oin t mass at the origin a (1 − β ) fraction of the time, w h ile activ e learning only samples from the p oin t mass at (1 , 0 , 0 , . . . , 0) since all predictors ha v e the same loss on samples at the origin. Since all hypothesis h h a v e the same loss for samples at the origin, only samples not at the origin influence the sample complexit y . Activ e learning samples from p oin ts not at the origin 1 /β more often than passive learning, implying the theorem. 11 Beygelzimer, D asgupt a and Langf ord 6. A lo w er b ound on lab el complexit y (K¨ a¨ ari¨ ainen, 2006) sh ow ed that for an y hyp othesis class H and an y η > ǫ > 0, there is a data distribution suc h that (a) the optimal error rate ac hiev able by H is η ; and (b) any activ e learner that finds h ∈ H with err or rate ≤ η + ǫ (with probability > 1 / 2 ) must mak e η 2 /ǫ 2 queries. W e n o w strengthen this lo w er b ound to dη 2 /ǫ 2 , where d is the V C dimension of H . Let’s see ho w this relates to the lab el complexit y rates of the pr e vious section. It is well- kno wn that if a sup ervised learner s ees T examples (for any T > d/η ), its fin a l h yp othesis has error ≤ η + p dη /T (De vro y e et al., 1996) with high probabilit y . Think of this as η + ǫ for ǫ = p dη /T . Our lo w er b ound now implies that an activ e learner must mak e at least dη 2 /ǫ 2 = η T queries. Th i s explains the η T leading term in all the lab el complexit y b ounds w e ha v e discussed. Theorem 9 F or any η , ǫ > 0 such that 2 ǫ ≤ η ≤ 1 / 4 , for any input sp ac e X and hyp othesis class H (of functions mapping X into Y = { +1 , − 1 } ) of VC dimension 1 < d < ∞ , th er e is a distribution over X × Y such that (a) the b est err or r ate achievable by H is η ; (b) any active le arner se eking a classifier of err or at most η + ǫ must make Ω( dη 2 /ǫ 2 ) queries to suc c e e d with pr ob ability at le ast 1 / 2 . Pro o f Pic k a set of d p oin ts x o , x 1 , x 2 , . . . , x d − 1 shattered by H . Here is a distribu t ion ov er X × Y : p oin t x o has probabilit y 1 − β , while eac h of the remaining x i has probabilit y β / ( d − 1), where β = 2( η + 2 ǫ ). At x o , the resp onse is alw ays y = 1. A t x i , i ≥ 1, the resp onse is y = 1 with probabilit y 1 / 2 + γ b i , w here b i is either +1 or − 1, and γ = 2 ǫ/β = ǫ/ ( η + 2 ǫ ) < 1 / 4. Nature starts by picking b 1 , . . . , b d − 1 uniformly at random. This defines the target h yp othesis h ∗ : h ∗ ( x o ) = 1 and h ∗ ( x i ) = b i . Its error rate is β · (1 / 2 − γ ) = η . An y learner outputs a h yp othesis in H and th us implicitly mak es guesses at th e u n- derlying hidden bits b i . Unless it correctly determines b i for at least 3 / 4 of the p oin ts x 1 , . . . , x d − 1 , the error of its hypothesis will b e at least η + (1 / 4) · β · (2 γ ) = η + ǫ . No w, supp ose the activ e learner mak es ≤ c ( d − 1) /γ 2 queries, where c is a sm a ll constan t ( c ≤ 1 / 125 su ffice s). W e’ll show that it fails (outputs a hyp othesis with error ≥ η + ǫ ) with probabilit y at least 1 / 2. W e’ll sa y x i is he avily querie d if the act iv e learner qu erie s it at least 4 c/γ 2 times. A t most 1 / 4 of the x i ’s are hea vily queried; without loss of generalit y , these are x 1 , . . . , x k , for s o me k ≤ ( d − 1) / 4. The remaining x i get so few queries that the learner guesses eac h corresp onding bit b i with probabilit y less than 2 / 3; this can b e d eriv ed from Slud ’s lemma (b elo w), whic h r elates the tails of a binomial to th at of a normal. Let F i denote the even t that the learner gets b i wrong; so E F i ≥ 1 / 3 for i > k . Sin c e k ≤ ( d − 1) / 4, the probabilit y that the learner fails is giv en by P [learner fails] = P [ F 1 + · · · + F d − 1 ≥ ( d − 1) / 4] ≥ P [ F k +1 + · · · + F d − 1 ≥ ( d − 1) / 4] ≥ P [ B ≥ ( d − 1) / 4] ≥ P [ Z ≥ 0] = 1 / 2 , where B is a binomial((3 / 4) ( d − 1) , 1 / 3) random v ariable, Z is a standard normal, and the last inequalit y f ollo ws fr o m Slud ’s lemma. Thus the activ e learner m ust m a k e at least 12 Impor t ance Weighted Active Le arning c ( d − 1) /γ 2 = Ω( dη 2 /ǫ 2 ) queries to succeed with probabilit y at least 1 / 2. Lemma 5 (Slud (1977)) L et B b e a Binomial ( n, p ) r andom variable with p ≤ 1 / 2 , and let Z b e a standar d normal. F or any k ∈ [ np, n (1 − p )] , P [ B ≥ k ] ≥ P [ Z ≥ ( k − np ) / p np (1 − p )] . Theorem 9 uses the same example that is us e d for low er b ounds on sup ervised sample complexit y (section 14.4 of (Devro ye et al., 1996)), although in th at case the lo w er b ound is dη /ǫ 2 . The b ound for activ e learning is smaller by a factor of η b ecause the acti v e learner can a v oid m a king rep eated queries to the “heavy” p oin t x o , whose lab el is immediately ob vious. 7. Implemen ting IW AL IW AL(loss-w eigh ting) can b e efficientl y implemente d in the case where H is the class of b ounded-length linear separators { u ∈ R d : k u k 2 ≤ B } and the loss fu ncti on is conv ex: l ( z , y ) = φ ( y z ) for con v ex φ . Eac h iteration of Algorithm 2 inv olv es solving t w o op timization p roblems o ver a re- stricted h yp othesis set H t = \ t ′

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment