Algorithmic Connections Between Active Learning and Stochastic Convex Optimization

Algorithmic Connections Bet w een Activ e Learning and Sto c hastic Con v ex Optimi z ation Aadit y a Ramdas Mac hine Learning Departmen t Carnegie Mellon Univ ersit y aramdas@cs. cmu.edu Aarti Singh Mac hine Learning Departmen t Carnegie Mellon Univ ersit y aarti@cs.cm u.edu Septem b er 24, 2018 Abstract Int eresting theoretical asso ciations ha ve bee n established by rece nt paper s b etw een the ﬁelds of active learning and stochastic conv ex optimization due to the common role of feedback in sequential querying mechanisms. In this pap er, we cont inu e this thre a d in t w o parts by ex ploiting these relations for the ﬁr st time to y ield nov el algorithms in bo th ﬁelds, further mo tiv ating the study of their intersection. First, inspired b y a recen t optimization algorithm that was a daptive to unknown uniform co nv exity par ameters, we present a new active learning a lgorithm for one-dimensio nal thresholds that ca n yie ld minimax rates by adapting to unknown no ise parameters. Next, we show that one can per form d -dimensiona l s to chastic minimization of smo oth unifor mly conv ex functions when only gr anted oracle acces s to noisy gradient signs a long any co o rdinate instead of real-v a lued gra dients, by using a simple r andomized c o ordinate descent procedur e where each line search ca n b e solved by 1-dimensiona l active learning, pr ov ably ac hieving the same erro r conv ergence rate a s ha ving the en tir e real-v alued gradient. Combining thes e t wo par ts yields an a lgorithm that solves sto chastic conv ex optimization o f uniformly conv ex and smo oth functions using o nly noisy gradient signs b y rep e atedly perfor ming active learning, achiev es optimal rates a nd is a daptive to all unknown conv exity and smo othness parameters . 1 In tro duction The tw o ﬁelds of con v ex optimization and activ e learning seem to h av e ev olv ed qu ite in- dep end en tly of eac h other. Recen tly , [1] p ointe d out their relatedness due to the inherent sequen tial nature of b oth ﬁelds an d the complex r ole of feedbac k in taking future actions. F ollo wing that, [2] made th e connections more explicit by t ying toge ther the exp onent us ed in noise conditions in activ e lea rning and th e exp onent used in un iform conv exit y (UC) in optimization. They u sed this to establish lo wer b ounds (and tig h t upp er b oun ds) in sto c h astic optimization of UC fu n ctions based on pro of tec hn iques from activ e learning. Ho wev er, it w as un clear if there w ere concrete algorithmic ideas in common b et w een th e ﬁelds. Here, w e pro vide a p ositive answer by exp loiting the aforemen tioned connections to form new and interesting algorithms that clearly d emonstrate that the complexit y of d - 1 dimensional sto c h astic optimizatio n is p r ecisely the complexit y of 1-dimensional activ e learning. Inspired by an optimization algorithm th at was adaptive to u nkno wn un iform con vexit y p arameters, we d esign an interesting one-dimensional activ e learner th at is also adaptiv e to un kno wn noise parameters. Th is algorithm is simpler than the adaptiv e activ e learning algorithm prop osed recen tly in [3] w hic h handles the p o ol based activ e learning setting. Giv en access to this acti v e learner as a subroutine for line searc h, w e show that a s imple randomized co ordinate d escen t pro cedu re can m inimize uniformly conv ex fu nctions with a m uc h simpler sto c hastic oracle that return s only a Bernoulli random v ariable representing a noisy sign of the gradien t in a single co ordinate d irection, rather than a f u ll-dimensional real-v alued gradien t v ector. The resulting algorithm is adaptiv e to all unknown UC and smo othness parameters and ac hieve minimax op timal con vergence r ates. W e sp end the ﬁr st tw o sections describin g the pr oblem setup and preliminary insigh ts, b efore describing our algorithms in s ections 3 and 4. 1.1 Setup of First-Order Sto cha stic Con vex Optimization First-order sto c h astic con vex optimizat ion is the task of approxi mately minimizing a conv ex function o v er a con v ex set, giv en oracle access to u n biased estimates of the fun ction and gradien t at any p oin t, u sing as few queries as p ossible ([4]). W e will assume that w e are give n an arb itrary set S ⊂ R d of kno wn diameter b oun d R = max x,y ∈ S k x − y k . A con v ex fun ction f with x ∗ = arg min x ∈ S f ( x ) is said to b e k -uniformly con v ex if, for some λ > 0 , k ≥ 2, w e ha ve for all x, y ∈ S f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + λ 2 k x − y k k (strong conv exit y arises when k = 2). f is L -Lipsc hitz for some L > 0 if k∇ f ( x ) k ∗ ≤ L (where k . k ∗ is the d ual n orm of k . k ); equiv alent ly for all x, y ∈ S | f ( x ) − f ( y ) | ≤ L k x − y k A diﬀeren tiable f is H -strongly sm o oth (or has a H -Lipsc hitz gradien t) for some H > λ if for all x, y ∈ S , w e ha v e k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ H k x − y k , or equiv alen tly f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + H 2 k x − y k 2 In this pap er w e shall alw ays assume k . k = k . k ∗ = k . k 2 and deal with strongly smo oth and uniformly con vex f unctions with p arameters λ > 0 , k ≥ 2, L, H > 0. A sto chastic ﬁrst ord er oracle is a fun ction that accepts x ∈ S , and returns  ˆ f ( x ) , ˆ g ( x )  ∈ R d +1 where E  ˆ f ( x )  = f ( x ) , E  ˆ g ( x )  = ∇ f ( x ) (these unbiased estimates also hav e b ound ed v ariance) and the exp ectatio n is ov er any in- ternal randomness of the oracle. An optimization algorithm is a metho d that sequen tially queries an oracle at p oint s in S and retur ns ˆ x T as an estimate of the optim um of f after T queries (or alternativ ely tr ies to ac hieve an error of ǫ ) and their p erformance can b e measured by either function error f ( ˆ x T ) − f ( x ∗ ) or p oin t error k ˆ x T − x ∗ k . 2 1.2 Sto cha stic Gradien t -Sign Oracles Deﬁne a sto chastic sign oracle to b e a fun ction of x ∈ S, j ∈ { 1 ...d } , that returns ˆ s j ( x ) ∈ { + , −} wh ere 1   η ( x ) − 0 . 5   = Θ  [ ∇ f ( x )] j  and η ( x ) = Pr  ˆ s j ( x ) = + | x  where ˆ s j ( x ) is a noisy sign  [ ∇ f ( x )] j  and [ ∇ f ( x )] j is th e j -th co ord in ate of ∇ f , and the probabilit y is ov er an y internal randomness of the oracle. This b eha vior of η ( x ) actually needs to h old only when   [ ∇ f ( x )] j   is sm all. In this pap er , we consider co ordinate descen t algorithms that are motiv ated by appli- cations where computing th e ov erall gradien t, or ev en a fun ction v alue, can b e exp ensive due to high dimensionalit y or huge amoun ts of d ata, but computing the gradient in an y one co ordinate can b e c h eap. [5] m en tions the example of min x 1 2 k Ax − b k 2 + 1 2 k x k 2 for some n × d matrix A (or any other r egularizatio n that decomp oses o ver dimensions). Computing the gradien t A ⊤ ( Ax − b ) + x is exp ensive, b ecause of the matrix-v ector m ultiply . How ev er , its j -th co ord inate is 2 A j ⊤ ( Ax − b ) + x j and requires an exp ense of only n if the residu al v ecto r Ax − b is k ep t trac k of (this is easy to do, since on a single coord inate up date of x , the residual c h ange is prop ortional to A j , an additional exp ense of n ). A sign oracle is weak er th an a ﬁr st order oracle, and can actuall y b e obtained by return- ing the sign of the ﬁrst order oracle’s n oisy gradien t if the mass of the n oise distribu tion gro ws linearly arou n d its zero mean (argued in next section). At the optim um along co or- dinate j , the oracle retur ns a ± 1 with equ al probabilit y , and otherwise return s the correct sign w ith a p robabilit y pr op ortional to the v alue of the directional deriv ativ e at that p oin t (this is reﬂ ectiv e of th e fact that the larger the deriv ative ’s absolute v alue, the easier it w ould b e for the oracle to appr o ximate its sign, hence the smaller the pr ob ab ility of error). It is not unreasonable that there ma y b e other circumstances where ev en calculating the (real v alue) gradien t in the i -th d irection could b e exp ensive, bu t estimating its sign could b e a muc h easier task as it on ly requires estimating whether fun ction v alues are exp ected to increase or decrease along a co ord inate (in a similar spirit of function comparison oracles [6], but w ith slightly more p o w er). W e will also s ee that the rates for op timization crucially dep end on wh ether the gradien t noise is sign-pr eserving or n ot. F or instance, with round ing errors or storing ﬂoats with sm all precision, one can get deterministic r ates as if we had the exact gradien t since the rounding or low er p recision do esn’t ﬂip s igns . 1.3 Setup of Active Threshold Learning The problem of one-dimensional threshold estimation assu mes you ha ve an int erv al of length R , sa y [0 , R ]. Giv en a p oint x , it has a lab el y ∈ { + , −} that is drawn from an u nknown conditional distr ib ution η ( x ) = Pr  Y = + | X = x  and the threshold t is the uniqu e p oint where η ( x ) = 1 / 2, with it b eing larger th an half on one side of t and smaller than half on the other (h en ce it is more like ly to draw a + on one sid e of t and a − on the other s id e). The task of acti v e learning of threshold classiﬁers allo ws the learner to sequen tially query T (p ossibly d ep endent) p oin ts, observing lab els dra w n fr om the unknown conditional 1 f = Θ( g ) means f = Ω( g ) and f = O( g ) (rate of growth) 3 distribution after eac h qu er y , with the goal of returning a guess ˆ x T as close to t as p ossible. In the formal study of classiﬁcation (cf. [7]), it is common to study minimax rates wh en the regression function η ( x ) satisﬁes Tsybako v’s noise or margin condition (TNC) with exp onent k at the threshold t . Diﬀeren t v ersions of this b oundary noise condition are us ed in regression, densit y or leve l-set estimation and lead to an imp r o vemen t in min imax optimal rates (for classiﬁcation, also cf. [8], [3]). Here, we pr esen t the version of TNC used in [9] : M | x − t | k − 1 ≥ | η ( x ) − 1 / 2 | ≥ µ | x − t | k − 1 whenev er 2 | η ( x ) − 1 / 2 | ≤ ǫ 0 for s ome constan ts M > µ > 0 , ǫ 0 > 0 , k ≥ 1. A standard measure for how well a classiﬁer h p erf orms is giv en b y its risk, wh ic h is simply the probability of classiﬁcatio n error (exp ectation u nder 0 − 1 loss), R ( h ) = Pr  h ( x ) 6 = y  . The p erformance of threshold learning strategies can b e measured by the excess classiﬁcation risk of the resultan t threshold classiﬁer at ˆ x T compared to the Ba yes optimal classiﬁer at t as giv en by 3 R ( ˆ x T ) − R ( t ) = ˆ x T ∨ t Z ˆ x T ∧ t | 2 η ( x ) − 1 | dx (1) In the ab o ve expr ession, akin to [9], w e use a uniform marginal distribu tion for activ e learning since th er e is n o underlying distribution o ver x . Alternativ ely , on e can simp ly measure the one-dimensional p oin t error | ˆ x T − t | in estimation of th e th reshold. Minimax rates for estimation of risk and p oin t error in activ e learning under TNC w ere p ro vided in [9] and are su mmarized in th e next section. 1.4 Summary of Con t ributions No w that w e hav e intro d uced the notation u sed in our p ap er and some r elev an t p revious w ork (more in the next section), we can clearly state our contributions. • W e generalize an idea fr om [10] to present a simp le ep o ch-based activ e learning al- gorithm with a passive learning subr outine th at can optimally learn one-dimensional thresholds and is adaptiv e to unknown noise parameters. • W e show that noisy gradien t signs suﬃce for minimization of uniformly conv ex func- tions by proving that a random co ord inate descent algorithm with an activ e learning line-searc h subr outine ac h iev es minim ax con v er gence rates. • Due to the conn ection b et w een the relev an t exp onen ts in th e tw o ﬁelds, we can com b in e the ab o ve tw o metho ds to get an algorithm that achiev es minimax optimal rates and is adaptiv e to un kno wn conv exit y parameters. • As a corollary , we argue that with access to p ossib ly noisy non-exact gradient s that don’t switc h an y signs (roundin g errors or lo w -p recision storage are sign-preservin g), w e can s till achiev e exp onent ially f ast d eterministic rates. 2 Note that | x − t | ≤ δ 0 :=  ǫ 0 M  1 k − 1 = ⇒ | η ( x ) − 1 / 2 | ≤ ǫ 0 = ⇒ | x − t | ≤  ǫ 0 µ  1 k − 1 3 a ∨ b := max( a, b ) and a ∧ b := min( a, b ) 4 2 Preliminary In sigh ts 2.1 Connections Bet w een Exp onen ts T aking one p oint as x ∗ in the deﬁnition of UC, we see that | f ( x ) − f ( x ∗ ) | ≥ λ 2 k x − x ∗ k k Since k∇ f ( x ) kk x − x ∗ k ≥ ∇ f ( x ) ⊤ ( x − x ∗ ) ≥ f ( x ) − f ( x ∗ ) (by conv exit y), k∇ f ( x ) − 0 k ≥ λ 2 k x − x ∗ k k − 1 Another r elev an t fact for u s will b e that u niformly con vex fu nctions in d d im en sions are uniformly con v ex along an y one direction, or in other words, for ev ery ﬁ x ed x ∈ S and ﬁ xed unit vec tor u ∈ R d , the un iv ariate function of α deﬁned by f x,u ( α ) := f ( x + αu ) is also UC with the same parameters 4 . F or u = e j ,   [ ∇ f ( x )] j − 0   ≥ λ 2 k x − x ∗ j k k − 1 where x ∗ j = x + α ∗ j e j and α ∗ j = arg min { α | x + αe j ∈ S } f ( x + αe j ). This uncanny similarit y to the TNC (since ∇ f ( x ∗ ) = 0) was mathematically exploite d in [2] w here the authors used a lo wer b ound ing pro of tec h nique for one-dimensional activ e th reshold learning from [9] to pr o vid e a new lo wer b ounding pr o of tec h nique for the d -dimensional sto chastic conv ex optimization of UC fu nctions. In particular, they sho w ed that the minimax r ate f or 1-dimensional activ e learning exce ss risk and the d -dimensional optimization fun ction error b oth scale d lik e 5 ˜ Θ  T − k 2 k − 2  , and that the p oin t error in b oth settings scaled like ˜ Θ  T − 1 2 k − 2  , where k is either the TNC exp onent or the UC exp onen t, dep end ing on the setting. The imp ortance of this conn ection cannot b e emphasized enough and we will see th is b eing useful throughout this p ap er. As mentio ned earlier [9] require a tw o-sided TNC cond ition (up p er and lo wer growth con- dition to pro vide exact tight r ate of growth) in order to prov e risk up p er b ounds. On a similar n ote, for uniformly con v ex fun ctions, we will assu m e suc h a Lo cal k -Strong Smo oth- ness condition around d irectional minima Assumption LkSS : for all j ∈ { 1 ...d }   [ ∇ f ( x )] j − 0   ≤ Λ k x − x ∗ j k k − 1 for s ome constan t Λ > λ/ 2, so we can tight ly c h aracterize the rate of gro wth as   [ ∇ f ( x )] j − 0   = Θ  k x − x ∗ j k k − 1  This condition is implied by s tr ong smo othn ess or Lipschitz smo oth gradien ts when k = 2 (for strongly con vex and s tr ongly smo oth functions), but is a sligh tly str onger assumption otherwise. 4 Since f is U C, f x,u ( α ) ≥ f x,u (0) + α ∇ f x,u (0) + λ 2 | α | k 5 w e use ˜ O , ˜ Θ to hide constants and p olylogarithmic factors 5 2.2 The One-Dimensiona l Argumen t The basic argumen t for relating optimizatio n to activ e learning wa s made in [2] in the con text of sto chastic ﬁ rst order oracles wh en the noise distribution P( z ) is unbiased and gro ws linearly around its zero mean, i.e. Z ∞ 0 dP( z ) = 1 2 and Z t 0 dP( z ) = Θ( t ) for all 0 < t < t 0 , for constants t 0 (similarly for − t 0 < t < 0). This is satisﬁed for gaussian, uniform and m any other distrib u tions. W e rep r o duce the argument for clarit y and then sk etc h it for sto c h astic signed oracles as w ell. F or any x ∈ S , it is clear th at f x,j ( α ) := f ( x + αe j ) is con vex; its grad ient ∇ f x,j ( α ) := [ ∇ f ( x + αe j )] j is an incr easing function of α that switc h es signs at α ∗ j := arg min { α | x + αe j ∈ S } f x,j ( α ), or equiv alently at directional minim um x ∗ j := x + α ∗ j e j . One can think of s ign([ ∇ f ( x )] j ) as b eing the true lab el of x , sign([ ∇ f ( x )] j + z ) as b eing the observed lab el, an d ﬁ nding x ∗ j as learning the decision b oundary (p oin t where lab els switc h signs). Deﬁne r egression function η ( x ) := Pr  sign([ ∇ f ( x )] j + z ) = + | x  and note that minimizing f x 0 ,j corresp onds to iden tifying the Ba y es thresh old classiﬁer as x ∗ j b ecause the p oint at whic h η ( x ) = 0 . 5 or [ ∇ f ( x )] j = 0 is x ∗ j . Consider a p oint x = x ∗ j + te j for t > 0 with [ ∇ f ( x )] j > 0 and hence has true lab el + (a s imilar argument can b e made for t < 0). As discussed earlier,   [ ∇ f ( x )] j   = Θ  k x − x ∗ j k k − 1  = Θ( t k − 1 ). Th e probabilit y of seeing lab el + is the pr obabilit y that w e dr a w z in  − [ ∇ f ( x )] j , ∞  so that the sign of [ ∇ f ( x )] j + z is still p ositiv e. Hence, the regression fun ction can b e written as η ( x ) = Pr  [ ∇ f ( x )] j + z > 0  = Pr ( z > 0) + Pr  − [ ∇ f ( x )] j < z < 0  = 0 . 5 + Θ  [ ∇ f ( x )] j  = ⇒   η ( x ) − 1 2   = Θ  [ ∇ f ( x )] j  = Θ  t k − 1  = Θ  | x − x ∗ j | k − 1  Hence, η ( x ) satisﬁes the TNC with exp onen t k , and an activ e learning algorithm (n ext sub- section) can b e used to obtain a p oin t ˆ x T with small p oint-e rror and excess risk. Note that function error in conv ex optimization is b ounded ab ov e by excess risk of the corresp ond ing activ e learner using eq (1) b ecause f j ( ˆ x T ) − f j ( x ∗ j ) =      ˆ x T ∨ x ∗ j Z ˆ x T ∧ x ∗ j [ ∇ f ( x )] j dx      = Θ ˆ x T ∨ x ∗ j Z ˆ x T ∧ x ∗ j | 2 η ( x ) − 1 | dx ! = Θ  R ( ˆ x T )  Similarly , for sto c h astic sign oracles (Sec. 1.2), us ing η ( x ) = Pr  ˆ s j ( x ) = +  ,   η ( x ) − 1 2   = Θ  [ ∇ f ( x )] j  = Θ  k x − x ∗ j k k − 1  6 2.3 A Non-adaptiv e A ctiv e Threshold Learning Algorit hm One can use a grid-based probabilistic v arian t of binary searc h called the BZ algorithm [11] to appro ximately learn the threshold eﬃcien tly in th e activ e setting, in th e setting that η ( x ) satisﬁes the TNC for known k , µ , M (it is not adaptiv e to the parameters of the problem - one n eeds to know these constan ts b eforehand ). The analysis of BZ and th e p r o of of th e follo wing lemma are discussed in detail in Theorem 1 of [12 ], Theorem 2 of [9 ] and the App end ix of [2]. Lemma 1. Given a 1 - dimensional r e gr ession function that satisﬁes the TN C with known p ar ameters µ, k , then after T queries, the BZ algorithm r eturns a p oint ˆ t su c h that | ˆ t − t | = ˜ Θ( T − 1 2 k − 2 ) and the exc ess risk is ˜ Θ( T − k 2 k − 2 ) . Due to the d escrib ed connection b etw een exp onents, one can use BZ to approxima tely optimize a one dimens ional u n iformly con v ex function f j with kn o w n u niform conv exit y parameters λ, k . Hence, the BZ algorithm can b e u sed to ﬁn d a p oin t with low function error b y s earc hin g for a p oint with lo w risk. This, when com bined with Lemma 1, yields the follo wing imp ortant result. Lemma 2. Given a 1 -dimensional k -UC and LkSS f unction f j , a line se ar ch to ﬁnd ˆ x T close to x ∗ j up to ac cur acy | ˆ x T − x ∗ j | ≤ η in p oint-err or c an b e p erforme d in ˜ Θ(1 /η 2 k − 2 ) steps using the BZ algorithm. Alternatively, in T steps we c an ﬁnd ˆ x T such that f ( ˆ x T ) − f ( x ∗ j ) = ˜ Θ( T − k 2 k − 2 ) . 3 A 1-D Adaptiv e A ctiv e Threshold Learning Algorithm W e no w describ e an algorithm f or act iv e learning of one-dimensional thresholds that is adaptiv e, meaning it can ac h iev e the min imax optimal rate ev en if the TNC p arameters M , µ, k are unkno wn. It is quite diﬀeren t from th e non -adaptive BZ alg orithm in its ﬂa vour, though it can b e regarded as a robu st b inary search pr o cedure, and its design and pro of are inspir ed fr om an optimization pro cedure from [10] that is adaptiv e to unkn own UC parameters λ, k . Ev en though [10] considers a sp eciﬁc optimizati on algorithm (dual a v eraging), we observe that their algorithm that adapts to unkn o w n UC p arameters can use any optimal con v ex optimization algorithm as a su broutine within eac h ep o ch. Similarly , our adaptive activ e learning algorithm is ep o c h -b ased and can use any optimal passive learnin g subr outine in eac h ep o ch. W e n ote that [3] also dev elop ed an adaptiv e algorithm based on disagreemen t co eﬃcien t and V C-dimension arguments, b ut it is in a p o ol-based setting where one has access to a large p o ol of unlab eled d ata, and is muc h more complicated. 3.1 An Optimal P assiv e Learning Subroutine The excess risk of p assiv e learning pro cedures for 1-d thresholds can b e b ound ed by O( T − 1 / 2 ) (e.g. see Alexander’s inequ alit y in [13] to av oid √ log T factors from E R M/VC argumen ts) and can b e achiev ed by ignoring the TNC parameters. 7 Consider suc h a passive learning pro cedure u nder a uniform distribution of samp les (mimic ked by act iv e learning by querying the domain u n iformly) in a ball 6 B ( x 0 , R ) around an arbitrary p oin t x 0 of radius R that is kno w n to con tain the true thr eshold t . Then without kno wledge of M , µ, k , in T steps we can get a p oin t ˆ x T close to the true threshold t s u c h th at with pr obabilit y at least 1 − δ R ( ˆ x ) − R ( t ) = ˆ x T ∧ t Z ˆ x T ∨ t | 2 η ( x ) − 1 | dx ≤ C δ R √ T for s ome constan t C δ . Assumin g ˆ x T lies inside th e TNC region, µ ˆ x T ∧ t Z ˆ x T ∨ t | x − t | k − 1 dx ≤ ˆ x T ∧ t Z ˆ x T ∨ t | 2 η ( x ) − 1 | dx Hence µ | ˆ x T − t | k k ≤ C δ R √ T . Since k 1 /k ≤ 2, w.p. at least 1 − δ we get a p oint -error | ˆ x T − t | ≤ 2  C δ R µ √ T  1 /k (2) W e assume that ˆ x T lies within the TNC region since th e in terv al | η ( x ) − 1 2 | ≤ ǫ 0 has at least constan t width | x − t | ≤ δ 0 = ( ǫ 0 / M ) 1 / ( k − 1) , it w ill only tak e a constant num b er of iterations to ﬁnd a p oint w ithin it. A form al w a y to argue this w ould b e to see that if the o ve rall risk goes to zero lik e C δ R √ T , then the p oin t cannot sta y outside this constan t sized region of width δ 0 where | η ( x ) − 1 / 2 | ≤ ǫ 0 , since it would accum ulate a large constan t risk of at least t + δ 0 R t µ | x − t | k − 1 = µδ k 0 k . S o as long as T is larger than a constant T 0 := C 2 δ R 2 k 2 µ 2 δ 2 k 0 , our b ound in eq 2 h olds with high pr ob ab ility (we can ev en assume we wa ste a constan t n um b er of queries to ju st get into the TNC region b efore using this algorithm). 3.2 Adaptiv e One-Dimensional Active Threshold Learner Algorithm 1 is a generalize d ep o ch-based binary searc h , and w e rep eatedly p erform pas- siv e learning in a halving searc h radius. Let the num b er of ep o c h s b e E := log q 2 T C 2 ˜ δ log T ≤ log T 2 (if 7 constan t C 2 ˜ δ > 2) and ˜ δ := 2 δ / log T ≤ δ /E . Let the time b udget p er ep o ch b e N := T /E (the s ame for eve ry ep o ch) and the searc h r adius in ep o ch e ∈ { 1 , ..., E } shrink as R e := 2 − e +1 R . Let us deﬁne the minimizer of the risk within the ball of r adius R e cen tered around x e − 1 at ep o c h e as x ∗ e = arg min  R ( x ) : x ∈ S ∩ B ( x e − 1 , R e )  Note that x ∗ e = t iﬀ t ∈ B ( x e − 1 , R e ) and will b e one end of the in terv al otherwise. 6 Deﬁne B ( x, R ) := [ x − R, x + R ] 8 Input: Domain S of diameter R , oracle bud get T , conﬁdence δ Blac k Box: Any optimal passive learning pr o cedure P ( x, R, N ) that outputs an estimated threshold in B ( x, R ) u sing N queries Cho ose an y x 0 ∈ S , R 1 = R, E = log q 2 T C 2 ˜ δ log T , N = T E 1: while 1 ≤ e ≤ E do 2: x e ← P ( x e − 1 , R e , N ) 3: R e +1 ← R e 2 , e ← e + 1 4: end while Output: x E Algorithm 1: Adap tive Threshold L earner Theorem 1. In the setting of one-dimensional active le arning of thr esholds, Algorithm 1 adaptively achieves R ( x E ) − R ( t ) = ˜ O  T − k 2 k − 2  with pr ob ability at le ast 1 − δ in T querie s when the unknown r e gr ession fu nction η ( x ) has unknown TNC p ar ameters µ , k . Pr o of. Sin ce w e use an optimal passiv e learning su broutine at ev ery ep o c h, we kno w that after eac h ep o c h e we h a ve w ith p robabilit y at least 1 − ˜ δ 7 R ( x e ) − R ( x ∗ e ) ≤ C ˜ δ R e p T /E ≤ C ˜ δ R e r log T 2 T (3) Since η ( x ) satisﬁes the TNC (and is b ound ed ab ov e by 1), w e ha ve for all x µ | x − t | k − 1 ≤ | η ( x ) − 1 / 2 | ≤ 1 If the set has diameter R , one of th e endp oin ts m us t b e at least R/ 2 aw a y from t , and hence w e get a limitation on th e maxim um v alue of µ as µ ≤ 1 ( R/ 2) k − 1 . Since k ≥ 2 and E ≥ 2, and 2 − E = C ˜ δ q log T 2 T , using simple algebra w e get µ ≤ 2 ( k − 2) E +2 ( R/ 2) k − 1 = 4 . 2 − E 2 ( k − 1) E 2 ( k − 1) R k − 1 = 4 . 2 − E 2 ( k − 1) (2 − E R ) k − 1 = 4 C ˜ δ 2 k − 1 R k − 1 E +1 r log T 2 T W e pr o ve that w e will b e appropr iately close to t after some ep o c h e ∗ b y doing case analysis on µ . When the true u n kno wn µ is suﬃcient ly small, i.e. µ ≤ 4 C ˜ δ 2 k − 1 R k − 1 2 r log T 2 T (4) then we show th at we ’ll b e done after e ∗ = 1. O th erwise, we will b e d one after ep o c h 2 ≤ e ∗ ≤ E if the tru e µ lies in the range 4 C ˜ δ 2 k − 1 R k − 1 e ∗ r log T 2 T ≤ µ ≤ 4 C ˜ δ 2 k − 1 R k − 1 e ∗ +1 r log T 2 T (5) 7 By VC theory for th reshold classiﬁers or similar arguments in [13], C 2 ˜ δ ∼ log (1 / ˜ δ ) ∼ log log T since ˜ δ ∼ δ / log T . W e treat it as constan t for clarity of exp osition, but actually lose log log T factors like t he high probabilit y arguments in [14] and [2] 9 T o see why we’ll b e d one, equations (4) and (5) imply R e ∗ +1 ≤ 2  8 C 2 ˜ δ log T µ 2 T  1 2 k − 2 after ep o c h e ∗ and plugging this in to equation (3) with R e ∗ = 2 R e ∗ +1 , we get R ( x e ∗ ) − R ( x ∗ e ∗ ) ≤ C ˜ δ R e ∗  log T 2 T  1 2 = O  log T T  k 2 k − 2 ! (6) There are t wo issues hin d ering the completion of our pro of. Th e ﬁ rst is that ev en though x ∗ 1 = t to start oﬀ w ith , it migh t b e the case that x ∗ e ∗ is far a wa y from t since we are c h opping the radius by half at every ep o ch. Interesti ngly , in lemma 3 we will pro v e that roun d e ∗ is the last round up to which x ∗ e = t . This wo uld imply from eq (6) that R ( x e ∗ ) − R ( t ) = ˜ O  T − k 2 k − 2  (7) Secondly we migh t b e concerned th at after the round e ∗ , we ma y mo v e further a wa y from t in later ep o chs. How ev er , we will sho w that since the radii are decreasing geometrically b y h alf at every ep o c h , we cannot really w ander to o f ar a w a y from x e ∗ . This will giv e us a b ound (see lemma 4) lik e R ( x E ) − R ( x e ∗ ) = ˜ O  T − k 2 k − 2  (8) W e will essentiall y prov e that the ﬁnal p oin t x e ∗ of ep o ch e ∗ is suﬃcientl y close to the true optim u m t , an d the ﬁn al p oin t of the algorithm x E is suﬃcientl y close to x e ∗ . Summing eq (7) and eq (8) yields our desired resu lt. Lemma 3. F or al l e ≤ e ∗ , c onditione d on having x ∗ e − 1 = t , with pr ob ability 1 − ˜ δ we have x ∗ e = t . In other wor ds, up to ep o ch e ∗ , the optimal classiﬁer in the domain of e ach ep o ch is the true thr eshold with high pr ob ability. Pr o of. x ∗ e = t w ill hold in ep o c h e if the distance b et ween the ﬁr st p oin t x e − 1 in the ep o c h e is such that the ball of radius R e around it actually con tains t , or mathematical ly if | x e − 1 − t | ≤ R e . This is trivially satiﬁed for e = 1, and assuming that it is true for ep o c h e − 1 we w ill sho w show b y ind uction that it holds true for ep o c h e ≤ e ∗ w.p. 1 − ˜ δ . Notice that u sing equation (2), conditioned on the indu ction going th rough in pr evious rounds ( t b eing within the searc h radius), after the completion of round e − 1 we ha v e with probability 1 − ˜ δ | x e − 1 − t | ≤ 2 " C ˜ δ R e − 1 µ p T /E # 1 /k If this w as upp er b ound ed b y R e , then the indu ction w ould go th r ough. S o what we w ould really lik e to sho w is that 2  C ˜ δ R e − 1 µ √ T /E  1 k ≤ R e . Sin ce R e − 1 = 2 R e , w e eﬀectiv ely w an t to sho w 2 k C ˜ δ 2 R e µ q E T ≤ R k e or equiv alen tly that for all e ≤ e ∗ w e would lik e to h a ve 4 C ˜ δ 2 k − 1 R k − 1 e q E T ≤ µ . Since E ≤ log T 2 , we would b e ac hieving something stronger if we sho wed 4 C ˜ δ 2 k − 1 R k − 1 e r log T 2 T ≤ µ 10 whic h is kno w n to b e tru e for every ep o c h u p to e ∗ b y equation (5). Lemma 4. F or al l e ∗ < e ≤ E , R ( x e ) − R ( x e ∗ ) ≤ C ˜ δ R e ∗ √ T /E = ˜ O  T − k 2 k − 2  w.p. 1 − ˜ δ , ie after ep o ch e ∗ , we c annot deviate much fr om wher e we ende d ep o ch e ∗ . Pr o of. F or e > e ∗ , w e ha ve with probab ility at least 1 − ˜ δ R ( x e ) − R ( x e − 1 ) ≤ R ( x e ) − R ( x ∗ e ) ≤ C ˜ δ R e p T /E and hence even for th e ﬁn al ep o c h E , we hav e with pr obabilit y (1 − ˜ δ ) E − e ∗ R ( x E ) − R ( x e ∗ ) = E X e = e ∗ +1 [ R ( x e ) − R ( x e − 1 )] ≤ E X e = e ∗ +1 C ˜ δ R e p T /E Since th e radii are h alving in s ize, this is up p er b ounded (lik e equation (6)) b y C ˜ δ R e ∗ p T /E [1 / 2 + 1 / 4 + 1 / 8 + ... ] ≤ C ˜ δ R e ∗ p T /E = ˜ O  T − k 2 k − 2  These lemmas j u stify the use of equ ations (7) and (8), wh ose s um yields our d esired result. Notice that the ov erall probability of s uccess is at least (1 − ˜ δ ) E ≥ 1 − δ , hence concluding the pro of of the th eorem. 4 Randomized Sto c hastic-Sign Co ordinate Descen t W e no w d escrib e an algorithm th at can do sto chasti c optimization of k -UC and LkSS functions in d > 1 dimensions when giv en access to a sto chastic sign oracle and a black- b o x 1-D activ e learning algorithm, su c h as our adaptive scheme from the previous section as a sub routine. T h e p ro cedure is w ell-kno wn in the literature, but the idea that on e only needs noisy gradient s igns to p erf orm minimization optimally , and that one can use activ e learning as a line-searc h p r o cedure, is n o vel to the b est of our knowledge. The idea is to simp ly p er f orm random co ord inate-wise d escen t with approximat e line searc h, wh ere the subroutine for line s earch is an optimal activ e threshold learning algorithm that is used to appr oac h the min im um of the fu nction along the chosen direction. Let the gradien t at ep o c h e b e called ∇ e − 1 = ∇ f ( x e − 1 ), the unit vect or direction of descent d e b e a u nit co ordinate v ecto r c h osen randomly from { 1 ...d } , and our step size from x e − 1 b e α e (determined b y activ e learnin g) so that our next p oint is x e := x e − 1 + α e d e . Assume, for analysis, that the optimum of f e ( α ) := f ( x e − 1 + αd e ) is α ∗ e := arg min α f ( x e − 1 + αd e ) and x ∗ e := x e − 1 + α ∗ e d e where (due to optimalit y) th e deriv ativ e is ∇ f e ( α ∗ e ) = 0 = ∇ f ( x ∗ e ) ⊤ d e (9) 11 The line searc h to ﬁ nd α e and x e that appro ximates the minimum x ∗ e can b e accomplished b y an y optimal activ e learning algorithm algorithm, once we ﬁ x the num b er of time steps p er line searc h. 4.1 Analysis of Algorithm 2 Input: set S of diameter R , query bu dget T Oracle: sto c hastic sign oracle O f ( x, j ) r eturning n oisy sign  [ ∇ f ( x )] j  Blac kBo x: algorithm LS ( x, d, n ) : line search from x , direction d , for n s teps Cho ose an y x 0 ∈ S , E = d (log T ) 2 1: while 1 ≤ e ≤ E do 2: Cho ose a unit co ordinate v ector d e from { 1 ...d } uniformly at r an d om 3: x e ← LS ( x e − 1 , d e , T /E ) usin g O f 4: e ← e + 1 5: end while Output: x E Algorithm 2: Rand omized Sto chastic- Sign C o ordinate Descen t Let the num b er of ep o c h s b e E = d (log T ) 2 , and the num b er of time steps p er ep o c h is T /E . W e can do a line searc h from x e − 1 , to get x e that appr o ximates x ∗ e w ell in f u nction error in T /E = ˜ O( T ) steps usin g an activ e learning subroutin e and let the resulting function-error b e denoted b y ǫ ′ = ˜ O  T − k 2 k − 2  . f ( x e ) ≤ f ( x ∗ e ) + ǫ ′ Also, L k S S and UC allo w us to infer (for k ∗ = k k − 1 , i.e. 1 /k + 1 /k ∗ = 1) f ( x e − 1 ) − f ( x ∗ e ) ≥ λ 2 k x e − 1 − x ∗ e k k ≥ λ 2Λ k ∗   ∇ ⊤ e − 1 d e   k ∗ Eliminating f ( x ∗ e ) fr om the ab ov e equations, subtracting f ( x ∗ ) fr om b oth sides, denoting ∆ e := f ( x e ) − f ( x ∗ ) and taking exp ectations E [∆ e ] ≤ E [∆ e − 1 ] − λ 2Λ k ∗ E h   ∇ ⊤ e − 1 d e   k ∗ i + ǫ ′ Since 8 E h |∇ ⊤ e − 1 d e | k ∗   d 1 , ..., d e − 1 i = 1 d k∇ e − 1 k k ∗ k ∗ ≥ 1 d k∇ e − 1 k k ∗ w e get E [∆ e ] ≤ E [∆ e − 1 ] − λ 2 d Λ k ∗ E h k∇ e − 1 k k ∗ i + ǫ ′ By con vexit y , Cauc h y-Sc hw artz and UC 9 , k∇ e − 1 k k ∗ ≥  λ 2  1 /k − 1 ∆ e − 1 , we get E [∆ e ] ≤ E [∆ e − 1 ] 1 − 1 d  λ 2Λ  k ∗ ! + ǫ ′ 8 k ≥ 2 = ⇒ 1 ≤ k ∗ ≤ 2 = ⇒ k . k k ∗ ≥ k . k 2 9 ∆ k e − 1 ≤ [ ∇ ⊤ e − 1 ( x e − 1 − x ∗ )] k ≤ k∇ e − 1 k k k x e − 1 − x ∗ k k ≤ k∇ e − 1 k κ 2 λ ∆ e − 1 12 Deﬁning 10 C := 1 d  λ 2Λ  k ∗ < 1, w e get the recurr ence E [∆ e ] − ǫ ′ C ≤ (1 − C )  E [∆ e − 1 ] − ǫ ′ C  Since E = d (log T ) 2 and ∆ 0 ≤ L k x 0 − x ∗ k ≤ LR , after the last ep o ch, w e ha v e E [∆ E ] − ǫ ′ C ≤ (1 − C ) E  ∆ 0 − ǫ ′ C  ≤ exp  − C d (log T ) 2  ∆ 0 ≤ LR T − C d log T As long as T > exp  (2Λ /λ ) k ∗  , a constan t, w e hav e C d log T ≥ 1 and E [∆ E ] = O( ǫ ′ ) + o( T − 1 ) = ˜ O  T − k 2 k − 2  whic h is the d esired result. Notice that in this section we didn ’t need to know λ, Λ , k , b ecause w e simply r u n randomized co ordinate descent for E = d (log T ) 2 ep o chs with T /E steps p er su b routine, and th e activ e learnin g sub routine wa s also adap tive to the appropriately calculate d T NC p arameters. In summ ary , Theorem 2. Given ac c ess to only noisy g r adient sign information fr om a sto chastic sign or acle, R andomize d Sto chastic-Sign Co or dinate Desc ent c an minimize UC and LkSS func- tions at the minimax optimal c onver genc e r ate for exp e cte d function err or of ˜ O( T − k 2 k − 2 ) adaptive to al l unknown c onvexity and smo othness p ar ameters. As a sp e cial c ase for k = 2 , str ongly c onvex and str ongly smo oth fu nctions c an b e minimize d in ˜ O(1 /T ) steps. 4.2 Gradien t Sign-Preserving Computations A practical concern for implemen ting optimization algorithms is mac hine precision, the n um b er of decimals to w hic h r eal num b ers are stored. Finite space ma y limit th e accuracy with whic h ev ery gradien t can b e stored, and one ma y ask how m uc h these inaccuracies may aﬀect the ﬁnal conv ergence rate - ho w is the qu ery complexit y of optimization aﬀected if the true gradien ts w ere rounded to one or t w o decimal p oint s? If the gradients were rand omly rounded (to remain unbiased), then one migh t guess that we could easily ac hiev e s to c h astic ﬁrst-order optimization rates. Ho wev er, our results giv e a surp rising answer to that question, as a sim ilar argumen t rev eals that for UC and L k S S functions (with str on gly conv ex and strongly smo oth b eing a sp ecial case), our algorithm ac hieves exp onential rates. Since rounding errors do not ﬂip an y sign in the gradient, ev en if the gradien t wa s r ounded or decimal p oin ts were dropp ed as m uc h as p ossible and w e w ere to return only a single bit p er coord inate ha ving the true signs, then one can still ac hieve the exp onential ly fast conv ergence rate observed in n on-sto c h astic settings - our algorithm needs only a logarithmic n u m b er of epo c h s, and in eac h ep o c h activ e learning w ill appr oac h the directional m inim um exp onen tially fast with noiseless gradient signs using a p erfect binary searc h . In fact, our algorithm is the n atural generalizatio n for a higher-dimensional binary searc h, b oth in the d etermin istic and sto chastic settings. W e can summarize this in the follo wing theorem: 10 Since 1 < k ∗ ≤ 2 and Λ > λ / 2, we ha ve C < 1 13 Theorem 3. Given ac c ess to gr adient signs in the pr esenc e of sign-pr eserving noise (such as deterministic or r andom r ounding of gr adients, dr opp ing de cimal plac es f or lower pr e cision, etc), R andomize d Sto chastic-Sign Co or dinate Desc ent c an minimize UC and LkSS functions exp onential ly fast, with a function err or c onver genc e r ate of ˜ O(exp {− T } ) . 5 Discussion While the assumption of smo othness is natural f or strongly conv ex functions, our assum ption of LkS S migh t app ear strong in general. It is p ossible to relax th is assumption and require the LkSS exp onent to diﬀer from the UC exp onent, or to only assume strong sm o othness - this still yields consistency for our algorithm, bu t the rate ac hiev ed is w orse. [10] and [2] b oth hav e ep o ch b ased algorithms that ac hieve the minimax r ates u nder ju st Lipsc hitz assumptions with access to a fu ll-gradien t sto chastic ﬁr st order oracle, but it is h ard to pro v e the s ame rates f or a co ordinate d escen t p ro cedure without sm o othness assumptions. Giv en a target fu n ction accuracy ǫ in stead of q u ery b udget T , a similar rand omized co ordinate descen t pr o cedure to ours ac hieve s the minimax rate w ith a similar pro of, but it is non-adaptiv e since we presently don’t ha ve an ad ap tive activ e learning pr o cedure when giv en ǫ . As of now, w e kno w no adap tive UC optimization pr o cedure wh en giv en ǫ . Recen tly , [15] analysed sto c hastic gradien t descen t w ith av eraging, and show that for smo oth fu n ctions, it is p ossible for an algorithm to automatically adapt b etw een conv ex- it y and strong con vexit y , and in comparision we sho w how to adapt to unknown uniform con vexit y (strong con v exit y b eing a sp ecial case of κ = 2). It m a y b e p ossible to com bin e the ideas fr om this pap er and [15] to get a un iv ers ally adaptiv e algorithm fr om con vex to all degrees of uniform conv exit y . It would also b e in teresting to see if th ese ideas extend to connections b et w een con vex op timization and learnin g lin ear th reshold functions. In this pap er, w e exploit r ecen tly disco vered theoretical connections by pro viding explicit algorithms that take adv an tage of them. W e show ho w these could lead to cross-fertiliz ation of ﬁelds in b oth directions and hop e that this is just the b eginning of a ﬂourishin g interac tion where these insigh ts m a y lead to many new algorithms if w e lev erage th e theoretical relations in more inno v ativ e wa ys. References [1] Raginsky , M., Rakhlin, A.: Information complexit y of blac k-b o x con vex optimizatio n: A new lo ok via f eedb ac k information theory . In : 47th An n ual Allerton Conference on Comm unication, C on trol, and Comp uting, 2009. (2009) [2] Ramdas, A., S ingh, A.: Optimal rates for s to c h astic con v ex optimization u nder tsy- bak o v noise condition. In tl. Conference in Mac h ine Learnin g (ICML) (2013) [3] Hanneke , S.: Rat es of conv ergence in activ e learning. The Ann als of Statistics 39 (1) (2011 ) 333–3 61 [4] Nemiro vski, A., Y u din, D.: Pr ob lem complexit y and method eﬃciency in optimiza tion. John Wiley & S ons (1983) 14 [5] Nestero v, Y.: Eﬃciency of coordinate descen t metho ds on h uge-scale optimization problems. Core discussion pap ers 2 (2010) 2010 [6] Jamieson, K., Now ak, R., Rec ht, B.: Query complexit y of d eriv ativ e-free optimization. Adv ances in Neural In formation Pr o cessing Systems (NIPS) (2012) [7] Tsybako v, A.: O ptimal aggregatio n of classiﬁers in statistical learnin g. The Ann als of Statistics 32 (1) (2004 ) 135–1 66 [8] Audib ert, J.Y., T sybak o v, A.B.: F ast learnin g r ates for plug-in classiﬁers. Annals of Statistics 35 (2) (2007 ) 608–6 33 [9] Castro, R., No wak, R.: Minimax b oun ds for activ e learning. In: Pro ceedings of the 20th ann ual conference on learning theory , S pringer-V erlag (2007) 5–19 [10] Iouditski, A., Nesterov, Y.: Primal-dual su bgradien t metho ds f or minimizing un iformly con vex functions. Unive rsite Joseph F ourier, Grenoble, F r ance (2010 ) [11] Burnashev, M., Zigangiro v, K .: An interv al estimation problem for con trolled obs er - v ations. Problemy Pe redac h i Informatsii 10 (3) (1974) 51–61 [12] Castro, R., No wak, R.: Activ e s en sing and learnin g. F oundations and App lications of Sensor Managemen t (2009 ) 177–200 [13] Devro ye, L ., Gy¨ orﬁ, L., Lugosi, G.: A p robabilistic theory of p attern reco gnition. V olume 31. sp r inger (1996) [14] Hazan, E., Kale, S.: Bey ond the regret m inimization b arrier: an optimal algorithm for sto c h astic strongly-con vex optimization. In: Pro ceedings of the 23nd Ann ual Confer- ence on Learnin g Th eory . (2011) [15] Bac h, F., Moulines, E.: Non-asymptotic analysis of stoc hastic approximat ion algo- rithms for mac h ine learning. Adv ances in Neural Inf ormation Pro cessing Systems (NIPS) (2011) 15

Algorithmic Connections Between Active Learning and Stochastic Convex Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment