Algorithmic Connections Between Active Learning and Stochastic Convex Optimization

Interesting theoretical associations have been established by recent papers between the fields of active learning and stochastic convex optimization due to the common role of feedback in sequential querying mechanisms. In this paper, we continue this…

Authors: Aaditya Ramdas, Aarti Singh

Algorithmic Connections Bet w een Activ e Learning and Sto c hastic Con v ex Optimi z ation Aadit y a Ramdas Mac hine Learning Departmen t Carnegie Mellon Univ ersit y aramdas@cs. cmu.edu Aarti Singh Mac hine Learning Departmen t Carnegie Mellon Univ ersit y aarti@cs.cm u.edu Septem b er 24, 2018 Abstract Int eresting theoretical asso ciations ha ve bee n established by rece nt paper s b etw een the fields of active learning and stochastic conv ex optimization due to the common role of feedback in sequential querying mechanisms. In this pap er, we cont inu e this thre a d in t w o parts by ex ploiting these relations for the fir st time to y ield nov el algorithms in bo th fields, further mo tiv ating the study of their intersection. First, inspired b y a recen t optimization algorithm that was a daptive to unknown uniform co nv exity par ameters, we present a new active learning a lgorithm for one-dimensio nal thresholds that ca n yie ld minimax rates by adapting to unknown no ise parameters. Next, we show that one can per form d -dimensiona l s to chastic minimization of smo oth unifor mly conv ex functions when only gr anted oracle acces s to noisy gradient signs a long any co o rdinate instead of real-v a lued gra dients, by using a simple r andomized c o ordinate descent procedur e where each line search ca n b e solved by 1-dimensiona l active learning, pr ov ably ac hieving the same erro r conv ergence rate a s ha ving the en tir e real-v alued gradient. Combining thes e t wo par ts yields an a lgorithm that solves sto chastic conv ex optimization o f uniformly conv ex and smo oth functions using o nly noisy gradient signs b y rep e atedly perfor ming active learning, achiev es optimal rates a nd is a daptive to all unknown conv exity and smo othness parameters . 1 In tro duction The tw o fields of con v ex optimization and activ e learning seem to h av e ev olv ed qu ite in- dep end en tly of eac h other. Recen tly , [1] p ointe d out their relatedness due to the inherent sequen tial nature of b oth fields an d the complex r ole of feedbac k in taking future actions. F ollo wing that, [2] made th e connections more explicit by t ying toge ther the exp onent us ed in noise conditions in activ e lea rning and th e exp onent used in un iform conv exit y (UC) in optimization. They u sed this to establish lo wer b ounds (and tig h t upp er b oun ds) in sto c h astic optimization of UC fu n ctions based on pro of tec hn iques from activ e learning. Ho wev er, it w as un clear if there w ere concrete algorithmic ideas in common b et w een th e fields. Here, w e pro vide a p ositive answer by exp loiting the aforemen tioned connections to form new and interesting algorithms that clearly d emonstrate that the complexit y of d - 1 dimensional sto c h astic optimizatio n is p r ecisely the complexit y of 1-dimensional activ e learning. Inspired by an optimization algorithm th at was adaptive to u nkno wn un iform con vexit y p arameters, we d esign an interesting one-dimensional activ e learner th at is also adaptiv e to un kno wn noise parameters. Th is algorithm is simpler than the adaptiv e activ e learning algorithm prop osed recen tly in [3] w hic h handles the p o ol based activ e learning setting. Giv en access to this acti v e learner as a subroutine for line searc h, w e show that a s imple randomized co ordinate d escen t pro cedu re can m inimize uniformly conv ex fu nctions with a m uc h simpler sto c hastic oracle that return s only a Bernoulli random v ariable representing a noisy sign of the gradien t in a single co ordinate d irection, rather than a f u ll-dimensional real-v alued gradien t v ector. The resulting algorithm is adaptiv e to all unknown UC and smo othness parameters and ac hieve minimax op timal con vergence r ates. W e sp end the fir st tw o sections describin g the pr oblem setup and preliminary insigh ts, b efore describing our algorithms in s ections 3 and 4. 1.1 Setup of First-Order Sto cha stic Con vex Optimization First-order sto c h astic con vex optimizat ion is the task of approxi mately minimizing a conv ex function o v er a con v ex set, giv en oracle access to u n biased estimates of the fun ction and gradien t at any p oin t, u sing as few queries as p ossible ([4]). W e will assume that w e are give n an arb itrary set S ⊂ R d of kno wn diameter b oun d R = max x,y ∈ S k x − y k . A con v ex fun ction f with x ∗ = arg min x ∈ S f ( x ) is said to b e k -uniformly con v ex if, for some λ > 0 , k ≥ 2, w e ha ve for all x, y ∈ S f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + λ 2 k x − y k k (strong conv exit y arises when k = 2). f is L -Lipsc hitz for some L > 0 if k∇ f ( x ) k ∗ ≤ L (where k . k ∗ is the d ual n orm of k . k ); equiv alent ly for all x, y ∈ S | f ( x ) − f ( y ) | ≤ L k x − y k A differen tiable f is H -strongly sm o oth (or has a H -Lipsc hitz gradien t) for some H > λ if for all x, y ∈ S , w e ha v e k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ H k x − y k , or equiv alen tly f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + H 2 k x − y k 2 In this pap er w e shall alw ays assume k . k = k . k ∗ = k . k 2 and deal with strongly smo oth and uniformly con vex f unctions with p arameters λ > 0 , k ≥ 2, L, H > 0. A sto chastic first ord er oracle is a fun ction that accepts x ∈ S , and returns  ˆ f ( x ) , ˆ g ( x )  ∈ R d +1 where E  ˆ f ( x )  = f ( x ) , E  ˆ g ( x )  = ∇ f ( x ) (these unbiased estimates also hav e b ound ed v ariance) and the exp ectatio n is ov er any in- ternal randomness of the oracle. An optimization algorithm is a metho d that sequen tially queries an oracle at p oint s in S and retur ns ˆ x T as an estimate of the optim um of f after T queries (or alternativ ely tr ies to ac hieve an error of ǫ ) and their p erformance can b e measured by either function error f ( ˆ x T ) − f ( x ∗ ) or p oin t error k ˆ x T − x ∗ k . 2 1.2 Sto cha stic Gradien t -Sign Oracles Define a sto chastic sign oracle to b e a fun ction of x ∈ S, j ∈ { 1 ...d } , that returns ˆ s j ( x ) ∈ { + , −} wh ere 1   η ( x ) − 0 . 5   = Θ  [ ∇ f ( x )] j  and η ( x ) = Pr  ˆ s j ( x ) = + | x  where ˆ s j ( x ) is a noisy sign  [ ∇ f ( x )] j  and [ ∇ f ( x )] j is th e j -th co ord in ate of ∇ f , and the probabilit y is ov er an y internal randomness of the oracle. This b eha vior of η ( x ) actually needs to h old only when   [ ∇ f ( x )] j   is sm all. In this pap er , we consider co ordinate descen t algorithms that are motiv ated by appli- cations where computing th e ov erall gradien t, or ev en a fun ction v alue, can b e exp ensive due to high dimensionalit y or huge amoun ts of d ata, but computing the gradient in an y one co ordinate can b e c h eap. [5] m en tions the example of min x 1 2 k Ax − b k 2 + 1 2 k x k 2 for some n × d matrix A (or any other r egularizatio n that decomp oses o ver dimensions). Computing the gradien t A ⊤ ( Ax − b ) + x is exp ensive, b ecause of the matrix-v ector m ultiply . How ev er , its j -th co ord inate is 2 A j ⊤ ( Ax − b ) + x j and requires an exp ense of only n if the residu al v ecto r Ax − b is k ep t trac k of (this is easy to do, since on a single coord inate up date of x , the residual c h ange is prop ortional to A j , an additional exp ense of n ). A sign oracle is weak er th an a fir st order oracle, and can actuall y b e obtained by return- ing the sign of the first order oracle’s n oisy gradien t if the mass of the n oise distribu tion gro ws linearly arou n d its zero mean (argued in next section). At the optim um along co or- dinate j , the oracle retur ns a ± 1 with equ al probabilit y , and otherwise return s the correct sign w ith a p robabilit y pr op ortional to the v alue of the directional deriv ativ e at that p oin t (this is refl ectiv e of th e fact that the larger the deriv ative ’s absolute v alue, the easier it w ould b e for the oracle to appr o ximate its sign, hence the smaller the pr ob ab ility of error). It is not unreasonable that there ma y b e other circumstances where ev en calculating the (real v alue) gradien t in the i -th d irection could b e exp ensive, bu t estimating its sign could b e a muc h easier task as it on ly requires estimating whether fun ction v alues are exp ected to increase or decrease along a co ord inate (in a similar spirit of function comparison oracles [6], but w ith slightly more p o w er). W e will also s ee that the rates for op timization crucially dep end on wh ether the gradien t noise is sign-pr eserving or n ot. F or instance, with round ing errors or storing floats with sm all precision, one can get deterministic r ates as if we had the exact gradien t since the rounding or low er p recision do esn’t flip s igns . 1.3 Setup of Active Threshold Learning The problem of one-dimensional threshold estimation assu mes you ha ve an int erv al of length R , sa y [0 , R ]. Giv en a p oint x , it has a lab el y ∈ { + , −} that is drawn from an u nknown conditional distr ib ution η ( x ) = Pr  Y = + | X = x  and the threshold t is the uniqu e p oint where η ( x ) = 1 / 2, with it b eing larger th an half on one side of t and smaller than half on the other (h en ce it is more like ly to draw a + on one sid e of t and a − on the other s id e). The task of acti v e learning of threshold classifiers allo ws the learner to sequen tially query T (p ossibly d ep endent) p oin ts, observing lab els dra w n fr om the unknown conditional 1 f = Θ( g ) means f = Ω( g ) and f = O( g ) (rate of growth) 3 distribution after eac h qu er y , with the goal of returning a guess ˆ x T as close to t as p ossible. In the formal study of classification (cf. [7]), it is common to study minimax rates wh en the regression function η ( x ) satisfies Tsybako v’s noise or margin condition (TNC) with exp onent k at the threshold t . Differen t v ersions of this b oundary noise condition are us ed in regression, densit y or leve l-set estimation and lead to an imp r o vemen t in min imax optimal rates (for classification, also cf. [8], [3]). Here, we pr esen t the version of TNC used in [9] : M | x − t | k − 1 ≥ | η ( x ) − 1 / 2 | ≥ µ | x − t | k − 1 whenev er 2 | η ( x ) − 1 / 2 | ≤ ǫ 0 for s ome constan ts M > µ > 0 , ǫ 0 > 0 , k ≥ 1. A standard measure for how well a classifier h p erf orms is giv en b y its risk, wh ic h is simply the probability of classificatio n error (exp ectation u nder 0 − 1 loss), R ( h ) = Pr  h ( x ) 6 = y  . The p erformance of threshold learning strategies can b e measured by the excess classification risk of the resultan t threshold classifier at ˆ x T compared to the Ba yes optimal classifier at t as giv en by 3 R ( ˆ x T ) − R ( t ) = ˆ x T ∨ t Z ˆ x T ∧ t | 2 η ( x ) − 1 | dx (1) In the ab o ve expr ession, akin to [9], w e use a uniform marginal distribu tion for activ e learning since th er e is n o underlying distribution o ver x . Alternativ ely , on e can simp ly measure the one-dimensional p oin t error | ˆ x T − t | in estimation of th e th reshold. Minimax rates for estimation of risk and p oin t error in activ e learning under TNC w ere p ro vided in [9] and are su mmarized in th e next section. 1.4 Summary of Con t ributions No w that w e hav e intro d uced the notation u sed in our p ap er and some r elev an t p revious w ork (more in the next section), we can clearly state our contributions. • W e generalize an idea fr om [10] to present a simp le ep o ch-based activ e learning al- gorithm with a passive learning subr outine th at can optimally learn one-dimensional thresholds and is adaptiv e to unknown noise parameters. • W e show that noisy gradien t signs suffice for minimization of uniformly conv ex func- tions by proving that a random co ord inate descent algorithm with an activ e learning line-searc h subr outine ac h iev es minim ax con v er gence rates. • Due to the conn ection b et w een the relev an t exp onen ts in th e tw o fields, we can com b in e the ab o ve tw o metho ds to get an algorithm that achiev es minimax optimal rates and is adaptiv e to un kno wn conv exit y parameters. • As a corollary , we argue that with access to p ossib ly noisy non-exact gradient s that don’t switc h an y signs (roundin g errors or lo w -p recision storage are sign-preservin g), w e can s till achiev e exp onent ially f ast d eterministic rates. 2 Note that | x − t | ≤ δ 0 :=  ǫ 0 M  1 k − 1 = ⇒ | η ( x ) − 1 / 2 | ≤ ǫ 0 = ⇒ | x − t | ≤  ǫ 0 µ  1 k − 1 3 a ∨ b := max( a, b ) and a ∧ b := min( a, b ) 4 2 Preliminary In sigh ts 2.1 Connections Bet w een Exp onen ts T aking one p oint as x ∗ in the definition of UC, we see that | f ( x ) − f ( x ∗ ) | ≥ λ 2 k x − x ∗ k k Since k∇ f ( x ) kk x − x ∗ k ≥ ∇ f ( x ) ⊤ ( x − x ∗ ) ≥ f ( x ) − f ( x ∗ ) (by conv exit y), k∇ f ( x ) − 0 k ≥ λ 2 k x − x ∗ k k − 1 Another r elev an t fact for u s will b e that u niformly con vex fu nctions in d d im en sions are uniformly con v ex along an y one direction, or in other words, for ev ery fi x ed x ∈ S and fi xed unit vec tor u ∈ R d , the un iv ariate function of α defined by f x,u ( α ) := f ( x + αu ) is also UC with the same parameters 4 . F or u = e j ,   [ ∇ f ( x )] j − 0   ≥ λ 2 k x − x ∗ j k k − 1 where x ∗ j = x + α ∗ j e j and α ∗ j = arg min { α | x + αe j ∈ S } f ( x + αe j ). This uncanny similarit y to the TNC (since ∇ f ( x ∗ ) = 0) was mathematically exploite d in [2] w here the authors used a lo wer b ound ing pro of tec h nique for one-dimensional activ e th reshold learning from [9] to pr o vid e a new lo wer b ounding pr o of tec h nique for the d -dimensional sto chastic conv ex optimization of UC fu nctions. In particular, they sho w ed that the minimax r ate f or 1-dimensional activ e learning exce ss risk and the d -dimensional optimization fun ction error b oth scale d lik e 5 ˜ Θ  T − k 2 k − 2  , and that the p oin t error in b oth settings scaled like ˜ Θ  T − 1 2 k − 2  , where k is either the TNC exp onent or the UC exp onen t, dep end ing on the setting. The imp ortance of this conn ection cannot b e emphasized enough and we will see th is b eing useful throughout this p ap er. As mentio ned earlier [9] require a tw o-sided TNC cond ition (up p er and lo wer growth con- dition to pro vide exact tight r ate of growth) in order to prov e risk up p er b ounds. On a similar n ote, for uniformly con v ex fun ctions, we will assu m e suc h a Lo cal k -Strong Smo oth- ness condition around d irectional minima Assumption LkSS : for all j ∈ { 1 ...d }   [ ∇ f ( x )] j − 0   ≤ Λ k x − x ∗ j k k − 1 for s ome constan t Λ > λ/ 2, so we can tight ly c h aracterize the rate of gro wth as   [ ∇ f ( x )] j − 0   = Θ  k x − x ∗ j k k − 1  This condition is implied by s tr ong smo othn ess or Lipschitz smo oth gradien ts when k = 2 (for strongly con vex and s tr ongly smo oth functions), but is a sligh tly str onger assumption otherwise. 4 Since f is U C, f x,u ( α ) ≥ f x,u (0) + α ∇ f x,u (0) + λ 2 | α | k 5 w e use ˜ O , ˜ Θ to hide constants and p olylogarithmic factors 5 2.2 The One-Dimensiona l Argumen t The basic argumen t for relating optimizatio n to activ e learning wa s made in [2] in the con text of sto chastic fi rst order oracles wh en the noise distribution P( z ) is unbiased and gro ws linearly around its zero mean, i.e. Z ∞ 0 dP( z ) = 1 2 and Z t 0 dP( z ) = Θ( t ) for all 0 < t < t 0 , for constants t 0 (similarly for − t 0 < t < 0). This is satisfied for gaussian, uniform and m any other distrib u tions. W e rep r o duce the argument for clarit y and then sk etc h it for sto c h astic signed oracles as w ell. F or any x ∈ S , it is clear th at f x,j ( α ) := f ( x + αe j ) is con vex; its grad ient ∇ f x,j ( α ) := [ ∇ f ( x + αe j )] j is an incr easing function of α that switc h es signs at α ∗ j := arg min { α | x + αe j ∈ S } f x,j ( α ), or equiv alently at directional minim um x ∗ j := x + α ∗ j e j . One can think of s ign([ ∇ f ( x )] j ) as b eing the true lab el of x , sign([ ∇ f ( x )] j + z ) as b eing the observed lab el, an d fi nding x ∗ j as learning the decision b oundary (p oin t where lab els switc h signs). Define r egression function η ( x ) := Pr  sign([ ∇ f ( x )] j + z ) = + | x  and note that minimizing f x 0 ,j corresp onds to iden tifying the Ba y es thresh old classifier as x ∗ j b ecause the p oint at whic h η ( x ) = 0 . 5 or [ ∇ f ( x )] j = 0 is x ∗ j . Consider a p oint x = x ∗ j + te j for t > 0 with [ ∇ f ( x )] j > 0 and hence has true lab el + (a s imilar argument can b e made for t < 0). As discussed earlier,   [ ∇ f ( x )] j   = Θ  k x − x ∗ j k k − 1  = Θ( t k − 1 ). Th e probabilit y of seeing lab el + is the pr obabilit y that w e dr a w z in  − [ ∇ f ( x )] j , ∞  so that the sign of [ ∇ f ( x )] j + z is still p ositiv e. Hence, the regression fun ction can b e written as η ( x ) = Pr  [ ∇ f ( x )] j + z > 0  = Pr ( z > 0) + Pr  − [ ∇ f ( x )] j < z < 0  = 0 . 5 + Θ  [ ∇ f ( x )] j  = ⇒   η ( x ) − 1 2   = Θ  [ ∇ f ( x )] j  = Θ  t k − 1  = Θ  | x − x ∗ j | k − 1  Hence, η ( x ) satisfies the TNC with exp onen t k , and an activ e learning algorithm (n ext sub- section) can b e used to obtain a p oin t ˆ x T with small p oint-e rror and excess risk. Note that function error in conv ex optimization is b ounded ab ov e by excess risk of the corresp ond ing activ e learner using eq (1) b ecause f j ( ˆ x T ) − f j ( x ∗ j ) =      ˆ x T ∨ x ∗ j Z ˆ x T ∧ x ∗ j [ ∇ f ( x )] j dx      = Θ ˆ x T ∨ x ∗ j Z ˆ x T ∧ x ∗ j | 2 η ( x ) − 1 | dx ! = Θ  R ( ˆ x T )  Similarly , for sto c h astic sign oracles (Sec. 1.2), us ing η ( x ) = Pr  ˆ s j ( x ) = +  ,   η ( x ) − 1 2   = Θ  [ ∇ f ( x )] j  = Θ  k x − x ∗ j k k − 1  6 2.3 A Non-adaptiv e A ctiv e Threshold Learning Algorit hm One can use a grid-based probabilistic v arian t of binary searc h called the BZ algorithm [11] to appro ximately learn the threshold efficien tly in th e activ e setting, in th e setting that η ( x ) satisfies the TNC for known k , µ , M (it is not adaptiv e to the parameters of the problem - one n eeds to know these constan ts b eforehand ). The analysis of BZ and th e p r o of of th e follo wing lemma are discussed in detail in Theorem 1 of [12 ], Theorem 2 of [9 ] and the App end ix of [2]. Lemma 1. Given a 1 - dimensional r e gr ession function that satisfies the TN C with known p ar ameters µ, k , then after T queries, the BZ algorithm r eturns a p oint ˆ t su c h that | ˆ t − t | = ˜ Θ( T − 1 2 k − 2 ) and the exc ess risk is ˜ Θ( T − k 2 k − 2 ) . Due to the d escrib ed connection b etw een exp onents, one can use BZ to approxima tely optimize a one dimens ional u n iformly con v ex function f j with kn o w n u niform conv exit y parameters λ, k . Hence, the BZ algorithm can b e u sed to fin d a p oin t with low function error b y s earc hin g for a p oint with lo w risk. This, when com bined with Lemma 1, yields the follo wing imp ortant result. Lemma 2. Given a 1 -dimensional k -UC and LkSS f unction f j , a line se ar ch to find ˆ x T close to x ∗ j up to ac cur acy | ˆ x T − x ∗ j | ≤ η in p oint-err or c an b e p erforme d in ˜ Θ(1 /η 2 k − 2 ) steps using the BZ algorithm. Alternatively, in T steps we c an find ˆ x T such that f ( ˆ x T ) − f ( x ∗ j ) = ˜ Θ( T − k 2 k − 2 ) . 3 A 1-D Adaptiv e A ctiv e Threshold Learning Algorithm W e no w describ e an algorithm f or act iv e learning of one-dimensional thresholds that is adaptiv e, meaning it can ac h iev e the min imax optimal rate ev en if the TNC p arameters M , µ, k are unkno wn. It is quite differen t from th e non -adaptive BZ alg orithm in its fla vour, though it can b e regarded as a robu st b inary search pr o cedure, and its design and pro of are inspir ed fr om an optimization pro cedure from [10] that is adaptiv e to unkn own UC parameters λ, k . Ev en though [10] considers a sp ecific optimizati on algorithm (dual a v eraging), we observe that their algorithm that adapts to unkn o w n UC p arameters can use any optimal con v ex optimization algorithm as a su broutine within eac h ep o ch. Similarly , our adaptive activ e learning algorithm is ep o c h -b ased and can use any optimal passive learnin g subr outine in eac h ep o ch. W e n ote that [3] also dev elop ed an adaptiv e algorithm based on disagreemen t co efficien t and V C-dimension arguments, b ut it is in a p o ol-based setting where one has access to a large p o ol of unlab eled d ata, and is muc h more complicated. 3.1 An Optimal P assiv e Learning Subroutine The excess risk of p assiv e learning pro cedures for 1-d thresholds can b e b ound ed by O( T − 1 / 2 ) (e.g. see Alexander’s inequ alit y in [13] to av oid √ log T factors from E R M/VC argumen ts) and can b e achiev ed by ignoring the TNC parameters. 7 Consider suc h a passive learning pro cedure u nder a uniform distribution of samp les (mimic ked by act iv e learning by querying the domain u n iformly) in a ball 6 B ( x 0 , R ) around an arbitrary p oin t x 0 of radius R that is kno w n to con tain the true thr eshold t . Then without kno wledge of M , µ, k , in T steps we can get a p oin t ˆ x T close to the true threshold t s u c h th at with pr obabilit y at least 1 − δ R ( ˆ x ) − R ( t ) = ˆ x T ∧ t Z ˆ x T ∨ t | 2 η ( x ) − 1 | dx ≤ C δ R √ T for s ome constan t C δ . Assumin g ˆ x T lies inside th e TNC region, µ ˆ x T ∧ t Z ˆ x T ∨ t | x − t | k − 1 dx ≤ ˆ x T ∧ t Z ˆ x T ∨ t | 2 η ( x ) − 1 | dx Hence µ | ˆ x T − t | k k ≤ C δ R √ T . Since k 1 /k ≤ 2, w.p. at least 1 − δ we get a p oint -error | ˆ x T − t | ≤ 2  C δ R µ √ T  1 /k (2) W e assume that ˆ x T lies within the TNC region since th e in terv al | η ( x ) − 1 2 | ≤ ǫ 0 has at least constan t width | x − t | ≤ δ 0 = ( ǫ 0 / M ) 1 / ( k − 1) , it w ill only tak e a constant num b er of iterations to find a p oint w ithin it. A form al w a y to argue this w ould b e to see that if the o ve rall risk goes to zero lik e C δ R √ T , then the p oin t cannot sta y outside this constan t sized region of width δ 0 where | η ( x ) − 1 / 2 | ≤ ǫ 0 , since it would accum ulate a large constan t risk of at least t + δ 0 R t µ | x − t | k − 1 = µδ k 0 k . S o as long as T is larger than a constant T 0 := C 2 δ R 2 k 2 µ 2 δ 2 k 0 , our b ound in eq 2 h olds with high pr ob ab ility (we can ev en assume we wa ste a constan t n um b er of queries to ju st get into the TNC region b efore using this algorithm). 3.2 Adaptiv e One-Dimensional Active Threshold Learner Algorithm 1 is a generalize d ep o ch-based binary searc h , and w e rep eatedly p erform pas- siv e learning in a halving searc h radius. Let the num b er of ep o c h s b e E := log q 2 T C 2 ˜ δ log T ≤ log T 2 (if 7 constan t C 2 ˜ δ > 2) and ˜ δ := 2 δ / log T ≤ δ /E . Let the time b udget p er ep o ch b e N := T /E (the s ame for eve ry ep o ch) and the searc h r adius in ep o ch e ∈ { 1 , ..., E } shrink as R e := 2 − e +1 R . Let us define the minimizer of the risk within the ball of r adius R e cen tered around x e − 1 at ep o c h e as x ∗ e = arg min  R ( x ) : x ∈ S ∩ B ( x e − 1 , R e )  Note that x ∗ e = t iff t ∈ B ( x e − 1 , R e ) and will b e one end of the in terv al otherwise. 6 Define B ( x, R ) := [ x − R, x + R ] 8 Input: Domain S of diameter R , oracle bud get T , confidence δ Blac k Box: Any optimal passive learning pr o cedure P ( x, R, N ) that outputs an estimated threshold in B ( x, R ) u sing N queries Cho ose an y x 0 ∈ S , R 1 = R, E = log q 2 T C 2 ˜ δ log T , N = T E 1: while 1 ≤ e ≤ E do 2: x e ← P ( x e − 1 , R e , N ) 3: R e +1 ← R e 2 , e ← e + 1 4: end while Output: x E Algorithm 1: Adap tive Threshold L earner Theorem 1. In the setting of one-dimensional active le arning of thr esholds, Algorithm 1 adaptively achieves R ( x E ) − R ( t ) = ˜ O  T − k 2 k − 2  with pr ob ability at le ast 1 − δ in T querie s when the unknown r e gr ession fu nction η ( x ) has unknown TNC p ar ameters µ , k . Pr o of. Sin ce w e use an optimal passiv e learning su broutine at ev ery ep o c h, we kno w that after eac h ep o c h e we h a ve w ith p robabilit y at least 1 − ˜ δ 7 R ( x e ) − R ( x ∗ e ) ≤ C ˜ δ R e p T /E ≤ C ˜ δ R e r log T 2 T (3) Since η ( x ) satisfies the TNC (and is b ound ed ab ov e by 1), w e ha ve for all x µ | x − t | k − 1 ≤ | η ( x ) − 1 / 2 | ≤ 1 If the set has diameter R , one of th e endp oin ts m us t b e at least R/ 2 aw a y from t , and hence w e get a limitation on th e maxim um v alue of µ as µ ≤ 1 ( R/ 2) k − 1 . Since k ≥ 2 and E ≥ 2, and 2 − E = C ˜ δ q log T 2 T , using simple algebra w e get µ ≤ 2 ( k − 2) E +2 ( R/ 2) k − 1 = 4 . 2 − E 2 ( k − 1) E 2 ( k − 1) R k − 1 = 4 . 2 − E 2 ( k − 1) (2 − E R ) k − 1 = 4 C ˜ δ 2 k − 1 R k − 1 E +1 r log T 2 T W e pr o ve that w e will b e appropr iately close to t after some ep o c h e ∗ b y doing case analysis on µ . When the true u n kno wn µ is sufficient ly small, i.e. µ ≤ 4 C ˜ δ 2 k − 1 R k − 1 2 r log T 2 T (4) then we show th at we ’ll b e done after e ∗ = 1. O th erwise, we will b e d one after ep o c h 2 ≤ e ∗ ≤ E if the tru e µ lies in the range 4 C ˜ δ 2 k − 1 R k − 1 e ∗ r log T 2 T ≤ µ ≤ 4 C ˜ δ 2 k − 1 R k − 1 e ∗ +1 r log T 2 T (5) 7 By VC theory for th reshold classifiers or similar arguments in [13], C 2 ˜ δ ∼ log (1 / ˜ δ ) ∼ log log T since ˜ δ ∼ δ / log T . W e treat it as constan t for clarity of exp osition, but actually lose log log T factors like t he high probabilit y arguments in [14] and [2] 9 T o see why we’ll b e d one, equations (4) and (5) imply R e ∗ +1 ≤ 2  8 C 2 ˜ δ log T µ 2 T  1 2 k − 2 after ep o c h e ∗ and plugging this in to equation (3) with R e ∗ = 2 R e ∗ +1 , we get R ( x e ∗ ) − R ( x ∗ e ∗ ) ≤ C ˜ δ R e ∗  log T 2 T  1 2 = O  log T T  k 2 k − 2 ! (6) There are t wo issues hin d ering the completion of our pro of. Th e fi rst is that ev en though x ∗ 1 = t to start off w ith , it migh t b e the case that x ∗ e ∗ is far a wa y from t since we are c h opping the radius by half at every ep o ch. Interesti ngly , in lemma 3 we will pro v e that roun d e ∗ is the last round up to which x ∗ e = t . This wo uld imply from eq (6) that R ( x e ∗ ) − R ( t ) = ˜ O  T − k 2 k − 2  (7) Secondly we migh t b e concerned th at after the round e ∗ , we ma y mo v e further a wa y from t in later ep o chs. How ev er , we will sho w that since the radii are decreasing geometrically b y h alf at every ep o c h , we cannot really w ander to o f ar a w a y from x e ∗ . This will giv e us a b ound (see lemma 4) lik e R ( x E ) − R ( x e ∗ ) = ˜ O  T − k 2 k − 2  (8) W e will essentiall y prov e that the final p oin t x e ∗ of ep o ch e ∗ is sufficientl y close to the true optim u m t , an d the fin al p oin t of the algorithm x E is sufficientl y close to x e ∗ . Summing eq (7) and eq (8) yields our desired resu lt. Lemma 3. F or al l e ≤ e ∗ , c onditione d on having x ∗ e − 1 = t , with pr ob ability 1 − ˜ δ we have x ∗ e = t . In other wor ds, up to ep o ch e ∗ , the optimal classifier in the domain of e ach ep o ch is the true thr eshold with high pr ob ability. Pr o of. x ∗ e = t w ill hold in ep o c h e if the distance b et ween the fir st p oin t x e − 1 in the ep o c h e is such that the ball of radius R e around it actually con tains t , or mathematical ly if | x e − 1 − t | ≤ R e . This is trivially satified for e = 1, and assuming that it is true for ep o c h e − 1 we w ill sho w show b y ind uction that it holds true for ep o c h e ≤ e ∗ w.p. 1 − ˜ δ . Notice that u sing equation (2), conditioned on the indu ction going th rough in pr evious rounds ( t b eing within the searc h radius), after the completion of round e − 1 we ha v e with probability 1 − ˜ δ | x e − 1 − t | ≤ 2 " C ˜ δ R e − 1 µ p T /E # 1 /k If this w as upp er b ound ed b y R e , then the indu ction w ould go th r ough. S o what we w ould really lik e to sho w is that 2  C ˜ δ R e − 1 µ √ T /E  1 k ≤ R e . Sin ce R e − 1 = 2 R e , w e effectiv ely w an t to sho w 2 k C ˜ δ 2 R e µ q E T ≤ R k e or equiv alen tly that for all e ≤ e ∗ w e would lik e to h a ve 4 C ˜ δ 2 k − 1 R k − 1 e q E T ≤ µ . Since E ≤ log T 2 , we would b e ac hieving something stronger if we sho wed 4 C ˜ δ 2 k − 1 R k − 1 e r log T 2 T ≤ µ 10 whic h is kno w n to b e tru e for every ep o c h u p to e ∗ b y equation (5). Lemma 4. F or al l e ∗ < e ≤ E , R ( x e ) − R ( x e ∗ ) ≤ C ˜ δ R e ∗ √ T /E = ˜ O  T − k 2 k − 2  w.p. 1 − ˜ δ , ie after ep o ch e ∗ , we c annot deviate much fr om wher e we ende d ep o ch e ∗ . Pr o of. F or e > e ∗ , w e ha ve with probab ility at least 1 − ˜ δ R ( x e ) − R ( x e − 1 ) ≤ R ( x e ) − R ( x ∗ e ) ≤ C ˜ δ R e p T /E and hence even for th e fin al ep o c h E , we hav e with pr obabilit y (1 − ˜ δ ) E − e ∗ R ( x E ) − R ( x e ∗ ) = E X e = e ∗ +1 [ R ( x e ) − R ( x e − 1 )] ≤ E X e = e ∗ +1 C ˜ δ R e p T /E Since th e radii are h alving in s ize, this is up p er b ounded (lik e equation (6)) b y C ˜ δ R e ∗ p T /E [1 / 2 + 1 / 4 + 1 / 8 + ... ] ≤ C ˜ δ R e ∗ p T /E = ˜ O  T − k 2 k − 2  These lemmas j u stify the use of equ ations (7) and (8), wh ose s um yields our d esired result. Notice that the ov erall probability of s uccess is at least (1 − ˜ δ ) E ≥ 1 − δ , hence concluding the pro of of the th eorem. 4 Randomized Sto c hastic-Sign Co ordinate Descen t W e no w d escrib e an algorithm th at can do sto chasti c optimization of k -UC and LkSS functions in d > 1 dimensions when giv en access to a sto chastic sign oracle and a black- b o x 1-D activ e learning algorithm, su c h as our adaptive scheme from the previous section as a sub routine. T h e p ro cedure is w ell-kno wn in the literature, but the idea that on e only needs noisy gradient s igns to p erf orm minimization optimally , and that one can use activ e learning as a line-searc h p r o cedure, is n o vel to the b est of our knowledge. The idea is to simp ly p er f orm random co ord inate-wise d escen t with approximat e line searc h, wh ere the subroutine for line s earch is an optimal activ e threshold learning algorithm that is used to appr oac h the min im um of the fu nction along the chosen direction. Let the gradien t at ep o c h e b e called ∇ e − 1 = ∇ f ( x e − 1 ), the unit vect or direction of descent d e b e a u nit co ordinate v ecto r c h osen randomly from { 1 ...d } , and our step size from x e − 1 b e α e (determined b y activ e learnin g) so that our next p oint is x e := x e − 1 + α e d e . Assume, for analysis, that the optimum of f e ( α ) := f ( x e − 1 + αd e ) is α ∗ e := arg min α f ( x e − 1 + αd e ) and x ∗ e := x e − 1 + α ∗ e d e where (due to optimalit y) th e deriv ativ e is ∇ f e ( α ∗ e ) = 0 = ∇ f ( x ∗ e ) ⊤ d e (9) 11 The line searc h to fi nd α e and x e that appro ximates the minimum x ∗ e can b e accomplished b y an y optimal activ e learning algorithm algorithm, once we fi x the num b er of time steps p er line searc h. 4.1 Analysis of Algorithm 2 Input: set S of diameter R , query bu dget T Oracle: sto c hastic sign oracle O f ( x, j ) r eturning n oisy sign  [ ∇ f ( x )] j  Blac kBo x: algorithm LS ( x, d, n ) : line search from x , direction d , for n s teps Cho ose an y x 0 ∈ S , E = d (log T ) 2 1: while 1 ≤ e ≤ E do 2: Cho ose a unit co ordinate v ector d e from { 1 ...d } uniformly at r an d om 3: x e ← LS ( x e − 1 , d e , T /E ) usin g O f 4: e ← e + 1 5: end while Output: x E Algorithm 2: Rand omized Sto chastic- Sign C o ordinate Descen t Let the num b er of ep o c h s b e E = d (log T ) 2 , and the num b er of time steps p er ep o c h is T /E . W e can do a line searc h from x e − 1 , to get x e that appr o ximates x ∗ e w ell in f u nction error in T /E = ˜ O( T ) steps usin g an activ e learning subroutin e and let the resulting function-error b e denoted b y ǫ ′ = ˜ O  T − k 2 k − 2  . f ( x e ) ≤ f ( x ∗ e ) + ǫ ′ Also, L k S S and UC allo w us to infer (for k ∗ = k k − 1 , i.e. 1 /k + 1 /k ∗ = 1) f ( x e − 1 ) − f ( x ∗ e ) ≥ λ 2 k x e − 1 − x ∗ e k k ≥ λ 2Λ k ∗   ∇ ⊤ e − 1 d e   k ∗ Eliminating f ( x ∗ e ) fr om the ab ov e equations, subtracting f ( x ∗ ) fr om b oth sides, denoting ∆ e := f ( x e ) − f ( x ∗ ) and taking exp ectations E [∆ e ] ≤ E [∆ e − 1 ] − λ 2Λ k ∗ E h   ∇ ⊤ e − 1 d e   k ∗ i + ǫ ′ Since 8 E h |∇ ⊤ e − 1 d e | k ∗   d 1 , ..., d e − 1 i = 1 d k∇ e − 1 k k ∗ k ∗ ≥ 1 d k∇ e − 1 k k ∗ w e get E [∆ e ] ≤ E [∆ e − 1 ] − λ 2 d Λ k ∗ E h k∇ e − 1 k k ∗ i + ǫ ′ By con vexit y , Cauc h y-Sc hw artz and UC 9 , k∇ e − 1 k k ∗ ≥  λ 2  1 /k − 1 ∆ e − 1 , we get E [∆ e ] ≤ E [∆ e − 1 ] 1 − 1 d  λ 2Λ  k ∗ ! + ǫ ′ 8 k ≥ 2 = ⇒ 1 ≤ k ∗ ≤ 2 = ⇒ k . k k ∗ ≥ k . k 2 9 ∆ k e − 1 ≤ [ ∇ ⊤ e − 1 ( x e − 1 − x ∗ )] k ≤ k∇ e − 1 k k k x e − 1 − x ∗ k k ≤ k∇ e − 1 k κ 2 λ ∆ e − 1 12 Defining 10 C := 1 d  λ 2Λ  k ∗ < 1, w e get the recurr ence E [∆ e ] − ǫ ′ C ≤ (1 − C )  E [∆ e − 1 ] − ǫ ′ C  Since E = d (log T ) 2 and ∆ 0 ≤ L k x 0 − x ∗ k ≤ LR , after the last ep o ch, w e ha v e E [∆ E ] − ǫ ′ C ≤ (1 − C ) E  ∆ 0 − ǫ ′ C  ≤ exp  − C d (log T ) 2  ∆ 0 ≤ LR T − C d log T As long as T > exp  (2Λ /λ ) k ∗  , a constan t, w e hav e C d log T ≥ 1 and E [∆ E ] = O( ǫ ′ ) + o( T − 1 ) = ˜ O  T − k 2 k − 2  whic h is the d esired result. Notice that in this section we didn ’t need to know λ, Λ , k , b ecause w e simply r u n randomized co ordinate descent for E = d (log T ) 2 ep o chs with T /E steps p er su b routine, and th e activ e learnin g sub routine wa s also adap tive to the appropriately calculate d T NC p arameters. In summ ary , Theorem 2. Given ac c ess to only noisy g r adient sign information fr om a sto chastic sign or acle, R andomize d Sto chastic-Sign Co or dinate Desc ent c an minimize UC and LkSS func- tions at the minimax optimal c onver genc e r ate for exp e cte d function err or of ˜ O( T − k 2 k − 2 ) adaptive to al l unknown c onvexity and smo othness p ar ameters. As a sp e cial c ase for k = 2 , str ongly c onvex and str ongly smo oth fu nctions c an b e minimize d in ˜ O(1 /T ) steps. 4.2 Gradien t Sign-Preserving Computations A practical concern for implemen ting optimization algorithms is mac hine precision, the n um b er of decimals to w hic h r eal num b ers are stored. Finite space ma y limit th e accuracy with whic h ev ery gradien t can b e stored, and one ma y ask how m uc h these inaccuracies may affect the final conv ergence rate - ho w is the qu ery complexit y of optimization affected if the true gradien ts w ere rounded to one or t w o decimal p oint s? If the gradients were rand omly rounded (to remain unbiased), then one migh t guess that we could easily ac hiev e s to c h astic first-order optimization rates. Ho wev er, our results giv e a surp rising answer to that question, as a sim ilar argumen t rev eals that for UC and L k S S functions (with str on gly conv ex and strongly smo oth b eing a sp ecial case), our algorithm ac hieves exp onential rates. Since rounding errors do not flip an y sign in the gradient, ev en if the gradien t wa s r ounded or decimal p oin ts were dropp ed as m uc h as p ossible and w e w ere to return only a single bit p er coord inate ha ving the true signs, then one can still ac hieve the exp onential ly fast conv ergence rate observed in n on-sto c h astic settings - our algorithm needs only a logarithmic n u m b er of epo c h s, and in eac h ep o c h activ e learning w ill appr oac h the directional m inim um exp onen tially fast with noiseless gradient signs using a p erfect binary searc h . In fact, our algorithm is the n atural generalizatio n for a higher-dimensional binary searc h, b oth in the d etermin istic and sto chastic settings. W e can summarize this in the follo wing theorem: 10 Since 1 < k ∗ ≤ 2 and Λ > λ / 2, we ha ve C < 1 13 Theorem 3. Given ac c ess to gr adient signs in the pr esenc e of sign-pr eserving noise (such as deterministic or r andom r ounding of gr adients, dr opp ing de cimal plac es f or lower pr e cision, etc), R andomize d Sto chastic-Sign Co or dinate Desc ent c an minimize UC and LkSS functions exp onential ly fast, with a function err or c onver genc e r ate of ˜ O(exp {− T } ) . 5 Discussion While the assumption of smo othness is natural f or strongly conv ex functions, our assum ption of LkS S migh t app ear strong in general. It is p ossible to relax th is assumption and require the LkSS exp onent to differ from the UC exp onent, or to only assume strong sm o othness - this still yields consistency for our algorithm, bu t the rate ac hiev ed is w orse. [10] and [2] b oth hav e ep o ch b ased algorithms that ac hieve the minimax r ates u nder ju st Lipsc hitz assumptions with access to a fu ll-gradien t sto chastic fir st order oracle, but it is h ard to pro v e the s ame rates f or a co ordinate d escen t p ro cedure without sm o othness assumptions. Giv en a target fu n ction accuracy ǫ in stead of q u ery b udget T , a similar rand omized co ordinate descen t pr o cedure to ours ac hieve s the minimax rate w ith a similar pro of, but it is non-adaptiv e since we presently don’t ha ve an ad ap tive activ e learning pr o cedure when giv en ǫ . As of now, w e kno w no adap tive UC optimization pr o cedure wh en giv en ǫ . Recen tly , [15] analysed sto c hastic gradien t descen t w ith av eraging, and show that for smo oth fu n ctions, it is p ossible for an algorithm to automatically adapt b etw een conv ex- it y and strong con vexit y , and in comparision we sho w how to adapt to unknown uniform con vexit y (strong con v exit y b eing a sp ecial case of κ = 2). It m a y b e p ossible to com bin e the ideas fr om this pap er and [15] to get a un iv ers ally adaptiv e algorithm fr om con vex to all degrees of uniform conv exit y . It would also b e in teresting to see if th ese ideas extend to connections b et w een con vex op timization and learnin g lin ear th reshold functions. In this pap er, w e exploit r ecen tly disco vered theoretical connections by pro viding explicit algorithms that take adv an tage of them. W e show ho w these could lead to cross-fertiliz ation of fields in b oth directions and hop e that this is just the b eginning of a flourishin g interac tion where these insigh ts m a y lead to many new algorithms if w e lev erage th e theoretical relations in more inno v ativ e wa ys. References [1] Raginsky , M., Rakhlin, A.: Information complexit y of blac k-b o x con vex optimizatio n: A new lo ok via f eedb ac k information theory . In : 47th An n ual Allerton Conference on Comm unication, C on trol, and Comp uting, 2009. (2009) [2] Ramdas, A., S ingh, A.: Optimal rates for s to c h astic con v ex optimization u nder tsy- bak o v noise condition. In tl. Conference in Mac h ine Learnin g (ICML) (2013) [3] Hanneke , S.: Rat es of conv ergence in activ e learning. The Ann als of Statistics 39 (1) (2011 ) 333–3 61 [4] Nemiro vski, A., Y u din, D.: Pr ob lem complexit y and method efficiency in optimiza tion. John Wiley & S ons (1983) 14 [5] Nestero v, Y.: Efficiency of coordinate descen t metho ds on h uge-scale optimization problems. Core discussion pap ers 2 (2010) 2010 [6] Jamieson, K., Now ak, R., Rec ht, B.: Query complexit y of d eriv ativ e-free optimization. Adv ances in Neural In formation Pr o cessing Systems (NIPS) (2012) [7] Tsybako v, A.: O ptimal aggregatio n of classifiers in statistical learnin g. The Ann als of Statistics 32 (1) (2004 ) 135–1 66 [8] Audib ert, J.Y., T sybak o v, A.B.: F ast learnin g r ates for plug-in classifiers. Annals of Statistics 35 (2) (2007 ) 608–6 33 [9] Castro, R., No wak, R.: Minimax b oun ds for activ e learning. In: Pro ceedings of the 20th ann ual conference on learning theory , S pringer-V erlag (2007) 5–19 [10] Iouditski, A., Nesterov, Y.: Primal-dual su bgradien t metho ds f or minimizing un iformly con vex functions. Unive rsite Joseph F ourier, Grenoble, F r ance (2010 ) [11] Burnashev, M., Zigangiro v, K .: An interv al estimation problem for con trolled obs er - v ations. Problemy Pe redac h i Informatsii 10 (3) (1974) 51–61 [12] Castro, R., No wak, R.: Activ e s en sing and learnin g. F oundations and App lications of Sensor Managemen t (2009 ) 177–200 [13] Devro ye, L ., Gy¨ orfi, L., Lugosi, G.: A p robabilistic theory of p attern reco gnition. V olume 31. sp r inger (1996) [14] Hazan, E., Kale, S.: Bey ond the regret m inimization b arrier: an optimal algorithm for sto c h astic strongly-con vex optimization. In: Pro ceedings of the 23nd Ann ual Confer- ence on Learnin g Th eory . (2011) [15] Bac h, F., Moulines, E.: Non-asymptotic analysis of stoc hastic approximat ion algo- rithms for mac h ine learning. Adv ances in Neural Inf ormation Pro cessing Systems (NIPS) (2011) 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment