Agnostically Learning Juntas from Random Walks
We prove that the class of functions g:{-1,+1}^n -> {-1,+1} that only depend on an unknown subset of k<<n variables (so-called k-juntas) is agnostically learnable from a random walk in time polynomial in n, 2^{k^2}, epsilon^{-k}, and log(1/delta). In…
Authors: Jan Arpe, Elchanan Mossel
Agnostically Learning Jun tas from Random W alks Jan Arp e ∗ Elc hanan Mossel † June 25, 2008 Abstract W e prov e that the cla ss o f functions g : {− 1 , +1 } n → {− 1 , +1 } that only dep end on an unknown subset o f k ≪ n v ariables (so -called k -juntas) is agnostically lear nable from a ra ndom walk in time p o lynomial in n , 2 k 2 , ǫ − k , and log(1 /δ ). In o ther words, ther e is an a lgorithm with the claimed r unning time that, given ǫ, δ > 0 and acces s to a random walk on {− 1 , +1 } n lab eled by an a rbitrary function f : { − 1 , +1 } n → {− 1 , +1 } , finds with pr obability at least 1 − δ a k -junta that is (opt( f ) + ǫ )-close to f , where opt( f ) denotes the distance of a clo sest k - jun ta to f . Keyw ords: a gnostic learning, random wa lks, juntas 1 In tro duction 1.1 Motiv ation In sup ervised learning, the learner is pro vided with a tr aining set of lab ele d examples ( x 1 , f ( x 1 )) , ( x 2 , f ( x 2 )) , . . . , and the goal i s to find a hyp othesis h that is a go o d ap p ro ximation to f , i.e., that g ives g o o d estimate s for f ( x ) also on the p oints that are not p resen t in the training set. In many applications, the p oint s x corr esp ond to particular states of a s y s tem and the labels f ( x ) corresp ond to classific ations of these stat es. If the underlying system evo lves o v er time and thus ( x t , f ( x t )) corresp ond s to a measuremen t of the current state and its classification at time t , it is often reasonable to assume that state c hanges only o ccur lo c al ly , i.e., at eac h time t , x t differs only “locally” from x t − 1 . S u c h phenomena o ccur for instance in physics or b iology: e.g., in a fixed time int erv al, a particle can only tra v el a finite distance and the m utation of a DNA sequence can b e assumed to happ en in a single p osition at a time. In discrete s ettings, suc h p ro cesses are often mo deled as r andom walks on graphs, in which the no des represent the states of the sys tem, and edges indicate p ossible lo cal state c h anges. W e are intereste d in stud ying the sp ecial case that the und erlying graph is a h yp er cu b e, i.e., the no de set is {− 1 , 1 } n and t wo no d es are adjacen t if and only if they differ in exactly one coord inate. F urthermore, w e restrict the setting to Bo ole an classific ations . This r andom walk le arning mo del has attrac ted a lot of atten tion since the n ineties [1 , 3, 7, 6, 15], mainly b ecause of its interesting ∗ U.C. Berkeley . Email: arpe @stat.berkeley.edu . Su pported b y the Postdoc-Program of the German Academic Exchange S ervice (DAAD) and in part by NSF Career Award DMS 0548249 and BSF 2004105. † U.C. Berkel ey . Email: mossel@ stat.berkeley.edu . Supp orted b y NSF Career Awa rd D MS 0548 249, BSF 2004105 , and DO D ONR grant N0014-07-1-05-06. learning theoretic pr op erties. The mo del is we aker than the memb ership query mo del in whic h the learner is allo w ed to ask the classifica tions of sp ecific p oin ts, and it is stronger than the uniform- distribution mo del in wh ic h the learner observes p oin ts th at are drawn in dep endent ly of eac h other from the uniform distribution on { − 1 , 1 } n . Moreo ver, the latter relatio n is kno w n to b e strict : under a standard complexit y theoretic assump tion (existence of one-w a y functions) there is a class that is efficien tly learnable from lab eled ran d om walks, but not from indep end en t uniformly distributed examples [6, Prop osition 2]. The random w alk learnin g m o d el sh ares some similarities with b oth other mo dels men tioned ab o v e: as in the uniform-distrib ution mo del, the examples are generated at rand om (so that the learner has n o influen ce on th e giv en examples) and p oin ts of the rand om wa lk that corresp on d to time p oin ts that are su fficien tly far apart roughly b ehav e lik e ind ep enden t uniformly distributed p oin ts. On the other h and, some learning problems that app ear to b e infeasible in the uniform distribution mo del but are kno w n to b e easy to solv e in the membersh ip query mo del ha ve tur ned out to b e easy in the r an d om walk mo del as w ell. Among them is the problem of learning DNFs with p olynomially many terms [6] (ev en und er random classifciation n oise) and the problem of learning parit y fu n ctions in the pr esence of random classificatio n noise. The former result r elies on an efficien t algo r ithm p erforming the Bounde d Siev e [6] in tro du ced in [5]. The latter result follo ws from the fact that the (noise-less) random walk mo del admits an efficien t appr o ximation of v ariable influ ences, and the effect of r andom classificatio n noise can b e easily dealt w ith by d ra wing a sufficiently larger amount of examples. Giv en this s uccess of the rand om wa lk mo del in learning large classes in the presence of random classification noise, it is natural to ask wh ether it can also cop e w ith even more sev ere noise m o d els. One elegan t, alb eit c hallenging, noise mo del is the agnostic le arning mo del in tro du ced by Kearns et al. [11]. In this mod el, no assu mption w hatsoever is m ade ab out the lab els. Instead of asking for a hyp othesis th at is close to the classification function, the goal in agnostic learning is to pro duce a h yp othesis that agrees with the lab els on n early as many p oint s as the b est fitting fun ction from th e target class. More form ally , give n a class C of Boolean fun ctions on {− 1 , 1 } n and an arbitrary function f : {− 1 , 1 } n → {− 1 , 1 } , let opt C ( f ) = min g ∈C Pr[ g ( x ) 6 = f ( x )]. The class C is agnostic al ly le arnable if there is an algorithm that, for any ǫ, δ > 0, pro duces a h yp othesis h that, with probability at least 1 − δ , satisfies Pr[ h ( x ) 6 = f ( x )] ≤ opt C ( f ) + ǫ . Recen tly , Gopalan et al. [9] ha v e shown that the class of Bo olean functions that can b e repr e- sen ted by decision trees of p olynomial size (in the num b er of v ariables) can b e learned agnostically from memb ership queries in p olynomial time. Th eir main result combines the Kushilevitz-Mansour algorithm for fi nding large F our ier co efficien ts [12] with a gradien t-descent algorithm [16 ] to solv e an ℓ 1 -regression p roblem for sp arse p olynomials. T h ey also present a s im p ler algorithm (with slightly w orse ru nning time) th at pr op erly agnostically learns the class of k -juntas . These are fu nctions f : {− 1 , 1 } n → {− 1 , 1 } that dep end on an a pr iori un kno wn subset of at m ost k v ariables. The term pr op er learnin g refers to the requ iremen t that only hyp otheses from the target class (h ere: k -jun tas) are p ro duced. The inv estiga tion of the learnability of this class has b oth practical and th eoretic al motiv ation. Practically , the jun ta learnin g problem serves as a clean mo del of learning in the presence of irrelev an t information, a core problem in data m in ing [4]. F rom a theoretical p ersp ective , the problem is interesting du e to its close relationship to learning DNF formula s , decision trees, and noisy parit y functions [14]. 2 1.2 Our R esults and T ec hniques The main result of this p ap er is that the class of k -junta s on n v ariables is prop erly agnostically learnable in the random w alk mo del in time p olynomial in n (times some fu n ction in k and the accuracy parameter ǫ ). More pr ecisely , we sho w Theorem 1. L et C b e the class of k -juntas on n variables. Ther e is an algorithm that, given ǫ, δ > 0 and ac c ess to a r andom walk x 1 , x 2 , . . . on {− 1 , 1 } n that i s lab ele d by an arbitr ary fu nc tion f : {− 1 , 1 } n → {− 1 , 1 } , r eturns a k -junta h that, with pr ob ability at le ast 1 − δ , satisfies Pr[ h ( x ) 6 = f ( x )] ≤ op t C ( f ) + ǫ . The running time of this algorithm is p olynomia l in n , 2 k 2 , (1 /ǫ ) k , and log(1 /δ ) . W e th us p ro v e the fir st efficient learning resu lt for agnostically learning jun tas (ev en prop erly) in a p assive learning mod el. Our main tec hnical lemma (Lemma 3) shows that for an arb itrary function f and a k -junta g , there exists another k -jun ta g ′ that is almost as correlated with f as g is and whose r elev ant v ariables can b e inf erred from all low-le vel F our ier coefficien ts of f of a certain s ize. These F ourier co efficien ts can in tu rn b e detected using the Bound ed S iev e algorithm of Bshouty et al. [6 ] give n a random w alk lab eled b y f . On ce a sup ers et R of the relev an t v ariables of g ′ is found, it is easy to derive a hyp othesis that only dep ends on at most k v ariables from R and th at b est matc h es the giv en lab els: F or eac h k -elemen t subset J ( R , the b est matc h ing function with r elev ant v ariables in J is obtained by taking ma jority v otes on p oints that coincide in these co ordinates. Similarly to the classical r esu lt of Angluin and Laird [2] that a (prop er ) hyp othesis th at minimizes the num b er of disagreemen ts with the lab els is close to the target f unction (in the P A C learnin g mo d el with random classification n oise), we show that su c h a hypothesis is also a goo d candidate to satisfy the agnostic learnin g goal in the random walk mo d el (see Prop osition 1). A s imilar statemen t has implicitly b een shown in the agnostic P A C learning mo d el (see the p ro of of Theorem 1 in [11]). 1.3 Related W ork Our algorithm for agnostically learning juntas in the ran d om wa lk mo del h as some similarities to Gopalan et al.’s recent algorithm for pr op erly agnostically learning juntas in the mem b ership query mo d el [9]. The main differences b et ween the approac hes are in t wo resp ects: fir st, we do not explicitly calculate the quantiti es I ≤ k i = P S : i ∈ S, | S |≤ k ˆ f ( S ) 2 but instead u s e our tec hn ical lemma menti oned ab o v e, whic h ma y b e of indep endent interest. Second, instead of usin g th eir c haracterization of the b est fitting junta w ith a fix ed set of relev ant v ariables in terms of the F ourier sp ectrum of f ([9 , Lemma 13]), w e directly construct such a b est fitting hyp othesis by taking ma jorit y v otes in ambiguous situations. Ev en though we b ecame aw are of Gopalan et al.’s r esult only after devising our jun ta learnin g algorithm w e ha v e decided to adopt m uc h of their notation to the b en efit of the readers. It should also b e noted that a generalizatio n of Gopalan et al.’s decision tr ee learning algorithm cannot b e adapted for the r an d om w alk mo del in a straightforw ard man n er: The runn ing time of the only kno wn analogue of the Kushilevitz-Mansour su broutine for the r andom w alk mo del (i.e., the Bound ed Siev e) is exp onent ial in the level up to which the large F ourier coefficien ts are sough t. In general, ho wev er, sparse p olynomials can b e concen trated on high lev els. It would b e in teresting to see if the results in [9] can also b e d eriv ed f or th e restriction of the class of all t -sparse p olynomials to t -sparse p olynomials of degree roughly log( t ) since for every decision tr ee of size t , there is an ǫ -clo se decision tree of d ep th O (log ( t/ǫ )) (cf. [5 ]). In this case, the same result should hold for the random walk mo del. 3 1.4 Organization of This Paper W e briefly in tro duce notatio n al and tec hn ical prerequisites in Section 2. The random w alk learnin g mo del and its agnostic v arian t are int r od uced in Section 3. Section 4 con tains a concen tration for random wal ks and th e result on disagreemen t minimization in the random walk mo d el. The main result on agnostically learning junt as is presen ted in Section 5. The App end ix con tains a f ormal statemen t and pro of of a result concerning the indep endence of p oin ts in a rand om w alk (Section A) and an elemen tary pro of of the concen tration b oun d (Section B). 2 Preliminaries Let N = { 0 , 1 , 2 , . . . } . F or n ∈ N , let [ n ] = { 1 , . . . , n } . F or x, x ′ ∈ {− 1 , 1 } n , let x ⊙ x ′ denote the v ector obtained b y co ordinate-wise m ultiplication of x and x ′ . F or i ∈ [ n ], let e i denote the v ector in whic h all en tries are equal to + 1 except in the i th p osition, wh ere the ent r y is − 1. F or f : {− 1 , 1 } n → {− 1 , 1 } , a v ariable x i is said to b e r elevant to f (and f dep ends on x i ) if there is an x ∈ {− 1 , 1 } n suc h that f ( x ⊙ e i ) 6 = f ( x ). F or i ∈ [ n ] and a ∈ {− 1 , 1 } , denote by f x i = a : {− 1 , 1 } n → {− 1 , 1 } th e s ub-function of f obtained b y letting f x i = a ( x ) = f ( x ′ ) with x ′ j = x j if j 6 = i and x ′ i = a . Thus, x i is r elev an t to f if and only if f x i =1 6 = f x i = − 1 . Th e restriction of a vect or x ∈ {− 1 , 1 } n to a subset of co ordin ates J ⊆ [ n ] is denoted b y x | J ∈ {− 1 , 1 } | J | . All pr obabilities and exp ectations in this pap er are tak en with resp ect to th e un if orm distribu tion (except when indicated differently). F or f , g : {− 1 , 1 } n → R , d efine the inner pro du ct h f , g i = E x [ f ( x ) g ( x )] = 2 − n X x ∈{− 1 , 1 } n f ( x ) g ( x ) . It is wel l-kn own that the f unctions χ S : {− 1 , 1 } n → {− 1 , 1 } , S ⊆ [ n ], defined by χ S ( x ) = Q i ∈ S x i form an orthonormal basis of the sp ace of real-v alued functions on {− 1 , 1 } n . Thus, every function f : {− 1 , 1 } n → R has the unique F ourier exp ansion f = X S ⊆ [ n ] ˆ f ( S ) χ S where ˆ f ( S ) = h f , χ S i are the F ourier c o efficients of f . Let k f k 2 = h f , f i 1 / 2 = E[ f ( x ) 2 ] 1 / 2 . Planc herel’s equation states that h f , g i = X S ⊆ [ n ] ˆ f ( S ) ˆ g ( S ) , (1) and from this, Parsev al’s equation k f k 2 2 = P S ⊆ [ n ] ˆ f ( S ) 2 follo ws as the sp ecial case f = g . F or f , g : {− 1 , 1 } n → {− 1 , 1 } , defin e the distanc e b et we en f and g b y ∆( f , g ) = Pr[ f ( x ) 6 = g ( x )] , and for a class C = C n of f unctions from {− 1 , 1 } n to {− 1 , 1 } , let opt C ( f ) = min g ∈C ∆( f , g ) b e the distance of f to a nearest fu nction in C . I t is easily seen that ∆( f , g ) = (1 − h f , g i ) / 2. F urth ermore, for a sample S = ( x i , y i ) i =1 ,...,m with x i ∈ {− 1 , 1 } n and y i ∈ {− 1 , 1 } , let ∆( f , S ) = 1 m { i ∈ { 1 , . . . , m } | f ( x i ) 6 = y i } b e the fraction of examples in S for w hic h the lab els d isagree with the lab eling f unction f . 4 3 The Random W alk Learning M o del 3.1 Learning from Noiseless Examples Let C = S n ∈ N C n b e a class of functions, where eac h C n con tains f u nctions f : {− 1 , 1 } n → {− 1 , 1 } . In th e r ando m walk le arning mo del , a learning algorithm has access to the oracle R W ( f ) f or some unknown fu nction f ∈ C n . On the first r equest, R W ( f ) generates a p oint x ∈ {− 1 , 1 } n according to the un iform distribu tion on {− 1 , 1 } n and r eturns the example ( x, f ( x )), wh ere w e refer to f ( x ) as the lab el or the classific ation of the example. On su bsequen t requests, it selects a random co ordinate i ∈ [ n ] and retur ns ( x ⊙ e i , f ( x ⊙ e i )), where x is the p oint returned in the last query . The goal of a learning alg orithm A is, give n inpu ts δ, ǫ > 0, to output a h yp othesis h : {− 1 , 1 } n → {− 1 , 1 } suc h that w ith probabilit y at least 1 − δ (tak en o ver all p ossible random walks of th e requested length), Pr[ h ( x ) 6 = f ( x )] ≤ ǫ . In this case, A is said to le arn f with ac cur acy ǫ and c onfidenc e 1 − δ . The class C is le arnable fr om r andom walks if there is an algorithm A that for every n , ev ery f ∈ C n , ev ery δ > 0, and ev ery ǫ > 0 learns f with access to R W( f ) with accuracy ǫ an d confi dence 1 − δ . The class C is said to b e learnable in time equal to the r unning time of A , whic h is a function of n , ǫ , δ , and p ossibly other parameters in v olve d in th e parameterization of th e class C . If a learning algorithm only outp u ts h yp otheses h ∈ C n , it is called a pr op er le arning algorithm . In this case, C is pr op erly le arnable . The random walk m od el is a p assive learning mo del in the sens e that a learning alg orithm has no d irect cont r ol on which examples it r eceiv es (as opp osed to the memb ership query mo del in wh ic h the learner is allo wed to ask for the lab els of sp ecific p oin ts x ). F or passive learning mod els, we ma y assume withou t loss of generalit y that all examples are requested at once. 3.2 Agnostic Learning In the mo del of agnostic le arning fr om r andom walks , w e mak e no assum p tion whatso ever on the nature of the lab els. F ollo wing the m od el of Gopalan et al. [9], w e assume that there is an arbitr ary function f : {− 1 , 1 } n according to w h ic h the examples are lab eled, i.e., a learner observes pairs ( x, f ( x ) ), with the p oin ts coming from a random walk. In other w ords, the learner has access to R W( f ), but no w f is no longer requir ed to b elong to C . W e can think of the lab els as originati n g from a concept g ∈ C , w ith an opt C ( f ) fraction of lab els flipp ed by an adv ersary . The goal of a learnin g algo r ithm is to output a hyp othesis h that p erf orms nearly as we ll as th e b est fu nction of C . Let opt C ( f ) = m in g ∈C Pr x [ g ( x ) 6 = f ( x )], w here x ∈ {− 1 , 1 } n is dra w n according to the u niform distribution. An algorithm agnostic al ly le arns C if, for an y f : {− 1 , 1 } n → {− 1 , 1 } , giv en δ , ǫ > 0, it outputs a h yp othesis h : {− 1 , 1 } n suc h that with probabilit y at least 1 − δ , Pr x [ h ( x ) 6 = f ( x )] ≤ opt C ( f ) + ǫ . Again, if the algorithm alw a ys outputs a h yp othesis h ∈ C , then it is called a pr op er learning algorithm, and C is said to b e pr op erly agnostic al ly le arnable . Although all learning algorithms in this pap er are p rop er, w e b eliev e that a word is in order concerning the f orm ulation of the learning goal in impr op er agnostic le arning . Namely , it could w ell h app en th at w e can fin d a h yp othesis that satisfies Pr x [ h ( x ) 6 = f ( x )] ≤ opt C ( f ), b ut suc h an h could b e as far as 2 opt C ( f ) f rom all concepts in C , which can definitely not b e considered a sensib le solution if, say , opt C ( f ) ≥ 1 / 4. Instead, a hypothesis sh ou ld rather b e r equired to b e ǫ -close to some function g ∈ C that p erforms b est (or almost b est): Pr[ h ( x ) 6 = g ( x )] ≤ ǫ for some g ∈ C with Pr[ g ( x ) 6 = f ( x )] = opt C ( f ) (or for some near-optimal g ∈ C w ith Pr[ g ( x ) 6 = f ( x ) ≤ opt C ( f ) + ǫ ′ ). Alternativ ely , one can require h to b elong to some r easonably c hosen hyp othesis class H ⊇ C , e.g., th e h yp otheses output by the algorithm in [9] for learning d ecision trees of size t are t -sp arse p olynomials . In fact, that algorithm pr op erly ag n ostical ly learns the latter cla s s . 5 4 A Concen tration Bou nd for Lab eled Random W alks The follo win g lemma estimate s the probabilit y th at, after dra win g a random wal k x 0 , . . . , x ℓ , the p oin ts x 0 and x ℓ are indep endent. The pro of (and a more formal statemen t) are deferred to the App endix (see Lemma 4 in S ection A). Lemma 1. L et δ > 0 , ℓ ≥ n ln( n/δ ) and x 0 , . . . , x ℓ b e a r andom walk on {− 1 , 1 } n . Then, with pr o ability at le ast 1 − δ , x 0 and x ℓ ar e indep endent 1 and uniformly distribute d. Lemma 2. L e t g : {− 1 , 1 } n → [ − 1 , 1] and δ, ǫ > 0 . L et N = ⌈ n ln( n/δ ) ⌉ , m ≥ 2 N ǫ 2 ln 2 N δ , and x 1 , . . . , x m b e a r andom walk on {− 1 , 1 } n . Then, with pr ob ability at le ast 1 − δ , 1 m m X i =1 g ( x i ) − E x [ g ( x )] ≤ ǫ , wher e the exp e ctation is taken over a u ni f ormly distribute d x . Although a similar result can b e obtained from the m ore general works on concen tration b ounds for random walks by Gillman [8] and for finite Marko v Chains by L ´ ezaud [13], we giv e an elementa ry pro of for Lemma 2 in the App endix (see Section B). As an immediate consequence, the fraction of disagreemen ts b et w een the lab els f ( x i ) and the v alues h ( x i ) on a random wal k con v erge quic kly to the total fr actio n of disagreemen ts on all of {− 1 , 1 } n : Corollary 1. L et C = C n b e a class of functions f r om {− 1 , 1 } n to {− 1 , 1 } . L et ǫ, δ > 0 , f : {− 1 , 1 } n → {− 1 , 1 } , and ( x i , f ( x i )) i =1 ,...,m b e a lab ele d r andom walk of length m ≥ 2 N ǫ 2 ln 2 N |C | δ , wher e N = ⌈ n ln( n | C | /δ ) ⌉ . Then, with pr ob ability at le ast 1 − δ , for every h ∈ C , | ∆( h, S ) − ∆( h, f ) | ≤ ǫ . (2) Pr o of. Let h ∈ C . T aking g ( x ) = 1 2 | h ( x ) − f ( x ) | , w e obtain ∆( h, S ) = 1 m g ( x ) and ∆( h, f ) = E x [ g ( x )], so that b y Lemma 2, | ∆( h, S ) − ∆ ( h, f ) | ≤ ǫ with probabilit y at least 1 − δ / |C | . Th us, with probabilit y at least 1 − δ , (2) holds for al l h ∈ C . The follo wing prop osition sh o ws that, similarly to th e classical r esult by Angluin and Laird [2 ] for d istribution-free P A C -learning and the analogue b y Kearns et al. [11] for agnostic P AC-lea r ning, also in the r andom w alk mo del agnostic learning is ac hiev ed by find ing a h yp othesis that m inimizes the num b er of disagreemen ts with a lab eled ran d om walk of s ufficien t length. 1 More precisely , we can p erform an additional exp eriment such that conditional to some even t that o ccurs with probabilit y at least 1 − δ (taken o ver th e d ra w of th e random w alk and th e outcome of the additional exp erimen t), x 0 and x ℓ are indep endent. F or more details, see Section A in the App end ix. 6 Prop osition 1. L et C = C n b e a class of functions fr om {− 1 , 1 } n to {− 1 , 1 } . L et ǫ, δ > 0 , f : {− 1 , 1 } n → {− 1 , 1 } , and S = ( x i , f ( x i )) i =1 ,...,m b e a lab e le d r andom walk of length m ≥ (8 N/ǫ 2 ) ln(2 N |C | /δ ) , wher e N = ⌈ n ln(2 n |C | /δ ) ⌉ . L et h opt ∈ C minimize ∆( h, S ) . Then, with pr ob ability at le ast 1 − δ , ∆( h opt , f ) ≤ opt C ( f ) + ǫ . In p articular, the r e q u ir e d sample size is p olynomial in n , log |C | , 1 /ǫ , and log(1 /δ ) . Pr o of. By C orollary 1, | ∆( h, S ) − ∆( h, f ) | ≤ ǫ/ 2 for all h ∈ C . In particular, all functions h ∈ C with ∆( h, f ) > opt C ( f ) + ǫ hav e ∆( h, S ) > opt C ( f ) + ǫ/ 2, whereas all fun ctions h ∈ C with ∆( h, f ) = opt C ( f ) ha ve ∆( h, S ) ≤ opt C ( f ) + ǫ/ 2. Consequently , ∆( h opt , S ) ≤ opt C ( f ) + ǫ/ 2, and th us ∆( h opt , f ) ≤ opt C ( f ) + ǫ . 5 Agnostically Learning Junt as W e start w ith our main tec hnical lemma that shows that when ever there is a k -junta g at distance ∆( f , g ) to some function f , then there is another k -junta g ′ (in fact, a sub f unction of g ) at d istance ∆( f , g ) + ǫ such that the relev an t v ariables of g ′ can b e d etect ed b y fi nding all lo w-lev el F ourier co efficien ts th at are of a certain m in im um size. Lemma 3. L et f : {− 1 , 1 } n → {− 1 , 1 } b e an arbitr ary function and g : {− 1 , 1 } n → {− 1 , 1 } b e a k -junta. Then, for every ǫ > 0 , ther e exists a k -ju nta g ′ such that h f , g ′ i ≥ h f , g i − ǫ and for al l r e lev ant variables x i of g ′ , ther e exists S ⊆ [ n ] with | S | ≤ k , i ∈ S , and | ˆ f ( S ) | ≥ C · 2 − ( k − 1) / 2 · ǫ , (3) wher e C = 1 − 1 / √ 2 ≈ 0 . 293 . Pr o of. Th e pro of is by ind uction on k . F or k = 0, there is n othing to sh o w since there are no relev ant v ariables. F or the induction step, let k > 0. Assume th at taking g ′ to b e g do es not satisfy the conclusion, i.e., for some relev an t v ariable x i of g , | ˆ f ( S ) | < C 2 − ( k − 1) / 2 ǫ for all S ⊆ [ n ] with | S | ≤ k and i ∈ S . Our goal is to sh o w that in this case, either g x i =1 or g x i = − 1 is w ell correlated with f and thus asserts the existence of an appropr iate ( k − 1)-junta g ′ . Let T = { S ⊆ [ n ] | ˆ g ( S ) 6 = 0 } . Then |T | ≤ 2 k . It f ollo w s that h f , g i = X S ∈T ˆ f ( S ) ˆ g ( S ) = X S ∈T : i ∈ S ˆ f ( S ) ˆ g ( S ) + X S ∈T : i 6∈ S ˆ f ( S ) ˆ g ( S ) ≤ X S ∈T : i ∈ S | ˆ g ( S ) | · C · 2 − ( k − 1) / 2 · ǫ + X S ∈T : i 6∈ S ˆ f ( S ) ˆ g ( S ) ≤ 2 ( k − 1) / 2 · C · 2 − ( k − 1) / 2 ǫ + X S ∈T : i 6∈ S ˆ f ( S ) ˆ g ( S ) = C · ǫ + X S ∈T : i 6∈ S ˆ f ( S ) ˆ g ( S ) , where th e first equation is Planc herel’s equation (1 ) and the second inequalit y follo ws b y Cauc hy- Sc hw artz (note that ˆ g ( S ) is su pp orted on at most 2 k − 1 sets S with i ∈ S ). Con s equen tly , X S ∈T : i 6∈ S ˆ f ( S ) ˆ g ( S ) ≥ h f , g i − C · ǫ . Since for S ⊆ [ n ], [ g x i =1 ( S ) + \ g x i = − 1 ( S ) / 2 = ( 0 if i ∈ S ˆ g ( S ) if i 6∈ S , 7 it follo ws that h f , g x i = a i ≥ h f , g i − C · ǫ for a = 1 or for a = − 1. No w g x i = a is a ( k − 1)-junt a, so by indu ction hypothesis, there exists some ( k − 1)-junta g ′ suc h that h f , g ′ i ≥ h f , g x i = a i − ǫ/ √ 2 ≥ h f , g i − C · ǫ − ǫ/ √ 2 = h f , g i − ǫ and for all x i relev ant to g ′ , there exists S ⊆ [ n ] with | S | ≤ k − 1, i ∈ S , and | ˆ f ( S ) | ≥ C · 2 − ( k − 2) / 2 · ǫ/ √ 2 = C · 2 − ( k − 1) / 2 · ǫ . One might wo n d er if f or f : {− 1 , 1 } n → {− 1 , 1 } and a k -junta g : {− 1 , 1 } n → {− 1 , 1 } , h f , g i ≥ ǫ do es n ot imply that for every relev an t v ariable x i of g , there exists S ⊆ [ n ] with | S | ≤ k , i ∈ S , suc h that (3) holds. First of all, if f ( x ) = x 1 ∧ . . . ∧ x k (in terpreting − 1 as true and +1 as false), then for all S ⊆ [ n ] with S 6 = ∅ , | ˆ f ( S ) | ≤ 2 − k +1 . So taking g = f , the prior statemen t cannot hold. Still, one migh t at least hop e for a similar statemen t with the righ t-hand side of (3) replaced b y something of the f orm 2 − poly( k ) · poly ( ǫ ). How ev er, if w e tak e f as ab o ve and g ( x ) = x 2 ∧ . . . ∧ x k +1 , then h f , g i = 1 − 2 − k +1 but for all S ⊆ [ n ] with k + 1 ∈ S , ˆ f ( S ) = 0 (since x k +1 is n ot relev an t to f ). Next, we n eed a tool for finding large lo w-degree F ourier co efficien ts of an arbitrary Bo olean function, h a ving access to a lab eled ran d om wal k. Suc h an algorithm is said to p erf orm the Bounde d Sieve (see [6, Definition 3]). Bshouty et al. [6] ha ve sh o wn that suc h an algorithm exists for the random wal k mo del. More p r ecisely , T heorems 7 and 9 in [6] imply: Theorem 2 (Bound ed Sieve , [6]) . Ther e is an algorithm Bound edSiev e( f , θ , ℓ, δ ) that on input θ > 0 , ℓ ∈ [ n ] , and δ > 0 , given ac c ess to R W ( f ) for some f : {− 1 , 1 } n → {− 1 , 1 } , outputs a list of S ⊆ [ n ] with ˆ f ( S ) 2 ≥ θ / 2 such that with pr ob ability at le ast 1 − δ , every S ⊆ [ n ] with | S | ≤ ℓ and ˆ f ( S ) 2 ≥ θ app e ars in i t. The algorithm runs i n time p oly ( n, 2 ℓ , 1 /θ , log(1 /δ )) , and the list c ontains at most 2 /θ sets S . F or a samp le S = ( x i , f ( x i )) i =0 ,...,m , a set J ⊆ [ n ] of size k , and an assignm ent α ∈ {− 1 , 1 } | J | , let s + α = |{ i ∈ [ m ] | x i | J = α ∧ f ( x i ) = +1 }| and s − α = |{ i ∈ [ m ] | x i | J = α ∧ f ( x i ) = − 1 }| . O b viously , a J -jun ta h J that b est agrees w ith f on the p oint s in S is giv en b y h J ( x ) = s gn ( s + x | J − s − x | J ). In other w ords , h ( x ) tak es on the v alue a ∈ {− 1 , 1 } that is tak en on by th e ma jorit y of lab els in the su b-cub e that fixes the co ordinates in J to α . This fu nction is unique except for the c hoice of h J ( x ) at p oin ts x with s + x J = s − x j . The function h J differs from th e lab els of S in err( J ) = P α ∈{− 1 , 1 } | J | err( α ) p oin ts, wh ere err( α ) = min { s + α , s − α } . By Prop osition 1, if S is sufficientl y large, then with high probabilit y , the fu nction h J appro ximately minimizes ∆ ( h, f ) among all J -juntas h . W e are n ow ready to show our m ain result: Theorem 3 (Restatemen t of Theorem 1) . The class of k -juntas g : {− 1 , 1 } n → {− 1 , 1 } is pr op erly agnostic al ly le arnable with ac c ur acy ǫ and c onfidenc e 1 − δ in the r andom walk mo del in time p oly( n, 2 k 2 , (1 /ǫ ) k , log (1 /δ )) . Pr o of. In the follo w in g, w e sho w that Algorithm 1 b elo w is an agnostic learning algorithm with the desired run ning time b ound. 8 Algorithm 1 LearnJunt as 1: Input k , ǫ, δ 2: Access to R W( f ) f or some f : {− 1 , 1 } n → {− 1 , 1 } 3: Run BoundedSieve ( f , (1 − 1 / √ 2) 2 · 2 − k +1 · ǫ 2 , k , δ / 2) and let T b e the return ed list. 4: Let R = S { S | S ∈ T } . 5: F or all J ⊆ R with | J | = k : 6: Comp ute err ( J ). 7: Return h J opt for some J opt that minimizes er r ( J ). Denote the class of n -v ariate k -juntas by C and let γ = opt C ( f ). W e p ro v e th at, with probability at least 1 − δ , ∆( h J opt , f ) ≤ γ + ǫ . Let g ∈ C w ith ∆( f , g ) = γ , so that h f , g i = 1 − 2 γ . By Lemma 3, there exists g ′ ∈ C such that h f , g ′ i ≥ 1 − 2 γ − ǫ (equiv alen tly , ∆( f , g ′ ) ≤ γ + ǫ/ 2) and for all relev an t v ariables x i of g ′ , ther e exists S ⊆ [ n ] with | S | ≤ k , i ∈ S , and ˆ f ( S ) 2 ≥ (1 − 1 / √ 2) 2 · 2 − ( k − 1) · ǫ 2 . Consequent ly , with pr obabilit y at least 1 − δ / 2, the list T returned in Step 3 of the algorithm con tains all of these sets S , and thus R con tains all r elev ant v ariables of g ′ . The Bounded Siev e subroutine run s in time p oly( n, 2 k , 1 /ǫ, log (1 /δ )). The set J opt is c h osen suc h that the corresp onding J opt -junt a h J opt minimizes the num b er of disagreemen ts with the lab els among all k -juntas with relev ant v ariables in R . Denote the class of these juntas b y C ( R ). S ince |T | ≤ 2 · (1 − 1 / √ 2) − 2 2 k − 1 /ǫ 2 ≤ 12 · 2 k /ǫ 2 , we h a v e | R | ≤ k |T | ≤ 12 · k · 2 k /ǫ 2 . Consequen tly , R con tains | R | k ≤ e | R | k k ≤ 12 · 2 k ǫ 2 k = p oly(2 k 2 , (1 /ǫ ) k ) subsets of size k , and log |C ( R ) | ≤ log 2 2 k · | R | k = p oly(2 k 2 , (1 /ǫ ) k ). By Prop osition 1, w ith probabilit y at least 1 − δ / 2, ∆( h J opt , f ) ≤ opt C ( R ) ( f ) + ǫ/ 2 , pro vided that p oly( n, log |C ( R ) | , 1 /ǫ, log (1 /δ )) = p oly( n, 2 k 2 , (1 /ǫ ) k , log (1 /δ )) examples are drawn. Since g ′ ∈ C ( R ), we obtain ∆( h J opt , f ) ≤ ∆( g ′ , f ) + ǫ/ 2 ≤ γ + ǫ . The total ru nning time of the algo r ith m is p olynomial in n , 2 k 2 , (1 /ǫ ) k , and log (1 /δ ). References [1] David Aldous and Umesh V. V azirani. A marko vian extension of v alian t’s learning mo del. Inf. Comput. , 117(2):18 1–186, 199 5. [2] Dana Angluin and Philip D. Laird. Learnin g From Noisy Examp les. Machine L e arning , 2(4):3 43–370, April 1988 . 9 [3] Peter L. Bartlett , Pa u l Fisc her, and Klaus-Uw e H¨ offgen. Exploiting Random Walks f or Learn- ing. Inf orm. and Comput. , 176 (2):121–13 5, 2002. [4] Avr im Blum and Pa t Langley . Selection of Relev an t Features and Ex amp les in Mac hine Learning. Artificial Intel ligenc e , 97(1-2):245 –271, Decem b er 199 7. [5] Nader H. Bshout y and Vitaly F eldman. On Using Extended Statistical Queries to Av oid Mem b ersh ip Qu eries. J. Mach. L e arn. R es. , 2(3):359–3 96, August 2002. [6] Nader H. Bshout y , Elc hanan Mossel, Ry an O’Donnell, and Ro cco A. Servedio . L earning DNF from random wa lks. J. Comp u t. System Sci. , 71(3):250–2 65, Octob er 2005. [7] David Gamarnik. Extension of the P AC Framew ork to Finite and Count able Marko v Chains. IEEE T r ans. Inform. The ory , 49(1):338 –345, 2003 . [8] David Gillman. A Chernoff Bound for Random Walks on Expander Graph s. SIAM J. Comp. , 27:120 31220, August 1998. [9] Parikshit Gopalan, Adam T auman Kalai, and Adam R. K liv ans. Agnostically L earn ing Deci- sion Trees. In Ric hard E. Ladner and Cynthia Dw ork, editors, Pr o c e e dings of the 40th Annual ACM Symp osium on The ory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008 (STOC ’08) , p ages 527 –536. A CM Press, 2008. [10] W assily Hoeffdin g. Probabilit y Inequalities for Sum s of Bounded Random Variables. J. Amer. Statist. Asso c. , 58:13–30, 196 3. [11] Mic h ael J. Kearns, Rob ert E. Schapire, and Linda Sellie. T ow ard efficien t agnostic learning. Machine L e arning , 17(2-3):115 –141, 1994. [12] Eyal Kushilevitz and Yisha y Mansour . Learnin g Decision Trees Using the Fourier S p ectrum. SIAM J. Comp. , 22(6):1 331–1348, Decem b er 1993. [13] Pasca l L´ ezaud. Ch ernoff-t yp e Bound for Finite Mark o v Ch ains. Ann. Appl. Pr ob ab. , 8(3): 849– 867, 1998. [14] Elchanan Mossel, Rya n W. O ’Donnell, and Ro cco A. Servedio . Learning fun ctions of k relev ant v ariables. J. Comput. System Sci. , 69(3):421 –434, No v ember 2004 . [15] S´ ebastien R o ch. On Learnin g T h resholds of Parities and Unions of Rectangles in Ran d om Walk Mo dels. R andom Structur es Algorithm s , 31(4 ):406–417, 2007. [16] Martin Zin kevic h. Online con v ex programming an d generalized infinitesimal gradien t ascen t. In T om F a wcet t and Nin a Mishra, editors, Machine L e arning, P r o c e e dings of the Twentieth International Confer enc e (ICML 2003), August 21-24, 2003, Washington, DC, USA , pages 928–9 36. AAAI Pr ess, 2003. A Indep endence of Po in ts in Random W alks An up dating r andom walk is a sequence x 0 , ( x 1 , i 1 ) , ( x 2 , i 2 ) , . . . , where x 0 is dra wn un iformly at random, eac h i t ∈ [ n ] is a co ordinate dra w n un iformly at random, and x t is set to x t − 1 or to x t − 1 ⊙ e i t , eac h with probabilit y 1 / 2. W e s a y that in step t , coord inate i t is up date d . 10 Giv en an up dating random w alk x 0 , ( x 1 , i 1 ) , ( x 2 , i 2 ) , . . . , all v ariables will with high p r obabilit y b e up dated after ℓ = Ω( n log n ) steps, so that in this case, x 0 and x ℓ can b e consid ered as indep endent uniformly distributed r andom v ariables. More formally , let X ℓ b e the set of all up dating rand om w alks of length ℓ , and let X ℓ goo d b e the set of up dating r andom w alks such that all v ariables hav e b een up dated (at least once) after ℓ s teps . Then, cond itional to the up d ating rand om w alk b elonging to X ℓ goo d , x 0 and x ℓ are in dep endent (and uniform ly distribu ted). Since the up dating rand om w alk mo del is only a tec hnical utilit y , we would lik e to sa y similar things ab out the “usual” rand om walk m o d el, so that w e do not ha v e to take care of going bac k and forth b etw een th e mo dels in our analyses (although th at w ould constitute a reasonable alternativ e). W e pro ceed as follo ws. Giv en a (non-up dating) random wa lk x 0 , x 1 , . . . , w e p erform an addi- tional exp eriment to sim u late an up dating random wal k (see also [6]). W e then ac c ept the (original) random w alk if the additional exp er im ent leads to a go o d up dating rand om walk. It will then follo w that, conditional to the random w alk b eing accepted, x 0 and x ℓ are indep endent. Our algorithms will of course not p erform this exp erimen t. Instead, we will reason in the analyses that if we p erformed the add itional exp erimen t, then w e w ould accept the giv en random wal k with a cer- tain (high) pr obabilit y (take n o ver the d ra w of the random w alk and the additional exp erimen t), implying that certain p oin ts are indep endent . P erform the follo wing r an d om exp erimen t: Giv en a random walk X of length ℓ , dra w a sequ en ce F = ( F 1 , F 2 , . . . ) of Bernoulli trials with Pr[ F j = 1] = Pr[ F j = 0] = 1 / 2 for eac h j un til F con tains ℓ ones. (If this is not the case after, s ay , L = p oly( ℓ ) steps, then reject X .) Otherwise, let ℓ ′ denote the length of F and construct a sequence I = ( i 1 , . . . , i ℓ ′ ) of v ariable indices as follo ws. Denote b y j 1 < . . . < j ℓ the ℓ p ositions in F w ith F i = 1. F or eac h k ∈ [ ℓ ], let i j k = p k , w h ere p k is the p osition in X that is flipp ed in the k th step. F or eac h j ∈ [ ℓ ′ ] \ { j 1 , . . . , j ℓ } , ind ep enden tly draw an index i j ∈ [ n ] with u niform probability . Accept X if { i 1 , . . . , i ℓ ′ } = [ n ], otherw ise reject X . Lemma 4 (F orm al r estate ment of Lemma 1) . L et X = ( x 0 , . . . , x ℓ ) b e a r andom walk of length ℓ ≥ n ln(2 n/δ ) and p erform the exp eriment ab ove. Then X is ac c epte d with pr ob ability at le ast 1 − δ . Mor e over, c onditional to X b eing ac c e pte d, the r andom variables x 0 and x ℓ ar e i ndep endent and uniformly distribute d. Pr o of. First, by c ho osing L app ropriately , we can ensur e with p robabilit y at least 1 − δ / 2 that F con tains at least ℓ ones. By constru ction, the sequence x ′ 0 , ( x ′ 1 , i 1 ) , ( x ′ 2 , i 2 ) , . . . , ( x ′ ℓ ′ , i ℓ ′ ) with x ′ 0 = x 0 , x ′ j k = x k for k ∈ [ ℓ ], and x ′ j = x ′ j − 1 for j ∈ [ ℓ ′ ] \ { j 1 , . . . , j ℓ } is distrib uted as an up dating rand om wa lk of length ℓ ′ . Note that unlike in the original u p dating rand om w alk mo del, w e d etermine the sequence F of up d ating outcomes b efor e we determine the p ositions to b e u p dated. Moreo v er, the choic e of the co ordinates to b e u p dated in the p ositions where F i = 1 is incorp orated in the dra w of the original w alk. Th e su bsequence x ′ 0 , x ′ j 1 , x ′ j 2 , . . . , x ′ j ℓ is equal to the original w alk x 0 , x 1 , x 1 , x 2 , . . . , x ℓ . The probability that { i 1 , . . . , i ℓ ′ } ( [ n ] is at most n · (1 − 1 /n ) ℓ ′ ≤ n · (1 − 1 /n ) ℓ ≤ δ / 2 since ℓ ≥ n ln(2 n/δ ). Consequent ly , with total probability at least 1 − δ , the random walk is accepted. In th is case, every co ordinate has ev en tually b een up dated after the ℓ ′ steps of the up dating random w alk. Thus, f or eac h co ordinate j , of x ℓ j = x ′ ℓ ′ j is indep end en t of x 0 j = x ′ 0 j , i.e., x 0 and x ℓ are indep en den t and unif orm ly distribu ted (conditional to X b eing accepted). 11 B An Elemen tary P ro of of Lemma 2 T o estimate the conv ergence rate of empirical a verag es to their exp ectatio n s, we need th e follo win g standard C hernoff-Ho effd ing b oun d [10]: F or a sequ ence of indep end ent iden tically distribu ted random v ariables X 1 , . . . , X m with E[ X i ] = µ that tak e v alues in [ − 1 , 1], Pr " 1 m m X i =1 X i − µ # ≤ 2 e − ǫ 2 m/ 2 . (4) Pr o of of L emma 2. F or eac h j ∈ { 0 , . . . , N − 1 } , the p oin ts x iN + j , i ∈ { 0 , . . . , m/ N − 1 } , are with probabilit y at least 1 − ( m/ N − 1) δ pairwise indep enden t b y Lemma 1. In this case, the v alues f ( x iN + j ), 0 ≤ i ≤ m/ N − 1, are ind ep enden t and ident ically distributed samples of the random v ariable f ( x ) with x ∈ {− 1 , 1 } n uniformly distribu ted. By the Ho effding b ound, Pr N m m/ N − 1 X i =0 f ( x iN + j ) − E x [ f ( x )] > ǫ ≤ 2 exp ( − mǫ 2 / (2 N )) Th u s, th e probability that ( N/m ) P m/ N − 1 i =0 f ( x iN + j ) − E x [ f ( x )] > ǫ for some j ∈ { 0 , . . . , N − 1 } is at most 2 N exp( − mǫ 2 / (2 N )). Finally , w e h av e 1 m m X i =0 f ( x i ) − E x [ f ( x )] = 1 m N X j =0 m/ N − 1 X i =0 f ( x iN + j ) − m N E x [ f ( x )] ≤ 1 m N X j =1 m N ǫ = ǫ with probability at least 1 − 2 N exp( − mǫ 2 / (2 N )) ≥ 1 − δ . 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment