Decision trees are PAC-learnable from most product distributions: a smoothed analysis
We consider the problem of PAC-learning decision trees, i.e., learning a decision tree over the n-dimensional hypercube from independent random labeled examples. Despite significant effort, no polynomial-time algorithm is known for learning polynomia…
Authors: Adam Tauman Kalai, Shang-Hua Teng
Decision trees are P A C-learnable from mos t pro duct distributio ns: a smo othed analys is Adam T auman Kalai Microsoft Researc h New England Shang-Hua T eng ∗ Boston Univ ersit y Octob er 23, 2018 Abstract W e consider the problem of P A C- learning decision trees, i.e., learning a decision tree over the n -dimensional hypercub e from indep endent ra ndom lab eled examples. Despite significant effort, no po lynomial-time algorithm is known for learning polyno mial-sized decision trees (even tr e es of any sup er-c o nstant size), e ven when examples are assumed to be drawn from the uniform distribution o n { 0,1 } n . W e give an algo rithm that lear ns arbitra ry po lynomial-sized decisio n trees for most pr o duct distributions . In particular, co nsider a ra ndom pro duct distributio n where the bias of each bit is ch osen indep endent ly a nd uniformly from, say , [ . 4 9 , . 51 ] . Then with high probability ov er the parameters of the pro duct distribution and the random ex amples dr awn from it, the a lgorithm will lear n any tree. More g enerally , in the spir it of smo othed analysis, we consider an ar bitrary p ro duct distribution whose para meters are sp ecified only up to a [− c, c ] accuracy (pertur bation), for an ar bitrarily s mall p os itive co nstant c . 1 In tro duction Decision tre es ar e classifier s at the center stage of b oth the theory and practice of machine learning. Despite decades of res earch, no p oly nomial-time algor ithm is known for P A C- learning po lynomial-sized (or a n y s uper -constant-sized) Bo olean decision trees over { 0 , 1 } n , even a s sum- ing examples a re dr awn fro m the uniform distribution on inputs. The situation is no better for an y other constant-bounded pro duct dis tribution. In light of this, wha t we show is perha ps surprising: every decision tree can b e le a rned fro m most pro duct distributions. Hence, the uniform-distribution assumption common in learning (and o ther fields) may no t b e simplifying matters as o ne migh t hop e. 1.1 Related work Learning decision trees in V alian t’s P AC mo del [1 3] requires learning an a rbitrary tree from po lynomially-many r andom lab eled examples, drawn indep e nden tly fro m an arbitrary distribu- tion and lab eled according to the tree. Note that the output of the learning alg orithm need not be a decision tree – a ny function, which well appr oximates the target tree on future exam- ples drawn from the same distribution as the training da ta, suffices. The uniform- P A C model of learning assumes that data is drawn from the uniform distribution. In previo us w or k, size- s trees were shown to b e P A C-lear nable in time O n log s [3, 1]. Jun tas, functions that dep end on only ∗ This work w as done while th e auth or w as visiting Microsoft Researc h N ew England. 1 r “r elev ant” bits (a specia l ca se of decis ion trees of size 2 r ) can b e unifor m-P A C le arned faster : in time roug hly O ( n 0 . 7 r ) [10]. A v ar iet y o f alternatives to P A C lear ning have b een co nsidered, to cir cum ven t the difficulties. R andom depth- O ( log n ) trees have b een shown to b e prop erly 1 learnable, with hig h pro bability , from unifor m ra ndom examples by Jackson a nd Servedio [7]. Decision trees hav e b een also shown to b e learnable from data which is coming fro m a r andom walk , i.e., cons ecutive tr aining exa mples differ in a s ingle r andom p osition [2]. A seminal result of Kushilevitz and Manso ur (KM) [8], using a n alg orithm similar to Goldreich-Levin [4], shows that decisio n trees are unifor m-P A C lear na ble from membership queries (i.e., bla ck b ox acces s to the function) in p olynomia l time. Since KM prov ed to be an essential ingredient in further work suc h as learning DNFs [6] a nd ag no stic learning [5], as well as to applica tions b eyond learning, the present work g ives hop e to a nu m ber of questions discussed in Section 6. W e co nsider a “ smo othed learning” mo del inspire d by Smo o thed Analysis, w hich Spielman and T eng introduced to explain why the simplex metho d for linear prog r amming (LP) usually runs in p oly no mial time [12]. Roughly sp ea king, they s how that if ea ch para meter of an LP is per turb ed by a small a mount, then the simplex metho d will run in po lynomial time with high probability (in fact, the exp ected run- time will b e po lynomial). F o r LP’s arising from nature or business (a s opp osed to reduction from a nother computatio nal pr oblem), the par a meters are measurements or estimates that hav e some inherent inaccuracy or uncertaint y . Hence, the mo del is reasona ble for a larg e class of in ter esting LP’s. 1.2 Main result W e s upp os e that the examples are coming from a p ro duct distribution P µ , sp ecified by µ ∈ [ 0 , 1 ] n where µ i = E x ∼P µ [ x i ] . An illustrative instantiation of our main r esult is the following. T ake any decision tree a nd pick a random µ ∈ [ 0 . 49 , 0 . 51 ] n . Then, with high probability (o ver µ a nd the random exa mples from P µ ), our alg orithm w ill output a p olynomial threshold function which is a go o d approximation to the tr ee. Since P ( . 5 ,...,. 5 ) is the uniform distributio n, the choice o f µ ∈ [ 0 . 49 , 0 . 51 ] n is close in spirit 2 to the uniform distribution. More generally , fix a ny ar bitrarily small cons tant c ∈ ( 0 , 1 4 ) . An adversary , if you will, chooses an arbitra ry decision tree f and an arbitrary ¯ µ ∈ [ 2 c, 1 − 2 c ] n but the a ctual pr o duct distribution will have par ameters µ = ¯ µ + ∆, where ∆ ∈ [− c, c ] n is a unifor mly random per tur- bation. Then, a p olyno mia l num b er of exa mples will be drawn from P µ . With hig h pro bability ov er the p ertur ba tion ∆ and the data drawn from P ¯ µ + ∆ , the a lgorithm will output a function which is v ery close to f . The main theorem we pr ov e is the following. Theorem 1. L et c ∈ ( 0 , 1 4 ) . Then ther e is a u n ivariate p o lynomial q such that, for any inte gers n, s ≥ 1 , r e als ǫ , δ > 0 , function f ∶ { 0 , 1 } n → {− 1 , 1 } c ompute d by a size- s de cision tr e e, and any ¯ µ ∈ [ 2 c, 1 − 2 c ] n , with pr ob ability ≥ 1 − δ over ∆ chosen uniformly at r andom fr om [− c, c ] n and m ≥ q ( n s ( δ ǫ )) tra ining examples ( x 1 , f ( x 1 )) , . . . , ( x m , f ( x m )) wher e e ach x i is dr awn indep endently fr om P µ (wher e µ = ¯ µ + ∆ ), the output of algorithm L is h with, Pr x ∼P µ [ h ( x ) ≠ f ( x )] ≤ ǫ. Algo rithm L is p olynomial time, i.e., it ru n s in time poly ( n, m ) and outputs a p olynomial thr eshold funct ion. It is w o rth making a few remarks ab out this theorem. W or st-case analy sis is beautiful but sometimes lea ds to artificial limitations, es pec ia lly in domains lik e learning where we do not a ctually b elieve that an adversary c ho oses the problem. In this sense, it is natural to 1 The output of their algorithm is a decision-tree classifier. 2 Statistically sp eakin g, this distribution is quite different than the uniform d istribu tion. Learning form any µ ∈ [ 1 / 2 − √ 1 / n, 1 / 2 + √ 1 / n ] n w ould lik ely b e as d ifficult as learning from the u niform distribution. 2 slightly weak en the p ow er of the adversary . Here, we hav e a ssumed that the adversary can only sp ecify the pro duct dis tribution up to [− c, c ] a ccuracy or rather that the adversary may have a tr embling hand (to misuse a term of Selten [11]). As a n exa mple of smo othed ana lysis, our s is int eresting b ecause unlike linea r pr ogra mming, where worst-case p o lynomial-time alternatives to the simplex were already known, ther e are no known efficient alg orithms for uniform-P A C learning decision trees. In lea r ning, the standard uniform-P A C mo del a lr eady “a ssumes awa y” any adversarial con- nection b etw een the function b eing learned and the distribution over data. Now, the uniform distribution ass umption is made with the hope that the resulting algor ithms may be useful for learning or at least shed light on issues in volv ed in the problem; it is a na tur al first step in de- signing genera l- distribution lear ning ass umptions . W e hop e tha t the smo o thed ana lysis ser ves a s imilar purpo se. 1.3 The approac h The intuition behind our a lgorithm is quite simple. It will turn out to be notationally co nv enient to co nsider examples x ∈ {− 1 , 1 } n . Now for starter s , consider a decis ion tree that co mputes a log ( n ) - sized parity f ( x ) = ∏ i ∈ S x i , for s ome set S ⊆ { 1 , 2 , . . . , n } , S = log 2 ( n ) . This can b e done using a size n tre e. Under the uniform distribution o n exa mples, each bit x i (or any subset of ≤ log ( n ) − 1 bits) is uncorrela ted with f . Now take a pro duct distribution with rando m mean vector µ ∈ [− c, c ] n and define x ′ = x − µ , so that E [ x i ] = 0. Then with pro bability ≥ 1 − δ , f ( x ) has a sig nificant ( pol y ( δ n ) ) cor relation with each x ′ i for i ∈ S and no cor relation with any i ∈ S . Hence, it is ea sy to find the r elev ant bits. No w, a p o ly nomial size-tree may , in ge neral, in volve all n bits so finding the relev ant bits is not sufficient. As is standard for F o urier learning under pro duct distributions, one ca n write f ( x ) = f ( x ′ ) as a p olyno mial in x ′ . Each coefficient of a term ∏ i ∈ S x ′ i can b e estimated in a stra ightf or- ward manner from random examples. How ever, finding the he avy co efficients (those with large magnitude) is a bit like finding a num b er of needle s in a haystack. How ever, this is the most fascinating a sp ect of the pr oblem – it r equires so-c a lled fe atur e disc overy or fe atu r e c onst ruction algorithms. These algo rithms hence tie together a fundament al problem in b o th the theo ry and practice of learning : many c la im that the heart of the pr oblem of machine lear ning is r eally that of finding or crea ting g o o d fea tur es [9]. The key prop erty we pr ov e is the following, with high pr obability over µ ∈ [− c, c ] n . If the co efficient in f ( x ′ ) of a term ∏ i ∈ T x ′ i is larg e, then so is the c o efficien t of ∏ i ∈ S x ′ i for each S ⊆ T . This makes finding all the large co efficients easy using a to p-down approach. The pro of of this fact r elies on tw o pro pe r ties: there is a s imple relations hip b etw een different co efficie n ts under different pro duct distributions, a nd a low-degree nonzero multilinear p oly no mial c a nnot b e to o close to 0 to o often (this is a c ontin uous g eneralization of the Sch wartz-Zipp el theorem). In our simple exa mple, it is easy to se e that b y expanding f ( x ) = ∏ i ∈ S x i = ∏ i ∈ S ( x ′ i + µ i ) , al l co efficients of terms ∏ i ∈ T x ′ i , for T ⊆ S , will b e nonzero with probability 1. Another p ersp ective on the algor ithm is that it giv es a substitute for KM (equiv alently Goldreich-Levin) using r andom examples instead of adaptive queries. It is a weaker substitute in that it is only capable of finding la r ge co efficients on terms of O ( log n ) . 2 Organization Preliminarie s are given in Section 3. Before we give the smo othed algo rithm for learning, we prov e a prop erty about F ourier co efficients under ra ndo m pro duct distributions in Section 4. W e then give the alg orithm and ana ly sis in Section 5. Conclusio ns and future work are discussed in Section 6. 3 3 Preliminaries Let N = { 1 , 2 , . . . , n } . As mentioned, for notatio nal ease we consider exa mples ( x, y ) with x ∈ {− 1 , 1 } n and y ∈ {− 1 , 1 } . F or S ⊆ N , x ∈ R n , let x S denote ∏ i ∈ S x i . An y function f ∶ {− 1 , 1 } n → R can b e written uniquely as a m ultilinea r p oly nomial in x , f ( x ) = S ⊆ N ˆ f ( S ) x S . The ˆ f ( S ) ’s are called the F our ier co efficients. The degree of a mult ilinear p olyno mial is deg ( f ) = max { S ˆ f ( S ) ≠ 0 } , and with a s light abuse of terminolog y , we say a p oly no mial is degree- d if deg ( f ) ≤ d . Henceforth we write ∑ S to deno te ∑ S ⊆ N and ∑ ∣ S ∣= d to deno te the sum over S ⊆ N such that S = d . Similarly for ∑ ∣ S ∣> d , and so for th. W e write x ∈ U A to denote x chosen uniformly at random from set A . One may define an inner pr o duct b etw een functions f , g ∶ {− 1 , 1 } n → R by , f , g = E x ∈ U {− 1 , 1 } n [ f ( x ) g ( x )] . It is easy to see that x S , x T is 1 if S = T and 0 other wise. Hence, the 2 n differen x S ’s for m a n orthonor mal basis fo r the set of rea l-v alued functions on {− 1 , 1 } n . W e thus hav e that f , g = ∑ S ⊆ N ˆ f ( S ) ˆ g ( S ) , and Parsev al’s equality , f , f = S ⊆ N ˆ f 2 ( S ) = E x ∈ U {− 1 , 1 } n [ f 2 ( x )] . This implies that for any f ∶ { − 1 , 1 } n → [− 1 , 1 ] , ∑ S ˆ f 2 ( S ) ≤ 1. It is als o useful for b ounding E [( f ( x ) − g ( x )) 2 ] = ∑ S ( ˆ f ( S ) − ˆ g ( S )) 2 . A pr o duct distributio n D µ ov er {− 1 , 1 } n is parameter ized by its mean vector µ ∈ [− 1 , 1 ] n , where µ i = E x ∼D µ [ x i ] a nd the bits are indep endent. (W e no w use D to av oid co nfusion with pro duct distributions P over { 0 , 1 } n discussed in the introduction.) The uniform distr ibutio n is D 0 . W e say D µ is c -b ounded if µ i ∈ [− 1 + c, 1 − c ] for all i . Fix any constant c ∈ ( 0 , 1 2 ) . W e assume we have some fixed 2 c -b ounded pro duct distribution ¯ µ ∈ [− 1 + 2 c, 1 − 2 c ] n and that a random p erturb ation ∆ ∈ [− c, c ] n is chosen uniformly a t random and the resulting pr o duct distribution has µ = ¯ µ + ∆. Note that D µ is c -b ounded a nd called the p erturb e d pro duct distribution. F or a ny distributio n D on { − 1 , 1 } n , one ca n similarly define an inner pro duct f , g D = E x ∼D [ f ( x ) g ( x )] . In the case o f a pro duct distribution D µ , it is natur a l to normaliz e the co- ordinates so that they hav e mea n 0 and v ariance 1. Let z ( x, µ ) ∈ R n be the vector defined by z i ( µ, x ) = ( x i − µ i ) 1 − µ 2 i . When µ and x a r e understo o d from c o nt ext, we write just z . This normaliza tion gives E x ∼D µ [ z i ( x, µ )] = 0 and E x ∼D µ [ z 2 i ( x, µ )] = 0. Let z S = z S ( x, µ ) = ∏ i ∈ S z i ( x, µ ) . It is als o easy to se e that E x ∼D µ [ z S z T ] is 1 if S = T and 0 other wise. Hence, the 2 n differen x S ’s for m an orthonorma l bas is for the set of real- v alued functions on {− 1 , 1 } n with resp ect to D µ . W e define the normalized F ourier co e fficie n t, for any S ⊆ N , ˆ f ( S, µ ) = E x ∼D µ [ f ( x ) z S ( x, µ )] . (1) Note that this g ives a straig ht forward means of estimating any such co efficient. Also obser ve that ˆ f ( S, 0 ) = ˆ f ( S ) and that, for an y µ ∈ [− 1 , 1 ] n , f ( x ) = S ˆ f ( S, µ ) z S ( x, µ ) . Finally , it will b e conv enient to define a partially normalized F ourier co efficient, ¯ f ( S, µ ) = ˆ f ( S, µ ) ∏ i ∈ S 1 − µ 2 i . 4 Note that if µ ∈ [− 1 + c, 1 − c ] n then w e hav e, ˆ f ( S, µ ) ≤ ¯ f ( S, µ ) ≤ ˆ f ( S, µ ) ( 1 − ( 1 − c ) 2 ) ∣ S ∣/ 2 ≤ ˆ f ( S, µ ) c ∣ S ∣/ 2 (2) In this no tation, w e also hav e, f ( x ) = S ¯ f ( S, µ ) i ∈ S ( x i − µ i ) = S ¯ f ( S, µ )( x i − µ i ) S Hence, for a ny µ = ¯ µ + ∆, S ¯ f ( S, µ )( x − µ ) S = S ¯ f ( S, ¯ µ ) ( x − µ ) + ∆ S . Collecting terms gives a means for translating betw ee n pro duct distributions µ = ¯ µ + ∆: ¯ f ( S, µ ) = T ⊇ S ¯ f ( T , ¯ µ ) ∆ T ∖ S (3) 3.1 Decision t rees A decision tree T o ver {− 1 , 1 } n is a r o oted bina ry tree, in which each in ter na l no de is la be led with a n int eger i ∈ N , and each leaf is assigned a lab el of ± 1. W e consider Bo o lean de c is ion trees, in which case each in ter nal no de has exactly tw o children, and the tw o outgo ing edges a r e lab eled, one of them 1 and the other − 1 . The tree computes a function f T ∶ {− 1 , 1 } n → {− 1 , 1 } defined recursively as follows. If the ro ot is a lea f, then the v alue is simply the v alue o f the lea f. Otherwise, say the ro ot is lab eled with i , a nd say it’s c hildr en ar e T − 1 and T 1 , following the lab els − 1 and + 1, r esp ectively . The the v alue of the tree is defined to b e the v alue computed by T x i on x , i.e., f T x i ( x ) . In other w ords, f ( x ) = 1 2 + x i 2 f T 1 ( x ) + 1 2 − x i 2 f T − 1 ( x ) . W e assume that no no de appe ars mor e than once on any pa th down from the ro o t to a leaf. Hence, the ab ove function is a m ultilinear p o lynomial f ∶ {− 1 , 1 } n → {− 1 , 1 } , but more in some cases it may b e helpful to think of it as simply a multilinear p olynomial f ∶ R n → R . T he size of a decision tree is defined to b e the num b er of le av es. W e define the depth of the ro o t of the tree to b e 0. Thus a depth- d tree computes a degree- d multilinear po lynomial. 4 F ourier prop erties for random pro duct distributions The following lemmas show tha t, with high pro ba bilit y , for every c o efficient ˆ f ( S ) that is s uffi- ciently larg e, say ˆ f ( S ) > b , it is very likely that all subterms T ⊆ S hav e ˆ f ( T ) > a , for some a < b . It turns out that this is easier to state in ter ms of the partia lly normalized co efficients ¯ f ( S ) . The following simple lemma is at the heart o f the analys is . Lemma 2 . T ake any c ∈ ( 0 , 1 2 ) , ¯ µ ∈ [− 1 + c, 1 − c ] n and let µ = ¯ µ + ∆ , wher e ∆ is chosen uniformly at r andom fr om [− c, c ] n . L et f ∶ R n → R b e any mu ltiline ar function f ( x ) = ∑ S ¯ f ( S, µ )( x − µ ) . Then for any T ⊆ U ⊆ N , a, b > 0 , Pr ∆ ∈ U [ − c,c ] n [ ¯ f ( T , µ ) ≤ a ¯ f ( U, µ ) ≥ b ] ≤ a b ( 4 c ) U ∖ T 2 . (F or even ts A, B , we define Pr [ A B ] = 0 in the case that Pr [ B ] = 0 .) In o r der to prove lemma 2, we give a contin uo us v ariant of Sch wartz-Zippel theorem. This le mma states that a nonzero degree- d mu ltilinear function ca nnot b e to o clos e to 0 to o often ov er x ∈ [− 1 , 1 ] n . 5 Lemma 3 . L et g ∶ R n → R b e a de gr e e- d multiline ar p olynomial, g ( x ) = ∑ S ≤ d ˆ g ( S ) x S . Su pp ose that t her e exists S ⊆ N with S = d and ˆ g ( S ) ≥ 1 . Then for a uniformly chosen r andom x ∈ [− 1 , 1 ] n , and for any ǫ > 0 , we have, Pr x ∼ U [ − 1 , 1 ] n [ g ( x ) ≤ ǫ ] ≤ 2 d √ ǫ. Pr o of. WLOG let say ˆ g ( D ) = 1 for D = { x 1 , x 2 , . . . , x d } for we can always pe rmute the terms and rescale the p olyno mial s o tha t this co efficient is exactly 1. W e fir st esta blish that, Pr x ∈ U [ − 1 , 1 ] n [ g ( x ) ≤ ǫ ] ≤ Pr x ∈ U [ − 1 , 1 ] n [ x D ≤ ǫ ] . (4) In other words, the worst ca se is a monomial. T o see this, write, g ( x ) = x 1 g 1 ( x 2 , x 3 , . . . , x n ) + g 2 ( x 2 , x 3 , . . . , x n ) . Now, by indep endence imagine picking x by first picking x 2 , x 3 , . . . , x n (later we will pick x 1 ). Let γ i = g i ( x 2 , . . . , x n ) for i = 1 , 2. Then, consider the t wo se ts I 1 = { x 1 ∈ R ∶ x 1 γ 1 + γ 2 ≤ ǫ } and I 2 = { x 1 ∈ R ∶ x 1 γ 1 ≤ ǫ } . These ar e bo th int erv als, and they ar e of equal width. Howev er, I 2 is centered at the origin. Hence, since x 1 is chosen uniformly fr om [− 1 , 1 ] , we hav e that for any fixed γ 1 , γ 2 , Pr x 1 ∈ U [ − 1 , 1 ] [ x 1 ∈ I 1 ] ≤ Pr x 1 ∈ U [ − 1 , 1 ] [ x 1 ∈ I 2 ] , b eca use I 2 ∩ [− 1 , 1 ] is a t least as wide as I 1 ∩ [− 1 , 1 ] . Hence it suffices to prov e the lemma for those functions where ˆ g ( S ) = 0 for all S for which 1 ∉ S . (In fact, this is the worst case.) By symmetry , it suffices to prov e the lemma for those functions where ˆ g ( S ) = 0 for all S for w hich i ∉ S , for i = 1 , 2 , . . . , d . After r emoving all terms S that do no t contain D we are left with the function x D , esta blishing (4). No w, for a lo ose b ound, one can use Marko v’s inequa lit y: Pr [ x D ≤ ǫ ] = Pr x D − 1 2 ≥ ǫ − 1 2 ≤ E [ x D − 1 2 ] ǫ − 1 2 = ǫ 1 2 2 d . In the last step, E [ x D − 1 2 ] = E [ x 1 − 1 2 ] d by indep endence and symmetry , and a simple calcu- lation based on the fact that x 1 is unifor m from [ 0 , 1 ] g ives E [ x 1 − 1 2 ] = 2 . Although we won’t use it, we mention that one ca n co mpute a tig h t b ound, Pr [ x 1 . . . x d ≤ ǫ ] = ǫ ∑ d − 1 i = 0 log i 1 ǫ . This is shown by inductio n and Pr [ x 1 x 2 . . . x i + 1 ≤ ǫ ] = ∫ 1 0 Pr [ x 1 x 2 . . . x i ⋯ ≤ ǫ t ] dt. With this lemma in hand, w e a r e now r eady to prove Lemma 2. Pr o of of L emma 2. F or any set S ⊆ N , le t ∆ = ( ∆ [ S ] , ∆ [ N ∖ S ]) where ∆ [ S ] ∈ [ − c, c ] S repre- sents the co o rdinates of ∆ that are in S . Let V = U ∖ T . The main idea is to imag ine picking ∆ by picking ∆ [ N ∖ V ] first (and later picking ∆ [ V ] ). Now, we claim that once ∆ [ N ∖ V ] is fixed, ¯ f ( U, µ ) is determined. This follows from (3), us ing the fact that S ∖ U ⊆ N ∖ V : ¯ f ( U, µ ) = S ⊇ U ¯ f ( S, 0 ) µ S ∖ U . On the other ha nd ¯ f ( T , µ ) is not determined only from ∆ [ N ∖ V ] . Once we ha ve fix ed ∆ [ N ∖ V ] , it is no w a p olynomial in ∆ [ V ] using (3 ) ag ain: g ( ∆ [ V ]) = ¯ f ( T , µ ) = S ⊇ T ¯ f ( S, ¯ µ ) ∆ S ∖ T . Clearly g is a multilinear p olynomia l of degr ee a t most V . Most imp or tantly , the co efficie nt of ∆ V in g is exac tly ∑ S ⊇ T ∪ V ¯ f ( S, ¯ µ ) ∆ S ∖ ( T ∪ V ) = ¯ f ( U, µ ) , since T ∪ V = U . Hence, the choice ¯ f ( S, µ ) can b e viewed a s a degree- d po lynomial in the r andom v ariable ∆ [ V ] w ith lea ding co efficient ¯ f ( U, µ ) , and we can a pply Lemma 3. So, supp ose that ¯ f ( U, µ ) > b . Let g ′ ( x ) = b − 1 c − V g ( xc ) , so the co e fficient of x V in g ′ is ( b − 1 c − V ) c V ¯ f ( U, µ ) ≥ 1. By lemma 3, Pr ∆ [ V ] ∈ U [ − c,c ] ∣ V ∣ [ g ( ∆ [ V ]) ≤ a ] = Pr x ∈ U [ − 1 , 1 ] ∣ V ∣ [ g ′ ( x ) < ab − 1 c − V ] ≤ a b c − V 2 2 V . 6 W e now obser ve that Lemma 2 implies that with high probability , all sub-co efficients of la rge ˆ f ( S ) will b e pretty large . Lemma 4. L et f ∶ { − 1 , 1 } n → [ − 1 , 1 ] . L et α, β ≥ 0 , d ∈ N . L et c ∈ ( 0 , 1 2 ) , ¯ µ ∈ [ − 1 + 2 c, 1 − 2 c ] n , and µ = ¯ µ + ∆ wher e ∆ ∈ [ − c, c ] n is chosen uniformly at r andom. T hen, Pr ∆ ∈ U [ − c,c ] n ∃ T ⊆ U ⊆ N su ch that U ≤ d ∧ ˆ f ( T , µ ) ≤ α ∧ ˆ f ( U, µ ) ≥ β ≤ α 1 2 β − 5 2 ( 2 c ) 2 d . Pr o of. Since µ is c -b ounded, for any S ⊆ N with S ≤ d , ˆ f ( S, µ ) ≤ ¯ f ( S, µ ) ≤ c − d 2 ˆ f ( S, µ ) , (see (2)), it suffice s to show that, for any a, b > 0, Pr ∆ ∈ U [ − c,c ] n ∃ T ⊆ U ⊆ N such that U ≤ d ∧ ¯ f ( T , µ ) ≤ a ∧ ¯ f ( U, µ ) ≥ b ≤ a 1 2 b − 5 2 4 d c − 3 d 2 . This is b ecause for a = αc − d 2 and b = β , ˆ f ( U, µ ) ≥ β implies ¯ f ( U, µ ) ≥ b , and ˆ f ( T , µ ) ≤ α implies ¯ f ( U, µ ) ≤ a . W e can b ound the above quantit y by the union b ound using Lemma 2. It is at mos t, U ≤ d T ⊆ U Pr [ ¯ f ( T , µ ) ≤ a ∧ ¯ f ( U, µ ) ≥ b ] = U ≤ d T ⊆ U Pr [ ¯ f ( T , µ ) ≤ a ¯ f ( U, µ ) ≥ b ] Pr [ ¯ f ( U, µ ) ≥ b ] ≤ U ≤ d T ⊆ U a 1 2 b − 1 2 ( 4 c ) U ∖ T 2 Pr [ ¯ f ( U, µ ) ≥ b ] ≤ 2 d a 1 2 b − 1 2 ( 4 c ) d 2 U ≤ d Pr [ ¯ f ( U, µ ) ≥ b ] = 2 d a 1 2 b − 1 2 ( 4 c ) d 2 E { U U ≤ d ∧ ¯ f ( U, µ ) ≥ b } All probabilities in the ab ove are ov er ∆ ∈ U [ − c, c ] n . Finally , there can b e at most c − d b − 2 different U ⊆ N such that ¯ f ( U, µ ) ≥ b since ∑ S ¯ f 2 ( S, µ ) ≤ c − d ∑ S ˆ f 2 ( S, µ ) ≤ c − d for a ll µ b y Parsev al’s inequality . Hence, the exp ected num b er of such U is a t most c − d b − 2 and we have the lemma. 5 Algorithm F or simplicit y , w e supp ose th at the algorithm has exac t kno wledg e of µ . In general, these parameters can b e es timated to any desir ed inv er se-p olynomial accur acy in polyno mia l time. The algorithm is b elow. 7 Algorithm L . Inputs: ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ R n × { − 1 , 1 } a nd µ ∈ [ c, 1 − c ] n . 1. Let z j i ∶= x j i − µ i 1 − µ 2 i , for i = 1 , 2 , . . . , n a nd j = 1 , 2 , . . . , m . 2. Let S 0 ∶= { ∅ } . 3. F or d = 1 , 2 , . . . , log m 12 ( 1 − max i ≤ n µ i ) ∶ (a) Let S d ∶= S d − 1 ∪ S ∪ { i } S ∈ S d − 1 ∧ 1 m m j = 1 y j z j S ∪ { i } ≥ m − 1 3 . (b) If S d > m then ab or t and output F AIL. 4. Let p be the following p olynomia l p ∶ { − 1 , 1 } n → R , p ( x ) = S ⊆ S n 1 m m j = 1 y j z j S χ S ( z ) . 5. Output h ( x ) = sgn ( p ( x )) . It is well-known that functions computed by decision trees can b e appr oximated by spar se po lynomials, namely , the set of “heavy” co efficie nts, i.e., those which have larg e mag nitudes. These heavy co efficient s tend to b e o n terms of small degree as well. This is true for any constant bo unded pro duct distribution. Lemma 5. L et c ∈ [ 0 , 1 2 ] , let µ ∈ [ − 1 + c, 1 − c ] n , d ∈ N , β > 0 , and let f ∶ { − 1 , 1 } n → { − 1 , 1 } b e c ompute d by a size- s de cision tre e. Then, S ∶ ˆ f ( S,µ ) ≥ β ∧ S ≤ d ˆ f 2 ( S ) ≥ 1 − 4 ( 1 − c 2 ) d s + 2 d + 2 β . Hence, it is to b e shown that algo r ithm L ident ifies these heavy co e fficie n ts and e s timates them w ell. The pro of of this lemma is deferred un til after the pro of o f the ma in theorem. Pr o of of The or em 1. Fir st, note that for a ny g ∶ { − 1 , 1 } n → R and a ny distribution D ov er { − 1 , 1 } n , Pr x ∼D [ sgn ( g ( x )) ≠ f ( x )] ≤ E x ∼D [( g ( x ) − f ( x )) 2 ] . The r eason is that a ny time sgn ( g ( x )) ≠ f ( x ) , w e ha ve that g ( x ) − f ( x ) ≥ 1, since f ∶ { − 1 , 1 } n → { − 1 , 1 } . Hence, it suffices to s how that with pr obability ≥ 1 − δ , E x ∼D µ [( p ( x ) − f ( x )) 2 ] = S ( ˆ p ( S, µ ) − ˆ f ( S, µ )) 2 ≤ ǫ. This is wha t we do. Define the e s timate o f ˆ f ( S, µ ) (bas e d on the data) to b e, e ( S ) = 1 m m j = 1 y j z j S . By eq uation (1 ), we have that E [ e ( S )] = ˆ f ( S, µ ) , for any fixed S, µ , where the exp ectation is taken over the m data p oints. O f course, steps (3a) and (4) only ev alua te e ( S ) on a small nu m ber o f sets, but it is helpful to define e for all S . Let d = 2 c log 12 s ǫ , D = log m 12 ( 1 − max i ≤ n µ i ) , β = ( ǫ ( 12 s )) 1 + 2 c , t = m − 1 3 , and τ = t √ ǫ 4 . No te that D ≥ log m 12 c > d for m = p oly ( s ǫ ) , so the alg orithm will a t least attempt to estimate all co efficients up to degree d . 8 W e define the set o f gingerbr e ad fe atur es to be, G = S ⊆ N S ≤ d ∧ ˆ f ( S, µ ) ≥ β . These are the features that we r eally req uir e for a goo d approximation. W e define the set o f br e adcrumb fe atur es to be, B = B ⊆ S S ∈ G . These a re the features which will help us find the gingerbrea d features. The set of p ebble fe atu res is, P = { ∅ } ∪ S ⊆ N S ≤ D, ˆ f ( S, µ ) ≥ t − τ . These are the features that might po ssibly b e included in S n on a “g o od” r un of the algo r ithm. Note that, by Parsev al’s inequality , P ≤ 1 + ( t − τ ) − 2 ≤ 1 + 2 t − 2 ≤ 3 t − 2 . W e will ar gue that, with high pr obability , G ⊆ S n ⊆ P . In order to do this, we also consider the set o f c andidate fe atur es , C = P ∪ S ∪ { i } S ∈ P, i ∈ N . These a re the set of all features tha t we might p o s sibly e s timate (ev aluate e ( S ) ) on a “go o d” run of the alg orithm. Let us for mally call a run of the alg orithm “ go o d” if, (a) ˆ f ( S, µ ) − e ( S ) ≤ τ for all S ∈ C a nd (b) ˆ f ( S, µ ) ≥ t + τ for a ll S ∈ B . First, we claim tha t (a) implies S n ⊆ P . This can b e seen by induction, arguing tha t S i ⊆ P for all i = 0 , 1 , . . . , n . This is tr iv ial for i = 0. If it holds for i , then for i + 1, we hav e that the set of fea tures on iteration i tha t are estimated will all be in C , hence will all b e within τ of cor rect. Hence, for any of thes e features that is not in P , we will hav e e ( s ) < t and it will no t b e included in S i . Second we claim that (a) and (b) imply that B ⊆ S n . The pro of of this is similarly stra ig htf o rward by induction. So (a) and (b) imply that G ⊆ S n ⊆ P , since G ⊆ B . Note that since P ≤ 3 t − 2 < m , the alg orithm will not ab ort and output F AIL in this ca se. Now, S ( ˆ p ( S, µ ) − ˆ f ( S, µ )) 2 ≤ S ∈ S n ( e ( S ) − ˆ f ( S, µ )) 2 + S ∈ B ˆ f 2 ( S, µ ) ≤ P τ 2 + 4 ( 1 − c 2 ) d s + 2 d + 2 β . This follows fro m S n ≤ P and Lemma 5. Hence, a go o d run has , S ( ˆ p ( S, µ ) − ˆ f ( S, µ )) 2 ≤ 3 t − 2 τ 2 + 4 ( 1 − c 2 ) d s + 2 d + 2 β ≤ ǫ, for the choice of parameters ab ov e, b ecause 3 t − 2 τ 2 = ( 3 16 ) ǫ , 4 ( 1 − c 2 ) d s ≤ ǫ 3, and 2 d + 2 β ≤ ǫ 3. This means tha t every go o d run outputs a hypothesis o f err or ≤ ǫ . It rema ins to show that the probability o f a go o d r un is at le ast 1 − δ , which we do by the union b ound ov er the tw o even ts (a) and (b). By Lemma 4 pr op erty (b) fails with probability at most, ( t + τ ) 1 2 β − 5 2 ( 2 c ) 2 d ≤ 2 m − 1 6 ( 12 s ǫ ) c ′ ≤ δ 2 , for some constant c ′ and m = p oly ( ns ( δ ǫ )) . Finally , it remains to show tha t (a) fails with probability at most δ 2. First, we need to bo und z j S fo r e ach S ∈ C . Let v = 1 − max i ≤ d µ i ∈ [ c, 1 ] so that D = log m 12 v W e first observe that z i ( x, µ ) ≤ 2 − v 1 −( 1 − v ) 2 ≤ 2 v for a n y i ∈ N , and x ∈ { − 1 , 1 } n , by the definition of z . This means that z S ( x, µ ) ≤ ( 2 v ) log m 12 v ≤ m 1 12 for all S ∈ C , x ∈ { − 1 , 1 } n , using the fact that ( 2 v ) v ≤ e for all v ≤ 1. Fina lly , by Chernoff-Ho effding bo unds, the probability of e ( S ) − ˆ f ( S, µ ) ≥ τ on any S ∈ C is at most 2 e − mτ 2 ( 2 m 1 / 6 ) . Since C ≤ n P ≤ 3 nt − 2 , it suffices to show that this is at most δ ( 2 C ) ≥ δ t 2 ( 6 n ) . In other words, to finish, we need that 2 e − m 1 / 6 ǫ 32 ≥ δ m − 2 3 ( 6 n ) , which is c le arly true for m sufficiently la rge, in particular p oly ( ns ( δ ǫ )) certainly suffices. 9 W e now prove Lemma 5. Pr o of of L emma 5. Let g ∶ { − 1 , 1 } n → { − 1 , 0 , 1 } be the function computed by the trunca ted decision tree in which each internal node a t depth d ha s been re placed b y a lea f o f v alue 0. Then, S ( ˆ f ( S, µ ) − ˆ g ( S, µ )) 2 = E x ∼D µ [( f ( x ) − g ( x )) 2 ] = P r x ∼D µ [ f ( x ) ≠ g ( x )] ≤ ( 1 − c ) d s. The last inequality follows fro m the fact that the probability of reaching any lea f at depth d is at most ( 1 − c ) d . Since g is degree d , ∑ S > d ˆ f 2 ( S, µ ) ≤ ( 1 − c ) d s . Th us by removing a ll terms o f degree greater than d , we throw out at most ( 1 − c ) d s mass. Hence, it suffices to show that, S ∶ ˆ f ( S, µ ) ≤ β ˆ f 2 ( S, µ ) ≤ 3 ( 1 − c ) d s + 2 d + 2 β . This can b e done by brea k ing it into t wo cases, S ∶ ˆ f ( S,µ ) ≤ β ˆ f 2 ( S, µ ) = S ∶ ˆ f ( S, µ ) ≤ β ∧ ˆ g ( S,µ ) ≥ 2 β ˆ f 2 ( S, µ ) + S ∶ ˆ f ( S,µ ) ≤ β ∧ ˆ g ( S,µ ) ≤ 2 β ˆ f 2 ( S, µ ) . Each S occur ring in the first ter m ab ove contributes at lea st β 2 to ∑ S ( ˆ f ( S, µ ) − ˆ g 2 ( S, µ ) ≤ ( 1 − c ) d s , hence there ca n b e at most ( 1 − c ) d s β 2 terms in the first term ab ov e, a nd S ∶ ˆ f ( S,µ ) ≤ β ∧ hatg ( S,µ ) ≥ 2 β ˆ f 2 ( S, µ ) ≤ β 2 ( 1 − c ) d s β 2 = ( 1 − c ) d s. Using the fact that ( a + b ) 2 ≤ 2 ( a 2 + b 2 ) , for any rea ls a, b , we hav e, S ∶ ˆ f ( S, µ ) ≤ β ∧ ˆ g ( S,µ ) ≤ 2 β ˆ f 2 ( S, µ ) ≤ S ∶ ˆ f ( S, µ ) ≤ β ∧ ˆ g ( S,µ ) ≤ 2 β 2 ( ˆ f ( S, µ ) − ˆ g ( S, µ )) 2 + ˆ g 2 ( S, µ ) Now we know tha t ∑ S ( ˆ f ( S, µ ) − ˆ g ( S, µ )) 2 ≤ ( 1 − c ) d s , s o this gives an upp er b ound of 2 ( 1 − c ) d s on the sum of the first terms in the ab ov e. It suffices to show that, S ∶ ˆ g ( S ,µ ) ≤ 2 β ˆ g 2 ( S, µ ) ≤ 2 d + 1 β . T o see this, no te that g has at most 4 d nonzero ter ms, as a depth- d decisio n tree. And since any vector v ∈ R 4 d with v ≤ 1 ha s v 1 ≤ 2 d , w e hav e that ∑ S ˆ g ( S, µ ) ≤ 2 d . Finally , S ∶ ˆ g ( S ,µ ) ≤ 2 β ˆ g 2 ( S, µ ) ≤ S ˆ g ( S, µ ) 2 β ≤ 2 d + 1 β . 6 Conclusions In conclusion, w e hav e shown in a precise s ense, that all decisio n tre es a re learnable from mos t pro duct distributions. The main to o l we hav e is a t y pe of genera lization o f KM that uses random exa mples drawn from a (per tur bed) pro duct distribution, and w orks only for terms of degree O ( lo g n ) . Learning decision trees is a clear demonstr ation o f the pow er o f a new mo del. How ever, the ques tions rais ed by suc h a to ol ar e per haps even more interesting. First, can one lear n DNFs from mos t pro duct distributions? Second, can o ne ag nostically lea rn in these s ettings, fo r example can one agnos tically learn de c ision trees in this setting? A third and very interesting direc tio n would b e to go b eyond pro duct distributions to arbitrar y p erturb ed 10 distributions. T o b e pr ecise, let D b e an arbitrar y distribution o n { − 1 , 1 } n . Let a, b ∈ U [ 0 , c ] n be t wo unifor mly rando m per tur bation vectors. C o nsider the distribution in which x is first chosen from D and then each bit x i is a ltered as follows: if x i = 1 then x i is flipp ed with probability a i , if x i = − 1 then x i is flipp ed with probability b i . This gives a new t yp e of per turb ed distr ibution on inputs which is not in gener al a pro duct distribution. Hence, our current techniques will no t work but it is p ossible that others will. Finally , we mention tha t the Goldreich-Levin algor ithm [4], similar to K M, has a num b er of applica tions in computatio na l co mplexity and other area s . It would b e interesting to se e if these applications could also b e studied from ra ndom ex amples, instead of blac k-b ox ac c e ss, in a s mo othed analysis setting. Ac kno wledgments . W e a re very grateful to Ran Raz, Ryan O ’Do nnell, and Prasad T etali for illuminating discussions. References [1] A. Blum , R ank- r de cision t r e es ar e a su b class of r -de cision lists , Information Pro cessing Letters, 42 (1 9 92), pp. 183 –185 . [2] N. Bshouty, E. Mo ssel, R. O’Donnell, an d R. Ser vedio , L e arning DNF fr om R an- dom Walks . T o app ear in Journal of Computer and System Scienc es , 2005. [3] A. E hrenfeucht an d D. Haussler , L e arning de cision tr e es fr om r ando m examples , Information and Co mputation, 82 (198 9), pp. 23 1–246 . [4] O. Goldreich and L. Levin , A har d-c or e pr e dic ate for al l one-way functions , in Pro ceed- ings of the Twen ty-First Annual Symp osium on Theory of Computing, 19 89, pp. 25– 32. [5] P. Gop alan, A. T. Kal ai, and A . R. Kliv ans , A gnostic al ly le arning de cision tr e es , in Pro ceedings o f the 40th annual A CM symp osium on Theory of c o mputing, New Y ork, NY, USA, 2 008, A CM, pp. 527– 536. [6] J. Ja ckso n , An efficient memb ership-query algori thm for le arning DN F with r esp e ct t o the uniform distribution , Jo urnal o f Computer and System Sciences, 55 (199 7), pp. 414– 440. [7] J. Ja cks on and R. Ser vedio , L e arning r andom lo g-depth de cision tre es u nder the uniform distribution , in P ro ceedings of the 16th Annual Conf. o n Computational Lea rning Theory and 7 th Kernel W orksho p, 2003 , pp. 610–6 24. [8] E. Kushilevitz and Y. Manso ur , L e arning de cision t r e es u sing the Fourier sp e ctrum , SIAM J. on Computing, 22 (1993), pp. 13 31–13 48. [9] T. M. Mitchell , Machine L e arning , McGraw-Hill, New Y ork, 1997 . [10] E. Mossel, R. O’Donnel l, a nd R. Ser vedio , L e arning juntas , in P ro ceedings of the 35th Ann ual Symp osium on Theory o f Computing, 20 03. [11] R. Sel ten , R e ex amination of the p erfe ctness c onc ept for e quilibrium p oints in extensive games , International Journal of Game Theor y . [12] D. A. Spielman and S.-H. Teng , S mo othe d analysis of algorithms: Why the simplex algorithm usu al ly takes p olynomial time , J. ACM, 5 1 (2004), pp. 385 – 463. [13] L. V aliant , A the ory of t he le arnable , Comm unications of the ACM, 27 (1984 ), pp. 113 4– 1142. 11
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment