Statistical Learning of Arbitrary Computable Classifiers

Statistical Learnin g of Arbitrary Computable Classiﬁers David S oloveichik ∗ California Institute of T echnolo gy MC 136-9 3 Pasadena, CA 91125 dsolov@calte ch.edu Abstract Statistical learning theory chieﬂy studies restricted hypoth esis classes, particularly those with ﬁnite V apnik-Che rvonenkis (VC) dimension. The fun- damental qu antity of in terest is the sample co m- plexity: the numbe r of samp les re quired to learn to a speciﬁed le vel of accu racy . Here we consider learning over the set of all compu table labeling function s. Since the VC-dimension is inﬁnite an d a p riori (unifo rm) bound s o n the number of sam- ples a re impo ssible, we let the learn ing algor ithm decide wh en it has seen sufﬁcient samples to hav e learned. W e ﬁrst show th at learn ing in this setting is indeed possible, an d de velop a learning algo- rithm. W e then sho w , howev er , that boun ding sam- ple complexity indep endently o f the distribution is impossible. Notably , this impossibility is entirely due to the r equiremen t that the learn ing algo rithm be compu table, and not due to the statistical natur e of the problem . 1 Intr oduction Suppose we are trying to learn a difﬁcult classiﬁcation pr ob- lem: for example determin ing whether the giv en image con- tains a human face, o r wh ether the MRI imag e shows a ma- lignant tumo r , etc. W e may ﬁrst try to tra in a simp le model such as a small neur al network. If that fails, we may move on to other, po tentially more complex, methods of classiﬁ- cation such as sup port vector mach ines with different ker- nels, techn iques to apply certain transfo rmations to th e data ﬁrst, etc. Con ventional statistical learnin g theory attemp ts to bou nd the n umber of samp les nee ded to learn to a spec- iﬁed level o f accu racy for each o f the above m odels (e.g. neural networks, supp ort v ector machines). Speciﬁcally , it is enoug h to bou nd th e VC-dimension of the learning model to determine the numb er of sam ples to use [VC71, BEHW89] . Howe ver , if we allo w ourselves to ch ange the model, then the VC-dimension of the o verall lear ning algorithm is not ﬁ- nite, and m uch o f statistical learnin g theory does not directly apply . ∗ I thank E rik Winfree and Matthew Cook for discussions and in v aluable support. Accepting that much of the time the complexity of th e model cannot be a p riori bou nded, Structural Risk Minimiza- tion [V ap9 8] explicitly considers a hierarchy of incr easingly complex m odels. An altern ativ e app roach, and one we f ol- low in this paper, is simply to co nsider a single learning model that includes all possible classiﬁcation methods. W e consider the unrestricted learning model consisting of all comp utable classiﬁers. Since th e VC-d imension is clearly inﬁnite, th ere are no u niform boun ds (ind epende nt of the distribution and the target conc ept) on the numb er of samples n eeded to learn accurately [BEHW89]. Y et we still want to gua rantee a desired le vel o f accuracy . Rather than deciding on the nu mber of samples a p riori, it is natural to allow the learning algorithm to decide when it has seen sufﬁ- ciently many labe led samples based on the training samples seen up to n ow a nd their la bels. Since the above learning model in cludes a ny practical classiﬁcation sch eme, we term it universal (P A C-) learning. W e ﬁ rst sh ow that there is a comp utable learning al- gorithm in our un iv ersal setting. Then, in o rder to obtain bound s on the n umber of trainin g samples that wou ld be needed, we consider m easuring sample com plexity of the learning alg orithm as a fu nction of the unkn own co rrect la- beling f unction ( i.e. target concep t). Although th e corre ct labeling is unknown, this sample co mplexity measu re co uld be used to compare learning algo rithms speculatively: “if the target labeling were such and such, learning algor ithm A re- quires fe wer samples than learning algorithm B ”. By asking what is the largest sample size needed assuming the tar get labeling f unction is in a certain class, we could co mpare th e sample complexity of the universal learner to a learner over the restricted class (e.g. with ﬁnite VC-dimension) . Howe ver , we p rove that it is imp ossible to bo und the sample c omplexity of any c omputab le universal learning al- gorithm, e ven as a function of the tar get concept. Depending on th e d istribution, any such bound will be e xceeded with ar - bitrarily hig h proba bility . T he impossibility of a distribution- indepen dent bound is entirely due to the co mputab ility re - quiremen t. I ndeed we show there is an unco mputable learn- ing proce dure f or which we bou nd the number of samples queried as a function of the unknown target co ncept, inde- penden tly of the distribution. Our results im ply tha t com putable learning alg orithms in the universal setti ng m ust “waste samp les” in th e sense of requirin g mo re samples than is n ecessary for statistical reasons alone. 2 Relation to Pre vious W ork There is comp aratively little work in statistical learn ing the- ory on learning arbitrary computable class iﬁers compa red to the volume of r esearch o n learning in mo re restricted set- tings. Computation al learning theo ry (aka P A C-learning) requires learn ing algorith ms to b e efﬁcient in the sense of runnin g in polyno mial time o f cer tain p arameters [ V al84, KV94]. That work generally restricts lea rning to very lim- ited con cept/hyp othesis s paces such as perceptron s, DNF ex- pressions, limited-we ight neu ral networks, etc. The purely statistical learning theory para digm ignores issues of com- putability [VC71, V ap9 8]. W ork on learnin g ar bitrary com- putable functio ns is m ostly in the “lea rning in the limit” paradigm [ Gol67, Ang 88], in which the goal of learnin g is to eventually conv erge to the p erfectly c orrect hy pothesis as opposed to appr oximating it with an appro ximately cor rect hypoth esis. The idea of allowing the learner to ask for a varying num- ber of trainin g samples based o n the ones pr eviously seen was studied before in statistical learning theor y [LMR88, BI94]. Linial et al [LMR88] called this mo del “dy namic sampling” and showed that d ynamic sampling allows learn- ing with a hyp othesis space of inﬁnite VC-dimen sion if all hypoth eses can b e enu merated. This is essentially T heorem 4 of our paper . Howe ver , the hypo thesis space of all com - putable fun ctions canno t be enu merated by any algo rithm, and thus these results do n ot directly imply the existence of a learning algorithm in our setting. Our proof technique for establishing positiv e results (Theor em 2) is p arallel evaluation of all hypotheses, and is based o n Levin’ s universal search [Lev73 ]. In lear ning the- ory , Levin’ s univ ersal search was p reviously used by Gol- dreich and Ron [GR97] to evaluate all learning algorithms in parallel and obta in an algorithm with asympto tically optim al computatio n t ime. The main negati ve result of this paper is sho wing the ab- sence of distribution indep endent boun ds o n sample com- plexity for co mputable universal le arning algorithms (The- orem 5). Recently Ryab ko [Rya0 5] co nsidered learn ing a r- bitrary computable classiﬁ ers, albeit in a setting where th e number of samples for the learn ing algorithm is extern ally chosen. He demonstrated a computational difﬁculty in deter - mining the numbe r of samp les ne eded: it g rows faster th an any compu table function o f the length of th e target co ncept. In con trast, we p rove that distribution-independ ent boun ds do not exist altogeth er fo r comp utable learn ing alg orithms in our setting. 3 Deﬁnitions The samp le space X is the universe of possible points over which learning occur s. Here we will largely suppo se the sample space X is the set of a ll ﬁnite b inary strings { 0 , 1 } ∗ . A co ncept space C and hypoth esis space H are sets of boolean -valued functions over X , which are said to la bel points x ∈ X a s 0 / 1 . The con cept space C is the set o f all possible labeling function s that our learning algor ithm may be asked to learn from. In each learning scenario, there is some u nknown tar get co ncept c ∈ C that represents th e de- sired way of labeling points. Th ere is also an unknown sa m- ple distribution D over X . The learning algorithm ch ooses a hypothesis h ∈ H based on ii d samples drawn f rom D and labeled acc ording to the target con cept c . Since we cannot hope to distingu ish between a hyp othesis that is always cor- rect and o ne that is correc t mo st of the time, we adop t the “proba bly approximately correct” [V al84] goal of producin g with high p robab ility ( 1 − δ ) a hypo thesis h such tha t the probab ility over x ∼ D that h ( x ) 6 = c ( x ) is small ( ε ). Here we will mo stly con sider the concep t space C to be the set o f all total recursive fu nctions X → { 0 , 1 } . W e say that this is a un iv ersal learnin g setting becau se C in- cludes any practical classiﬁcation sch eme. W e will mostly consider the hy pothesis space to be the set of all par tial r e- cursive functions X → { 0 , 1 , ⊥} , where ⊥ indicates failur e to h alt. From P AC learn ing it is known that som etimes it helps to use different con cept an d hypothe sis classes, if one desires the learning algor ithm to be efﬁcient [ PV88]. In a related way , allowing our algorithm to outpu t a par tial r ecur- si ve function that may not halt on all inputs seems to per- mit lea rning (e.g. Theorem 2). Abusing n otation, c ∈ C or h ∈ H will refer to either the function or to a represen tation of that function as a progr am. Similarly C and H will refer to the sets of function s or to the sets of r epresentation s of the correspo nding fun ctions. W e assume all p rogram s are writ- ten in some ﬁxed alphabet and are interpreted by some ﬁxed universal T uring machine. I f h is a partial recursive function and h ( x ) = ⊥ the n by convention h ( x ) 6 = h ′ ( x ) f or any partial recursive fu nction h ′ (even if h ′ ( x ) = ⊥ also). W e can now deﬁne what we me an by a learning algo- rithm: Deﬁnition 1 Algorithm A is a learning algorithm over sam- ple space X , co ncept space C , and hypoth esis space H if: • (syntactic r equir ements) A takes two inpu ts δ ∈ (0 , 1 ) and ε ∈ (0 , 1 / 2) , queries a n o racle for pairs in X × { 0 , 1 } , and if A halts it outputs a hypothesis h ∈ H . • (semantic requir e ments) F or any δ, ε , for an y concept c ∈ C , a nd distribution D over X , if the oracle r eturn s pairs ( x, c ( x )) for x drawn iid fr om D , then A always halts, and with pr oba bility at least 1 − δ ou tputs a hypo- thesis h such that Pr x ∼ D [ h ( x ) 6 = c ( x )] < ε . The al ways halting req uiremen t seems a nice property o f the learnin g algorithm and indeed the learn ing algorithm we develop ( Theorem 2) will h alt for any con cept and sequence of samples. However , relaxing this requirement to allow a non-ze ro prob ability that the learning algor ithm quer ies the oracle f or inﬁn itely many s amples d oes not ch ange our nega- ti ve results (Theorem 5), as l ong as a ﬁnite numbe r o f oracle calls implies halting. The fundamen tal n otion in statistical learning theory is that of sam ple co mplexity . Since th e VC-dimensio n o f our hypoth esis space is inﬁnite, there is no uniform bound m ( δ, ε ) o n the n umber of samp les needed to le arn to th e δ, ε lev el of accu racy . W e will consider the q uestion of whether for a g iv en le arning algorith m the re is a distribution- indepen dent boun d m ( c, δ, ε ) on the nu mber of samples queried fr om the oracle wh ere c ∈ C is the target hypo- thesis. In other words the boun d i s allowed to depend on the target concep t c but not on the sample distribution D . Such a bound may be satisﬁed with certainty , or satisﬁed with high probab ility over the learning samples. 4 Results W e ﬁrst show th at there is a computa ble lear ning algo rithm in our setting. Theorem 2 There is a learning alg orithm over sample space X o f a ll ﬁn ite binary str ings, hypo thesis space H of all partial r ecursive functions, and con cept space C of all total r ecursive functions. In order to prove this theorem we nee d the fo llowing lemma. R esults equiv alent to this lemm a can be fou nd in [LMR88]. Lemma 3 Let X be any sample space an d D be a ny distribution over X . F ix any function c : X → { 0 , 1 } . Suppo se hypoth esis space H is coun table, and let h 1 , h 2 , . . . b e som e o r dering of H . F or any δ, ε , let m ( i ) = ⌈ (2 ln i + ln(1 /δ ) + ln( π 2 / 6)) /ε ⌉ . Suppo se x 1 , x 2 , . . . is an inﬁnite sequence of iid samples d rawn fr om D . Then the pr obability that ther e exists h i ∈ H such that Pr x ∼ D [ h i ( x ) 6 = c ( x )] > ε , but h i agr ees with c on x 1 , x 2 , . . . , x m ( i ) , is less than δ . Proof: The p robab ility that a particula r h i with error prob- ability Pr x ∼ D [ h i ( x ) 6 = c ( x )] > ε gets m ( i ) i. i.d. instan ces drawn from D correct is less than (1 − ε ) m ( i ) ≤ e − m ( i ) ε ≤ (6 /π 2 )( δ /i 2 ) . By the u nion boun d, the p robab ility that an y h i with erro r pro bability greater than ε gets m ( i ) instanc es correct is less than P ∞ i =1 (6 /π 2 )( δ /i 2 ) = δ . Proof of Theorem 2: Let h 1 , h 2 , . . . be a rec ursive enu- meration of H (for example in lexicographic order) . For the giv en δ, ε , let m ( i ) be deﬁned as in Lemma 3. T he lea rning algorithm co mputes inﬁnitely many thread s 1 , 2 , . . . runn ing in parallel. This can be do ne by a standard dovetailing tech- nique. (For example u se the f ollowing schedu le: for k = 1 to inﬁnity , for i = 1 to k, perfo rm step k − i + 1 o f thread i .) Thr ead i sequ entially checks w hether h i ( x 1 ) = c ( x 1 ) , h i ( x 2 ) = c ( x 2 ) , . . . , h i ( x m ( i ) ) = c ( x m ( i ) ) , exiting if a check fails. If all m ( i ) ch ecks pass, thre ad i termina tes and outputs h i . T he learning alg orithm queries the o racle as n ec- essary for new learning samples and their labeling . The over- all algorithm terminates as soon as some thread o utputs an h i , and outputs th is hypothesis. By Lem ma 3, with proba- bility at least 1 − δ , this h i has err or pr obability less than ε . Further, since C ⊂ H , the learning algorithm will always terminate. Note that it seems necessary to expand the hyp othesis space to inclu de all partial recursive functio ns because the concept space of to tal re cursive fu nctions does no t h av e a recursive enumer ation (it is uncomputable wh ether a gi ven progr am is total recursive or no t). W e will see in T heorem 5 that there is no bound m ( c, δ, ε ) on the numb er of samples quer ied by any com - putable learning algorithm in our setting . Let u s obtain some intuition for why that is true fo r the above learnin g algorith m. Then we will contrast this to the case of an un computa ble learning algorithm . In essence, we can make the above learnin g algorithm query for mo re samples than is necessary for statistical rea- sons alon e. Intuitively , sup pose that an h i ∗ coming ear ly in the orderin g is always correct but takes a very long time to compute. The learning algorithm cannot wait for th is h i ∗ to ﬁnish, because it do es no t know th at any particular h i will ev er halt. At some point it h as to start testing h i ’ s that come later in th e or dering and that h ave larger m ( i ) ’ s. T esting these requires more learning samples than m ( i ∗ ) . If we can know which h i ’ s ar e safe to skip over since they don’t ha lt, an d for which h i ’ s we should wait, then the above problem is solved . Ind eed, th e follo wing th eorem sho ws that there is no statistical reason why a d istribution-indepen dent bound m ( c, δ, ε ) is impossible. The theor em presents a well deﬁned method of learning ( albeit an uncomputab le one) for which th ere exists su ch a bou nd, and th is bo und is satisﬁed with certainty . Below , the haltin g o racle g iv es 0 / 1 answers to questions of the fo rm ( h, x ) where h ∈ H, x ∈ X such that a 1 answer indicates that h ( x ) halts an d a 0 answer indicates it does not; the answers are clearly uncomp utable. Theorem 4 If a learnin g algorithm is a llowed to qu ery th e halting oracle, then the r e is a learnin g a lgorithm over sam- ple space X o f all ﬁnite binary strings, hypoth esis space H of all partial r ecursive function s, a nd concept space C of all total r ecursive fu nctions, and a function m : C × (0 , 1) × (0 , 1 / 2) → N , such that for any a ppr oximation parameters δ, ε , a ny tar get concept c ∈ C , and a ny distrib ution D over X , the learning algorithm uses at most m ( c, δ, ε ) training samples. Proof: Rather than dovetaili ng as is done f or the computable learning algorith m (The orem 2 ), we can sequentially test ev ery h i on samples x 1 , . . . , x m ( i ) because we can d eter- mine wheth er h i halts on a g iv en in put. S ince c = h i ∗ for some h i ∗ ∈ H , th e hyp othesis h i we outpu t will al- ways satisfy i < i ∗ , and therefore we will requ ire at most m ( i ∗ ) = ⌈ (2 ln( i ∗ ) + ln (1 /δ ) + ln( π 2 / 6)) /ε ⌉ sam ples. W e now show that for any c omputab le lear ning algo- rithm, and any possible sample b ound m ( c, δ, ε ) , there is a target concep t c an d a sam ple distrib ution such tha t this sam- ple boun d is violated with high prob ability . The pro bability of v iolation can be mad e arbitrar ily close to 1 − 2( δ +(1 − δ ) ε ) (which approa ches 1 as δ, ε → 0 ). In fact this theorem is stronger: it shows that giv en a learnin g algo rithm, without varying the tar get con cept, but just by v arying the distribu- tion it is possible to make the algorith m ask f or arbitrarily many learning samples with high probability . Theorem 5 F o r an y lea rning algo rithm over sample spa ce X of all ﬁ nite binary strings, hypoth esis space H of all partial r ecurs ive functions, a nd concept space C o f all to- tal recur sive functions, ther e is a tar get concep t c ∈ C , such that fo r a ny appr oximation pa rameters δ, ε , for any ρ < 1 − 2 ( δ + (1 − δ ) ε ) , and for any sample b ound m ∈ N ther e is a d istrib ution D over X , suc h tha t th e lea rning al- gorithm uses mor e than m training samples with pr o bability at least ρ . The key difference b etween a computable and an uncom- putable le arning algo rithm, is th at a con cept can simulate a co mputable one. By simulating the learning alg orithm, a concept can c hoose to beh av e in way that is bad f or the learn- ing algorithm ’ s sample complexity . T o prove the above theorem, we will ﬁrst need th e f ol- lowing lemma. Th e lemma e ssentially shows a situation such that any le arning algorithm accor ding to our deﬁnition m ust query for more than m lea rning samples with high pro babil- ity when th e target concept is ch osen adversarily . The lemm a is true even without req uiring the learning algorithm to be computab le. Note that the lemm a doe s not directly imp ly the theor em above, even in its we aker f orm, because in o r- der to increase the numbe r of lear ning samples that are l ikely queried by the learning algorithm, we ha ve to change the tar - get con cept. Since m ( c, δ, ε ) is a function of c , th ere is no guaran tee th at the boun d doesn’t become larger a s well. Lemma 6 Let X be a set o f d points, and let C b e the set of all la belings of X . Let D be a uniform distribution over X . Supp ose A is a learn ing alg orithm over sample space X , concept an d hypo thesis space C . F or any a ccuracy param- eters δ, ε and any m < d , there is a concept c ∈ C such that when the oracle draws fr om D labeled accor ding to c the pr obability tha t A samples mor e than m points is at least 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Proof: W e use the probabilistic metho d to ﬁnd a particu larly bad conc ept c ∗ . Supp ose we do n ot start with a ﬁxed target concept c , b ut d raw it u niform ly fr om C . In o ther word s, c is determined by v alues { c ( x ) } x ∈ X drawn uniformly from { 0 , 1 } . Given some x 1 , . . . , x m , c ( x 1 ) , . . . , c ( x m ) , and x 6∈ { x 1 , . . . , x m } , the value of c ( x ) is a fair co in ﬂip. Thus if on x 1 , . . . , x m labeled by c ( x 1 ) , . . . , c ( x m ) , A outputs a h ypo- thesis without asking f or more samples, then the hypo thesis is in correct on x with pr obability 1 / 2 . If we now let x vary , the probab ility that the hy pothesis is incorrect on x is at least (1 / 2)( d − m ) /d since there ar e at least d − m points no t in x 1 , . . . , x m . Now supp ose for any c the pro bability that A samples more than m poin ts is at most ρ . Then the un condi- tional pr obability that th e hyp othesis o utput by A is inco rrect on a rand om sam ple point is at least (1 − ρ )(1 / 2)( d − m ) /d . This im plies that there is a con cept c ∗ ∈ C such th at the probab ility that the h ypothesis outpu t by A is inco rrect on a random sample point is at least (1 − ρ )(1 / 2)( d − m ) /d . Since A is a learn ing algo rithm, when we use c ∗ to la- bel the train ing points, and use accuracy parameters δ, ε , the probab ility that the hyp othesis p roduc ed by A has erro r pr ob- ability greater than ε is at m ost δ . If we make the worst case assumption that whenever the error probab ility of the hy po- thesis is larger than ε it is exactly 1 , an d o therwise the err or probab ility is exactly ε , then the probability that the hypo- thesis ou tput by A is inco rrect on a rand om sample point is at most δ · 1 + (1 − δ ) ε . Th us (1 − ρ )(1 / 2)( d − m ) /d ≤ δ + (1 − δ ) ε , implying that ρ ≥ 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Now in order to prove Theor em 5, we essentially sho w that th ere is some ﬁxed concept c ∗ that b ehaves as the ba d c ’ s in arbitr ary instances of Lemma 6. Proof of Theorem 5: Co nsider the following pr ogram P : { 0 , 1 } ∗ → { 0 , 1 } . First it interpr ets the given string x ∈ { 0 , 1 } ∗ as a tuple h δ, ε, m, d, i i for δ ∈ (0 , 1) , ε ∈ (0 , 1 / 2) and m, d, i ∈ N using some ﬁxed one-to-o ne encoding of such tuples as binary strings. If x canno t be deco ded appr o- priately , or if i > d then P return s 0 . Otherwise, fo r th ese δ, ε, m, d , let ˆ X ⊂ { 0 , 1 } ∗ be the set of d strings wh ich ar e interpreted as { h δ, ε, m, d, 1 i , . . . , h δ, ε, m, d, d i} , an d let ˆ D be a u niform distribution over ˆ X and 0 elsewhere. Let ˆ C be the set of a ll possible labeling s of ˆ X . For eac h labeling ˆ c ∈ ˆ C , pr ogram P compu tes the pro bability ρ ˆ c that A given accuracy param eters δ, ε , qu eries for more than m sample points if points ar e d rawn fro m ˆ D labeled accord ing to ˆ c . For each ˆ c , this r equires simu lating A f or at most d m differ - ent sequences of sample p oints. Let ˆ c ∗ = a rgmax ˆ c ∈ ˆ C { ρ ˆ c } , breaking ties in some ﬁxed w ay . Finally P ou tputs ˆ c ∗ ( x ) . Observe that P is to tal r ecursive since A spen ds a ﬁ- nite time on any ﬁnite sequence o f samp le p oints. (This is a weaker condition than the al ways halting requirement of o ur deﬁnition of a learnin g algorithm. ) T hus P is som e c ∗ ∈ C . Further, fo r any δ, ε , m, d , o n all po ints h δ, ε, m, d, i i for i ≤ d , P ﬁnds the same ˆ c ∗ , and thus on these points c ∗ acts like this ˆ c ∗ . By Lemma 6, if m < d th en this ˆ c ∗ has th e pro p- erty that ρ ˆ c ∗ ≥ 1 − 2 d ( δ +(1 − δ ) ε ) d − m . T herefor e, if A is gi ven ac- curacy p arameters δ, ε , the target con cept is c ∗ , and th e distri- bution D is uniform over {h δ, ε, m, d, 1 i , . . . , h δ, ε, m, d, d i} for so me d ∈ N such that m < d , then the probab ility tha t A requests more tha n m samples is at least 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Since we can choose D such tha t d is large en ough, we o b- tain the desired result. 5 Conclusion W e have shown that l earning arbitrary co mputab le classiﬁers is p ossible in th e statistical learnin g parad igm. Howe ver for any co mputab le learnin g algorithm , the n umber of samp les required to lear n to a d esired le vel of accuracy m ay b ecome arbitrarily large depend ing on the sample distribution. This is in co ntrast to uncom putable learning method s in the same universal setting wh ose samp le co mplexity can be bou nded indepen dently of the distribution. Our re sults mean that there is a big price in terms of sam- ple complexity to be paid for the combination o f u niversality and co mputability of th e lear ner . Speciﬁcally , by tweaking the distribution we can make a com putable universal l earner arbitrarily worse than a restricted learnin g algorithm on a ﬁ- nite VC-dimen sional hypothe sis space, or even an uncom- putable universal learn er . While we have presented a single comp utable learning algorithm in our universal setting, one would like to dev elop a m easure that would allow different learning algorith ms to be compar ed to each other in terms of sample complexity . W e have seen that sample complexity m ( c, δ, ε ) is not such a measure; is there a viable alternative? Finally , we have igno red comp utation time in our anal- ysis. As such, our learning algorithm is not likely to have practical signiﬁcance. Integrating run ning time into the the- ory presented would be a critical extension. Refer ences [Ang88 ] D. Anglu in. Identify ing lan guages from stochastic examples. T echnical r eport, Y ale University , De partment of Comp uter Science, 1988. [BEHW89] A. Blumer, A. Eh renfeuch t, D. Hau ssler , an d M. K. W armuth . L earnability and the V apnik- Chervonenkis d imension. Journal o f the AC M , 36(4) :929–9 65, 1989 . [BI94] G. M. Bene dek and A. Itai. Nonunif orm learn- ability . Journal of Computer an d System Sci- ences , pages 311– 323, 1994. [Gol67] E. M. Gold. Language Identiﬁcatio n in the Limit. Information and Contr ol , 1 0:447 –474, 1967. [GR97] O. Go ldreich and D. Ron. On u niversal lear n- ing algorithms. Info rmation Pr ocessing Letters , 63(3) :131–1 36, 1997 . [KV94] M. J. Kearns and U. V . V azirani. A n Intr oduc- tion to Computationa l Learning Theory . MIT Press, 1994 . [Lev73] L. A. Levin. U niv ersal sequential search p rob- lems. Pr oblems of Informatio n T ransmission , 9(3):2 65–26 6, 197 3. [LMR88] N. L inial, Y . M ansour, and R. L. Ri vest. Results on le arnability an d th e V apn ik-Chervonenk is di- mension. 29th Ann ual Symposium on F oun- dations of Computer Science , pag es 1 20–12 9, 1988. [PV88] L. Pitt and L. G. V aliant. Co mputation al limita- tions on l earning from exam ples. Journal of the AC M , 35(4 ):965 –984, 198 8. [Rya05] D. Ryabko. On Com putability o f Pattern Recognition Problem s. In Pr oceeding s of the 16th I nternation al Conference on Algorithmic Learning Theory , pages 148 –156 . Sp ringer, 2005. [V al84] L. G. V aliant. A theory of the learnable. Com- munication s of the AC M , 27:11 34–1 142, 1984 . [V ap98] V . N. V apnik. Statistical lea rning th eory . W iley New Y ork, 1998. [VC71] V . N. V ap nik and A. Y . Cherv onenk is. On th e unifor m convergence of relative fre quencies of ev ents to th eir p robab ilities. Theory of Pr ob a- bility and its Applicatio ns , 16:264 –280, 1971.

Statistical Learning of Arbitrary Computable Classifiers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment