Statistical Learning of Arbitrary Computable Classifiers

Statistical learning theory chiefly studies restricted hypothesis classes, particularly those with finite Vapnik-Chervonenkis (VC) dimension. The fundamental quantity of interest is the sample complexity: the number of samples required to learn to a …

Authors: David Soloveichik

Statistical Learnin g of Arbitrary Computable Classifiers David S oloveichik ∗ California Institute of T echnolo gy MC 136-9 3 Pasadena, CA 91125 dsolov@calte ch.edu Abstract Statistical learning theory chiefly studies restricted hypoth esis classes, particularly those with finite V apnik-Che rvonenkis (VC) dimension. The fun- damental qu antity of in terest is the sample co m- plexity: the numbe r of samp les re quired to learn to a specified le vel of accu racy . Here we consider learning over the set of all compu table labeling function s. Since the VC-dimension is infinite an d a p riori (unifo rm) bound s o n the number of sam- ples a re impo ssible, we let the learn ing algor ithm decide wh en it has seen sufficient samples to hav e learned. W e first show th at learn ing in this setting is indeed possible, an d de velop a learning algo- rithm. W e then sho w , howev er , that boun ding sam- ple complexity indep endently o f the distribution is impossible. Notably , this impossibility is entirely due to the r equiremen t that the learn ing algo rithm be compu table, and not due to the statistical natur e of the problem . 1 Intr oduction Suppose we are trying to learn a difficult classification pr ob- lem: for example determin ing whether the giv en image con- tains a human face, o r wh ether the MRI imag e shows a ma- lignant tumo r , etc. W e may first try to tra in a simp le model such as a small neur al network. If that fails, we may move on to other, po tentially more complex, methods of classifi- cation such as sup port vector mach ines with different ker- nels, techn iques to apply certain transfo rmations to th e data first, etc. Con ventional statistical learnin g theory attemp ts to bou nd the n umber of samp les nee ded to learn to a spec- ified level o f accu racy for each o f the above m odels (e.g. neural networks, supp ort v ector machines). Specifically , it is enoug h to bou nd th e VC-dimension of the learning model to determine the numb er of sam ples to use [VC71, BEHW89] . Howe ver , if we allo w ourselves to ch ange the model, then the VC-dimension of the o verall lear ning algorithm is not fi- nite, and m uch o f statistical learnin g theory does not directly apply . ∗ I thank E rik Winfree and Matthew Cook for discussions and in v aluable support. Accepting that much of the time the complexity of th e model cannot be a p riori bou nded, Structural Risk Minimiza- tion [V ap9 8] explicitly considers a hierarchy of incr easingly complex m odels. An altern ativ e app roach, and one we f ol- low in this paper, is simply to co nsider a single learning model that includes all possible classification methods. W e consider the unrestricted learning model consisting of all comp utable classifiers. Since th e VC-d imension is clearly infinite, th ere are no u niform boun ds (ind epende nt of the distribution and the target conc ept) on the numb er of samples n eeded to learn accurately [BEHW89]. Y et we still want to gua rantee a desired le vel o f accuracy . Rather than deciding on the nu mber of samples a p riori, it is natural to allow the learning algorithm to decide when it has seen suffi- ciently many labe led samples based on the training samples seen up to n ow a nd their la bels. Since the above learning model in cludes a ny practical classification sch eme, we term it universal (P A C-) learning. W e fi rst sh ow that there is a comp utable learning al- gorithm in our un iv ersal setting. Then, in o rder to obtain bound s on the n umber of trainin g samples that wou ld be needed, we consider m easuring sample com plexity of the learning alg orithm as a fu nction of the unkn own co rrect la- beling f unction ( i.e. target concep t). Although th e corre ct labeling is unknown, this sample co mplexity measu re co uld be used to compare learning algo rithms speculatively: “if the target labeling were such and such, learning algor ithm A re- quires fe wer samples than learning algorithm B ”. By asking what is the largest sample size needed assuming the tar get labeling f unction is in a certain class, we could co mpare th e sample complexity of the universal learner to a learner over the restricted class (e.g. with finite VC-dimension) . Howe ver , we p rove that it is imp ossible to bo und the sample c omplexity of any c omputab le universal learning al- gorithm, e ven as a function of the tar get concept. Depending on th e d istribution, any such bound will be e xceeded with ar - bitrarily hig h proba bility . T he impossibility of a distribution- indepen dent bound is entirely due to the co mputab ility re - quiremen t. I ndeed we show there is an unco mputable learn- ing proce dure f or which we bou nd the number of samples queried as a function of the unknown target co ncept, inde- penden tly of the distribution. Our results im ply tha t com putable learning alg orithms in the universal setti ng m ust “waste samp les” in th e sense of requirin g mo re samples than is n ecessary for statistical reasons alone. 2 Relation to Pre vious W ork There is comp aratively little work in statistical learn ing the- ory on learning arbitrary computable class ifiers compa red to the volume of r esearch o n learning in mo re restricted set- tings. Computation al learning theo ry (aka P A C-learning) requires learn ing algorith ms to b e efficient in the sense of runnin g in polyno mial time o f cer tain p arameters [ V al84, KV94]. That work generally restricts lea rning to very lim- ited con cept/hyp othesis s paces such as perceptron s, DNF ex- pressions, limited-we ight neu ral networks, etc. The purely statistical learning theory para digm ignores issues of com- putability [VC71, V ap9 8]. W ork on learnin g ar bitrary com- putable functio ns is m ostly in the “lea rning in the limit” paradigm [ Gol67, Ang 88], in which the goal of learnin g is to eventually conv erge to the p erfectly c orrect hy pothesis as opposed to appr oximating it with an appro ximately cor rect hypoth esis. The idea of allowing the learner to ask for a varying num- ber of trainin g samples based o n the ones pr eviously seen was studied before in statistical learning theor y [LMR88, BI94]. Linial et al [LMR88] called this mo del “dy namic sampling” and showed that d ynamic sampling allows learn- ing with a hyp othesis space of infinite VC-dimen sion if all hypoth eses can b e enu merated. This is essentially T heorem 4 of our paper . Howe ver , the hypo thesis space of all com - putable fun ctions canno t be enu merated by any algo rithm, and thus these results do n ot directly imply the existence of a learning algorithm in our setting. Our proof technique for establishing positiv e results (Theor em 2) is p arallel evaluation of all hypotheses, and is based o n Levin’ s universal search [Lev73 ]. In lear ning the- ory , Levin’ s univ ersal search was p reviously used by Gol- dreich and Ron [GR97] to evaluate all learning algorithms in parallel and obta in an algorithm with asympto tically optim al computatio n t ime. The main negati ve result of this paper is sho wing the ab- sence of distribution indep endent boun ds o n sample com- plexity for co mputable universal le arning algorithms (The- orem 5). Recently Ryab ko [Rya0 5] co nsidered learn ing a r- bitrary computable classifi ers, albeit in a setting where th e number of samples for the learn ing algorithm is extern ally chosen. He demonstrated a computational difficulty in deter - mining the numbe r of samp les ne eded: it g rows faster th an any compu table function o f the length of th e target co ncept. In con trast, we p rove that distribution-independ ent boun ds do not exist altogeth er fo r comp utable learn ing alg orithms in our setting. 3 Definitions The samp le space X is the universe of possible points over which learning occur s. Here we will largely suppo se the sample space X is the set of a ll finite b inary strings { 0 , 1 } ∗ . A co ncept space C and hypoth esis space H are sets of boolean -valued functions over X , which are said to la bel points x ∈ X a s 0 / 1 . The con cept space C is the set o f all possible labeling function s that our learning algor ithm may be asked to learn from. In each learning scenario, there is some u nknown tar get co ncept c ∈ C that represents th e de- sired way of labeling points. Th ere is also an unknown sa m- ple distribution D over X . The learning algorithm ch ooses a hypothesis h ∈ H based on ii d samples drawn f rom D and labeled acc ording to the target con cept c . Since we cannot hope to distingu ish between a hyp othesis that is always cor- rect and o ne that is correc t mo st of the time, we adop t the “proba bly approximately correct” [V al84] goal of producin g with high p robab ility ( 1 − δ ) a hypo thesis h such tha t the probab ility over x ∼ D that h ( x ) 6 = c ( x ) is small ( ε ). Here we will mo stly con sider the concep t space C to be the set o f all total recursive fu nctions X → { 0 , 1 } . W e say that this is a un iv ersal learnin g setting becau se C in- cludes any practical classification sch eme. W e will mostly consider the hy pothesis space to be the set of all par tial r e- cursive functions X → { 0 , 1 , ⊥} , where ⊥ indicates failur e to h alt. From P AC learn ing it is known that som etimes it helps to use different con cept an d hypothe sis classes, if one desires the learning algor ithm to be efficient [ PV88]. In a related way , allowing our algorithm to outpu t a par tial r ecur- si ve function that may not halt on all inputs seems to per- mit lea rning (e.g. Theorem 2). Abusing n otation, c ∈ C or h ∈ H will refer to either the function or to a represen tation of that function as a progr am. Similarly C and H will refer to the sets of function s or to the sets of r epresentation s of the correspo nding fun ctions. W e assume all p rogram s are writ- ten in some fixed alphabet and are interpreted by some fixed universal T uring machine. I f h is a partial recursive function and h ( x ) = ⊥ the n by convention h ( x ) 6 = h ′ ( x ) f or any partial recursive fu nction h ′ (even if h ′ ( x ) = ⊥ also). W e can now define what we me an by a learning algo- rithm: Definition 1 Algorithm A is a learning algorithm over sam- ple space X , co ncept space C , and hypoth esis space H if: • (syntactic r equir ements) A takes two inpu ts δ ∈ (0 , 1 ) and ε ∈ (0 , 1 / 2) , queries a n o racle for pairs in X × { 0 , 1 } , and if A halts it outputs a hypothesis h ∈ H . • (semantic requir e ments) F or any δ, ε , for an y concept c ∈ C , a nd distribution D over X , if the oracle r eturn s pairs ( x, c ( x )) for x drawn iid fr om D , then A always halts, and with pr oba bility at least 1 − δ ou tputs a hypo- thesis h such that Pr x ∼ D [ h ( x ) 6 = c ( x )] < ε . The al ways halting req uiremen t seems a nice property o f the learnin g algorithm and indeed the learn ing algorithm we develop ( Theorem 2) will h alt for any con cept and sequence of samples. However , relaxing this requirement to allow a non-ze ro prob ability that the learning algor ithm quer ies the oracle f or infin itely many s amples d oes not ch ange our nega- ti ve results (Theorem 5), as l ong as a finite numbe r o f oracle calls implies halting. The fundamen tal n otion in statistical learning theory is that of sam ple co mplexity . Since th e VC-dimensio n o f our hypoth esis space is infinite, there is no uniform bound m ( δ, ε ) o n the n umber of samp les needed to le arn to th e δ, ε lev el of accu racy . W e will consider the q uestion of whether for a g iv en le arning algorith m the re is a distribution- indepen dent boun d m ( c, δ, ε ) on the nu mber of samples queried fr om the oracle wh ere c ∈ C is the target hypo- thesis. In other words the boun d i s allowed to depend on the target concep t c but not on the sample distribution D . Such a bound may be satisfied with certainty , or satisfied with high probab ility over the learning samples. 4 Results W e first show th at there is a computa ble lear ning algo rithm in our setting. Theorem 2 There is a learning alg orithm over sample space X o f a ll fin ite binary str ings, hypo thesis space H of all partial r ecursive functions, and con cept space C of all total r ecursive functions. In order to prove this theorem we nee d the fo llowing lemma. R esults equiv alent to this lemm a can be fou nd in [LMR88]. Lemma 3 Let X be any sample space an d D be a ny distribution over X . F ix any function c : X → { 0 , 1 } . Suppo se hypoth esis space H is coun table, and let h 1 , h 2 , . . . b e som e o r dering of H . F or any δ, ε , let m ( i ) = ⌈ (2 ln i + ln(1 /δ ) + ln( π 2 / 6)) /ε ⌉ . Suppo se x 1 , x 2 , . . . is an infinite sequence of iid samples d rawn fr om D . Then the pr obability that ther e exists h i ∈ H such that Pr x ∼ D [ h i ( x ) 6 = c ( x )] > ε , but h i agr ees with c on x 1 , x 2 , . . . , x m ( i ) , is less than δ . Proof: The p robab ility that a particula r h i with error prob- ability Pr x ∼ D [ h i ( x ) 6 = c ( x )] > ε gets m ( i ) i. i.d. instan ces drawn from D correct is less than (1 − ε ) m ( i ) ≤ e − m ( i ) ε ≤ (6 /π 2 )( δ /i 2 ) . By the u nion boun d, the p robab ility that an y h i with erro r pro bability greater than ε gets m ( i ) instanc es correct is less than P ∞ i =1 (6 /π 2 )( δ /i 2 ) = δ . Proof of Theorem 2: Let h 1 , h 2 , . . . be a rec ursive enu- meration of H (for example in lexicographic order) . For the giv en δ, ε , let m ( i ) be defined as in Lemma 3. T he lea rning algorithm co mputes infinitely many thread s 1 , 2 , . . . runn ing in parallel. This can be do ne by a standard dovetailing tech- nique. (For example u se the f ollowing schedu le: for k = 1 to infinity , for i = 1 to k, perfo rm step k − i + 1 o f thread i .) Thr ead i sequ entially checks w hether h i ( x 1 ) = c ( x 1 ) , h i ( x 2 ) = c ( x 2 ) , . . . , h i ( x m ( i ) ) = c ( x m ( i ) ) , exiting if a check fails. If all m ( i ) ch ecks pass, thre ad i termina tes and outputs h i . T he learning alg orithm queries the o racle as n ec- essary for new learning samples and their labeling . The over- all algorithm terminates as soon as some thread o utputs an h i , and outputs th is hypothesis. By Lem ma 3, with proba- bility at least 1 − δ , this h i has err or pr obability less than ε . Further, since C ⊂ H , the learning algorithm will always terminate. Note that it seems necessary to expand the hyp othesis space to inclu de all partial recursive functio ns because the concept space of to tal re cursive fu nctions does no t h av e a recursive enumer ation (it is uncomputable wh ether a gi ven progr am is total recursive or no t). W e will see in T heorem 5 that there is no bound m ( c, δ, ε ) on the numb er of samples quer ied by any com - putable learning algorithm in our setting . Let u s obtain some intuition for why that is true fo r the above learnin g algorith m. Then we will contrast this to the case of an un computa ble learning algorithm . In essence, we can make the above learnin g algorithm query for mo re samples than is necessary for statistical rea- sons alon e. Intuitively , sup pose that an h i ∗ coming ear ly in the orderin g is always correct but takes a very long time to compute. The learning algorithm cannot wait for th is h i ∗ to finish, because it do es no t know th at any particular h i will ev er halt. At some point it h as to start testing h i ’ s that come later in th e or dering and that h ave larger m ( i ) ’ s. T esting these requires more learning samples than m ( i ∗ ) . If we can know which h i ’ s ar e safe to skip over since they don’t ha lt, an d for which h i ’ s we should wait, then the above problem is solved . Ind eed, th e follo wing th eorem sho ws that there is no statistical reason why a d istribution-indepen dent bound m ( c, δ, ε ) is impossible. The theor em presents a well defined method of learning ( albeit an uncomputab le one) for which th ere exists su ch a bou nd, and th is bo und is satisfied with certainty . Below , the haltin g o racle g iv es 0 / 1 answers to questions of the fo rm ( h, x ) where h ∈ H, x ∈ X such that a 1 answer indicates that h ( x ) halts an d a 0 answer indicates it does not; the answers are clearly uncomp utable. Theorem 4 If a learnin g algorithm is a llowed to qu ery th e halting oracle, then the r e is a learnin g a lgorithm over sam- ple space X o f all finite binary strings, hypoth esis space H of all partial r ecursive function s, a nd concept space C of all total r ecursive fu nctions, and a function m : C × (0 , 1) × (0 , 1 / 2) → N , such that for any a ppr oximation parameters δ, ε , a ny tar get concept c ∈ C , and a ny distrib ution D over X , the learning algorithm uses at most m ( c, δ, ε ) training samples. Proof: Rather than dovetaili ng as is done f or the computable learning algorith m (The orem 2 ), we can sequentially test ev ery h i on samples x 1 , . . . , x m ( i ) because we can d eter- mine wheth er h i halts on a g iv en in put. S ince c = h i ∗ for some h i ∗ ∈ H , th e hyp othesis h i we outpu t will al- ways satisfy i < i ∗ , and therefore we will requ ire at most m ( i ∗ ) = ⌈ (2 ln( i ∗ ) + ln (1 /δ ) + ln( π 2 / 6)) /ε ⌉ sam ples. W e now show that for any c omputab le lear ning algo- rithm, and any possible sample b ound m ( c, δ, ε ) , there is a target concep t c an d a sam ple distrib ution such tha t this sam- ple boun d is violated with high prob ability . The pro bability of v iolation can be mad e arbitrar ily close to 1 − 2( δ +(1 − δ ) ε ) (which approa ches 1 as δ, ε → 0 ). In fact this theorem is stronger: it shows that giv en a learnin g algo rithm, without varying the tar get con cept, but just by v arying the distribu- tion it is possible to make the algorith m ask f or arbitrarily many learning samples with high probability . Theorem 5 F o r an y lea rning algo rithm over sample spa ce X of all fi nite binary strings, hypoth esis space H of all partial r ecurs ive functions, a nd concept space C o f all to- tal recur sive functions, ther e is a tar get concep t c ∈ C , such that fo r a ny appr oximation pa rameters δ, ε , for any ρ < 1 − 2 ( δ + (1 − δ ) ε ) , and for any sample b ound m ∈ N ther e is a d istrib ution D over X , suc h tha t th e lea rning al- gorithm uses mor e than m training samples with pr o bability at least ρ . The key difference b etween a computable and an uncom- putable le arning algo rithm, is th at a con cept can simulate a co mputable one. By simulating the learning alg orithm, a concept can c hoose to beh av e in way that is bad f or the learn- ing algorithm ’ s sample complexity . T o prove the above theorem, we will first need th e f ol- lowing lemma. Th e lemma e ssentially shows a situation such that any le arning algorithm accor ding to our definition m ust query for more than m lea rning samples with high pro babil- ity when th e target concept is ch osen adversarily . The lemm a is true even without req uiring the learning algorithm to be computab le. Note that the lemm a doe s not directly imp ly the theor em above, even in its we aker f orm, because in o r- der to increase the numbe r of lear ning samples that are l ikely queried by the learning algorithm, we ha ve to change the tar - get con cept. Since m ( c, δ, ε ) is a function of c , th ere is no guaran tee th at the boun d doesn’t become larger a s well. Lemma 6 Let X be a set o f d points, and let C b e the set of all la belings of X . Let D be a uniform distribution over X . Supp ose A is a learn ing alg orithm over sample space X , concept an d hypo thesis space C . F or any a ccuracy param- eters δ, ε and any m < d , there is a concept c ∈ C such that when the oracle draws fr om D labeled accor ding to c the pr obability tha t A samples mor e than m points is at least 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Proof: W e use the probabilistic metho d to find a particu larly bad conc ept c ∗ . Supp ose we do n ot start with a fixed target concept c , b ut d raw it u niform ly fr om C . In o ther word s, c is determined by v alues { c ( x ) } x ∈ X drawn uniformly from { 0 , 1 } . Given some x 1 , . . . , x m , c ( x 1 ) , . . . , c ( x m ) , and x 6∈ { x 1 , . . . , x m } , the value of c ( x ) is a fair co in flip. Thus if on x 1 , . . . , x m labeled by c ( x 1 ) , . . . , c ( x m ) , A outputs a h ypo- thesis without asking f or more samples, then the hypo thesis is in correct on x with pr obability 1 / 2 . If we now let x vary , the probab ility that the hy pothesis is incorrect on x is at least (1 / 2)( d − m ) /d since there ar e at least d − m points no t in x 1 , . . . , x m . Now supp ose for any c the pro bability that A samples more than m poin ts is at most ρ . Then the un condi- tional pr obability that th e hyp othesis o utput by A is inco rrect on a rand om sam ple point is at least (1 − ρ )(1 / 2)( d − m ) /d . This im plies that there is a con cept c ∗ ∈ C such th at the probab ility that the h ypothesis outpu t by A is inco rrect on a random sample point is at least (1 − ρ )(1 / 2)( d − m ) /d . Since A is a learn ing algo rithm, when we use c ∗ to la- bel the train ing points, and use accuracy parameters δ, ε , the probab ility that the hyp othesis p roduc ed by A has erro r pr ob- ability greater than ε is at m ost δ . If we make the worst case assumption that whenever the error probab ility of the hy po- thesis is larger than ε it is exactly 1 , an d o therwise the err or probab ility is exactly ε , then the probability that the hypo- thesis ou tput by A is inco rrect on a rand om sample point is at most δ · 1 + (1 − δ ) ε . Th us (1 − ρ )(1 / 2)( d − m ) /d ≤ δ + (1 − δ ) ε , implying that ρ ≥ 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Now in order to prove Theor em 5, we essentially sho w that th ere is some fixed concept c ∗ that b ehaves as the ba d c ’ s in arbitr ary instances of Lemma 6. Proof of Theorem 5: Co nsider the following pr ogram P : { 0 , 1 } ∗ → { 0 , 1 } . First it interpr ets the given string x ∈ { 0 , 1 } ∗ as a tuple h δ, ε, m, d, i i for δ ∈ (0 , 1) , ε ∈ (0 , 1 / 2) and m, d, i ∈ N using some fixed one-to-o ne encoding of such tuples as binary strings. If x canno t be deco ded appr o- priately , or if i > d then P return s 0 . Otherwise, fo r th ese δ, ε, m, d , let ˆ X ⊂ { 0 , 1 } ∗ be the set of d strings wh ich ar e interpreted as { h δ, ε, m, d, 1 i , . . . , h δ, ε, m, d, d i} , an d let ˆ D be a u niform distribution over ˆ X and 0 elsewhere. Let ˆ C be the set of a ll possible labeling s of ˆ X . For eac h labeling ˆ c ∈ ˆ C , pr ogram P compu tes the pro bability ρ ˆ c that A given accuracy param eters δ, ε , qu eries for more than m sample points if points ar e d rawn fro m ˆ D labeled accord ing to ˆ c . For each ˆ c , this r equires simu lating A f or at most d m differ - ent sequences of sample p oints. Let ˆ c ∗ = a rgmax ˆ c ∈ ˆ C { ρ ˆ c } , breaking ties in some fixed w ay . Finally P ou tputs ˆ c ∗ ( x ) . Observe that P is to tal r ecursive since A spen ds a fi- nite time on any finite sequence o f samp le p oints. (This is a weaker condition than the al ways halting requirement of o ur definition of a learnin g algorithm. ) T hus P is som e c ∗ ∈ C . Further, fo r any δ, ε , m, d , o n all po ints h δ, ε, m, d, i i for i ≤ d , P finds the same ˆ c ∗ , and thus on these points c ∗ acts like this ˆ c ∗ . By Lemma 6, if m < d th en this ˆ c ∗ has th e pro p- erty that ρ ˆ c ∗ ≥ 1 − 2 d ( δ +(1 − δ ) ε ) d − m . T herefor e, if A is gi ven ac- curacy p arameters δ, ε , the target con cept is c ∗ , and th e distri- bution D is uniform over {h δ, ε, m, d, 1 i , . . . , h δ, ε, m, d, d i} for so me d ∈ N such that m < d , then the probab ility tha t A requests more tha n m samples is at least 1 − 2 d ( δ +(1 − δ ) ε ) d − m . Since we can choose D such tha t d is large en ough, we o b- tain the desired result. 5 Conclusion W e have shown that l earning arbitrary co mputab le classifiers is p ossible in th e statistical learnin g parad igm. Howe ver for any co mputab le learnin g algorithm , the n umber of samp les required to lear n to a d esired le vel of accuracy m ay b ecome arbitrarily large depend ing on the sample distribution. This is in co ntrast to uncom putable learning method s in the same universal setting wh ose samp le co mplexity can be bou nded indepen dently of the distribution. Our re sults mean that there is a big price in terms of sam- ple complexity to be paid for the combination o f u niversality and co mputability of th e lear ner . Specifically , by tweaking the distribution we can make a com putable universal l earner arbitrarily worse than a restricted learnin g algorithm on a fi- nite VC-dimen sional hypothe sis space, or even an uncom- putable universal learn er . While we have presented a single comp utable learning algorithm in our universal setting, one would like to dev elop a m easure that would allow different learning algorith ms to be compar ed to each other in terms of sample complexity . W e have seen that sample complexity m ( c, δ, ε ) is not such a measure; is there a viable alternative? Finally , we have igno red comp utation time in our anal- ysis. As such, our learning algorithm is not likely to have practical significance. Integrating run ning time into the the- ory presented would be a critical extension. Refer ences [Ang88 ] D. Anglu in. Identify ing lan guages from stochastic examples. T echnical r eport, Y ale University , De partment of Comp uter Science, 1988. [BEHW89] A. Blumer, A. Eh renfeuch t, D. Hau ssler , an d M. K. W armuth . L earnability and the V apnik- Chervonenkis d imension. Journal o f the AC M , 36(4) :929–9 65, 1989 . [BI94] G. M. Bene dek and A. Itai. Nonunif orm learn- ability . Journal of Computer an d System Sci- ences , pages 311– 323, 1994. [Gol67] E. M. Gold. Language Identificatio n in the Limit. Information and Contr ol , 1 0:447 –474, 1967. [GR97] O. Go ldreich and D. Ron. On u niversal lear n- ing algorithms. Info rmation Pr ocessing Letters , 63(3) :131–1 36, 1997 . [KV94] M. J. Kearns and U. V . V azirani. A n Intr oduc- tion to Computationa l Learning Theory . MIT Press, 1994 . [Lev73] L. A. Levin. U niv ersal sequential search p rob- lems. Pr oblems of Informatio n T ransmission , 9(3):2 65–26 6, 197 3. [LMR88] N. L inial, Y . M ansour, and R. L. Ri vest. Results on le arnability an d th e V apn ik-Chervonenk is di- mension. 29th Ann ual Symposium on F oun- dations of Computer Science , pag es 1 20–12 9, 1988. [PV88] L. Pitt and L. G. V aliant. Co mputation al limita- tions on l earning from exam ples. Journal of the AC M , 35(4 ):965 –984, 198 8. [Rya05] D. Ryabko. On Com putability o f Pattern Recognition Problem s. In Pr oceeding s of the 16th I nternation al Conference on Algorithmic Learning Theory , pages 148 –156 . Sp ringer, 2005. [V al84] L. G. V aliant. A theory of the learnable. Com- munication s of the AC M , 27:11 34–1 142, 1984 . [V ap98] V . N. V apnik. Statistical lea rning th eory . W iley New Y ork, 1998. [VC71] V . N. V ap nik and A. Y . Cherv onenk is. On th e unifor m convergence of relative fre quencies of ev ents to th eir p robab ilities. Theory of Pr ob a- bility and its Applicatio ns , 16:264 –280, 1971.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment