A simple randomized algorithm for sequential prediction of ergodic time series

We present a simple randomized procedure for the prediction of a binary sequence. The algorithm uses ideas from recent developments of the theory of the prediction of individual sequences. We show that if the sequence is a realization of a stationary…

Authors: ** - **László Györfi** (헝가리, 통계학 및 정보이론) - **Gábor Lugosi** (헝가리, 기계학습 및 확률론) - **Gusztáv Morvai** (헝가리

A simple randomized algorithm for sequen tial prediction of ergo dic time series (App eared in: IEEE T rans. Inform. Theory 45 (1999 ), no. 7, 2642–2650. ) L´ aszl´ o Gy¨ orfi G´ ab or Lugosi Guszt´ av Morv ai Abstract W e presen t a simple randomized p ro cedure for the prediction of a b inary sequence. The algorithm uses id eas from recent dev elopmen ts of the th eory of the pred iction of individual sequences. W e sho w that if the sequence is a realization of a stationary and ergo dic r andom pr o cess then the av erage num b er of mistak es conv erges, almost surely , to that of the optim um, giv en by the Ba yes p redictor. The desirab le finite-sample prop erties of the predictor are illustrated by its p erform ance for Mark o v pr o cesses. In suc h cases the p r edictor exhibits near optimal b ehavio r ev en without kno w ing the order of the Mark ov pr o cess. P rediction w ith side information is also considered . 0 1 In tro duct ion W e address the problem of sequen tial prediction of a binary sequence. A sequence of bits y 1 , y 2 , . . . ∈ { 0 , 1 } is hidden from the predic t o r. A t eac h time ins t a n t i = 1 , 2 , . . . , the predictor is ask ed to guess the v alue of the next outcome y i with kno wledge of the past y i − 1 1 = ( y 1 , . . . , y i − 1 ) (where y 0 1 denotes the empt y string). Th us, the predictor’s decision, at time i , is based on the v alue of y i − 1 1 . W e also assume t ha t the predictor ha s access to a sequence of indep endent, iden tically distributed ( i.i.d.) random v ariables U 1 , U 2 , . . . , uniformly distributed on [0 , 1], so that the predictor can use U i in forming a ra ndomized decision for y i . F o r ma lly , the strategy of the predictor is a sequence g = { g i } ∞ i =1 of decision functions g i : { 0 , 1 } i − 1 × [0 , 1] → { 0 , 1 } and the randomized prediction fo rmed at t ime i is g i ( y i − 1 1 , U i ). The predictor pay s a unit p enalt y eac h time a mistak e is made. After n r o unds of play , the normalize d cumulative los s on the string y n 1 is L n 1 ( g , U n 1 ) = 1 n n X i =1 I { g i ( y i − 1 1 ,U i ) 6 = y i } , where I denotes the indicator function. When no confusion is caused, or when the predictor do es not randomize, w e will simply write L n 1 ( g ) = L n 1 ( g , U n 1 ). In general, w e denote the a verage n umber of mistakes b et wee n times m and n b y L n m ( g , U n m ) = 1 n − m + 1 n X i = m I { g i ( y i − 1 1 ,U i ) 6 = y i } . W e also write b L n 1 ( g ) = E L n 1 ( g , U n 1 ) and b L n m ( g ) = E L n m ( g , U n m ) for the exp ected loss of the randomized strategy g . (Here the exp ectatio n is tak en with resp ect to the randomization U n 1 .) In this pap er w e assume that y 1 , y 2 , . . . a r e realizatio ns of the random v ariables Y 1 , Y 2 , . . . dra wn from the binary-v alued stationary a nd ergo dic pro cess { Y n } ∞ −∞ . W e assume that the randomizing v ariables U 1 , U 2 , . . . a re indep enden t of the pro cess { Y n } ∞ −∞ . In this case there is a fundamen tal limit for the predictabilit y of the sequence. This is stated in the next theorem whose pro of ma y b e found in Algo et [2]. Theorem 1 (A lgo et [2]) F or an y pr e diction str ate gy g and stationary er go dic pr o c ess { Y n } ∞ −∞ , lim inf n →∞ L n 1 ( g ) ≥ L ∗ almost sur ely, wher e L ∗ = E h min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ } i is the minimal (Bayes) pr ob ability o f err or of an y de cision for the value of Y 0 b ase d on the infinite p ast Y − 1 −∞ = ( . . . , Y − 3 , Y − 2 , Y − 1 ) . 1 Based on Theorem 1, the follow ing definition is meaningful: Definition 1 A pr e diction str ate gy g is c al le d univ ersal if for al l stationary and er go dic pr o c esses { Y n } ∞ −∞ , lim n →∞ L n 1 ( g ) = L ∗ almost sur ely . Therefore, univ ersal strategies asymptotically ac hiev e the b est p ossible loss for a ll ergo dic pro cesses. The first question is, of course, if suc h a strategy exists. Th e affirmative answ er follo ws from a more general result of Algo et [2]. Here w e giv e an alternativ e pro of which is based on earlier results of Ornstein and Bailey . Theorem 2 (A lgo et [2]) Ther e exis ts a universal pr e diction s c heme. Pro of. Ornstein [19] prov ed t ha t there exists a sequence of functions f i : { 0 , 1 } i → [0 , 1], i = 1 , 2 , . . . suc h that fo r all ergo dic pro cesses { Y n } ∞ −∞ , lim n →∞ f n ( Y − 1 − n ) = P { Y 0 = 1 | Y − 1 −∞ } almost surely . (1) Bailey [4] observ ed that fo r suc h estimators, for all ergo dic pro cesses lim n →∞ 1 n n X i =1 | ( f i − 1 ( Y i − 1 1 ) − P { Y i = 1 | Y i − 1 −∞ }| = 0 almost surely . (2) Indeed, (1) and Breiman’s generalized ergo dic theorem (see Lemma 4 in the App endix) yield (2). Once suc h a sequence { f i } of estimators is av ailable, w e ma y define a (non- randomized) prediction sche me by the plug- in predictor g n ( y n − 1 1 ) = ( 1 if f n − 1 ( y n − 1 1 ) ≥ 1 2 0 otherwise. It is we ll-kno wn that the probability of error of suc h a plug-in predictor may b e b ounded b y the L 1 error of the estimator it is based o n. In particular, b y a simple inequality app earing in the pro of of [9, Theorem 2 .2], P n g n ( Y n − 1 1 ) 6 = Y n | Y n − 1 −∞ o − P n g ∗ ( Y n − 1 −∞ ) 6 = Y n | Y n − 1 −∞ o ≤ 2    f n − 1 ( Y n − 1 1 ) − P n Y n = 1 | Y n − 1 −∞ o    . Therefore, | L n 1 ( g ) − L ∗ | ≤      L n 1 ( g ) − 1 n n X i =1 P n g i ( Y i − 1 1 ) 6 = Y i | Y i − 1 1 o      + 1 n n X i =1    P n g i ( Y i − 1 1 ) 6 = Y i | Y i − 1 1 o − P n g ∗ ( Y i − 1 −∞ ) 6 = Y i | Y i − 1 −∞ o    +      1 n n X i =1 P n g ∗ ( Y i − 1 −∞ ) 6 = Y i | Y i − 1 −∞ o − L ∗      2 ≤      L n 1 ( g ) − 1 n n X i =1 P n g i ( Y i − 1 1 ) 6 = Y i | Y i − 1 1 o      + 2 n n X i =1    f i − 1 ( Y i − 1 1 ) − P n Y i = 1 | Y i − 1 −∞ o    +      1 n n X i =1 P n g ∗ ( Y i − 1 −∞ ) 6 = Y i | Y i − 1 −∞ o − L ∗      . The first t erm of the rig h t-ha nd side tends t o zero almost surely by the Ho effding-Azuma inequalit y (Lemma 5 in the App endix) and the Borel-Cantelli le mma. The s econd one con v erges to zero a lmost surely by (2) a nd the third term tends to zero almost surely by the ergo dic t heorem. ✷ It was Ornstein [19] who fir st pr ov ed the existence of estimators satisfying (1). This w as later generalized b y Algo et [1]. A simpler estimator with the same con vergenc e prop erty w as in tro duced by Morv ai, Y ako witz, and Gy¨ orfi [17]. Unfortunat ely , ev en the simpler estimator needs so larg e amoun ts of data that its practical use is unrealistic. By this w e mean that ev en for “simple” i.i.d. or Marko v pro cesses the rate of conv ergence of the estimator is v ery slo w. M o tiv ated b y the need of a practical estimator, Morv ai, Y ako witz, and Algo et [18 ] in tro duced an ev en simpler algorithm. Ho we v er, it is not kno wn whether their estimator satisfies (1), and w e do not ev en kno w whether the corresp onding predictor is unive r sal. The purp ose of this pap er is to introduce a new simple univ ersal predictor whose finite-sample p erformance fo r Mark ov pro cesses promise practical applicabilit y . 3 2 A si mpl e univ ers al alg o rithm In this section we presen t a s imple predic tion s t rategy , and prov e its univ ersality . It is motiv ated by some recen t dev elopmen ts fr o m t he theory of the prediction of individual sequence s (see, e.g., V ovk [22], F eder, Merha v, and Gutman [10], L it tlestone and W armuth [12], Cesa-Bianc hi et al. [7]). Thes e metho ds pr edict according to a com bination of sev eral predictors, the so-called exp erts . The main idea in this pap er is that if t he sequenc e to predict is draw n from a stationary and ergo dic pro cess, combining t he predictions of a small a nd simple set of appropriately c hosen predictors (t he so-called exp erts) suffices to achie ve unive rsalit y . First w e define an infinite sequence of exp erts h (1) , h (2) , . . . as follow s: Fix a p ositiv e in teger k , and for eac h n ≥ 1, s ∈ { 0 , 1 } k and y ∈ { 0 , 1 } define the function b P k n : { 0 , 1 } × { 0 , 1 } n − 1 × { 0 , 1 } k → [0 , 1] b y b P k n ( y , y n − 1 1 , s ) =    { k < i < n : y i − 1 i − k = s, y i = y }       { k < i < n : y i − 1 i − k = s }    for n > k + 1 , (3) where 0 / 0 is defined to b e 1 / 2 . Also, for n ≤ k + 1 w e define b P k n ( y , y n − 1 1 , s ) = 1 / 2 . In other w ords, b P k n ( y , y n − 1 1 , s ) is the prop ortion of the app earances of the bit y f ollo wing the string s among all app earances of s in the sequence y n − 1 1 . The exp ert h ( k ) is a sequence of functions h ( k ) n : { 0 , 1 } n − 1 → { 0 , 1 } , n = 1 , 2 , . . . defined by h ( k ) n ( y n − 1 1 ) = ( 0 if b P k n (0 , y n − 1 1 , y n − 1 n − k ) > 1 2 1 otherwis e, n = 1 , 2 , . . . . That is, exp ert h ( k ) is a (nonrandomized) prediction strategy , whic h lo oks for all appear a nces of the last seen string y n − 1 n − k of length k in the past and predicts according to the larger of the relativ e frequencies of 0 ’s and 1’s following the string. W e ma y call h ( k ) a k -th or der empiric al Markov str ate gy. The pro p osed prediction algorithm pro ceeds as follows: Let m = 0 , 1 , 2 , . . . b e a non- negativ e in teger. F or 2 m ≤ n < 2 m +1 , the prediction is based up on a w eighted ma jo rit y of predictions of the exp erts h (1) , . . . , h (2 m +1 ) as follows: g n ( y n − 1 1 , u ) =      0 if u > P 2 m +1 k =1 h ( k ) n ( y n − 1 1 ) w n ( k ) P 2 m +1 k =1 w n ( k ) 1 otherwis e, n = 1 , 2 , . . . , where w n ( k ) is the w eight of expert h ( k ) defined by the past p erformance of h ( k ) as w 2 m ( k ) = 1 and w n ( k ) = e − η m ( n − 2 m ) L n − 1 2 m ( h ( k ) ) for 2 m < n < 2 m +1 , where η m = q 8 ln(2 m +1 ) / 2 m . Recall that L n − 1 2 m ( h ( k ) ) = 1 n − 2 m n − 1 X i =2 m I n h ( k ) i ( y i − 1 1 ) 6 = y i o 4 is the a v erage n um b er o f mistake s ma de b y exp ert h ( k ) b et wee n times 2 m and n − 1. The w eigh t of eac h exp ert is therefore exp onentially decreasing with the n umber of its mistak es on this part of the data. Remarks. 1. The ab ov e-men tioned estimator of Morv ai, Y a k ow itz, and Algo et [18] selects a v a lue of k in a certain data-dep enden t manner, and uses the corresp onding estimate b P k n . The new estimate, how ev er, tak es a mixture (w eigh ted a ve rage) of all p ossible v alues of k , with exp o nential w eigh ts depending on the past p erfor mance of eac h comp o nen t estimator. As Lemma 1 b elo w suggests, this tec hnique guaran tees a n um b er of errors almost as small as that o f the b est exp ert (i.e., b est v alue of k ). 2. Ry abk o [21] prop osed an estimator somewhat similar in spirit t o the predictor defined here. Ry abk o used a mixture of empirical Marko v predictors, and prov ed its univ ersalit y for all statio nary and ergo dic pro cesses in a sense related to the K ullback-Leibler div ergence. The idea of div ersifying Mark o v strategies also app ears in Algo et [1]. 3. Eac h time n equals a p ow er of tw o, all we ig h ts are reset to 1, and a simple ma jorit y vote is tak en among the experts. This is necessary to mak e the algorithm sequen tial and to be able to incorp orate more and more experts in the decision. If the to tal length of the sequenc e to b e predicted w as finite (say n ) a nd know n in adv ance, then no suc h r esetting w ould b e necessary , one could just use the first n exp erts as Lemma 1 b elo w desc rib es. Ho wev er, to ac hiev e univ ersality , an infinite class of experts is necessary . As the first par t of the pro of of Theorem 3 b elo w sho ws, we do not lo ose m uc h b y suc h a resetting of the w eights. 4. Related prediction sc hemes hav e been prop osed by F ede r , Merha v, and Gutman [10] for individual sequences. Their computationally quite simple metho ds are sho wn to predict asymptotically as w ell as a ny finite-state predictor. The main result of this section is the univ ersalit y of this simple prediction sc heme: Theorem 3 The pr e diction sch e me g define d ab ove is unive rs a l. In the pro of w e use a b eautiful result of Cesa-Bianchi et al. [7]. It states that, giv en a set of K exp erts, and a seque nce of fixed length n , there exists a randomized predictor whose n um b er of mistak es is not more than that of the b est predictor plus q ( n/ 2) ln K for al l p ossible sequence s y n 1 . The simpler algorithm a nd statemen t cited b elo w is due to Cesa-Bianc hi [6]: Lemma 1 L et ˜ h (1) , . . . , ˜ h ( K ) b e a finite c ol le ction of pr e diction str ate gies (exp erts), and let η > 0 . Then if the pr e diction str ate gy ˜ g is d efine d by ˜ g i ( y i − 1 1 , u ) =        0 i f u > P K k =1 P n ˜ h ( k ) ( y i − 1 1 , U i ) = 1 o ˜ w i ( k ) P K k =1 ˜ w i ( k ) 1 o therwise, i = 1 , 2 , . . . , wher e for al l k = 1 , . . . , K ˜ w 1 ( k ) = 1 a nd ˜ w i ( k ) = e − η ( i − 1) b L i − 1 1 ( ˜ h ( k ) ) , i > 1 5 then for every n ≥ 1 and y n 1 ∈ { 0 , 1 } n , b L n 1 ( ˜ g ) ≤ min k =1 ,...,K b L n 1 ( ˜ h ( k ) ) + ln K η n + η 8 . In p articular, if N is a p ositive inte ger, and η = √ 8 N − 1 ln K , then b L n 1 ( ˜ g ) ≤ min k =1 ,...,K b L n 1 ( ˜ h ( k ) ) + √ N n s ln K 2 , n ≤ N . Pro of of The orem 3. T aking K = 2 m +1 and N = 2 m in Lemma 1, we hav e that the exp ected n umber of errors committed b y g on a segmen t 2 m , . . . , 2 m +1 − 1 is b ounded, for an y y 2 m +1 − 1 2 m ∈ { 0 , 1 } 2 m , as b L 2 m +1 − 1 2 m ( g ) = E   1 2 m 2 m +1 − 1 X i =2 m I { g i ( y i − 1 1 ,U i ) 6 = y i }   ≤ min k ≤ 2 m +1 L 2 m +1 − 1 2 m ( h ( k ) ) + s ln(2 m +1 ) 2 · 2 m = min k =1 , 2 ,... L 2 m +1 − 1 2 m ( h ( k ) ) + s ln(2 m +1 ) 2 · 2 m , where the last equality follows from the fact that for a ll i < 2 m +1 , all exp erts h ( k ) with k ≥ 2 m +1 predict iden tically to h (2 m +1 ) . (Note that since the predictors h ( k ) are deterministic, for ev ery m , b L 2 m +1 − 1 2 m ( h ( k ) ) = L 2 m +1 − 1 2 m ( h ( k ) ).) Similarly , denoting n = 2 ⌊ log 2 n ⌋ +1 , and inv oking Lemma 1 with K = n and N = n/ 2, L n n/ 2 ( g ) ≤   min k =1 , 2 ,... L n n/ 2 ( h ( k ) ) + q n/ 2 n − n/ 2 + 1 s ln( n ) 2   . Therefore, for an y sequence y 1 , y 2 , . . . , n b L n 1 ( g ) = ⌊ log 2 n ⌋− 1 X m =0 2 m L 2 m +1 − 1 2 m ( g ) + ( n − n/ 2 + 1) L n n/ 2 ( g ) ≤ ⌊ log 2 n ⌋− 1 X m =0 2 m   min k =1 , 2 ,... L 2 m +1 − 1 2 m ( h ( k ) ) + s ln(2 m +1 ) 2 · 2 m   +( n − n/ 2 + 1)   min k =1 , 2 ,... L n n/ 2 ( h ( k ) ) + q n/ 2 n − n/ 2 + 1 s ln( n ) 2   ≤ n min k =1 , 2 ,... L n 1 ( h ( k ) ) + ⌊ log 2 n ⌋− 1 X m =0 s 2 m ln(2 m +1 ) 2 + s n/ 2 ln n 2 = n min k =1 , 2 ,... L n 1 ( h ( k ) ) + ⌊ log 2 n ⌋ X m =0 s 2 m ln(2 m +1 ) 2 . 6 Denoting µ = ⌊ log 2 n ⌋ , w e ma y write µ X m =0 s 2 m ln(2 m +1 ) 2 ≤ s ln(2 µ +1 ) 2 µ X m =0 2 m/ 2 < s ln(2 µ +1 ) 2 · 2 ( µ +1) / 2 √ 2 − 1 = c s n log 2 n 2 , where c = √ ln 2 √ 2 − 1 ≈ 2 . 01 . Th us, w e obtain b L n 1 ( g ) ≤ min k =1 , 2 ,... L n 1 ( h ( k ) ) + c n s n log 2 n 2 ≤ min k =1 , 2 ,... L n 1 ( h ( k ) ) + c s log 2 n + 1 n . Noting that for an y fixed sequence y n 1 , L n 1 ( g , U n 1 ) is a sum of [0 , 1 ]-v alued indep endent random v ariables whose exp ectation is b L n 1 ( g ), we may use Ho effding’s inequalit y [11] to see that for any sequence y n 1 , and ǫ > 0, P n    L n 1 ( g , U n 1 ) − b L n 1 ( g )    > ǫ o ≤ 2 e − 2 nǫ 2 . (4) Therefore, if L is now ev aluated on the random sequence Y 1 , Y 2 , . . . , w e obtain lim sup n →∞ L n 1 ( g , U n 1 ) ≤ lim sup n →∞   min k =1 , 2 ,... L n 1 ( h ( k ) ) + c s log 2 n + 1 n   . = lim sup n →∞ min k =1 , 2 ,... L n 1 ( h ( k ) ) almost surely . Th us, it remains to show t ha t for an y ergo dic pro cess Y 1 , Y 2 , . . . , lim sup n →∞ min k =1 , 2 ,... L n 1 ( h ( k ) ) ≤ L ∗ almost surely . (5) This will follo w easily fr o m the following lemma: Lemma 2 F or any k ≥ 1 , lim sup n →∞ L n 1 ( h ( k ) ) ≤ L ∗ + ǫ k almost sur ely , wher e ǫ k > 0 is such that lim k →∞ ǫ k = 0 . 7 Remark. If the pro cess { Y n } happ ens to b e m -th order Mark ov, then it is easy to see that ǫ k = 0 for all k ≥ m . The p erformance of the predictor for suc h pro cesses is inv estigated in the next section. Pro of. In tro duce ˜ L n 1 ( h ( k ) ) = 1 n n X i =1 P { Y i 6 = h ( k ) i ( Y i − 1 1 ) | Y i − 1 −∞ } . By Lemma 5 in the App endix we immediately obtain lim n →∞    L n 1 ( h ( k ) ) − ˜ L n 1 ( h ( k ) )    = 0 almost surely . Therefore, it suffices to sho w tha t lim sup n →∞ ˜ L n 1 ( h ( k ) ) ≤ L ∗ + ǫ k almost surely . T o this end, first w e study the asymptotic b ehav ior of t he quan tity P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } . Notice that P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } ≤ I A P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } + I B k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } + I C k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } + I D c k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } (6) where A =  P { Y 0 = 1 | Y − 1 −∞ } = 1 2  , B k =  P { Y 0 = 1 | Y − 1 −∞ } < 1 2 and P { Y 0 = 1 | Y − 1 − k } < 1 2  , C k =  P { Y 0 = 0 | Y − 1 −∞ } < 1 2 and P ( Y 0 = 0 | Y − 1 − k ) < 1 2  , and D k = A ∪ B k ∪ C k . Notice that P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } = P { Y 0 = 1 | Y − 1 −∞ } I { b P k n (1 ,Y − 1 − n +1 | Y − 1 − k ) ≤ 1 2 } + P { Y 0 = 0 | Y − 1 −∞ } I { b P k n (1 ,Y − 1 − n +1 | Y − 1 − k ) > 1 2 } . (7) No w w e examine the four terms on the rig h t-ha nd side of ( 6 ). F or the first term (7) yields I A P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } = I A 1 2 = I A min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ }  . F or the second term observ e that under B k , for sufficien tly large n , b P k n (1 , Y − 1 − n +1 , Y − 1 − k ) < 1 2 almost surely , and therefore b y (7) w e hav e lim n →∞ I B k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ ) = I B k min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ }  a.s. 8 F or the third term w e o btain similarly lim n →∞ I C k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ ) = I C k min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ }  a.s. The last term is simply b ounded b y I D c k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } ≤ I D c k . Com bining all these b ounds, w e obtain P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } ≤ I D k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } + I D c k (8) and lim n →∞ I D k P { Y 0 6 = h ( k ) n ( Y − 1 − n +1 ) | Y − 1 −∞ } = I D k min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ }  a.s. (9) F rom (8) it is immediate that lim sup n →∞ ˜ L n 1 ( h ( k ) ) ≤ lim n →∞ 1 n n X i =1 I D k ( T i Y ∞ −∞ ) P { Y i 6 = h ( k ) i ( Y i − 1 1 ) | Y i − 1 −∞ } + lim n →∞ 1 n n X i =1 I D c k ( T i Y ∞ −∞ ) , where T denotes the left shift op erator defined on doubly infinite binary sequences y ∞ −∞ ∈ { 0 , 1 } ∞ −∞ . By this inequalit y and (9), Breiman’s generalized ergo dic theorem (see Lemma 4 in the App endix) implies lim sup n →∞ ˜ L n 1 ( h ( k ) ) ≤ E h min  P { Y 0 = 1 | Y − 1 −∞ } , P { Y 0 = 0 | Y − 1 −∞ } i + P { D c k } = L ∗ + P { D c k } almost surely . Since by t he martingale con ve rgence theorem lim k →∞ P { Y 0 = 1 | Y − 1 − k } = P { Y 0 = 1 | Y − 1 −∞ } almost surely , w e ha ve lim k →∞ P ( D c k ) = 0 . T aking ǫ k = P { D c k } , the pro of o f the lemma is complete. ✷ No w w e return to the pro o f of Theoren 3. By Lemma 2, for arbitrary K , lim sup n →∞ min k =1 , 2 ,... L n 1 ( h ( k ) ) ≤ lim sup n →∞ L n 1 ( h ( K ) ) ≤ L ∗ + ǫ K . Since K is arbitrary and ǫ K → 0, (5) is established, and the pro of of the theorem is finished. ✷ Remarks. 1. The prop osed estimate is clearly easy to compute. One merely has to k eep trac k of the exp ected cumulativ e losses L n − 1 2 m ( h ( k ) ) for k = 1 , 2 , . . . , n . How ev er, for large n , 9 storing the en tire data history may b e problematic. In suc h cases, more efficien t tree-based data structures, suc h as the ones described by F eder, Merhav , and Gutman [10], ma y b e applied. W e do not inv estigate this issue further here. 2. W e see from the analysis that for any sequence y 1 , y 2 , . . . and for all n , b L n 1 ( g ) ≤ min k =1 , 2 ,... L n 1  h ( k )  + 2 . 01 s log 2 n + 1 n , and that the difference | L n 1 ( g , U n 1 ) − b L n 1 ( g ) | b etw een the actual loss and the exp ected loss is O p ( n − 1 / 2 ). (F or a sequence of r a ndom v ariables { X n } and sequence of nonnegative n um b ers { a n } we say that X n = O p ( a n ) if for ev ery ǫ > 0 there exists a constan t c > 0 suc h that lim sup n →∞ P {| X n | ≥ ca n } < ǫ .) The rate o f con v ergence to L ∗ dep ends on the b ehavior of the b est exp ert for the time segmen t up to n . F or example, in the next section w e show that for m -th order Mark ov pro cesses the m -th exp ert predicts v ery w ell, and this fact will suffice to derive p erforma nce b ounds fo r the prop o sed predictor. 3. The pro p osed predictor is by no means the only p ossibility . Different sets of exp erts ma y b e com bined in a similar fashion, and unive rsalit y only dep ends on the b ehavior of the b est expert. If some additional information is known (or suspected) a b out the pro cess to b e predicted, this infor ma t io n ma y b e built in the definition of the experts. W e c hose the empirical Mark ov strategies as exp erts for con venie nce, and as we’ll see it in the next section, this c ho ice pays off whenev er the pro cess happ ens to b e finite order Marko v. 10 3 Mark o v pro ce s ses In this section we assume that the pro cess to predict { Y n } ∞ −∞ is (in addition to b eing station- ary and ergo dic) m -th order Marko v, that is, for an y bina r y sequence y − 1 −∞ = ( . . . , y − 2 , y − 1 ), P { Y 0 = 1 | Y − 1 −∞ = y − 1 −∞ } = P { Y 0 = 1 | Y − 1 − m = y − 1 − m } , where m is a p ositiv e integer. W e sho w that the prop osed predictor ac hiev es a nearly optimal p erformance for any m and for an y such pro cess, ev en though t he predictor do es not use the kno wledge that the pro cess is m -th order Mark o v. T he intuitiv e reason for suc h a b eha vior is the f o llo wing: w e ha ve seen it in the previous section that for any sequence, L n 1 ( g , U n 1 ) ≤ min k =1 , 2 ,... L n 1  h ( k )  + 3 s log 2 n + 1 n + O p   s 1 n   . On the other hand, if the seq uence is m -th order Mark o v, then there exists an expert, namely h ( m ) with v ery go o d p erformance. In o rder to simplify o ur analysis, w e mo dify the exp erts somewhat. They ar e defined as b efore but the probability estimates of (3) are now replaced by ¯ P k n ( y , y n − 1 1 , s ) =    { k < i < n : y i − 1 i − k = s, y i = y }    + 1    { k < i < n : y i − 1 i − k = s }    + 2 . (10) In other words, the simple empirical frequency counts are now replaced b y the corresp onding Laplace estimates. It is easy t o see that all results of Section 2 r emain v alid for the mo dified predictor. Remark. The reason for this mo dification is that this w ay w e can app eal t o a result o f Rissanen [20] whic h simplifies our analysis. W e b eliev e that similar p erformance b ounds ar e true for the original predictor of Section 2. In the next theorem w e compare the p erformance of our predictor to the univ ersal low er b ound L ∗ . The statement only giv es information ab o ut the expected loss, but w e b eliev e this result already illustrat es the go o d b ehavior of the prop osed predictor for Mark ov pro cesses. Theorem 4 If the pr o c ess to b e pr e dicte d is a stationary and er go dic m -th or der Markov pr o c ess, then the cumulative loss L n 1 ( g ) = L n 1 ( g , U n 1 ) of the pr e diction str ate gy of Se ction 2 (with the mo difie d estimates of (10)) satisfies E L n 1 ( g ) ≤ L ∗ + 2 s 2 m − 1 log n n + 3 s log 2 n + 1 n + r c n , wher e c > 0 is a universa l c onstant. 11 Pro of. First note that (4) implies E h | L n 1 ( g , U n 1 ) − b L n 1 ( g ) || Y ∞ −∞ i ≤ Z ∞ 0 2 e − 2 nǫ 2 dǫ ≤ s ln(2 e ) 2 n (see, e.g., [9, page 208 ]), and therefor e it suffices to inv estigate b L n 1 ( g ). Recall also from the pro of o f Theorem 3 that for any input sequence, b L n 1 ( g ) ≤ min k =1 , 2 ,... L n 1 ( h ( k ) ) + 3 s log 2 n + 1 n , and, in particular, b L n 1 ( g ) ≤ L n 1 ( h ( m ) ) + 3 s log 2 n + 1 n . Th us, it suffices to sho w tha t for m -th order Mark ov pro cesses the p erformance of the m -th exp ert h ( m ) satisfies E L n 1 ( h ( m ) ) ≤ L ∗ + 2 s 2 m − 1 log n n + r c n for some constant c . T o this end, observ e that, on the one hand, E L n 1 ( h ( m ) ) = E " 1 n n X i =1 I { h ( m ) i ( Y i − 1 1 ) 6 = Y i } # = E " 1 n n X i =1 P { h ( m ) i ( Y i − 1 1 ) 6 = Y i | Y i − 1 −∞ } # , and on t he other hand, b y the Marko v pro p ert y , L ∗ = E " 1 n n X i =1 min  P n Y i = 1 | Y i − 1 i − m o , P n Y i = 1 | Y i − 1 i − m o # = E " 1 n n X i =1 P n h ( m, ∗ ) ( Y i − 1 i − m ) 6 = Y i | Y i − 1 −∞ o # , where h ( m, ∗ ) is the Bay es decision, giv en, for an y s ∈ { 0 , 1 } m , b y h ( m, ∗ ) ( s ) = ( 1 if P { Y 0 = 1 | Y − 1 − m = s } ≥ 1 / 2 0 otherwise. (Note t ha t the optimal predictor, that is, the one whic h minimizes the probabilit y of error at ev ery step predicts according to h ( m, ∗ ) .) The ab o ve equalities imply that E L n 1 ( h ( m ) ) − L ∗ ≤ 1 n n X i =1 E    P n h ( m ) i ( Y i − 1 1 ) 6 = Y i | Y i − 1 −∞ o − P n h ( m, ∗ ) ( Y i − 1 i − m ) 6 = Y i | Y i − 1 −∞ o    ≤ 2 n n X i =1 E    ¯ P m i (1 , Y i − 1 1 , Y i − 1 i − m ) − P n Y i = 1 | Y i − 1 1 o    , 12 where the second inequality f ollo ws b y [9, Theorem 2.2]. In the rest o f the pr o of w e simply apply some kno wn results from the theory of univ ersal prediction. Fir st, b y applications of Jensen’s and Pinsk er’s inequalities (see Merha v and F ede r [15, eq. (20)]) we obtain 2 n n X i =1 E    ¯ P m i (1 , Y i − 1 1 , Y i − 1 i − m ) − P n Y i = 1 | Y i − 1 1 o    ≤ 2 v u u u t 1 n n X i =1 X y i − 1 1 ∈{ 0 , 1 } i − 1 P { Y i − 1 1 = y i − 1 1 } 1 X j =0 P { Y i = j | Y i − 1 1 = y i − 1 1 } log P { Y i = j | Y i − 1 1 = y i − 1 1 } ¯ P m i ( j, y i − 1 1 , y i − 1 i − m ) . Observ e that on the righ t- hand side, under the square ro ot sign, we ha ve the normalized Kullbac k-Leibler div ergence b et w een the pro babilit y measure of Y n 1 and its estimate con- structed a s a pro duct of the Laplace estimates (10). But this div ergence, for m -th order Mark ov sources, is w ell-kno wn to b e b ounded by 2 m 2 n log n + O  1 n  , see Rissanen [20]. This concludes the pro of. ✷ Remarks. 1. As Theorem 4 shows , b y exp onential w eigh ting of the empirical Mark ov strategies, the predictor automatically adapts to the unkno wn Mark o v o r der. Similar re- sults, though in differen t setup, are achie ved b y Mo dha and Masry [13],[14] b y complexit y regularization. 2. Merhav , F eder, and Gutman, [16] sho w ed that if the pro cess is m -th order Marko v, then the randomized predictor ˜ h ( m ) defined by ˜ h ( m ) i ( y i − 1 1 , U ) =      0 if b P m n (0 , y n − 1 1 , y n − 1 n − m ) > 1 2 1 if b P m n (0 , y n − 1 1 , y n − 1 n − m ) < 1 2 I { U ≥ 1 / 2 } otherwise ac hiev es E L n 1 ( ˜ h ( m ) ) − L ∗ ≤ C /n , where C is a constan t dep ending of the distribution of the pro cess. Ho we v er, in an in teresting con trast, the b est distribution-fr e e upp er b ound for a ll m -th order Marko v pr o cesses is of the order of n − 1 / 2 . T o illustrate this, consider the case m = 0 , that is, when { Y n } is a n i.i.d. pro cess with P { Y 1 = 1 } = 1 / 2 + θ , and the predictor ˜ h (0) is based on a ma jo rit y v o te of t he bits app eared in the past. In this case, for ev ery n , sup θ ∈ [ − 1 / 2 , 1 / 2]  E L n 1 ( ˜ h (0) ) − L ∗  ≥ c 1 n − 1 / 2 , where c 1 is a univ ersal constan t. (This is straigh tforward to see b y considering θ = cn − 1 / 2 13 for some small constan t c , and writing E L n 1 ( ˜ h (0) ) − L ∗ = 1 n n X i =1  P { ˜ h (0) ( Y i − 1 1 , U i ) 6 = Y i } −  1 2 − θ  = 1 n n X i =1 2 θ P { ˜ h (0) ( Y i − 1 1 , U i ) = 0 } ≥ 1 n n X i =1 2 θ P    i − 1 X j =1 Y j < i − 1 2    = 1 n n X i =1 2 θ P    i − 1 X j =1 ( Y j − E Y j ) < − ( i − 1) θ    . Finally , in v oke the Berry -Ess ´ een theorem (see, e.g., [8]) to deduce that there exists a univers al constan t c 2 suc h that P n P i − 1 j =1 ( Y j − E Y j ) < − ( i − 1) θ o ≥ c 2 for ev ery 2 ≤ i ≤ n .) Th us, ev en though for every single v alue of θ , E L n 1 ( ˜ h ( m ) ) − L ∗ con v erges to zero at a rate o f O (1 /n ), the minimax rate o f con v ergence is, in fact, O (1 / √ n ). Since the upp er b o und in Theorem 4 is indep enden t of the distribution, w e see that, in this sense, (ignoring loga rithmic factors) the order of magnitude o f the b ound is the b est p ossible. 14 4 Predictio n w i th side in formation In this section w e apply the same ideas to the seemingly more difficult classifi cation (or pattern recognition) problem. T he setup is the follow ing: let { ( X n , Y n ) } ∞ −∞ b e a stationary and ergo dic sequence o f pairs taking v alues in R d × { 0 , 1 } . The problem is to predict the v alue of Y n giv en the dat a ( X n , D n − 1 ), where w e denote D n − 1 = ( X n − 1 1 , Y n − 1 1 ). The prediction problem is similar to t he one studied in Section 2 with the exception that the seque nce of X i ’s is also av ailable to the predictor. One ma y think ab out the X i ’s a s side information. W e ma y formalize the prediction problem as follow s. A (randomized) prediction strategy is a sequence g = { g i } ∞ i =1 of decision functions g i : { 0 , 1 } i − 1 ×  R d  i × [0 , 1] → { 0 , 1 } so that the prediction formed at time i is g i ( y i − 1 1 , x i 1 , U i ). The normalize d cumulative loss for an y fixed pair of sequences x n 1 , y n 1 is now R n 1 ( g , U n 1 ) = 1 n n X i =1 I { g i ( y i − 1 1 ,x i 1 ,U i ) 6 = y i } , W e also use the short notat io n R n 1 ( g ) = R n 1 ( g , U n 1 ). Denote the exp ected loss of the random- ized strategy g b y b R n 1 ( g ) = E R n 1 ( g , U n 1 ) . W e assume t hat the randomizing v a riables U 1 , U 2 , . . . a r e independen t of the pro cess { ( X n , Y n ) } . Just lik e in the case of predic tion without side information, the fundamen tal limit is giv en b y the Bay es proba bility of error: Theorem 5 F or any pr e diction str ate gy g and stationary er go dic pr o c ess { ( X n , Y n ) } ∞ n = −∞ , lim inf n →∞ R n 1 ( g ) ≥ R ∗ almost sur ely, wher e R ∗ = E h min  P { Y 0 = 1 | Y − 1 −∞ , X 0 −∞ } , P { Y 0 = 0 | Y − 1 −∞ , X 0 −∞ } i . The pro of o f this lo w er b ound is similar to t ha t o f Theorem 1, the details are omitted. It follo ws from results of Morv ai, Y ak owitz, and Gy¨ orfi [17] that there exists a prediction strategy g suc h that for all ergo dic pro cesses, R n 1 ( g ) → R ∗ almost surely . (W e omit the details here.) The alg orithm of Morv ai, Y ako witz, and G y¨ orfi, how ev er, has a v ery slo w rate of conv ergence ev en for i.i.d. pro cesses. The main message of this section is a simple univ ersal pro cedure with a practical app eal. The idea, aga in, is t o combine the decisions of a small n umber of simple exp erts in an appropriat e w ay . W e define an infinite array o f experts h ( k, ℓ ) , k , ℓ = 1 , 2 , . . . as follows . Let P ℓ = { A ℓ,j , j = 1 , 2 , . . . , m ℓ } b e a sequence of finite partitions of the feature space R d , a nd let G ℓ b e the corresp onding quantize r: G ℓ ( x ) = j, if x ∈ A ℓ,j . 15 With some abuse of notatio n, for an y n and x n 1 ∈  R d  n , w e write G ℓ ( x n 1 ) for t he sequ ence G ℓ ( x 1 ) , . . . , G ℓ ( x n ). Fix p ositiv e integers k , ℓ , a nd fo r eac h s ∈ { 0 , 1 } k , z ∈ { 1 , 2 , . . . , m ℓ } k +1 , and y ∈ { 0 , 1 } define b P ( k, ℓ ) n ( y , y n − 1 1 , x n 1 , s, z ) =    { k < i < n : y i − 1 i − k = s , G ℓ ( x i i − k ) = z , y i = y }       { k < i < n : y i − 1 i − k = s , G ℓ ( x i i − k ) = z , }    , n > k + 1 . (11 ) 0 / 0 is defined to b e 1 / 2. Also, for n ≤ k + 1 w e define b P ( k, ℓ ) n ( y , y n − 1 1 , x n 1 , s, z ) = 1 / 2. The exp ert h ( k, ℓ ) is now defined b y h ( k, ℓ ) n ( y n − 1 1 , x n 1 ) = ( 0 if b P ( k, ℓ ) n (0 , y n − 1 1 , x n 1 , y n − 1 n − k , G ℓ ( x n n − k )) < 1 2 1 otherwise, n = 1 , 2 , . . . That is, exp ert h ( k, ℓ ) quan tizes the sequence x n 1 according to the partition P ℓ , and lo oks f or all app earances o f the last seen quantized strings y n − 1 n − k , G ℓ ( x n n − k ) of length k in the past. Then it predicts a ccording to the larger of the relative frequencie s of 0’s and 1’s fo llo wing the string. The prop o sed a lgorithm com bines the predictions of these exp erts similarly to that of Section 2. This w a y b ot h the length of the string to b e matched and the resolution of the quan tizer are adjusted dep ending on the data. The formal definition is as fo llo ws: F or any m = 0 , 1 , 2 , . . . , if 2 m ≤ n < 2 m +1 , the prediction is based up on a we ig h ted ma jority o f predictions of the (2 m +1 ) 2 exp erts h ( k, ℓ ) , k , l ≤ 2 m +1 as follows: g n ( y n − 1 1 , x n 1 , u ) =      0 if u > P k ,ℓ ≤ 2 m +1 h ( k, ℓ ) n ( y n − 1 1 , x n 1 ) w n ( k , ℓ ) P k ,ℓ ≤ 2 m +1 w n ( k , ℓ ) 1 otherwise, where w n ( k , ℓ ) is the w eigh t of exp ert h ( k, ℓ ) defined by the past p erformance of h ( k, ℓ ) as w 2 m ( k , ℓ ) = 1 and w n ( k , ℓ ) = e − η m ( n − 2 m ) R n − 1 2 m ( h ( k,ℓ ) ) for 2 m < n < 2 m +1 , where η m = q 8 ln(2 m +1 ) 2 / 2 m . T o prov e the univ ersalit y of the metho d, w e need some na t ural conditions on the sequence of partitions. W e assume the follo wing: (a) the sequence of partitions is nested, that is, any cell o f P ℓ +1 is a subset of a cell o f P ℓ , ℓ = 1 , 2 , . . . ; (b) each partition P ℓ is finite; (c) if diam( A ) = sup x,y ∈ A k x − y k denotes t he dia meter of a set, t hen fo r each sphere S cen tered at the origin lim ℓ →∞ max j : A ℓ,j ∩ S 6 = ∅ diam( A ℓ,j ) = 0 . Remark. The next theorem states the univ ersalit y of the prop o sed pattern recognition sc heme. The definition of the alg orithm is somewhat arbitrary , w e just c hose o ne o f the 16 man y p ossibilities. In this v ersion, at time n , only partitio ns with indice s at most n are tak en in to account. It is easy to see that the univ ersalit y prop ert y remains v alid if the n umber of partitions conside red at time n is an arbitrary , p olynomially increasing f unction of n . The conditions for the seq uence of partitions again giv e a lot o f lib erty to the user. In applications, the partit io ns may b e c hosen to incorp orate some pr io r kno wledge ab out t he pro cess. In this pap er w e merely prov e univ ersalit y of the sc heme. P erformance bounds in the st yle of Section 3 for sp ecial types of pro ceses ma y b e deriv ed, tha nks to the p ow erful individual seque nce b ounds. Here, ho w ev er, the analysis may b e substan tia lly mor e complicated. Theorem 6 Assume that the se quenc e of p artitions P ℓ satisfies the thr e e c onditions ab ove. Then the p attern r e c o gnition scheme g define d ab ove satisfies lim n →∞ R n 1 ( g ) = R ∗ almost sur ely for any stationary an d er go dic p r o c ess { ( X n , Y n ) } ∞ n = −∞ . Pro of. As in t he pro of of Theorem 3 , w e o btain that f o r a n y stat io nary and ergo dic pro cess { ( X n , Y n ) } ∞ n = −∞ , lim sup n →∞ R n 1 ( g , U n 1 ) ≤ lim sup n →∞     min k = 1 , 2 , . . . ℓ = 1 , 2 , . . . , n − 1 R n 1 ( h ( k, ℓ ) ) + 2 c s log 2 n + 1 n     = lim sup n →∞ min k = 1 , 2 , . . . ℓ = 1 , 2 , . . . , n − 1 R n 1 ( h ( k, ℓ ) ) almost surely . Th us, it remains to show t ha t lim sup n →∞ min k = 1 , 2 , . . . ℓ = 1 , 2 , . . . , n − 1 R n 1 ( h ( k, ℓ ) ) ≤ R ∗ almost surely . T o pro ve this, w e use the follo wing lemma, whose pr o of is easily obtained b y copying that of Lemma 2 : Lemma 3 F or e ach k , ℓ ≥ 1 , ther e exists a p ositive numb er ǫ k ,ℓ such that for any fixe d ℓ , lim k →∞ ǫ k ,ℓ = 0 and lim sup n →∞ R n 1 ( h ( k, ℓ ) ) ≤ R ∗ ( ℓ ) + ǫ k ,ℓ , wher e R ∗ ( ℓ ) = E h min  P { Y 0 = 1 | Y − 1 −∞ , G ℓ ( X 0 −∞ ) } , P { Y 0 = 0 | Y − 1 −∞ , G ℓ ( X 0 −∞ ) } i . No w w e return to the pro of of Theorem 6. Since the sequence of partitions P ℓ is nested, and b y (c), the sequen ces P { Y 0 = 1 | Y − 1 −∞ , G ℓ ( X 0 −∞ ) } and P { Y 0 = 0 | Y − 1 −∞ , G ℓ ( X 0 −∞ ) } l = 1 , 2 , . . . 17 are martinga les and they con ve r g e a lmost surely to P { Y 0 = 1 | Y − 1 −∞ , X 0 −∞ } and P { Y 0 = 0 | Y − 1 −∞ , X 0 −∞ } . Th us, it fo llows from Leb esgue’s dominated conv ergence theorem that lim l →∞ R ∗ ( ℓ ) = E h min  P { Y 0 = 1 | Y − 1 −∞ , X 0 −∞ } , P { Y 0 = 0 | Y − 1 −∞ , X 0 −∞ } i = R ∗ . No w it follows easily that lim sup n →∞ min k = 1 , 2 , . . . ℓ = 1 , 2 , . . . , n − 1 R n 1 ( h ( k, ℓ ) ) ≤ R ∗ almost surely , and the pro of of the theorem is finished. ✷ 18 5 App endix Here w e describ e tw o results which a re used in the analysis. The first is due t o Breiman [5], and its pro of may also b e f o und in Algo et [2]. Lemma 4 Breiman’s generalized ergodic theorem [5]. L et Z = { Z i } ∞ −∞ b e a sta- tionary and er go dic time series. L et T denote the left shift op er ator. L et f i b e a se quenc e of r e al-value d functions such that for some function f , f i ( Z ) → f ( Z ) alm o st sur ely. Assume that E sup i | f i ( Z ) | < ∞ . Then lim t →∞ 1 n n X i =1 f i ( T i Z ) = E f ( Z ) almost sur ely. The second is the Ho effding-Azuma inequalit y for sums of b ounded martingale differences: Lemma 5 Hoeffding [11], Azuma [3]. L et X 1 , X 2 , . . . b e a se quenc e of r andom vari- ables, and assume that V 1 , V 2 , . . . is a martingale differ enc e se quenc e with r esp e ct to X 1 , X 2 , . . . . Assume furthermor e that ther e exist r andom variab les Z 1 , Z 2 , . . . and nonne gative c onstants c 1 , c 2 , . . . such that for every i > 0 Z i is a function of X 1 , . . . , X i − 1 , and Z i ≤ V i ≤ Z i + c i with pr ob ability one. Then for an y ǫ > 0 and n P ( n X i =1 V i ≥ ǫ ) ≤ e − 2 ǫ 2 / P n i =1 c 2 i and P ( n X i =1 V i ≤ − ǫ ) ≤ e − 2 ǫ 2 / P n i =1 c 2 i . Ac knowledgem en t. W e thank Nicol´ o Cesa-Bianc hi for t eaching us all we e needed to kno w ab out pr ediction with expert advise. W e are also grateful to Sid Y ak owitz for illuminating discussions and to the referees f o r a v ery careful reading of the manusc r ipt and for v aluable suggestions. W e also thank M´ arta Horv´ ath for useful conv ersations. 19 References [1] P . Algo et. Univ ersal sc hemes for prediction, gam bling, and p ortfo lio sele ction. Annals of Pr ob ability , 20:901– 941, 1 9 92. [2] P . Algo et. The strong la w of large num b ers for sequen tial decisions under uncertainit y . IEEE T r ansac tion s on Information T he ory , 40:609–63 4 , 1 994. [3] K. Azuma. W eighted sums of certain dep enden t random v ariables. T ohoku Mathematic al Journal , 68:357– 367, 1967. [4] D.H. Bailey . Se quential schemes for classifying and pr e dicting er go dic pr o c esses . PhD thesis, Stanford Univ ersit y , 1976. [5] L. Breiman. The individual ergo dic theorem of information theory . Annals of Math- ematic al Statistics , 28:8 09–811, 1957. Correction. A nnals of Mathematic al St atistics , 31:809–81 0, 196 0 . [6] N. Cesa-Bianchi. Analysis of t w o gradien t- based algorithms fo r on-line regression. In Pr o c e e din g s of the 10th Annual Co n fer enc e on Comp utational L e arning The ory , pages 163–170. AC M Press, 1997. [7] N. Cesa-Bianc hi, Y. F reund, D .P . Helmbold, D . Haussler, R. Schapire, and M.K. W ar- m uth. Ho w to use exp ert advice. Journal of the A CM , 44(3):42 7–485, 1997. [8] Y.S. Cho w and H. T eic her. Pr ob abil i ty The ory, Indep endenc e, Inter change ability, Mar- tingales (2nd edition). Springer-V erlag, New Y ork, 19 88. [9] L. Devro y e, L. Gy¨ orfi, and G. Lug osi. A Pr ob abilistic The ory o f Pattern R e c o gnition . Springer-V erlag, New Y ork, 1 996. [10] M. F eder, N. Merha v, and M. Gutman. Univ ersal prediction o f individual seque nces. IEEE T r ansac tion s on Information T he ory , 38:1258–1 2 70, 1992. [11] W. Ho effding. Probability inequalities for sums o f b ounded ra ndom v ariables. Journal of the Americ an Statistic al Asso ciation , 58:13–30, 1963 . [12] N. Littlestone and M. K. W arm uth. The w eighted ma jority alg o rithm. I nformation and Computation , 108:21 2–261, 1994. [13] D.S. Mo dha a nd E. Masry . Minim um complexit y regression estimation with w eakly de- p enden t observ ations. I EEE T r ansactions on In formation The ory , 42:2 1 33–2145, 1996. [14] D.S. Mo dha a nd E. Masry . Memory-univ ersal prediction of stationary ra ndom pro cesses. IEEE T r ansac tion s on Information T he ory , 44:117–13 3 , 1 998. [15] N. Merha v a nd M. F eder. Univ ersal prediction. IEEE T r ansactions o n Information The ory , 44:2124– 2 147, 1 998. 20 [16] N. Merha v, M. F eder, and M. Gutman. Some prop erties of sequen tia l predictors fo r binary Mark o v sources. IEEE T r ansactions on Information The ory , 39:887–892 , 1993. [17] G. Morv ai, S. Y ak owitz, and L. Gy¨ orfi. Nonparametric inference fo r ergo dic, stationary time series. Annals of Statistics , 24 :370–379, 199 6 . [18] G. Morv ai, S. Y ako witz, a nd P . Alg o et. W eakly Conv ergen t Statio nary Time Series. IEEE T r ansac tion s on Information T he ory , 43:483–49 8 , 1 997. [19] D.S. Ornstein. Guessing the next o utput of a stationary pr o cess. I sr ael Journal o f Mathematics , 30:292 – 296, 1 978. [20] J. Rissanen. Complexit y of strings in the class of Marko v sources. IEEE T r ansactions on Information The ory , 3 2 :526–532, 1986. [21] B.Y a. Ryabk o. Prediction o f random sequences a nd univ ersal co ding. Pr oblems of Information T r ansm ission , 2 4:87–96, 1988 . [22] V.G. V ovk. Aggregating strategies. In Pr o c e e dings of the Thir d A nnual Workshop on Computational L e arning The ory , pages 372–383. Asso ciation of Computing Mac hinery , New Y ork, 1 990. 21

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment