Nonparametric Statistical Inference for Ergodic Processes

Nonparametric Statistical Inference for Ergo dic Pro cesses Daniil Ry abk o ∗ , Boris Ry abk o # ∗ SequeL, INRIA-Lille Nord Euro pe, F rance, daniil@r yabko.net # Institute of Computational T echnologies of Siberian B r anch of Russian Academy o f Scie nce , Siber ian State Universit y of T elecomm unications and Informatics, Nov o sibirsk, Russia; b oris@r yabk o.net Abstract In this w ork a method for statistical analysis of ti me series is proposed, whic h is u sed to obtain solutions to some classical problems of mathe- matical statistics under the only assumption that the process generating the data is stationary ergo dic. Namely , three problems are considered: goo dness-of-ﬁt (or identit y ) testing, pro cess cla ssiﬁcation, and the change p oint problem. F or each of the problems a test is construct ed that is asymptotically accurate for the case when the data is generated by sta- tionary ergo dic processes. The tests are based on empirical estimates of distributional distance. 1 In tro duction Ov erview. I n this work w e c onsider the problem of sta tistical a na lysis of time series, when nothing is known ab out the under lying pro ces s generating the data, except that it is stationary ergo dic. There is a v as t literature on time series anal- ysis under v arious parametric assumptions, and also under such non-par ametric assumptions as that the proces s has a ﬁnit e memory or p os sesses certain mixing rates. While under these settings most of the pro blems of statistical analy- sis are clear ly s olv able and eﬃcient a lgorithms exist, in the gener al setting of stationary ergo dic pro ces ses it is far less clear what can b e do ne in principle, which problems of s tatistical analy sis admit a solution and which do not. In t his work we pro p ose a method of statistical analysis of time ser ies, that allows us to demons tr ate that some classical statistical problems indeed admit a solution under the only ass umption that the data is stationa r y erg o dic, whereas b efor e solutions only for more restr icted cases were known. The s olutions are always constructive, that is, we present asymptotically accura te alg o rithms for each of the considered pr oblems. All the algorithms are based on empirical estimates o f distributional distance, whic h is in the co re of the sugg ested approach. W e sug- gest that the pro p o sed a pproach can b e applied to o ther pr oblems of statistical 1 analysis of time series, with the view of establishing principled positive results, leaving the task of ﬁnding optimal alg orithms for each particular problem as a topic for further res e a rch. Here we conce ntrate on the following thr ee problems: g o o dness-o f-ﬁt (or ident ity) tes ting , pro cess classiﬁca tion, and the c hange p oint pro blem. Go o dne ss-of-ﬁt testing. The ﬁrst problem is the following pr oblem o f hy- po thesis testing. A stationa ry er go dic pro c e s s distribution ρ is known theoret- ically . Giv en a data sa mple, it is req uired to test whether it was ge ner ated by ρ , versus it w as generated by any other stationar y e rgo dic distribution that is diﬀerent from ρ (go o dness-o f-ﬁt, or iden tity testing). The case of i.i.d. or ﬁnite- memory pro cesses is w ide ly studied (see e.g. [7]); in particular , when ρ has a ﬁnite memo ry [22] prop ose s a test against any stationary erg o dic alternative: a test that can be based on an arbitrar y univ ersa l co de. It was noted in [27] that an asymptotically accurate test for the case o f stationary ergo dic proc e s ses o ver ﬁnite alphab et exists (but no test was pro po sed). Here we pr o p ose a concrete and simple a symptotically a ccurate go odness -of-ﬁt test, whic h demonstrates the prop osed approach: to use e mpirical distributional distance for h yp otheses test- ing. By asymptotica lly ac curate test we mea n the following. First, the Typ e I error of the tes t (or its size) is ﬁxed and is given as a parameter to the test. That is, given a ny α > 0 a s an input, under H 0 (if the data sample was indeed generated b y ρ ) the pro ba bilit y that the test says “ H 1 ” is not greater than α . Second, under any hypothesis in H 1 (that is, if the distribution generating the data is diﬀer ent fr om ρ ), the test will say “ H 0 ” not more than a ﬁnite num ber of times, with probability 1. In other words, the T yp e I error of the test is ﬁxed and the Type II er ror c an b e ma de not more than a ﬁnite num b er o f times, as the data sample increases, with pro bability 1 under any statio na ry ergo dic alternative. A comment on this setting is in or der. When the alternative H 1 is less general, e.g. distributions that ha ve ﬁnite-memor y [11] or known mixing rates, one t ypically seek s a test that has optimal rates of decrease of probability of T y p e II error to 0. F or our case, when the alterna tive is the s et of all s tationary erg o dic pro cesses, this r ate is ne c essarily non- unifor m. In this sense, the prop erty that we establish for our test is the s trongest p ossible. Observe that the notion of consistency that w e consider is stronger than requiring that the test makes o nly a ﬁnite n umber of erro rs (either Type I or Type I I) with probability 1, the setting consider ed, for example, in the case s when H 0 is comp osite, o r for the pro cess classiﬁcatio n problem that we address in this work. Pro cess classiﬁcation. In the next problem that we co nsider, we aga in hav e to decide whether a data sample was gener a ted by a pro cess s atisfying a hy- po thesis H 0 or a hypo thesis H 1 . How ever, here H 0 and H 1 are not known theoretically , but a re represented by tw o a dditional da ta samples . More pre- cisely , the problem is that of pro cess cla ssiﬁcation, which can be for mu lated a s follows. W e are given three samples X = ( X 1 , . . . , X k ), Y = ( Y 1 , . . . , Y m ) and Z = ( Z 1 , . . . , Z n ) generated by stationary ergo dic pro ces ses with distributions ρ X , ρ Y and ρ Z . It is known that ρ X 6 = ρ Y , while e ither ρ Z = ρ X or ρ Z = ρ Y . It is requir ed to test whic h one is the case. That is, we have to decide whether 2 the sample Z was genera ted by the sa me pro cess as the sample X or by the same pr o cess as the sample Y . This problem fo r the case of dep endent time series was consider ed for example in [1 1], where a solution is pre s ented under the ﬁnite-memory assumption. It is closely related to many imp orta n t problems in statistics and applicatio n area s, such as pattern r ecognition, c la ssiﬁcation, etc. Apparently no asymptotically accur ate pro ce dure fo r pro cess classiﬁcation has bee n known so far for the gener al ca se o f stationar y ergo dic pro ces ses. Here we pro po se a test that conv erges almost sur ely to the cor r ect answer. In other words, the test makes o nly a ﬁnite num ber of error s with probability 1, with resp ect to any stationa ry erg o dic pro cesses generating the data. Unlike in the previous problem, her e we do not explicitly distinguish b etw een Type I and Type II error, sinc e the h yp otheses are by nature symmetric: H 0 is “ ρ Z = ρ X ” and H 1 is “ ρ Z = ρ Y ”. Change p oi n t estim ation. Fina lly , we co ns ider the change p oint problem. It is another cla s sical pr o blem, with v as t literature on b oth parametric (see e.g. [2]) and non-par ametric (see e.g. [6]) metho ds for solving it. In this w ork we addre s s the case where the data is dep endent, its form and the str ucture o f dep endence is unknown, and marg ina l distributions b efore and after the change may be the same. W e consider the following (oﬀ-line) setting of the pro blem: a (real-v alued) sample Z 1 , . . . , Z n is given, where Z 1 , . . . , Z k are genera ted according to some distribution ρ X and Z k +1 , . . . , Z n are generated according to s ome distribution ρ Y which is diﬀerent fro m ρ X . It is known that the distributio ns ρ X and ρ Y are stationary ergo dic, but no thing els e is known ab o ut them. Most literatur e on ch ange p oint pr oblem for dep endent time series assumes that the ma rginal distributions b efore and after the change p oint a r e diﬀer e nt, and o ften a lso make explicit r e strictions on the dep endence, such as r e quirements on mixing r ates. Nonparametric methods used in these cases ar e t y pically based o n Kolmogor ov- Smirnov statistic, Cramer-von Mises statistic, or g e neralizations thereof [6, 4, 9]. The main diﬀer ence of our results is that we do not ass ume that the single- dimensional marginals (or ﬁnite-dimensional marginals of any given ﬁxed size) are diﬀeren t, and do not make an y assumptions on the structure of dep endence. The only assumption is that the (unkno wn) pro cess distributions befor e and after the change po int ar e stationar y er go dic. Our result is a demonstra tion of that asy mptotically accur a te c hange p oint estimation is p os sible in this general setting. Related problem s. Le t us brieﬂy r e late the three problems for which w e present consistent tests to other problems of statistical analysis of stationar y ergo dic time series. First, a clo sely r e lated pr oblem is tha t of homo geneity testing . The problem is as follows: given t wo s amples, o ne has to decide whether they were genera ted by the same pro ce s s distribution o r by diﬀerent ones. While solutions to this problem exis t for i.i.d. data (see for example [3, 26], and references ther ein), for stationa ry ergo dic proces ses (and ev en for a smaller class of B-pro cesses ) a consis tent tes t does not exist, ev en in the binary -v alued case, as was shown in [2 4]. This pro ble m is close ly r elated to change p oint dete ction problem: g iven a single sample, one has to decide whether there was an abrupt change of distribution somewhere. If we k now that ther e was such a change, 3 then we can give a n asymptotically consistent estimate for it, as w e show here; how e ver, if it is not kno wn that the change p oint exists, nobo dy can construct a consistent change p oint test, because there is no consisten t test for homogeneit y . In other words, we can tell wher e a c ha nge p o int is, if there is one, but we cannot (in gener al) tell w he ther there is o ne o r not (in the case o f statio na ry ergo dic distributions). Observe that the pro cess clas siﬁcation problem describ ed ab ov e turns o ut to b e easier tha n homo geneity tes ting: a consistent test exists for the former (constructed in this w ork) but not for the latter. Other hypo thesis testing pro blems that concer n stationary time ser ies in- clude testing for having a c e r tain memory (i.e. testing the hypothesis “ k - order Marko v pro cess” versus “stationary ergo dic, not k -order Ma rko v”), testing for mem b ership to parametric families, and others [12, 1 6, 1 7, 21, 22]. Some recent general r esults that characterize those hypotheses ab out ﬁnitely-v alued ergo dic pro cesses that can be tested are provided in [23]. Fina lly , a related pr oblem is that of pr e diction or for e c asting [19, 14, 15, 2 0]. In this r e sp e ct, the results of the pr esent work clarify which problems can and which cannot b e solved, when the only a ssumption o n the data is tha t it is stationary ergo dic. Metho dol ogy . All the tests that we constr uc t are based on empirical e s ti- mates of the so- called dis tr ibutional distance. F or tw o pr o cesses ρ 1 , ρ 2 a distri- butional dista nce is deﬁned as P ∞ k =1 w k | ρ 1 ( B k ) − ρ 2 ( B k ) | , where w k are p ositive summable r eal weigh ts, e.g. w k = 2 − k and B k range over a countable ﬁeld that generates the sigma - algebra of the underlying proba bilit y spa ce. F or exam- ple, if we ar e talk ing ab o ut ﬁnite-alphab et pro cesses with the binary alphab et A = { 0 , 1 } , B k would range ov er the set A ∗ = ∪ k ∈ N A k ; that is, ov er all tu- ples 0 , 1 , 00 , 0 1 , 10 , 11 , 000 , 001 , . . . (of co urse, we could just as well o mit, sa y , 1 and 1 1); therefore, the distributional distance in this case is the weighted sum of diﬀerences of probabilities o f all p o ssible tuples. In this work we c onsider real-v alued pr o cesses, so B k hav e to r ange throug h a s uitable seq uence of in- terv als, all pairs of suc h in terv a ls, triples, etc. (e.g. we can use a sequence of par titions into c ubes o f decre a sing volume, see the next section for formal deﬁnitions). Alth ough distributional distance is a natural concept that, for sto chastic pro cesses, ha s b een studied for a while [1 0], its empirical estimates hav e not, to our knowledge, been used for statistical analysis of time se ries. W e argue that this distance is rather natural for this kind of problems, ﬁrst of all, since it can be co nsistently es timated (unlike, for e xample, ¯ d distance, whic h cannot [18] b e consistently estimated for the general case o f stationary er go dic pro cesses). Secondly , it is alwa ys b ounded, unlike (empirical) K L divergence, which is often used for statistical infer ence for time ser ies (e.g. [7 , 2 2, 1, 8, 13] and others). Other a ppr oaches to sta tis tica l analysis of s ta tionary dep endent time ser ies include the use of (universal) co des [12, 22, 21]. Here w e ﬁrst show that distributional distance b etw een stationary ergo dic pr o cesses can b e consis- ten tly estimated based on sampling, and then a pply it to co nstruct consistent tests for the three pro blems of statistical analysis descr ib e d ab ove. Although empirica l estimates of the distributiona l distance inv o lve taking an inﬁnite sum, in practice it is ob vious that only a ﬁnite num be r of summands 4 has to b e ca lculated. This is due to the fact that empirical estimates hav e to be compared to each other o r to theo retically known pro ba bilities, and since the (bo unded) summands have (exp onentially) decreasing weigh ts, the result of the compar ison is known a fter only ﬁnitely many e v aluations. Therefore, the algorithms presented can be applied in practice. On the other hand, the main v alue of the results is in the de mo nstration of wha t is p ossible in princ iple ; ﬁnding practically eﬃcient procedur e s for each of the co nsidered problems is an int eresting pro blem for further resea rch. A clos ely r elated but more practica l approach is that of tests ba sed on universal co de s [2 1, 20]. 2 Preliminaries W e are cons ide r ing (stationar y ergo dic) proce sses with the alphab et A = R . The genera lization to A = R d is straig ht forward; moreov er, the results can b e extended to the case when A is a co mplete separable metric space. W e us e the sym b ol A ∗ for ∪ ∞ i =1 A i . Elements of A ∗ are called words or s equences. F o r each k , l ∈ N , let B k,l be a par tition of the set R k int o k -dimensio nal c ub es of with volume h k l , such that h k l → 0 when l → ∞ , for every k ∈ N . Moreover, deﬁne B k = ∪ l ∈ N B k,l . Let also B = ∪ ∞ k =1 B k ; since this set is co untable we can introduce an enumeration B = { B i : i ∈ N } . The set { B i × A ∞ : i ∈ N } generates the Borel σ -alg ebra on R ∞ = A ∞ . F or a s et B ∈ B let | B | b e the index k of the set B k that B comes fro m: | B | = k : B ∈ B k . F o r a sequence X ∈ A n and a set B ∈ B deno te ν ( X, B ) the fr equency with which the sequence X falls in the s et B ν ( X , B ) := ( 1 n −| B | +1 P n −| B | +1 i =1 I { ( X i ,...,X i + | B |− 1 ) ∈ B } if n ≥ | B | , 0 otherwise where X = ( X 1 , . . . , X n ). F or example, ν  (0 . 5 , 1 . 5 , 1 . 2 , 1 . 4 , 2 . 1 ) , ([1 . 0 , 2 . 0] × [1 . 0 , 2 . 0])  = 1 / 2 . W e use the symbol S for the set of all sta tionary erg o dic pro cess es on A ∞ . The ergo dic theo rem (see e.g. [5]) implies that for any proces s ρ ∈ S g enerating a se q uence X 1 , X 2 , . . . the frequenc y of obs erving a tuple tha t falls into each B ∈ B tends to its limiting (or a priory) pro bability a.s.: ν (( X 1 , . . . , X n ) , B ) → ρ (( X 1 , . . . , X | B | ) ∈ B ) as n → ∞ . W e will often a bbreviate ρ (( X 1 , . . . , X | B | ) ∈ B ) = : ρ ( B ). Deﬁnition 1 (distr ibutional distance) . The distributional distanc e is deﬁne d for a p air of pr o c esses ρ 1 , ρ 2 as fol lows [10]: d ( ρ 1 , ρ 2 ) = ∞ X i =1 w i | ρ 1 ( B i ) − ρ 2 ( B i ) | , (1) wher e w i ar e summable p ositive r e al weights (e.g. w k = 2 − k ). 5 It is ea sy to s e e that d is a metric. The reader is referred to [1 0] for more information ab out d and its pro pe r ties. Deﬁnition 2 (empirical distributional dis tance) . F or X , Y ∈ A ∗ , deﬁne em- piric al distributional distanc e ˆ d ( X, Y ) as ˆ d ( X, Y ) := ∞ X i =1 w i | ν ( X , B i ) − ν ( Y , B i ) | . (2) Similarly, we c an deﬁne the empiric al distanc e when only one of t he pr o c ess me asur es is un known: ˆ d ( X, ρ ) := ∞ X i =1 w i | ν ( X , B i ) − ρ ( B i ) | , (3) wher e ρ ∈ S and X ∈ A ∗ . The following lemma will pla y a key ro le in e s tablishing the main r e sults. Lemma 1 . L et two samples X = ( X 1 , . . . , X k ) and Y = ( Y 1 , . . . , Y m ) b e gener- ate d by stationary er go dic pr o c ess es ρ X and ρ Y r esp e ctively. Then (i) lim k,m →∞ ˆ d ( X , Y ) = d ( ρ X , ρ Y ) a.s. (ii) lim k →∞ ˆ d ( X, ρ Y ) = d ( ρ X , ρ Y ) a.s. Pr o of. F or an y ε > 0 we can ﬁnd such an index J that P ∞ i = J w i < ε/ 2. More- ov er, for e a ch j we hav e ν (( X 1 , . . . , X k ) , B j ) → ρ X ( B j ) a.s., so that | ν (( X 1 , . . . , X k ) , B j ) − ρ ( B j ) | < ε/ (4 J w j ) from some step k on; deﬁne K j := k . Let K := max j K and m > M we hav e | ˆ d ( X, Y ) − d ( ρ X , ρ Y ) | =      ∞ X i =1 w i  | ν ( X , B i ) − ν ( Y , B i ) | − | ρ X ( B i ) − ρ Y ( B i ) |       ≤ ∞ X i =1 w i  | ν ( X , B i ) − ρ X ( B i ) | + | ν ( Y , B i ) − ρ Y ( B i ) |  ≤ J X i =1 w i  | ν ( X , B i ) − ρ X ( B i ) | + | ν ( Y , B i ) − ρ Y ( B i ) |  + ε/ 2 ≤ J X i =1 w i ( ε/ (4 J w i ) + ε/ (4 J w i )) + ε/ 2 = ε, which prov es the ﬁrst statement. The second s tatement can b e prov en analo - gously . 6 Remark 1. While for the pr o ofs the single-index deﬁnition of ρ just intr o duc e d is mor e c onvenient, if the t ests ar e to b e c ompute d the fol lowing deﬁnition should b e e asier to manage (al l the statements b elow hold for this metric to o) d ′ ( ρ 1 , ρ 2 ) := X k.l w k,l X b ∈ B k,l | ρ 1 ( b ) − ρ 2 ( b ) | , wher e again the weights w k,l should b e summable, e.g. w k,l := 2 − ( k + l ) . 3 Main results 3.1 Go o dness-of-ﬁt T est F o r a g iven sta tio nary e r go dic pr o cess measure ρ and a sample X = ( X 1 , . . . , X n ) we wish to tes t the hypothesis H 0 that the sample was genera ted b y ρ versus H 1 that it was g enerated by a stationary erg o dic distribution tha t is diﬀerent from ρ . Thus, H 0 = { ρ } and H 1 = S \ H 0 . Deﬁne the s et D n δ as the set of all s amples of length n that a r e at least δ -far from ρ in empirical distributional distance: D n δ := { X ∈ A n : ˆ d ( X, ρ ) ≥ δ } . F o r ea ch n a nd each given conﬁdence lev el α deﬁne the cr itica l region C n α of the test as C n α := D n γ where γ := inf { δ : ρ ( D n δ ) ≤ α } . (4) The test r ejects H 0 at co nﬁdence level α if ( X 1 , . . . , X n ) ∈ C n α and a ccepts it otherwise. In words, for each sequence we measure the distanc e b etw een the empirical probabilities (frequencies) and the measure ρ (that is, the theoretical ρ -probabilities); we then ta ke a lar gest ball (with resp ect to this distanc e ) ar ound ρ that has ρ -pro bability no t greater than 1 − α . The test rejects all sequences outside this ball. Deﬁnition 3 (Go o dness- of-ﬁt tes t) . F or e ach n ∈ N a nd α ∈ (0 , 1) the go o dness- of-ﬁt t est G α n : A n → { 0 , 1 } is deﬁne d as G α n ( X 1 , . . . , X n ) :=  1 if ( X 1 , . . . , X n ) ∈ C n α , 0 otherwise. Theorem 1. The test G α n has t he fol lowing pr op erties. (i) F or every α ∈ (0 , 1) and every n ∈ N t he T yp e I err or of t he t est is not gr e ater than α : ρ ( G α n = 1 ) ≤ α . (ii) F or every α ∈ (0 , 1) the T yp e II err or go es to 0 almost s u r ely: for every ρ ′ 6 = ρ we have lim n →∞ G α n = 1 with ρ ′ pr ob ability 1. 7 Pr o of. The ﬁrst statement holds b y co ns truction. T o prov e the second state- men t, le t the sample X b e g enerated by ρ ′ ∈ S , ρ ′ 6 = ρ , and deﬁne δ = d ( ρ, ρ ′ ) / 2. By Lemma 1 we ha ve ρ ( D n δ ) → 0, so that ρ ( D n δ ) < α from some n on; denote it n 1 . Thus, for n > n 1 we hav e D n δ ⊂ C n α . At the same time, b y Lemma 1 we hav e ˆ d ( X, ρ ) > δ from some n on, whic h w e denote n 2 ( X ), with ρ ′ -probability 1. So, for n > ma x { n 1 , n 2 ( X ) } we hav e X ∈ D n δ ⊂ C n α , which prov e s the sta te- men t (ii) . 3.2 Pro c ess classiﬁcation Let there b e given three samples X = ( X 1 , . . . , X k ), Y = ( Y 1 , . . . , Y m ) and Z = ( Z 1 , . . . , Z n ). Each s a mple is generated by a statio nary ergo dic pro cess ρ X , ρ Y and ρ Z resp ectively . More over, it is k nown that either ρ Z = ρ X or ρ Z = ρ Y , but ρ X 6 = ρ Y . W e wish to construct a test that, based on the ﬁnite samples X , Y and Z will tell whether ρ Z = ρ X or ρ Z = ρ Y . The test chooses the sample X or Y accor ding to whic hever is c lo ser to Z in ˆ d . That is, w e deﬁne the test G ( X, Y , Z ) as follows. If ˆ d ( X , Z ) ≤ ˆ d ( Y , Z ) then the tes t says that the sample Z is generated b y the same pro cess as the sa mple X, otherwise it s ays that the sample Z is generated by the same pro ce s s as the sample Y. Deﬁnition 4 ( Pro cess classiﬁe r ) . Deﬁne the classiﬁer L : A ∗ × A ∗ × A ∗ → { 1 , 2 } as fol lows L ( X , Y , Z ) :=  1 if ˆ d ( X , Z ) ≤ ˆ d ( Y , Z ) 2 otherwise, for X , Y , Z ∈ A ∗ . Theorem 2. The t est L ( X , Y , Z ) makes only a ﬁn ite numb er of err ors when | X | , | Y | a nd | Z | go to inﬁnity, with pr ob ability 1: if ρ X = ρ Z then L ( X , Y , Z ) = 1 fr om some | X | , | Y | , | Z | on with pr ob ability 1; otherwise L ( X , Y , Z ) = 2 fr om some | X | , | Y | , | Z | on with pr ob ability 1. Pr o of. F rom the fact that d is a metric and from Lemma 1 we conclude that ˆ d ( X, Z ) → 0 (with proba bility 1) if and only if ρ X = ρ Z . So, if ρ X = ρ Z then by assumption ρ Y 6 = ρ Z and ˆ d ( X, Z ) → 0 a.s. while ˆ d ( Y , Z ) → d ( ρ Y , ρ Z ) 6 = 0 . Thu s in this ca se ˆ d ( Y , Z ) > ˆ d ( X, Z ) from some | X | , | Y | , | Z | on with proba bilit y 1 , from which moment we have L ( X , Y , Z ) = 1. The opp osite cas e is analogo us. 3.3 Change p oin t problem The s a mple Z = ( Z 1 , . . . , Z n ) consists of tw o concatena ted par ts X = ( X 1 , . . . , X k ) and Y = ( Y 1 , . . . , Y m ), where m = n − k , so that Z i = X i for 1 ≤ i ≤ k and Z k + j = Y j for 1 ≤ j ≤ m . The samples X and Y are gene r ated indepen- dent ly by tw o diﬀerent statio na ry ergo dic pro cess e s with alphab et A = R . The 8 distributions of the pro cesses are unknown. The v alue k is called the change p oint . It is assumed that k is linear in n ; more pr ecisely , αn < k < β n for so me 0 < α ≤ β < 1 from some n on. It is requir ed to estimate the c hange p oint k based on the s a mple Z . F o r each t , 1 ≤ t ≤ n , denote U t the sample ( Z 1 , . . . , Z t ) consis ting of the ﬁrst t elements of the sample Z , and denote V t the remainder ( Z t +1 , . . . , Z n ). Deﬁnition 5 (Change p oint es timator) . Deﬁne the change p oint est imate ˆ k : A ∗ → N as fol lows: ˆ k ( X 1 , . . . , X n ) := arg max t ∈ [ αn,n − β n ] ˆ d ( U t , V t ) . The following theorem establis hes asymptotic consistency of this estimator. Theorem 3. F or the estimate ˆ k of the change p oint k we have | ˆ k − k | = o ( n ) a.s. wher e n is the size of the sample, and when k , n − k → ∞ in such a way that α < k n < β for some α, β ∈ (0 , 1) fr om some n on. Pr o of. T o prove the sta tement, we will sho w that fo r ev ery γ , 0 < γ < 1 with probability 1 the inequality ˆ d ( U t , V t ) < ˆ d ( X, Y ) holds for each t such that αk ≤ t < γ k p oss ibly except for a ﬁnite n umber of times. Thus we will show that linear γ -underestimates o ccur only a ﬁnite num b er of times, and fo r ov erestimate it is a nalogous . Fix so me γ , 0 < γ < 1 and ε > 0 . Let J b e big enoug h to hav e P ∞ i = J w i < ε/ 2 and a lso big eno ugh to hav e an index j < J for which ρ X ( B j ) 6 = ρ Y ( B j ). T ake M ε ∈ N large enough to hav e | ν ( Y , B i ) − ρ Y ( B i ) | ≤ ε/ 2 J for a ll m > M ε and for each i , 1 ≤ i ≤ J , and also to hav e | B i | /m < ε/ J fo r each i , 1 ≤ i ≤ J . This is p oss ible since empirica l frequencies conv erge to the limiting probabilities a.s . (that is, M ε depe nds on the realizatio ns Y 1 , Y 2 , . . . ) (cf. the proo f of Lemma 1 ). Find a K ε (that dep ends on X ) such that for all k > K ε and for all i , 1 ≤ i ≤ J we hav e | ν ( U t , B i ) − ρ X ( B i ) | ≤ ε/ 2 J for eac h t ∈ [ αn, . . . , k ] (5) (this is p oss ible simply b eca use αn → ∞ ) . F urthermore , w e can s elect K ε large enough to have | ν (( X s , X s +1 , . . . , X k ) , B i ) − ρ X ( B i ) | ≤ ε/ 2 J for e a ch s ≤ γ k : this follows from (5) a nd the indentit y ν (( X s , X s +1 , . . . , X k ) = k k − s ν (( X 1 , . . . , X k ) − s − 1 k − s ν ( X 1 , . . . , X s − 1 ) + o (1). So, for each s ∈ [ αn, γ k ] we have     ν ( V s , B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     ≤      (1 − γ ) k ν (( X s , . . . , X k ) , B j ) + mν ( Y , B j ) (1 − γ ) k + m − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m      + | B j | m + γ k ≤ 3 ε/J, 9 for k > K ε and m > M ε (from the deﬁnitions of K ε and M ε ). Hence | ν ( X , B j ) − ν ( Y , B j ) | − | ν ( U s , B j ) − ν ( V s , B j ) | ≥ | ν ( X , B j ) − ν ( Y , B j ) | −     ν ( U s , B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     − 3 ε/J ≥ | ρ X ( B j ) − ρ Y ( B j ) | −     ρ X ( B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     − 4 ε/J = δ j − 4 ε/J, for some δ j that depends only on k /m and γ . Summing ov er all B i , i ∈ N , we get ˆ d ( X, Y ) − ˆ d ( U s , V s ) ≥ w j δ j − 5 ε, for all n such that k > K ε and m > M ε , which is positive for sma ll enoug h ε . Ac kno wledgemen ts W e are g rateful to the anonymous reviewer who suggested a simpliﬁcation of the pro cess metric d (see a lso Rema rk 1). This work has be en supp orted by F r ench National Resear ch Agency (ANR), pro ject EXPLO -RA ANR-08-COSI- 004 (Daniil Ryabko) and by Russian F oundation for Basic Res earch, grant 09- 07-00 005-a (Boris Ryabk o). References [1] R. Ahlswede, I. Csisza r, “Hyp othesis testing with communication con- straints,” IEEE T r ans. Information The ory, vol. 32 no. 4, pp. 533– 542, 1986. [2] M. Basseville, I. Nikiforov, Dete ction of A brupt Changes: The ory and Ap- plic ations. Prentice Hall, 1993 . [3] G. Bia u, L . Gy¨ orﬁ, “On the a symptotic prop erties of a nonparametric L 1 - test of ho mogeneity ,” IEEE T r ans. Information The ory, vol. 51, pp. 3965– 3973, 2005 . [4] E. Carlstein, S. Lele, “Nonpa rametric change-point estimation for data from a n er g o dic sequence,” T e or. V er oyatnost. i Primenen. vol. 38, no. 4 (1993), pp. 910– 917; transla tion in The ory Pr ob ab. Appl. vol. 38 no . 4, pp. 726–7 33, 19 93. [5] P . Billingsley , Er go dic the ory and informatio n. Wiley , New Y ork, 19 65. 10 [6] B. Br o dsky , B. Dar khovsky . Nonp ar ametric Metho ds i n Ch ange-Point Pr ob- lems. Kluw er Academic Pablishers, 19 9 3. [7] I. Csis z´ ar, P . Shields, “Notes on Information Theory and Statistics: A tutorial,” F oundations and T r ends in Communic ations and Information The ory 1 (2 004), p. 1–11 1. [8] I. Csis z´ ar, “Information Theoretic Metho ds in Pr obability and Statistics ,” Information The ory So c. R ev. art icles, 1997. Av a ilable: h ttp://www.itso c. org/r eview/frr e v.ht ml [9] L. Giraitisa, R. Leipusb, D. Surga ilis, “The change-point pr oblem for de- pendent observ ations,” Journal of Statistic al Planning and In fer enc e vol. 5 3 no. 3, pp. 29 7–310 , 1996. [10] R. Gray . Pr obability , R andom Pr o c esses, and Er go dic Pr op erties. Springe r V er lag, 1988 . [11] M. Gutman, “Asymptotically Optimal Classiﬁcation fo r Multiple T ests with Empirically Observed Statistics,” IEEE T r ans. In formation The ory, vol. 35 no. 2, pp. 402–40 8, 1989. [12] J. C. Kieﬀer, “Strong ly consistent c o de-based identiﬁcation and or der es- timation for co nstrained ﬁnite-state mo del classes,” IEEE T r ans. Inform. The ory v ol. 39 no. 3, pp. 89 3–90 2, 1993. [13] L. Gy¨ orﬁ, G. Mor v ai, I. V a jda, “Info r mation-theor e tic methods in testing the go o dness of ﬁt,” In pr o c e e dings of IEEE International Symp osium on Information The ory, 2 000. [14] L. Gy¨ o rﬁ, G. Mo r v ai, S. Y akowitz (199 8), “Limits to c onsistent on-line forecasting for er go dic time ser ies,” IEEE T r ans. In formation The ory vol. 44 , no. 2, pp. 88 6–892 . [15] Morv ai G., W eiss B., “Limitations on int ermittent forecasting,” Statist ics and Pr ob ability L etters, vol. 72, pp. 285–2 90, 20 05. [16] G. Morv a i, B. W eiss , “On classifying pr o cesses,” Bernoul li , vol. 11, no. 3, pp. 523– 532, 2 005. [17] G. Mo rv ai, B. W eiss, “On estimating the memor y of ﬁnitarily Markovian pro cesses,” Ann. Inst. H. Poinc ar ´ e Pr ob ab. St atist . vol. 4 3 pp. 1 5–30, 2007. [18] Ornstein, D. S. and W eiss, B.(1990 ), “How Sampling Reveals a P ro cess,” Annals of Pr ob ability vol. 18 no. 3, pp. 90 5–93 0. [19] B. Ryabko, “Prediction of rando m sequences a nd universal co ding,” Pr ob- lems of In formation T ra nsmission, vol. 2 4, pp. 87–9 6, 198 8. 11 [20] B. Ryabko, “Compres sion-Base d Methods for Nonparametric Predictio n and E stimation of So me Chara c teristics of Time Series,” IEEE T r ans. In - formation The ory , V ol. 55, No. 9, 20 09, pp. 4309– 4315. [21] B. Ry abko, J. Astola, “Univ ersal co des as a basis for nonpara metr ic testing of serial independence for time series ,” Journal of Statist ic al Planning and Infer enc e, vol. 13 6 no. 12, pp. 411 9-412 8, 2006. [22] B. Ryabk o, J. Astola, A. Gammerman, “Application of Kolmogor ov com- plexity a nd universal co des to identit y testing and nonparametr ic tes t- ing of ser ial independence for time s eries,” The or et ic al Computer Scienc e, vol. 359 , 20 06, pp. 440–4 48. [23] D. Ryabko, “T esting comp osite hypotheses ab out discrete- v alued statio n- ary pro cesse s ,” in Pr o c e e dgings of In formation The ory Workshop 2010, Cairo, Eg y pt, pp. 291-295 , 2010. [24] D. Ryabk o, “An imp os s ibilit y result for pr o cess discr imination,” in Pr o c. 2009 IEEE In t ernational Symp osium on Information The ory , pp. 1734– 1738, Seoul, South Kor ea, 2009. [25] D. Ryabko, B. Ryabk o, “On h yp otheses testing for ergo dic pro ces s es,” in Pr o c e e dgings of Information The ory Workshop 2008 , Porto, Portugal, pp. 281–2 83. [26] D. Ry abko, J. Schmidh ub er, “Using Data Co mpressors to Construct Rank T es ts,” Applie d Mathematics L et ters , vol. 22 no. 7, pp. 1 0 29–1 032, 200 9. [27] P . Shields, “The Interactions Betw ee n Ergo dic Theory and Informa tion Theory ,” IEEE T r ans. on Information The ory, vol. 44 , no. 6 (19 98), pp. 2079 –2093 . 12

Nonparametric Statistical Inference for Ergodic Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment