Nonparametric Statistical Inference for Ergodic Processes

In this work a method for statistical analysis of time series is proposed, which is used to obtain solutions to some classical problems of mathematical statistics under the only assumption that the process generating the data is stationary ergodic. N…

Authors: Daniil Ryabko (INRIA Lille - Nord Europe), Boris Ryabko (SIBSUTI, ICT SBRAS)

Nonparametric Statistical Inference for Ergo dic Pro cesses Daniil Ry abk o ∗ , Boris Ry abk o # ∗ SequeL, INRIA-Lille Nord Euro pe, F rance, daniil@r yabko.net # Institute of Computational T echnologies of Siberian B r anch of Russian Academy o f Scie nce , Siber ian State Universit y of T elecomm unications and Informatics, Nov o sibirsk, Russia; b oris@r yabk o.net Abstract In this w ork a method for statistical analysis of ti me series is proposed, whic h is u sed to obtain solutions to some classical problems of mathe- matical statistics under the only assumption that the process generating the data is stationary ergo dic. Namely , three problems are considered: goo dness-of-fit (or identit y ) testing, pro cess cla ssification, and the change p oint problem. F or each of the problems a test is construct ed that is asymptotically accurate for the case when the data is generated by sta- tionary ergo dic processes. The tests are based on empirical estimates of distributional distance. 1 In tro duction Ov erview. I n this work w e c onsider the problem of sta tistical a na lysis of time series, when nothing is known ab out the under lying pro ces s generating the data, except that it is stationary ergo dic. There is a v as t literature on time series anal- ysis under v arious parametric assumptions, and also under such non-par ametric assumptions as that the proces s has a finit e memory or p os sesses certain mixing rates. While under these settings most of the pro blems of statistical analy- sis are clear ly s olv able and efficient a lgorithms exist, in the gener al setting of stationary ergo dic pro ces ses it is far less clear what can b e do ne in principle, which problems of s tatistical analy sis admit a solution and which do not. In t his work we pro p ose a method of statistical analysis of time ser ies, that allows us to demons tr ate that some classical statistical problems indeed admit a solution under the only ass umption that the data is stationa r y erg o dic, whereas b efor e solutions only for more restr icted cases were known. The s olutions are always constructive, that is, we present asymptotically accura te alg o rithms for each of the considered pr oblems. All the algorithms are based on empirical estimates o f distributional distance, whic h is in the co re of the sugg ested approach. W e sug- gest that the pro p o sed a pproach can b e applied to o ther pr oblems of statistical 1 analysis of time series, with the view of establishing principled positive results, leaving the task of finding optimal alg orithms for each particular problem as a topic for further res e a rch. Here we conce ntrate on the following thr ee problems: g o o dness-o f-fit (or ident ity) tes ting , pro cess classifica tion, and the c hange p oint pro blem. Go o dne ss-of-fit testing. The first problem is the following pr oblem o f hy- po thesis testing. A stationa ry er go dic pro c e s s distribution ρ is known theoret- ically . Giv en a data sa mple, it is req uired to test whether it was ge ner ated by ρ , versus it w as generated by any other stationar y e rgo dic distribution that is different from ρ (go o dness-o f-fit, or iden tity testing). The case of i.i.d. or finite- memory pro cesses is w ide ly studied (see e.g. [7]); in particular , when ρ has a finite memo ry [22] prop ose s a test against any stationary erg o dic alternative: a test that can be based on an arbitrar y univ ersa l co de. It was noted in [27] that an asymptotically accurate test for the case o f stationary ergo dic proc e s ses o ver finite alphab et exists (but no test was pro po sed). Here we pr o p ose a concrete and simple a symptotically a ccurate go odness -of-fit test, whic h demonstrates the prop osed approach: to use e mpirical distributional distance for h yp otheses test- ing. By asymptotica lly ac curate test we mea n the following. First, the Typ e I error of the tes t (or its size) is fixed and is given as a parameter to the test. That is, given a ny α > 0 a s an input, under H 0 (if the data sample was indeed generated b y ρ ) the pro ba bilit y that the test says “ H 1 ” is not greater than α . Second, under any hypothesis in H 1 (that is, if the distribution generating the data is differ ent fr om ρ ), the test will say “ H 0 ” not more than a finite num ber of times, with probability 1. In other words, the T yp e I error of the test is fixed and the Type II er ror c an b e ma de not more than a finite num b er o f times, as the data sample increases, with pro bability 1 under any statio na ry ergo dic alternative. A comment on this setting is in or der. When the alternative H 1 is less general, e.g. distributions that ha ve finite-memor y [11] or known mixing rates, one t ypically seek s a test that has optimal rates of decrease of probability of T y p e II error to 0. F or our case, when the alterna tive is the s et of all s tationary erg o dic pro cesses, this r ate is ne c essarily non- unifor m. In this sense, the prop erty that we establish for our test is the s trongest p ossible. Observe that the notion of consistency that w e consider is stronger than requiring that the test makes o nly a finite n umber of erro rs (either Type I or Type I I) with probability 1, the setting consider ed, for example, in the case s when H 0 is comp osite, o r for the pro cess classificatio n problem that we address in this work. Pro cess classification. In the next problem that we co nsider, we aga in hav e to decide whether a data sample was gener a ted by a pro cess s atisfying a hy- po thesis H 0 or a hypo thesis H 1 . How ever, here H 0 and H 1 are not known theoretically , but a re represented by tw o a dditional da ta samples . More pre- cisely , the problem is that of pro cess cla ssification, which can be for mu lated a s follows. W e are given three samples X = ( X 1 , . . . , X k ), Y = ( Y 1 , . . . , Y m ) and Z = ( Z 1 , . . . , Z n ) generated by stationary ergo dic pro ces ses with distributions ρ X , ρ Y and ρ Z . It is known that ρ X 6 = ρ Y , while e ither ρ Z = ρ X or ρ Z = ρ Y . It is requir ed to test whic h one is the case. That is, we have to decide whether 2 the sample Z was genera ted by the sa me pro cess as the sample X or by the same pr o cess as the sample Y . This problem fo r the case of dep endent time series was consider ed for example in [1 1], where a solution is pre s ented under the finite-memory assumption. It is closely related to many imp orta n t problems in statistics and applicatio n area s, such as pattern r ecognition, c la ssification, etc. Apparently no asymptotically accur ate pro ce dure fo r pro cess classification has bee n known so far for the gener al ca se o f stationar y ergo dic pro ces ses. Here we pro po se a test that conv erges almost sur ely to the cor r ect answer. In other words, the test makes o nly a finite num ber of error s with probability 1, with resp ect to any stationa ry erg o dic pro cesses generating the data. Unlike in the previous problem, her e we do not explicitly distinguish b etw een Type I and Type II error, sinc e the h yp otheses are by nature symmetric: H 0 is “ ρ Z = ρ X ” and H 1 is “ ρ Z = ρ Y ”. Change p oi n t estim ation. Fina lly , we co ns ider the change p oint problem. It is another cla s sical pr o blem, with v as t literature on b oth parametric (see e.g. [2]) and non-par ametric (see e.g. [6]) metho ds for solving it. In this w ork we addre s s the case where the data is dep endent, its form and the str ucture o f dep endence is unknown, and marg ina l distributions b efore and after the change may be the same. W e consider the following (off-line) setting of the pro blem: a (real-v alued) sample Z 1 , . . . , Z n is given, where Z 1 , . . . , Z k are genera ted according to some distribution ρ X and Z k +1 , . . . , Z n are generated according to s ome distribution ρ Y which is different fro m ρ X . It is known that the distributio ns ρ X and ρ Y are stationary ergo dic, but no thing els e is known ab o ut them. Most literatur e on ch ange p oint pr oblem for dep endent time series assumes that the ma rginal distributions b efore and after the change p oint a r e differ e nt, and o ften a lso make explicit r e strictions on the dep endence, such as r e quirements on mixing r ates. Nonparametric methods used in these cases ar e t y pically based o n Kolmogor ov- Smirnov statistic, Cramer-von Mises statistic, or g e neralizations thereof [6, 4, 9]. The main differ ence of our results is that we do not ass ume that the single- dimensional marginals (or finite-dimensional marginals of any given fixed size) are differen t, and do not make an y assumptions on the structure of dep endence. The only assumption is that the (unkno wn) pro cess distributions befor e and after the change po int ar e stationar y er go dic. Our result is a demonstra tion of that asy mptotically accur a te c hange p oint estimation is p os sible in this general setting. Related problem s. Le t us briefly r e late the three problems for which w e present consistent tests to other problems of statistical analysis of stationar y ergo dic time series. First, a clo sely r e lated pr oblem is tha t of homo geneity testing . The problem is as follows: given t wo s amples, o ne has to decide whether they were genera ted by the same pro ce s s distribution o r by different ones. While solutions to this problem exis t for i.i.d. data (see for example [3, 26], and references ther ein), for stationa ry ergo dic proces ses (and ev en for a smaller class of B-pro cesses ) a consis tent tes t does not exist, ev en in the binary -v alued case, as was shown in [2 4]. This pro ble m is close ly r elated to change p oint dete ction problem: g iven a single sample, one has to decide whether there was an abrupt change of distribution somewhere. If we k now that ther e was such a change, 3 then we can give a n asymptotically consistent estimate for it, as w e show here; how e ver, if it is not kno wn that the change p oint exists, nobo dy can construct a consistent change p oint test, because there is no consisten t test for homogeneit y . In other words, we can tell wher e a c ha nge p o int is, if there is one, but we cannot (in gener al) tell w he ther there is o ne o r not (in the case o f statio na ry ergo dic distributions). Observe that the pro cess clas sification problem describ ed ab ov e turns o ut to b e easier tha n homo geneity tes ting: a consistent test exists for the former (constructed in this w ork) but not for the latter. Other hypo thesis testing pro blems that concer n stationary time ser ies in- clude testing for having a c e r tain memory (i.e. testing the hypothesis “ k - order Marko v pro cess” versus “stationary ergo dic, not k -order Ma rko v”), testing for mem b ership to parametric families, and others [12, 1 6, 1 7, 21, 22]. Some recent general r esults that characterize those hypotheses ab out finitely-v alued ergo dic pro cesses that can be tested are provided in [23]. Fina lly , a related pr oblem is that of pr e diction or for e c asting [19, 14, 15, 2 0]. In this r e sp e ct, the results of the pr esent work clarify which problems can and which cannot b e solved, when the only a ssumption o n the data is tha t it is stationary ergo dic. Metho dol ogy . All the tests that we constr uc t are based on empirical e s ti- mates of the so- called dis tr ibutional distance. F or tw o pr o cesses ρ 1 , ρ 2 a distri- butional dista nce is defined as P ∞ k =1 w k | ρ 1 ( B k ) − ρ 2 ( B k ) | , where w k are p ositive summable r eal weigh ts, e.g. w k = 2 − k and B k range over a countable field that generates the sigma - algebra of the underlying proba bilit y spa ce. F or exam- ple, if we ar e talk ing ab o ut finite-alphab et pro cesses with the binary alphab et A = { 0 , 1 } , B k would range ov er the set A ∗ = ∪ k ∈ N A k ; that is, ov er all tu- ples 0 , 1 , 00 , 0 1 , 10 , 11 , 000 , 001 , . . . (of co urse, we could just as well o mit, sa y , 1 and 1 1); therefore, the distributional distance in this case is the weighted sum of differences of probabilities o f all p o ssible tuples. In this work we c onsider real-v alued pr o cesses, so B k hav e to r ange throug h a s uitable seq uence of in- terv als, all pairs of suc h in terv a ls, triples, etc. (e.g. we can use a sequence of par titions into c ubes o f decre a sing volume, see the next section for formal definitions). Alth ough distributional distance is a natural concept that, for sto chastic pro cesses, ha s b een studied for a while [1 0], its empirical estimates hav e not, to our knowledge, been used for statistical analysis of time se ries. W e argue that this distance is rather natural for this kind of problems, first of all, since it can be co nsistently es timated (unlike, for e xample, ¯ d distance, whic h cannot [18] b e consistently estimated for the general case o f stationary er go dic pro cesses). Secondly , it is alwa ys b ounded, unlike (empirical) K L divergence, which is often used for statistical infer ence for time ser ies (e.g. [7 , 2 2, 1, 8, 13] and others). Other a ppr oaches to sta tis tica l analysis of s ta tionary dep endent time ser ies include the use of (universal) co des [12, 22, 21]. Here w e first show that distributional distance b etw een stationary ergo dic pr o cesses can b e consis- ten tly estimated based on sampling, and then a pply it to co nstruct consistent tests for the three pro blems of statistical analysis descr ib e d ab ove. Although empirica l estimates of the distributiona l distance inv o lve taking an infinite sum, in practice it is ob vious that only a finite num be r of summands 4 has to b e ca lculated. This is due to the fact that empirical estimates hav e to be compared to each other o r to theo retically known pro ba bilities, and since the (bo unded) summands have (exp onentially) decreasing weigh ts, the result of the compar ison is known a fter only finitely many e v aluations. Therefore, the algorithms presented can be applied in practice. On the other hand, the main v alue of the results is in the de mo nstration of wha t is p ossible in princ iple ; finding practically efficient procedur e s for each of the co nsidered problems is an int eresting pro blem for further resea rch. A clos ely r elated but more practica l approach is that of tests ba sed on universal co de s [2 1, 20]. 2 Preliminaries W e are cons ide r ing (stationar y ergo dic) proce sses with the alphab et A = R . The genera lization to A = R d is straig ht forward; moreov er, the results can b e extended to the case when A is a co mplete separable metric space. W e us e the sym b ol A ∗ for ∪ ∞ i =1 A i . Elements of A ∗ are called words or s equences. F o r each k , l ∈ N , let B k,l be a par tition of the set R k int o k -dimensio nal c ub es of with volume h k l , such that h k l → 0 when l → ∞ , for every k ∈ N . Moreover, define B k = ∪ l ∈ N B k,l . Let also B = ∪ ∞ k =1 B k ; since this set is co untable we can introduce an enumeration B = { B i : i ∈ N } . The set { B i × A ∞ : i ∈ N } generates the Borel σ -alg ebra on R ∞ = A ∞ . F or a s et B ∈ B let | B | b e the index k of the set B k that B comes fro m: | B | = k : B ∈ B k . F o r a sequence X ∈ A n and a set B ∈ B deno te ν ( X, B ) the fr equency with which the sequence X falls in the s et B ν ( X , B ) := ( 1 n −| B | +1 P n −| B | +1 i =1 I { ( X i ,...,X i + | B |− 1 ) ∈ B } if n ≥ | B | , 0 otherwise where X = ( X 1 , . . . , X n ). F or example, ν  (0 . 5 , 1 . 5 , 1 . 2 , 1 . 4 , 2 . 1 ) , ([1 . 0 , 2 . 0] × [1 . 0 , 2 . 0])  = 1 / 2 . W e use the symbol S for the set of all sta tionary erg o dic pro cess es on A ∞ . The ergo dic theo rem (see e.g. [5]) implies that for any proces s ρ ∈ S g enerating a se q uence X 1 , X 2 , . . . the frequenc y of obs erving a tuple tha t falls into each B ∈ B tends to its limiting (or a priory) pro bability a.s.: ν (( X 1 , . . . , X n ) , B ) → ρ (( X 1 , . . . , X | B | ) ∈ B ) as n → ∞ . W e will often a bbreviate ρ (( X 1 , . . . , X | B | ) ∈ B ) = : ρ ( B ). Definition 1 (distr ibutional distance) . The distributional distanc e is define d for a p air of pr o c esses ρ 1 , ρ 2 as fol lows [10]: d ( ρ 1 , ρ 2 ) = ∞ X i =1 w i | ρ 1 ( B i ) − ρ 2 ( B i ) | , (1) wher e w i ar e summable p ositive r e al weights (e.g. w k = 2 − k ). 5 It is ea sy to s e e that d is a metric. The reader is referred to [1 0] for more information ab out d and its pro pe r ties. Definition 2 (empirical distributional dis tance) . F or X , Y ∈ A ∗ , define em- piric al distributional distanc e ˆ d ( X, Y ) as ˆ d ( X, Y ) := ∞ X i =1 w i | ν ( X , B i ) − ν ( Y , B i ) | . (2) Similarly, we c an define the empiric al distanc e when only one of t he pr o c ess me asur es is un known: ˆ d ( X, ρ ) := ∞ X i =1 w i | ν ( X , B i ) − ρ ( B i ) | , (3) wher e ρ ∈ S and X ∈ A ∗ . The following lemma will pla y a key ro le in e s tablishing the main r e sults. Lemma 1 . L et two samples X = ( X 1 , . . . , X k ) and Y = ( Y 1 , . . . , Y m ) b e gener- ate d by stationary er go dic pr o c ess es ρ X and ρ Y r esp e ctively. Then (i) lim k,m →∞ ˆ d ( X , Y ) = d ( ρ X , ρ Y ) a.s. (ii) lim k →∞ ˆ d ( X, ρ Y ) = d ( ρ X , ρ Y ) a.s. Pr o of. F or an y ε > 0 we can find such an index J that P ∞ i = J w i < ε/ 2. More- ov er, for e a ch j we hav e ν (( X 1 , . . . , X k ) , B j ) → ρ X ( B j ) a.s., so that | ν (( X 1 , . . . , X k ) , B j ) − ρ ( B j ) | < ε/ (4 J w j ) from some step k on; define K j := k . Let K := max j K and m > M we hav e | ˆ d ( X, Y ) − d ( ρ X , ρ Y ) | =      ∞ X i =1 w i  | ν ( X , B i ) − ν ( Y , B i ) | − | ρ X ( B i ) − ρ Y ( B i ) |       ≤ ∞ X i =1 w i  | ν ( X , B i ) − ρ X ( B i ) | + | ν ( Y , B i ) − ρ Y ( B i ) |  ≤ J X i =1 w i  | ν ( X , B i ) − ρ X ( B i ) | + | ν ( Y , B i ) − ρ Y ( B i ) |  + ε/ 2 ≤ J X i =1 w i ( ε/ (4 J w i ) + ε/ (4 J w i )) + ε/ 2 = ε, which prov es the first statement. The second s tatement can b e prov en analo - gously . 6 Remark 1. While for the pr o ofs the single-index definition of ρ just intr o duc e d is mor e c onvenient, if the t ests ar e to b e c ompute d the fol lowing definition should b e e asier to manage (al l the statements b elow hold for this metric to o) d ′ ( ρ 1 , ρ 2 ) := X k.l w k,l X b ∈ B k,l | ρ 1 ( b ) − ρ 2 ( b ) | , wher e again the weights w k,l should b e summable, e.g. w k,l := 2 − ( k + l ) . 3 Main results 3.1 Go o dness-of-fit T est F o r a g iven sta tio nary e r go dic pr o cess measure ρ and a sample X = ( X 1 , . . . , X n ) we wish to tes t the hypothesis H 0 that the sample was genera ted b y ρ versus H 1 that it was g enerated by a stationary erg o dic distribution tha t is different from ρ . Thus, H 0 = { ρ } and H 1 = S \ H 0 . Define the s et D n δ as the set of all s amples of length n that a r e at least δ -far from ρ in empirical distributional distance: D n δ := { X ∈ A n : ˆ d ( X, ρ ) ≥ δ } . F o r ea ch n a nd each given confidence lev el α define the cr itica l region C n α of the test as C n α := D n γ where γ := inf { δ : ρ ( D n δ ) ≤ α } . (4) The test r ejects H 0 at co nfidence level α if ( X 1 , . . . , X n ) ∈ C n α and a ccepts it otherwise. In words, for each sequence we measure the distanc e b etw een the empirical probabilities (frequencies) and the measure ρ (that is, the theoretical ρ -probabilities); we then ta ke a lar gest ball (with resp ect to this distanc e ) ar ound ρ that has ρ -pro bability no t greater than 1 − α . The test rejects all sequences outside this ball. Definition 3 (Go o dness- of-fit tes t) . F or e ach n ∈ N a nd α ∈ (0 , 1) the go o dness- of-fit t est G α n : A n → { 0 , 1 } is define d as G α n ( X 1 , . . . , X n ) :=  1 if ( X 1 , . . . , X n ) ∈ C n α , 0 otherwise. Theorem 1. The test G α n has t he fol lowing pr op erties. (i) F or every α ∈ (0 , 1) and every n ∈ N t he T yp e I err or of t he t est is not gr e ater than α : ρ ( G α n = 1 ) ≤ α . (ii) F or every α ∈ (0 , 1) the T yp e II err or go es to 0 almost s u r ely: for every ρ ′ 6 = ρ we have lim n →∞ G α n = 1 with ρ ′ pr ob ability 1. 7 Pr o of. The first statement holds b y co ns truction. T o prov e the second state- men t, le t the sample X b e g enerated by ρ ′ ∈ S , ρ ′ 6 = ρ , and define δ = d ( ρ, ρ ′ ) / 2. By Lemma 1 we ha ve ρ ( D n δ ) → 0, so that ρ ( D n δ ) < α from some n on; denote it n 1 . Thus, for n > n 1 we hav e D n δ ⊂ C n α . At the same time, b y Lemma 1 we hav e ˆ d ( X, ρ ) > δ from some n on, whic h w e denote n 2 ( X ), with ρ ′ -probability 1. So, for n > ma x { n 1 , n 2 ( X ) } we hav e X ∈ D n δ ⊂ C n α , which prov e s the sta te- men t (ii) . 3.2 Pro c ess classification Let there b e given three samples X = ( X 1 , . . . , X k ), Y = ( Y 1 , . . . , Y m ) and Z = ( Z 1 , . . . , Z n ). Each s a mple is generated by a statio nary ergo dic pro cess ρ X , ρ Y and ρ Z resp ectively . More over, it is k nown that either ρ Z = ρ X or ρ Z = ρ Y , but ρ X 6 = ρ Y . W e wish to construct a test that, based on the finite samples X , Y and Z will tell whether ρ Z = ρ X or ρ Z = ρ Y . The test chooses the sample X or Y accor ding to whic hever is c lo ser to Z in ˆ d . That is, w e define the test G ( X, Y , Z ) as follows. If ˆ d ( X , Z ) ≤ ˆ d ( Y , Z ) then the tes t says that the sample Z is generated b y the same pro cess as the sa mple X, otherwise it s ays that the sample Z is generated by the same pro ce s s as the sample Y. Definition 4 ( Pro cess classifie r ) . Define the classifier L : A ∗ × A ∗ × A ∗ → { 1 , 2 } as fol lows L ( X , Y , Z ) :=  1 if ˆ d ( X , Z ) ≤ ˆ d ( Y , Z ) 2 otherwise, for X , Y , Z ∈ A ∗ . Theorem 2. The t est L ( X , Y , Z ) makes only a fin ite numb er of err ors when | X | , | Y | a nd | Z | go to infinity, with pr ob ability 1: if ρ X = ρ Z then L ( X , Y , Z ) = 1 fr om some | X | , | Y | , | Z | on with pr ob ability 1; otherwise L ( X , Y , Z ) = 2 fr om some | X | , | Y | , | Z | on with pr ob ability 1. Pr o of. F rom the fact that d is a metric and from Lemma 1 we conclude that ˆ d ( X, Z ) → 0 (with proba bility 1) if and only if ρ X = ρ Z . So, if ρ X = ρ Z then by assumption ρ Y 6 = ρ Z and ˆ d ( X, Z ) → 0 a.s. while ˆ d ( Y , Z ) → d ( ρ Y , ρ Z ) 6 = 0 . Thu s in this ca se ˆ d ( Y , Z ) > ˆ d ( X, Z ) from some | X | , | Y | , | Z | on with proba bilit y 1 , from which moment we have L ( X , Y , Z ) = 1. The opp osite cas e is analogo us. 3.3 Change p oin t problem The s a mple Z = ( Z 1 , . . . , Z n ) consists of tw o concatena ted par ts X = ( X 1 , . . . , X k ) and Y = ( Y 1 , . . . , Y m ), where m = n − k , so that Z i = X i for 1 ≤ i ≤ k and Z k + j = Y j for 1 ≤ j ≤ m . The samples X and Y are gene r ated indepen- dent ly by tw o different statio na ry ergo dic pro cess e s with alphab et A = R . The 8 distributions of the pro cesses are unknown. The v alue k is called the change p oint . It is assumed that k is linear in n ; more pr ecisely , αn < k < β n for so me 0 < α ≤ β < 1 from some n on. It is requir ed to estimate the c hange p oint k based on the s a mple Z . F o r each t , 1 ≤ t ≤ n , denote U t the sample ( Z 1 , . . . , Z t ) consis ting of the first t elements of the sample Z , and denote V t the remainder ( Z t +1 , . . . , Z n ). Definition 5 (Change p oint es timator) . Define the change p oint est imate ˆ k : A ∗ → N as fol lows: ˆ k ( X 1 , . . . , X n ) := arg max t ∈ [ αn,n − β n ] ˆ d ( U t , V t ) . The following theorem establis hes asymptotic consistency of this estimator. Theorem 3. F or the estimate ˆ k of the change p oint k we have | ˆ k − k | = o ( n ) a.s. wher e n is the size of the sample, and when k , n − k → ∞ in such a way that α < k n < β for some α, β ∈ (0 , 1) fr om some n on. Pr o of. T o prove the sta tement, we will sho w that fo r ev ery γ , 0 < γ < 1 with probability 1 the inequality ˆ d ( U t , V t ) < ˆ d ( X, Y ) holds for each t such that αk ≤ t < γ k p oss ibly except for a finite n umber of times. Thus we will show that linear γ -underestimates o ccur only a finite num b er of times, and fo r ov erestimate it is a nalogous . Fix so me γ , 0 < γ < 1 and ε > 0 . Let J b e big enoug h to hav e P ∞ i = J w i < ε/ 2 and a lso big eno ugh to hav e an index j < J for which ρ X ( B j ) 6 = ρ Y ( B j ). T ake M ε ∈ N large enough to hav e | ν ( Y , B i ) − ρ Y ( B i ) | ≤ ε/ 2 J for a ll m > M ε and for each i , 1 ≤ i ≤ J , and also to hav e | B i | /m < ε/ J fo r each i , 1 ≤ i ≤ J . This is p oss ible since empirica l frequencies conv erge to the limiting probabilities a.s . (that is, M ε depe nds on the realizatio ns Y 1 , Y 2 , . . . ) (cf. the proo f of Lemma 1 ). Find a K ε (that dep ends on X ) such that for all k > K ε and for all i , 1 ≤ i ≤ J we hav e | ν ( U t , B i ) − ρ X ( B i ) | ≤ ε/ 2 J for eac h t ∈ [ αn, . . . , k ] (5) (this is p oss ible simply b eca use αn → ∞ ) . F urthermore , w e can s elect K ε large enough to have | ν (( X s , X s +1 , . . . , X k ) , B i ) − ρ X ( B i ) | ≤ ε/ 2 J for e a ch s ≤ γ k : this follows from (5) a nd the indentit y ν (( X s , X s +1 , . . . , X k ) = k k − s ν (( X 1 , . . . , X k ) − s − 1 k − s ν ( X 1 , . . . , X s − 1 ) + o (1). So, for each s ∈ [ αn, γ k ] we have     ν ( V s , B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     ≤      (1 − γ ) k ν (( X s , . . . , X k ) , B j ) + mν ( Y , B j ) (1 − γ ) k + m − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m      + | B j | m + γ k ≤ 3 ε/J, 9 for k > K ε and m > M ε (from the definitions of K ε and M ε ). Hence | ν ( X , B j ) − ν ( Y , B j ) | − | ν ( U s , B j ) − ν ( V s , B j ) | ≥ | ν ( X , B j ) − ν ( Y , B j ) | −     ν ( U s , B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     − 3 ε/J ≥ | ρ X ( B j ) − ρ Y ( B j ) | −     ρ X ( B j ) − (1 − γ ) k ρ X ( B j ) + mρ Y ( B j ) (1 − γ ) k + m     − 4 ε/J = δ j − 4 ε/J, for some δ j that depends only on k /m and γ . Summing ov er all B i , i ∈ N , we get ˆ d ( X, Y ) − ˆ d ( U s , V s ) ≥ w j δ j − 5 ε, for all n such that k > K ε and m > M ε , which is positive for sma ll enoug h ε . Ac kno wledgemen ts W e are g rateful to the anonymous reviewer who suggested a simplification of the pro cess metric d (see a lso Rema rk 1). This work has be en supp orted by F r ench National Resear ch Agency (ANR), pro ject EXPLO -RA ANR-08-COSI- 004 (Daniil Ryabko) and by Russian F oundation for Basic Res earch, grant 09- 07-00 005-a (Boris Ryabk o). References [1] R. Ahlswede, I. Csisza r, “Hyp othesis testing with communication con- straints,” IEEE T r ans. Information The ory, vol. 32 no. 4, pp. 533– 542, 1986. [2] M. Basseville, I. Nikiforov, Dete ction of A brupt Changes: The ory and Ap- plic ations. Prentice Hall, 1993 . [3] G. Bia u, L . Gy¨ orfi, “On the a symptotic prop erties of a nonparametric L 1 - test of ho mogeneity ,” IEEE T r ans. Information The ory, vol. 51, pp. 3965– 3973, 2005 . [4] E. Carlstein, S. Lele, “Nonpa rametric change-point estimation for data from a n er g o dic sequence,” T e or. V er oyatnost. i Primenen. vol. 38, no. 4 (1993), pp. 910– 917; transla tion in The ory Pr ob ab. Appl. vol. 38 no . 4, pp. 726–7 33, 19 93. [5] P . Billingsley , Er go dic the ory and informatio n. Wiley , New Y ork, 19 65. 10 [6] B. Br o dsky , B. Dar khovsky . Nonp ar ametric Metho ds i n Ch ange-Point Pr ob- lems. Kluw er Academic Pablishers, 19 9 3. [7] I. Csis z´ ar, P . Shields, “Notes on Information Theory and Statistics: A tutorial,” F oundations and T r ends in Communic ations and Information The ory 1 (2 004), p. 1–11 1. [8] I. Csis z´ ar, “Information Theoretic Metho ds in Pr obability and Statistics ,” Information The ory So c. R ev. art icles, 1997. Av a ilable: h ttp://www.itso c. org/r eview/frr e v.ht ml [9] L. Giraitisa, R. Leipusb, D. Surga ilis, “The change-point pr oblem for de- pendent observ ations,” Journal of Statistic al Planning and In fer enc e vol. 5 3 no. 3, pp. 29 7–310 , 1996. [10] R. Gray . Pr obability , R andom Pr o c esses, and Er go dic Pr op erties. Springe r V er lag, 1988 . [11] M. Gutman, “Asymptotically Optimal Classification fo r Multiple T ests with Empirically Observed Statistics,” IEEE T r ans. In formation The ory, vol. 35 no. 2, pp. 402–40 8, 1989. [12] J. C. Kieffer, “Strong ly consistent c o de-based identification and or der es- timation for co nstrained finite-state mo del classes,” IEEE T r ans. Inform. The ory v ol. 39 no. 3, pp. 89 3–90 2, 1993. [13] L. Gy¨ orfi, G. Mor v ai, I. V a jda, “Info r mation-theor e tic methods in testing the go o dness of fit,” In pr o c e e dings of IEEE International Symp osium on Information The ory, 2 000. [14] L. Gy¨ o rfi, G. Mo r v ai, S. Y akowitz (199 8), “Limits to c onsistent on-line forecasting for er go dic time ser ies,” IEEE T r ans. In formation The ory vol. 44 , no. 2, pp. 88 6–892 . [15] Morv ai G., W eiss B., “Limitations on int ermittent forecasting,” Statist ics and Pr ob ability L etters, vol. 72, pp. 285–2 90, 20 05. [16] G. Morv a i, B. W eiss , “On classifying pr o cesses,” Bernoul li , vol. 11, no. 3, pp. 523– 532, 2 005. [17] G. Mo rv ai, B. W eiss, “On estimating the memor y of finitarily Markovian pro cesses,” Ann. Inst. H. Poinc ar ´ e Pr ob ab. St atist . vol. 4 3 pp. 1 5–30, 2007. [18] Ornstein, D. S. and W eiss, B.(1990 ), “How Sampling Reveals a P ro cess,” Annals of Pr ob ability vol. 18 no. 3, pp. 90 5–93 0. [19] B. Ryabko, “Prediction of rando m sequences a nd universal co ding,” Pr ob- lems of In formation T ra nsmission, vol. 2 4, pp. 87–9 6, 198 8. 11 [20] B. Ryabko, “Compres sion-Base d Methods for Nonparametric Predictio n and E stimation of So me Chara c teristics of Time Series,” IEEE T r ans. In - formation The ory , V ol. 55, No. 9, 20 09, pp. 4309– 4315. [21] B. Ry abko, J. Astola, “Univ ersal co des as a basis for nonpara metr ic testing of serial independence for time series ,” Journal of Statist ic al Planning and Infer enc e, vol. 13 6 no. 12, pp. 411 9-412 8, 2006. [22] B. Ryabk o, J. Astola, A. Gammerman, “Application of Kolmogor ov com- plexity a nd universal co des to identit y testing and nonparametr ic tes t- ing of ser ial independence for time s eries,” The or et ic al Computer Scienc e, vol. 359 , 20 06, pp. 440–4 48. [23] D. Ryabko, “T esting comp osite hypotheses ab out discrete- v alued statio n- ary pro cesse s ,” in Pr o c e e dgings of In formation The ory Workshop 2010, Cairo, Eg y pt, pp. 291-295 , 2010. [24] D. Ryabk o, “An imp os s ibilit y result for pr o cess discr imination,” in Pr o c. 2009 IEEE In t ernational Symp osium on Information The ory , pp. 1734– 1738, Seoul, South Kor ea, 2009. [25] D. Ryabko, B. Ryabk o, “On h yp otheses testing for ergo dic pro ces s es,” in Pr o c e e dgings of Information The ory Workshop 2008 , Porto, Portugal, pp. 281–2 83. [26] D. Ry abko, J. Schmidh ub er, “Using Data Co mpressors to Construct Rank T es ts,” Applie d Mathematics L et ters , vol. 22 no. 7, pp. 1 0 29–1 032, 200 9. [27] P . Shields, “The Interactions Betw ee n Ergo dic Theory and Informa tion Theory ,” IEEE T r ans. on Information The ory, vol. 44 , no. 6 (19 98), pp. 2079 –2093 . 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment