Estimation, inference and model selection for jump regression models

ESTIMA TION, INFERENCE AND MODEL SELECTION F OR JUMP REGRESSION MODELS Steﬀen Grønneb erg 1 , Gudm und Hermansen 2 , 3 and Nils Lid Hjort 3 1 BI Norw egian Business Sc ho ol, Oslo, 2 Norw egian Computing Cen tre, Oslo, 3 Departmen t of Mathematics, Universit y of Oslo June 2014 Abstract . W e consider regression mo dels with data of the type y i = m ( x i ) + ε i , where the m ( x ) curve is tak en lo cally constan t, with unknown levels and jump p oin ts. W e in v estigate the large-sample prop erties of the minimum least squares estimators, ﬁnding in particular that jump p oint parameters and level parameters are estimated with respectively n -rate precision and √ n -rate precision, where n is sample size. Ba y es solutions are in vestigat ed as well and found to b e sup erior. W e then construct jump information criteria, resp ectively AJIC and BJIC, for selecting the right num ber of jump p oints from data. This is done by following the line of arguments that lead to the Ak aik e and Bay esian information criteria AIC and BIC, but whic h here lead to diﬀeren t form ulae due to the diﬀeren t type of large-sample approximations in volv ed. Argmin principle, break p oints, discon tin uities, jump information criteria (JIC), large-sample appro ximations, step function regression, structural change 1. Intr oduction Consider regression data ( x i , y i ), with x i and y i one-dimensional. There is a long list of regression mo dels of the t yp e y i = m ( x i , θ ) + ε i for i = 1 , . . . , n, (1.1) where the regression function m ( x, θ ) is either linear in certain basis functions, like for p olynomial regression, or smooth in x , and wher e the ε i are zero-mean i.i.d. errors with standard deviation level σ . Here least squares estimators for the θ parameter can b e w ork ed with and leads to the familiar t yp e of √ n con v ergence to normality , and in its turn to inference metho ds, along with strategies for mo del selection of the AIC and BIC v ariet y . This is the topic of Section 2 below, encompassing also non- linear but smooth regression mo dels. Though that section partly contains ost ensibly w ell-kno wn material on least squares estimation in smo oth regression mo dels we emphasise that the dev elopmen ts there p ertaining to a model robust AIC strategy and to reﬁned v ersions of the BIC are new. 1 2 JIC FOR JUMP REGRESSION MODELS The main fo cus of this article is how ever estimation, inference and mo del selec- tion metho ds for step function regression mo dels, in v olving disconti n uit y p oin ts and lev els for m ( x, θ ). Here the metho dology pans out diﬀerently from the smo oth case, and w e shall in particular see that there is n -conv ergence rather than √ n -con v ergence for estimators of break p oints and that limit distributions are not normal. F or conv enience and without essential loss of generalit y we take the x range to b e the unit in terv al, and study the mo del where m ( x, θ ) = a j for γ j − 1 ≤ x < γ j , (1.2) for windo ws [ γ j − 1 , γ j ], with j = 1 , . . . , d and γ 0 = 0 , γ d = 1. The unknown parame- ters to estimate from data are the d − 1 break p oint p ositions and the d level s, along with the spread parameter σ . F or an illustration, see Figure 1.1, with the true m ( x ) function of the ab ov e type ha ving ﬁve windo ws. Supp osing the n um b er of windows b eing kno wn to the statistician, the task there is to estimate and carry out inference for the ﬁve lev el parameters, the four break p oin t lo cations, and σ , based on such a data set. Often one would not know the n um b er of windo ws a priori, how ever, in whic h case inferring the righ t num b er of break p oin ts is also part of the task. Our pap er provides metho ds for handling these questions. The simplest non-trivial version of this mo del has tw o window s, with say m ( x, θ ) = ( a if x ≤ γ , b if x > γ , (1.3) and with b oth levels a, b and break p oint p osition γ unkno wn parameters. W e deriv e the basic large-sample p rop erties of the least squ ares metho d in Section 3; sp eciﬁcally , it is shown for the least squares estimator b γ that n ( b γ − γ 0 ) has a limit distribution, asso ciated with a certain t w o-sided comp ound Poisson pro cess. This means a faster rate of con v ergence than in traditional parametric mo dels. The sit- uation where there is a break p oint, but where the lev el of m ( x ) is not necessarily en tirely ﬂat throughout the t wo windo ws, is also considered, in Section 4. Under- standing that situation is vital not only for constructing mo del robust inference metho ds, but for building mo del selection strategies. Metho ds and results are then generalised in Section 5 to the case of (1.2), i.e. with d windows and 2 d unknown parameters. There is again the faster n -rate con v ergence for estimators of the d − 1 break p oints. In some types of applications the n umber of windows w ould b e known to the analyst a priori, say d , hence limiting the statistical inference task to the d − 1 break p oints and d level s. F requen tly the n um ber of windo ws w ould not b e kno wn a priori, as p ointed to ab ov e. The diﬃcult y clearly dep ends on the sizes of windows and degree of separation betw een their levels; in Figure 1.1 the small diﬀerence JIC FOR JUMP REGRESSION MODELS 3 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 x y Figure 1.1. Illustration of the (1.2) mo del, with ﬁv e windows and n = 250 data p oin ts. The statistical task is to estimate the discontin uity p oints and the lev els from the data, and also to infer the right n um b er of windo ws in case this is not kno wn a priori. b et w een the levels of windo ws four a nd ﬁve w ould b e hard to iden tify , unless the sample size is large. In Sections 6 and 7 we dev elop jump information criteria (JIC), resp ectiv ely the AJIC and BJIC, for selecting the right n um b er of windo ws based on data. These are constructed b y follo wing the basic line of argumen ts whic h for regular mo dels lead to resp ectiv ely the AIC and the BIC formulae, cf. Section 2, but whic h no w are in need of mo diﬁcation and sophistication, in view of the diﬀeren t t yp e of asymptotics at work. In Section 8 we study the Ba y esian approach to estimation problems in mo dels of the (1.2) form, and ﬁnd that the Bay es estimators b γ j,B ha v e the same fast con- v ergence rate as for the least squares metho d, but that th e limit distributions of n ( b γ j,B − γ 0 ,j ) are d iﬀeren t. W e also sho w that the Ba yes estimators ha v e smaller asymptotic mean squared errors. Then in Section 9 v arious computational asp ects are addressed. W e are in particular exhibiting a new version of previously suggested pro cedures which uses dynamic programming to rapidly compute b oth least squares and Bay es estimates; the algorithms work fast even for high n and with man y break p oin ts. Illustrations of our metho ds and ﬁndings are giv en in Section 10, and our 4 JIC FOR JUMP REGRESSION MODELS pap er ends with a list of concluding remarks and suggestions for future w ork in Section 11. The themes work ed with in this pap er hav e quite a v ast literature, and to com- plicate matters, sev eral asym ptotic p ersp ectiv es exist. Some b o ok length treatmen ts on the sub ject for a single c hange p oin t are Basseville & Nikiforov (1993), Bro dsky & Darkhov sky (1993), Carlstein et al. (1994) and Cs ¨ org˝ o & Horv´ ath (1997). Mo d- ern treatments of a single c hange p oin t for jump regression mo dels include Section 14.5.1 of Kosorok (2008) and the application in Seijo & Sen (2011). The problem of multiple change p oints is less studied in the pure statistics literature, but has b een thoroughly inv estigated in the econometric literature. The asymptotic persp ectiv e in the econometric literature is that the sizes of the jumps shrink as the n um b er of observ ations increases. In the one-jump case, one assumes that for some 0 < α < 1 / 2 the jump size λ n is such that λ n → 0 as n → ∞ , but that n 1 / 2 − α λ n → 0 as n → ∞ . In con trast, w e will follow Kosorok (2008); Seijo & Sen (2011) and consider the cov ariate placemen ts as random, as formalised in the up coming Assumption 1 – but let the jump sizes b e ﬁxed. See also Remark C in Section 11. Tw o cen tral pap ers on frequen tistic estimation from the econometric p ersp ectiv e are Bai & P erron (1998) and Bai & P erron (2003). In Bai & P erron (1998), the limit distributions in the presence of a prescribed n um ber of jumps is derived using the shrinking jump p ersp ectiv e for its asymptotic approximations. Bai & Per ron (2003) surv eys the practical asp ects of frequentistic estimation of regression mo dels with jumps. A large-scale sim ulation study of the small -sample appro ximation of the limit la w of the estimators in the presence of m ultiple jumps is found in Bai & P erron (2006), where shrinking jump sizes are assumed. They conclude that ‘the cov erage rates for the break dates are adequate unless the break is either to o small (so small as not to b e detected by the tests) or to o big’. Such a large-scale sim ulation has to our kno wledge not b een conducted for the case where the jump sizes are ﬁxed. It seems reasonable that some of the qualitative conclusions from Bai & Perron (2006) are v alid for our case as w ell. W e note a ﬁnite-sample approach to the Gaussian case is Lebarbier (2005). On the Bay esian side, there is again a large literature a v ailable. F earnhead (2006) and Chib (1998) b oth develop computationally feasible estimation metho ds. There is even an R pac k age a v ailable, described in Erdman & Emerson (20 07), whic h also gives man y useful references. JIC FOR JUMP REGRESSION MODELS 5 W e note that there are sev eral distinct mo dels that are tec hnically similar to the jump regression mo dels of the presen t pap er. Threshold regression models is a prominen t example, see Chan (1993); Y u (2012); Li & Ling (2012). The case of con tin uous pro cess (signal) with discon tin uities has b een work ed with, b oth from Ba y esian and frequen tist p ersp ectiv es, in Ibragimov & Khasminskii (1981, p. 329–338). They w ork with sto chasti c diﬀeren tial equat ions of the form d X ( t ) = S ( t − θ ) d t + d B ( t ) for t ∈ [0 , n ], where B ( t ) is a Brown ian motion pro cess, S ( u ) is a p erio dic deterministic function of p erio d 1 with discontin uities of the ﬁrst kind, and n is an integer. There are relativ ely sp eaking far few er pap ers in the literature seriously ad- dressing the mo del selection issues, the task of ﬁnding the right num ber of windows in the data. Some care is needed regarding the interpretation and assessments of suc h methods, as the task of pinpointing and counting gen uine discon tin uities in the underlying signal, whic h is our p ersp ectiv e in this article, is diﬀerent from ap- pro ximating a smo oth function with a lo cally constan t curv e, as in e.g. Banerjee & McKeague (2007) and Hermansen & Hjort (2014a). F or general p ersp ectives on mo delling with discon tin uities, see the discussion of F rigessi & Hjort (2002) and the articles collected in the asso ciated sp ecial issue of the Journal of Nonparametric Statistics. Some pap ers discussing mo del selection issues for jump regressions are Ninomiy a (2005); F ric k et al. (2014). A recen t w ork, also dealing wi th the selection of jumps is, Zou et al. (2014). They use a nonparametric maximum lik eliho o d approach, while we will exclusiv ely use models parametrised b y a ﬁnite-dimensional and ﬁxed set. Ho w ev er, their model selection tec hnique fo cuses on providi ng a pro cedure that selects the correct num b er of jumps in a consisten t manner as n → ∞ . Our mo del selection ideas are instead based on extending the motiv ation b ehind the AIC and BIC formulae, as set up and dev elop ed in Claeskens & Hjort (2008, Chapters 2 & 3). Other papers that extend the underlying motiv ation of the AIC and BIC to non-classical settings include Hermansen & Hjort (2014a,b); Grønneb erg & Hjort (2014). Applications of c hange-point mo dels of the (1.2) v ariet y abound, and include the ﬁelds of biology , geology , engineering with e.g. sp eec h recognition, medicine, i maging, marine biology and o ceanograph y , etc. A broad ov erview is oﬀered in F ric k et al. (2014), including references. Kjesbu et al. (2014) hav e extended a classic time series from Hjort (1914), pertaining to the liv er qualit y of ﬁsh, from the original 1880–1912 range to 1859–2012, and mo dels used to understand, analyse, in terpret and forming predictions from this and similar time series for marine sciences include versions of (1.2). In this connection we also menti on the so-called b ent cable regression mo dels of Chiu et al. (2006), where the mean curve is tak en to b e conti n uous but allo w ed to 6 JIC FOR JUMP REGRESSION MODELS ha v e discon tin uities in its deriv ativ e; Kjesbu et al. (2014) discuss also suc h mo dels for the Hjort time series. 2. The case of smooth regression models It is useful to study the more traditional case of smo oth regression curves ﬁrst, b efore tendind to the jump regression models. Here the theory indeed w orks more smo othly , regarding b oth estimation, inference and mo del selection, leading in par- ticular to formulae for AIC and BIC close to the familiar ones, see e.g. Claesk ens & Hjort (2008) for these. W e shall see that s ev eral constructions and pro p erties rely on analysis of t w o sp ecial functions w ork ed with b elo w, the A n ( s, t ) pro cess and the B n ( s, t ) function, asso ciated respectively with the log-lik elihoo d function and a Kullback –Leibler distance function. When we turn to jump regression mo dels in later sections it b ecomes necessary to understand and analyse more tec hnically demanding analogues of A n and B n deﬁned for these non-smo oth mo dels. Since some of our limit distribution results and asso ciated large-sample appro x- imations dep end not only on the mo del form ulation and the m ( x ) function, but also on how the x i are distributed, w e put up an explicit assumption, as follows. Assumption 1. The x i are generated indep endently from a cov ariate distri- bution with densit y f ( x ) p ositiv e on the unit in terv al. Hence ergo dic a v erages ¯ h n = n − 1 P n i =1 h ( x i ) = R h ( x ) d F n ( x ) con v erge to limits R h ( x ) f ( x ) d x , for eac h b ounded h . Here F n is the empirical distribution of the observed x 1 , . . . , x n , con- v erging to F , the cum ulativ e of f . 2.1. The mo del and its least false parameters. Consider then a setup in which y i = m ( x i ) + ε i for i = 1 , . . . , n, (2.1) where m ( x ) = m true ( x ) is the true regression curv e E ( Y | x ) and where the ε i are i.i.d. N(0 , σ 2 true ). A parametric mo del m ( x, θ ) is then suggested, in this section tak en to b e smooth, with t w o contin uous deriv ativ es in b oth x and θ . Protot yp e cases would b e the p olynomial θ 0 + θ 1 x + · · · + θ p x p , or an expansion in other basis functions, but also nonlinear mo dels lik e e.g. θ 0 + θ 1 { 1 − exp( − θ 2 x i ) } ﬁt the general framew ork. W e shall ﬁrst clarify what the ML estimators are aiming at, when we a v oid p ostulating that the parametric mo del is correct, i.e. m ( x ) is not assumed to b e m ( x, θ ) for any θ . The log-likelihoo d function takes the form ℓ n ( θ , σ ) = − n log σ − 1 2 (1 /σ 2 ) Q n ( θ ) (ignoring the immaterial constan t − 1 2 n log(2 π ) here and in what follo ws), where Q n ( θ ) = P n i =1 { y i − m ( x i , θ ) } 2 . W e see that the ML estimator b θ is the argmin of Q n JIC FOR JUMP REGRESSION MODELS 7 and that the ML estimator of σ is given b y b σ 2 = Q n ( b θ ) /n = Q n, min /n . W e ha v e E Q n ( θ ) /n = σ 2 true + n − 1 n X i =1 { m ( x i ) − m ( x i , θ ) } 2 , and w e deﬁne the least false regression parameter θ 0 ,n to be the parameter v alue minimising thi s expression, i.e. min imising R { m ( x ) − m ( x, θ ) } 2 d F n ( x ). W e similarly deﬁne the least false spread parameter σ 0 ,n via σ 2 0 ,n = σ 2 true + Z { m ( x ) − m ( x, θ 0 ,n ) } 2 d F n ( x ) . (2.2) These least false parameters are those at whic h the ML estimators b θ and b σ are aiming, and cor resp ond to the b est possible appro ximation aﬀorded by the para- metric mo del to the real data generating mec hanism (2.1). When n increases, θ 0 ,n → θ 0 , the minimiser of R { m ( x ) − m ( x, θ ) } 2 f ( x ) d x , and σ 0 ,n → σ 0 , giv en b y σ 2 0 = σ 2 true + R { m ( x ) − m ( x, θ 0 ) } 2 f ( x ) d x . Note that a go o d mo del succeeds in m ( x, θ 0 ) coming close to the real m ( x ), hence also leading to a σ 0 b eing small and coming close to σ true of (2.1), whereas a coarser mo del typically do es not come close to m ( x ) and implies a larger least false parameter σ 0 . 2.2. The log-lik eliho o d function and the A n pro cess. Consider now the random function A n ( t, u ) = ℓ n ( θ 0 ,n + t/ √ n, σ 0 ,n + u/ √ n ) − ℓ n ( θ 0 ,n , σ 0 ,n ) . (2.3) It will b e seen that some of the more imp ortan t results concerning the distributions of the ML estimators follo w from large-sample prop erties of the A n function, along with natural mo del selection criteria. T o w ork with A n and its distributional limit, w e ﬁrst need W n = √ n ( e σ 2 /σ 2 0 ,n − 1) → d W ∼ N(0 , 2) , where e σ 2 = n − 1 P n i =1 { y i − m ( x i , θ 0 ,n ) } 2 . Secondly , H n ( t ) = Q n ( θ 0 ,n + t/ √ n ) − Q n ( θ 0 ,n ) = n X i =1  { y i − m ( x i , θ 0 ,n ) − d i } 2 − { y i − m ( x i , θ 0 ,n ) } 2  = n X i =1  − 2 { y i − m ( x i , θ 0 ,n ) } d i + d 2 i  , in which d i = m ( x i , θ 0 ,n + t/ √ n ) − m ( x i , θ 0 ,n ) . = m ∗ ( x i , θ 0 ) t t/ √ n + 1 2 t t m ∗∗ ( x i , θ 0 ) t/n, (2.4) writing m ∗ ( x, θ ) and m ∗∗ ( x, θ ) for resp ectively the p -v ector and p × p -matrix of ﬁrst and second order partial deriv ativ es with resp ect to θ , and where p is the dimension 8 JIC FOR JUMP REGRESSION MODELS of θ . W e ﬁnd from this and some additional eﬀorts that H n ( t ) = − 2 t t U n + t t (Σ n + Ω n ) t + o pr (1). Here Σ n = n − 1 n X i =1 m ∗ ( x i , θ 0 ,n ) m ∗ ( x i , θ 0 ,n ) → Σ = Z m ∗ ( x, θ 0 ) m ∗ ( x, θ 0 ) f ( x ) d x and Ω n = − n − 1 n X i =1 { m ( x i ) − m ( x i , θ 0 ,n ) } m ∗∗ ( x i , θ 0 ,n ) → Ω = Z { m ( x, θ 0 ) − m ( x ) } m ∗∗ ( x, θ 0 ) f ( x ) d x, whereas U n = n − 1 / 2 n X i =1 { y i − m ( x i , θ 0 ,n ) } m ∗ ( x i , θ 0 ,n ) = n − 1 / 2 n X i =1 ε i m ∗ ( x i , θ 0 ,n ) . W e use here the fact that n X i =1 { m ( x i ) − m ( x i , θ 0 ,n ) } m ∗ ( x i , θ 0 ,n ) = 0 , (2.5) whic h follo ws from the deﬁnition of the least false parameter θ 0 ,n . This then leads to U n → d U ∼ N(0 , σ 2 true Σ). There is also join t con v ergence ( W n , U n ) → d ( W , U ), with W and U found to b e indep endent, again thanks to (2.5). It follows that H n ( t ) → d H ( t ) = − 2 t t U + t t (Σ + Ω) t. W e note that Ω is zero or small if the parametric mo del is correct or nearly correct or if m ( x, θ ) is linear or nearly linear in θ . F or linear regression mo dels, for example, Ω n and hence Ω is zero. Com bining these ingredien ts one may now establish the limit pro cess for A n of (2.3). W e hav e A n ( t, u ) = − n log( σ 0 ,n + u/ √ n ) − 1 2 1 σ 2 0 ,n 1 { 1 + ( u/σ 0 ,n ) / √ n } 2 { Q n ( θ 0 ,n ) + H n ( t ) } + n log σ 0 ,n + 1 2 1 σ 2 0 ,n Q n ( θ 0 ,n ) = − n log { 1 + ( u/σ 0 ,n ) / √ n } − 1 2 1 σ 2 0 ,n h 1 { 1 + ( u/σ 0 ,n ) / √ n } 2 − 1 i Q n ( θ 0 ,n ) − 1 2 1 σ 2 0 ,n 1 (1 + ( u/σ 0 ,n ) / √ n ) 2 H n ( t ) . This leads with some further details and eﬀorts to A n ( t, u ) = u σ 0 W n − u 2 σ 2 0 − 1 2 1 σ 2 0 H n ( t ) + o pr (1) → d A ( t, u ) = u σ 0 W − u 2 σ 2 0 + 1 σ 2 0 { t t U − 1 2 t t (Σ + Ω) t } . (2.6) JIC FOR JUMP REGRESSION MODELS 9 The limit function dep ends on ( W, U ) only , and, after securing tigh tness – which we will not do here – the conv ergence in distribution here takes place not merely p oint- wise, i.e. for each ( t, u ), but inside eac h function space D ([ − r , r ] p +1 ), for r p ositiv e, equipp ed with the Sk orokho d top ology for right-con tinuous functions with left-hand limits; see Billingsley (1968) for mathematical details concerning con v ergence of probabilit y measures in such function spaces. The p oin t, in our connection, is that A n → d A automatically implies h ( A n ) → d h ( A ) for eac h h con tin uous on the supp ort of A . V arious useful corollaries hence ﬂo w from (2.6), as w e now demonstrate. (i) Using the argmax con tin uous mapping theorem as describ ed e.g. in v an der V aart & W ellner (1996, Chapter 3.2.1), w e get that if √ n ( b θ − θ 0 ,n ) is b ounded in probabilit y then argmax( A n ) → d argmax( A ): b t n = √ n ( b θ − θ 0 ,n ) → d b t = (Σ + Ω) − 1 U, b u n = √ n ( b σ − σ 0 ,n ) → d b u = 1 2 σ 0 W . (2.7) Also, these maximisers of the random limit function A hav e distributions b t ∼ N p (0 , σ 2 true (Σ + Ω) − 1 Σ(Σ + Ω) − 1 ) and b u ∼ N(0 , 1 2 σ 2 0 ) . Note that w e do not provide suﬃcient conditions for √ n ( b θ − θ 0 ,n ) = O P (1), as this is a standard problem in v estigated for example in van De Geer (2000, Chapters 7 & 9). (ii) Consider A ∗ n ( t, u ) = A n ( b t n + t, b u n + u ) − A n ( b t n , b u n ) = ℓ n ( b θ + t/ √ n, b σ + u/ √ n ) − ℓ n ( b θ , b σ ) . W e ha v e A ∗ n ( t, u ) → d A ∗ ( t, u ) = A ( b t + t, b u + u ) − A ( b t, b u ) = − u 2 − 1 2 1 σ 2 0 t t (Σ + Ω) t. (2.8) This relates to the fact that there is joint con v ergence in distribution ( A n , b t n ) → d ( A, b t ), and that the comp osite function taking ( A, t ) to A ( t ) is contin uous. (iii) W e also ha v e A n ( b t n , b u n ) = ℓ n ( b θ , b σ ) − ℓ n ( θ 0 ,n , σ 0 ,n ) → d A ( b t, b u ) = 1 4 W 2 + 1 2 (1 /σ 2 0 ) U t (Σ + Ω) − 1 U. 2.3. The Kullbac k–Leibler distance and the B n function. The AIC t yp e mo del selection strategy , work ed with b elo w, is in timately connected to the Kullbac k– Leibler distance from the true data generating mechanism g ( y | x ), here asso ciated with (2.1), and the estimated parametric model, sa y f ( y | x, b θ , b σ ). This distance, when av eraged ov er the cases x 1 , . . . , x n , is KL n = Z Z g ( y | x ) log g ( y | x ) f ( y | x, b θ , b σ ) d y d F n ( x ) = const . − n − 1 k n ( b θ , b σ ) , 10 JIC FOR JUMP REGRESSION MODELS in which k n ( θ , σ ) = n X i =1 Z g true ( y | x i ) log f ( y | x i , θ , σ ) d y = − n log σ − 1 2 1 σ 2 n X i =1 [ { m ( x i ) − m ( x i , θ ) } 2 + σ 2 true ] . The best appro ximation or least fal se parameters are set up such that they maximise this function, hence with partial deriv ativ es b eing zero at ( θ 0 ,n , σ 0 ,n ). This leads to B n ( t, u ) = k n ( θ 0 ,n + t/ √ n, σ 0 ,n + u/ √ n ) − k n ( θ 0 ,n , σ 0 ,n ) = − 1 2 v t J n v + o (1) , writing v = ( t, u ) t , and where J n = − n − 1 ∂ 2 k n ( θ 0 ,n , σ 0 ,n ) ∂ α∂ α t , writing α = ( θ , σ ) for the full parameter v ector. Under the cond itions ab ov e, w e ha v e J n → J = 1 σ 2 0 Σ + Ω , 0 0 , 2 ! , and B n ( t, u ) → B ( t, u ) = − 1 2 v t J v . 2.4. The mo del robust AIC. The mo del selection strategy underlying the original AIC, see Claesk ens & Hjort (2008, Ch. 3), is in essence to estimate the Kullback– Leibler distance from truth to eac h estimated mo del and then choosing the mo del with smallest suc h estimated distance. In view of the ab ov e this en tails estimating b k n = k n ( b θ , b σ ) = − n log b σ − 1 2 1 b σ 2 n X i =1 [ { m ( x i ) − m ( x i , b θ ) } 2 + σ 2 true ] from data, for each candidate mo del, and then choose the mo del with the highest suc h estimate. T o do this we start with ℓ n, max = ℓ n ( b θ , b σ ) = − n log b σ − 1 2 n and inv estigate to what extent this tends to o v ersho ot its target k n ( b θ , b σ ). W rite ℓ n, max = ℓ n ( θ 0 ,n , σ 0 ,n ) + { ℓ n ( b θ , b σ ) − ℓ n ( θ 0 ,n , σ 0 ,n ) } = ℓ n ( θ 0 ,n , σ 0 ,n ) + E n, 1 , b k n = k n ( θ 0 ,n , σ 0 ,n ) + { k n ( b θ , b σ ) − k n ( θ 0 ,n , σ 0 ,n ) } = k n ( θ 0 ,n , σ 0 ,n ) + E n, 2 . The p oin t here is that E ℓ n ( θ 0 ,n , σ 0 ,n ) = k n ( θ 0 ,n , σ 0 ,n ), i.e. that part of ℓ n, max is un bi- ased for the corresp onding part of b k n , whereas E n, 1 is p ositive and E n, 2 is negative, con tributing b oth to a certain p ositiv e bias for ℓ n, max . W e ﬁrst ﬁnd E n, 1 = A n ( b t n , b u n ) → d E 1 = A ( b t, b u ) = 1 4 W 2 + 1 2 (1 /σ 2 0 ) U t (Σ + Ω) − 1 U, JIC FOR JUMP REGRESSION MODELS 11 with ( b t n , b u n ) and ( b t, b u ) as in (2.7). F urthermore, E n, 2 = B n ( b t n , b u n ) → d E 2 = B ( b t, b u ) , and one ﬁnds in fact that ( E n, 1 , E n, 2 ) → d ( E 1 , E 2 ) jointly , with E 2 = − E 1 . So E n, 1 − E n, 2 → d E 1 − E 2 = 1 2 W 2 + (1 /σ 2 0 ) U t (Σ + Ω) − 1 U, whic h has the follo wing mean v alue, the o v ersho oting bias of ℓ n, max for estimating k n ( b θ , b σ ): b = E ( E 1 − E 2 ) = 1 + ( σ 2 true /σ 2 0 )T r { (Σ + Ω) − 1 Σ } . This leads to AIC ∗ = 2 ℓ n, max − 2 b b = 2 ℓ n, max − 2  1 + ( b σ 2 / b σ 2 0 )T r { ( b Σ + b Ω) − 1 b Σ }  , (2.9) where b σ 2 is an estimate of σ 2 true . W e call this the mo del robust AIC. W e note that the AIC ∗ dev elop ed here is not identica l to the traditional AIC, whic h merely counts estimated parameters and here would b e equal to AIC = 2 ℓ n, max − 2(1 + p ). Our v ersion (2.9) is mo del robust in that we are not assuming the parametric model to b e correct when we assess and estimate the Kullbac k–Leibler deviances. F or the case of comparing and assessing linear regressions models the ab o v e simpliﬁes, in that Ω = 0, leading to AIC ∗ = 2 ℓ n, max − 2(1 + p b σ 2 / b σ 2 0 ) . Again, b σ 0 stems from the candidate mo del b eing assessed, whereas b σ estimates the underlying σ true . W e migh t use b σ from the biggest mo del under consideration, or tak e n − 1 P n i =1 { y i − e m ( x i ) } 2 with a nonparametric smo other for e m ( x ). W e also record t hat our AIC ∗ is equiv alent to ch o osing the candi date model with the smallest v alue of n log b σ 0 + ( b σ 2 / b σ 2 0 )T r { ( b Σ + b Ω) − 1 b Σ } . This is also related to Mal lo ws’s C p , but in fact a more robust v ersion, and also encompassing the p ossibilit y of comparing linear with nonlinear regressions. 2.5. The BIC. Supp ose a Ba y esian prior π ( θ , σ ) is put up for the parameters of the mo del. W e ma y then use (2.8) to approximate the marginal lik eliho o d of the data, with implications for p osterior mo del computations, as we shall see. W e ﬁnd λ n = Z π ( θ , σ ) L n ( θ , σ ) d θ d σ = exp( ℓ n, max ) Z exp { ℓ n ( θ , σ ) − ℓ n ( b θ , b σ ) } π ( θ , σ ) d θ d σ 12 JIC FOR JUMP REGRESSION MODELS Set now θ = b θ + t/ √ n and σ = b σ + u/ √ n , whic h is seen to lead to inv olving A ∗ n ( t, u ) ab o v e, and λ n = exp( ℓ n, max ) Z exp { A ∗ n ( t, u ) } π ( b θ + t/ √ n, b σ + u/ √ n ) d t d u (1 / √ n ) p +1 This leads to the approximation log λ n . = ℓ n, max − 1 2 ( p + 1) log n + log Z exp { A ∗ ( t, u ) } d t d u + log π ( b θ , b σ ) . The integral here may b e ev aluated exactly , in view of (2.8), giving Z exp { A ∗ ( t, u ) } d t d u = (2 π ) ( p +1) / 2 2 − 1 / 2 | (Σ + Ω) /σ 2 0 | − 1 / 2 and its turn log λ n . = ℓ n, max − 1 2 ( p + 1) log n + 1 2 ( p + 1) log (2 π ) + p log σ 0 − 1 2 log | Σ + Ω | + log π ( b θ , b σ ) . (2.10) Assume no w that a list of candidate models M 1 , . . . , M k are under considera- tion, along with priors π ( θ j , σ j ) for the parameters ( θ j , σ j ) inv olv ed in mo del M j , and ﬁnally with prior probabilities p ( M j ) to make the Ba y esian multi-model description complete. The p osterior mo del probabilities are th en p ( M j | data) ∝ p ( M j ) λ n,j , with λ n,j the marginal lik eliho o d computed under M j , deﬁned and computed as ab o v e. W e then hav e p ( M j | data) . = p ( M j ) exp { ℓ n,j, max − 1 2 ( p j + 1) log n + D n,j } P k j ′ =1 p ( M j ′ ) exp { ℓ n,j ′ , max − 1 2 ( p j ′ + 1) log n + D n,j ′ } , with p j the dimension of θ j and D n,j the relev ant extra terms, from the appr ox - imation dev eloped ab ov e. Since these extra terms are of lo w er order, in fact of order O ( p ) compared to order O pr ( n ) and O ( p log n ) for the tw o leading terms, the appro ximation p ( M j | data) ∝ p ( M j ) exp( 1 2 BIC j ) is a practical one, with BIC j = 2 ℓ n,j, max − ( p j + 1) log n. (2.11) This is, in essence, the rationale underlying the BIC mo del selection criterion; cf. Claesk ens & Hjort (2008, Ch. 4). One may of course attempt to use fur- ther asp ects of the ab ov e developmen t when approxim ating p ( M j | data), e.g. using ( p j + 1) log { n/ (2 π ) } instead of ( p j + 1) log n as suggested by (2.10), but this would also need an assessment of the actual size of the log π ( b θ , b σ ) term. The common practice stemming from this metho dology is to sidestep these low er-level issues, and also to take all prior mo del probabilities p ( M j ) equal; then selecting the most likely mo del giv en the data is equiv alent to c ho osing the candidate model with highest BIC score. JIC FOR JUMP REGRESSION MODELS 13 3. The two-windo w case: a single discontinuity In this section we fo cus on the case of a single breakp oint parameter γ and denote the corresp onding level parameters a and b , to the left and right of γ , resp ectively . The mo del hence tak es the form y i = m ( x i , θ ) + ε i = ( a + ε i if x ≤ γ , b + ε i if x > γ , (3.1) writing θ = ( γ , a, b ) for the full regression function parameter. As in the previous section w e tak e t he ε i to be i.i.d. N(0 , σ 2 ), and view the x i to be coming from a co v ariate distribution with density f with cumulativ e F . In this sectio n we w ork under the model conditions giv en abov e, where there are underlying true p arameters θ 0 = ( γ 0 , a 0 , b 0 ) and σ 0 , with γ 0 an inner p oint of (0 , 1) and a 0  = b 0 . T o dev elop mo del robust inference metho ds and for constructing mo del selection criteria it is ho w ev er also necessary to examine prop erties of estimators outside these mo del conditions, i.e. when the regression function m true ( x ) is not necessarily of the precise form implicit in (3.1). Such an extended analysis is given in Section 4 b elo w. As for the case examined in Section 2, the log-lik eliho o d function is ℓ n ( θ , σ ) = − n log σ − Q n ( θ ) /σ 2 , so the ML estimators b θ = ( b γ , b a, b b ) for are those that minimise Q n ( θ ) = n X i =1 { y i − m ( x i , θ ) } 2 = X x i ≤ γ ( y i − a ) 2 + X x i >γ ( y i − b ) 2 , (3.2) i.e. identical to the least squares estimators. The n umerical recip e is clear enough, since b a ( γ ) = ¯ y L = P x i ≤ γ y i P x i ≤ γ 1 and b b ( γ ) = ¯ y R = P x i >γ y i P x i >γ 1 , the av erages of y i v alues to the left and to the right, resp ectiv ely , minimise the func- tion for given γ . This reduces the problem to the computation of a one-dimensional curv e Q n ( γ , b a ( γ ) , b b ( γ )) = X x i ≤ γ { y i − b a ( γ ) } 2 + X x i >γ { y i − b b ( γ ) } 2 = V n,L ( γ ) + V n,R ( γ ) , sa y , splitting in to a sum of the v ariation to the left and to the righ t of γ . This function is constant inside interv als spanned b y consecutive x i v alues (i.e. inside eac h ( x i − 1 , x i ), if the x i are ordered), and we deﬁne b γ to b e the midp oint inside the relev an t interv al where Q n is smallest. The sizes of these in terv als tend uniformly to zero as sample size increases, as a consequence of Assumption 1, so in the l imit there is no ambiguit y . W e also note that an algebraic reform ulation shows that minimising V n,L ( γ ) + V n,R ( γ ) ab ov e is equiv alent to the numerically faster recip e of maximising n L b a ( γ ) 2 + n R b b ( γ ) 2 , 14 JIC FOR JUMP REGRESSION MODELS with n L = nF n ( γ ) the n um b er of p oin ts to the left of γ and n R = n − n L the n um b er falling to the righ t. W rite now Q n ( θ ) = n X i =1 { m ( x i , θ 0 ) − m ( x i , θ ) + ε i } 2 = Q n, 0 + R n ( θ ) − 2 V n ( θ ) , sa y , with Q n, 0 = P n i =1 ε 2 i not dep ending on θ and with R n ( θ ) = n X i =1 { m ( x i , θ ) − m ( x i , θ 0 ) } 2 , V n ( θ ) = n X i =1 { m ( x i , θ ) − m ( x i , θ 0 ) } ε i . It will pro ve useful to w ork with the random function H n ( s, t 1 , t 2 ) = Q n ( γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n ) − Q n ( γ 0 , a 0 , b 0 ) . (3.3) The reason for the diﬀerent scales here is that it will turn out that b γ tends to γ 0 at a faster rate (namely n ) than do b a and b b to a 0 and b 0 (namely √ n ). An algebraic reform ulation, via Q n = Q n, 0 + R n − 2 V n ab o v e, gives H n ( s, t 1 , t 2 ) = C n ( s, t 1 , t 2 ) − 2 D n ( s, t 1 , t 2 ) , with C n ( s, t 1 , t 2 ) = n X i =1 { m ( x i , γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n ) − m ( x i , γ 0 , a 0 , b 0 ) } 2 , D n ( s, t 1 , t 2 ) = n X i =1 { m ( x i , γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n ) − m ( x i , γ 0 , a 0 , b 0 ) } ε i . The following lemma sp ells out the limiting b eha viour of C n and D n , and hence of H n . F or this w e ﬁrst need to describ e a certain t w o-sided comp ound Po isson pro cess. F or an in tensit y parameter λ , let N ∗ ( λ, s ) b e a tw o-sided Poisson pro cess, i.e. equal to a P oisson pro cess N ( s ) for s ≥ 0 and to another and independent P oisson pro cess N ′ ( | s | ) for s ≤ 0, b oth with parameter λ . Next, for s ≥ 0 deﬁne W ∗ ( λ, s ) = N ( s ) X i =1 U i , with the U i indep enden t and standard normal and indep enden t of N 1 , and similarly W ∗ ( λ, s ) = P N ′ ( | s | ) i =1 U ′ i for s ≤ 0, with this second set of U ′ i also b eing indep enden t and standard normal and indep endent of v ariables emplo y ed for p ositive s . W e note that W ∗ ( λ, s ) has mean zero, v ariance λ | s | , a p oin t mass at zero of size exp( − λ | s | ), and a somewhat hea vier kurtosis than a t w o-sided Wiener process with the same v ariance; E { W ∗ ( λ, s ) / p λ | s |} 4 = 3 + 3 / ( λ | s | ) . The p oint is now that N ∗ ( λ, s ) and W ∗ ( λ, s ) as describ ed here app ear as limits of natural pro cesses inside our framework, and with λ = f ( γ 0 ). First let s > 0 and JIC FOR JUMP REGRESSION MODELS 15 consider N n ( s ), the num b er of x i in [ γ 0 , γ 0 + s/n ]. It is a binomial ( n, q n ( s )), sa y , with q n ( s ) = F ( γ 0 , γ 0 + s/n ). W e see that N n ( s ) → d N ( s ), a Poisson pro cess with in tensit y f ( γ 0 ), and similarly for s negativ e. F urthermore, the pro cess W n ( s ) =    P γ 0 0 , P γ 0 + s/n ≤ x i <γ 0 ε i if s < 0 , con v erges in distribution to a scaled vers ion of suc h a t wo -sided comp ound P oisson pro cess, sa y σ 0 W ∗ ( f ( γ 0 ) , s ). Proving this is not diﬃcult, e.g. via momen t-generating functions. Lemma 1. The re is pro cess conv ergence H n ( s, t 1 , t 2 ) → d H ( s, t 1 , t 2 ) = C ( s, t 1 , t 2 ) − 2 D ( s, t 1 , t 2 ) , where C ( s, t 1 , t 2 ) = ( b 0 − a 0 ) 2 N ∗ ( f ( γ 0 ) , s ) + F ( γ 0 ) t 2 1 + { 1 − F ( γ 0 ) } t 2 2 , D ( s, t 1 , t 2 ) = σ 0 | b 0 − a 0 | W ∗ ( f ( γ 0 ) , s ) + V 1 t 1 + V 2 t 2 , where V 1 and V 2 are indep endent zero-mean normals with v ariances σ 2 0 F ( γ 0 ) an d σ 2 0 { 1 − F ( γ 0 ) } and also indep enden t of N ∗ ( f ( γ 0 ) , s ) and W ∗ ( f ( γ 0 ) , s ) , introduced ab o v e. The con v ergence takes place in the Sk orokhod function space D ([ − r, r ] 3 ) , for eac h p ositive r . Note that we only sk etc h the pro of, and do not prov ide a formal deriv ation of the tightness required for full pro cess conv ergence. Sketch of pr o of. W e ﬁrst demonstrate that C n tends to the C function. T o this end, consider the s > 0 case ﬁrst. Then m ( x i , γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n ) − m ( x i , γ 0 , a 0 , b 0 ) =      a 0 + t 1 / √ n − a 0 if x i ≤ γ 0 , a 0 + t 1 / √ n − b 0 if γ 0 < x i ≤ γ 0 + s/n, b 0 + t 2 / √ n − b 0 if x i > γ 0 + s/n. It follows, using F n for the empirical cumulativ e distribution function of the x i p oin ts, and recognising N n ( s ) = nF n ( γ 0 , γ 0 + s/n ) from ab ov e, that C n ( s, t 1 , t 2 ) = nF n ( γ 0 ) t 2 1 /n + ( a 0 − b 0 − t 1 / √ n ) 2 nF n ( γ 0 , γ 0 + s/n ) + n { 1 − F n ( γ 0 + s/n ) } t 2 2 /n → p ( b 0 − a 0 ) 2 N ∗ ( λ, s ) + F ( γ 0 ) t 2 1 + { 1 − F ( γ 0 ) } t 2 2 , writing λ = f ( γ 0 ). Similarly going through the case of s < 0 giv es the C n → d C result, for all ( s, t 1 , t 2 ). 16 JIC FOR JUMP REGRESSION MODELS Next we w ork with D n , which for the case of s > 0 may b e expressed as D n ( s, t 1 , t 2 ) = t 1 n − 1 / 2 X x i ≤ γ 0 ε i + t 2 n − 1 / 2 X x i >γ 0 + s/n ε i + ( a 0 − b 0 + t 2 / √ n ) X γ 0 γ 0 { m ( x i ) − b } 2 . W e ﬁnd that a 0 ,n = ¯ m L = P x i ≤ γ 0 ,n m ( x i ) P x i ≤ γ 0 ,n 1 and b 0 ,n = ¯ m R = P x i >γ 0 ,n m ( x i ) P x i >γ 0 ,n 1 , (2 . 3) and there will b e a certain in terv al ( x i , x i +1 ) of v alues of γ for which the proﬁl e function R n ( γ ) = X x i ≤ γ { m ( x i ) − ¯ m L } 2 + X x i >γ { m ( x i ) − ¯ m R } 2 is smallest. W e deﬁne γ 0 ,n to b e the mid-p oint of that interv al. These le ast false parameters, for giv en x 1 , . . . , x n , hav e limits a 0 = R γ 0 0 m ( x ) f ( x ) d x R γ 0 0 f ( x ) d x and b 0 = R 1 γ 0 m ( x ) f ( x ) d x R 1 γ 0 f ( x ) d x as n grows to inﬁnity , where we assume there is a unique γ 0 minimising R ( γ ) = Z γ 0 { m ( x ) − a 0 ( γ ) } 2 f ( x ) d x + Z 1 γ { m ( x ) − b 0 ( γ ) } 2 f ( x ) d x. P arallelling the dev elopmen t of the previous section, w e ma y write H n ( s, t 1 , t 2 ) = Q n ( γ 0 + s/n, a 0 ,n + t 1 / √ n, b 0 ,n + t 2 / √ n ) − Q n ( γ 0 , a 0 ,n , b 0 ,n ) = C n ( s, t 1 , t 2 ) − 2 D n ( s, t 1 , t 2 ) , JIC FOR JUMP REGRESSION MODELS 19 with C n ( s, t 1 , t 2 ) = n X i =1  { m ( x i , γ 0 ,n + s/n, a 0 ,n + t 1 / √ n, b 0 ,n + t 2 / √ n ) − m ( x i ) } 2 − { m ( x i , γ 0 ,n , a 0 ,n , b 0 ,n ) − m ( x i ) } 2  , D n ( s, t 1 , t 2 ) = n X i =1 { m ( x i , γ 0 ,n + s/n, a 0 ,n + t 1 / √ n, b 0 ,n + t 2 / √ n ) − m ( x i , γ 0 ,n , a 0 ,n , b 0 ,n ) } ε i . F or the following result we require that the true regression function m has left- and right -hand limits m ( γ 0 − ) and m ( γ 0 +) at a genuine jump p oin t γ 0 , and further- more that m ( γ 0 +) is closer to b 0 than a 0 and that m ( γ 0 − ) is closer to a 0 than b 0 . In particular, the quantities c 1 = { m ( γ 0 − ) − b 0 } 2 − { m ( γ 0 − ) − a 0 } 2 , c 2 = { m ( γ 0 +) − a 0 } 2 − { m ( γ 0 +) − b 0 } 2 (4.1) are then b oth p ositiv e. Under mo del conditions, m ( γ 0 +) = b 0 and m ( γ 0 − ) = a 0 , making b oth c 1 and c 2 equal to | b 0 − a 0 | 2 , which is the situation of Lemma 1. The follo wing generalises that result. Lemma 2. Und er the ab ov e conditions, there is pro cess con v ergence H n ( s, t 1 , t 2 ) → d H ( s, t 1 , t 2 ) = C ( s, t 1 , t 2 ) − 2 D ( s, t 1 , t 2 ) , where C ( s, t 1 , t 2 ) = F ( γ 0 ) t 2 1 + { 1 − F ( γ 0 ) } t 2 2 + ( c 1 N ∗ ( f ( γ 0 ) , s ) if s < 0 , c 2 N ∗ ( f ( γ 0 ) , s ) if s ≥ 0 , D ( s, t 1 , t 2 ) = V 1 t 1 + V 2 t 2 + σ true | b 0 − a 0 | W ∗ ( f ( γ 0 ) , s ) , where V 1 , V 2 , ( N ∗ , W ∗ ) are indep endent, with V 1 and V 2 zero-mean normals with v ariances σ 2 true F ( γ 0 ) and σ 2 true { 1 − F ( γ 0 ) } . The con v ergence takes place in the Sko- rokho d function space D ([ − r , r ] 3 ) , for each p ositiv e r . Again, we onl y pro vide a sk etch of the pro of, as w e do n ot pro vide a formal tigh tness pro of required for full pro cess con v ergence. Sketch of pr o of. Tha t D n tends to the same t yp e of limit pro cess D as for Lemma 1, with σ true no w replacing σ 0 there, follo ws essentially as in the pro of of that lemma, with some more attention to details. No w C n has a more complicated limit function C , how ever. 20 JIC FOR JUMP REGRESSION MODELS T o study this limit, let ﬁrst s > 0. The ‘left’ part of C n tak es the form X x i ≤ γ 0 [ { a 0 ,n + t 1 / √ n − m ( x i ) } 2 − { a 0 ,n − m ( x i ) } 2 ] , is equal to F n ( γ 0 ) t 2 1 , con v erging again to F ( γ 0 ) t 2 1 . The ‘righ t’ part, with x i > γ 0 + s/n terms, is a bit more complica ted, but e nds up con v erging to { 1 − F ( γ 0 ) } t 2 2 . The middle part of C n , summing ov er x i in ( γ 0 , γ 0 + s/n ], of which there are N n ( s ) terms con v erging to N ∗ ( f ( γ 0 ) , s ), consists of terms essentially equal to { m ( γ 0 ,n +) − a 0 ,n } 2 − { m ( γ 0 ,n +) − b 0 ,n } 2 . By assumptions, therefore, this middle part of C n ( s, t 1 , t 2 ) conv erges to the an- nounced c 2 N ∗ ( f ( γ 0 ) , s ) when s > 0. The case of s < 0 is handled similarly , and one ﬁnds that the middle part of C n tends to c 1 N ∗ ( f ( γ 0 ) , s ). □ A fuller analysis taking also σ estimation in to account starts with the log- lik eliho o d function ℓ n ( θ , σ ) = − n log σ − 1 2 (1 /σ 2 ) Q n ( θ ), as in Section 2, but no w with the more complicated and non-smooth Q n ( θ ) of (3.2). The parallel to the crucial A n function (2.3) studied earlier is no w A n ( s, t 1 , t 2 , u ) = ℓ n ( γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n, σ 0 + u/ √ n ) − ℓ n ( γ 0 , a 0 , b 0 , σ 0 ) . Via eﬀorts ab ov e this random function is seen to conv erge in distribution to A ( s, t 1 , t 2 , u ) = u σ 0 W − u 2 σ 2 0 − 1 2 1 σ 2 0 H ( s, t 1 , t 2 ) = u σ 0 W − u 2 σ 2 0 + 1 σ 0 2 { D ( s, t 1 , t 2 ) − 1 2 C ( s, t 1 , t 2 ) } = u σ 0 W − u 2 σ 2 0 + 1 σ 2 0 { V 1 t 1 + V 2 t 2 − 1 2 F 1 t 2 1 − 1 2 F 2 t 2 2 + M ( s ) } , (4.2) sa y . Here w e write for simplicit y F 1 = F (0 , γ 0 ) and F 2 = F ( γ 0 , 1), and M ( s ) = σ true | b 0 − a 0 | W ∗ ( f ( γ 0 ) , s ) − 1 2 c ( s ) N ∗ ( f ( γ 0 ) , s ) , (4.3) with c ( s ) b eing c 1 for s p ositiv e and c 2 for s negativ e. This prop erly generalises the case of (3.5), where c 1 and c 2 of (4.1) b oth are equal to ( b 0 − a 0 ) 2 . W e learn from this that b a, b b, b γ , b σ are again asymptotically indep enden t, with b a and b b having the same limit distributions as in (3.4), but that the limit distribution for b γ is no w more complicated; n ( b γ − γ 0 ) → d argmax( M ). W e note that a fully rigorous pro of of this result requires the simultaneous pro cess conv ergence of A and A ’s pure jump pro cess, as explained in Seijo & Sen (2011). JIC FOR JUMP REGRESSION MODELS 21 5. The mul ti-windows model W e no w pro ceed to the more general case of several break p oin ts and windo ws. The theory and results of the previous section generalise rather nicely , without sev ere complications. Consider therefore mo del (1.2) with the regression functio n is ﬂat inside each of d windo ws. The estimators b a j and b γ j are again those that minimise Q n ( θ ) = P n i =1 { y i − m ( x i , θ ) } 2 . This is minimised for given γ by setting a j equal to b a j ( γ ), the a v erage of y i in window G j ( γ ) = ( γ j − 1 , γ j ]. Hence windows are ch osen such that the com bined v ariation Q n, prof ( γ ) = Q n ( b a ( γ ) , γ ) = d X j =1 X i ∈ G j ( γ ) { y i − b a j ( γ ) } 2 is minimal. This is equiv alen t to maximising P d j =1 n j b a j ( γ ) 2 . In this section we do assume that the (1.2) mo del is fully correct, with true parameter v alues a j, 0 (there are d suc h) and γ j, 0 (there are d − 1 such). Deﬁne H n ( s, t ) = Q n ( γ 0 + s/n, a 0 + t/ √ n ) − Q n ( γ 0 , a 0 ) = C n ( s, t ) − 2 D n ( s, t ) , where t = ( t 1 , . . . , t d ) t and s = ( s 1 , . . . , s d − 1 ) t . The metho ds and results of Sections 3 – 4 are not v ery hard to generalise, and lead to H n ( s, t ) → d H ( s, t ) = C ( s, t ) − 2 D ( s, t ) , with C ( s, t ) = d − 1 X j =1 c j ( s ) N ∗ j ( λ j , s j ) + d X j =1 F ( γ j − 1 , 0 , γ j, 0 ) t 2 j , D ( s, t ) = d − 1 X j =1 σ true | a j +1 , 0 − a j, 0 | W ∗ j ( λ j , s ) + d X j =1 V j t j , where W ∗ 1 ( λ 1 , s ) , . . . , W ∗ d − 1 ( λ d − 1 , s ) are independent t w o-sided comp ound Poisson pro cesses, as describ ed in asso ciation with Lemma 1, with λ j = f ( γ 0 ,j ), and in- dep enden t of the V j , which are indep enden t zero-mean normals with v ariances σ 2 F ( γ 0 ,j − 1 , γ 0 ,j ). Here we write F ( c, d ) for F ( d ) − F ( c ). Finally , c j ( s ) abov e is c j, 1 for s > 0 and c j, 2 for s < 0, where c j, 1 = { m ( γ 0 ,j − ) − a 0 ,j +1 } 2 − { m ( γ 0 ,j − ) − a 0 ,j } 2 , c j, 2 = { m ( γ 0 ,j +) − a 0 ,j } 2 − { m ( γ 0 ,j +) − a 0 ,j +1 } 2 . (5.1) When the model used is correct , then m ( γ 0 ,j − ) = a 0 ,j and m ( γ 0 ,j +) = a 0 ,j +1 , so that b oth c j, 1 and c j, 2 are equal to ( a 0 ,j +1 − a 0 ,j ) 2 , cf. Lemma 1. 22 JIC FOR JUMP REGRESSION MODELS The crucial corollary is that all 2 d parameter estimators are asymptotically indep enden t, with √ n ( b a j − a j, 0 ) → d V j /F ( γ j − 1 , 0 , γ j, 0 ) ∼ N(0 , σ 2 /F ( γ j − 1 , 0 , γ j, 0 )) , n ( b γ j − γ j, 0 ) → d argmax( M j ) , where c j = 1 2 | a j +1 , 0 − a j, 0 | /σ and λ j = f ( γ 0 ,j ), featuring also M j ( s ) = σ true | a 0 ,j +1 − a 0 ,j | W ∗ j ( λ j , s ) − 1 2 c j ( s ) N ∗ n ( λ j , s ) . ( 5.2) The c j , λ j quan tities may b e estimated consisten tly , after which conﬁdence inter v als ma y b e computed for each break p oin t γ j , as p er Remark 1. Note that to fully formalise this argument, an extension of the theory in Seijo & Sen (2011) to the case with several piecewise constant co ordinates are needed. W e do not prov ide this extension here. 6. The AJIC T o dev elop a prop er analogue of the AIC and its mo del robust version AIC ∗ w e need to follo w the chain of argumen ts given in Section 2, but to mo dify results and form ulae under w a y to tak e the nonstandard asymptotics in to accoun t. This is what w e focus on in this section, aiming for a form ula of the t yp e AIC ∗ = 2( ℓ n, max − b b ), with b b the estimate of the p ositive bias b in v olv ed when assessing ℓ n, max /n as an estimator of the mo del speciﬁc part of the Kullbac k–Leibler divergence from truth to mo del. W e b egin with A n ( s, t, u ) = ℓ n ( γ 0 ,n + s/n, a 0 ,n + t/ √ n, σ 0 ,n + u/ √ n ) − ℓ n ( γ 0 ,n , a 0 ,n , σ 0 ,n ) → d A ( s, t, u ) = u σ 0 W − u 2 σ 2 0 − 1 2 1 σ 2 0 { C ( s, t ) − 2 D ( s, t ) } . The limit pro cess ma y b e written in the form A ( s, t, u ) = u σ 0 W − u 2 σ 2 0 + 1 σ 2 0 d X j =1 { V j t j − 1 2 F ( γ 0 ,j − 1 , γ 0 ,j ) t 2 j } + 1 σ 2 0 d − 1 X j =1 M j ( s j ) , with M j ( s ) as in (5.2). Here we need A n ( b s n , b t n , b u n ) = ℓ n ( b θ , b σ ) − ℓ n ( θ 0 ,n , σ 0 ,n ) → d A ( b s, b t, b u ) , where b t j = V j /F ( γ 0 ,j − 1 , γ 0 ,j ) and b u = 1 2 σ 0 W , along with the more in v olv ed b s j = argmin( H j ) = argmax( M j ). So ℓ n, max = ℓ n, 0 + E n, 1 with E n, 1 → d A ( b s, b t, b u ) = 1 4 W 2 + 1 2 1 σ 2 0 d X j =1 V 2 j F ( γ 0 ,j − 1 , γ 0 ,j ) + 1 σ 2 0 d − 1 X j =1 M j ( b s j ) . JIC FOR JUMP REGRESSION MODELS 23 Next up is the parallel to the B n function of Section 2, which takes the form B n ( s, t, u ) = k n ( γ 0 ,n + s/n, a 0 ,n + t/ √ n, σ 0 ,n + u/ √ n ) − k n ( γ 0 ,n , a 0 ,n , σ 0 ,n ) . W e ﬁnd that it tends in distribution to B ( s, t, u ) = − u 2 σ 2 0 − 1 2 1 σ 2 0 C ( s, t ) with C ( s, t ) = d − 1 X j =1 e j ( s ) N ∗ j ( λ j , s ) + d X j =1 F ( γ j − 1 , 0 , γ j, 0 ) t 2 j , where e j ( s ) is d 1 ,j for s p ositive and d 2 ,j for s negativ e. With k n ( b θ , b σ ) = k n ( θ 0 ,n , σ 0 ,n ) + E n, 2 , therefore, we ha v e E n, 2 = B n ( b s n , b t n , b u n ) → d E 2 = B ( b s, b t, b u ) , yielding E 2 = − 1 4 W 2 − 1 2 1 σ 2 0 d X j =1 V 2 j F ( γ 0 ,j − 1 , γ 0 ,j ) − 1 2 1 σ 2 0 d − 1 X j =1 c j ( b s j ) N ∗ j ( λ j , s ) . This means that the extra part, the p ositive random component of ℓ n, max as an estimator of k n ( b θ , b σ ), tends in distribution to E 1 − E 2 = 1 2 W 2 + 1 σ 2 0 d X j =1 V 2 j F ( γ 0 ,j − 1 , γ 0 ,j ) + 1 σ 2 0 d − 1 X j =1 { M j ( b s j ) + 1 2 c j ( b s j ) N ∗ j ( λ j , b s j ) } . The exp ected bias, for large n , is hence b = 1 + d σ 2 true σ 2 0 + 1 σ 2 0 d − 1 X j =1 κ j , sa y , with κ j = E { M j ( b s j ) + 1 2 c j ( b s j ) N ∗ j ( λ j , b s j ) } = σ true | a 0 ,j +1 − a 0 ,j | E W ∗ j ( λ j , b s j ) . W e conclude that the mo del robust AIC style JIC for comparing and assessing jump regression mo dels with diﬀeren t num b ers of windo ws is AJIC ∗ = 2 ℓ n, max − 2 b b = 2( − n log b σ 0 − 1 2 n ) − 2  1 + d b σ 2 b σ 2 0 + 1 b σ 2 0 d − 1 X j =1 b κ j  . (6.1) This score is computed for each candidate n um ber d of windo ws, say up to a pre- selected d max , and in the end the mo del is selected which has the largest such score. As indicated ℓ n, max here is still the simple formula − n log b σ 0 − 1 2 n , but with b σ 2 0 = Q n ( b γ , b a ) /n inv olving the estimation of break p oints and levels for the num ber d of windo ws under consideration. Also, b σ is an estimate of σ true , e.g. tak en as the 24 JIC FOR JUMP REGRESSION MODELS b σ 0 for the biggest d max under consideration. The most tricky part of AJIC ∗ is the b κ j , an estimate of κ j . Since no closed-form expression for thi s exp ected v alue is a v ailable we resort to simulation, taking b κ j to b e the av erage of say 1000 simulate d copies of b σ | b a j +1 − b a j | W ∗ ( b λ j , b s j ). Here b λ j is a consi sten t estimate of f ( γ 0 ,j ), e.g. using a kernel estimate for the x i ; in cases where the design densit y is kno wn, how ev er, as when the x i are seen as uniformly distributed on the interv al in question, we simply use that kno wn v alue of f ( γ 0 ,j ). See Section 10 for an illustration. 7. The BJIC Here w e aim for a selection criterion of the BIC type, but with the necessary mo diﬁcations of the c hain of arguments used to reach (2.11) for the smo oth regression case. As we sa w in the BIC subsection of Section 2, this needs appro ximations to marginal lik elihoo ds, and this pans out diﬀeren t in jump models as w e shall see no w. In Section 6 we w ork ed with the random pro cess A n ( s, t, u ) and its limit A ( s, t, u ), along with their maximisers, resp ectiv ely b s n , b t n , b u n and b s, b t, b u . W e infer from these results that A ∗ n ( s, t, u ) = A n ( b s n + s, b t n + t, b u n + u ) − A n ( b s n , b t n , b u n ) = ℓ n ( b γ + s/n, b a + t/ √ n, b σ + u/ √ n ) − ℓ n ( b γ , b a, b σ ) → d A ∗ ( s, t, u ) = A ( b s + s, b t + t, b u + u ) − A ( b s, b t, b u ) , describing the log-lik eliho o d function b ehaviour around the ML v alues. W e ﬁnd A ∗ ( s, t, u ) = − u 2 σ 2 0 − 1 2 1 σ 2 0 d X j =1 F j t 2 j + 1 σ 2 0 d − 1 X j =1 M j ( s j ) , again with M j ( s j ) as in (5.2), and with F j short-hand notation for F ( γ 0 ,j ) − F ( γ 0 ,j − 1 ). W e are no w in a p osition to appro ximate marginal likelihoo ds, which will lead to our BJIC. F or a given n um b er of windo ws parameter d , supp ose a prior π ( γ , a, σ ) is placed on the mo del parameters. The marginal likelihoo d is then λ n = Z π ( γ , a, σ ) exp { ℓ n ( γ , a, σ ) } d γ d a d σ = exp( ℓ n, max ) Z π ( b γ + s/n, b a + t/ √ n, b σ + u/ √ n ) × exp { A ∗ n ( s, t, u ) } d s d t d u  1 n  d − 1  1 √ n  d  1 √ n  . This leads to the approximation log λ n = ℓ n, max − ( d − 1) log n − 1 2 ( d + 1) log n + log Z exp { A ∗ ( s, t, u ) } d s d t d u + log π ( b γ , b a, b σ ) , JIC FOR JUMP REGRESSION MODELS 25 assuming that the prior used is con tin uous on a region con taining the parameter estimates. W e may ev aluate the inte gral part here, in part using sim ulations, but it is at any rate clear that log λ n = ℓ n, max − 1 2 (3 d − 1) log n + δ n , (7.1) sa y , with δ n of lo w er order than the tw o leading terms. This leads to the modiﬁed BIC for jump regression mo dels, with BJIC d = 2 ℓ n,d, max − (3 d − 1) log n, (7.2) for the case of mo del M d , with d windo ws. Here ℓ n,d, max = − n log b σ d − 1 2 n , the log-lik elihoo d maxim um for mo del M d with b σ 2 d = n − 1 Q n ( b a ) the ML estimate under that mo del. Increasing d leads decreasing b σ d and hence increasing the log-likelihoo d maxim um. The traditional BIC suggests p enalt y 2 d log n , but our versio n (7.2) is the more correct version, reﬂecting the diﬀerent t yp e of asymptotics. As with (2.11), the in terpretation is that when candidate mo dels M 1 , . . . , M d max are b eing considered, asso ciated with n um b er of windo ws d b eing equal to 1 , . . . , d max , and with prior probabilities p ( M 1 ) , . . . , p ( M d max ), then p ( M d | data) ∝ p ( M d ) exp( 1 2 BJIC d ). With equal prior probabilities for the d max mo dels, the mo del is selected with the highest BJIC score. It is sometimes w orth the extra trouble to construct a more complete approx- imation to the marginal lik eliho o ds, using (7.1) but including the remainder term δ n , for each candidate mo del. First, Z exp { A ∗ ( s, t, u ) } d s d t d u = (2 π ) 1 / 2 σ 0 √ 2 d Y j =1 (2 π ) 1 / 2 σ 0 F 1 / 2 j d − 1 Y j =1 Z exp n M j ( s j ) σ 2 0 o d s j , with logarithm sa y r 1 = 1 2 ( d + 1) log (2 π ) + ( d + 1) log σ 0 − 1 2 log 2 − 1 2 d X j =1 log F j + d − 1 X j =1 log µ j , where µ j = R exp( M j /σ 2 0 ) d s . Secondly , supp ose we use indep endent priors for γ , the levels a 1 , . . . , a d , and σ , whic h app ears to b e a natural t yp e of construction. If w e use the natural ﬂat Diric hlet prior for γ on the simplex of γ j − γ j − 1 nonnegativ e and adding to one, and priors for the a j corresp onding to uniform distributions on in terv als of length sa y τ , then r 2 = log π ( b γ , b a, b σ ) = log Γ( d ) − d log τ + log π 0 ( b σ ) , 26 JIC FOR JUMP REGRESSION MODELS sa y , with π 0 ( σ ) denoting the prior used for that parameter. The suggestion is then to use the ﬁne-tuned BJIC = 2 ℓ n,d, max − (3 d − 1) log n + 2( b r 1 + r 2 ) , where b r 1 inserts estimates for the F j and uses si m ulation to ev aluate th e µ j , and where r 2 uses some τ natural for the application at hand. Finally w e advocate omitting the log π 0 ( b σ ) term in r 2 since that term ought to b e approximatel y the same across mo dels. 8. Ba yes estima tors Consider the Ba y esian framew ork, whic h starts out with a join t pri or densit y for the lev el and break p oin t param eters along with the spread parameter, to be com bined in the usual fashion with the data lik elihoo d, whic h w e tak e to b e based on the normal assumption. F or simplicit y of presentation we fo cus here on the case of tw o windo ws, as in Section 3; the generalisation to the m ulti-windo w case will b e fairly straigh tforw ard, using observ ations and techniqu es already work ed with in Section 5. The starting p oin t is a prior densit y π ( γ , a, b, σ ) for the four unkno wn param- eters; w e w ould typ ically tak e these four to be independent. The p osterior den- sit y is then of course prop ortional to π ( γ , a, b, σ ) exp { ℓ n ( γ , a, b, σ ) } , with ℓ n the log- lik eliho o d function. The random vector ( S n , T n, 1 , T n, 2 , U n ) = ( n ( γ − γ 0 ) , √ n ( a − a 0 ) , √ n ( b − b 0 ) , √ n ( σ − σ 0 )) hence has p osterior densit y , say π n ( s, t 1 , t 2 , u ), prop ortional to π ( γ 0 + s/n, a 0 + t 1 / √ n, b 0 + t 2 / √ n, σ 0 + u/ √ n ) exp { A n ( s, t 1 , t 2 , u ) } , where A n is the log-lik eliho o d pro cess function wo rk ed with in Section 4, con v erging in distribution to the A ( s, t 1 , t 2 , u ) of (4.2). As long as the prior is p ositive and con tin uous on a region containing the true v alues, therefore, the p osterior density con v erges almost surely to c exp { A ( s, t 1 , t 2 , u ) } , with c the appropriate normalising constan t. Due to the additive structure of A this means that S n , T n, 1 , T n, 2 , U n are asymptotically indep enden t in the Ba y esian p osterior framework. The most in ter- esting asp ect here is that the p osterior densit y of S n = n ( γ − γ 0 ) tends to h ( s ) ∝ exp { M ( s ) /σ 2 0 } , with M ( s ) as in (3.5). T aking the mean here, we also learn that n ( b γ B − γ 0 ) → d K = Z sh ( s ) d s = R s exp { M ( s ) /σ 2 0 } d s R exp { M ( s ) /σ 2 0 } d s , JIC FOR JUMP REGRESSION MODELS 27 where b γ B = E( γ | data) is the Ba y es estimator under quadratic loss. General results of Ibragimo v & Khasminskii (1981, Chapter 1.9) imply that the Ba yes estimator b γ B will hav e smaller limiting m ean squar ed error than that of the ML estimator, i.e. K ab ov e has smaller squared error than b s = argmax( M ). In particular, there is no Bernshte ˘ ın–v on Mises theorem here ab out the ML and the Ba y es estimators doing equally well for large n . The situation is easier and more standard when it comes to level parameters and the σ , ho w ev er; here the abov e eﬀorts indeed lead to suc h Bernsh te ˘ ın–vo n Mises theorems. In particular, the p osterior distribution of √ n ( a − b a ) has the very same limit as has √ n ( b a − a 0 ), etc. 9. Algorithms for computing estima tes The standard w a y of maximising the lik eliho o d of break p oin t mo dels is via a dynamic programming approac h, described e.g. Bai & P erron (1998). W e ha v e follo w ed this appr oac h, and implemen ted the com putational algorithm in C. W e also impro v ed the algorithm in tw o wa ys. First, w e calculated the lik eliho o d con tributions of placing a jump at a sp eciﬁc regions in the inner-lo op of the algorithm. Most implemen tations w e ha v e seen calculate these con tributions in a pre- calculated lo ok- up table. Such a table is of size  n d  and requires a v ast size if n and d are large. In bioinformatics applications, b oth n and d are routinely v ery large. Another impro v emen t w e carried out is that the dynamic progra mming routine calculates the accum ulated likelih o o ds of all p ossible jump-conﬁgurations. As the dynamic programming routine places one jump at the time, w e can a v oid calculating all further no des if the curren t no de already hav e an accum ulated lik eliho o d exceeding a kno wn lik eliho o d conﬁguration. If this likelihoo d conﬁguration is close to the maximised lik elihoo d, we redu ce the computational burden of the full maximisation. W e propose the following v ery fast linear time algorithm to pro vide such an initial estimate. First, let i = 1, u 0 = 1, u 2 = n and u 1 = argmin u 1 1 X j =0 u j +1 X i = u j ( Y i − ¯ Y u j : u j +1 ) 2 . Note that this is a one-dimensional optimisatio n. Iterate the followin g t w o steps un til i = n : (1) Let u i +1 = n and compute b u i = argmin u i i − 1 X j =0 u j +1 X i = u j ( Y i − ¯ Y u j : u j +1 ) 2 , and set u i = ˆ u i . 28 JIC FOR JUMP REGRESSION MODELS (2) Comput e v i = argmin u i X 1 ≤ j ≤ i,j  = i − 1 u j +1 X i = u j ( Y i − ¯ Y u j : u j +1 ) 2 , and set v i = b v i . If v i = u i − 1 , increase i . No w go to step 1. Otherwise, set u i − 1 = v i and decrease i b y 1. No w rep eat step 2. No w let ( w 1 , . . . , w d ) b e the sorted v ersion of ( u 1 , . . . , u d ). This is the initial v alue of the full optimisation. T o test the eﬃciency of this pro cedure, we ran a sim ulation with 1 × 10 4 rep eti- tions, each repetition ha ving a random sample size n and a random n um b er of break p oin ts d chosen uniformly b et w een speciﬁed limits. The jumps were distributed according to N(0 , 1 . 5 2 ). T able 9.1 sho ws the sa vings when using the ab o v e initial- v alue algorithm compared to a full ML optimisation. Here, the column with ‘correct breaks’ is the p ercent of jump p oin ts shared with the full ML solution, ‘correct solu- tion’ i s the p ercen tage of times where the initial and t he ML solutions w ere iden tical, and the computational reduction compares the computation time with or without the initial v alue co de. It seems clear that w e can gain muc h from sp ecifying a go o d starting solution for the optimisation routine. Sample Num b er of Correct Correct Co mputational size breaks breaks solution reduction 20 ≤ n ≤ 2000 1 ≤ d ≤ √ n 54% 23% 66% 20 ≤ n ≤ 2000 1 ≤ d ≤ √ n/ 2 71% 45% 50% 20 ≤ n ≤ 4000 1 ≤ d ≤ √ n/ 2 65% 35% 52% T able 9.1. The eﬀects of the initial-v alue computation. 10. Illustra tions The follo wing simple illustra tion is in tended as a simple ‘pro of of concept’, where w e demonstrate that the c hange points mo dels and the AJIC machinery b eats the smo oth regression mo dels and the mo del robust AIC of Section 2, if there are actual c hange p oin ts in the true mo del and these are decently separated. With larger sample size also windo ws with smaller diﬀerences in lev els may b e detected to o. In Figure 10.1 w e ha v e plotted n = 1000 realisations from a mo del with three c hange p oin ts γ = (0 . 234 , 0 . 50 , 0 . 73) with levels a = (1 . 0 , 3 . 1 , 2 . 8 , 1 . 5) for the four cells. The x i ha v e b een sampled from the uniform distribution on the unit in terv al and the errors are indep enden t Gaussian noise with standard deviation σ true = 0 . 5. W e will compare the p erformance of sev en models: three change p oin ts mo dels JIC FOR JUMP REGRESSION MODELS 29 with 1–3 break p oin ts and four smo oth regression mo dels represen ted b y p olynomial regression from constan t to cubic, see T able 10.1 for the individual results. 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 x y Figure 10.1. A simulated illustration based on 1000 realisations from a true c hange p oint mo del with three break p oints. The AJIC winner is the mo del with three breaks, indicated by the vertical lines. The b est smo oth regression mo del, according to the mo del robust AIC, is a cubic polynomial regression mo del sh own b y dashed line abov e. The algorithm presen ted in Section 9 is used to ﬁnd th e break p oin ts for the diﬀeren t change p oin ts mo dels. T o estimate the model robust correction term b b w e need the exp ectation E W ∗ j ( λ, b s j ), for j = 1 , . . . , d − 1 in eac h candidate mo del. These expected v alues do not hav e closed form expressions. A simple practical so- lution is to ev aluating these exp ected v alues by sim ulating indep enden t realisations of W ∗ j ( λ, b s j ), for eac h j , and for each candidate mo del. F or this particular mo del, the estimates seems to be quite stable and 500 realisations results in go o d approx- imations. Increasing the num b er of realisations in the sim ulations results in minor impro v emen ts, i.e. in the second and third decimal place. Note that the AJIC ∗ and AIC ∗ scores, for respectively the jump models and the smo oth mo dels, are constructed to b e directly comparable. In other words, our AJIC mac hinery is not merely to compare and rank mo dels with diﬀeren t n um b er 30 JIC FOR JUMP REGRESSION MODELS mo del AJIC ∗ AIC ∗ AIC ℓ n, max b b b σ 0 d degree p 1 -535.957 . -535.9 47 -266.062 1.917 0.791 2 . 4 2 387.025 . 384.540 197.883 4.371 0.497 3 . 6 3 423.834 . 418.901 217.651 5.734 0.487 4 . 8 constan t . -1029.331 -1029.887 -513.387 1.278 1.013 . 0 2 linear . -1012.365 -1012.931 -504.616 1.566 1.004 . 1 3 quadratic . -293.802 -294.966 -144.155 2.746 0.701 . 2 4 cubic . -270.732 -271.925 -131.981 3.385 0.692 . 3 5 T able 10.1. Comparison of AJIC and AIC scores for the three c hange p oints mo dels and four smooth p olynomial regression mo dels from constant to cubic used in Figure 10.1. The smo oth regression mo dels are not ev en close with respect to the change p oint mo dels, how ev er, the diﬀerence primarily due to the large gap in ℓ n, max v alues. of windo ws, but to work in concert with the robust AIC scores for other regression mo dels. Remark 2. In the implemen tation, ther e i s n o l o wer b ound on the minim um size of the windows. This means that if d max is large, the maxim um lik eliho o d estimator will t ypically isolate some single extreme p oin ts and view these as ‘thin windo ws’. These observ ations are poten tial outliers or ar tifacts of the windo w search procedure and w ould need to be further examined. F or the presen t purp ose, how ev er, with fo cus on model selection, a low er bound should b e included, since w e will t ypically a v oid in terpreting single p oin ts or tin y windows with v ery few p oin ts as gen uine windo ws, i.e. not as the eﬀect of real changes in the underlying signal. Moreo v er, the sp eciﬁcation of a suitable minim um length is also needed in order to obtain prop er estimates for c 1 ,j and c 2 ,j in (5.1). This is curren tly done by using small lo cal a v erages, which implicit assumes a certain minim um windo w length. 11. Concluding remarks W e end our pap er with a list of concluding notes, some p oin ting to further researc h work of relev ance. A. Time series with break p oints. Our basic model has b een that of y i = m ( x i ) + ε i , with the error terms ε i b eing i.i.d. F or several application areas these migh t easily b e dep enden t, how ev er, and p erhaps with the x i signifying either time of spatial con text. One ma y then extend our metho ds and results to suc h models, e.g. with and autoregressive structure for the error terms. Suc h extensions are JIC FOR JUMP REGRESSION MODELS 31 p ossible, with the required generalisations from the theory of empirical pro cesses for dep enden t data. B. T ests for break p oints. Our theory has b een geared to w ards ﬁ nding go o d estimates, along with precision measures and e.g. conﬁdence in terv als, for given mo dels. The theory may also be used to dev elop statistical tests for h yp otheses of the type ‘ m ( x ) is constan t’ against the alternative that ‘ m ( x ) has a discontin uity’. This is not pursued here, ho we ve r. C. Alternativ e large-sample setups. W e ha v e used Assumption 1, but there are other sensible asymptotic approximatio ns as w ell. Besides the asymptotic approxi- mations used in the econometric literature as mentioned in the in tro duction, there are settings where deterministic co v ariates such as x i = i/n are natural. W e still ha v e conv ergence of ergo dic av erages, but with diﬀerent limiting pro cesses. References Bai, J. & Perron, P. (1998). Estimating and testing linear mo dels with multiple structural chang es. Ec onometric a 66 , 47–78. Bai, J. & Perr on, P. (2003). Computation and analysis of m ultiple structural c hange mo dels. Journal of Applie d Ec onometrics 18 , 1–22. Bai, J. & Perr on, P. (2006). Multiple structural change mo dels: a simulation analysis. Ec onometric The ory and Pr actic e: F r ontiers of Analysis and Applie d R ese ar ch , 212–240. Banerjee, M. & McKea gue, I. (2007). Conﬁdence sets for split p oin ts in decision trees. A nnals of Statistics 35 , 543–574. Basseville, M. & Nikifor o v, I. (1993). Dete ction of A brupt changes: The ory and Applic ation . xx: Pren tice-Hall. Billingsley, P. (1968). Conver genc e of Pr ob ability Me asur es . New Y ork: Wiley . Br odsky, B. & D arkho vsky, B. (1993). Nonp ar ametric Metho ds in Change- Point Pr oblems . xx: Springer. Carlstein, E. , M ¨ uller, H. & Siegmund, D. (1994). Change-Point Pr oblems . xx: Institute of Mathematical Statistics. Chan, K. S. (1993). Consi stency and limiting distribution of the least squares estimator of a threshold autoregressiv e mo del. A nnals of Statistics 21 , 520–533. Chib, S. (1998). Estimation and comparison of m ultiple c hange-p oint mo dels. Jour- nal of Ec onometrics 86 , 221–241. Chiu, G. , Lockhar t, R. & R outledge, R. (2006). Bent-cable regression theory and applications. Journal of the A meric an Statistic al Asso ciation 101 , 542–553. Claeskens, G. & Hjor t, N. L. (2008). Mo del Sele ction and Mo del Aver aging . Cam bridge: Cam bridge Univ ersit y Press. 32 JIC FOR JUMP REGRESSION MODELS Cs ¨ or g ˝ o, M. & Hor v ´ ath, L. (1997). Limit The or ems in Change-Point A nalysis . New Y ork: Wiley . Erdman, C. & Emerson, J. W. (2007). b cp: an R pac k age for p erforming a Ba y esian analysis of c hange p oint problems. Journal of Statistic al Softwar e 23 , 1–13. Fearnhead, P. (2006). Exact and eﬃcien t Bay esian inference for multiple c hange- p oin t problems. Statistics and Computing 16 , 203–213. Frick, S. , Munk, A. & Sieling, H. (2014). Multiscale c hange-p oint inference [with discussion con tributions]. Journal of the R oyal Statistic al So ciety Series B 76 , 495–580. Frigessi, A. & Hjor t, N. L. (2002). Statistical mo dels and metho ds for discon- tin uous phenomena. Journal of Nonp ar ametric Statistics 14 , 1–6. Grønneber g, S. & Hjor t, N. L. (2014). The copula information criteria. Sc an- dinavian Journal of Statistics 41 , 436–459. Hermansen, G. & Hjor t, N. L. (2014a). Bernstein–v on Mises theorems for non- parametric function analysis via lo cally constan t mo delling: A uniﬁed approac h. Submitte d for public ation . Hermansen, G. & Hjor t, N. L. (2014b). F o cused information criteria for time series. Submitte d for public ation . Hjor t, J. (1914). Fluctuations in the Gr e at Fisheries of the Northern Eur op e Viewe d in the Light of Biolo gic al R ese ar ch . Cop enhagen: Rapp orts et Pro c` es- V erbaux des R´ eunions du Conseil Interna tional p our l’Exploration de la Mer. Ibra gimov, I. & Khasminski i, R. (1981). Statistic al Estimation . Berlin: Springer. Kjesbu, O. S. , Opdal, A. F. , K orsbrekk, K. , Devine, J. A. & Skjæra asen, J. E. (2014). Making use of Johan Hjort’s ‘unkno wn’ legacy: reconstruction of a 150-y ear coastal time series on Northeast Arctic cod (Gadus morh ua) liv er data rev eals long-term trends in energy allocation patterns. ICES Journal of Marine Scienc e x x , xx–xx. K osor ok, M. (2008). Intr o duction to Empiric al Pr o c esses and Semip ar ametric Infer enc e . Berlin: Springer. Lebarbier, E. (2005). Detecting multipl e change-poin ts in the mean of Gaussian pro cess by mo del selection. Signal Pr o c essing 85 , 717–736. Li, D. & Ling, S. (2012). On the least squares estimation of multiple-regi me threshold autoregressiv e mo dels. Journal of Ec onomtrics 167 , 240–253. Ninomiy a, Y. (2005). Information criterion for gaussian change-point mo del. Sta- tistics and Pr ob ability L etters 72 , 237–247. Seijo, E. & Sen, B. (2011). A con tin uous mapping theorem for the smallest argmax functional. Ele ctr onic Journal of Statistics 39 , 1633–1657. JIC FOR JUMP REGRESSION MODELS 33 van De Geer, S. (2000). Empiric al Pr o c esses in M-estimation . Cambridge: Cam- bridge Universit y Press. v an der V aar t, A. & Wellner, J. (1996). We ak Conver genc e and Empiric al Pr o c esses . Berlin: Springer. Yu, P. (2012). Lik elihoo d estimation and inference in threshold regression. Journal of Ec onometrics 167 , 274–294. Zou, C. , Yin, G. , Feng, L. & W ang, Z. (2014). Nonparametric maxim um lik eliho o d approac h to m ultiple c hange-p oin t problems. A nnals of Statitics , xx– xx.

Estimation, inference and model selection for jump regression models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment