Segmentation of the mean of heteroscedastic data via cross-validation

This paper tackles the problem of detecting abrupt changes in the mean of a heteroscedastic signal by model selection, without knowledge on the variations of the noise. A new family of change-point detection procedures is proposed, showing that cross…

Authors: Sylvain Arlot (LIENS), Alain Celisse

Segmentation of the mean of heteroscedastic data via cross-validation
Segmen tation of the mean of heterosedasti data via ross-v alidation Sylv ain Arlot and Alain Celisse Otob er 30, 2018 Abstrat This pap er ta kles the problem of deteting abrupt  hanges in the mean of a het- erosedasti signal b y mo del seletion, without kno wledge on the v ariations of the noise. A new family of  hange-p oin t detetion pro edures is prop osed, sho wing that ross-v alidation metho ds an b e suessful in the heterosedasti framew ork, whereas most existing pro edures are not robust to heterosedastiit y . The robustness to het- erosedastiit y of the prop osed pro edures is supp orted b y an extensiv e sim ulation study , together with reen t theoretial results. An appliation to Comparativ e Ge- nomi Hybridization (CGH) data is pro vided, sho wing that robustness to heterosedas- tiit y an indeed b e required for their analysis. 1 In tro dution The problem ta kled in the pap er is the detetion of abrupt  hanges in the mean of a signal without assuming its v ariane is onstan t. Mo del seletion and ross-v alidation te hniques are used for building  hange-p oin t detetion pro edures that signian tly impro v e on ex- isting pro edures when the v ariane of the signal is not onstan t. Before detailing the approa h and the main on tributions of the pap er, let us motiv ate the problem and briey reall some related w orks in the  hange-p oin t detetion literature. 1.1 Change-p oin t detetion The  hange-p oin t detetion problem, also alled one-dimensional segmen tation, deals with a sto  hasti pro ess the distribution of whi h abruptly  hanges at some unkno wn instan ts. The purp ose is to reo v er the lo ation of these  hanges and their n um b er. This problem is motiv ated b y a wide range of appliations, su h as v oie reognition, nanial time- series analysis [29℄ and Comparativ e Genomi Hybridization (CGH) data analysis [35℄. A large literature exists ab out  hange-p oin t detetion in man y framew orks [see 12 , 17 , for a omplete bibliograph y℄. The rst pap ers on  hange-p oin t detetion w ere dev oted to the sear h for the lo ation of a unique  hange-p oin t, also named breakp oin t [see 34 , for instane℄. Lo oking for m ultiple  hange-p oin ts is a harder task and has b een studied later. F or instane, Y ao [ 49 ℄ used the BIC riterion for deteting m ultiple  hange-p oin ts in a Gaussian signal, and Miao and Zhao [33℄ prop osed an approa h relying on rank statistis. 1 The setting of the pap er is the follo wing. The v alues Y 1 , . . . , Y n ∈ R of a noisy signal at p oin ts t 1 , . . . , t n are observ ed, with Y i = s ( t i ) + σ ( t i ) ǫ i , E [ ǫ i ] = 0 and V ar( ǫ i ) = 1 . (1) The funtion s is alled the r e gr ession funtion and is assumed to b e pieewise-onstan t, or at least w ell appro ximated b y pieewise onstan t funtions, that is, s is smo oth ev erywhere exept at a few breakp oin ts. The noise terms ǫ 1 , . . . , ǫ n are assumed to b e indep enden t and iden tially distributed. No assumption is made on σ : [0 , 1 ] 7→ [0 , ∞ ) . Note that all data ( t i , Y i ) 1 ≤ i ≤ n are observ ed b efore deteting the  hange-p oin ts, a setting whi h is alled o-line . As p oin ted out b y La vielle [28 ℄, m ultiple  hange-p oin t detetion pro edures generally ta kle one among the follo wing three problems: 1. Deteting  hanges in the mean s assuming the standard-deviation σ is onstan t, 2. Deteting  hanges in the standard-deviation σ assuming the mean s is onstan t, 3. Deteting  hanges in the whole distribution of Y , with no distintion b et w een  hanges in the mean s ,  hanges in the standard-deviation σ and  hanges in the distribution of ǫ . In appliations su h as CGH data analysis,  hanges in the mean s ha v e an imp ortan t biologial meaning, sine they orresp ond to the limits of amplied or deleted areas of  hromosomes. Ho w ev er in the CGH setting, the standard-deviation σ is not alw a ys on- stan t, as assumed in problem 1. See Setion 6 for more details on CGH data, for whi h heterosedastiit ythat is, v ariations of σ  orresp ond to exp erimen tal artefats or bio- logial n uisane that should b e remo v ed. Therefore, CGH data analysis requires to solv e a fourth problem, whi h is the purp ose of the presen t artile: 4. Deteting  hanges in the mean s with no onstrain t on the standard-deviation σ : [0 , 1] 7→ [0 , ∞ ) . Compared to problem 1, the dierene is the presene of an additional n uisane parameter σ making problem 4 harder. Up to the b est of our kno wledge, no  hange-p oin t detetion pro edure has ev er b een prop osed for solving problem 4 with no prior information on σ . 1.2 Mo del seletion Mo del seletion is a suessful approa h for m ultiple  hange-p oin t detetion, as sho wn b y La vielle [28℄ and b y Lebarbier [30℄ for instane. Indeed, a set of  hange-p oin tsalled a segmen tationis naturally asso iated with the set of pieewise-onstan t funtions that ma y only jump at these  hange-p oin ts. Giv en a set of funtions (alled a mo del), estimation an b e p erformed b y minimizing the least-squares riterion (or other riteria, see Setion 3). Therefore, deteting  hanges in the mean of a signal, that is the  hoie of a segmen tation, amoun ts to selet su h a mo del. More preisely , giv en a olletion of mo dels { S m } m ∈M n and the asso iated olletion of least-squares estimators { b s m } m ∈M n , the purp ose of mo del seletion is to pro vide a mo del index b m su h that b s b m rea hes the b est p erformane among all estimators { b s m } m ∈M n . 2 Mo del seletion an target t w o dieren t goals. On the one hand, a mo del seletion pro edure is eient when its quadrati risk is smaller than the smallest quadrati risk of the estimators { b s m } m ∈M n , up to a onstan t fator C n ≥ 1 . Su h a prop ert y is alled an or ale ine quality when it holds for ev ery nite sample size. The pro edure is said to b e asymptoti eient when the previous prop ert y holds with C n → 1 as n tends to innit y . Asymptoti eieny is the goal of AIC [2, 3℄ and Mallo ws' C p [32℄, among man y others. On the other hand, assuming that s b elongs to one of the mo dels { S m } m ∈M n , a pro- edure is mo del  onsistent when it  ho oses the smallest mo del on taining s asymptotially with probabilit y one. Mo del onsisteny is the goal of BIC [ 39℄ for instane. See also the artile b y Y ang [46℄ ab out the distintion b et w een eieny and mo del onsisteny . In the presen t pap er as in [30 ℄, the qualit y of a m ultiple  hange-p oin t detetion pro- edure is assessed b y the quadrati risk; hene, a hange in the me an hidden by the noise should not b e dete te d . This  hoie is motiv ated b y appliations where the signal-to-noise r atio may b e smal l , so that exatly reo v ering ev ery true  hange-p oin t is hop eless. There- fore, eient mo del seletion pro edures will b e used in order to detet the  hange-p oin ts. Without prior information on the lo ations of the  hange-p oin ts, the natural olletion of mo dels for  hange-p oin t detetion dep ends on the sample size n . Indeed, there exist  n − 1 D − 1  dieren t partitions of the n design p oin ts in to D in terv als, ea h partition orresp ond- ing to a set of ( D − 1)  hange-p oin ts. Sine D an tak e an y v alue b et w een 1 and n , 2 n − 1 mo dels an b e onsidered. Therefore, mo del seletion pro edures used for m ultiple  hange- p oin t detetion ha v e to satisfy non-asymptoti orale inequalities: the olletion of mo dels annot b e assumed to b e xed with the sample size n tending to innit y . (See Setion 2.3 for a preise denition of the olletion { S m } m ∈M n used for  hange-p oin t detetion.) Most mo del seletion results onsider p olynomial olletions of mo dels { S m } m ∈M n , that is Card( M n ) ≤ C n α for some onstan ts C, α ≥ 0 . F or p olynomial olletions, pro e- dures lik e AIC or Mallo ws' C p are pro v ed to satisfy orale inequalities in v arious framew orks [9, 15, 10, 16 ℄, assuming that data are homos e dasti , that is, σ ( t i ) do es not dep end on t i . Ho w ev er as sho wn in [6℄, Mallo ws' C p is sub optimal when data are heter os e dasti , that is the v ariane is non-onstan t. Therefore, other pro edures m ust b e used. F or instane, resampling p enalization is optimal with heterosedasti data [ 5℄. Another approa h has b een explored b y Gendre [25 ℄, whi h onsists in sim ultaneously estimating the mean and the v ariane, using a partiular p olynomial olletion of mo dels. Ho w ev er in  hange-p oin t detetion, the olletion of mo dels is exp onen tial, that is Card( M n ) is of order exp( αn ) for some α > 0 . F or su h large olletions, esp eially larger than p olynomial, the ab o v e p enalization pro edures fail. Indeed, Birgé and Massart [16℄ pro v ed that the minimal amoun t of p enalization required for a pro edure to satisfy an orale inequalit y is of the form p en( m ) = c 1 σ 2 D m n + c 2 σ 2 D m n log  n D m  , (2) where c 1 and c 2 are p ositiv e onstan ts and σ 2 is the v ariane of the noise, assumed to b e onstan t. Lebarbier [30℄ prop osed c 1 = 5 and c 2 = 2 for optimizing the p enalt y (2) in the on text of  hange-p oin t detetion. P enalties similar to (2) ha v e b een in tro dued indep enden tly b y other authors [38 , 1, 11, 45 ℄ and are sho wn to pro vide satisfatory results. 3 Nev ertheless, all these results assume that data are homosedasti. A tually , the mo del seletion problem with heterosedasti data and an exp onen tial olletion of mo dels has nev er b een onsidered in the literature, up to the b est of our kno wledge. F urthermore, p enalties of the form (2) are v ery lose to b e prop ortional to D m , at least for small v alues of D m . Therefore, the results of [6 ℄ lead to onjeture that the p enalt y (2) is sub optimal for mo del seletion o v er an exp onen tial olletion of mo dels, when data are heterosedasti. The suggest of this pap er is to use ross-v alidation metho ds instead. 1.3 Cross-v alidation Cross-v alidation (CV) metho ds allo w to estimate (almost) un biasedly the quadrati risk of an y estimator, su h as b s m (see Setion 3.2 ab out the heuristis underlying CV). Classial examples of CV metho ds are the lea v e-one-out [Lo o, 27, 43 ℄ and V -fold ross-v alidation [VF CV, 23, 24 ℄. More referenes on ross-v alidation an b e found in [7 , 19℄ for instane. CV an b e used for mo del seletion, b y  ho osing the mo del S m for whi h the CV estimate of the risk of b s m is minimal. The prop erties of CV for mo del seletion with a p olynomial olletion of mo dels and homosedasti data ha v e b een widely studied. In short, CV is kno wn to adapt to a wide range of statistial settings, from densit y estimation [42 , 20℄ to regression [44 , 48 ℄ and lassiation [26, 47 ℄. In partiular, Lo o is asymptotially equiv alen t to AIC or Mallo ws' C p in sev eral framew orks where they are asymptotially optimal, and other CV metho ds ha v e similar p erformanes, pro vided the size of the training sample is lose enough to the sample size [see for instane 31, 40, 22℄. In addition, CV metho ds are robust to heterosedastiit y of data [5, 7℄, as w ell as sev eral other resampling metho ds [6 ℄. Therefore, CV is a natural alternativ e to p enalization pro edures assuming homosedastiit y . Nev ertheless, nearly nothing is kno wn ab out CV for mo del seletion with an exp onen tial olletion of mo dels, su h as in the  hange-p oin t detetion setting. The literature on mo del seletion and CV [14, 40 , 16, 21℄ only suggests that minimizing diretly the Lo o estimate of the risk o v er 2 n − 1 mo dels w ould lead to o v ertting. In this pap er, a remark made b y Birgé and Massart [16 ℄ ab out p enalization pro edure is used for solving this issue in the on text of  hange-p oin t detetion. Mo del seletion is p erfomed in t w o steps: First,  ho ose a segmen tation giv en the n um b er of  hange-p oin ts; seond,  ho ose the n um b er of  hange-p oin ts. CV metho ds an b e used at ea h step, leading to Pro edure 6 (Setion 5). The pap er sho ws that su h an approa h is indeed suessful for deteting  hanges in the mean of a heterosedasti signal. 1.4 Con tributions of the pap er The main purp ose of the presen t w ork is to design a CV-based mo del seletion pro e- dure (Pro edure 6) that an b e used for deteting m ultiple  hanges in the mean of a heterosedasti signal. Su h a pro edure exp erimen tally adapts to heterosedastiit y when the olletion of mo dels is exp onen tial, whi h has nev er b een obtained b efore. In parti- ular, Pro edure 6 is a reliable alternativ e to Birgé and Massart's p enalization pro edure [15 ℄ when data an b e heterosedasti. Another ma jor diult y ta kled in this pap er is the omputational ost of resampling metho ds when seleting among 2 n mo dels. Ev en when the n um b er ( D − 1) of  hange- 4 p oin ts is giv en, exploring the  n − 1 D − 1  partitions of [0 , 1] in to D in terv als and p erforming a resampling algorithm for ea h partition is not feasible when n is large and D > 0 . An implemen tation of Pro edure 6 with a tratable omputational omplexit y is prop osed in the pap er, using losed-form form ulas for Lea v e- p -out (Lp o) estimators of the risk, dynami programming, and V -fold ross-v alidation. The pap er also p oin ts out that least-squares estimators are not reliable for  hange- p oin t detetion when the n um b er of breakp oin ts is giv en, although they are widely used to this purp ose in the literature. Indeed, exp erimen tal and theoretial results detailed in Setion 3.1 sho w that least-squares estimators suer from lo  al overtting when the v ariane of the signal is v arying o v er the sequene of observ ations. On the on trary , minimizers of the Lp o estimator of the risk do not suer from this dra wba k, whi h emphasizes the in terest of using ross-v alidation metho ds in the on text of  hange-p oin t detetion. The pap er is organized as follo ws. The statistial framew ork is desrib ed in Setion 2. First, the problem of seleting the b est segmen tation giv en the n um b er of  hange-p oin ts is ta kled in Setion 3. Theoretial results and an extensiv e sim ulation study sho w that the usual minimization of the least-squares riterion an b e misleading when data are heterosedasti, whereas ross-v alidation-based pro edures pro vide satisfatory results in the same framew ork. Then, the problem of  ho osing the n um b er of breakp oin ts from data is addressed in Setion 4 . As supp orted b y an extensiv e sim ulation study , V -fold ross-v alidation (VF CV) leads to a omputationally feasible and statistially eien t mo del seletion pro edure when data are heterosedasti, on trary to pro edures impliitly assuming homosedasti- it y . The resampling metho ds of Setions 3 and 4 are om bined in Setion 5, leading to a family of resampling-based pro edures for deteting  hanges in the mean of a heterosedas- ti signal. A wide sim ulation study sho ws they p erform w ell with b oth homosedasti and heterosedasti data, signian tly impro ving the p erformane of pro edures whi h impli- itly assume homosedastiit y . Finally , Setion 6 illustrates on a real data set the promising b eha viour of the prop osed pro edures for analyzing CGH miroarra y data, ompared to pro edures previously used in this setting. 2 Statistial framew ork In this setion, the statistial framew ork of  hange-p oin t detetion via mo del seletion is in tro dued, as w ell as some notation. 2.1 Regression on a xed design Let S ∗ denote the set of measurable funtions [0 , 1] 7→ R . Let t 1 < · · · < t n ∈ [0 , 1] b e some deterministi design p oin ts, s ∈ S ∗ and σ : [0 , 1] 7→ [0 , ∞ ) b e some funtions and dene ∀ i ∈ { 1 , . . . , n } , Y i = s ( t i ) + σ ( t i ) ǫ i , (3) where ǫ 1 , . . . , ǫ n are indep enden t and iden tially distributed random v ariables with E [ ǫ i ] = 0 and E  ǫ 2 i  = 1 . 5 As explained in Setion 1.1 , the goal is to nd from ( t i , Y i ) 1 ≤ i ≤ n a pieewise-onstan t funtion f ∈ S ∗ lose to s in terms of the quadrati loss k s − f k 2 n := 1 n n X i =1 ( f ( t i ) − s ( t i )) 2 . 2.2 Least-squares estimator A lassial estimator of s is the le ast-squar es estimator , dened as follo ws. F or ev ery f ∈ S ∗ , the least-squares riterion at f is dened b y P n γ ( f ) := 1 n n X i =1 ( Y i − f ( t i )) 2 . The notation P n γ ( f ) means that the funtion ( t, Y ) 7→ γ ( f ; ( t, Y )) := ( Y − f ( t )) 2 is in tegrated with resp et to the empirial distribution P n := n − 1 P n i =1 δ ( t i ,Y i ) . P n γ ( f ) is also alled the empiri al risk of f . Then, giv en a set S ⊂ S ∗ of funtions [0 , 1] 7→ R (alled a mo del ), the least-squares estimator on mo del S is ERM( S ; P n ) := arg min f ∈ S { P n γ ( f ) } . The notation ERM( S ; P n ) stresses that the least-squares estimator is the output of the empirial risk minimization algorithm o v er S , whi h tak es a mo del S and a data sample as inputs. When a olletion of mo dels { S m } m ∈M n is giv en, b s m ( P n ) or b s m are shortuts for ERM( S m ; P n ) . 2.3 Colletion of mo dels Sine the goal is to detet jumps of s , ev ery mo del onsidered in this artile is the set of pieewise onstan t funtions with resp et to some partition of [0 , 1] . F or ev ery K ∈ { 1 , . . . , n − 1 } and ev ery sequene of in tegers α 0 = 1 < α 1 < α 2 < · · · < α K ≤ n (the breakp oin ts), ( I λ ) λ ∈ Λ ( α 1 ,...α K ) denotes the partition [ t α 0 ; t α 1 ) , . . . , [ t α K − 1 ; t α K ) , [ t α K ; 1] of [0 , 1] in to ( K + 1) in terv als. Then, the mo del S ( α 1 ,...α K ) is dened as the set of pieewise onstan t funtions that an only jump at t = t α j for some j ∈ { 1 , . . . , K } . F or ev ery K ∈ { 1 , . . . , n − 1 } , let f M n ( K + 1) denote the set of su h sequenes ( α 1 , . . . α K ) of length K , so that { S m } m ∈ f M n ( K + 1) is the olletion of mo dels of piee- wise onstan t funtions with K breakp oin ts. When K = 0 , f M n (1) := {∅} and the mo del S ∅ is the linear spae of onstan t funtions on [0 , 1] . Remark that for ev ery K and m ∈ f M n ( K + 1) , S m is a v etor spae of dimension D m = K + 1 . In the rest of the pa- p er, the relationship b et w een the n um b er of breakp oin ts K and the dimension D = K + 1 of the mo del S ( α 1 ,...α K ) is used rep eatedly; in partiular, estimating of the n um b er of break- p oin ts (Setion 4) is equiv alen t to  ho osing the dimension of a mo del. In addition, sine a mo del S m is uniquely dened b y m , the index m is also alled a mo del. 6 The lassial olletion of mo dels for  hange-p oin t detetion an no w b e dened as { S m } m ∈ f M n , where f M n = S D ∈D n f M n ( D ) and D n = { 1 , . . . , n } . This olletion has a ardinalit y 2 n − 1 . In this pap er, a sligh tly smaller olletion of mo dels is onsidered, that is, all m ∈ f M n su h that ea h elemen t of the partition ( I λ ) λ ∈ Λ m on tains at least t w o design p oin ts ( t j ) 1 ≤ j ≤ n . Indeed, when nothing is kno wn ab out the noise-lev el σ ( · ) , one an- not hop e to distinguish t w o onseutiv e  hange-p oin ts from a lo al v ariation of σ . F or ev ery D ∈ { 1 , . . . , n } , let M n ( D ) denote the set of m ∈ f M n ( D ) satisfying this prop- ert y . Then, the olletion of mo dels used in this pap er is dened as { S m } m ∈M n where M n = S D ∈D n M n ( D ) and D n ⊂ { 1 , . . . , n/ 2 } . Finally , in all the exp erimen ts of the pap er, D n = { 1 , . . . , 4 n/ 10 } for reasons detailed in Setion 4.2, in partiular Remark 3. 2.4 Mo del seletion Among { S m } m ∈M n , the b est mo del is dened as the minimizer of the quadr ati loss k s − b s m k 2 n o v er m ∈ M n and alled the or ale m ⋆ . Sine the orale dep ends on s , one an only exp et to selet b m ( P n ) from the data su h that the quadrati loss of b s b m is lose to that of the orale with high probabilit y , that is, k s − b s b m k 2 n ≤ C inf m ∈M n n k s − b s m k 2 n o + R n (4) where C is lose to 1 and R n is a small remainder term (t ypially of order n − 1 ). Inequalit y (4) is alled an or ale ine quality . 3 Lo alization of the breakp oin ts A usual strategy for m ultiple  hange-p oin t detetion [ 28, 30 ℄ is to disso iate the sear h for the b est segmen tation giv en the n um b er of breakp oin ts from the  hoie of the n um b er of breakp oin ts. In this setion, the n um b er K = D − 1 of breakp oin ts is xed and the goal is to lo alize them. In other w ords, the goal is to selet a mo del among { S m } m ∈M n ( D ) . 3.1 Empirial risk minimization's failure with heterosedasti data As explained b y man y authors su h as La vielle [28℄, minimizing the least-squares riterion o v er { b s m } m ∈M ( D ) is a lassial w a y of estimating the b est segmen tation with ( D − 1)  hange-p oin ts. This leads to the follo wing pro edure: Pro edure 1. b m ERM ( D ) := arg min m ∈M n ( D ) { P n γ ( b s m ) } = ERM  e S D ; P n  , where e S D := ∪ m ∈M n ( D ) S m is the set of pieewise onstan t funtions with exatly ( D − 1)  hange-p oin ts,  hosen among t 2 , . . . , t n (see Setion 2.3). 7 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 t i Y i Loo ERM Oracle 0 0.2 0.4 0.6 0.8 1 −1.5 −1 −0.5 0 0.5 1 1.5 t i Y i Loo ERM Oracle Figure 1: Comparison of b s m ⋆ ( D ) (dotted bla k line), b s b m ERM ( D ) (dashed blue line) and b s b m Loo ( D ) (plain magen ta line, see Setion 3.2.2 ), D b eing the optimal dimension (see Figure 3 ). Data are generated as desrib ed in Setion 3.3.1 with n = 10 0 data p oin ts. Left: homosedasti data ( s 2 , σ c ) , D = 4 . Righ t: heterosedasti data ( s 3 , σ pc, 3 ) , D = 6 . R emark 1 . Dynami programming [13℄ leads to an eien t implemen tation of Pro edure 1 with omputational omplexit y O  n 2  . Among mo dels orresp onding to segmen tations with ( D − 1)  hange-p oin ts, the orale mo del an b e dened as m ⋆ ( D ) := arg min m ∈M n ( D ) n k s − b s m k 2 n o . Figure 1 illustrates ho w far b m ERM ( D ) t ypially is from m ⋆ ( D ) aording to v ariations of the standard-deviation σ . On the one hand, when data are homosedasti, empirial risk minimization yields a segmen tation lose to the orale (Figure 1, left). On the other hand, when data are heterosedasti, empirial risk minimization in tro dues artiial breakp oin ts in areas where the noise-lev el is ab o v e a v erage, and misses breakp oin ts in areas where the noise-lev el is b elo w a v erage (Figure 1, righ t). In other w ords, when data are heterosedas- ti, empiri al risk minimization over e S D lo  al ly overts in high-noise ar e as , and lo ally underts in lo w-noise areas. The failure of empirial risk minimization with heterosedasti data observ ed on Fig- ure 1 is general [21 , Chapter 7℄ and an b e explained b y Lemma 1 b elo w. Indeed, the riteria P n γ ( b s m ) and k s − b s m k 2 n , resp etiv ely minimized b y b m ERM ( D ) and m ⋆ ( D ) o v er M n ( D ) , are lose to their resp etiv e exp etations, as pro v ed b y the onen tration inequalities of [7, Prop osition 9℄ for instane. Lemma 1 enables to ompare these exp etations. Lemma 1. L et m ∈ M n and dene s m := arg min f ∈ S m k s − f k 2 n . Then, E [ P n γ ( b s m )] = k s − s m k 2 n − V ( m ) + 1 n n X i =1 σ ( t i ) 2 (5) E h k s − b s m k 2 n i = k s − s m k 2 n + V ( m ) (6) wher e V ( m ) := P λ ∈ Λ m ( σ r λ ) 2 n and ∀ λ ∈ Λ m , ( σ r λ ) 2 := P n i =1 σ ( t i ) 2 1 t i ∈ I λ Card ( { k | t k ∈ I λ } ) . (7) 8 Lemma 1 is pro v ed in [21 ℄. As it is w ell-kno wn in the mo del seletion literature, the exp etation of the quadrati loss (6 ) is the sum of t w o terms: k s − s m k 2 n is the bias of mo del S m , and V ( m ) is a v ariane term, measuring the diult y of estimating the D m parameters of mo del S m . Up to the term n − 1 P n i =1 σ ( t i ) 2 whi h do es not dep end on m , the empirial risk underestimates the quadrati risk (that is, the exp etation of the quadrati loss), as sho wn b y (5) , b eause of the sign in fron t of V ( m ) . Nev ertheless, when data are homosedasti, that is when ∀ i , σ ( t i ) = σ , V ( m ) = D m σ 2 n − 1 is the same for all m ∈ M n ( D ) . Therefore, (5 ) and (6 ) sho w that for ev ery D ≥ 1 , when data are homosedasti arg min m ∈M n ( D ) { E [ P n γ ( b s m )] } = arg min m ∈M n ( D ) n E h k s − b s m k 2 n io . Hene, b m ERM ( D ) and m ⋆ ( D ) tend to b e lose to one another, as on the left of Figure 1. On the on trary , when data are heterosedasti, the v ariane term V ( m ) an b e quite dieren t among mo dels m ∈ M n ( D ) , ev en though they ha v e the same dimension D . Indeed, V ( m ) inreases when a breakp oin t is mo v ed from an area where σ is small to an area where σ is large. Therefore, the empirial risk minimization algorithm rather puts breakp oin ts in noisy areas in order to minimize − V ( m ) in (5 ). This is illustrated in the righ t panel of Figure 1, where the orale segmen tation m ⋆ ( D ) has more breakp oin ts in areas where σ is small. 3.2 Cross-v alidation Cross-v alidation (CV) metho ds are natural andidates for xing the failure of empirial risk minimization when data are heterosedasti, sine CV metho ds are naturally adaptiv e to heterosedastiit y (see Setion 1.3 ). The purp ose of this setion is to prop erly dene ho w CV an b e used for seleting b m ∈ M n ( D ) (Pro edure 2), and to reall theoretial results sho wing wh y this pro edure adapts to heterosedastiit y (Prop osition 1 ). 3.2.1 Heuristis The ross-v alidation heuristis [4 , 43℄ relies on a data splitting idea: F or ea h andidate algorithmsa y ERM( S m ; · ) for some m ∈ M n ( D ) , part of the dataalled training setis used for training the algorithm. The remaining partalled v alidation setis used for estimating the risk of the algorithm. This simple strategy is alled validation or hold- out . One an also split data sev eral times and a v erage the estimated v alues of the risk o v er the splits. Su h a strategy is alled r oss-validation (CV). CV with general rep eated splits of data has b een in tro dued b y Geisser [23 , 24℄. In the xed-design setting, ( t i , Y i ) 1 ≤ i ≤ n are not iden tially distributed so that CV estimates a quan tit y sligh tly dieren t from the usual predition error. Let T b e uniformly distributed o v er { t 1 , . . . , t n } and Y = s ( T ) + σ ( T ) ǫ , where ǫ is indep enden t from ǫ 1 , . . . , ǫ n 9 with the same distribution. Then, the CV estimator of the risk of b s ( P n ) estimates E ( T ,Y ) h ( b s ( T ) − Y ) 2 i = 1 n n X i =1 E ǫ h ( s ( t i ) + σ ( t i ) ǫ i − b s ( t i )) 2 i = k s − b s k 2 n + 1 n n X i =1 σ ( t i ) 2 . Hene, minimizing the CV estimator of E ( T ,Y ) h ( b s m ( T ) − Y ) 2 i o v er m amoun ts to minimize k s − b s m k 2 n , up to estimation errors. Ev en though the use of CV in a xed-design setting is not usual, theoretial results detailed in Setion 3.2.4 b elo w sho w that CV atually leads to a go o d estimator of the quadrati risk k s − b s m k 2 n . This fat is onrmed b y all the exp erimen tal results of the pap er. 3.2.2 Denition Let us no w formally dene ho w CV is used for seleting some m ∈ M n ( D ) from data. A (statistial) algorithm A is dened as an y measurable funtion P n 7→ A ( P n ) ∈ S ∗ . F or an y t i ∈ [0 , 1] , A ( t i ; P n ) denotes the v alue of A ( P n ) at p oin t t i . F or an y I ( t ) ⊂ { 1 , . . . , n } , dene I ( v ) := { 1 , . . . , n } \ I ( t ) , P ( t ) n := 1 Card( I ( t ) ) X i ∈ I ( t ) δ ( t i ,Y i ) and P ( v ) n := 1 Card( I ( v ) ) X i ∈ I ( v ) δ ( t i ,Y i ) . Then, the hold-out estimator of the risk of an y algorithm A is dened as b R ho ( A , P n , I ( t ) ) := P ( v ) n γ  A  P ( t ) n  = 1 Card( I ( v ) ) X i ∈ I ( v )  A ( t i ; P ( t ) n ) − Y i  2 . The r oss-validation estimators of the risk of A are then dened as the a v erage of b R ho ( A , P n , I ( t ) j ) o v er j = 1 , . . . , B where I ( t ) 1 , . . . , I ( t ) B are  hosen in a predetermined w a y [24 ℄. Lea v e-one-out, lea v e- p -out and V -fold ross-v alidation are among the most lassial examples of CV pro edures. They dier one another b y the  hoie of I ( t ) 1 , . . . , I ( t ) B . • L e ave-one-out ( Lo o ), often alled or dinary CV [4, 43℄, onsists in training with the whole sample exept one p oin t, used for testing, and rep eating this for ea h data p oin t: I ( t ) j = { 1 , . . . , n } \ { j } for j = 1 , . . . , n . The Lo o estimator of the risk of A is dened b y b R Loo ( A , P n ) := 1 n n X j =1   Y j − A  t j ; P ( − j ) n  2  , where P ( − j ) n = ( n − 1) − 1 P i, i 6 = j δ ( t i ,Y i ) . • L e ave- p -out ( Lp o p , with an y p ∈ { 1 , . . . , n − 1 } ) generalizes Lo o. Let E p denote the olletion of all p ossible subsets of { 1 , . . . , n } with ardinalit y n − p . Then, Lp o 10 onsists in onsidering ev ery I ( t ) ∈ E p as training set indies: b R Lpo p ( A , P n ) :=  n p  − 1 X I ( t ) ∈E p   1 p X j ∈ I ( v )   Y j − A  t j ; P ( t ) n  2    . (8) • V -fold r oss-validation (VF CV) is a omputationally eien t alternativ e to Lp o and Lo o . The idea is to rst partition the data in to V blo  ks, to use all the data but one blo  k as a training sample, and to rep eat the pro ess V times. In other w ords, VF CV is a blo  kwise Lo o, so that its omputational omplexit y is V times that of A . F ormally , let B 1 , . . . , B V b e a partition of { 1 , . . . , n } and P ( B k ) n := ( n − Card( B k )) − 1 P i / ∈ B k δ ( t i ,Y i ) for ev ery k ∈ { 1 , . . . , V } . The VF CV estimator of the risk of A is dened b y b R VF V ( A , P n ) := 1 V V X k =1   1 Card( B k ) X j ∈ B k   Y j − A  t j ; P ( B k ) n  2    . (9) The in terested reader will nd theoretial and exp erimen tal results on VF CV and the b est w a y to use it in [ 7, 21 ℄ and referenes therein, in partiular [18 ℄. Giv en the Lo o estimator of the risk of ea h algorithm A among { ERM( S m ; · ) } m ∈M n ( D ) , the segmen tation with ( D − 1) breakp oin ts  hosen b y Lo o is dened as follo ws. Pro edure 2. b m Loo ( D ) := arg min m ∈M n ( D ) n b R Loo (ERM ( S m ; · ) , P n ) o . The segmen tations  hosen b y Lp o and VF CV are dened similarly and denoted resp etiv ely b y b m Lpo p ( D ) and b y b m VF V ( D ) . As illustrated b y Figure 1 , when data are heterosedasti, b m Loo ( D ) is often loser to the orale segmen tation m ⋆ ( D ) than b m ERM ( D ) . This impro v emen t will b e explained b y theoretial results in Setion 3.2.4 b elo w. 3.2.3 Computational tratabilit y The omputational omplexit y of ERM( S m ; P n ) is O ( n ) sine for ev ery λ ∈ Λ m , the v alue of b s m ( P n ) on I λ is equal to the mean of { Y i } t i ∈ I λ . Therefore, a naiv e implemen tation of Lp o p has a omputational omplexit y O  n  n p   , whi h an b e in tratable for large n in the on text of mo del seletion, ev en when p = 1 . In su h ases, only VF CV with a small V w ould w ork straigh tforw ardly , sine its omputational omplexit y is O ( nV ) . Nev ertheless, losed-form form ulas for the Lp o estimator of the risk ha v e b een deriv ed in the densit y estimation [20, 19℄ and regression [21℄ framew orks. Some of these losed- form form ulas apply to regressograms b s m with m ∈ M n . The follo wing theorem giv es a losed-form expression for b R Lpo p ( m ) := b R Lpo p (ERM( S m ; · ) , P n ) whi h an b e omputed with O ( n ) elemen tary op erations. 11 Theorem 1 (Corollary 3.3.2 in [21℄) . L et m ∈ M n , S m and b s m = ERM( S m ; · ) b e dene d as in Se tion 2. F or every ( t 1 , Y 1 ) , . . . , ( t n , Y n ) ∈ R 2 and λ ∈ Λ m , dene S λ, 1 := n X j =1 Y j 1 { t j ∈ I λ } and S λ, 2 := n X j =1 Y 2 j 1 { t j ∈ I λ } . Then, for every p ∈ { 1 , . . . , n − 1 } , the Lp o p estimator of the risk of b s m dene d by (8 ) is given by b R Lpo p ( m ) = X λ ∈ Λ m 1 pN λ  ( A λ − B λ ) S λ, 2 + B λ S 2 λ, 1  1 { n λ ≥ 2 } + { + ∞} 1 { n λ =1 }  , wher e for every λ ∈ Λ m , n λ := Card ( { i | t i ∈ I λ } ) N λ := 1 − 1 { p ≥ n λ }  n − n λ p − n λ  /  n p  A λ := V λ (0)  1 − 1 n λ  − V λ (1) n λ + V λ ( − 1) B λ := V λ (1) 2 − 1 n λ ≥ 3 n λ ( n λ − 1) + V λ (0) n λ − 1  1 + 1 n λ  1 n λ ≥ 3 − 2  − V λ ( − 1) 1 n λ ≥ 3 n λ − 1 and ∀ k ∈ {− 1 , 0 , 1 } , V λ ( k ) := min { n λ , ( n − p ) } X r =max { 1 , ( p − n λ ) } r k  n − p r  p n λ − r   n n λ  . R emark 2 . V λ ( k ) an also b e written as E  Z k 1 Z > 0  where Z has h yp ergeometri distri- bution with parameters ( n, n − p, n λ ) . An imp ortan t pratial onsequene of Theorem 1 is that for ev ery D and p , b m Lpo p ( D ) an b e omputed with the same omputational omplexit y as b m ERM ( D ) , that is O  n 2  . Indeed, Theorem 1 sho ws that b R Lpo p ( m ) is a sum o v er λ ∈ Λ m of terms dep ending only on { Y i } t i ∈ I λ , so that dynami programming [13 ℄ an b e used for omputing the mini- mizer b m Lpo p ( D ) of b R Lpo p ( m ) o v er m ∈ M n . Therefore, Lp o and Lo o ar e  omputational ly tr atable for hange-p oint dete tion when the n um b er of breakp oin ts is giv en. Dynami programming also applies to b m VF V with a omputational omplexit y O  V n 2  , sine ea h term app earing in b R VF V ( m ) is the a v erage o v er V quan tities that m ust b e omputed, exept when V = n sine VF CV then b eomes Lo o. Sine VF CV is mostly an appro ximation to Lo o or Lp o but has a larger omputational omplexit y , b m Lpo p will b e preferred to b m VF V ( D ) in the follo wing. 3.2.4 Theoretial guaran tees In order to understand wh y CV indeed w orks for  hange-p oin t detetion with a giv en n um b er of breakp oin ts, let us reall a straigh tforw ard onsequene of Theorem 1 whi h is pro v ed in details in [21, Lemma 7.2.1 and Prop osition 7.2.3℄. Prop osition 1. Using the notation of L emma 1, for any m ∈ M n , E h b R Lpo p ( m ) i ≈ k s − s m k 2 n + 1 n − p X λ ∈ Λ m ( σ r λ ) 2 + 1 n n X i =1 σ ( t i ) 2 , (10) 12 Figure 2: Regression funtions s 1 , s 2 , s 3 ; s 1 and s 2 are pieewise onstan t with 4 jumps; s 3 is pieewise onstan t with 9 jumps. wher e the appr oximation holds as so on as min λ ∈ Λ m n λ is lar ge enough (in p artiular lar ger than p ). The omparison of (6 ) and (10 ) sho ws that Lp o p yields an almost un biased estimator of k s − b s m k 2 n : The only dierene is that the fator 1 /n in fron t of the v ariane term V ( m ) has b een  hanged in to 1 / ( n − p ) . Therefore, minimizing the Lp o p estimator of the risk instead of the empirial risk allo ws to automatially tak e in to aoun t heterosedastiit y of data. 3.3 Sim ulation study The goal of this setion is to exp erimen tally assess, for sev eral v alues of p , the p erformane of Lp o p for deteting a giv en n um b er of  hanges in the mean of a heterosedasti signal. This p erformane is also ompared with that of empirial risk minimization. 3.3.1 Setting The setting desrib ed in this setion is used in all the exp erimen ts of the pap er. Data are generated aording to (3 ) with n = 100 . F or ev ery i , t i = i/n and ǫ i has a standard Gaussian distribution. The regression funtion s is  hosen among three pieewise onstan t funtions s 1 , s 2 , s 3 plotted on Figure 2. The mo del olletion desrib ed in Setion 2.3 is used with D n = { 1 , . . . , 4 n/ 10 } . The noise-lev el funtion σ ( · ) is  hosen among the follo wing funtions: 1. Homosedasti noise: σ c = 0 . 25 1 [0 , 1] , 2. Heterosedasti pieewise onstan t noise: σ pc, 1 = 0 . 2 1 [0 , 1 / 3] + 0 . 05 1 [1 / 3 , 1] , σ pc, 2 = 2 σ pc, 1 or σ pc, 3 = 2 . 5 σ pc, 1 . 3. Heterosedasti sin usoidal noise: σ s = 0 . 5 s in ( tπ / 4) . All om binations b et w een the regression funtions ( s i ) i =1 , 2 , 3 and the v e noise-lev els σ · ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Results b elo w only rep ort a small part of the en tire sim ulation study but in tend to b e represen tativ e of the main observ ed b eha viour. A more omplete rep ort of the results, inluding other 13 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 Dimension D ERM Loo Lpo20 Lpo50 0 5 10 15 20 25 30 35 40 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Dimension D ERM Loo Lpo20 Lpo50 Figure 3: E     s − b s b m P ( D )    2 n  as a funtion of D for P among `ERM' (empirial risk minimization), `Lo o' (Lea v e-one-out), ` Lp o(20) ' ( Lp o p with p = 20 ) and ` Lp o(50) ' ( Lp o p with p = 50 ). Left: homosedasti ( s 2 , σ c ) . Righ t: heterosedasti ( s 3 , σ pc, 3 ) . All urv es ha v e b een estimated from N = 10 000 indep enden t samples; error bars are all negligible in fron t of visible dierenes (the larger ones are smaller than 8 . 10 − 5 on the left, and smaller than 2 . 10 − 4 on the righ t). The urv es D 7→   s − b s b m P ( D )   2 n b eha v e similarly to their exp etations. regression funtions s and noise-lev el funtions σ , is giv en in the seond authors' thesis [ 21, Chapter 7℄; see also Setion 3 of the supplemen tary material. 3.3.2 Results: Comparison of segmen tations for ea h dimension The segmen tations of ea h dimension D ∈ D n obtained b y empirial risk minimization (` ERM ', Pro edure 1 ) and Lp o p (Pro edure 2) for sev eral v alues of p are ompared on Fig- ure 3, through the exp eted v alues of the quadrati loss E     s − b s b m P ( D )    2 n  for pro edure P . On the one hand, when data are homosedasti (Figure 3, left), all pro edures yield similar p erformanes for all dimensions up to t wie the b est dimension; Lp o p p erforms signian tly b etter for larger dimensions. Therefore, unless the dimension is strongly o v er- estimated (whatev er the w a y D is  hosen), all pro edures are equiv alen t with homosedasti data. On the other hand, when data are heterosedasti (Figure 3, righ t), ERM yields signi- an tly w orse p erformane than Lp o for dimensions larger than half the true dimension. As explained in Setions 3.1 and 3.2.4 , b m ERM ( D ) often puts breakp oin ts inside pure noise for dimensions D smaller than the true dimension, whereas Lp o do es not ha v e this dra wba k. Therefore, whatev er the  hoie of the dimension (exept D ≤ 4 , that is for deteting the ob vious jumps), Lp o should b e prefered to empirial risk minimization as so on as data are heterosedasti. 14 s · σ · ERM Lo o Lpo 20 Lpo 50 2  2.88 ± 0.01 2.93 ± 0.01 2.93 ± 0.01 2.94 ± 0.01 p ,1 1.31 ± 0.02 1.16 ± 0.02 1.14 ± 0.02 1.11 ± 0.01 p ,3 3.09 ± 0.03 2.52 ± 0.03 2.48 ± 0.03 2.32 ± 0.03 3  3.18 ± 0.01 3.25 ± 0.01 3.29 ± 0.01 3.44 ± 0.01 p ,1 3.00 ± 0.01 2.67 ± 0.02 2.68 ± 0.02 2.77 ± 0.02 p ,3 4.41 ± 0.02 3.97 ± 0.02 4.00 ± 0.02 4.11 ± 0.02 T able 1: A v erage p erformane C or ( J P , Id K ) for  hange-p oin t detetion pro edures P among ERM , Lo o and Lp o p with p = 20 and p = 50 . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. 3.3.3 Results: Comparison of the b est segmen tations This setion fo uses on the segmen tation obtained with the b est p ossible  hoie of D , that is the one orresp onding to the minim um of D 7→    s − b s b m P ( D )    2 n (plotted on Figure 3) for pro edures P among ERM , Lo o , and Lp o p with p = 20 and p = 50 . Therefore, the p erformane of a pro edure P is dened b y C or ( J P , Id K ) := E  inf 1 ≤ D ≤ n     s − b s b m P ( D )    2 n  E h inf m ∈M n n k s − b s m k 2 n oi , whi h measures what is lost ompared to the orale when seleting one segmen tation b m P ( D ) p er dimension. Ev en if the  hoie of D is a real pratial problemwhi h will b e ta kled in the next setions, C or ( J P , Id K ) helps to understand whi h is the b est pro edure for seleting a segmen tation of a giv en dimension. The notation C or ( J P , Id K ) has b een  hosen for onsisteny with notation used in the next setions (see Setion 5.1 ). T able 1 onrms the results of Setion 3.3.2 . On the one hand, when data are ho- mosedasti, ERM p erforms sligh tly b etter than Lo o or Lp o p . On the other hand, when data are heterosedasti, Lp o p often p erforms b etter than ERM (whatev er p ), and the impro v emen t an b e large (more than 20% in the setting ( s 2 , σ pc, 3 ) ). Ov erall, when ho- mosedastiit y of the signal is questionable, Lp o p app ears m u h more reliable than ERM for lo alizing a giv en n um b er of  hange-p oin ts of the mean. The question of  ho osing p for optimizing the p erformane of Lp o p remains a widely op en problem. The sim ulation exp erimen t summarized with T able 1 only sho ws that Lp o p impro v es ERM whatev er p , the optimal v alue of p dep ending on s and σ . 4 Estimation of the n um b er of breakp oin ts In this setion, the n um b er of breakp oin ts is no longer xed or kno wn a priori . The goal is preisely the estimation of this n um b er, as often needed with real data. T w o main pro edures are onsidered. First, a p enalization pro edure in tro dued b y Birgé and Massart [15℄ is analyzed in Setion 4.1 ; this pro edure is suessful for  hange- 15 p oin t detetion when data are homosedasti [28, 30 ℄. On the basis of this analysis, V - fold ross-v alidation (VF CV) is then prop osed as an alternativ e to Birgé and Massart's p enalization pro edure (BM) when data an b e heterosedasti. In order to enable the omparison b et w een BM and VF CV when fo using on the ques- tion of  ho osing the n um b er of breakp oin ts, VF CV is used for  ho osing among the same segmen tations as BM, that is { b m ERM ( D ) } D ∈D n . The om bination of VF CV for  ho osing D with the new pro edures prop osed in Setion 3 will b e studied in Setion 5. 4.1 Birgé and Massart's p enalization First, let us dene preisely the p enalization pro edure prop osed b y Birgé and Massart [15 ℄ suessfully used for  hange-p oin t detetion in [28 , 30℄. Pro edure 3 (Birgé and Massart [15 ℄) . 1. ∀ m ∈ M n , b s m := ERM ( S m ; P n ) . 2. b m BM := arg min m ∈M n , D m ∈D n { P n γ ( b s m ) + pen BM ( m ) } , where for ev ery m ∈ M n , the p enalt y p en BM ( m ) only dep ends on S m through its dimension: p en BM ( m ) = p en BM ( D m ) := b C D m n  5 + 2 log  n D m  , (11) where b C is estimated from data using Birgé and Massart's slop e heuristis [16, 8℄, as prop osed b y Lebarbier [30 ℄ and b y La vielle [28 ℄. See Setion 1 of the supplemen tary material for a detailed disussion ab out b C . 3. e s BM := b s b m BM . All m ∈ M n ( D ) are p enalized in the same w a y b y p en BM ( m ) , so that Pro edure 3 atually selets a segmen tation among { b m ERM ( D ) } D ∈D n . Therefore, Pro edure 3 an b e reform ulated as follo ws, as notied in [ 16 , Setion 4.3℄. Pro edure 4 (Reform ulation of Pro edure 3) . 1. ∀ D ∈ D n , b s b m ERM ( D ) := ERM  e S D ; P n  where e S D := S m ∈M n ( D ) S m . 2. b D BM := arg min D ∈D n  P n γ ( b s b m ERM ( D ) ) + pen BM ( D )  where p en BM ( D ) is dened b y (11 ). 3. e s BM := b s b m ERM ( b D BM ) . In the follo wing, `BM' denotes Pro edure 4 and crit BM ( D ) := P n γ ( b s b m ERM ( D ) ) + p en BM ( D ) is alled the BM riterion. Pro edure 4 laries the reason wh y p en BM m ust b e larger than Mallo ws' C p p enalt y . Indeed, for ev ery m ∈ M n , Lemma 1 sho ws that when data are homosedasti, P n γ ( b s m ) + p en( m ) is an un biased estimator of k s − b s m k 2 n when p en( m ) = 2 σ 2 D m n − 1 , that is Mallo ws' 16 0 5 10 15 20 25 30 35 40 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Loss VF 5 BM 0 5 10 15 20 25 30 35 40 0 0.05 0.1 0.15 0.2 0.25 Loss VF 5 BM Figure 4: Comparison of the exp etations of   s − b s b m ( D )   2 n (`Loss'), crit VF V ( D ) (` VF 5 ') and crit BM ( D ) (` BM '). Data are generated as explained in Setion 3.3.1 . Left: ho- mosedasti ( s 2 , σ c ) . Righ t: heterosedasti ( s 2 , σ pc, 3 ) . Exp etations ha v e b een estimated from N = 10 000 indep enden t samples; error bars are all negligible in fron t of visible dierenes (the larger ones are smaller than 5 . 10 − 4 on the left, and smaller than 2 . 10 − 3 on the righ t). Similar b eha viours are observ ed for ev ery single sample, with sligh tly larger utuations for crit VF V ( D ) than for crit BM ( D ) . The urv es ` BM ' and ` VF 5 ' ha v e b een shifted in order to mak e omparison with `Loss' easier, without  hanging the lo ation of the minim um. C p p enalt y . When Card( M n ) is at most p olynomial in n , Mallo ws' C p p enalt y leads to an eien t mo del seletion pro edure, as pro v ed in sev eral regression framew orks [ 41, 31 , 10℄. Hene, Mallo ws' C p p enalt y is an adequate measure of the apait y of an y v etor spae S m of dimension D m , at least when data are homosedasti. On the on trary , in the  hange-p oin t detetion framew ork, Card( M n ) gro ws exp onen- tially with n . The form ulation of Pro edure 4 p oin ts out that p en BM ( D ) has b een built so that crit BM ( D ) estimates un biasedly   s − b s b m ERM ( D )   2 n for ev ery D , where b s b m ERM ( D ) is the empirial risk minimizer o v er e S D . Hene, p en BM ( D ) measures the apait y of e S D , whi h is m u h bigger than a v etor spae of dimension D . Therefore, p en BM should b e larger than Mallo ws' C p , as onrmed b y the results of Birgé and Massart [16℄ on minimal p enalties for exp onen tial olletions of mo dels. Sim ulation exp erimen ts supp ort the fat that crit BM ( D ) is an un biased estimator of   s − b s b m ( D )   2 n for ev ery D (up to an additiv e onstan t) when data are homosedasti (Figure 4 left). Ho w ev er, when data are heterosedasti, theoretial results pro v ed b y Birgé and Massart [15 , 16 ℄ no longer apply , and sim ulations sho w that crit BM ( D ) do es not alw a ys estimate   s − b s b m ERM ( D )   2 n w ell (Figure 4 righ t). This result is onsisten t with Lemma 1 , as w ell as the sub optimalit y of p enalties prop ortional to D m for mo del seletion among a p olynomial olletion of mo dels when data are heterosedasti [6℄. Therefore, p en BM ( D ) is not an adequate apait y measure of e S D in general when data are heterosedasti, and another apait y measure is required. 4.2 Cross-v alidation As sho wn in Setion 3.2.2, CV an b e used for estimating the quadrati loss k s − A ( P n ) k 2 n for an y algorithm A . In partiular, CV w as suessfully used in Setion 3 for estimating 17 the quadrati risk of ERM( S m ; · ) for all segmen tations m ∈ M n ( D ) with a giv en n um b er ( D − 1) of breakp oin ts (Pro edure 2), ev en when data are heterosedasti. Therefore, CV metho ds are natural andidates for xing BM's failure. The prop osed pro edurewith VF CVis the follo wing. Pro edure 5. 1. ∀ D ∈ D n , b s b m ERM ( D ) := ERM  e S D ; P n  , 2. b D VF V := arg min D ∈D n { crit VF V ( D ) } where crit VF V ( D ) := b R VF V  ERM  e S D ( · ); ·  , ·  and b R VF V is dened b y (9 ). R emark 3 . In algorithm ( t i , Y i ) 1 ≤ i ≤ n 7→ ERM  e S D ; P n  , the mo del e S D dep ends on the design p oin ts. When the training set is ( t i , Y i ) i / ∈ B k , the mo del e S D is the union of the S m su h that ∀ λ ∈ Λ m , I λ on tains at least t w o elemen ts of { t i s . t . i / ∈ B k } . Su h an m exists as so on as D ≤ ( n − max k { Card( B k ) } ) / 2 and t w o onseutiv e design p oin ts t i , t i +1 alw a ys b elong to dieren t blo  ks B k , whi h is alw a ys assumed in this pap er. Note that the dynami programming algorithms [13 ℄ quoted in Setion 3.2.3 an straigh tforw ardly tak e in to aoun t su h onstrain ts when minimizing the empirial risk o v er e S D . The dep endene of e S D on the design explains wh y crit VF V ( D ) dereases for D lose to n ( V − 1) / (2 V ) , as observ ed on Figure 4. Indeed, when D is lose to n t / 2 (where n t is the size of the design), only a few { S m } m ∈M n t ( D ) remain in e S D ; for instane, when D = n t / 2 , e S D is equal to one of the { S m } m ∈M n t ( D ) . Therefore, the apait y of e S D dereases in the neigh b orho o d of D = n t / 2 . Similar pro edures an b e dened with Lo o and Lp o p instead of VF CV. The in terest of VF CV is its reasonably small omputational osttaking V ≤ 10 for instane, sine no losed-form form ula exists for CV estimators of the risk of ERM  e S D ; P n  . 4.3 Sim ulation results A sim ulation exp erimen t w as p erformed in the setting presen ted in Setion 3.3.1 , for om- paring BM and VF V with V = 5 blo  ks. A represen tativ e piture of the results is giv en b y Figure 4 and b y T able 2 [see 21, Chapter 7, and Setion 3 of the supplemen tary material for additional results℄. As illustrated b y Figure 4 , crit VF V ( D ) an b e used for measuring the apait y of e S D . Indeed, VF CV orretly estimates the risk of empirial risk minimizers o v er e S D for ev ery D and for b oth homosedasti and heterosedasti data; crit VF V ( D ) only underestimates   s − b s b m ( D )   2 n for dimensions D lose to n ( V − 1) / (2 V ) , for reasons explained at the end of Remark 3. On the on trary , crit BM ( D ) is a p o or estimate of   s − b s b m ( D )   2 n when data are heterosedasti. Subsequen tly , VF CV yields a m u h smaller p erformane index C or ( J ERM , P K ) := E     s − b s b m ERM ( b D P )    2 n  E h inf m ∈M n n k s − b s m ( P n ) k 2 n oi 18 s · σ · Orale VF 5 BM 2  2.88 ± 0.01 4.51 ± 0.03 5.27 ± 0.03 p ,2 2.88 ± 0.02 6.58 ± 0.06 19.82 ± 0.07 s 3.01 ± 0.01 5.21 ± 0.04 9.69 ± 0.40 3  3.18 ± 0.01 4.41 ± 0.02 4.39 ± 0.01 p ,2 4.06 ± 0.02 5.99 ± 0.02 7.86 ± 0.03 s 4.02 ± 0.01 5.97 ± 0.03 7.59 ± 0.03 T able 2: P erformane C or ( J ERM , P K ) for P = Id (that is,  ho osing the dimension D ⋆ := arg min D ∈D n n   s − b s b m ERM ( D )   2 n o ), P = VF V with V = 5 or P = BM . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. than BM when data are heterosedasti (T able 2); see also the supplemen tary material (Setion 1) for details ab out the p erformanes of BM and p ossible w a ys to impro v e them. When data are homosedasti, VF CV and BM ha v e similar p erformanes (ma yb e with a sligh t adv an tage for BM), whi h is not surprising sine BM uses the kno wledge that data are homosedasti. Moreo v er, BM has b een pro v ed to b e optimal in the homosedasti setting [15, 16℄. Ov erall, VF CV app ears to b e a reliable alternativ e to BM when no prior kno wledge guaran tees that data are homosedasti. 5 New  hange-p oin t detetion pro edures via ross- v alidation Setions 3 and 4 sho w ed that when data are heterosedasti, CV an b e used suessfully instead of p enalized riteria for deteting breakp oin ts giv en their n um b er, as w ell as for estimating the n um b er of breakp oin ts. Nev ertheless, in Setion 4, the segmen tations om- pared b y CV w ere obtained b y empirial risk minimization, so that they an b e sub optimal aording to the results of Setion 3. The next step for obtaining reliable  hange-p oin t detetion pro edures for heterosedas- ti data is to om bine the t w o ideas, that is, to use CV t wie. The goal of the presen t setion is to prop erly dene su h pro edures (with v arious kinds of CV) and to assess their p erformanes. 5.1 Denition of a family of  hange-p oin t detetion pro edures The general strategy used in this artile for  hange-p oin t detetion relies on t w o steps: First, detet where ( D − 1) breakp oin ts should b e lo ated for ev ery D ∈ D n ; seond, estimate the n um b er ( D − 1) of breakp oin ts. This strategy an b e summarized with the follo wing pro edure: Pro edure 6 (General t w o-step  hange-p oin t detetion pro edure) . 19 1. ∀ D ∈ D n , A D ( P n ) := b s b m ( D ) = arg min m ∈M n ( D ) { crit 1 ( S m , P n ) } where for ev ery mo del S , crit 1 ( S, P n ) ∈ R estimates k s − ERM( S ; P n ) k 2 n and b s m = ERM( S m ; P n ) is dened as in Setion 3.1. 2. b D = arg min D ∈D n { crit 2 ( A D , P n ) } , where for ev ery algorithm A D , crit 2 ( A D , P n ) ∈ R estimates k s − A D ( P n ) k 2 n . 3. Output: the segmen tation b m ( b D ) and the orresp onding estimator b s b m ( b D ) of s . Let us no w detail whi h are the andidate riteria crit 1 and crit 2 for b eing used in Pro edure 6. F or the rst step: • The empirial risk (` ERM ') is crit 1 , ERM ( S, P n ) := P n γ (ERM ( S ; P n )) • The Lea v e- p -out estimator of the risk (` Lp o p ') is, for ev ery p ∈ { 1 , . . . , n − 1 } , crit 1 , Lpo ( S, P n , p ) := b R Lpo p (ERM( S ; · ) , P n ) • F or omparison, the ideal riterion (`Id') is dened b y crit 1 ,I d ( S, P n ) := k s − ERM( S ; P n ) k 2 n . As in Setion 3, Lo o denotes Lp o 1 . The VF CV estimator of the risk b R VF V ould also b e used as crit 1 ; it will not b e onsidered in the follo wing b eause it is omputationally more exp ensiv e and more v ariable than Lp o (see Setion 3.2 ). F or the seond step: • Birgé and Massart's p enalization riterion (`BM') is crit 2 , BM ( A D , P n ) := P n γ ( A D ( P n )) + p en BM ( D ) , where p en BM ( D ) is dened b y (11 ) with c 1 = 5 , c 2 = 2 and b C is  hosen b y the slop e heuristis (see Setion 1 of the supplemen tary material). • The V -fold ross-v alidation estimator of the risk (` VF V ') is, for ev ery V ∈ { 1 , . . . , n } , crit 2 , VF V ( A D , P n ) := b R VF V ( A D , P n ) , where b R VF V is dened b y (9 ) and the blo  ks B 1 , . . . , B V are  hosen as in Pro edure 5 (see Remark 3). • F or omparison, the ideal riterion (`Id') is dened b y crit 2 ,I d ( A D , P n ) := k s − A D ( P n ) k 2 n . R emark 4 . F or crit 2 , denitions using Lp o ould theoretially b e onsidered. They are not in v estigated here b eause they are omputationally in tratable. In the follo wing, the notation J α, β K is used as a shortut for Pro edure 6 with crit 1 ,α and crit 2 ,β , and the outputs of J α, β K are denoted b y b m J α,β K ∈ M n and e s J α,β K ∈ S ∗ . F or instane, BM oinides with J ERM , BM K ; Pro edures J α, I d K are ompared for sev eral α in Setion 3; Pro edures J ERM , β K are ompared for β ∈ { Id , BM , VF 5 } in Setion 4 . 20 s · σ · J ERM , VF 5 K J Lo o , VF 5 K J Lp o 20 , VF 5 K J ERM , BM K 1  5.40 ± 0.05 5.03 ± 0.05 5.10 ± 0.05 3.91 ± 0.03 p ,1 11.96 ± 0.03 10.25 ± 0.03 10.28 ± 0.03 12.85 ± 0.04 p ,3 4.96 ± 0.05 4.82 ± 0.04 4.79 ± 0.05 13.08 ± 0.04 s 7.33 ± 0.06 6.82 ± 0.05 6.99 ± 0.06 9.41 ± 0.04 2  4.51 ± 0.03 4.55 ± 0.03 4.50 ± 0.03 5.27 ± 0.03 p ,1 11.67 ± 0.09 10.26 ± 0.08 10.29 ± 0.08 19.36 ± 0.07 p ,3 6.66 ± 0.06 5.81 ± 0.06 5.74 ± 0.06 20.12 ± 0.06 s 5.21 ± 0.04 5.19 ± 0.03 5.17 ± 0.03 9.69 ± 0.04 3  4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.39 ± 0.01 p ,1 4.91 ± 0.02 4.40 ± 0.02 4.44 ± 0.02 6.50 ± 0.02 p ,3 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 8.47 ± 0.03 s 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 7.59 ± 0.03 T able 3: P erformane C or ( P ) for sev eral  hange-p oin t detetion pro edures P in sev eral settings ( s, σ ) . Ea h time, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 5.2 Sim ulation study A sim ulation exp erimen t ompares pro edures J α, VF 5 K for sev eral α and J ERM , BM K , in the setting desrib ed in Setion 3.3.1 . A represen tativ e piture of the results is giv en b y T able 3 [see 21, Chapter 7, for additional results℄. The (statistial) p erformane of ea h omp eting pro edure P is measured b y C or ( P ) := E h k s − e s P ( P n ) k 2 n i E h inf m ∈M n n k s − b s m ( P n ) k 2 n oi , b oth exp etations b eing ev aluated b y a v eraging o v er N = 10 00 0 indep enden t samples. R emark 5 . Birgé and Massart's p enalization pro edure is the only lassial  hange-p oin t detetion pro edure onsidered in this exp erimen t for t w o reasons. First,  hange-p oin t detetion pro edure lo oking for  hanges in the distribution of Y i w ould learly fail to detet  hanges in the mean of the signal, as so on as the noise-lev el σ v aries inside areas where the mean is onstan t. Seond, among pro edures deteting  hanges in the mean of a signal in a setting omparable to the setting of the pap er (that is, frequen tist, parametri, o-line, with no information on the n um b er of  hange-p oin ts), BM app ears to b e the most reliable pro edure aording to reen t pap ers [28 , 30 ℄. The question of the alibration of b C is addressed in Setion 1 of the supplemen tary material. First, BM is onsisten tly outp erformed b y the other pro edures, exept in the ho- mosedasti settings in whi h it onrms its strength. Seond, empirial risk minimization ( ERM ) sligh tly outp erforms CV ( Lo o and Lp o 20 ) when data are homosedasti. On the on trary , when data are heterosedasti, Lo o and Lp o 20 learly outp erform ERM , often b y a margin larger than 10% (for instane, when σ = σ pc, 1 ). Therefore, the results of Setion 3 are onrmed when using VF 5 (instead of Id ) for  ho osing the dimension. 21 F ramew ork A B C J ERM , BM K 6.82 ± 0.03 7.21 ± 0.04 13.49 ± 0.07 J ERM , VF 5 K 4.78 ± 0.03 5.09 ± 0.03 7.17 ± 0.05 J Lo o , VF 5 K 4.65 ± 0.03 4.88 ± 0.03 6.61 ± 0.05 J Lp o 20 , VF 5 K 4.78 ± 0.03 4.91 ± 0.03 6.49 ± 0.05 J Lp o 50 , VF 5 K 4.97 ± 0.03 5.18 ± 0.04 6.69 ± 0.05 T able 4: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 100 . In ea h framew ork, N = 10 , 000 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . Third, the omparison b et w een J Lp o p , VF 5 K for sev eral v alues of p is less lear. Ev en though p = 1 (that is, Lo o ) mostly outp erforms p = 20 (as w ell as p = 50 , see the supplemen tary material), dierenes are small and often not signian t despite the large n um b er of samples generated. The onlusion of the sim ulation exp erimen t on this question is that all v alues of p b et w een 1 and n/ 2 all p erform almost equally w ell, with a small adv an tage to p = 1 whi h ma y not b e general. Let us men tion here that the  hoie of p for Lp o p is usually related to o v erp enalization [see for instane 5, 19 , 21℄, but it seems diult to  haraterize the settings for whi h o v erp enalization is needed for deteting  hange-p oin ts giv en their n um b er. 5.3 Random framew orks In order to assess the generalit y of the results of T able 3, the pro edures onsidered in Setion 5.2 ha v e b een ompared in three random settings. The follo wing pro ess has b een rep eated N = 10 , 0 00 times. First, pieewise onstan t funtions s and σ are randomly  hosen (see Setion 2 of the supplemen tary material for details). Then, giv en s and σ , a data sample ( t i , Y i ) 1 ≤ i ≤ n is generated as desrib ed in Setion 3.3.1 , and the same olletion of mo dels is used. Finally , ea h pro edure P is applied to the sample ( t i , Y i ) 1 ≤ i ≤ n , and its loss k s − e s P ( P n ) k 2 n is measured, as w ell as the loss of the orale inf m ∈M n n k s − b s m k 2 n o . T o summarize the results, the qualit y of ea h pro edure is measured b y the ratio C ( R ) or ( P ) = E s,σ,ǫ 1 ,...,ǫ n h k s − e s P ( P n ) k 2 n i E s,σ,ǫ 1 ,...,ǫ n h inf m ∈M n n k s − b s m k 2 n oi . The notation C ( R ) or ( P ) diers from C or ( P ) to emphasize that ea h exp etation inludes the randomness of s and σ , in addition to the one of ( ǫ i ) 1 ≤ i ≤ n . The results of this exp erimen twhi h are rep orted in T able 4 mostly onrm the results of the previous setion (exept that all the framew orks are heterosedasti here), that is, whatev er p , J Lp o p , VF 5 K outp erforms J ERM , VF 5 K , whi h strongly outp erforms J ERM , BM K . Similar resultsnot rep orted hereha v e b een obtained with a sample size n = 200 and N = 1 000 samples. 22 Moreo v er, the dierene b et w een the p erformanes of J Lp o p , VF 5 K and J ERM , VF 5 K is the largest in setting C and the smallest in setting A. This fat onrms the in terpretation giv en in Setion 3 for the failure of ERM for lo alizing a giv en n um b er of  hange-p oin ts. Indeed, the main dierenes b et w een framew orks A, B and Cwhi h are preisely dened in Setion 2 of the supplemen tary material an b e sk et hed as follo ws: A the partitions on whi h s is built is often lose to regular, and σ is  hosen indep en- den tly from s . B the partitions on whi h s is built are often irregular, and σ is  hosen indep enden tly from s . C the partitions on whi h s is built are often irregular, and σ dep ends on s , so that the noise-lev el is smaller where s jumps more often. In other w ords, framew orks A, B and C ha v e b een built so that for an y D ∈ D n , the largest v ariations o v er M n ( D ) of V ( m ) (dened b y (7)) o ur in framew ork C, and the smallest v ariations o ur in framew ork A. As a onsequene, v ariations of the p erformane of J ERM , VF 5 K ompared to J Lp o p , VF 5 K aording to the framew ork ertainly ome from the lo al o v ertting phenomenon presen ted in Setion 3. 6 Appliation to CGH miroarra y data In this setion, the new  hange-p oin t detetion pro edures prop osed in the pap er are applied to CGH miroarra y data. 6.1 Biologial on text The purp ose of Comparativ e Genomi Hybridization (CGH) miroarra y exp erimen ts is to detet and map  hromosomal ab errations. F or instane, a piee of  hromosome an b e amplie d , that is app ear sev eral times more than usual, or delete d . Su h ab errations are often related to aner disease. Roughly , CGH proles giv e the log-ratio of the DNA op y n um b er along the  hromo- somes, ompared to a referene DNA sequene [see 3537 , for details ab out the biologial on text of CGH data℄. The goal of CGH data analysis is to detet abrupt  hanges in the mean of a signal (the log-ratio of op y n um b ers), and to estimate the mean in ea h segmen t. Hene,  hange-p oin t detetion pro edures are needed. Moreo v er, assuming that CGH data are homosedasti is often unrealisti. Indeed,  hanges in the  hemial omp osition of the sequene are kno wn to indue  hanges in the v ariane of the observ ed CGH prole, p ossibly indep enden tly from v ariations of the true op y n um b er. Therefore, pro edures robust to heterosedastiit y , su h as the ones prop osed in Setion 5 , should yield b etter resultsin terms of deteting  hanges of op y n um b er than pro edures assuming homosedastiit y . The data set onsidered in this setion is based on the Bt474 ell lines, whi h denote epithelial ells obtained from h uman breast aner tumors of a sixt y-y ear-old w oman [ 36℄. A test genome of Bt474 ell lines is ompared to a normal referene male genome. Ev en 23 though sev eral  hromosomes are studied in these ell lines, this setion fo uses on  hromo- somes 1 and 9. Chromosome 1 exhibits a putativ e heterogenous v ariane along the CGH prole, and  hromosome 9 is lik ely to meet the homosedastiit y assumption. Log-ratios of op y n um b ers ha v e b een measured at 119 lo ations for  hromosome 1 and at 93 lo ations for  hromosome 9. 6.2 Pro edures used in the CGH literature Before applying Pro edure 6 to the analysis of Bt474 CGH data, let us reall the denition of t w o  hange-p oin t detetion pro edures, whi h w ere the most suessful for analyzing the same data aording to the literature [36 ℄. The rst pro edure is a simplied v ersion of BM prop osed b y La vielle [28, Setion 2℄ and rst used on CGH data in [ 36 ℄. Note that BM w ould giv e similar results on the data of Figure 5. The seond pro eduredenoted b y `PML' for p enalized maxim um lik eliho o daims at deteting  hanges in either the mean or the v ariane, that is breakp oin ts for ( s, σ ) . The seleted mo del is dened as the minimizer o v er m ∈ M n of crit PML ( m ) := X λ ∈ Λ m n λ log   1 n λ X t i ∈ I λ ( Y i − b s m ( t i ; P n )) 2   + b C ′′ D m , where n λ = Card { t i ∈ I λ } and b C ′′ is estimated from data b y the slop e heuristis algorithm [28 , 30℄. 6.3 Results Results obtained with BMsimple, PML, J ERM , VF 5 K and J Lp o 20 , VF 5 K on the Bt474 data set are rep orted on Figure 5 . F or  hromosome 9, BMsimple and PML yield (almost) the same segmen tation, so that the homosedastiit y assumption is ertainly not m u h violated. As exp eted, J ERM , VF 5 K and J Lp o 20 , VF 5 K also yield v ery similar segmen tations, whi h onrms the reliabilit y of these pro edures for homosedasti signal [see 21 , Setion 7.6 for details℄. The piture is quite dieren t for  hromosome 1. Indeed, as sho wn b y Figure 5 (righ t), BMsimple selets a segmen tation with 7 breakp oin ts, whereas PML selets a segmen tation with only one breakp oin t. The ma jor dierene b et w een BMsimple and PML supp orts at least the idea that these data m ust b e heterosedasti. Nev ertheless, none of the segmen tations  hosen b y BMsimple and PML are en tirely satisfatory: BMsimple relies on an assumption whi h is ertainly violated; PML ma y use a  hange in the estimated v ariane for explaining sev eral  hanges in the mean. CV-based pro edures J ERM , VF 5 K and J Lp o 20 , VF 5 K yield t w o other segmen tations, with a medium n um b er of breakp oin ts, resp etiv ely 4 and 3. In view of the sim ulation exp erimen ts of the previous setions, the segmen tation obtained via J Lp o 20 , VF 5 K should b e the most reliable one sine data are heterosedasti. Therefore, the righ t of Figure 5 an b e in terpretated as follo ws: The noise-lev el is small in the rst part of  hromosome 1, then higher, but not as high as estimated b y PML. In partiular, the op y n um b er  hanges 24 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (a) BMsimple 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (b) PML 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 () J ERM , VF 5 K 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (d) J Lp o 20 , VF 5 K 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (e) BMsimple 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (f ) PML 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (g) J ERM , VF 5 K 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (h) J Lp o 20 , VF 5 K Figure 5: Change-p oin ts lo ations along Chromosome 9 (Left) and Chromosome 1 (Righ t). The mean on ea h homogeneous region is indiated b y plain horizon tal lines. 25 t wie inside the seond part of  hromosome 1 (as dened b y the segmen tation obtained with PML), indiating that t w o putativ e amplied regions of  hromosome 1 ha v e b een deteted. Note ho w ev er that  ho osing among the segmen tations obtained with J ERM , VF 5 K and J Lp o 20 , VF 5 K is not an easy task without additional data. A denitiv e answ er w ould need further biologial exp erimen ts. 7 Conlusion 7.1 Results summary Cross-v alidation (CV) metho ds ha v e b een used to build reliable pro edures (Pro edure 6) for deteting  hanges in the mean of a signal whose v ariane ma y not b e onstan t. First, when the n um b er of breakp oin ts is giv en, empirial risk minimization has b een pro v ed to fail for some heterosedasti problems, from b oth theoretial and exp erimen tal p oin ts of view. On the on trary , the Lea v e- p -out ( Lp o p ) remains robust to heterosedas- tiit y while b eing omputationally eien t thanks to losed-form form ulas giv en in Se- tion 3.2.3 (Theorem 1). Seond, for  ho osing the n um b er of breakp oin ts, the ommonly used p enalization pro- edure prop osed b y Birgé and Massart in the homosedasti framew ork should not b e applied to heterosedasti data. V -fold ross-v alidation (VF CV) turns out to b e a reliable alternativ eb oth with homosedasti and heterosedasti data, leading to m u h b etter segmen tations in terms of quadrati risk when data are heterosedasti. F urthermore, un- lik e usual deterministi p enalized riteria, VF CV eien tly  ho oses among segmen tations obtained b y either Lp o or empirial risk minimization, without an y sp ei  hange in the pro edure. T o onlude, the om bination of Lp o (for  ho osing a segmen tation for ea h p ossi- ble n um b er of breakp oin ts) and VF CV yields the most reliable pro edure for deteting  hanges in the mean of a signal whi h is not a priori kno wn to b e homosedasti. The resulting pro edure is omputationally tratable for small v alues of V , sine its omputa- tional omplexit y is of order O ( V n 2 ) , whi h is similar to man y omparable  hange-p oin t detetion pro edures. The inuene of V on the statistial p erformane of the pro edure is not studied sp eially in this pap er; nev ertheless, onsidering V = 5 only w as suien t to obtain a b etter statistial p erformane than Birgé and Massart's p enalization pro edure when data are heterosedasti. When applied to real data (CGH proles in Setion 6), the prop osed pro edure turns out to b e quite useful and eetiv e, for a data set on whi h existing pro edures highly disagree b eause of heterosedastiit y . 7.2 Prosp ets The general form of Pro edure 6 ould b e used with sev eral other riteria, at b oth steps of the  hange-p oin t detetion pro edure. F or instane, resampling p enalties [5℄ ould b e used at the rst step, for lo alizing the  hange-p oin ts giv en their n um b er. A t the seond step, V -fold p enalization [6℄ ould also b e used instead of VF CV, with the same omputational ost and p ossibly an impro v ed statistial p erformane. 26 Comparing preisely these resampling-based riteria for optimizing the p erformane of Pro edure 6 w ould b e of great in terest and deserv es further w orks. Sim ultaneously , sev eral v alues of V should b e ompared for the seond step of Pro edure 6, and the preise inuene of p when Lp o p is used at the rst step should b e further in v estigated. Preliminary results in this diretion an already b e found in [21 , Chapter 7℄. Referenes [1℄ F. Abramo vi h, Y. Benjamini, D. Donoho, and I. Johnstone. Adapting to Unkno wn Sparsit y b y on trolling the False Diso v ery Rate. The A nnals of Statistis , 34(2):584 653, 2006. [2℄ H. Ak aik e. Statistial preditor iden tiation. A nn. Inst. Statisti. Math. , 22:203217, 1969. [3℄ Hirotugu Ak aik e. Information theory and an extension of the maxim um lik eliho o d priniple. In Se  ond International Symp osium on Information The ory (Tsahkadsor, 1971) , pages 267281. Ak adémiai Kiadó, Budap est, 1973. [4℄ Da vid M. Allen. The relationship b et w een v ariable seletion and data augmen tation and a metho d for predition. T e hnometris , 16:125127, 1974. [5℄ Sylv ain Arlot. Mo del seletion b y resampling p enalization, 2008. hal-00262478. [6℄ Sylv ain Arlot. Sub optimalit y of p enalties prop ortional to the dimension for mo del seletion in heterosedasti regression, Deem b er 2008. [7℄ Sylv ain Arlot. V -fold ross-v alidation impro v ed: V -fold p enalization, F ebruary 2008. [8℄ Sylv ain Arlot and P asal Massart. Data-driv en alibration of p enalties for least- squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni), 2009. [9℄ Y anni k Baraud. Mo del seletion for regression on a xed design. Pr ob ab. The ory R elate d Fields , 117(4):467493, 2000. [10℄ Y anni k Baraud. Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni), 2002. [11℄ A. Barron, L. Birgé, and P . Massart. Risk b ounds for mo del seletion via p enalization. Pr ob ab. The ory and R elat. Fields , 113:301413, 1999. [12℄ M. Basseville and N. Nikiforo v. The Dete tion of Abrupt Changes - The ory and Ap- pli ations . Pren tie-Hall: Information and System Sienes Series, 1993. [13℄ R. E. Bellman and S. E. Dreyfus. Applie d Dynami Pr o gr amming . Prineton, 1962. [14℄ L. Birgé and P . Massart. From mo del seletion to adaptiv e estimation. In D. P ollard, E. T orgensen, and G. Y ang, editors, In Festshrift for Luien Le Cam: Rese ar h Pap ers in Pr ob ability and Statistis , pages 5587. Springer-V erlag, New Y ork, 1997. 27 [15℄ L. Birgé and P . Massart. Gaussian mo del seletion. J. Eur op e an Math. So . , 3(3):203 268, 2001. [16℄ Luien Birgé and P asal Massart. Minimal p enalties for Gaussian mo del seletion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373, 2007. [17℄ B. Bro dsky and B. Darkho vsky . Metho ds in Change-p oint pr oblems . Klu w er A ademi Publishers, Dordre h t, The Netherlands, 1993. [18℄ P . Burman. Comparativ e study of Ordinary Cross-Validation, v-Fold Cross-Validation and the rep eated Learning-Testing Metho ds. Biometrika , 76(3):503514, 1989. [19℄ A. Celisse. Densit y estimation via ross-v alidation: Mo del seletion p oin t of view. T e hnial rep ort, arXiv, 2008. [20℄ A. Celisse and S. Robin. Nonparametri densit y estimation b y exat lea v e-p-out ross- v alidation. Computational Statistis and Data A nalysis , 52(5):23502368, 2008. [21℄ Alain Celisse. Mo del sele tion via r oss-validation in density estimation, r e gr ession and hange-p oints dete tion . PhD thesis, Univ ersit y P aris-Sud 11, Deem b er 2008. oai:tel.ar hiv es-ouv ertes.fr:tel-00346320_v1. [22℄ S. Dudoit and M. v an der Laan. Asymptotis of ross-v alidated risk estimation in estimator seletion and p erformane assessmen t. Statisti al Metho dolo gy , 2(2):131 154, 2005. [23℄ S. Geisser. A preditiv e approa h to the random eet mo del. Biometrika , 61(1):101 107, 1974. [24℄ Seymour Geisser. The preditiv e sample reuse metho d with appliations. J. A mer. Statist. Asso . , 70:320328, 1975. [25℄ Xa vier Gendre. Sim ultaneous estimation of the mean and the v ariane in heterosedas- ti gaussian regression, 2008. [26℄ M. Kearns, Y. Mansour, A. Y. Ng, and D. Ron. An Exp erimen tal and Theoretial Comparison of Mo del Seletion Metho ds. Mahine L e arning , 27:750, 1997. [27℄ P . A. La hen bru h and M. R. Mi k ey . Estimation of Error Rates in Disriminan t Analysis. T e hnometris , 10(1):111, 1968. [28℄ M. La vielle. Using p enalized on trasts for the  hange-p oin t problem. Signal Pr o  es. , 85:15011510, 2005. [29℄ M. La vielle and G. T eyssière. Detetion of Multiple Change-Poin ts in Multiv ariate Time Series. Lithuanian Mathemati al Journal , 46:287306, 2006. [30℄ E. Lebarbier. Deteting m ultiple  hange-p oin ts in the mean of a Gaussian pro ess b y mo del seletion. Signal Pr o . , 85:717736, 2005. [31℄ K.-C. Li. Asymptoti Optimalit y for C p , C L , Cross-Validation and Generalized Cross- Validation: Disrete Index Set. The A nnals of Statistis , 15(3):958975, 1987. 28 [32℄ C. L. Mallo ws. Some ommen ts on C p . T e hnometris , 15:661675, 1973. [33℄ B. Q. Mia and L. C. Zhao. On detetion of  hange p oin ts when the n um b er is unkno wn. Chinese J. Appl. Pr ob ab. Statist. , 9(2):138145, 1993. [34℄ D. Piard. Testing and estimating  hange p oin ts in time series. J. Appl. Pr ob ab. , 17:841867, 1985. [35℄ F. Piard. Pr o  ess se gmentation/lustering Appli ation to the analysis of arr ay CGH data . PhD thesis, Univ ersité P aris-Sud 11, 2005. [36℄ F. Piard, S. Robin, M. La vielle, C. V aisse, and J-J. Daudin. A statistial approa h for arra y CGH data analysis. BMC Bioinformatis , 27(6):eletroni aess, 2005. [37℄ F ran k Piard, Stéphane Robin, Émilie Lebarbier, and Jean-Jaques Daudin. A seg- men tation/lustering mo del for the analysis of arra y gh data. Biometris , 2007. T o app ear. doi:10.1111/j.1541-0420.2006.00729.x. [38℄ J. Rissanen. Univ ersal Prior for In tegers and Estimation b y Minim um Desription Length. The A nnals of Statistis , 11(2):416431, 1983. [39℄ G. S h w arz. Estimating the dimension of a mo del. The A nnals of Statistis , 6(2):461 464, 1978. [40℄ J. Shao. An asymptoti theory for linear mo del seletion. Statisti a Sini a , 7:221264, 1997. [41℄ R. Shibata. An optimal seletion of regression v ariables. Biometrika , 68:4554, 1981. [42℄ C.J. Stone. An asymptotially optimal windo w seletion rule for k ernel densit y esti- mates. The A nnals of Statistis , 12(4):12851297, 1984. [43℄ M. Stone. Cross-v alidatory  hoie and assessmen t of statistial preditions. J. R oy. Statist. So . Ser. B , 36:111147, 1974. [44℄ M. Stone. An Asymptoti Equiv alene of Choie of Mo del b y Cross-v alidation and Ak aik e's Criterion. JRSS B , 39(1):4447, 1977. [45℄ R. Tibshirani and K. Knigh t. The Co v ariane Ination Criterion for Adaptiv e Mo del Seletion. JRSS B , 61(3):529546, 1999. [46℄ Y. Y ang. Regression with m ultiple andidate mo del: seletion or mixing? Statist. Sini a , 13:783809, 2003. [47℄ Y. Y ang. Comparing Learning Metho ds for Classiation. Statisti a Sini a , 16:635 657, 2006. [48℄ Y. Y ang. Consisteny of ross-v alidation for omparing regression pro edures. The A nnals of Statistis , 35(6):24502473, 2007. [49℄ Y. Y ao. Estimating the n um b er of  hange-p oin ts via S h w arz riterion. Statist. Pr ob ab. L ett. , 6:181189, 1988. 29 Supplemen tary material for Segmen tation of the mean of heterosedasti data via ross-v alidation Sylv ain Arlot and Alain Celisse Otob er 30, 2018 1 Calibration of Birgé and Massart's p enalization Birgé and Massart's p enalization mak es use of the p enalt y p en BM ( D ) := b C D n  5 + 2 log  n D  . In a previous v ersion of this w ork [6, Chapter 7℄, b C w as dened as suggested in [7 , 8℄, that is, b C = 2 b K max . jump with the notation b elo w. This yielded p o or p erformanes, whi h seemed related to the denition of b C . Therefore, alternativ e denitions for b C ha v e b een in v estigated, leading to the  hoie b C = 2 b K thresh . throughout the pap er, where b K thresh . is dened b y (2 ) b elo w. The presen t app endix in tends to motiv ate this  hoie. T w o main approa hes ha v e b een onsidered in the literature for dening b C in the p enalt y p en BM : • Use b C = c σ 2 an y estimate of the noise-lev el, for instane, c σ 2 := 1 n n/ 2 X i =1 ( Y 2 i − Y 2 i − 1 ) 2 , (1) assuming n is ev en and t 1 < · · · < t n . • Use Birgé and Massart's slop e heuristis , that is, ompute the sequene b D ( K ) := arg m in D ∈D n  P n γ ( b s b m ERM ( D ) ) + K D n  5 + 2 log  n D   , nd the (unique) K = b K jump at whi h b D ( K ) jumps from large to small v alues, and dene b C = 2 b K jump . The rst approa h follo ws from theoretial and exp erimen tal results [ 4 , 8℄ whi h sho w that b C should b e lose to σ 2 when the noise-lev el is onstan t; (1) is a lassial estimator of the v ariane used for instane b y Baraud [3℄ for mo del seletion in a dieren t setting. The optimalit y (in terms of orale inequalities) of the seond approa h has b een pro v ed for regression with homosedasti Gaussian noise and p ossibly exp onen tial olletions of 1 s · σ · 2 b K max . jump 2 b K thresh . c σ 2 σ 2 true 1  6.85 ± 0.12 3.91 ± 0.03 1.74 ± 0.02 2.05 ± 0.02 p ,3 17.56 ± 0.15 13.08 ± 0.04 4.42 ± 0.04 10.43 ± 0.05 s 20.07 ± 0.31 9.41 ± 0.04 2.18 ± 0.03 1.66 ± 0.02 2  6.02 ± 0.03 5.27 ± 0.03 3.58 ± 0.02 3.54 ± 0.02 p ,3 17.76 ± 0.10 20.12 ± 0.07 10.58 ± 0.07 16.64 ± 0.08 s 10.17 ± 0.05 9.69 ± 0.04 5.28 ± 0.03 10.95 ± 0.02 3  4.97 ± 0.02 4.39 ± 0.01 4.62 ± 0.01 4.21 ± 0.01 p ,3 8.66 ± 0.03 8.47 ± 0.03 6.64 ± 0.02 8.00 ± 0.03 s 8.50 ± 0.04 7.59 ± 0.03 5.94 ± 0.02 15.50 ± 0.04 A 7.52 ± 0.04 6.82 ± 0.03 4.86 ± 0.03 5.55 ± 0.03 B 7.89 ± 0.04 7.21 ± 0.04 5.18 ± 0.03 5.77 ± 0.03 C 12.81 ± 0.08 13.49 ± 0.07 8.93 ± 0.06 12.44 ± 0.07 T able 1: P erformane C or (BM) with four dieren t denitions of b C (see text), in some of the sim ulation settings onsidered in the pap er. In ea h setting, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . mo dels [5℄, as w ell as in a heterosedasti framew ork with p olynomial olletions of mo dels [2℄. In the on text of  hange-p oin t detetion with homosedasti data, La vielle [7℄ and Lebarbier [8 ℄ sho w ed that b C = 2 b K max . jump an ev en p erform b etter than b C = σ 2 when b K max . jump orresp onds to the highest jump of b D ( K ) . Alternativ ely , it w as prop osed in [2℄ to dene b C = 2 b K thresh . where b K thresh . := min  K s . t . b D ( K ) ≤ D thresh . :=  n ln( n )  . (2) These three denitions of b C ha v e b een ompared with b C = σ 2 true := n − 1 P n i =1 σ ( t i ) 2 in the settings of the pap er. A represen tativ e part of the results is rep orted in T able 1 . The main onlusions are the follo wing. • 2 b K thresh . almost alw a ys b eats 2 b K max . jump , ev en in homosedasti settings. This on- rms some sim ulation results rep orted in [2℄. • σ 2 true often b eats slop e heuristis-based denitions of b C , but not alw a ys, as previously notied b y Lebarbier [8℄. Dierenes of p erformane an b e h uge (in partiular when σ = σ s ), but not alw a ys in fa v our of σ 2 true (for instane, when s = s 3 ). • c σ 2 yields signian tly b etter p erformane than σ 2 true in most settings (but not all), with h uge margins in some heterosedasti settings. The latter result atually omes from an artefat, whi h an b e explained as follo ws. First, E h c σ 2 i = 1 n n X i =1 σ ( t i ) 2 + 1 n n X i =1 ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 ≥ 1 n n X i =1 σ ( t i ) 2 = σ 2 true . 2 The dierene b et w een these exp etations is not negligible in all the settings of the pap er. F or instane, when n = 100 , t i = i/n and s = s 1 , n − 1 P i ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 = 0 . 04 whereas σ 2 true v aries b et w een 0 . 015 (when σ = σ pc, 1 ) to 0 . 093 (when σ = σ pc, 3 ). Nev ertheless, c σ 2 w ould not o v erestimate σ 2 true at all in a v ery lose setting: Shifting the jumps of s 1 b y 1 / 100 is suien t to mak e n − 1 P i ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 equal to zero, and the p erformanes of BM with b C = c σ 2 w ould then b e v ery lose to the p erformanes of BM with b C = σ true . Seond, o v erp enalization turns out to impro v e the results of BM in most of the het- erosedasti settings onsidered in the pap er. The reason for this phenomenon is illustrated b y the righ t panel of Figure 4. Indeed, p en BM is a p o or p enalt y when data are het- erosedasti, underp enalizing dimensions lose to the orale but o v erp enalizing the largest dimensions (remem b er that b C = 2 b K thresh . on Figure 4). Then, in a setting lik e ( s 2 , σ pc, 3 ) m ultiplying p en BM b y a fator C o ver > 1 helps dereasing the seleted dimension; the same ause has dieren t onsequenes in other settings, su h as ( s 1 , σ s or ( s 3 , σ c ) . Nev erthe- less, ev en  ho osing b C using b oth P n and s , (crit BM ( D )) D > 0 remains a p o or estimate of    s − b s b m ERM ( D )   2 n  D > 0 in most heterosedasti settings (ev en up to an additiv e onstan t). T o onlude, p en BM with b C = c σ 2 is not a reliable  hange-p oin t detetion pro edure, and the apparen tly go o d p erformanes observ ed in T able 1 ould b e misleading. This leads to the remaining  hoie b C = 2 b K thresh . whi h has b een used throughout the pap er, although this alibration metho d ma y ertainly b e impro v ed. Results of T able 1 for b C = σ 2 true indiate ho w far the p erformanes of p en BM ould b e impro v ed without o v erp enalization. A ording to T ables 4 and 5, BM with b C = σ 2 true only has signian tly b etter p erformanes than J ERM , VF 5 K or J Lo o , VF 5 K in the three homosedasti settings and in setting ( s 1 , σ s ) . Finally , o v erp enalization ould b e used to impro v e BM, but  ho osing the o v erp enaliza- tion fator from data is a diult problem, esp eially without kno wing a priori whether the signal is homosedasti or heterosedasti. This question deserv es a sp ei extensiv e sim ulation exp erimen t. T o b e ompletely fair with CV metho ds, su h an exp erimen t should also ompare BM with o v erp enalization to V -fold p enalization [1 ℄ with o v erp enalization, for  ho osing the n um b er of  hange-p oin ts. 2 Random framew orks generation The purp ose of this app endix is to detail ho w pieewise onstan t funtions s and σ ha v e b een generated in the framew orks A, B and C of Setion 5.3. In ea h framew ork, s and σ are of the form s ( x ) = K s − 1 X j =0 α j 1 [ a j ; a j +1 ) + α K s 1 [ a K s ; a K s +1 ] with a 0 = 0 < a 1 < · · · < a K s = 1 σ ( x ) = K σ − 1 X j =0 β j 1 [ b j ; b j +1 ) + β K σ 1 [ b K σ ; b K σ +1 ] with b 0 = 0 < b 1 < · · · < b K σ = 1 for some p ositiv e in tegers K s , K σ and real n um b ers α 0 , . . . , α K s ∈ R and β 0 , . . . , β K σ > 0 . 3 R emark 1 . The framew orks A, B and C dep end on the sample size n , through the distri- bution of K s , K σ , and of the size of the in terv als [ a j ; a j +1 ) and [ b j ; b j +1 ) . This ensures that the signal-to-noise ratio remains rather small, so that the quadrati risk remains an adequate p erformane measure for  hange-p oin t detetion. When the signal-to-noise ratio is larger (that is, when all jumps of s are m u h larger than the noise-lev el, and the n um b er of jumps of s is small ompared to the sample size), the  hange-p oin t detetion problem is of dieren t nature. In partiular, the n um b er of  hange-p oin ts w ould b e b etter estimated with pro edures targeting iden tiation (su h as BIC, or ev en larger p enalties) than eieny (su h as VF CV). 2.1 F ramew ork A In framew ork A, s and σ are generated as follo ws: • K s , the n um b er of jumps of s , has uniform distribution o v er { 3 , . . . , ⌊ √ n ⌋} . • F or 0 ≤ j ≤ K s , a j +1 − a j = ∆ s min + (1 − ( K s + 1)∆ s min ) U j P K s k =0 U k with ∆ s min = min { 5 /n, 1 / ( K s + 1) } and U 0 , . . . , U K s are i.i.d. with uniform distri- bution o v er [0; 1] . • α 0 = V 0 and for 1 ≤ j ≤ K s , α j = α j − 1 + V j where V 0 , . . . , V K s are i.i.d. with uniform distribution o v er [ − 1; − 0 . 1] ∪ [0 . 1; 1] . • K σ , the n um b er of jumps of σ , has uniform distribution in { 5 , . . . , ⌊ √ n ⌋} . • F or 0 ≤ j ≤ K σ , b j +1 − b j = ∆ σ min + (1 − ( K σ + 1)∆ σ min ) U ′ j P K s k =0 U ′ k with ∆ σ min = min { 5 /n, 1 / ( K σ + 1) } and U ′ 0 , . . . , U ′ K σ are i.i.d. with uniform distri- bution o v er [0; 1] . • β 0 , . . . , β K σ are i.i.d. with uniform distribution o v er [0 . 05; 0 . 5] . T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork A are plotted on Figure 1. 2.2 F ramew ork B The only dierene with framew ork A is that U 0 , . . . , U K s are i.i.d. with the same distri- bution as Z = | 10 Z 1 + Z 2 | where Z 1 has Bernoulli distribution with parameter 1 / 2 and Z 2 has a standard Gaussian distribution. T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork B are plotted on Figure 2. 4 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 2 2.5 t i Y i s(t i ) Figure 1: Random framew ork A: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 2 2.5 t i Y i s(t i ) Figure 2: Random framew ork B: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . 5 2.3 F ramew ork C The main dierene b et w een framew orks C and B is that [0; 1] is split in to t w o regions: a K s, 1 +1 = 1 / 2 and K s = K s, 1 + K s, 2 + 1 for some p ositiv e in tegers K s, 1 , K s, 2 , and the b ounds of the distribution of β j are larger when b j ≥ 1 / 2 and smaller when b j < 1 / 2 . T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork C are plotted on Figure 3. More preisely , s and σ are generated as follo ws: • K s, 1 has uniform distribution o v er { 2 , . . . , K max , 1 } with K max , 1 = ⌊ √ n ⌋ − 1 − ⌊ ( ⌊ √ n − 1 ⌋ ) / 3 ⌋ . • K s, 2 has uniform distribution o v er { 0 , . . . , K max , 2 } with K max , 2 = ⌊ ( ⌊ √ n − 1 ⌋ ) / 3 ⌋ . • Let U 0 , . . . , U K s b e i.i.d. random v ariables with the same distribution as Z = | 10 Z 1 + Z 2 | where Z 1 has Bernoulli distribution with parameter 1 / 2 and Z 2 has a standard Gaussian distribution. • F or 0 ≤ j ≤ K s, 1 , a j +1 − a j = ∆ s, 1 min + (1 − ( K s, 1 + 1)∆ s, 1 min ) U j P K s, 1 k =0 U k with ∆ s, 1 min = min { 5 /n, 1 / ( K s, 1 + 1) } . • F or K s, 1 + 1 ≤ j ≤ K s , a j +1 − a j = ∆ s, 2 min + (1 − ( K s, 2 + 1)∆ s, 2 min ) U j P K s k = K s, 1 +1 U k with ∆ s, 2 min = min { 5 /n, 1 / ( K s, 2 + 1) } . • α 0 = V 0 and for 1 ≤ j ≤ K s , α j = α j − 1 + V j where V 0 , . . . , V K s are i.i.d. with uniform distribution o v er [ − 1; − 0 . 1] ∪ [0 . 1; 1] . • K σ , ( b j +1 − b j ) 0 ≤ j ≤ K σ are distributed as in framew orks A and B. • β 0 , . . . , β K σ are indep enden t. When b j < 1 / 2 , β j has uniform distribution o v er [0 . 025 ; 0 . 2] . When b j ≥ 1 / 2 , β j has uniform distribution o v er [0 . 1; 0 . 8 ] . 3 A dditional results from the sim ulation study In the next pages are presen ted extended v ersions of the T ables of the main pap er, as w ell as an extended v ersion of T able 1 (T able 7). 6 0 0.2 0.4 0.6 0.8 1 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 t i Y i s(t i ) Figure 3: Random framew ork C: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . Referenes [1℄ Sylv ain Arlot. V -fold ross-v alidation impro v ed: V -fold p enalization, F ebruary 2008. [2℄ Sylv ain Arlot and P asal Massart. Data-driv en alibration of p enalties for least-squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni), 2009. [3℄ Y anni k Baraud. Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni), 2002. [4℄ L. Birgé and P . Massart. Gaussian mo del seletion. J. Eur op e an Math. So . , 3(3):203 268, 2001. [5℄ Luien Birgé and P asal Massart. Minimal p enalties for Gaussian mo del seletion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373, 2007. [6℄ Alain Celisse. Mo del sele tion via r oss-validation in density estimation, r e gr ession and hange-p oints dete tion . PhD thesis, Univ ersit y P aris-Sud 11, Deem b er 2008. oai:tel.ar hiv es-ouv ertes.fr:tel-00346320_v1. [7℄ M. La vielle. Using p enalized on trasts for the  hange-p oin t problem. Signal Pr o  es. , 85:15011510, 2005. [8℄ E. Lebarbier. Deteting m ultiple  hange-p oin ts in the mean of a Gaussian pro ess b y mo del seletion. Signal Pr o . , 85:717736, 2005. 7 s · σ · ERM Lo o Lpo 20 Lpo 50 1  1.59 ± 0.01 1.60 ± 0.02 1.58 ± 0.01 1.58 ± 0.01 p ,1 1.04 ± 0.01 1.06 ± 0.01 1.06 ± 0.01 1.06 ± 0.01 p ,2 1.89 ± 0.02 1.87 ± 0.02 1.87 ± 0.02 1.87 ± 0.02 p ,3 2.05 ± 0.02 2.05 ± 0.02 2.05 ± 0.02 2.07 ± 0.02 s 1.54 ± 0.02 1.52 ± 0.02 1.52 ± 0.02 1.51 ± 0.02 2  2.88 ± 0.01 2.93 ± 0.01 2.93 ± 0.01 2.94 ± 0.01 p ,1 1.31 ± 0.02 1.16 ± 0.02 1.14 ± 0.02 1.11 ± 0.01 p ,2 2.88 ± 0.02 2.24 ± 0.02 2.19 ± 0.02 2.13 ± 0.02 p ,3 3.09 ± 0.03 2.52 ± 0.03 2.48 ± 0.03 2.32 ± 0.03 s 3.01 ± 0.01 3.03 ± 0.01 3.05 ± 0.01 3.13 ± 0.01 3  3.18 ± 0.01 3.25 ± 0.01 3.29 ± 0.01 3.44 ± 0.01 p ,1 3.00 ± 0.01 2.67 ± 0.02 2.68 ± 0.02 2.77 ± 0.02 p ,2 4.06 ± 0.02 3.63 ± 0.02 3.64 ± 0.02 3.78 ± 0.02 p ,3 4.41 ± 0.02 3.97 ± 0.02 4.00 ± 0.02 4.11 ± 0.02 s 4.02 ± 0.01 3.82 ± 0.01 3.85 ± 0.01 3.98 ± 0.01 T able 2: A v erage p erformane C or ( J P , Id K ) for  hange-p oin t detetion pro edures P among ERM , Lo o and Lp o p with p = 20 and p = 50 . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. s · σ · Orale VF 5 BM 1  1.59 ± 0.01 5.40 ± 0.05 3.91 ± 0.03 p ,1 1.04 ± 0.01 11.96 ± 0.03 12.85 ± 0.04 p ,2 1.89 ± 0.02 6.43 ± 0.05 13.03 ± 0.04 p ,3 2.05 ± 0.02 4.96 ± 0.05 13.08 ± 0.04 s 1.54 ± 0.02 7.33 ± 0.06 9.41 ± 0.04 2  2.88 ± 0.01 4.51 ± 0.03 5.27 ± 0.03 p ,1 1.31 ± 0.02 11.67 ± 0.09 19.36 ± 0.07 p ,2 2.88 ± 0.02 6.58 ± 0.06 19.82 ± 0.07 p ,3 3.09 ± 0.03 6.66 ± 0.06 20.12 ± 0.07 s 3.01 ± 0.01 5.21 ± 0.04 9.69 ± 0.40 3  3.18 ± 0.01 4.41 ± 0.02 4.39 ± 0.01 p ,1 3.00 ± 0.01 4.91 ± 0.02 6.50 ± 0.02 p ,2 4.06 ± 0.02 5.99 ± 0.02 7.86 ± 0.03 p ,3 4.41 ± 0.02 6.32 ± 0.02 8.47 ± 0.03 s 4.02 ± 0.01 5.97 ± 0.03 7.59 ± 0.03 T able 3: P erformane C or ( J ERM , P K ) for P = Id (that is,  ho osing the dimension D ⋆ := arg min D ∈D n n   s − b s b m ERM ( D )   2 n o ), P = VF V with V = 5 or P = BM . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. 8 s · σ · J ERM , VF 5 K J Lo o , VF 5 K J Lp o 20 , VF 5 K J L p o 50 , VF 5 K J E R M , BM K 1  5.40 ± 0.05 5.03 ± 0.05 5.10 ± 0.05 5.24 ± 0.05 3.91 ± 0.03 p ,1 11.96 ± 0.03 10.25 ± 0.03 10.28 ± 0.03 10.66 ± 0.04 12.85 ± 0.04 p ,2 6.43 ± 0.05 5.83 ± 0.05 5.99 ± 0.05 6.20 ± 0.05 13.03 ± 0.04 p ,3 4.96 ± 0.05 4.82 ± 0.04 4.79 ± 0.05 5.02 ± 0.05 13.08 ± 0.04 s 7.33 ± 0.06 6.82 ± 0.05 6.99 ± 0.06 6.91 ± 0.06 9.41 ± 0.04 2  4.51 ± 0.03 4.55 ± 0.03 4.50 ± 0.03 4.73 ± 0.03 5.27 ± 0.03 p ,1 11.67 ± 0.09 10.26 ± 0.08 10.29 ± 0.08 10.45 ± 0.09 19.36 ± 0.07 p ,2 6.58 ± 0.06 5.85 ± 0.06 5.85 ± 0.06 5.49 ± 0.06 19.82 ± 0.07 p ,3 6.66 ± 0.06 5.81 ± 0.06 5.74 ± 0.06 5.66 ± 0.06 20.12 ± 0.06 s 5.21 ± 0.04 5.19 ± 0.03 5.17 ± 0.03 5.51 ± 0.04 9.69 ± 0.04 3  4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.94 ± 0.02 4.39 ± 0.01 p ,1 4.91 ± 0.02 4.40 ± 0.02 4.44 ± 0.02 4.69 ± 0.02 6.50 ± 0.02 p ,2 5.99 ± 0.02 5.34 ± 0.02 5.42 ± 0.02 5.75 ± 0.02 7.86 ± 0.03 p ,3 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 6.24 ± 0.02 8.47 ± 0.03 s 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 6.07 ± 0.02 7.59 ± 0.03 T able 4: P erformane C or ( P ) for sev eral  hange-p oin t detetion pro edures P . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation. F ramew ork A B C J ERM , BM K 6.82 ± 0.03 7.21 ± 0.04 13.49 ± 0.07 J ERM , VF 5 K 4.78 ± 0.03 5.09 ± 0.03 7.17 ± 0.05 J Lo o , VF 5 K 4.65 ± 0.03 4.88 ± 0.03 6.61 ± 0.05 J Lp o 20 , VF 5 K 4.78 ± 0.03 4.91 ± 0.03 6.49 ± 0.05 J Lp o 50 , VF 5 K 4.97 ± 0.03 5.18 ± 0.04 6.69 ± 0.05 T able 5: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 100 . In ea h framew ork, N = 10 , 000 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 9 F ramew ork A B C J ERM , BM K 9.04 ± 0.12 11.62 ± 0.14 21.21 ± 0.31 J ERM , BM b σ K 5.34 ± 0.10 6.24 ± 0.11 11.48 ± 0.22 J ERM , VF 5 K 5.10 ± 0.11 5.92 ± 0.11 7.31 ± 0.14 J Lo o , VF 5 K 4.90 ± 0.11 5.63 ± 0.11 6.89 ± 0.16 J Lp o 20 , VF 5 K 4.88 ± 0.10 5.55 ± 0.10 6.82 ± 0.15 J Lp o 50 , VF 5 K 5.11 ± 0.11 5.49 ± 0.10 7.14 ± 0.15 T able 6: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 200 . In ea h framew ork, N = 1 , 00 0 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . s · σ · 2 b K max . jump 2 b K thresh . c σ 2 σ 2 true 1  6.85 ± 0.12 3.91 ± 0.03 1.74 ± 0.02 2.05 ± 0.02 p ,1 70.97 ± 1.18 12.85 ± 0.04 1.13 ± 0.02 10.20 ± 0.05 p ,2 23.74 ± 0.26 13.03 ± 0.04 3.55 ± 0.04 10.43 ± 0.05 p ,3 17.56 ± 0.15 13.08 ± 0.04 4.42 ± 0.04 10.43 ± 0.05 s 20.07 ± 0.31 9.41 ± 0.04 2.18 ± 0.03 1.66 ± 0.02 2  6.02 ± 0.03 5.27 ± 0.03 3.58 ± 0.02 3.54 ± 0.02 p ,1 17.83 ± 0.10 19.36 ± 0.07 8.52 ± 0.06 15.62 ± 0.08 p ,2 17.63 ± 0.10 19.82 ± 0.07 10.77 ± 0.07 16.56 ± 0.08 p ,3 17.76 ± 0.10 20.12 ± 0.07 10.58 ± 0.07 16.64 ± 0.08 s 10.17 ± 0.05 9.69 ± 0.04 5.28 ± 0.03 10.95 ± 0.02 3  4.97 ± 0.02 4.39 ± 0.01 4.62 ± 0.01 4.21 ± 0.01 p ,1 7.18 ± 0.03 6.50 ± 0.02 4.52 ± 0.02 6.70 ± 0.03 p ,2 8.14 ± 0.03 7.86 ± 0.03 6.22 ± 0.02 7.55 ± 0.03 p ,3 8.66 ± 0.03 8.47 ± 0.03 6.64 ± 0.02 8.00 ± 0.03 s 8.50 ± 0.04 7.59 ± 0.03 5.94 ± 0.02 15.50 ± 0.04 A 7.52 ± 0.04 6.82 ± 0.03 4.86 ± 0.03 5.55 ± 0.03 B 7.89 ± 0.04 7.21 ± 0.04 5.18 ± 0.03 5.77 ± 0.03 C 12.81 ± 0.08 13.49 ± 0.07 8.93 ± 0.06 12.44 ± 0.07 T able 7: P erformane C or (BM) with four dieren t denitions of b C (see text), in some of the sim ulation settings onsidered in the pap er. In ea h setting, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment