Segmentation of the mean of heteroscedastic data via cross-validation
This paper tackles the problem of detecting abrupt changes in the mean of a heteroscedastic signal by model selection, without knowledge on the variations of the noise. A new family of change-point detection procedures is proposed, showing that cross…
Authors: Sylvain Arlot (LIENS), Alain Celisse
Segmen tation of the mean of heterosedasti data via ross-v alidation Sylv ain Arlot and Alain Celisse Otob er 30, 2018 Abstrat This pap er ta kles the problem of deteting abrupt hanges in the mean of a het- erosedasti signal b y mo del seletion, without kno wledge on the v ariations of the noise. A new family of hange-p oin t detetion pro edures is prop osed, sho wing that ross-v alidation metho ds an b e suessful in the heterosedasti framew ork, whereas most existing pro edures are not robust to heterosedastiit y . The robustness to het- erosedastiit y of the prop osed pro edures is supp orted b y an extensiv e sim ulation study , together with reen t theoretial results. An appliation to Comparativ e Ge- nomi Hybridization (CGH) data is pro vided, sho wing that robustness to heterosedas- tiit y an indeed b e required for their analysis. 1 In tro dution The problem ta kled in the pap er is the detetion of abrupt hanges in the mean of a signal without assuming its v ariane is onstan t. Mo del seletion and ross-v alidation te hniques are used for building hange-p oin t detetion pro edures that signian tly impro v e on ex- isting pro edures when the v ariane of the signal is not onstan t. Before detailing the approa h and the main on tributions of the pap er, let us motiv ate the problem and briey reall some related w orks in the hange-p oin t detetion literature. 1.1 Change-p oin t detetion The hange-p oin t detetion problem, also alled one-dimensional segmen tation, deals with a sto hasti pro ess the distribution of whi h abruptly hanges at some unkno wn instan ts. The purp ose is to reo v er the lo ation of these hanges and their n um b er. This problem is motiv ated b y a wide range of appliations, su h as v oie reognition, nanial time- series analysis [29℄ and Comparativ e Genomi Hybridization (CGH) data analysis [35℄. A large literature exists ab out hange-p oin t detetion in man y framew orks [see 12 , 17 , for a omplete bibliograph y℄. The rst pap ers on hange-p oin t detetion w ere dev oted to the sear h for the lo ation of a unique hange-p oin t, also named breakp oin t [see 34 , for instane℄. Lo oking for m ultiple hange-p oin ts is a harder task and has b een studied later. F or instane, Y ao [ 49 ℄ used the BIC riterion for deteting m ultiple hange-p oin ts in a Gaussian signal, and Miao and Zhao [33℄ prop osed an approa h relying on rank statistis. 1 The setting of the pap er is the follo wing. The v alues Y 1 , . . . , Y n ∈ R of a noisy signal at p oin ts t 1 , . . . , t n are observ ed, with Y i = s ( t i ) + σ ( t i ) ǫ i , E [ ǫ i ] = 0 and V ar( ǫ i ) = 1 . (1) The funtion s is alled the r e gr ession funtion and is assumed to b e pieewise-onstan t, or at least w ell appro ximated b y pieewise onstan t funtions, that is, s is smo oth ev erywhere exept at a few breakp oin ts. The noise terms ǫ 1 , . . . , ǫ n are assumed to b e indep enden t and iden tially distributed. No assumption is made on σ : [0 , 1 ] 7→ [0 , ∞ ) . Note that all data ( t i , Y i ) 1 ≤ i ≤ n are observ ed b efore deteting the hange-p oin ts, a setting whi h is alled o-line . As p oin ted out b y La vielle [28 ℄, m ultiple hange-p oin t detetion pro edures generally ta kle one among the follo wing three problems: 1. Deteting hanges in the mean s assuming the standard-deviation σ is onstan t, 2. Deteting hanges in the standard-deviation σ assuming the mean s is onstan t, 3. Deteting hanges in the whole distribution of Y , with no distintion b et w een hanges in the mean s , hanges in the standard-deviation σ and hanges in the distribution of ǫ . In appliations su h as CGH data analysis, hanges in the mean s ha v e an imp ortan t biologial meaning, sine they orresp ond to the limits of amplied or deleted areas of hromosomes. Ho w ev er in the CGH setting, the standard-deviation σ is not alw a ys on- stan t, as assumed in problem 1. See Setion 6 for more details on CGH data, for whi h heterosedastiit ythat is, v ariations of σ orresp ond to exp erimen tal artefats or bio- logial n uisane that should b e remo v ed. Therefore, CGH data analysis requires to solv e a fourth problem, whi h is the purp ose of the presen t artile: 4. Deteting hanges in the mean s with no onstrain t on the standard-deviation σ : [0 , 1] 7→ [0 , ∞ ) . Compared to problem 1, the dierene is the presene of an additional n uisane parameter σ making problem 4 harder. Up to the b est of our kno wledge, no hange-p oin t detetion pro edure has ev er b een prop osed for solving problem 4 with no prior information on σ . 1.2 Mo del seletion Mo del seletion is a suessful approa h for m ultiple hange-p oin t detetion, as sho wn b y La vielle [28℄ and b y Lebarbier [30℄ for instane. Indeed, a set of hange-p oin tsalled a segmen tationis naturally asso iated with the set of pieewise-onstan t funtions that ma y only jump at these hange-p oin ts. Giv en a set of funtions (alled a mo del), estimation an b e p erformed b y minimizing the least-squares riterion (or other riteria, see Setion 3). Therefore, deteting hanges in the mean of a signal, that is the hoie of a segmen tation, amoun ts to selet su h a mo del. More preisely , giv en a olletion of mo dels { S m } m ∈M n and the asso iated olletion of least-squares estimators { b s m } m ∈M n , the purp ose of mo del seletion is to pro vide a mo del index b m su h that b s b m rea hes the b est p erformane among all estimators { b s m } m ∈M n . 2 Mo del seletion an target t w o dieren t goals. On the one hand, a mo del seletion pro edure is eient when its quadrati risk is smaller than the smallest quadrati risk of the estimators { b s m } m ∈M n , up to a onstan t fator C n ≥ 1 . Su h a prop ert y is alled an or ale ine quality when it holds for ev ery nite sample size. The pro edure is said to b e asymptoti eient when the previous prop ert y holds with C n → 1 as n tends to innit y . Asymptoti eieny is the goal of AIC [2, 3℄ and Mallo ws' C p [32℄, among man y others. On the other hand, assuming that s b elongs to one of the mo dels { S m } m ∈M n , a pro- edure is mo del onsistent when it ho oses the smallest mo del on taining s asymptotially with probabilit y one. Mo del onsisteny is the goal of BIC [ 39℄ for instane. See also the artile b y Y ang [46℄ ab out the distintion b et w een eieny and mo del onsisteny . In the presen t pap er as in [30 ℄, the qualit y of a m ultiple hange-p oin t detetion pro- edure is assessed b y the quadrati risk; hene, a hange in the me an hidden by the noise should not b e dete te d . This hoie is motiv ated b y appliations where the signal-to-noise r atio may b e smal l , so that exatly reo v ering ev ery true hange-p oin t is hop eless. There- fore, eient mo del seletion pro edures will b e used in order to detet the hange-p oin ts. Without prior information on the lo ations of the hange-p oin ts, the natural olletion of mo dels for hange-p oin t detetion dep ends on the sample size n . Indeed, there exist n − 1 D − 1 dieren t partitions of the n design p oin ts in to D in terv als, ea h partition orresp ond- ing to a set of ( D − 1) hange-p oin ts. Sine D an tak e an y v alue b et w een 1 and n , 2 n − 1 mo dels an b e onsidered. Therefore, mo del seletion pro edures used for m ultiple hange- p oin t detetion ha v e to satisfy non-asymptoti orale inequalities: the olletion of mo dels annot b e assumed to b e xed with the sample size n tending to innit y . (See Setion 2.3 for a preise denition of the olletion { S m } m ∈M n used for hange-p oin t detetion.) Most mo del seletion results onsider p olynomial olletions of mo dels { S m } m ∈M n , that is Card( M n ) ≤ C n α for some onstan ts C, α ≥ 0 . F or p olynomial olletions, pro e- dures lik e AIC or Mallo ws' C p are pro v ed to satisfy orale inequalities in v arious framew orks [9, 15, 10, 16 ℄, assuming that data are homos e dasti , that is, σ ( t i ) do es not dep end on t i . Ho w ev er as sho wn in [6℄, Mallo ws' C p is sub optimal when data are heter os e dasti , that is the v ariane is non-onstan t. Therefore, other pro edures m ust b e used. F or instane, resampling p enalization is optimal with heterosedasti data [ 5℄. Another approa h has b een explored b y Gendre [25 ℄, whi h onsists in sim ultaneously estimating the mean and the v ariane, using a partiular p olynomial olletion of mo dels. Ho w ev er in hange-p oin t detetion, the olletion of mo dels is exp onen tial, that is Card( M n ) is of order exp( αn ) for some α > 0 . F or su h large olletions, esp eially larger than p olynomial, the ab o v e p enalization pro edures fail. Indeed, Birgé and Massart [16℄ pro v ed that the minimal amoun t of p enalization required for a pro edure to satisfy an orale inequalit y is of the form p en( m ) = c 1 σ 2 D m n + c 2 σ 2 D m n log n D m , (2) where c 1 and c 2 are p ositiv e onstan ts and σ 2 is the v ariane of the noise, assumed to b e onstan t. Lebarbier [30℄ prop osed c 1 = 5 and c 2 = 2 for optimizing the p enalt y (2) in the on text of hange-p oin t detetion. P enalties similar to (2) ha v e b een in tro dued indep enden tly b y other authors [38 , 1, 11, 45 ℄ and are sho wn to pro vide satisfatory results. 3 Nev ertheless, all these results assume that data are homosedasti. A tually , the mo del seletion problem with heterosedasti data and an exp onen tial olletion of mo dels has nev er b een onsidered in the literature, up to the b est of our kno wledge. F urthermore, p enalties of the form (2) are v ery lose to b e prop ortional to D m , at least for small v alues of D m . Therefore, the results of [6 ℄ lead to onjeture that the p enalt y (2) is sub optimal for mo del seletion o v er an exp onen tial olletion of mo dels, when data are heterosedasti. The suggest of this pap er is to use ross-v alidation metho ds instead. 1.3 Cross-v alidation Cross-v alidation (CV) metho ds allo w to estimate (almost) un biasedly the quadrati risk of an y estimator, su h as b s m (see Setion 3.2 ab out the heuristis underlying CV). Classial examples of CV metho ds are the lea v e-one-out [Lo o, 27, 43 ℄ and V -fold ross-v alidation [VF CV, 23, 24 ℄. More referenes on ross-v alidation an b e found in [7 , 19℄ for instane. CV an b e used for mo del seletion, b y ho osing the mo del S m for whi h the CV estimate of the risk of b s m is minimal. The prop erties of CV for mo del seletion with a p olynomial olletion of mo dels and homosedasti data ha v e b een widely studied. In short, CV is kno wn to adapt to a wide range of statistial settings, from densit y estimation [42 , 20℄ to regression [44 , 48 ℄ and lassiation [26, 47 ℄. In partiular, Lo o is asymptotially equiv alen t to AIC or Mallo ws' C p in sev eral framew orks where they are asymptotially optimal, and other CV metho ds ha v e similar p erformanes, pro vided the size of the training sample is lose enough to the sample size [see for instane 31, 40, 22℄. In addition, CV metho ds are robust to heterosedastiit y of data [5, 7℄, as w ell as sev eral other resampling metho ds [6 ℄. Therefore, CV is a natural alternativ e to p enalization pro edures assuming homosedastiit y . Nev ertheless, nearly nothing is kno wn ab out CV for mo del seletion with an exp onen tial olletion of mo dels, su h as in the hange-p oin t detetion setting. The literature on mo del seletion and CV [14, 40 , 16, 21℄ only suggests that minimizing diretly the Lo o estimate of the risk o v er 2 n − 1 mo dels w ould lead to o v ertting. In this pap er, a remark made b y Birgé and Massart [16 ℄ ab out p enalization pro edure is used for solving this issue in the on text of hange-p oin t detetion. Mo del seletion is p erfomed in t w o steps: First, ho ose a segmen tation giv en the n um b er of hange-p oin ts; seond, ho ose the n um b er of hange-p oin ts. CV metho ds an b e used at ea h step, leading to Pro edure 6 (Setion 5). The pap er sho ws that su h an approa h is indeed suessful for deteting hanges in the mean of a heterosedasti signal. 1.4 Con tributions of the pap er The main purp ose of the presen t w ork is to design a CV-based mo del seletion pro e- dure (Pro edure 6) that an b e used for deteting m ultiple hanges in the mean of a heterosedasti signal. Su h a pro edure exp erimen tally adapts to heterosedastiit y when the olletion of mo dels is exp onen tial, whi h has nev er b een obtained b efore. In parti- ular, Pro edure 6 is a reliable alternativ e to Birgé and Massart's p enalization pro edure [15 ℄ when data an b e heterosedasti. Another ma jor diult y ta kled in this pap er is the omputational ost of resampling metho ds when seleting among 2 n mo dels. Ev en when the n um b er ( D − 1) of hange- 4 p oin ts is giv en, exploring the n − 1 D − 1 partitions of [0 , 1] in to D in terv als and p erforming a resampling algorithm for ea h partition is not feasible when n is large and D > 0 . An implemen tation of Pro edure 6 with a tratable omputational omplexit y is prop osed in the pap er, using losed-form form ulas for Lea v e- p -out (Lp o) estimators of the risk, dynami programming, and V -fold ross-v alidation. The pap er also p oin ts out that least-squares estimators are not reliable for hange- p oin t detetion when the n um b er of breakp oin ts is giv en, although they are widely used to this purp ose in the literature. Indeed, exp erimen tal and theoretial results detailed in Setion 3.1 sho w that least-squares estimators suer from lo al overtting when the v ariane of the signal is v arying o v er the sequene of observ ations. On the on trary , minimizers of the Lp o estimator of the risk do not suer from this dra wba k, whi h emphasizes the in terest of using ross-v alidation metho ds in the on text of hange-p oin t detetion. The pap er is organized as follo ws. The statistial framew ork is desrib ed in Setion 2. First, the problem of seleting the b est segmen tation giv en the n um b er of hange-p oin ts is ta kled in Setion 3. Theoretial results and an extensiv e sim ulation study sho w that the usual minimization of the least-squares riterion an b e misleading when data are heterosedasti, whereas ross-v alidation-based pro edures pro vide satisfatory results in the same framew ork. Then, the problem of ho osing the n um b er of breakp oin ts from data is addressed in Setion 4 . As supp orted b y an extensiv e sim ulation study , V -fold ross-v alidation (VF CV) leads to a omputationally feasible and statistially eien t mo del seletion pro edure when data are heterosedasti, on trary to pro edures impliitly assuming homosedasti- it y . The resampling metho ds of Setions 3 and 4 are om bined in Setion 5, leading to a family of resampling-based pro edures for deteting hanges in the mean of a heterosedas- ti signal. A wide sim ulation study sho ws they p erform w ell with b oth homosedasti and heterosedasti data, signian tly impro ving the p erformane of pro edures whi h impli- itly assume homosedastiit y . Finally , Setion 6 illustrates on a real data set the promising b eha viour of the prop osed pro edures for analyzing CGH miroarra y data, ompared to pro edures previously used in this setting. 2 Statistial framew ork In this setion, the statistial framew ork of hange-p oin t detetion via mo del seletion is in tro dued, as w ell as some notation. 2.1 Regression on a xed design Let S ∗ denote the set of measurable funtions [0 , 1] 7→ R . Let t 1 < · · · < t n ∈ [0 , 1] b e some deterministi design p oin ts, s ∈ S ∗ and σ : [0 , 1] 7→ [0 , ∞ ) b e some funtions and dene ∀ i ∈ { 1 , . . . , n } , Y i = s ( t i ) + σ ( t i ) ǫ i , (3) where ǫ 1 , . . . , ǫ n are indep enden t and iden tially distributed random v ariables with E [ ǫ i ] = 0 and E ǫ 2 i = 1 . 5 As explained in Setion 1.1 , the goal is to nd from ( t i , Y i ) 1 ≤ i ≤ n a pieewise-onstan t funtion f ∈ S ∗ lose to s in terms of the quadrati loss k s − f k 2 n := 1 n n X i =1 ( f ( t i ) − s ( t i )) 2 . 2.2 Least-squares estimator A lassial estimator of s is the le ast-squar es estimator , dened as follo ws. F or ev ery f ∈ S ∗ , the least-squares riterion at f is dened b y P n γ ( f ) := 1 n n X i =1 ( Y i − f ( t i )) 2 . The notation P n γ ( f ) means that the funtion ( t, Y ) 7→ γ ( f ; ( t, Y )) := ( Y − f ( t )) 2 is in tegrated with resp et to the empirial distribution P n := n − 1 P n i =1 δ ( t i ,Y i ) . P n γ ( f ) is also alled the empiri al risk of f . Then, giv en a set S ⊂ S ∗ of funtions [0 , 1] 7→ R (alled a mo del ), the least-squares estimator on mo del S is ERM( S ; P n ) := arg min f ∈ S { P n γ ( f ) } . The notation ERM( S ; P n ) stresses that the least-squares estimator is the output of the empirial risk minimization algorithm o v er S , whi h tak es a mo del S and a data sample as inputs. When a olletion of mo dels { S m } m ∈M n is giv en, b s m ( P n ) or b s m are shortuts for ERM( S m ; P n ) . 2.3 Colletion of mo dels Sine the goal is to detet jumps of s , ev ery mo del onsidered in this artile is the set of pieewise onstan t funtions with resp et to some partition of [0 , 1] . F or ev ery K ∈ { 1 , . . . , n − 1 } and ev ery sequene of in tegers α 0 = 1 < α 1 < α 2 < · · · < α K ≤ n (the breakp oin ts), ( I λ ) λ ∈ Λ ( α 1 ,...α K ) denotes the partition [ t α 0 ; t α 1 ) , . . . , [ t α K − 1 ; t α K ) , [ t α K ; 1] of [0 , 1] in to ( K + 1) in terv als. Then, the mo del S ( α 1 ,...α K ) is dened as the set of pieewise onstan t funtions that an only jump at t = t α j for some j ∈ { 1 , . . . , K } . F or ev ery K ∈ { 1 , . . . , n − 1 } , let f M n ( K + 1) denote the set of su h sequenes ( α 1 , . . . α K ) of length K , so that { S m } m ∈ f M n ( K + 1) is the olletion of mo dels of piee- wise onstan t funtions with K breakp oin ts. When K = 0 , f M n (1) := {∅} and the mo del S ∅ is the linear spae of onstan t funtions on [0 , 1] . Remark that for ev ery K and m ∈ f M n ( K + 1) , S m is a v etor spae of dimension D m = K + 1 . In the rest of the pa- p er, the relationship b et w een the n um b er of breakp oin ts K and the dimension D = K + 1 of the mo del S ( α 1 ,...α K ) is used rep eatedly; in partiular, estimating of the n um b er of break- p oin ts (Setion 4) is equiv alen t to ho osing the dimension of a mo del. In addition, sine a mo del S m is uniquely dened b y m , the index m is also alled a mo del. 6 The lassial olletion of mo dels for hange-p oin t detetion an no w b e dened as { S m } m ∈ f M n , where f M n = S D ∈D n f M n ( D ) and D n = { 1 , . . . , n } . This olletion has a ardinalit y 2 n − 1 . In this pap er, a sligh tly smaller olletion of mo dels is onsidered, that is, all m ∈ f M n su h that ea h elemen t of the partition ( I λ ) λ ∈ Λ m on tains at least t w o design p oin ts ( t j ) 1 ≤ j ≤ n . Indeed, when nothing is kno wn ab out the noise-lev el σ ( · ) , one an- not hop e to distinguish t w o onseutiv e hange-p oin ts from a lo al v ariation of σ . F or ev ery D ∈ { 1 , . . . , n } , let M n ( D ) denote the set of m ∈ f M n ( D ) satisfying this prop- ert y . Then, the olletion of mo dels used in this pap er is dened as { S m } m ∈M n where M n = S D ∈D n M n ( D ) and D n ⊂ { 1 , . . . , n/ 2 } . Finally , in all the exp erimen ts of the pap er, D n = { 1 , . . . , 4 n/ 10 } for reasons detailed in Setion 4.2, in partiular Remark 3. 2.4 Mo del seletion Among { S m } m ∈M n , the b est mo del is dened as the minimizer of the quadr ati loss k s − b s m k 2 n o v er m ∈ M n and alled the or ale m ⋆ . Sine the orale dep ends on s , one an only exp et to selet b m ( P n ) from the data su h that the quadrati loss of b s b m is lose to that of the orale with high probabilit y , that is, k s − b s b m k 2 n ≤ C inf m ∈M n n k s − b s m k 2 n o + R n (4) where C is lose to 1 and R n is a small remainder term (t ypially of order n − 1 ). Inequalit y (4) is alled an or ale ine quality . 3 Lo alization of the breakp oin ts A usual strategy for m ultiple hange-p oin t detetion [ 28, 30 ℄ is to disso iate the sear h for the b est segmen tation giv en the n um b er of breakp oin ts from the hoie of the n um b er of breakp oin ts. In this setion, the n um b er K = D − 1 of breakp oin ts is xed and the goal is to lo alize them. In other w ords, the goal is to selet a mo del among { S m } m ∈M n ( D ) . 3.1 Empirial risk minimization's failure with heterosedasti data As explained b y man y authors su h as La vielle [28℄, minimizing the least-squares riterion o v er { b s m } m ∈M ( D ) is a lassial w a y of estimating the b est segmen tation with ( D − 1) hange-p oin ts. This leads to the follo wing pro edure: Pro edure 1. b m ERM ( D ) := arg min m ∈M n ( D ) { P n γ ( b s m ) } = ERM e S D ; P n , where e S D := ∪ m ∈M n ( D ) S m is the set of pieewise onstan t funtions with exatly ( D − 1) hange-p oin ts, hosen among t 2 , . . . , t n (see Setion 2.3). 7 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 t i Y i Loo ERM Oracle 0 0.2 0.4 0.6 0.8 1 −1.5 −1 −0.5 0 0.5 1 1.5 t i Y i Loo ERM Oracle Figure 1: Comparison of b s m ⋆ ( D ) (dotted bla k line), b s b m ERM ( D ) (dashed blue line) and b s b m Loo ( D ) (plain magen ta line, see Setion 3.2.2 ), D b eing the optimal dimension (see Figure 3 ). Data are generated as desrib ed in Setion 3.3.1 with n = 10 0 data p oin ts. Left: homosedasti data ( s 2 , σ c ) , D = 4 . Righ t: heterosedasti data ( s 3 , σ pc, 3 ) , D = 6 . R emark 1 . Dynami programming [13℄ leads to an eien t implemen tation of Pro edure 1 with omputational omplexit y O n 2 . Among mo dels orresp onding to segmen tations with ( D − 1) hange-p oin ts, the orale mo del an b e dened as m ⋆ ( D ) := arg min m ∈M n ( D ) n k s − b s m k 2 n o . Figure 1 illustrates ho w far b m ERM ( D ) t ypially is from m ⋆ ( D ) aording to v ariations of the standard-deviation σ . On the one hand, when data are homosedasti, empirial risk minimization yields a segmen tation lose to the orale (Figure 1, left). On the other hand, when data are heterosedasti, empirial risk minimization in tro dues artiial breakp oin ts in areas where the noise-lev el is ab o v e a v erage, and misses breakp oin ts in areas where the noise-lev el is b elo w a v erage (Figure 1, righ t). In other w ords, when data are heterosedas- ti, empiri al risk minimization over e S D lo al ly overts in high-noise ar e as , and lo ally underts in lo w-noise areas. The failure of empirial risk minimization with heterosedasti data observ ed on Fig- ure 1 is general [21 , Chapter 7℄ and an b e explained b y Lemma 1 b elo w. Indeed, the riteria P n γ ( b s m ) and k s − b s m k 2 n , resp etiv ely minimized b y b m ERM ( D ) and m ⋆ ( D ) o v er M n ( D ) , are lose to their resp etiv e exp etations, as pro v ed b y the onen tration inequalities of [7, Prop osition 9℄ for instane. Lemma 1 enables to ompare these exp etations. Lemma 1. L et m ∈ M n and dene s m := arg min f ∈ S m k s − f k 2 n . Then, E [ P n γ ( b s m )] = k s − s m k 2 n − V ( m ) + 1 n n X i =1 σ ( t i ) 2 (5) E h k s − b s m k 2 n i = k s − s m k 2 n + V ( m ) (6) wher e V ( m ) := P λ ∈ Λ m ( σ r λ ) 2 n and ∀ λ ∈ Λ m , ( σ r λ ) 2 := P n i =1 σ ( t i ) 2 1 t i ∈ I λ Card ( { k | t k ∈ I λ } ) . (7) 8 Lemma 1 is pro v ed in [21 ℄. As it is w ell-kno wn in the mo del seletion literature, the exp etation of the quadrati loss (6 ) is the sum of t w o terms: k s − s m k 2 n is the bias of mo del S m , and V ( m ) is a v ariane term, measuring the diult y of estimating the D m parameters of mo del S m . Up to the term n − 1 P n i =1 σ ( t i ) 2 whi h do es not dep end on m , the empirial risk underestimates the quadrati risk (that is, the exp etation of the quadrati loss), as sho wn b y (5) , b eause of the sign in fron t of V ( m ) . Nev ertheless, when data are homosedasti, that is when ∀ i , σ ( t i ) = σ , V ( m ) = D m σ 2 n − 1 is the same for all m ∈ M n ( D ) . Therefore, (5 ) and (6 ) sho w that for ev ery D ≥ 1 , when data are homosedasti arg min m ∈M n ( D ) { E [ P n γ ( b s m )] } = arg min m ∈M n ( D ) n E h k s − b s m k 2 n io . Hene, b m ERM ( D ) and m ⋆ ( D ) tend to b e lose to one another, as on the left of Figure 1. On the on trary , when data are heterosedasti, the v ariane term V ( m ) an b e quite dieren t among mo dels m ∈ M n ( D ) , ev en though they ha v e the same dimension D . Indeed, V ( m ) inreases when a breakp oin t is mo v ed from an area where σ is small to an area where σ is large. Therefore, the empirial risk minimization algorithm rather puts breakp oin ts in noisy areas in order to minimize − V ( m ) in (5 ). This is illustrated in the righ t panel of Figure 1, where the orale segmen tation m ⋆ ( D ) has more breakp oin ts in areas where σ is small. 3.2 Cross-v alidation Cross-v alidation (CV) metho ds are natural andidates for xing the failure of empirial risk minimization when data are heterosedasti, sine CV metho ds are naturally adaptiv e to heterosedastiit y (see Setion 1.3 ). The purp ose of this setion is to prop erly dene ho w CV an b e used for seleting b m ∈ M n ( D ) (Pro edure 2), and to reall theoretial results sho wing wh y this pro edure adapts to heterosedastiit y (Prop osition 1 ). 3.2.1 Heuristis The ross-v alidation heuristis [4 , 43℄ relies on a data splitting idea: F or ea h andidate algorithmsa y ERM( S m ; · ) for some m ∈ M n ( D ) , part of the dataalled training setis used for training the algorithm. The remaining partalled v alidation setis used for estimating the risk of the algorithm. This simple strategy is alled validation or hold- out . One an also split data sev eral times and a v erage the estimated v alues of the risk o v er the splits. Su h a strategy is alled r oss-validation (CV). CV with general rep eated splits of data has b een in tro dued b y Geisser [23 , 24℄. In the xed-design setting, ( t i , Y i ) 1 ≤ i ≤ n are not iden tially distributed so that CV estimates a quan tit y sligh tly dieren t from the usual predition error. Let T b e uniformly distributed o v er { t 1 , . . . , t n } and Y = s ( T ) + σ ( T ) ǫ , where ǫ is indep enden t from ǫ 1 , . . . , ǫ n 9 with the same distribution. Then, the CV estimator of the risk of b s ( P n ) estimates E ( T ,Y ) h ( b s ( T ) − Y ) 2 i = 1 n n X i =1 E ǫ h ( s ( t i ) + σ ( t i ) ǫ i − b s ( t i )) 2 i = k s − b s k 2 n + 1 n n X i =1 σ ( t i ) 2 . Hene, minimizing the CV estimator of E ( T ,Y ) h ( b s m ( T ) − Y ) 2 i o v er m amoun ts to minimize k s − b s m k 2 n , up to estimation errors. Ev en though the use of CV in a xed-design setting is not usual, theoretial results detailed in Setion 3.2.4 b elo w sho w that CV atually leads to a go o d estimator of the quadrati risk k s − b s m k 2 n . This fat is onrmed b y all the exp erimen tal results of the pap er. 3.2.2 Denition Let us no w formally dene ho w CV is used for seleting some m ∈ M n ( D ) from data. A (statistial) algorithm A is dened as an y measurable funtion P n 7→ A ( P n ) ∈ S ∗ . F or an y t i ∈ [0 , 1] , A ( t i ; P n ) denotes the v alue of A ( P n ) at p oin t t i . F or an y I ( t ) ⊂ { 1 , . . . , n } , dene I ( v ) := { 1 , . . . , n } \ I ( t ) , P ( t ) n := 1 Card( I ( t ) ) X i ∈ I ( t ) δ ( t i ,Y i ) and P ( v ) n := 1 Card( I ( v ) ) X i ∈ I ( v ) δ ( t i ,Y i ) . Then, the hold-out estimator of the risk of an y algorithm A is dened as b R ho ( A , P n , I ( t ) ) := P ( v ) n γ A P ( t ) n = 1 Card( I ( v ) ) X i ∈ I ( v ) A ( t i ; P ( t ) n ) − Y i 2 . The r oss-validation estimators of the risk of A are then dened as the a v erage of b R ho ( A , P n , I ( t ) j ) o v er j = 1 , . . . , B where I ( t ) 1 , . . . , I ( t ) B are hosen in a predetermined w a y [24 ℄. Lea v e-one-out, lea v e- p -out and V -fold ross-v alidation are among the most lassial examples of CV pro edures. They dier one another b y the hoie of I ( t ) 1 , . . . , I ( t ) B . • L e ave-one-out ( Lo o ), often alled or dinary CV [4, 43℄, onsists in training with the whole sample exept one p oin t, used for testing, and rep eating this for ea h data p oin t: I ( t ) j = { 1 , . . . , n } \ { j } for j = 1 , . . . , n . The Lo o estimator of the risk of A is dened b y b R Loo ( A , P n ) := 1 n n X j =1 Y j − A t j ; P ( − j ) n 2 , where P ( − j ) n = ( n − 1) − 1 P i, i 6 = j δ ( t i ,Y i ) . • L e ave- p -out ( Lp o p , with an y p ∈ { 1 , . . . , n − 1 } ) generalizes Lo o. Let E p denote the olletion of all p ossible subsets of { 1 , . . . , n } with ardinalit y n − p . Then, Lp o 10 onsists in onsidering ev ery I ( t ) ∈ E p as training set indies: b R Lpo p ( A , P n ) := n p − 1 X I ( t ) ∈E p 1 p X j ∈ I ( v ) Y j − A t j ; P ( t ) n 2 . (8) • V -fold r oss-validation (VF CV) is a omputationally eien t alternativ e to Lp o and Lo o . The idea is to rst partition the data in to V blo ks, to use all the data but one blo k as a training sample, and to rep eat the pro ess V times. In other w ords, VF CV is a blo kwise Lo o, so that its omputational omplexit y is V times that of A . F ormally , let B 1 , . . . , B V b e a partition of { 1 , . . . , n } and P ( B k ) n := ( n − Card( B k )) − 1 P i / ∈ B k δ ( t i ,Y i ) for ev ery k ∈ { 1 , . . . , V } . The VF CV estimator of the risk of A is dened b y b R VF V ( A , P n ) := 1 V V X k =1 1 Card( B k ) X j ∈ B k Y j − A t j ; P ( B k ) n 2 . (9) The in terested reader will nd theoretial and exp erimen tal results on VF CV and the b est w a y to use it in [ 7, 21 ℄ and referenes therein, in partiular [18 ℄. Giv en the Lo o estimator of the risk of ea h algorithm A among { ERM( S m ; · ) } m ∈M n ( D ) , the segmen tation with ( D − 1) breakp oin ts hosen b y Lo o is dened as follo ws. Pro edure 2. b m Loo ( D ) := arg min m ∈M n ( D ) n b R Loo (ERM ( S m ; · ) , P n ) o . The segmen tations hosen b y Lp o and VF CV are dened similarly and denoted resp etiv ely b y b m Lpo p ( D ) and b y b m VF V ( D ) . As illustrated b y Figure 1 , when data are heterosedasti, b m Loo ( D ) is often loser to the orale segmen tation m ⋆ ( D ) than b m ERM ( D ) . This impro v emen t will b e explained b y theoretial results in Setion 3.2.4 b elo w. 3.2.3 Computational tratabilit y The omputational omplexit y of ERM( S m ; P n ) is O ( n ) sine for ev ery λ ∈ Λ m , the v alue of b s m ( P n ) on I λ is equal to the mean of { Y i } t i ∈ I λ . Therefore, a naiv e implemen tation of Lp o p has a omputational omplexit y O n n p , whi h an b e in tratable for large n in the on text of mo del seletion, ev en when p = 1 . In su h ases, only VF CV with a small V w ould w ork straigh tforw ardly , sine its omputational omplexit y is O ( nV ) . Nev ertheless, losed-form form ulas for the Lp o estimator of the risk ha v e b een deriv ed in the densit y estimation [20, 19℄ and regression [21℄ framew orks. Some of these losed- form form ulas apply to regressograms b s m with m ∈ M n . The follo wing theorem giv es a losed-form expression for b R Lpo p ( m ) := b R Lpo p (ERM( S m ; · ) , P n ) whi h an b e omputed with O ( n ) elemen tary op erations. 11 Theorem 1 (Corollary 3.3.2 in [21℄) . L et m ∈ M n , S m and b s m = ERM( S m ; · ) b e dene d as in Se tion 2. F or every ( t 1 , Y 1 ) , . . . , ( t n , Y n ) ∈ R 2 and λ ∈ Λ m , dene S λ, 1 := n X j =1 Y j 1 { t j ∈ I λ } and S λ, 2 := n X j =1 Y 2 j 1 { t j ∈ I λ } . Then, for every p ∈ { 1 , . . . , n − 1 } , the Lp o p estimator of the risk of b s m dene d by (8 ) is given by b R Lpo p ( m ) = X λ ∈ Λ m 1 pN λ ( A λ − B λ ) S λ, 2 + B λ S 2 λ, 1 1 { n λ ≥ 2 } + { + ∞} 1 { n λ =1 } , wher e for every λ ∈ Λ m , n λ := Card ( { i | t i ∈ I λ } ) N λ := 1 − 1 { p ≥ n λ } n − n λ p − n λ / n p A λ := V λ (0) 1 − 1 n λ − V λ (1) n λ + V λ ( − 1) B λ := V λ (1) 2 − 1 n λ ≥ 3 n λ ( n λ − 1) + V λ (0) n λ − 1 1 + 1 n λ 1 n λ ≥ 3 − 2 − V λ ( − 1) 1 n λ ≥ 3 n λ − 1 and ∀ k ∈ {− 1 , 0 , 1 } , V λ ( k ) := min { n λ , ( n − p ) } X r =max { 1 , ( p − n λ ) } r k n − p r p n λ − r n n λ . R emark 2 . V λ ( k ) an also b e written as E Z k 1 Z > 0 where Z has h yp ergeometri distri- bution with parameters ( n, n − p, n λ ) . An imp ortan t pratial onsequene of Theorem 1 is that for ev ery D and p , b m Lpo p ( D ) an b e omputed with the same omputational omplexit y as b m ERM ( D ) , that is O n 2 . Indeed, Theorem 1 sho ws that b R Lpo p ( m ) is a sum o v er λ ∈ Λ m of terms dep ending only on { Y i } t i ∈ I λ , so that dynami programming [13 ℄ an b e used for omputing the mini- mizer b m Lpo p ( D ) of b R Lpo p ( m ) o v er m ∈ M n . Therefore, Lp o and Lo o ar e omputational ly tr atable for hange-p oint dete tion when the n um b er of breakp oin ts is giv en. Dynami programming also applies to b m VF V with a omputational omplexit y O V n 2 , sine ea h term app earing in b R VF V ( m ) is the a v erage o v er V quan tities that m ust b e omputed, exept when V = n sine VF CV then b eomes Lo o. Sine VF CV is mostly an appro ximation to Lo o or Lp o but has a larger omputational omplexit y , b m Lpo p will b e preferred to b m VF V ( D ) in the follo wing. 3.2.4 Theoretial guaran tees In order to understand wh y CV indeed w orks for hange-p oin t detetion with a giv en n um b er of breakp oin ts, let us reall a straigh tforw ard onsequene of Theorem 1 whi h is pro v ed in details in [21, Lemma 7.2.1 and Prop osition 7.2.3℄. Prop osition 1. Using the notation of L emma 1, for any m ∈ M n , E h b R Lpo p ( m ) i ≈ k s − s m k 2 n + 1 n − p X λ ∈ Λ m ( σ r λ ) 2 + 1 n n X i =1 σ ( t i ) 2 , (10) 12 Figure 2: Regression funtions s 1 , s 2 , s 3 ; s 1 and s 2 are pieewise onstan t with 4 jumps; s 3 is pieewise onstan t with 9 jumps. wher e the appr oximation holds as so on as min λ ∈ Λ m n λ is lar ge enough (in p artiular lar ger than p ). The omparison of (6 ) and (10 ) sho ws that Lp o p yields an almost un biased estimator of k s − b s m k 2 n : The only dierene is that the fator 1 /n in fron t of the v ariane term V ( m ) has b een hanged in to 1 / ( n − p ) . Therefore, minimizing the Lp o p estimator of the risk instead of the empirial risk allo ws to automatially tak e in to aoun t heterosedastiit y of data. 3.3 Sim ulation study The goal of this setion is to exp erimen tally assess, for sev eral v alues of p , the p erformane of Lp o p for deteting a giv en n um b er of hanges in the mean of a heterosedasti signal. This p erformane is also ompared with that of empirial risk minimization. 3.3.1 Setting The setting desrib ed in this setion is used in all the exp erimen ts of the pap er. Data are generated aording to (3 ) with n = 100 . F or ev ery i , t i = i/n and ǫ i has a standard Gaussian distribution. The regression funtion s is hosen among three pieewise onstan t funtions s 1 , s 2 , s 3 plotted on Figure 2. The mo del olletion desrib ed in Setion 2.3 is used with D n = { 1 , . . . , 4 n/ 10 } . The noise-lev el funtion σ ( · ) is hosen among the follo wing funtions: 1. Homosedasti noise: σ c = 0 . 25 1 [0 , 1] , 2. Heterosedasti pieewise onstan t noise: σ pc, 1 = 0 . 2 1 [0 , 1 / 3] + 0 . 05 1 [1 / 3 , 1] , σ pc, 2 = 2 σ pc, 1 or σ pc, 3 = 2 . 5 σ pc, 1 . 3. Heterosedasti sin usoidal noise: σ s = 0 . 5 s in ( tπ / 4) . All om binations b et w een the regression funtions ( s i ) i =1 , 2 , 3 and the v e noise-lev els σ · ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Results b elo w only rep ort a small part of the en tire sim ulation study but in tend to b e represen tativ e of the main observ ed b eha viour. A more omplete rep ort of the results, inluding other 13 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 Dimension D ERM Loo Lpo20 Lpo50 0 5 10 15 20 25 30 35 40 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Dimension D ERM Loo Lpo20 Lpo50 Figure 3: E s − b s b m P ( D ) 2 n as a funtion of D for P among `ERM' (empirial risk minimization), `Lo o' (Lea v e-one-out), ` Lp o(20) ' ( Lp o p with p = 20 ) and ` Lp o(50) ' ( Lp o p with p = 50 ). Left: homosedasti ( s 2 , σ c ) . Righ t: heterosedasti ( s 3 , σ pc, 3 ) . All urv es ha v e b een estimated from N = 10 000 indep enden t samples; error bars are all negligible in fron t of visible dierenes (the larger ones are smaller than 8 . 10 − 5 on the left, and smaller than 2 . 10 − 4 on the righ t). The urv es D 7→ s − b s b m P ( D ) 2 n b eha v e similarly to their exp etations. regression funtions s and noise-lev el funtions σ , is giv en in the seond authors' thesis [ 21, Chapter 7℄; see also Setion 3 of the supplemen tary material. 3.3.2 Results: Comparison of segmen tations for ea h dimension The segmen tations of ea h dimension D ∈ D n obtained b y empirial risk minimization (` ERM ', Pro edure 1 ) and Lp o p (Pro edure 2) for sev eral v alues of p are ompared on Fig- ure 3, through the exp eted v alues of the quadrati loss E s − b s b m P ( D ) 2 n for pro edure P . On the one hand, when data are homosedasti (Figure 3, left), all pro edures yield similar p erformanes for all dimensions up to t wie the b est dimension; Lp o p p erforms signian tly b etter for larger dimensions. Therefore, unless the dimension is strongly o v er- estimated (whatev er the w a y D is hosen), all pro edures are equiv alen t with homosedasti data. On the other hand, when data are heterosedasti (Figure 3, righ t), ERM yields signi- an tly w orse p erformane than Lp o for dimensions larger than half the true dimension. As explained in Setions 3.1 and 3.2.4 , b m ERM ( D ) often puts breakp oin ts inside pure noise for dimensions D smaller than the true dimension, whereas Lp o do es not ha v e this dra wba k. Therefore, whatev er the hoie of the dimension (exept D ≤ 4 , that is for deteting the ob vious jumps), Lp o should b e prefered to empirial risk minimization as so on as data are heterosedasti. 14 s · σ · ERM Lo o Lpo 20 Lpo 50 2 2.88 ± 0.01 2.93 ± 0.01 2.93 ± 0.01 2.94 ± 0.01 p ,1 1.31 ± 0.02 1.16 ± 0.02 1.14 ± 0.02 1.11 ± 0.01 p ,3 3.09 ± 0.03 2.52 ± 0.03 2.48 ± 0.03 2.32 ± 0.03 3 3.18 ± 0.01 3.25 ± 0.01 3.29 ± 0.01 3.44 ± 0.01 p ,1 3.00 ± 0.01 2.67 ± 0.02 2.68 ± 0.02 2.77 ± 0.02 p ,3 4.41 ± 0.02 3.97 ± 0.02 4.00 ± 0.02 4.11 ± 0.02 T able 1: A v erage p erformane C or ( J P , Id K ) for hange-p oin t detetion pro edures P among ERM , Lo o and Lp o p with p = 20 and p = 50 . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. 3.3.3 Results: Comparison of the b est segmen tations This setion fo uses on the segmen tation obtained with the b est p ossible hoie of D , that is the one orresp onding to the minim um of D 7→ s − b s b m P ( D ) 2 n (plotted on Figure 3) for pro edures P among ERM , Lo o , and Lp o p with p = 20 and p = 50 . Therefore, the p erformane of a pro edure P is dened b y C or ( J P , Id K ) := E inf 1 ≤ D ≤ n s − b s b m P ( D ) 2 n E h inf m ∈M n n k s − b s m k 2 n oi , whi h measures what is lost ompared to the orale when seleting one segmen tation b m P ( D ) p er dimension. Ev en if the hoie of D is a real pratial problemwhi h will b e ta kled in the next setions, C or ( J P , Id K ) helps to understand whi h is the b est pro edure for seleting a segmen tation of a giv en dimension. The notation C or ( J P , Id K ) has b een hosen for onsisteny with notation used in the next setions (see Setion 5.1 ). T able 1 onrms the results of Setion 3.3.2 . On the one hand, when data are ho- mosedasti, ERM p erforms sligh tly b etter than Lo o or Lp o p . On the other hand, when data are heterosedasti, Lp o p often p erforms b etter than ERM (whatev er p ), and the impro v emen t an b e large (more than 20% in the setting ( s 2 , σ pc, 3 ) ). Ov erall, when ho- mosedastiit y of the signal is questionable, Lp o p app ears m u h more reliable than ERM for lo alizing a giv en n um b er of hange-p oin ts of the mean. The question of ho osing p for optimizing the p erformane of Lp o p remains a widely op en problem. The sim ulation exp erimen t summarized with T able 1 only sho ws that Lp o p impro v es ERM whatev er p , the optimal v alue of p dep ending on s and σ . 4 Estimation of the n um b er of breakp oin ts In this setion, the n um b er of breakp oin ts is no longer xed or kno wn a priori . The goal is preisely the estimation of this n um b er, as often needed with real data. T w o main pro edures are onsidered. First, a p enalization pro edure in tro dued b y Birgé and Massart [15℄ is analyzed in Setion 4.1 ; this pro edure is suessful for hange- 15 p oin t detetion when data are homosedasti [28, 30 ℄. On the basis of this analysis, V - fold ross-v alidation (VF CV) is then prop osed as an alternativ e to Birgé and Massart's p enalization pro edure (BM) when data an b e heterosedasti. In order to enable the omparison b et w een BM and VF CV when fo using on the ques- tion of ho osing the n um b er of breakp oin ts, VF CV is used for ho osing among the same segmen tations as BM, that is { b m ERM ( D ) } D ∈D n . The om bination of VF CV for ho osing D with the new pro edures prop osed in Setion 3 will b e studied in Setion 5. 4.1 Birgé and Massart's p enalization First, let us dene preisely the p enalization pro edure prop osed b y Birgé and Massart [15 ℄ suessfully used for hange-p oin t detetion in [28 , 30℄. Pro edure 3 (Birgé and Massart [15 ℄) . 1. ∀ m ∈ M n , b s m := ERM ( S m ; P n ) . 2. b m BM := arg min m ∈M n , D m ∈D n { P n γ ( b s m ) + pen BM ( m ) } , where for ev ery m ∈ M n , the p enalt y p en BM ( m ) only dep ends on S m through its dimension: p en BM ( m ) = p en BM ( D m ) := b C D m n 5 + 2 log n D m , (11) where b C is estimated from data using Birgé and Massart's slop e heuristis [16, 8℄, as prop osed b y Lebarbier [30 ℄ and b y La vielle [28 ℄. See Setion 1 of the supplemen tary material for a detailed disussion ab out b C . 3. e s BM := b s b m BM . All m ∈ M n ( D ) are p enalized in the same w a y b y p en BM ( m ) , so that Pro edure 3 atually selets a segmen tation among { b m ERM ( D ) } D ∈D n . Therefore, Pro edure 3 an b e reform ulated as follo ws, as notied in [ 16 , Setion 4.3℄. Pro edure 4 (Reform ulation of Pro edure 3) . 1. ∀ D ∈ D n , b s b m ERM ( D ) := ERM e S D ; P n where e S D := S m ∈M n ( D ) S m . 2. b D BM := arg min D ∈D n P n γ ( b s b m ERM ( D ) ) + pen BM ( D ) where p en BM ( D ) is dened b y (11 ). 3. e s BM := b s b m ERM ( b D BM ) . In the follo wing, `BM' denotes Pro edure 4 and crit BM ( D ) := P n γ ( b s b m ERM ( D ) ) + p en BM ( D ) is alled the BM riterion. Pro edure 4 laries the reason wh y p en BM m ust b e larger than Mallo ws' C p p enalt y . Indeed, for ev ery m ∈ M n , Lemma 1 sho ws that when data are homosedasti, P n γ ( b s m ) + p en( m ) is an un biased estimator of k s − b s m k 2 n when p en( m ) = 2 σ 2 D m n − 1 , that is Mallo ws' 16 0 5 10 15 20 25 30 35 40 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Loss VF 5 BM 0 5 10 15 20 25 30 35 40 0 0.05 0.1 0.15 0.2 0.25 Loss VF 5 BM Figure 4: Comparison of the exp etations of s − b s b m ( D ) 2 n (`Loss'), crit VF V ( D ) (` VF 5 ') and crit BM ( D ) (` BM '). Data are generated as explained in Setion 3.3.1 . Left: ho- mosedasti ( s 2 , σ c ) . Righ t: heterosedasti ( s 2 , σ pc, 3 ) . Exp etations ha v e b een estimated from N = 10 000 indep enden t samples; error bars are all negligible in fron t of visible dierenes (the larger ones are smaller than 5 . 10 − 4 on the left, and smaller than 2 . 10 − 3 on the righ t). Similar b eha viours are observ ed for ev ery single sample, with sligh tly larger utuations for crit VF V ( D ) than for crit BM ( D ) . The urv es ` BM ' and ` VF 5 ' ha v e b een shifted in order to mak e omparison with `Loss' easier, without hanging the lo ation of the minim um. C p p enalt y . When Card( M n ) is at most p olynomial in n , Mallo ws' C p p enalt y leads to an eien t mo del seletion pro edure, as pro v ed in sev eral regression framew orks [ 41, 31 , 10℄. Hene, Mallo ws' C p p enalt y is an adequate measure of the apait y of an y v etor spae S m of dimension D m , at least when data are homosedasti. On the on trary , in the hange-p oin t detetion framew ork, Card( M n ) gro ws exp onen- tially with n . The form ulation of Pro edure 4 p oin ts out that p en BM ( D ) has b een built so that crit BM ( D ) estimates un biasedly s − b s b m ERM ( D ) 2 n for ev ery D , where b s b m ERM ( D ) is the empirial risk minimizer o v er e S D . Hene, p en BM ( D ) measures the apait y of e S D , whi h is m u h bigger than a v etor spae of dimension D . Therefore, p en BM should b e larger than Mallo ws' C p , as onrmed b y the results of Birgé and Massart [16℄ on minimal p enalties for exp onen tial olletions of mo dels. Sim ulation exp erimen ts supp ort the fat that crit BM ( D ) is an un biased estimator of s − b s b m ( D ) 2 n for ev ery D (up to an additiv e onstan t) when data are homosedasti (Figure 4 left). Ho w ev er, when data are heterosedasti, theoretial results pro v ed b y Birgé and Massart [15 , 16 ℄ no longer apply , and sim ulations sho w that crit BM ( D ) do es not alw a ys estimate s − b s b m ERM ( D ) 2 n w ell (Figure 4 righ t). This result is onsisten t with Lemma 1 , as w ell as the sub optimalit y of p enalties prop ortional to D m for mo del seletion among a p olynomial olletion of mo dels when data are heterosedasti [6℄. Therefore, p en BM ( D ) is not an adequate apait y measure of e S D in general when data are heterosedasti, and another apait y measure is required. 4.2 Cross-v alidation As sho wn in Setion 3.2.2, CV an b e used for estimating the quadrati loss k s − A ( P n ) k 2 n for an y algorithm A . In partiular, CV w as suessfully used in Setion 3 for estimating 17 the quadrati risk of ERM( S m ; · ) for all segmen tations m ∈ M n ( D ) with a giv en n um b er ( D − 1) of breakp oin ts (Pro edure 2), ev en when data are heterosedasti. Therefore, CV metho ds are natural andidates for xing BM's failure. The prop osed pro edurewith VF CVis the follo wing. Pro edure 5. 1. ∀ D ∈ D n , b s b m ERM ( D ) := ERM e S D ; P n , 2. b D VF V := arg min D ∈D n { crit VF V ( D ) } where crit VF V ( D ) := b R VF V ERM e S D ( · ); · , · and b R VF V is dened b y (9 ). R emark 3 . In algorithm ( t i , Y i ) 1 ≤ i ≤ n 7→ ERM e S D ; P n , the mo del e S D dep ends on the design p oin ts. When the training set is ( t i , Y i ) i / ∈ B k , the mo del e S D is the union of the S m su h that ∀ λ ∈ Λ m , I λ on tains at least t w o elemen ts of { t i s . t . i / ∈ B k } . Su h an m exists as so on as D ≤ ( n − max k { Card( B k ) } ) / 2 and t w o onseutiv e design p oin ts t i , t i +1 alw a ys b elong to dieren t blo ks B k , whi h is alw a ys assumed in this pap er. Note that the dynami programming algorithms [13 ℄ quoted in Setion 3.2.3 an straigh tforw ardly tak e in to aoun t su h onstrain ts when minimizing the empirial risk o v er e S D . The dep endene of e S D on the design explains wh y crit VF V ( D ) dereases for D lose to n ( V − 1) / (2 V ) , as observ ed on Figure 4. Indeed, when D is lose to n t / 2 (where n t is the size of the design), only a few { S m } m ∈M n t ( D ) remain in e S D ; for instane, when D = n t / 2 , e S D is equal to one of the { S m } m ∈M n t ( D ) . Therefore, the apait y of e S D dereases in the neigh b orho o d of D = n t / 2 . Similar pro edures an b e dened with Lo o and Lp o p instead of VF CV. The in terest of VF CV is its reasonably small omputational osttaking V ≤ 10 for instane, sine no losed-form form ula exists for CV estimators of the risk of ERM e S D ; P n . 4.3 Sim ulation results A sim ulation exp erimen t w as p erformed in the setting presen ted in Setion 3.3.1 , for om- paring BM and VF V with V = 5 blo ks. A represen tativ e piture of the results is giv en b y Figure 4 and b y T able 2 [see 21, Chapter 7, and Setion 3 of the supplemen tary material for additional results℄. As illustrated b y Figure 4 , crit VF V ( D ) an b e used for measuring the apait y of e S D . Indeed, VF CV orretly estimates the risk of empirial risk minimizers o v er e S D for ev ery D and for b oth homosedasti and heterosedasti data; crit VF V ( D ) only underestimates s − b s b m ( D ) 2 n for dimensions D lose to n ( V − 1) / (2 V ) , for reasons explained at the end of Remark 3. On the on trary , crit BM ( D ) is a p o or estimate of s − b s b m ( D ) 2 n when data are heterosedasti. Subsequen tly , VF CV yields a m u h smaller p erformane index C or ( J ERM , P K ) := E s − b s b m ERM ( b D P ) 2 n E h inf m ∈M n n k s − b s m ( P n ) k 2 n oi 18 s · σ · Orale VF 5 BM 2 2.88 ± 0.01 4.51 ± 0.03 5.27 ± 0.03 p ,2 2.88 ± 0.02 6.58 ± 0.06 19.82 ± 0.07 s 3.01 ± 0.01 5.21 ± 0.04 9.69 ± 0.40 3 3.18 ± 0.01 4.41 ± 0.02 4.39 ± 0.01 p ,2 4.06 ± 0.02 5.99 ± 0.02 7.86 ± 0.03 s 4.02 ± 0.01 5.97 ± 0.03 7.59 ± 0.03 T able 2: P erformane C or ( J ERM , P K ) for P = Id (that is, ho osing the dimension D ⋆ := arg min D ∈D n n s − b s b m ERM ( D ) 2 n o ), P = VF V with V = 5 or P = BM . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. than BM when data are heterosedasti (T able 2); see also the supplemen tary material (Setion 1) for details ab out the p erformanes of BM and p ossible w a ys to impro v e them. When data are homosedasti, VF CV and BM ha v e similar p erformanes (ma yb e with a sligh t adv an tage for BM), whi h is not surprising sine BM uses the kno wledge that data are homosedasti. Moreo v er, BM has b een pro v ed to b e optimal in the homosedasti setting [15, 16℄. Ov erall, VF CV app ears to b e a reliable alternativ e to BM when no prior kno wledge guaran tees that data are homosedasti. 5 New hange-p oin t detetion pro edures via ross- v alidation Setions 3 and 4 sho w ed that when data are heterosedasti, CV an b e used suessfully instead of p enalized riteria for deteting breakp oin ts giv en their n um b er, as w ell as for estimating the n um b er of breakp oin ts. Nev ertheless, in Setion 4, the segmen tations om- pared b y CV w ere obtained b y empirial risk minimization, so that they an b e sub optimal aording to the results of Setion 3. The next step for obtaining reliable hange-p oin t detetion pro edures for heterosedas- ti data is to om bine the t w o ideas, that is, to use CV t wie. The goal of the presen t setion is to prop erly dene su h pro edures (with v arious kinds of CV) and to assess their p erformanes. 5.1 Denition of a family of hange-p oin t detetion pro edures The general strategy used in this artile for hange-p oin t detetion relies on t w o steps: First, detet where ( D − 1) breakp oin ts should b e lo ated for ev ery D ∈ D n ; seond, estimate the n um b er ( D − 1) of breakp oin ts. This strategy an b e summarized with the follo wing pro edure: Pro edure 6 (General t w o-step hange-p oin t detetion pro edure) . 19 1. ∀ D ∈ D n , A D ( P n ) := b s b m ( D ) = arg min m ∈M n ( D ) { crit 1 ( S m , P n ) } where for ev ery mo del S , crit 1 ( S, P n ) ∈ R estimates k s − ERM( S ; P n ) k 2 n and b s m = ERM( S m ; P n ) is dened as in Setion 3.1. 2. b D = arg min D ∈D n { crit 2 ( A D , P n ) } , where for ev ery algorithm A D , crit 2 ( A D , P n ) ∈ R estimates k s − A D ( P n ) k 2 n . 3. Output: the segmen tation b m ( b D ) and the orresp onding estimator b s b m ( b D ) of s . Let us no w detail whi h are the andidate riteria crit 1 and crit 2 for b eing used in Pro edure 6. F or the rst step: • The empirial risk (` ERM ') is crit 1 , ERM ( S, P n ) := P n γ (ERM ( S ; P n )) • The Lea v e- p -out estimator of the risk (` Lp o p ') is, for ev ery p ∈ { 1 , . . . , n − 1 } , crit 1 , Lpo ( S, P n , p ) := b R Lpo p (ERM( S ; · ) , P n ) • F or omparison, the ideal riterion (`Id') is dened b y crit 1 ,I d ( S, P n ) := k s − ERM( S ; P n ) k 2 n . As in Setion 3, Lo o denotes Lp o 1 . The VF CV estimator of the risk b R VF V ould also b e used as crit 1 ; it will not b e onsidered in the follo wing b eause it is omputationally more exp ensiv e and more v ariable than Lp o (see Setion 3.2 ). F or the seond step: • Birgé and Massart's p enalization riterion (`BM') is crit 2 , BM ( A D , P n ) := P n γ ( A D ( P n )) + p en BM ( D ) , where p en BM ( D ) is dened b y (11 ) with c 1 = 5 , c 2 = 2 and b C is hosen b y the slop e heuristis (see Setion 1 of the supplemen tary material). • The V -fold ross-v alidation estimator of the risk (` VF V ') is, for ev ery V ∈ { 1 , . . . , n } , crit 2 , VF V ( A D , P n ) := b R VF V ( A D , P n ) , where b R VF V is dened b y (9 ) and the blo ks B 1 , . . . , B V are hosen as in Pro edure 5 (see Remark 3). • F or omparison, the ideal riterion (`Id') is dened b y crit 2 ,I d ( A D , P n ) := k s − A D ( P n ) k 2 n . R emark 4 . F or crit 2 , denitions using Lp o ould theoretially b e onsidered. They are not in v estigated here b eause they are omputationally in tratable. In the follo wing, the notation J α, β K is used as a shortut for Pro edure 6 with crit 1 ,α and crit 2 ,β , and the outputs of J α, β K are denoted b y b m J α,β K ∈ M n and e s J α,β K ∈ S ∗ . F or instane, BM oinides with J ERM , BM K ; Pro edures J α, I d K are ompared for sev eral α in Setion 3; Pro edures J ERM , β K are ompared for β ∈ { Id , BM , VF 5 } in Setion 4 . 20 s · σ · J ERM , VF 5 K J Lo o , VF 5 K J Lp o 20 , VF 5 K J ERM , BM K 1 5.40 ± 0.05 5.03 ± 0.05 5.10 ± 0.05 3.91 ± 0.03 p ,1 11.96 ± 0.03 10.25 ± 0.03 10.28 ± 0.03 12.85 ± 0.04 p ,3 4.96 ± 0.05 4.82 ± 0.04 4.79 ± 0.05 13.08 ± 0.04 s 7.33 ± 0.06 6.82 ± 0.05 6.99 ± 0.06 9.41 ± 0.04 2 4.51 ± 0.03 4.55 ± 0.03 4.50 ± 0.03 5.27 ± 0.03 p ,1 11.67 ± 0.09 10.26 ± 0.08 10.29 ± 0.08 19.36 ± 0.07 p ,3 6.66 ± 0.06 5.81 ± 0.06 5.74 ± 0.06 20.12 ± 0.06 s 5.21 ± 0.04 5.19 ± 0.03 5.17 ± 0.03 9.69 ± 0.04 3 4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.39 ± 0.01 p ,1 4.91 ± 0.02 4.40 ± 0.02 4.44 ± 0.02 6.50 ± 0.02 p ,3 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 8.47 ± 0.03 s 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 7.59 ± 0.03 T able 3: P erformane C or ( P ) for sev eral hange-p oin t detetion pro edures P in sev eral settings ( s, σ ) . Ea h time, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 5.2 Sim ulation study A sim ulation exp erimen t ompares pro edures J α, VF 5 K for sev eral α and J ERM , BM K , in the setting desrib ed in Setion 3.3.1 . A represen tativ e piture of the results is giv en b y T able 3 [see 21, Chapter 7, for additional results℄. The (statistial) p erformane of ea h omp eting pro edure P is measured b y C or ( P ) := E h k s − e s P ( P n ) k 2 n i E h inf m ∈M n n k s − b s m ( P n ) k 2 n oi , b oth exp etations b eing ev aluated b y a v eraging o v er N = 10 00 0 indep enden t samples. R emark 5 . Birgé and Massart's p enalization pro edure is the only lassial hange-p oin t detetion pro edure onsidered in this exp erimen t for t w o reasons. First, hange-p oin t detetion pro edure lo oking for hanges in the distribution of Y i w ould learly fail to detet hanges in the mean of the signal, as so on as the noise-lev el σ v aries inside areas where the mean is onstan t. Seond, among pro edures deteting hanges in the mean of a signal in a setting omparable to the setting of the pap er (that is, frequen tist, parametri, o-line, with no information on the n um b er of hange-p oin ts), BM app ears to b e the most reliable pro edure aording to reen t pap ers [28 , 30 ℄. The question of the alibration of b C is addressed in Setion 1 of the supplemen tary material. First, BM is onsisten tly outp erformed b y the other pro edures, exept in the ho- mosedasti settings in whi h it onrms its strength. Seond, empirial risk minimization ( ERM ) sligh tly outp erforms CV ( Lo o and Lp o 20 ) when data are homosedasti. On the on trary , when data are heterosedasti, Lo o and Lp o 20 learly outp erform ERM , often b y a margin larger than 10% (for instane, when σ = σ pc, 1 ). Therefore, the results of Setion 3 are onrmed when using VF 5 (instead of Id ) for ho osing the dimension. 21 F ramew ork A B C J ERM , BM K 6.82 ± 0.03 7.21 ± 0.04 13.49 ± 0.07 J ERM , VF 5 K 4.78 ± 0.03 5.09 ± 0.03 7.17 ± 0.05 J Lo o , VF 5 K 4.65 ± 0.03 4.88 ± 0.03 6.61 ± 0.05 J Lp o 20 , VF 5 K 4.78 ± 0.03 4.91 ± 0.03 6.49 ± 0.05 J Lp o 50 , VF 5 K 4.97 ± 0.03 5.18 ± 0.04 6.69 ± 0.05 T able 4: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 100 . In ea h framew ork, N = 10 , 000 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . Third, the omparison b et w een J Lp o p , VF 5 K for sev eral v alues of p is less lear. Ev en though p = 1 (that is, Lo o ) mostly outp erforms p = 20 (as w ell as p = 50 , see the supplemen tary material), dierenes are small and often not signian t despite the large n um b er of samples generated. The onlusion of the sim ulation exp erimen t on this question is that all v alues of p b et w een 1 and n/ 2 all p erform almost equally w ell, with a small adv an tage to p = 1 whi h ma y not b e general. Let us men tion here that the hoie of p for Lp o p is usually related to o v erp enalization [see for instane 5, 19 , 21℄, but it seems diult to haraterize the settings for whi h o v erp enalization is needed for deteting hange-p oin ts giv en their n um b er. 5.3 Random framew orks In order to assess the generalit y of the results of T able 3, the pro edures onsidered in Setion 5.2 ha v e b een ompared in three random settings. The follo wing pro ess has b een rep eated N = 10 , 0 00 times. First, pieewise onstan t funtions s and σ are randomly hosen (see Setion 2 of the supplemen tary material for details). Then, giv en s and σ , a data sample ( t i , Y i ) 1 ≤ i ≤ n is generated as desrib ed in Setion 3.3.1 , and the same olletion of mo dels is used. Finally , ea h pro edure P is applied to the sample ( t i , Y i ) 1 ≤ i ≤ n , and its loss k s − e s P ( P n ) k 2 n is measured, as w ell as the loss of the orale inf m ∈M n n k s − b s m k 2 n o . T o summarize the results, the qualit y of ea h pro edure is measured b y the ratio C ( R ) or ( P ) = E s,σ,ǫ 1 ,...,ǫ n h k s − e s P ( P n ) k 2 n i E s,σ,ǫ 1 ,...,ǫ n h inf m ∈M n n k s − b s m k 2 n oi . The notation C ( R ) or ( P ) diers from C or ( P ) to emphasize that ea h exp etation inludes the randomness of s and σ , in addition to the one of ( ǫ i ) 1 ≤ i ≤ n . The results of this exp erimen twhi h are rep orted in T able 4 mostly onrm the results of the previous setion (exept that all the framew orks are heterosedasti here), that is, whatev er p , J Lp o p , VF 5 K outp erforms J ERM , VF 5 K , whi h strongly outp erforms J ERM , BM K . Similar resultsnot rep orted hereha v e b een obtained with a sample size n = 200 and N = 1 000 samples. 22 Moreo v er, the dierene b et w een the p erformanes of J Lp o p , VF 5 K and J ERM , VF 5 K is the largest in setting C and the smallest in setting A. This fat onrms the in terpretation giv en in Setion 3 for the failure of ERM for lo alizing a giv en n um b er of hange-p oin ts. Indeed, the main dierenes b et w een framew orks A, B and Cwhi h are preisely dened in Setion 2 of the supplemen tary material an b e sk et hed as follo ws: A the partitions on whi h s is built is often lose to regular, and σ is hosen indep en- den tly from s . B the partitions on whi h s is built are often irregular, and σ is hosen indep enden tly from s . C the partitions on whi h s is built are often irregular, and σ dep ends on s , so that the noise-lev el is smaller where s jumps more often. In other w ords, framew orks A, B and C ha v e b een built so that for an y D ∈ D n , the largest v ariations o v er M n ( D ) of V ( m ) (dened b y (7)) o ur in framew ork C, and the smallest v ariations o ur in framew ork A. As a onsequene, v ariations of the p erformane of J ERM , VF 5 K ompared to J Lp o p , VF 5 K aording to the framew ork ertainly ome from the lo al o v ertting phenomenon presen ted in Setion 3. 6 Appliation to CGH miroarra y data In this setion, the new hange-p oin t detetion pro edures prop osed in the pap er are applied to CGH miroarra y data. 6.1 Biologial on text The purp ose of Comparativ e Genomi Hybridization (CGH) miroarra y exp erimen ts is to detet and map hromosomal ab errations. F or instane, a piee of hromosome an b e amplie d , that is app ear sev eral times more than usual, or delete d . Su h ab errations are often related to aner disease. Roughly , CGH proles giv e the log-ratio of the DNA op y n um b er along the hromo- somes, ompared to a referene DNA sequene [see 3537 , for details ab out the biologial on text of CGH data℄. The goal of CGH data analysis is to detet abrupt hanges in the mean of a signal (the log-ratio of op y n um b ers), and to estimate the mean in ea h segmen t. Hene, hange-p oin t detetion pro edures are needed. Moreo v er, assuming that CGH data are homosedasti is often unrealisti. Indeed, hanges in the hemial omp osition of the sequene are kno wn to indue hanges in the v ariane of the observ ed CGH prole, p ossibly indep enden tly from v ariations of the true op y n um b er. Therefore, pro edures robust to heterosedastiit y , su h as the ones prop osed in Setion 5 , should yield b etter resultsin terms of deteting hanges of op y n um b er than pro edures assuming homosedastiit y . The data set onsidered in this setion is based on the Bt474 ell lines, whi h denote epithelial ells obtained from h uman breast aner tumors of a sixt y-y ear-old w oman [ 36℄. A test genome of Bt474 ell lines is ompared to a normal referene male genome. Ev en 23 though sev eral hromosomes are studied in these ell lines, this setion fo uses on hromo- somes 1 and 9. Chromosome 1 exhibits a putativ e heterogenous v ariane along the CGH prole, and hromosome 9 is lik ely to meet the homosedastiit y assumption. Log-ratios of op y n um b ers ha v e b een measured at 119 lo ations for hromosome 1 and at 93 lo ations for hromosome 9. 6.2 Pro edures used in the CGH literature Before applying Pro edure 6 to the analysis of Bt474 CGH data, let us reall the denition of t w o hange-p oin t detetion pro edures, whi h w ere the most suessful for analyzing the same data aording to the literature [36 ℄. The rst pro edure is a simplied v ersion of BM prop osed b y La vielle [28, Setion 2℄ and rst used on CGH data in [ 36 ℄. Note that BM w ould giv e similar results on the data of Figure 5. The seond pro eduredenoted b y `PML' for p enalized maxim um lik eliho o daims at deteting hanges in either the mean or the v ariane, that is breakp oin ts for ( s, σ ) . The seleted mo del is dened as the minimizer o v er m ∈ M n of crit PML ( m ) := X λ ∈ Λ m n λ log 1 n λ X t i ∈ I λ ( Y i − b s m ( t i ; P n )) 2 + b C ′′ D m , where n λ = Card { t i ∈ I λ } and b C ′′ is estimated from data b y the slop e heuristis algorithm [28 , 30℄. 6.3 Results Results obtained with BMsimple, PML, J ERM , VF 5 K and J Lp o 20 , VF 5 K on the Bt474 data set are rep orted on Figure 5 . F or hromosome 9, BMsimple and PML yield (almost) the same segmen tation, so that the homosedastiit y assumption is ertainly not m u h violated. As exp eted, J ERM , VF 5 K and J Lp o 20 , VF 5 K also yield v ery similar segmen tations, whi h onrms the reliabilit y of these pro edures for homosedasti signal [see 21 , Setion 7.6 for details℄. The piture is quite dieren t for hromosome 1. Indeed, as sho wn b y Figure 5 (righ t), BMsimple selets a segmen tation with 7 breakp oin ts, whereas PML selets a segmen tation with only one breakp oin t. The ma jor dierene b et w een BMsimple and PML supp orts at least the idea that these data m ust b e heterosedasti. Nev ertheless, none of the segmen tations hosen b y BMsimple and PML are en tirely satisfatory: BMsimple relies on an assumption whi h is ertainly violated; PML ma y use a hange in the estimated v ariane for explaining sev eral hanges in the mean. CV-based pro edures J ERM , VF 5 K and J Lp o 20 , VF 5 K yield t w o other segmen tations, with a medium n um b er of breakp oin ts, resp etiv ely 4 and 3. In view of the sim ulation exp erimen ts of the previous setions, the segmen tation obtained via J Lp o 20 , VF 5 K should b e the most reliable one sine data are heterosedasti. Therefore, the righ t of Figure 5 an b e in terpretated as follo ws: The noise-lev el is small in the rst part of hromosome 1, then higher, but not as high as estimated b y PML. In partiular, the op y n um b er hanges 24 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (a) BMsimple 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (b) PML 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 () J ERM , VF 5 K 1.58 1.6 1.62 1.64 1.66 x 10 6 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (d) J Lp o 20 , VF 5 K 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (e) BMsimple 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (f ) PML 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (g) J ERM , VF 5 K 0.5 1 1.5 2 x 10 5 −1 −0.5 0 0.5 1 1.5 (h) J Lp o 20 , VF 5 K Figure 5: Change-p oin ts lo ations along Chromosome 9 (Left) and Chromosome 1 (Righ t). The mean on ea h homogeneous region is indiated b y plain horizon tal lines. 25 t wie inside the seond part of hromosome 1 (as dened b y the segmen tation obtained with PML), indiating that t w o putativ e amplied regions of hromosome 1 ha v e b een deteted. Note ho w ev er that ho osing among the segmen tations obtained with J ERM , VF 5 K and J Lp o 20 , VF 5 K is not an easy task without additional data. A denitiv e answ er w ould need further biologial exp erimen ts. 7 Conlusion 7.1 Results summary Cross-v alidation (CV) metho ds ha v e b een used to build reliable pro edures (Pro edure 6) for deteting hanges in the mean of a signal whose v ariane ma y not b e onstan t. First, when the n um b er of breakp oin ts is giv en, empirial risk minimization has b een pro v ed to fail for some heterosedasti problems, from b oth theoretial and exp erimen tal p oin ts of view. On the on trary , the Lea v e- p -out ( Lp o p ) remains robust to heterosedas- tiit y while b eing omputationally eien t thanks to losed-form form ulas giv en in Se- tion 3.2.3 (Theorem 1). Seond, for ho osing the n um b er of breakp oin ts, the ommonly used p enalization pro- edure prop osed b y Birgé and Massart in the homosedasti framew ork should not b e applied to heterosedasti data. V -fold ross-v alidation (VF CV) turns out to b e a reliable alternativ eb oth with homosedasti and heterosedasti data, leading to m u h b etter segmen tations in terms of quadrati risk when data are heterosedasti. F urthermore, un- lik e usual deterministi p enalized riteria, VF CV eien tly ho oses among segmen tations obtained b y either Lp o or empirial risk minimization, without an y sp ei hange in the pro edure. T o onlude, the om bination of Lp o (for ho osing a segmen tation for ea h p ossi- ble n um b er of breakp oin ts) and VF CV yields the most reliable pro edure for deteting hanges in the mean of a signal whi h is not a priori kno wn to b e homosedasti. The resulting pro edure is omputationally tratable for small v alues of V , sine its omputa- tional omplexit y is of order O ( V n 2 ) , whi h is similar to man y omparable hange-p oin t detetion pro edures. The inuene of V on the statistial p erformane of the pro edure is not studied sp eially in this pap er; nev ertheless, onsidering V = 5 only w as suien t to obtain a b etter statistial p erformane than Birgé and Massart's p enalization pro edure when data are heterosedasti. When applied to real data (CGH proles in Setion 6), the prop osed pro edure turns out to b e quite useful and eetiv e, for a data set on whi h existing pro edures highly disagree b eause of heterosedastiit y . 7.2 Prosp ets The general form of Pro edure 6 ould b e used with sev eral other riteria, at b oth steps of the hange-p oin t detetion pro edure. F or instane, resampling p enalties [5℄ ould b e used at the rst step, for lo alizing the hange-p oin ts giv en their n um b er. A t the seond step, V -fold p enalization [6℄ ould also b e used instead of VF CV, with the same omputational ost and p ossibly an impro v ed statistial p erformane. 26 Comparing preisely these resampling-based riteria for optimizing the p erformane of Pro edure 6 w ould b e of great in terest and deserv es further w orks. Sim ultaneously , sev eral v alues of V should b e ompared for the seond step of Pro edure 6, and the preise inuene of p when Lp o p is used at the rst step should b e further in v estigated. Preliminary results in this diretion an already b e found in [21 , Chapter 7℄. Referenes [1℄ F. Abramo vi h, Y. Benjamini, D. Donoho, and I. Johnstone. Adapting to Unkno wn Sparsit y b y on trolling the False Diso v ery Rate. The A nnals of Statistis , 34(2):584 653, 2006. [2℄ H. Ak aik e. Statistial preditor iden tiation. A nn. Inst. Statisti. Math. , 22:203217, 1969. [3℄ Hirotugu Ak aik e. Information theory and an extension of the maxim um lik eliho o d priniple. In Se ond International Symp osium on Information The ory (Tsahkadsor, 1971) , pages 267281. Ak adémiai Kiadó, Budap est, 1973. [4℄ Da vid M. Allen. The relationship b et w een v ariable seletion and data augmen tation and a metho d for predition. T e hnometris , 16:125127, 1974. [5℄ Sylv ain Arlot. Mo del seletion b y resampling p enalization, 2008. hal-00262478. [6℄ Sylv ain Arlot. Sub optimalit y of p enalties prop ortional to the dimension for mo del seletion in heterosedasti regression, Deem b er 2008. [7℄ Sylv ain Arlot. V -fold ross-v alidation impro v ed: V -fold p enalization, F ebruary 2008. [8℄ Sylv ain Arlot and P asal Massart. Data-driv en alibration of p enalties for least- squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni), 2009. [9℄ Y anni k Baraud. Mo del seletion for regression on a xed design. Pr ob ab. The ory R elate d Fields , 117(4):467493, 2000. [10℄ Y anni k Baraud. Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni), 2002. [11℄ A. Barron, L. Birgé, and P . Massart. Risk b ounds for mo del seletion via p enalization. Pr ob ab. The ory and R elat. Fields , 113:301413, 1999. [12℄ M. Basseville and N. Nikiforo v. The Dete tion of Abrupt Changes - The ory and Ap- pli ations . Pren tie-Hall: Information and System Sienes Series, 1993. [13℄ R. E. Bellman and S. E. Dreyfus. Applie d Dynami Pr o gr amming . Prineton, 1962. [14℄ L. Birgé and P . Massart. From mo del seletion to adaptiv e estimation. In D. P ollard, E. T orgensen, and G. Y ang, editors, In Festshrift for Luien Le Cam: Rese ar h Pap ers in Pr ob ability and Statistis , pages 5587. Springer-V erlag, New Y ork, 1997. 27 [15℄ L. Birgé and P . Massart. Gaussian mo del seletion. J. Eur op e an Math. So . , 3(3):203 268, 2001. [16℄ Luien Birgé and P asal Massart. Minimal p enalties for Gaussian mo del seletion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373, 2007. [17℄ B. Bro dsky and B. Darkho vsky . Metho ds in Change-p oint pr oblems . Klu w er A ademi Publishers, Dordre h t, The Netherlands, 1993. [18℄ P . Burman. Comparativ e study of Ordinary Cross-Validation, v-Fold Cross-Validation and the rep eated Learning-Testing Metho ds. Biometrika , 76(3):503514, 1989. [19℄ A. Celisse. Densit y estimation via ross-v alidation: Mo del seletion p oin t of view. T e hnial rep ort, arXiv, 2008. [20℄ A. Celisse and S. Robin. Nonparametri densit y estimation b y exat lea v e-p-out ross- v alidation. Computational Statistis and Data A nalysis , 52(5):23502368, 2008. [21℄ Alain Celisse. Mo del sele tion via r oss-validation in density estimation, r e gr ession and hange-p oints dete tion . PhD thesis, Univ ersit y P aris-Sud 11, Deem b er 2008. oai:tel.ar hiv es-ouv ertes.fr:tel-00346320_v1. [22℄ S. Dudoit and M. v an der Laan. Asymptotis of ross-v alidated risk estimation in estimator seletion and p erformane assessmen t. Statisti al Metho dolo gy , 2(2):131 154, 2005. [23℄ S. Geisser. A preditiv e approa h to the random eet mo del. Biometrika , 61(1):101 107, 1974. [24℄ Seymour Geisser. The preditiv e sample reuse metho d with appliations. J. A mer. Statist. Asso . , 70:320328, 1975. [25℄ Xa vier Gendre. Sim ultaneous estimation of the mean and the v ariane in heterosedas- ti gaussian regression, 2008. [26℄ M. Kearns, Y. Mansour, A. Y. Ng, and D. Ron. An Exp erimen tal and Theoretial Comparison of Mo del Seletion Metho ds. Mahine L e arning , 27:750, 1997. [27℄ P . A. La hen bru h and M. R. Mi k ey . Estimation of Error Rates in Disriminan t Analysis. T e hnometris , 10(1):111, 1968. [28℄ M. La vielle. Using p enalized on trasts for the hange-p oin t problem. Signal Pr o es. , 85:15011510, 2005. [29℄ M. La vielle and G. T eyssière. Detetion of Multiple Change-Poin ts in Multiv ariate Time Series. Lithuanian Mathemati al Journal , 46:287306, 2006. [30℄ E. Lebarbier. Deteting m ultiple hange-p oin ts in the mean of a Gaussian pro ess b y mo del seletion. Signal Pr o . , 85:717736, 2005. [31℄ K.-C. Li. Asymptoti Optimalit y for C p , C L , Cross-Validation and Generalized Cross- Validation: Disrete Index Set. The A nnals of Statistis , 15(3):958975, 1987. 28 [32℄ C. L. Mallo ws. Some ommen ts on C p . T e hnometris , 15:661675, 1973. [33℄ B. Q. Mia and L. C. Zhao. On detetion of hange p oin ts when the n um b er is unkno wn. Chinese J. Appl. Pr ob ab. Statist. , 9(2):138145, 1993. [34℄ D. Piard. Testing and estimating hange p oin ts in time series. J. Appl. Pr ob ab. , 17:841867, 1985. [35℄ F. Piard. Pr o ess se gmentation/lustering Appli ation to the analysis of arr ay CGH data . PhD thesis, Univ ersité P aris-Sud 11, 2005. [36℄ F. Piard, S. Robin, M. La vielle, C. V aisse, and J-J. Daudin. A statistial approa h for arra y CGH data analysis. BMC Bioinformatis , 27(6):eletroni aess, 2005. [37℄ F ran k Piard, Stéphane Robin, Émilie Lebarbier, and Jean-Jaques Daudin. A seg- men tation/lustering mo del for the analysis of arra y gh data. Biometris , 2007. T o app ear. doi:10.1111/j.1541-0420.2006.00729.x. [38℄ J. Rissanen. Univ ersal Prior for In tegers and Estimation b y Minim um Desription Length. The A nnals of Statistis , 11(2):416431, 1983. [39℄ G. S h w arz. Estimating the dimension of a mo del. The A nnals of Statistis , 6(2):461 464, 1978. [40℄ J. Shao. An asymptoti theory for linear mo del seletion. Statisti a Sini a , 7:221264, 1997. [41℄ R. Shibata. An optimal seletion of regression v ariables. Biometrika , 68:4554, 1981. [42℄ C.J. Stone. An asymptotially optimal windo w seletion rule for k ernel densit y esti- mates. The A nnals of Statistis , 12(4):12851297, 1984. [43℄ M. Stone. Cross-v alidatory hoie and assessmen t of statistial preditions. J. R oy. Statist. So . Ser. B , 36:111147, 1974. [44℄ M. Stone. An Asymptoti Equiv alene of Choie of Mo del b y Cross-v alidation and Ak aik e's Criterion. JRSS B , 39(1):4447, 1977. [45℄ R. Tibshirani and K. Knigh t. The Co v ariane Ination Criterion for Adaptiv e Mo del Seletion. JRSS B , 61(3):529546, 1999. [46℄ Y. Y ang. Regression with m ultiple andidate mo del: seletion or mixing? Statist. Sini a , 13:783809, 2003. [47℄ Y. Y ang. Comparing Learning Metho ds for Classiation. Statisti a Sini a , 16:635 657, 2006. [48℄ Y. Y ang. Consisteny of ross-v alidation for omparing regression pro edures. The A nnals of Statistis , 35(6):24502473, 2007. [49℄ Y. Y ao. Estimating the n um b er of hange-p oin ts via S h w arz riterion. Statist. Pr ob ab. L ett. , 6:181189, 1988. 29 Supplemen tary material for Segmen tation of the mean of heterosedasti data via ross-v alidation Sylv ain Arlot and Alain Celisse Otob er 30, 2018 1 Calibration of Birgé and Massart's p enalization Birgé and Massart's p enalization mak es use of the p enalt y p en BM ( D ) := b C D n 5 + 2 log n D . In a previous v ersion of this w ork [6, Chapter 7℄, b C w as dened as suggested in [7 , 8℄, that is, b C = 2 b K max . jump with the notation b elo w. This yielded p o or p erformanes, whi h seemed related to the denition of b C . Therefore, alternativ e denitions for b C ha v e b een in v estigated, leading to the hoie b C = 2 b K thresh . throughout the pap er, where b K thresh . is dened b y (2 ) b elo w. The presen t app endix in tends to motiv ate this hoie. T w o main approa hes ha v e b een onsidered in the literature for dening b C in the p enalt y p en BM : • Use b C = c σ 2 an y estimate of the noise-lev el, for instane, c σ 2 := 1 n n/ 2 X i =1 ( Y 2 i − Y 2 i − 1 ) 2 , (1) assuming n is ev en and t 1 < · · · < t n . • Use Birgé and Massart's slop e heuristis , that is, ompute the sequene b D ( K ) := arg m in D ∈D n P n γ ( b s b m ERM ( D ) ) + K D n 5 + 2 log n D , nd the (unique) K = b K jump at whi h b D ( K ) jumps from large to small v alues, and dene b C = 2 b K jump . The rst approa h follo ws from theoretial and exp erimen tal results [ 4 , 8℄ whi h sho w that b C should b e lose to σ 2 when the noise-lev el is onstan t; (1) is a lassial estimator of the v ariane used for instane b y Baraud [3℄ for mo del seletion in a dieren t setting. The optimalit y (in terms of orale inequalities) of the seond approa h has b een pro v ed for regression with homosedasti Gaussian noise and p ossibly exp onen tial olletions of 1 s · σ · 2 b K max . jump 2 b K thresh . c σ 2 σ 2 true 1 6.85 ± 0.12 3.91 ± 0.03 1.74 ± 0.02 2.05 ± 0.02 p ,3 17.56 ± 0.15 13.08 ± 0.04 4.42 ± 0.04 10.43 ± 0.05 s 20.07 ± 0.31 9.41 ± 0.04 2.18 ± 0.03 1.66 ± 0.02 2 6.02 ± 0.03 5.27 ± 0.03 3.58 ± 0.02 3.54 ± 0.02 p ,3 17.76 ± 0.10 20.12 ± 0.07 10.58 ± 0.07 16.64 ± 0.08 s 10.17 ± 0.05 9.69 ± 0.04 5.28 ± 0.03 10.95 ± 0.02 3 4.97 ± 0.02 4.39 ± 0.01 4.62 ± 0.01 4.21 ± 0.01 p ,3 8.66 ± 0.03 8.47 ± 0.03 6.64 ± 0.02 8.00 ± 0.03 s 8.50 ± 0.04 7.59 ± 0.03 5.94 ± 0.02 15.50 ± 0.04 A 7.52 ± 0.04 6.82 ± 0.03 4.86 ± 0.03 5.55 ± 0.03 B 7.89 ± 0.04 7.21 ± 0.04 5.18 ± 0.03 5.77 ± 0.03 C 12.81 ± 0.08 13.49 ± 0.07 8.93 ± 0.06 12.44 ± 0.07 T able 1: P erformane C or (BM) with four dieren t denitions of b C (see text), in some of the sim ulation settings onsidered in the pap er. In ea h setting, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . mo dels [5℄, as w ell as in a heterosedasti framew ork with p olynomial olletions of mo dels [2℄. In the on text of hange-p oin t detetion with homosedasti data, La vielle [7℄ and Lebarbier [8 ℄ sho w ed that b C = 2 b K max . jump an ev en p erform b etter than b C = σ 2 when b K max . jump orresp onds to the highest jump of b D ( K ) . Alternativ ely , it w as prop osed in [2℄ to dene b C = 2 b K thresh . where b K thresh . := min K s . t . b D ( K ) ≤ D thresh . := n ln( n ) . (2) These three denitions of b C ha v e b een ompared with b C = σ 2 true := n − 1 P n i =1 σ ( t i ) 2 in the settings of the pap er. A represen tativ e part of the results is rep orted in T able 1 . The main onlusions are the follo wing. • 2 b K thresh . almost alw a ys b eats 2 b K max . jump , ev en in homosedasti settings. This on- rms some sim ulation results rep orted in [2℄. • σ 2 true often b eats slop e heuristis-based denitions of b C , but not alw a ys, as previously notied b y Lebarbier [8℄. Dierenes of p erformane an b e h uge (in partiular when σ = σ s ), but not alw a ys in fa v our of σ 2 true (for instane, when s = s 3 ). • c σ 2 yields signian tly b etter p erformane than σ 2 true in most settings (but not all), with h uge margins in some heterosedasti settings. The latter result atually omes from an artefat, whi h an b e explained as follo ws. First, E h c σ 2 i = 1 n n X i =1 σ ( t i ) 2 + 1 n n X i =1 ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 ≥ 1 n n X i =1 σ ( t i ) 2 = σ 2 true . 2 The dierene b et w een these exp etations is not negligible in all the settings of the pap er. F or instane, when n = 100 , t i = i/n and s = s 1 , n − 1 P i ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 = 0 . 04 whereas σ 2 true v aries b et w een 0 . 015 (when σ = σ pc, 1 ) to 0 . 093 (when σ = σ pc, 3 ). Nev ertheless, c σ 2 w ould not o v erestimate σ 2 true at all in a v ery lose setting: Shifting the jumps of s 1 b y 1 / 100 is suien t to mak e n − 1 P i ( s ( t 2 i ) − s ( t 2 i − 1 )) 2 equal to zero, and the p erformanes of BM with b C = c σ 2 w ould then b e v ery lose to the p erformanes of BM with b C = σ true . Seond, o v erp enalization turns out to impro v e the results of BM in most of the het- erosedasti settings onsidered in the pap er. The reason for this phenomenon is illustrated b y the righ t panel of Figure 4. Indeed, p en BM is a p o or p enalt y when data are het- erosedasti, underp enalizing dimensions lose to the orale but o v erp enalizing the largest dimensions (remem b er that b C = 2 b K thresh . on Figure 4). Then, in a setting lik e ( s 2 , σ pc, 3 ) m ultiplying p en BM b y a fator C o ver > 1 helps dereasing the seleted dimension; the same ause has dieren t onsequenes in other settings, su h as ( s 1 , σ s or ( s 3 , σ c ) . Nev erthe- less, ev en ho osing b C using b oth P n and s , (crit BM ( D )) D > 0 remains a p o or estimate of s − b s b m ERM ( D ) 2 n D > 0 in most heterosedasti settings (ev en up to an additiv e onstan t). T o onlude, p en BM with b C = c σ 2 is not a reliable hange-p oin t detetion pro edure, and the apparen tly go o d p erformanes observ ed in T able 1 ould b e misleading. This leads to the remaining hoie b C = 2 b K thresh . whi h has b een used throughout the pap er, although this alibration metho d ma y ertainly b e impro v ed. Results of T able 1 for b C = σ 2 true indiate ho w far the p erformanes of p en BM ould b e impro v ed without o v erp enalization. A ording to T ables 4 and 5, BM with b C = σ 2 true only has signian tly b etter p erformanes than J ERM , VF 5 K or J Lo o , VF 5 K in the three homosedasti settings and in setting ( s 1 , σ s ) . Finally , o v erp enalization ould b e used to impro v e BM, but ho osing the o v erp enaliza- tion fator from data is a diult problem, esp eially without kno wing a priori whether the signal is homosedasti or heterosedasti. This question deserv es a sp ei extensiv e sim ulation exp erimen t. T o b e ompletely fair with CV metho ds, su h an exp erimen t should also ompare BM with o v erp enalization to V -fold p enalization [1 ℄ with o v erp enalization, for ho osing the n um b er of hange-p oin ts. 2 Random framew orks generation The purp ose of this app endix is to detail ho w pieewise onstan t funtions s and σ ha v e b een generated in the framew orks A, B and C of Setion 5.3. In ea h framew ork, s and σ are of the form s ( x ) = K s − 1 X j =0 α j 1 [ a j ; a j +1 ) + α K s 1 [ a K s ; a K s +1 ] with a 0 = 0 < a 1 < · · · < a K s = 1 σ ( x ) = K σ − 1 X j =0 β j 1 [ b j ; b j +1 ) + β K σ 1 [ b K σ ; b K σ +1 ] with b 0 = 0 < b 1 < · · · < b K σ = 1 for some p ositiv e in tegers K s , K σ and real n um b ers α 0 , . . . , α K s ∈ R and β 0 , . . . , β K σ > 0 . 3 R emark 1 . The framew orks A, B and C dep end on the sample size n , through the distri- bution of K s , K σ , and of the size of the in terv als [ a j ; a j +1 ) and [ b j ; b j +1 ) . This ensures that the signal-to-noise ratio remains rather small, so that the quadrati risk remains an adequate p erformane measure for hange-p oin t detetion. When the signal-to-noise ratio is larger (that is, when all jumps of s are m u h larger than the noise-lev el, and the n um b er of jumps of s is small ompared to the sample size), the hange-p oin t detetion problem is of dieren t nature. In partiular, the n um b er of hange-p oin ts w ould b e b etter estimated with pro edures targeting iden tiation (su h as BIC, or ev en larger p enalties) than eieny (su h as VF CV). 2.1 F ramew ork A In framew ork A, s and σ are generated as follo ws: • K s , the n um b er of jumps of s , has uniform distribution o v er { 3 , . . . , ⌊ √ n ⌋} . • F or 0 ≤ j ≤ K s , a j +1 − a j = ∆ s min + (1 − ( K s + 1)∆ s min ) U j P K s k =0 U k with ∆ s min = min { 5 /n, 1 / ( K s + 1) } and U 0 , . . . , U K s are i.i.d. with uniform distri- bution o v er [0; 1] . • α 0 = V 0 and for 1 ≤ j ≤ K s , α j = α j − 1 + V j where V 0 , . . . , V K s are i.i.d. with uniform distribution o v er [ − 1; − 0 . 1] ∪ [0 . 1; 1] . • K σ , the n um b er of jumps of σ , has uniform distribution in { 5 , . . . , ⌊ √ n ⌋} . • F or 0 ≤ j ≤ K σ , b j +1 − b j = ∆ σ min + (1 − ( K σ + 1)∆ σ min ) U ′ j P K s k =0 U ′ k with ∆ σ min = min { 5 /n, 1 / ( K σ + 1) } and U ′ 0 , . . . , U ′ K σ are i.i.d. with uniform distri- bution o v er [0; 1] . • β 0 , . . . , β K σ are i.i.d. with uniform distribution o v er [0 . 05; 0 . 5] . T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork A are plotted on Figure 1. 2.2 F ramew ork B The only dierene with framew ork A is that U 0 , . . . , U K s are i.i.d. with the same distri- bution as Z = | 10 Z 1 + Z 2 | where Z 1 has Bernoulli distribution with parameter 1 / 2 and Z 2 has a standard Gaussian distribution. T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork B are plotted on Figure 2. 4 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 2 2.5 t i Y i s(t i ) Figure 1: Random framew ork A: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 1.5 2 2.5 t i Y i s(t i ) Figure 2: Random framew ork B: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . 5 2.3 F ramew ork C The main dierene b et w een framew orks C and B is that [0; 1] is split in to t w o regions: a K s, 1 +1 = 1 / 2 and K s = K s, 1 + K s, 2 + 1 for some p ositiv e in tegers K s, 1 , K s, 2 , and the b ounds of the distribution of β j are larger when b j ≥ 1 / 2 and smaller when b j < 1 / 2 . T w o examples of a funtion s and a sample ( t i , Y i ) generated in framew ork C are plotted on Figure 3. More preisely , s and σ are generated as follo ws: • K s, 1 has uniform distribution o v er { 2 , . . . , K max , 1 } with K max , 1 = ⌊ √ n ⌋ − 1 − ⌊ ( ⌊ √ n − 1 ⌋ ) / 3 ⌋ . • K s, 2 has uniform distribution o v er { 0 , . . . , K max , 2 } with K max , 2 = ⌊ ( ⌊ √ n − 1 ⌋ ) / 3 ⌋ . • Let U 0 , . . . , U K s b e i.i.d. random v ariables with the same distribution as Z = | 10 Z 1 + Z 2 | where Z 1 has Bernoulli distribution with parameter 1 / 2 and Z 2 has a standard Gaussian distribution. • F or 0 ≤ j ≤ K s, 1 , a j +1 − a j = ∆ s, 1 min + (1 − ( K s, 1 + 1)∆ s, 1 min ) U j P K s, 1 k =0 U k with ∆ s, 1 min = min { 5 /n, 1 / ( K s, 1 + 1) } . • F or K s, 1 + 1 ≤ j ≤ K s , a j +1 − a j = ∆ s, 2 min + (1 − ( K s, 2 + 1)∆ s, 2 min ) U j P K s k = K s, 1 +1 U k with ∆ s, 2 min = min { 5 /n, 1 / ( K s, 2 + 1) } . • α 0 = V 0 and for 1 ≤ j ≤ K s , α j = α j − 1 + V j where V 0 , . . . , V K s are i.i.d. with uniform distribution o v er [ − 1; − 0 . 1] ∪ [0 . 1; 1] . • K σ , ( b j +1 − b j ) 0 ≤ j ≤ K σ are distributed as in framew orks A and B. • β 0 , . . . , β K σ are indep enden t. When b j < 1 / 2 , β j has uniform distribution o v er [0 . 025 ; 0 . 2] . When b j ≥ 1 / 2 , β j has uniform distribution o v er [0 . 1; 0 . 8 ] . 3 A dditional results from the sim ulation study In the next pages are presen ted extended v ersions of the T ables of the main pap er, as w ell as an extended v ersion of T able 1 (T able 7). 6 0 0.2 0.4 0.6 0.8 1 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 t i Y i s(t i ) 0 0.2 0.4 0.6 0.8 1 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 t i Y i s(t i ) Figure 3: Random framew ork C: t w o examples of a sample ( t i , Y i ) 1 ≤ i ≤ 100 and the orre- sp onding regression funtion s . Referenes [1℄ Sylv ain Arlot. V -fold ross-v alidation impro v ed: V -fold p enalization, F ebruary 2008. [2℄ Sylv ain Arlot and P asal Massart. Data-driv en alibration of p enalties for least-squares regression. J. Mah. L e arn. R es. , 10:245279 (eletroni), 2009. [3℄ Y anni k Baraud. Mo del seletion for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127146 (eletroni), 2002. [4℄ L. Birgé and P . Massart. Gaussian mo del seletion. J. Eur op e an Math. So . , 3(3):203 268, 2001. [5℄ Luien Birgé and P asal Massart. Minimal p enalties for Gaussian mo del seletion. Pr ob ab. The ory R elate d Fields , 138(1-2):3373, 2007. [6℄ Alain Celisse. Mo del sele tion via r oss-validation in density estimation, r e gr ession and hange-p oints dete tion . PhD thesis, Univ ersit y P aris-Sud 11, Deem b er 2008. oai:tel.ar hiv es-ouv ertes.fr:tel-00346320_v1. [7℄ M. La vielle. Using p enalized on trasts for the hange-p oin t problem. Signal Pr o es. , 85:15011510, 2005. [8℄ E. Lebarbier. Deteting m ultiple hange-p oin ts in the mean of a Gaussian pro ess b y mo del seletion. Signal Pr o . , 85:717736, 2005. 7 s · σ · ERM Lo o Lpo 20 Lpo 50 1 1.59 ± 0.01 1.60 ± 0.02 1.58 ± 0.01 1.58 ± 0.01 p ,1 1.04 ± 0.01 1.06 ± 0.01 1.06 ± 0.01 1.06 ± 0.01 p ,2 1.89 ± 0.02 1.87 ± 0.02 1.87 ± 0.02 1.87 ± 0.02 p ,3 2.05 ± 0.02 2.05 ± 0.02 2.05 ± 0.02 2.07 ± 0.02 s 1.54 ± 0.02 1.52 ± 0.02 1.52 ± 0.02 1.51 ± 0.02 2 2.88 ± 0.01 2.93 ± 0.01 2.93 ± 0.01 2.94 ± 0.01 p ,1 1.31 ± 0.02 1.16 ± 0.02 1.14 ± 0.02 1.11 ± 0.01 p ,2 2.88 ± 0.02 2.24 ± 0.02 2.19 ± 0.02 2.13 ± 0.02 p ,3 3.09 ± 0.03 2.52 ± 0.03 2.48 ± 0.03 2.32 ± 0.03 s 3.01 ± 0.01 3.03 ± 0.01 3.05 ± 0.01 3.13 ± 0.01 3 3.18 ± 0.01 3.25 ± 0.01 3.29 ± 0.01 3.44 ± 0.01 p ,1 3.00 ± 0.01 2.67 ± 0.02 2.68 ± 0.02 2.77 ± 0.02 p ,2 4.06 ± 0.02 3.63 ± 0.02 3.64 ± 0.02 3.78 ± 0.02 p ,3 4.41 ± 0.02 3.97 ± 0.02 4.00 ± 0.02 4.11 ± 0.02 s 4.02 ± 0.01 3.82 ± 0.01 3.85 ± 0.01 3.98 ± 0.01 T able 2: A v erage p erformane C or ( J P , Id K ) for hange-p oin t detetion pro edures P among ERM , Lo o and Lp o p with p = 20 and p = 50 . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. s · σ · Orale VF 5 BM 1 1.59 ± 0.01 5.40 ± 0.05 3.91 ± 0.03 p ,1 1.04 ± 0.01 11.96 ± 0.03 12.85 ± 0.04 p ,2 1.89 ± 0.02 6.43 ± 0.05 13.03 ± 0.04 p ,3 2.05 ± 0.02 4.96 ± 0.05 13.08 ± 0.04 s 1.54 ± 0.02 7.33 ± 0.06 9.41 ± 0.04 2 2.88 ± 0.01 4.51 ± 0.03 5.27 ± 0.03 p ,1 1.31 ± 0.02 11.67 ± 0.09 19.36 ± 0.07 p ,2 2.88 ± 0.02 6.58 ± 0.06 19.82 ± 0.07 p ,3 3.09 ± 0.03 6.66 ± 0.06 20.12 ± 0.07 s 3.01 ± 0.01 5.21 ± 0.04 9.69 ± 0.40 3 3.18 ± 0.01 4.41 ± 0.02 4.39 ± 0.01 p ,1 3.00 ± 0.01 4.91 ± 0.02 6.50 ± 0.02 p ,2 4.06 ± 0.02 5.99 ± 0.02 7.86 ± 0.03 p ,3 4.41 ± 0.02 6.32 ± 0.02 8.47 ± 0.03 s 4.02 ± 0.01 5.97 ± 0.03 7.59 ± 0.03 T able 3: P erformane C or ( J ERM , P K ) for P = Id (that is, ho osing the dimension D ⋆ := arg min D ∈D n n s − b s b m ERM ( D ) 2 n o ), P = VF V with V = 5 or P = BM . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N , measuring the unertain t y of the estimated p erformane. 8 s · σ · J ERM , VF 5 K J Lo o , VF 5 K J Lp o 20 , VF 5 K J L p o 50 , VF 5 K J E R M , BM K 1 5.40 ± 0.05 5.03 ± 0.05 5.10 ± 0.05 5.24 ± 0.05 3.91 ± 0.03 p ,1 11.96 ± 0.03 10.25 ± 0.03 10.28 ± 0.03 10.66 ± 0.04 12.85 ± 0.04 p ,2 6.43 ± 0.05 5.83 ± 0.05 5.99 ± 0.05 6.20 ± 0.05 13.03 ± 0.04 p ,3 4.96 ± 0.05 4.82 ± 0.04 4.79 ± 0.05 5.02 ± 0.05 13.08 ± 0.04 s 7.33 ± 0.06 6.82 ± 0.05 6.99 ± 0.06 6.91 ± 0.06 9.41 ± 0.04 2 4.51 ± 0.03 4.55 ± 0.03 4.50 ± 0.03 4.73 ± 0.03 5.27 ± 0.03 p ,1 11.67 ± 0.09 10.26 ± 0.08 10.29 ± 0.08 10.45 ± 0.09 19.36 ± 0.07 p ,2 6.58 ± 0.06 5.85 ± 0.06 5.85 ± 0.06 5.49 ± 0.06 19.82 ± 0.07 p ,3 6.66 ± 0.06 5.81 ± 0.06 5.74 ± 0.06 5.66 ± 0.06 20.12 ± 0.06 s 5.21 ± 0.04 5.19 ± 0.03 5.17 ± 0.03 5.51 ± 0.04 9.69 ± 0.04 3 4.41 ± 0.02 4.54 ± 0.02 4.62 ± 0.02 4.94 ± 0.02 4.39 ± 0.01 p ,1 4.91 ± 0.02 4.40 ± 0.02 4.44 ± 0.02 4.69 ± 0.02 6.50 ± 0.02 p ,2 5.99 ± 0.02 5.34 ± 0.02 5.42 ± 0.02 5.75 ± 0.02 7.86 ± 0.03 p ,3 6.32 ± 0.02 5.74 ± 0.02 5.81 ± 0.02 6.24 ± 0.02 8.47 ± 0.03 s 5.97 ± 0.02 5.72 ± 0.02 5.86 ± 0.02 6.07 ± 0.02 7.59 ± 0.03 T able 4: P erformane C or ( P ) for sev eral hange-p oin t detetion pro edures P . Sev eral regression funtions s and noise-lev el funtions σ ha v e b een onsidered, ea h time with N = 10 000 indep enden t samples. Next to ea h v alue is indiated the orresp onding empirial standard deviation. F ramew ork A B C J ERM , BM K 6.82 ± 0.03 7.21 ± 0.04 13.49 ± 0.07 J ERM , VF 5 K 4.78 ± 0.03 5.09 ± 0.03 7.17 ± 0.05 J Lo o , VF 5 K 4.65 ± 0.03 4.88 ± 0.03 6.61 ± 0.05 J Lp o 20 , VF 5 K 4.78 ± 0.03 4.91 ± 0.03 6.49 ± 0.05 J Lp o 50 , VF 5 K 4.97 ± 0.03 5.18 ± 0.04 6.69 ± 0.05 T able 5: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 100 . In ea h framew ork, N = 10 , 000 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 9 F ramew ork A B C J ERM , BM K 9.04 ± 0.12 11.62 ± 0.14 21.21 ± 0.31 J ERM , BM b σ K 5.34 ± 0.10 6.24 ± 0.11 11.48 ± 0.22 J ERM , VF 5 K 5.10 ± 0.11 5.92 ± 0.11 7.31 ± 0.14 J Lo o , VF 5 K 4.90 ± 0.11 5.63 ± 0.11 6.89 ± 0.16 J Lp o 20 , VF 5 K 4.88 ± 0.10 5.55 ± 0.10 6.82 ± 0.15 J Lp o 50 , VF 5 K 5.11 ± 0.11 5.49 ± 0.10 7.14 ± 0.15 T able 6: P erformane C ( R ) or ( P ) of sev eral mo del seletion pro edures P in framew orks A, B, C with sample size n = 200 . In ea h framew ork, N = 1 , 00 0 indep enden t samples ha v e b een onsidered. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . s · σ · 2 b K max . jump 2 b K thresh . c σ 2 σ 2 true 1 6.85 ± 0.12 3.91 ± 0.03 1.74 ± 0.02 2.05 ± 0.02 p ,1 70.97 ± 1.18 12.85 ± 0.04 1.13 ± 0.02 10.20 ± 0.05 p ,2 23.74 ± 0.26 13.03 ± 0.04 3.55 ± 0.04 10.43 ± 0.05 p ,3 17.56 ± 0.15 13.08 ± 0.04 4.42 ± 0.04 10.43 ± 0.05 s 20.07 ± 0.31 9.41 ± 0.04 2.18 ± 0.03 1.66 ± 0.02 2 6.02 ± 0.03 5.27 ± 0.03 3.58 ± 0.02 3.54 ± 0.02 p ,1 17.83 ± 0.10 19.36 ± 0.07 8.52 ± 0.06 15.62 ± 0.08 p ,2 17.63 ± 0.10 19.82 ± 0.07 10.77 ± 0.07 16.56 ± 0.08 p ,3 17.76 ± 0.10 20.12 ± 0.07 10.58 ± 0.07 16.64 ± 0.08 s 10.17 ± 0.05 9.69 ± 0.04 5.28 ± 0.03 10.95 ± 0.02 3 4.97 ± 0.02 4.39 ± 0.01 4.62 ± 0.01 4.21 ± 0.01 p ,1 7.18 ± 0.03 6.50 ± 0.02 4.52 ± 0.02 6.70 ± 0.03 p ,2 8.14 ± 0.03 7.86 ± 0.03 6.22 ± 0.02 7.55 ± 0.03 p ,3 8.66 ± 0.03 8.47 ± 0.03 6.64 ± 0.02 8.00 ± 0.03 s 8.50 ± 0.04 7.59 ± 0.03 5.94 ± 0.02 15.50 ± 0.04 A 7.52 ± 0.04 6.82 ± 0.03 4.86 ± 0.03 5.55 ± 0.03 B 7.89 ± 0.04 7.21 ± 0.04 5.18 ± 0.03 5.77 ± 0.03 C 12.81 ± 0.08 13.49 ± 0.07 8.93 ± 0.06 12.44 ± 0.07 T able 7: P erformane C or (BM) with four dieren t denitions of b C (see text), in some of the sim ulation settings onsidered in the pap er. In ea h setting, N = 10 000 indep enden t samples ha v e b een generated. Next to ea h v alue is indiated the orresp onding empirial standard deviation divided b y √ N . 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment