On penalized estimation for dynamical systems with small noise

On p enalized estimation for dynamical systems with small noise Alessandro De Grego r io ∗ Stefano M. Iacus † September 10, 2018 Abstract W e consider a dynamical system with sm all n o ise for whic h the drift is parametrized b y a ﬁnite dimensional parameter. F o r this mo del w e consider minim um distance estimation from con tin uous time ob- serv ations under l p -p enalt y imp osed on the parameters in the spirit of the Lasso approac h with the aim of simultaneo us estimation and mo del selection. W e study the consistency and the asymp to tic dis- tribution of these La sso-t yp e estimators for diﬀeren t v alues of p . F or p = 1 w e also consider the adaptiv e version of the Lasso estimator and establish it s orac le prop ertie s. Keyw or d s: dy namical systems, lasso estimation, mo del sele ction, infer- ence for sto c hastic pro cesses , diﬀusion-t yp e pro cesses, oracle prop erties . 1 In tro du ction Usually ordinary diﬀeren tial equation mo dels are the result of a veraging and/or neglecting some details of an original system without mo deling a complex s ystem with a h ug e num b e r of degrees of freedom or tuning param- eters. Intro ducing noise is therefore a w ay t o appro ximate closer the reality of observ able complex systems. It is then natural to t hin k of t he no is e as small, for example when one is considering the dynamics of macroscopic quan t it ies, ∗ Department of Statistical Sciences, P .le Aldo Moro 5, 0018 5 - Rome, Italy - alessan- dro.degre gorio@uniroma1 .it † Department of E conomics, Mana gemen t a nd Quan titativ e Metho ds, Via Conserv a t orio 7, 201 22 - Milan, Ita ly - stefano .iacus@unimi.it 1 i.e. av erages of quan tit ie s of in terest ov er a whole p opulation or in the case of signal that trav els through a p erturbed medium, etcetera. Dynamical syste ms with small p erturbations hav e b een indeed widely studied in A zencott [1982] and F reidlin and W en tzell [199 8]. Applications of small diﬀusion pro cesses to mathematical ﬁnance and option pricing hav e b een considere d in Y oshida [1992a], Kunito mo and T ak ahashi [200 1], T ak ahashi and Y oshida [2004], Uc hida a nd Y oshida [2004a] and references therein. Example s from biology and life sciences include Murray [2002], Bress loﬀ [2014], Ermen trout a nd T erman [2010]. Mo del selection is an imp ortan t asp ec t in the ab o ve applied ﬁelds al- though sometimes neglected. What o ccurs for dynamical systems with small noise, is not so diﬀerent from what happ ens in ordinary least squares (OLS) mo del estimation. Indeed, linear regression mo dels are used extensiv ely b y man y practitioners but, once estimated, these mo dels are useful as long a s the set of parameter (or co v ariates) is correctly sp eciﬁed. Therefore, the mo del selection step is an imp ortan t part of the analysis. T o intro duc e the idea of Lasso-type estimation w e b egin with linear mo d- els and OLS. In this fr a me w ork mo del selection o ccurs when some of the regression parameters are estimated as zero. Diﬀeren t mo dels are compared in terms o f information criteria like AIC/BIC or hypotheses testing. The ad- v an tage of the Lasso-type approa ch ov er AIC/BIC is that statistical mo dels do not need to b e nested but one can ra the r construct a single large para- metric mo del merging t wo o r thogonal mo dels and let the selection metho d to c ho ose one of the t w o mo dels [Caner, 2009]. V ariable selection b ecomes particularly imp ortan t when the true under- lying mo del has a sparse represen tat ion. Iden tifying correctly signiﬁcant predictors will improv e the prediction p erformance of the ﬁtted mo del [for an o v erview o f feature selection see F an and Li, 2006]. Considered the linear regression mo del Y i = x T i β + ε i , with x i a v ector of cov ar ia te s, β a ve ctor of q > 0 pa r ame ters and ε i i.i.d. Gaussian r andom v ariables. Knight and F u [2000] prop osed the follow ing l p -p enalized estima- tor for β ˆ β n := a rg min u n X i =1 ( Y i − x T i u ) 2 + λ n q X j =1 | u j | p ! (1) for some p > 0 and λ n → 0 as n → ∞ . The f amily of estimators ˆ β n solution to (1) are a generalization of the Ridge estimators whic h corresp ond to the case p = 2 [Efron et al., 2004]. The or iginal Lasso estimators [Tibshirani, 1996] are obtained setting p = 1 while OLS is the case p = 0 , not considered here. The link b et wee n Lasso-type estimation and mo del selection is also due to the fact that, in the limit as p → 0, this pro cedure approxim ate the AIC 2 or BIC selection metho ds, i.e. lim p → 0 q X j =1 | u j | p = q X j =1 1 { u j 6 =0 } whic h amoun ts to the n umber of non-null parameters in the mo del. Here 1 A the indicator function for set A . As said, the estimators solutions to (1) are attractiv e b ecause with them it is p ossible to p erform estimation and mo del selection in a single step, i.e. the pro cedure do es not need to estimate diﬀeren t mo dels and compare them later with information criteria as the dimension of the space of the parameters do es no change, just some of the comp onen ts of the ve ctor β ∗ j are assumed to b e zero. In non- linear mo dels a preliminary simple reparametrization (e.g. β 7→ β ′ − β ) is needed to in terpret this approac h in terms of mo del selection. In this w ork, w e extend the problem in (1) t o the class o f diﬀusion pro cess es with small noise solution to the sto c hastic diﬀeren t ial equation d X t = S t ( θ , X )d t + ε d W t , t ∈ [0 , T ], b y replacing the least squares estima- tion with the minimum distance estimation. The asymptotic is considered as ε → 0 for ﬁxed 0 < T < ∞ with θ ∈ Θ ⊂ R q a q -dimensional parameter. Since the seminal w orks of Kuto yan ts [1984, 19 91 , 1994] and Y oshida [1992b], statistical inference for contin uously observ ed small diﬀusion pro- cesses is we ll dev elop ed to day [see, e.g., Kuto yan ts and Philib ossian, 1994, Iacus, 2000, Iacus and Kuto yan ts, 200 1, Y oshida, 20 03 , Uc hida and Y oshida, 2004b] but the La s so problem has not b een considered so fa r . Although here w e consider only con tinuous time observ ations, it is w or t h men t io ning that there is also a gr owing literature on pa r a me tric inf erence for discretely observ ed small diﬀusion pro cesses [see Genon-Catalot, 1990, Laredo, 199 0 , Sørensen, 1997, 2012, Sø r ensen a nd Uc hida, 2003, Uc hida, 2 003, 2004, 2006, 2008, Gloter and Sørensen, 2 009, Guy et al., 2014] to whic h this Lasso pro b- lem can b e extended. Adaptiv e Lasso-ty p e estimation for ergo dic diﬀusion pro cess es sampled at discrete time has b een studied in De Gregorio and Iacus [2012] while for con t inuous time ergo dic diﬀusion pro cesses shrink age estima- tion has b een considered in Nkurunziza [2012]. This pap er is o rganized a s follows . In Section 2 w e introduce the mo del, the assumptions a nd the statemen t o f the problem. In Section 3 w e study the consistency of t he estimators and deriv e their asymptotic distribution for diﬀeren t v alues of p . F or p = 1 we also consider the case of adaptiv e Lasso estimation that is meant to con tro l asymptotic bias. F or the a da ptiv e estimator, w e are also able to pro v e that it represen ts an o r a c le pro cedure. 3 2 The Lass o-t yp e p roblem for dynamical sys- tems with small no ise Let us assume that on the probability space (Ω , F , P ) , with the ﬁltr a tion {F t , 0 ≤ t ≤ T } (where eac h F t , 0 ≤ t ≤ T , is augmen ted by sets fro m F ha ving zero P -measure), is give n a Wiener pro cess { W t , F t , 0 ≤ t ≤ T } . Let X = { X t , 0 ≤ t ≤ T } b e a real v alued diﬀusion-type pro cess solution to the follo wing sto c hastic diﬀeren tia l equation d X t = S t ( θ , X )d t + ε d W t , ε ∈ (0 , 1] , (2) with non random initial condition X 0 = x 0 . The pa r ame ter θ ∈ Θ ⊂ R q , where Θ is a b ounded, op en and conv ex set, supp osed to b e unknown. L et ( C [0 , T ] , B [0 , T ]) b e the measurable space o f con tin uous functions x t on [0 , T ] with σ -a lg e bra B [0 , T ] = σ { x t , 0 ≤ t ≤ T } . P ( ε ) θ denotes the law induced b y the pro cess X in ( C [0 , T ] , B [0 , T ]) when the true parameter is θ . W e denote b y u = ( u 1 , . . . , u q ) T the (tra ns p ose d) v ector u ∈ R q and the true v alue of θ b y θ ∗ . Let | | · || = || · || L 2 ( µ ) b e the L 2 -norm with resp ect to some ﬁnite measure µ on [0 , T ], i.e. || f || 2 = Z T 0 f 2 ( x ) µ (d x ) . W e supp ose that the trend co eﬃcie n t in (2) is of in tegr a l type, i.e. S t ( θ , X ) = V ( θ , t, X ) + Z t 0 K ( θ , t, s, X s )d s, where V ( θ , t, x ) and K ( θ , t, s, x ) are known measurable, non-an ticipative functionals suc h that (2) has a strong unique solution. F or example, the usual conditions (1 .34) and (1 .35) in Kutoy an ts [1994] and Theorem 4.6 in Lipster and Shiry aev [2001] ab out Lipsc hitz b eha viour and linear gr owing are suﬃcien t; i.e. Assumption 1. F or al l t ∈ [0 , T ] , θ ∈ Θ and X t , Y t ∈ C [0 , T ] | V ( θ , t, X t ) − V ( θ , t, Y t ) | + | K ( θ , t, s, X t ) − K ( θ , t, s, Y t ) | ≤ L 1 Z t 0 | X s − Y s | d K s + L 2 | X t − Y t | , | V ( θ , t, X t ) | + | K ( θ , t, s, X t ) | ≤ L 1 Z t 0 (1 + | X s | )d K s + L 2 (1 + | X t | ) , wher e L 1 and L 2 ar e p ositive c onstants and K s is a nonde cr e asing right- c ontinuous function, 0 ≤ K t ≤ K 0 , K 0 > 0 . 4 Assumption 1 implies that all the probabilit y measures P ( ε ) θ , θ ∈ Θ , a r e equiv alen t (see Theorem 7.7 in Lipster and Shiryaev [2001]). The asymptotic in this mo del is considered as ε → 0 and 0 < T < ∞ ﬁxed. W e will also write x ( θ ) = x t ( θ ) to denote the limiting dynamical system satisfying the in tegro-diﬀeren tia l equation d x t d t = V ( θ, t, x t ) + Z t 0 K ( θ , t, s, x s )d s, x 0 . W e a s sume that, for all 0 ≤ t ≤ T and for each θ ∈ Θ , the random elemen t X t and x t ( θ ) b elong to L 2 ( µ ) . Let x (1) t = { x (1) t ( θ ∗ ) , 0 ≤ t ≤ T } b e the Gaussian pro cess solution to d x (1) t =  V x ( θ ∗ , t, x t ( θ ∗ )) x (1) t + Z t 0 K x ( θ 0 , t, s, x s ( θ ∗ )) x (1) s d s  d t + d W t , (3) 0 ≤ t ≤ T , x (1) 0 = 0, where V x ( θ , t, x ) = ∂ ∂ x V ( θ , t, x ) and K x ( θ , t, s, x ) = ∂ ∂ x K ( θ , t, s, x ) . The pro cess x (1) t pla ys a cen tral role in the deﬁnition of the asymptotic distribution of the estimators in the theory of dynamical systems with small noise. W e need in addition the follo wing assumptions. Assumption 2. The sto chas tic pr o c ess X is diﬀ e r entiable in ε at the p oint ε = 0 in the fol lowin g sense: for al l ν > 0 lim ε → 0 P ( ε ) θ ∗  || ε − 1 ( X − x ) − x (1) || > ν  = 0 wher e x (1) = { x (1) t , 0 ≤ t ≤ T } is fr om (3) . W e further denote by ˙ x t ( θ ) the q -dimensional v ector of partial deriv ativ es of x t ( θ ) with resp ec t to θ j , j = 1 , . . . , q , i.e., ˙ x t ( θ ) = ( ∂ ∂ θ 1 x t ( θ ) , . . . , ∂ ∂ θ q x t ( θ )) T , and ˙ x t ( θ ∗ ) satisﬁes the syste ms of equations d ˙ x t ( θ ∗ ) d t = [ V x ( θ ∗ , t, x t ( θ ∗ )) ˙ x t ( θ ∗ ) + ˙ V ( θ ∗ , t, x t ( θ ∗ )) + Z t 0 ( ˙ K ( θ ∗ , t, s, x s ( θ ∗ )) + K x ( θ 0 , t, s, x s ( θ ∗ )) ˙ x s ( θ ∗ ))d s ]d t, ˙ x 0 ( θ ∗ ) = 0 , where the p oin t corresp onds to the diﬀeren tiation on θ ; i.e. ˙ V ( θ , t, x t ( θ )) =  ∂ ∂ θ 1 V ( θ , t, x t ( θ )) , ..., ∂ ∂ θ q V ( θ , t, x t ( θ ))  T . 5 Assumption 3. Th e deterministic dynami c al s ystem x t ( θ ) is diﬀer entiable in θ at the p oint θ ∗ in L 2 ( µ ) ; i.e. || x ( θ ∗ + h ) − x ( θ ∗ ) − h T ˙ x ( θ ∗ )) || = o ( | h | ) wher e h ∈ R q . Assumption 4. The matrix I ( θ ∗ ) = Z T 0 ˙ x t ( θ ∗ ) ˙ x T t ( θ ∗ ) µ (d t ) is p ositive d e ﬁnite and nonsingular. 2.1 The Lasso-t yp e estimator W e in tro duce a constrained minim um distance estimator for θ fo r the mo del in (2). The asymptotic prop erties of unconstrained the minim um distance estimators in the i.i.d. fra me w ork ha ve b een establishe d in Millar [1983, 1984]. Later Kuto yan ts [1991, 1994] and Kuto y ants and Philib o s sian [19 94 ] studied in details the prop erties of suc h estimators fo r diﬀusion pro ces ses with small noise. Information criteria for this mo del hav e b ee n studied in Uc hida and Y oshida [2004 b ], while here we study the Lasso-type approac h. T o deﬁne the Lasso-t yp e estimator the following p enalized contrast func- tion has to b e considered Z ε ( u ) = || X − x ( u ) | | + λ ε q X j =1 | u j | p , (4) p > 0, u ∈ Θ and λ ε > 0 a real sequence . In analogy to (1) , w e introduce the Lasso-t yp e estimator ˆ θ ε : C [0 , T ] → ¯ Θ for θ , deﬁned as ˆ θ ε = arg min θ ∈ ¯ Θ Z ε ( θ ) , (5) where ¯ Θ is the closure o d Θ. The following example explains w ell the spirit o f the Lasso pro cedure. W e consider a linear small diﬀusion-type pro cess X given by d X t = q X j =1 θ j A j ( t, X )d t + ε d W t , 0 ≤ t ≤ T . By a pp lying the estimator (5), some parameters θ j will b e set equal to 0 and this implies a sim ultaneous estimation and selection o f the mo del. 6 3 Asymptotic prop ert ies of th e es timator The additional l p -p enalization term in the contrast function (4) mo diﬁes the traditional pro perties of the minim um distance estimator. The analysis should b e p erformed f or the diﬀerent v alues of p whic h c ha nge the con v exity of the p enalt y term. 3.1 Consistency of the estimator Let us in tro duce the following functions g ε θ ∗ ( ν ) = inf | θ − θ ∗ |≥ ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) , h ε θ ∗ ( ν ) = inf | θ − θ ∗ | <ν ( || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p ) where | θ − θ ∗ | ≥ ν ( < ν ) is to b e in tended comp onen t wise, f or all ν > 0. W e need the follow ing iden tiﬁability-t yp e condition. Assumption 5. F or every ν > 0 , we assume that g ε θ ∗ ( ν ) > h ε θ ∗ ( ν ) . Theorem 1. L et Assumption 1 and Assumption 5 b e fulﬁl le d an d λ ε = O ( ε ) as ε → 0 . ˆ θ ε in (5) is a uniformly c onsistent estimator of θ ∗ ; i.e. for any ν > 0 lim ε → 0 sup θ ∗ ∈ Θ P ( ε ) θ ∗  | ˆ θ ε − θ ∗ | ≥ ν  = 0 . Pr o of. By deﬁnition of ˆ θ ε , for an y ν > 0 , w e ha v e that n ω : | ˆ θ ε − θ ∗ | ≥ ν o =  ω : inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ )  Moreo ve r, Z ε ( θ ) ≤ || X − x ( θ ∗ ) || + || x ( θ ) − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p , Z ε ( θ ) ≥ || x ( θ ) − x ( θ ∗ ) || − | | X − x ( θ ∗ ) || + λ ε q X j =1 | θ j | p . 7 Then, from the ab o ve inequalit y , w e get P ( ε ) θ ∗  | ˆ θ ε − θ ∗ | ≥ ν  = P ( ε ) θ ∗  inf | θ − θ ∗ | <ν Z ε ( θ ) > inf | θ − θ ∗ |≥ ν Z ε ( θ )  ≤ P ( ε ) θ ∗  || X − x ( θ ∗ ) || + h ε θ ∗ ( ν ) 2 > g ε θ ∗ ( ν ) 2  Since (see Lemma 1.13, in Kutoy an ts [1994]) || X t − x t ( θ ∗ ) || ≤ C ε sup 0 ≤ t ≤ T | W t | , P ( ε ) θ ∗ − a.s. , where C = C ( L 1 , L 2 , K 0 , T ) is a p ositiv e constant, under Assumption 5, we get sup θ ∗ ∈ Θ P ( ε ) θ ∗  | ˆ θ ε − θ ∗ | ≥ ν  ≤ P ( ε ) θ ∗  C ε sup 0 ≤ t ≤ T | W t | > 1 2 inf θ ∗ ∈ Θ { g ε θ ∗ ( ν ) − h ε θ ∗ ( ν ) }  ≤ 2 exp  − (inf θ ∗ ∈ Θ { g ε θ ∗ ( ν ) − h ε θ ∗ ( ν ) } ) 2 8 T C 2 ε 2  → 0 . In the ab o v e w e made use o f the follow ing estimate for N > 0 P  sup 0 ≤ t ≤ T | W t | > N  ≤ 4 P ( W T > N ) ≤ 2 e − N 2 2 T , see e.g. Kuto yan ts [1994], and observ ed that g ε θ ∗ ( ν ) − h ε θ ∗ ( ν ) → inf | θ − θ ∗ |≥ ν || x ( θ ) − x ( θ ∗ ) || > 0 , ε → 0 . F rom the pro of of the consistency of the estimator ˆ θ ε is it clear that the sp ee d of the con vergence dep ends on the sp eed of λ ε . The sp eed of λ ε also aﬀects the asymptotic distribution of the estimator. Remark 1. It i s p ossible to deﬁne other typ es of L asso-typ e estimators mo d- ifying the metric in (4) ; i.e. by c onsiderin g, for instanc e, the sup-no rm and the L 1 -norm. Henc e, if { X t , 0 ≤ t ≤ T } and { x t ( θ ) , 0 ≤ t ≤ T } , θ ∈ Θ , ar e elements of the sp ac e C [0 , T ] an d L 1 ( µ ) , r esp e ctively, we c an i n t r o duc e the c orr esp onding L asso estimator ˇ θ ε = a rg min θ ∈ ¯ Θ ( sup 0 ≤ t ≤ T | X t − x t ( θ ) | + λ ε q X j =1 | u j | p ) 8 and ˘ θ ε = arg min θ ∈ ¯ Θ ( Z T 0 | X t − x t ( θ ) | µ (d t ) + λ ε q X j =1 | u j | p ) . The estimators ˇ θ ε and ˘ θ ε ar e uniform ly c onsistent and the pr o of fol lows by the same steps adopte d to pr ove The or em 1. 3.2 Asymptotic distribution of the estimator In order to study the asymptotic distribution of the Lasso-type estimator w e need to distinguish the diﬀerent cases for p . W e star t with the case of p ≥ 1. W e denote b y “ → d ” the con v ergence in distribution and w e denote by ζ the follo wing Gaussian random v ector ζ = Z T 0 x (1) t ( θ ∗ ) ˙ x t ( θ ∗ ) µ (d t ); (6) i.e. ζ ∼ N q ( 0 , σ 2 ) where σ 2 := Z T 0 Z T 0 ˙ x t ( θ ∗ ) ˙ x s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) , see also Lemma 2.13 in Kuto y ants [1994]. The next tw o theorems ha v e b een inspired from Theorem 2 and Theorem 3 in Knight and F u [2000 ]. Theorem 2. L et Assumptions 1 – 5 hold, ζ deﬁne d as i n (6) , p ≥ 1 and ε − 1 λ ε → λ 0 ≥ 0 . Then ε − 1 ( ˆ θ ε − θ ∗ ) → d arg min u V ( u ) wher e V ( u ) = − 2 u T ζ + u T I ( θ ∗ ) u + λ 0 q X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − 1 for p > 1 and V ( u ) = − 2 u T ζ + u T I ( θ ∗ ) u + λ 0 q X j =1  | u j | 1 { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | 1 { θ ∗ j 6 =0 }  if p = 1 . 9 Pr o of. Let u ∈ R q and in tr oduce the ra ndo m function V ε ( u ) = 1 ε 2 || X − x ( θ ∗ + εu ) || 2 − || X − x ( θ ∗ ) || 2 + λ ε q X j =1  | θ ∗ j + εu j | p − | θ ∗ j | p  ! , (7) whic h is minimized at the p oin t u = ε − 1 ( ˆ θ ε − θ ∗ ) b y deﬁnition of ˆ θ ε . By exploiting Assumption 2 – 4, w e get 1 ε 2  || X − x ( θ ∗ + εu ) || 2 − || X − x ( θ ∗ ) || 2  = 1 ε 2  || X − x ( θ ∗ ) − εu T ˙ x ( θ ∗ ) || 2 − || X − x ( θ ∗ ) || 2  + o ε (1) = u T || ˙ x ( θ ∗ ) || 2 u − 2 u T || ε − 1 ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1) P ( ε ) θ ∗ − → ε → 0 u T I ( θ ∗ ) u − 2 u T ζ , (8) where P ( ε ) θ ∗ − → stands fo r the con ve rgence in probabilit y and ζ is from (6). F or the term in (7) λ ε ε 2 q X j =1  | θ ∗ j + εu j | p − | θ ∗ j | p  w e ha ve to distinguish the case p = 1 and p > 1. L et γ > 1, then λ ε ε 2 q X j =1  | θ ∗ j + εu j | p − | θ ∗ j | p  = λ ε ε q X j =1 u j | θ ∗ j + εu j | p − | θ ∗ j | p εu j − → ε → 0 λ 0 p X j =1 u j sgn( θ ∗ j ) | θ ∗ j | p − 1 (9) If p = 1, then b y similar argumen ts, w e ha ve λ ε ε 2 q X j =1  | θ ∗ j + εu j | − | θ ∗ j |  − → ε → 0 λ 0 q X j =1  | u j | 1 { θ ∗ j =0 } + u j sgn( θ ∗ j ) | θ ∗ j | 1 { θ ∗ j 6 =0 }  . (10) Notice that V ε ( u ) is not con ve x in u and then w e hav e to consider the con ve rgence in distribution on the top ology induced b y the uniform met- ric on compact sets; i.e. w e deal with the conv ergence in distribution of V ε ( u ) on the space of t he contin uous functions top ologized b y the distance 10 ρ ( y 1 , y 2 ) = sup u ∈ K | y 1 ( u ) − y 2 ( u ) | , where K is a compact subset of R d . F rom (8), (9) and (10) follows the con v ergence of the ﬁnite-dimensional distribu- tions ( V ε ( u 1 ) , ..., V ε ( u k )) → d ( V ( u 1 ) , ..., V ( u k )) for an y u i ∈ R d , i = 1 , ..., k . The tigh tness of V ε ( u ) is implied b y sup ε ∈ (0 , 1] E  sup u ∈ K     d d u V ε ( u )      < ∞ whic h follo ws from the regularity conditio ns on { x t ( θ ) , 0 ≤ t ≤ T } . Indeed it is not hard to prov e that lim h → 0 lim sup ε → 0 E [ w ( V ε ( u ) , h ) ∧ 1] ≤ lim h → 0 h sup ε ∈ (0 , 1] E  sup u ∈ K     d d u V ε ( u )      = 0 , where w ( y , h ) = sup { ρ ( y ( u ) , y ( v )) : | u − v | ≤ h } , with y con t inuous function on compact sets and h > 0 . Therefore by Theorem 16.5 in Kallen b erg [2001], w e conclude that V ε ( u ) → d V ( u ) uniformly on u. Since arg min u V ( u ) is unique ( P ( ε ) θ ∗ − a.s.), to pro v e that arg min V ε = ε − 1 ( ˆ θ ε − θ ∗ ) → d arg min V , w e can use Theorem 2.7 in Kim a nd P ollard [1990]. Hence, it is suﬃcien t to sho w that arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) . W e observ e that V ε ( u ) = V l ε ( u ) + o ε (1) where V l ε ( u ) = 1 ε 2 ( u T || ˙ x ( θ ∗ ) || 2 u − 2 u T || ε − 1 ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + λ ε q X j =1  | θ ∗ j + εu j | p − | θ ∗ j | p  ) is a con vex function. Since for eac h a ∈ R and δ > 0 , there exists a compact set K a,δ suc h tha t (see, Knigh t [1999]) lim sup ε → 0 P ( ε ) θ ∗  inf u / ∈ K a,δ V ε ( u ) ≤ a  ≤ δ , then arg min u V ε ( u ) = O P ( ε ) θ ∗ (1) . 11 In the case 0 < p < 1 the conv exit y argument cannot b e applied, more- o ver, some rate of con v ergence m ust b e imp o s ed on the sequence λ ε . Theorem 3. L et Assumption s 1 – 4 hol d , ζ deﬁne d as in (6) , 0 < p < 1 and λ ε /ε 2 − p → λ 0 ≥ 0 . Then ε − 1 ( ˆ θ ε − θ ∗ ) → d arg min u V ( u ) wher e V ( u ) = − 2 u T ζ + u T I ( θ ∗ ) u + λ 0 q X j =1 | u j | p 1 { θ ∗ j =0 } . Pr o of. W e pro ceed analogously to t he pro of o f Theorem 2. As b efore w e start with V ε ( u ) from (7). The ﬁrst part of the expression in V ε ( u ) con ve rges in distribution to − 2 u T ζ + u T I ( θ ∗ ) u as in Theorem 2. F or the second term, w e need to distinguish t he tw o cases θ ∗ k = 0 or θ ∗ k 6 = 0. By assumptions w e ha ve that λ ε /ε 2 − p → λ 0 and hence necessarily λ ε /ε → 0. Consider ﬁrst the case θ ∗ k 6 = 0. W e hav e that λ ε ε u k  | θ ∗ k + εu k | p − | θ ∗ k | p εu k  → 0 . Con ve rsely , if θ ∗ k = 0 w e ha ve that λ ε ε 2 q X j =1  | θ ∗ j + εu j | p − | θ ∗ j | p  → λ 0 q X j =1 | u j | p 1 { θ ∗ j =0 } So, b y means of the same argumen ts adopted in t he pro of of Theorem 2, we can prov e that V ε ( u ) → d V ( u ) uniformly on u . F ollowin g Kim and P o llard [1990], the ﬁnal step consists in showing that arg min V ε = O P ( ε ) θ ∗ (1) and so arg min V ε → d arg min V . Indeed, V ε ( u ) ≥ 1 ε 2  || X − x ( θ ∗ + εu ) || 2 − || X − x ( θ ∗ ) || 2  − λ ε ε 2 q X j =1 | εu j | p and for all u and ε suﬃcien tly small, δ > 0, we hav e V ε ( u ) ≥ 1 ε 2  || X − x ( θ ∗ + εu ) || 2 − | | X − x ( θ ∗ ) || 2  − ( λ 0 + δ ) q X j =1 | u j | p = V δ ε ( u ) . The term | u j | p gro ws slow er than the the ﬁrst normed terms in V δ ε ( u ), so arg min u V δ ε ( u ) = O P ( ε ) θ ∗ (1) and, in turn, arg min u V ε ( u ) is also O P ( ε ) θ ∗ (1). Since arg min u V ( u ) is unique, then the theorem is pro ved. 12 Remark 2. If λ 0 = 0 , fr om the ab ove the or ems we imm e dia tely obtain that ε − 1 ( ˆ θ ε − θ ∗ ) → d arg min u V ( u ) = I − 1 ( θ ∗ ) ζ , wher e I − 1 ( θ ∗ ) ζ ∼ N q ( 0 , I − 1 ( θ ∗ ) σ 2 I − 1 ( θ ∗ )) . 4 Adaptiv e v ersio n of the p enalized estima- tor Theorem 3 shows that, if p < 1, one can estimate t he nonzero parameters θ ∗ j 6 = 0 a t the usual rate without intro ducing asymptotic bias due to the p enalization and, at the same time, shrink the estimates of the null θ ∗ j = 0 parameters tow ard zero with p ositiv e probability . On the con trary , if p ≥ 1 non zero pa ramete rs are estimated with some asymptotic bias if λ 0 > 0. This is a w ell kno wn result in t he literature [Zou, 2006] and has b een indeed considered in D e Gregorio and Iacus [2012] for ergo dic diﬀusion mo dels with discrete observ ations. In this section w e consider only the case for p = 1, i.e. the real Lasso estimator. T o state the results w e need to rearrange the elemen ts of the ve ctor pa- rameters θ in this w ay . Supp ose that q 0 ≤ q v alues of θ ∗ are not n ull, than w e reorder θ ∗ as follows: θ ∗ = ( θ ∗ 1 , . . . , θ ∗ q 0 , θ ∗ q 0 +1 , . . . , θ ∗ q ) T , where w e denoted b y θ ∗ k = 0, k = q 0 + 1 , . . . , q , the n ull parameters. W e now need to mo dify the opt imizatio n function by in tro ducing one adaptive sequence for each of the parameters θ j ; i.e. ˜ Z ǫ ( u ) = || X − x ( u ) | | + q X j =1 λ ε,j | u j | , (11) and, as in the ab o ve , the adaptive Lasso-ty p e estimator is the solution to ˜ θ ε = ( ˜ θ ε 1 , ..., ˜ θ ε q ) = arg min θ ∈ Θ ˜ Z ε ( θ ) . (12) W e now need to sligh tly mo dify the rate o f conv ergence of t he new sequences { λ ε,j , j = 1 , . . . , q } . Assumption 6. L et κ ε = min j >q 0 λ ε,j and γ ε = max 1 ≤ j ≤ q 0 λ ε,j . Then the fol lowing c onver genc e must h o ld κ ε ε → ∞ and γ ε ε → 0 . 13 Let ˙ x 1 t ( θ ) =  ∂ ∂ θ 1 x t ( θ ) , . . . , ∂ ∂ θ q 0 x t ( θ )  T , and I 11 ( θ ) = Z T 0 ˙ x 1 t ( θ ) ˙ x 1 t ( θ ) T µ (d t ) , ( q 0 × q 0 matrix) . Let η b e a Gaussian random v ector deﬁned as follows η = Z T 0 x (1) t ( θ ∗ ) ˙ x 1 t ( θ ∗ ) µ (d t ) ∼ N q 0 ( 0 , σ 2 1 ) , (13) where σ 2 1 = Z T 0 Z T 0 ˙ x 1 t ( θ ∗ ) ˙ x 1 s ( θ ∗ ) T E [ x (1) t ( θ ∗ ) x (1) s ( θ ∗ )] µ (d t ) µ (d s ) . The estimator ˜ θ ε reac hes asymptotically the oracle prop erties. Indee d, a go o d pro cedure should ha v e the fo llo wing ( a s ymptotically) prop erties: (i) consisten tly estimates null parameters a s zero and vice v ersa; i.e. iden tiﬁes the righ t subset mo del; (ii) has the optimal estimation ra t e and con v erges to a G auss ian ra ndo m v ariable with cov aria nce matrix of the true subset mo del. Theorem 4 (Or a c le pro perties) . L et Assumptions 1 – 6 hold. Then, as ε → 0 , (i) Consistency in v ariable sele ction; i . e. P ( ε ) θ ∗ ( ˜ θ ε k = 0) − → 1 , k = q 0 + 1 , . . . , q ; (ii) Asymptotic normality; i .e. ε − 1 ( ˜ θ ε 1 − θ ∗ 1 , ..., ˜ θ ε q 0 − θ ∗ q 0 ) T − → d I − 1 11 ( θ ∗ ) η , wher e I − 1 11 ( θ ∗ ) η ∼ N q 0 ( 0 , I − 1 11 ( θ ∗ ) σ 2 1 I − 1 11 ( θ ∗ )) . Pr o of. (i) W e brieﬂy outline the pro of. Th e pro of is b y con tradiction. Let us assume that for one j = q 0 + 1 , . . . , q the adaptive -lasso estimator for θ ∗ j = 0 is ˜ θ ε j 6 = 0. By taking in t o accoun t the Karush-Kuhn-T uc k er (KKT) optimalit y conditions, we hav e 1 ε ∂ ∂ u j ˜ Z ǫ ( u )     u = ˜ θ ε = 1 ε  ∂ ∂ u j || X − x ( u ) ||     u = ˜ θ ε  + λ ε,j ε sgn( ˜ θ ε j ) = 0 . The ﬁrst term is O P ( ε ) θ ∗ (1) b y Assumption 2 and the fa c t that ˜ θ ε is the solution of (12). F or the second term we hav e that λ ε,j ε ≥ κ ε ε → ∞ b y Assumption 6. 14 (ii) Let ˜ V ε ( u ) = 1 ε 2 || X − x ( θ ∗ + εu ) || 2 − || X − x ( θ ∗ ) || 2 + q X j =1 λ ε,j  | θ ∗ j + εu j | − | θ ∗ j |  ! = u T || ˙ x ( θ ∗ ) || 2 u − 2 u T || ε − 1 ( X − x ( θ ∗ )) ˙ x ( θ ∗ ) || + o ε (1) + q X j =1 λ ε,j ε  | θ ∗ j + εu j | − | θ ∗ j | ε  (14) F rom Assumption 6, since u j | θ ∗ j + εu j | − | θ ∗ j | u j ε − → ε → 0 u j sg n ( θ ∗ j ) , for j = 1 , ..., q 0 , w e hav e that q 0 X j =1 λ ε,j ε  | θ ∗ j + εu j | − | θ ∗ j | ε  ≤ γ ε ε q 0 X j =1  u j | θ ∗ j + εu j | − | θ ∗ j | u j ε  − → ε → 0 0 , while for θ ∗ j = 0 , j = q 0 + 1 , ..., q , one has that P q j = q 0 +1 λ ε,j ε | u j | − → ε → 0 ∞ . Therefore, it is not po s sible to use the top ology of the uniform con v erge on compact sets; nev ertheless, w e can deﬁne the conv ergence o f ˜ V ε via epi- con ve rgence in distribution; i.e. from Lemma 4.1 in Gey er [1994], follows that ˜ V ε ( u ) → d ˜ V ( u ) for ev ery u , where ˜ V ( u ) = ( u T 1 I 11 ( θ ∗ ) u 1 − 2 u T 1 η , if u q 0 +1 = ... = u q = 0 , ∞ , otherwise , and u 1 = ( u 1 , ..., u q 0 ) T and the previous con v ergence is considered on the space of extended functions R q → [ −∞ , + ∞ ] with a suitable metric. (da ﬁssare meglio) F or mor e details on the epi-conv ergence see G eyer [1994], Knigh t [1 9 99 ] and Ro c k afellar and W ets [19 9 8]. Since the unique minimum p oin t of ˜ V ε ( u ) is given b y ε − 1 ( ˜ θ ε − θ ∗ ) and arg min u ˜ V ( u ) = ( I − 1 11 ( θ ∗ ) η , 0 ) T is P θ ∗ − unique, from Theorem 4.4 in Gey er [1994] follow s the result (ii). No w let ˜ θ ε b e an y consisten t estimator of θ ∗ , for example, the uncon- strained minimum distance estimator or the maxim um lik eliho o d estimator [Kuto y ants, 1994]. Then, as suggested b y Zou [2006], fo r a n y constan t λ 0 > 0 and δ > 1, it is suﬃcien t to c ho ose the sequenc es λ ε,j as f o llo ws λ ε,j = λ 0 | ˜ θ ε | δ . (15) 15 If λ 0 /ε → 0 and ε δ − 1 λ 0 → ∞ as ε → 0, then Assumption 6 is satisﬁed. Usually v alues of δ = 1 . 5 or δ = 2 ar e common in ada ptiv e Lasso estimation. The idea of w eighting the sequences as in (15) is to exploit the abilit y of consisten t estimators to give an initial g ue ss of ho w la rge is a parameter, a nd then using Lasso to shrink adaptiv ely t he p enalt y function in order to a v oid bias for true large parameters. References R. Azencott. F ormule de taylor sto c hastique et d ´ ev elopp emen t asymptotique d ´ ın t´ egrales de feynmann. S ´ eminair e de Pr ob abilit´ es XVI; Suppl´ eme n t : G ´ eo m etrie Diﬀ´ er entiel le Sto chastique. L e ctur e Notes In Math. , 921:2 37– 285, 1982. M. I. F reidlin a nd A. D. W en tzell. R and om p erturb ations of dynamic al sys- tems. 2nd . e d. Springer-V erlag, New Y ork, 1998. N. Y oshida. Asymptotic expansion for statistics related to small diﬀusions. Journal of the Jap an Statistic al So ciety , 22:13 9–159, 1 9 92a. N. K unito m o and A. T ak ahashi. The asymptotic expansion approach to the v aluation of in terest rate contingen t claims. Mathem at ic al Financ e , 11 (1): 117–151, 2001. A. T a k ahashi and N. Y oshida. An asymptotic expansion sc heme for optimal in ve stmen t pro ble ms. Stat. Infe r enc e Sto ch. Pr o c ess. , 7:153–188, 2004. M. Uchida and N. Y oshida. Asymptotic expansion for small diﬀusions applied to option pricing. Statist. I nfer. Sto chast. Pr o c ess. , 7:189–22 3, 2 0 04a. J. D. Murra y . Mathematic al Biolo gy I, an intr o duction . Springer, New Y ork, 2002. P . C. Bressloﬀ. Sto chastic Pr o c esses in Cel l Biolo gy, Inter discipli n ary Appl i e d Mathematics 41 . Springer, New Y ork, 20 14. G. B. Ermentrout a nd D . H. T erman. Mathematic al F oundations of Neur o- scienc es, I nter disc i p linary Applie d Mathematics 35 . Springer, New Y ork, 2010. M. Caner. Lasso-t yp e gmm estimator. Ec onometric The ory , 25:270–29 0 , 2009. 16 J. F an and R. Li. Statistical Challenges with High Dimensionality : F eature Selection in Kno wledge D isc o very . ArX iv Mathematics e - prints , F ebruary 2006. K. Knight and W. F u. Asymptotics for lasso-t yp e estimators. The Annals of Statistics , 5(28):153 6–1378, 2000 . B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of S t atistics , 32:407–489, 2004. R. Tibshirani. Regression shrink age and selection via the lasso. J. R oy. Statist. S o c. Ser. B , 58:2 6 7–288, 1996. Y. K utoy an ts. Par am e t er estim a t ion for sto chastic pr o c esses . Heldermann, Berlin, 1984. Y. Kutoy an ts. Minimum distance parameter estimation for diﬀusion t yp e observ ations. C.R. A c ad. Paris, S´ er. I , 312:637– 6 42, 1991 . Y. Kutoy an ts. Iden t iﬁc ation o f Dynamic al Systems w it h Sm a l l Noise . Klu wer Academic Publishers, Dordrech t, The Netherlands, 1994. N. Y oshida. Asymptotic expansion o f maxim um lik eliho o d estimators for small diﬀusions via the t heory of mallia vin-watanab e . Pr ob ab. The ory R elat. F ields , 92:275–3 1 1, 1 9 92b. Y. Kuto y ants and P . Philib ossian. On minimum l 1 -norm estimates of the parameter of ornstein-uhlen b ec k pro ces s. Statistics and Pr ob ability L etters , 20:117–12 3, 19 94. S. M. Iacus. Semiparametric estimation of the state of a dynamical system with small noise. Statistic al Infer enc e for Sto chastic Pr o c esses , 3:277–288, 2000. S. M. Iacus and Y u. Kutoy an ts. Semiparametric h yp otheses testing for dy- namical systems with small noise. Statistic al Infer enc e for Sto chastic Pr o- c esses , 10:105–12 0 , 2001. N. Y oshida. Conditional expansions a nd their applications. S to chas tic Pr o- c ess. Appl. , 107:53–81 , 2003. M. Uchida a nd N. Y oshida. Information criteria for small diﬀusions via the theory of malliav in-w ata na be. Statist. I nfer. Sto chast. Pr o c ess. , 7:35–6 7 , 2004b. 17 V. G enon-Catalot. Maxim um con trast estimation for diﬀusion pro cesses f r om discrete observ ations. Statistics , 21:9 9–116, 1990. C. F. Laredo. A suﬃcien t condition for asymptotic suﬃciency of incomplete observ ations o f a diﬀusion pro cess. Ann. Statist. , 18:11 5 8–1171, 1990. M. Sørensen. Small disp ersion a symptotics for diﬀusion martingale estimating functions . Dep artment of Statistics and Op er ations R e- se ar ch, University of Cop enhag e n , Preprin t No. 2000-2 , 1997 . URL http://www. math.ku.dk /michael/smalld.pdf . M Sørensen. Estimating functions for diﬀusion-type pro cesses. In M. Kessler, A. Lindner, and M. Srensen, editors, Statistic al Metho ds for Sto chastic D if- fer ential Equations , Pro ceedings of the Second International Symp osium on Information Theory , pag e s 1–10 7. CR C Press, Chapmann and Hall, 2012. M. Sørensen and M. Uc hida. Small diﬀusion asymptotics for discretely sam- pled sto c hastic diﬀeren tia l equations. Bernoul li , 9:1 0 51–1069, 2003 . M. Uchida. Estimation fo r dynamical systems with small noise from discrete observ ations. J. Jap an Statist. So c. , 33:157 –167, 20 0 3. M. Uc hida . Estimation f o r discretely observ ed small diﬀusions based o n appro ximate martingale estimating functions. Sc and. J. Statist. , 31:553– 566, 2004. M. Uc hida. Martingale estimating functions based on eigenfunctions for dis- cretely observ ed small diﬀusions. Bul l. Inform. Cyb ernet. , 38:1–13, 2006. M. Uc hida . Appro ximate mart ing ale estimating functions for sto c hastic dif- feren tial equations with small no is es. Sto chastic Pr o c esses and their Ap- plic ations , 118:1706– 1721, 20 0 8. A. Gloter and M Sørensen. Estimation for sto c hastic diﬀerential equations with a small diﬀusion co eﬃcien t. S t o chastic Pr o c esses and their Applic a- tions , 119:679–69 9 , 20 0 9. R. Guy , C. Laredo, and E. V ergu. Parametric inference for discretely observ ed m ultidimensional diﬀusions with small diﬀusion co eﬃcien t. S t o chastic Pr o- c esses an d their Appli c ation s , 124 :51–80, 2014. A. De G regorio and S. M. Iacus. Adaptiv e lasso-type estimation for m ultiv ariate diﬀusion pro cesses . Ec onom et ric The ory , 28:838–860, 8 18 2012. ISSN 1469-43 60. doi: 10.1017/S02 6 6466611000806. URL http://jour nals.cambr idge.org/article_S0266466611000806 . S. Nkurunziza. Shrink age strategies in some m ultiple multi-factor dy- namical systems. ESAIM: Pr ob ability and Statistics , 16:139–1 50, 1 2 0 12. ISSN 1262-3318 . doi: 10.1051/ps/20100 1 5. URL http://www. esaim- ps. org/article_S1292810010000157 . R.S. Lipster and A.N. Shiryae v. Statistics for R andom Pr o c esses I: Gener al The ory . Springer-V erlag , New Y ork, 2001 . P .W. Millar. The minimax principle in asymptotic statistical t heory . L e ct. Notes i n Math. , 976:76—265 , 19 8 3. P .W. Millar. A g e neral approac h to the o pt ima lity of the minim um distance estimators. T r ans. Amer. Math. So c. , 286:377–4 18, 1984. O. Kallenberg. F oundations of Mo dern Pr ob ability . Springer-V erlag , New Y ork, 2001. J. Kim and D. Pollard. Cube ro ot asymptotics. Annals of Statistics , 18: 191–219, 1990. K. K nig h t. Epi-con ve rgence in distribution and sto c hastic equi- semicon tin uit y . Unpublishe d manuscript , 1999. H. Zou. The adaptive lasso and its oracle properties. Journal of the Americ an Statistic al Asso cia t ion , 101(476):14 18–1429, 200 6. C.J. G ey er. On the asymptotics of constrained m -estimation. Annals of Statistics , 22:1993– 2010, 19 9 4. R.T. Ro c k afellar and R.J.B. W ets. V ariational Analysis . Springer-V erlag, New Y ork, 1998. 19

On penalized estimation for dynamical systems with small noise

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment