The asymptotic effect of tuning parameters

Tuning parameters are parameters involved in an estimating procedure for the purpose of reducing the risk of some other estimator. Examples include the degree of penalization in penalized regression and likelihood problems, as well as the balance par…

Authors: Ingrid Dæhlen, Nils Lid Hjort, Ingrid Hobæk Haff

The asymptotic effect of tuning parameters
The asymptotic effect of tuning parameters Ingrid Dæhlen 1 , 2 , Nils Lid Hjort 1 and Ingrid Hobæk Haff 1 1 Departmen t of Mathematics, Univ ersit y of Oslo 2 Norw egian Computing Cen ter, P ost b o x 114 Blindern, Oslo, 0314, Norw a y A ddress for corresp ondence: Ingrid Dæhlen, Departmen t of Mathematics, Universit y of Oslo, Moltk e Mo es v ei 35, 0851 Oslo, Norw a y , email: ingrdae@math.uio.no Abstract T uning parameters are parameters in v olv ed in an estimating pro cedure for the purp ose of reducing the risk of some other estimator. Examples include the degree of p enalization in p enalized regression and lik eliho o d problems, as w ell as the balance parameter in hybrid metho ds. T ypically tuning parameters are set to the minimizers of some estimator of the risk, a step which introduces additional randomness and mak es standard metho dology inapplicable. W e derive precise asymptotic theory for this situation. Our framework allows for smo oth, but otherwise arbitrary , loss functions and for the risk to b e estimated b y cross- v alidation procedures. Results include consistency of the optimal estimator to w ards a well-defined quantit y and asymptotic normality after prop er scaling and centring. W e give explicit forms and estimators for the limiting v ariance matrix and results sharply characterizing the distance from the training error to the cross-v alidated estimator of the risk. Keyw ords: cross v alidation, information criteria, large-sample theory , regularisation, t w o-stage estimators 1 In tro duction Supp ose we ha v e data Z 1 , . . . , Z n and wish to estimate a parameter θ . Assume further that for each fixed v alue of some other parameter λ this can b e done by maximising a 1 function Γ n ( θ , λ ) defined in terms of the data. Examples include hybrid estimators where Γ n ( θ , λ ) = λγ 1 n ( θ ) + (1 − λ ) γ 2 n ( θ ) and p enalized lik eliho o d methods, where for each λ , θ is estimated by the maximizer of ℓ n ( θ ) − λR ( θ ) for some function R . This latter situation co vers the ridge regression. An example of the former case is the hybrid generativ e-discriminative estimator defined as the maximizer of λℓ Gen n ( θ ) + (1 − λ ) ℓ Disc n ( θ ) where ℓ Gen n and ℓ Disc n are the log-likelihoo ds in generativ e and corresp onding discriminative mo dels (see e.g. Bouc hard and T riggs (2004)). F or each fixed λ , w e can maximize Γ n ( θ , λ ) with resp ect to θ to get the estimator ˆ θ ( λ ) . F urthermore, if Γ n is sufficiently smo oth in θ and a relatively standard set of regularit y conditions is satisfied, ˆ θ ( λ ) is consisten t for a well-defined quantit y θ 0 ( λ ) and √ n { ˆ θ ( λ ) − θ 0 ( λ ) } con v erges in law to some known limit distribution. In practice, ho w ev er λ is rarely fixed, and is instead set to minimize some estimated loss, usually appro ximated by means of an information criterion or cross-v alidation. There is therefore an additional randomness inv olved when working with tuning parameters. Because of this, the standard limit results for ˆ θ ( λ ) are not immediately applicable to ˆ θ ( ˆ λ ) . In practice, this p oin t is often ignored and inference ab out θ is made by treating λ as fixed at ˆ λ and by using p oint wise results for ˆ θ ( λ ) . Using such a metho d ignores part of the fitting pro cedure and added randomness introduced by tuning λ , ho w ev er. Because of this, the v ariance of ˆ θ ( ˆ λ ) tends to b e underestimated. Figure 1 illustrates the ideas outlined ab ov e for a data set concerning female Pima Indians and prev alence of diab etes (publicly a v ailable in e.g. the R-package mlbench ). In Section 6.1 w e fit a p enalized logistic regression mo del to these data where the tuning parameter dictating the size of the regularization term is set to the minimizer of the cross-v alidated Brier score. F or each regression co efficien t, Figure 1 shows histograms based on 2000 non-parametric b o otstrap iterations, together with the densities in appro ximate distributions. The dotted curv e corresp onds to the distribution we get when using the p oint wise results as describ ed in the previous paragraph. The dashed lines corresp ond to the alternative appro ximate distributions derived in this article, 2 0.50 0.75 1.00 1.25 1.50 0 1 2 3 4 0 1.0 0.5 0.0 0.5 0 1 2 3 1 2.0 1.5 1.0 0.5 0 1 2 3 2 0.25 0.00 0.25 0.50 0 1 2 3 4 3 0.50 0.25 0.00 0.25 0 1 2 3 4 0.5 0.0 0.5 0 1 2 3 5 1.0 0.5 0.0 0 1 2 3 6 1.0 0.5 0.0 0 1 2 3 7 1.0 0.5 0.0 0.5 0 1 2 3 8 Appr o ximate densities New appr o ximation P ointwise appr o ximation Figure 1: Histograms of draws from the b o otstrap distribution of ˆ β ( ˆ λ ) j for j = 0 , . . . , 8 together with the densit y in the appro ximate distribution. whic h tak e the tuning pro cedure into account. Note that the curves corresp onding to the p oin t wise estimator seem a bit to o narro w, while the curves corresp onding to the new estimators are less sharp. This is a consequence of including and not including the effect of the tuning step in the limiting distribution. In the presen t article, w e fill a gap in the literature by deriving precise limit results for the estimator ˆ θ ( ˆ λ ) , allo wing inference ab out θ to b e made while taking the added randomness from tuning λ in to accoun t. W e will start by deriving limit results when ˆ λ is estimated b y minimising the relatively simpler training error (Section 2) or an information criterion (Section 3). W e also derive estimators for the limiting v ariance matrix of ˆ θ ( ˆ λ ) (Section 2.2). In Section 4, we show that the results of Section 2 hold true also when the cross v alidated error is used to estimate λ . This will require us to tak e a closer lo ok at the cross v alidation pro cedure, and in Section 4.1 w e show a result pro ving exactly how far off the training error is from the cross-v alidated coun terpart. In Section 5, w e consider a particular situation where the theory simplifies greatly 3 and in Section 6, w e go though a couple of applications. In particular, w e will return to the ab ov e analysis of diabetes in Pima Indians in Section 6.1. Lastly , Section 7 con tains some concluding remarks. 1.1 Related w ork T o inv estigate the join t limiting b ehaviour of ˆ θ ( ˆ λ ) and ˆ λ , we reform ulate the tuning pro cedure as a tw o-stage estimation problem and show that ˆ θ ( ˆ λ ) can b e written as comp onen ts of a Z-estimator (see e.g. V an der V aart (2000, Ch. 5)). Man y of our pro ofs are therefore based up on and inspired b y results for tw o-stage estimators, see e.g. Jo e (2005), Ko and Hjort (2019) or Dæhlen and Hjort (2025), and b ears similarities with the framework of generalized estimating equations used frequen tly in econometrics, see e.g. Hansen (1982), Hall and Inoue (2003) or for a more thorough in tro duction, Shao (2003, Ch. 5). T o apply the tw o-stage arguments also in the context of cross-v alidation, we derive sharp results characterizing the difference b etw een the cross-v alidated estimator of the risk and the training error in Theorem 6. This theorem can b e seen as an extension of the result given in Stone (1977), where the author writes that the cross-v alidated lik eliho o d is asymptotically equiv alent to the AIC. Stone (1977) only writes do wn an informal argumen t, how ev er, and such intuitiv e explanations are often rep eated when the paper is cited, see e.g. Claesk ens and Hjort (2008) and Konishi and Kitagaw a (2008). T o the b est of our kno wledge, no formal pro of of the statemen t has ever b een giv en, and w e therefore believe that our Theorem 6 fills this gap in the literature. That b eing said, recently there has b een some work related to approximations to the LOOCV (see Stephenson and Bro derick (2020) for an o v erview), and the theoretical results concerning these approximations b ear resem blance to our w ork (see in particular (Beirami et al., 2017, Le. 3)). W e still b eliev e our theorem to b e nov el as it is more general than that of Stone (1977) since it allows for general Z -estimators instead of only maximum lik eliho o d estimators, and the functions used in the cross-v alidation 4 pro cedure and estimation of θ need not b e the same. W e also use Theorem 6 to discuss whether cross-v alidation is precise enough to capture the difference b et w een differen t mo dels, a topic whic h has b een discussed by e.g. Shao (1993) or Shao (1997) in the context of linear mo dels. F urthermore, our theory allows us to characterize the optimism of the training error, something whic h is studied in e.g. Efron (1983), Efron (1986), Tibshirani and Knigh t (1999) or Efron (2004), alb eit with a differen t approac h. Our metho dology is in some sense related to the field of p ost-selection inference, see e.g. Berk et al. (2013), Lee et al. (2016), Bachoc et al. (2020) or indeed Kuc hibhotla et al. (2022) for a recent review. In this field, the distribution of regression parameters is derived in a wa y that takes the mo del selection pro cedure into accoun t. This is t ypically done b y either conditioning on the mo del selection pro cedure or b y creating confidence interv als by taking the supremum o v er all candidate mo dels. W e, on the other hand, include the selection of con tin uous tuning parameters in to the fitting pro cedure itself. Because of this, the field of p ost-selection inference is related to the presen t article in spirit only . F urthermore, we provide explicit forms and estimators for the limiting v ariance of ˆ θ ( ˆ λ ) , something whic h is not obtained though the most p opular p ost-selection inference techniques (see e.g. the aforemen tioned review article). Another topic which b ears some similarities with our work is the research concerning asymptotic optimality of different parameter tuning sc hemes. F or a given prediction dep ending on some tuning parameter λ , these w orks inv estigate its distance to the “b est” prediction we can p ossibly make, when λ is tuned according to some criterion. Cra v en and W ahba (1978), Li (1985) and Sp ec kman (1985), Li (1986) and Li (1987) all in v estigate this problem in linear mo dels, and more recently Mu et al. (2018) and Mu et al. (2021) hav e prov en asymptotic optimality in a more adv anced context. See also Mu et al. (2021) whic h con tains a more comprehensiv e list of relev an t literature. So far, ho w ev er, the main concern of this field of research has b een whether the tuning pro cedure ensures that the obtained and optimal predictions are close, rather than the 5 asymptotic effect tuning parameters ha v e on the limit distribution of ˆ θ ( ˆ λ ) . W e b elieve this sets our results apart from the literature of asymptotic optimalit y . T o our kno wledge, no results concerning the limiting distribution of ˆ θ ( ˆ λ ) when ˆ λ is set to the minimizer of some general risk estimator exists in the literature. Arcones (2005) studies a similar situation, by deriving limit results for ˆ θ ( ˆ λ ) when ˆ λ is set to the minimizer of some estimator of the limiting v ariance of ˆ θ ( λ ) . This result is closely related to our theorems, but Arcones’ framework only w orks for one-dimensional parameters θ and λ . F urthermore, ˆ λ cannot minimize a general loss function or a cross-v alidated estimator of the risk. Our results are different from those of Arcones (2005) in yet another w a y . Ar- cones (2005), as well as many others (e.g. Do dge and Jurečk o vá (2000), Kato (2009), Germain and Roueff (2010) and (Sp ok oin y, 2025, Se. 10.2)), sho ws that S n ( λ ) = √ n h ˆ θ ( λ ) − θ 0 ( λ ) i con v erges as a sto chastic pro cess to a limiting Gaussian S ( λ ) . Suc h results ensure that ˆ θ ( λ ) b eha ves as exp ected for all v alues of λ and could in theory b e used to pro v e our results where the asymptotic b ehaviour of ˆ λ kno wn. This is p oin ted out in (Do dge and Jurečko vá, 2000, Se. 7.2 and 10.2). In practice, how ev er, the b ehaviour of the tuning parameter is not kno wn, and very few attempts hav e b een made at inv estigating this. Homrighausen and McDonald (2014) and Chetv erik o v et al. (2021) ha v e done some work on the Lasso, (Do dge and Jurečko vá, 2000, sec. 7.2) studies the b ehaviour of ˆ θ ( ˆ λ ) when ˆ θ ( λ ) is a trimmed mean in the context of one explicitly defined ˆ λ and Arcones (2005) considers a couple of sp ecific tuning schemes, but apart from these pap ers, virtually no theoretical discussion on ho w ˆ λ b eha v es asymptotically and how this in turn impacts the limiting distribution of S n ( ˆ λ ) seem to b e present in the literature. This is esp ecially true when cross-v alidation is used to tune ˆ λ . W e therefore b elieve the current pap er pro vides a significan t contribution to statistical estimation theory . Consult the supplementary material Section S.2 for a discussion utilizing similar arguments as Arcones (2005) when no natural limit for ˆ λ exists. 6 1.2 Notation and definitions Before we b egin with the theory we will introduce notation whic h will b e used through- out this article. F or a given function f ( θ ) w e will let ∇ θ = θ 0 f ( θ ) and H θ = θ 0 f ( θ ) b e the gradient and Hessian matrix of f at θ 0 resp ectiv ely , while ∂ θ = θ 0 f ( θ ) = f ′ ( θ 0 ) = ∂ f /∂ θ ( θ 0 ) will denote the Jacobian matrix of f at θ 0 . When there is no ro om for misunderstanding, we will simplify notation and write ∇ θ 0 f , H θ 0 f and ∂ θ 0 f rather than the ab ov e. F urthermore, we will often iden tify matrices in R p × q with vectors in R pq in the following wa y A = ( a 1 , a 2 , . . . , a q ) ∈ R p × q ⇝ ( a T 1 , a T 2 , . . . , a T q ) T ∈ R pq and hence, if A is a matrix, ∥ A ∥ = ( P j P k a 2 j k ) 1 / 2 . W e will also let Pr → and d → denote con v ergence in probabilit y and distribution/la w resp ectiv ely . W e will let Z 1 , . . . , Z n ∈ R d b e indep enden t and identically distributed (i.i.d.) v ariables from a distribution F and assume that for eac h fixed v alue of λ ∈ R q , θ ∈ R p is estimated b y solving the equation n − 1 P n i =1 φ ( Z i , θ , λ ) = 0 , for some function φ : R d + p + q → R p . W e will let ˆ θ ( λ ) denote the solution. If ˆ θ ( λ ) is a maximizer of an expression on the form n − 1 P n i =1 ϕ ( Z i , θ , λ ) for some function ϕ , this is satisfied with φ ( z , θ , λ ) = ∇ θ ϕ ( z , θ , λ ) under general conditions. In particular, if ˆ θ ( λ ) maximises a p enalized likelihoo d of the form n − 1 P n i =1 log f ( Y i , θ ) − λP ( θ ) , φ ( z , θ , λ ) = u ( Y i , θ ) − λP ′ ( θ ) where u is the score function in the model. W e will also commen t that regression and classification mo dels fit into this framework b y considering the cov ariates as random and letting Z = ( Y , X T ) T where Y is the resp onse and X the predictor. Lastly , fix a loss function ψ : R d + p → R . The training error, whic h will b e denoted by TE ( λ ) , is equal to n − 1 P n i =1 ψ { Z i , ˆ θ ( λ ) } . The lea v e-one-out cross v alidation (LOOCV) estimator of the risk is n − 1 P n i =1 ψ { Z i , ˆ θ ( − i ) ( λ ) } and will b e referenced by CV ( λ ) . W e will assume that ˆ λ is set to the minimizer of some error estimator and refer to this pro cess as “tuning” λ 7 2 T w o-stage tuning W e will start b y in v estigating the asymptotic prop erties of ˆ θ ( ˆ λ ) when ˆ λ is set b y tuning with resp ect to the training error. If this function is con v ex in λ and the minimizer lies in the interior of the parameter space for λ , this is equiv alent to solving TE’ ( λ ) = 0 . Hence, under these conditions, ˆ λ satisfies TE’ ( ˆ λ ) = P n i =1 ˆ θ ′ ( ˆ λ ) T ∇ ψ [ Z i , ˆ θ ( ˆ λ )] . If n − 1 P n i =1 ∂ ˆ θ ( ˆ λ ) φ ( Z i , θ , ˆ λ ) is in v ertible, the quan tit y ˆ θ ′ ( ˆ λ ) exists b y the implicit function theorem. F urthermore, it satisfies n − 1 P n i =1 [ ∂ ˆ θ ( ˆ λ ) φ ( Z i , θ , ˆ λ ) ˆ θ ′ ( ˆ λ ) + ∂ ˆ λ φ { Y i , ˆ θ ( ˆ λ ) , λ } ] = 0 b y the same result. Because of these relations as well as the definition of ˆ θ ( λ ) , we can express the vector ˆ α = ( ˆ θ ( ˆ λ ) T , ˆ λ T , ˆ θ ′ ( ˆ λ ) T ) T ∈ R p + q + pq as the solution to a set of equations. This implies that the full v ector ˆ α is what is called a Z estimator, and these are typically consistent for well-defined quantities and asymptotically normal after prop er scaling and centring, see e.g. V an der V aart (2000, Ch. 5). Since ˆ α is Z-estimator b y the previous argumen ts, we would exp ect the same to b e true for ˆ α (and in turn also ˆ θ ( ˆ λ ) ). The following theorem shows that this is indeed correct. Theorem 1. F or e ach fixe d λ , let ˆ θ ( λ ) b e the unique solution to n − 1 P n i =1 φ ( Z i , θ , λ ) = 0 . F urthermor e, assume that ˆ λ is the unique solution to TE’ ( λ ) = o p (1 / √ n ) . L et α = ( θ T , λ T , D T ) T ∈ R p + q + pq and ˆ α = ( ˆ θ ( ˆ λ ) T , ˆ λ T , ˆ θ ′ ( ˆ λ ) T ) T . Define the function η : R d + p + q + pq → R p + d + pq as η ( z , α ) = ( η 1 ( z , α ) T , η 2 ( z , α ) T , η 3 ( z , α ) T ) T wher e η 1 ( z , α ) = φ ( z , θ , λ ) , η 2 ( z , α ) = D T ∇ θ ψ ( z , θ ) and η 3 ( z , α ) = ∂ θ φ ( z , θ , λ ) D + ∂ λ φ ( Z, θ , λ ) . A ssume the solution to 0 = Ψ( α ) = E { η ( Z, θ , λ, D ) } is unique and e qual to α 0 . Then ˆ α is c onsistent for α 0 and furthermor e, ˆ α = α 0 − n − 1 Ψ ′ ( α 0 ) − 1 n X i =1 η ( Z i , θ 0 , λ 0 , D 0 ) + o p (1 / √ n ) , (1) pr ovide d the function Ψ is c ontinuously differ entiable in a neighb ourho o d of α 0 , that Ψ ′ ( α 0 ) is non-singular and that in a neighb ourho o d of α 0 the norm of η ( z , α ) and al l first or der p artial derivatives of η with r esp e ct to α ar e b ounde d by functions m 0 ( z ) and m 1 ( z ) r esp e ctively with E m 0 ( Z ) 2 , E m 1 ( Z ) 2 < ∞ . In p articular, √ n ( ˆ α − α 0 ) c onver ges 8 in law to a N (0 , V α ) distribution wher e V α = Ψ ′ ( α 0 ) − 1 V ar η ( Z , α 0 ) { Ψ ′ ( α 0 ) − 1 } T . Remark. By Leibniz’ in tegral formula, the function Ψ is smo oth in α as long as the same holds true for η . F urthermore α 0 is the unique solution to Ψ( α ) = 0 as long as θ ( λ ) , defined as the zero of E φ ( Z, θ , λ ) , is well defined for all λ and there is only a single θ ∈ { θ ( λ ) | λ ∈ Λ) } minimizing the limiting risk, E ψ [ Z, θ ( λ )] . F or a discussion on what happ ens this condition fails to hold, see Section S2 in the supplementary material. Pr o of. By the argumen ts preceding the theorem, P n i =1 η ( Z i , ˆ α ) = 0 . Hence, consistency follo ws from the assumptions and Lemma S1.1 and Corollary S1.2 in the supplementary material. Equation (1) follows from V an der V aart (2000, Th. 5.21). ■ In practice, w e are rarely interested in the asymptotic b ehaviour of the full v ector ˆ α , but rather in the limiting prop erties of ˆ θ ( ˆ λ ) alone. Corollary 1. A ssume the c onditions of The or em 1 hold true, and for e ach λ let θ ( λ ) b e solution to E φ ( Z, θ , λ ) = 0 . Then ˆ θ ( ˆ λ ) is c onsistent for θ 0 = θ ( λ 0 ) wher e λ 0 is the solution to ∇ λ E ψ { Z , θ 0 ( λ ) } = 0 . L et D = θ ′ ( λ 0 ) , Z 1 = H λ 0 E ψ { Z , θ 0 ( λ ) } , Z 2 = H θ 0 E ψ ( Z , θ ) , J = − ∂ θ 0 E φ ( Z, θ , λ 0 ) and b = ∇ θ 0 E ψ ( Z , θ ) , then ˆ θ ( ˆ λ ) = θ 0 + A 1 1 n n X i =1 φ ( Z i , θ 0 , λ 0 ) + A 2 1 n n X i =1 ∇ θ 0 ψ ( Z i , θ ) + A 3 1 n n X i =1 { ∂ θ 0 φ ( Z i , θ , λ 0 ) D + ∂ λ 0 φ ( Z i , θ 0 , λ ) } , (2) wher e A 1 = J − 1 − D Z − 1 1 { D T Z 2 + W } J − 1 , A 2 = − D Z − 1 1 D T and A 3 = − D Z − 1 1 M , wher e M is a q × pq blo ck diagonal matrix wher e the diagonal blo cks ar e e ach e qual to b T J − 1 and W is a q × p matrix which c an b e written as W = (( b T J − 1 W 1 ) T , . . . , ( b T J − 1 W q ) T ) T 9 with W j =             D T j H θ 0 E φ 1 ( Z, θ , λ 0 ) D T j H θ 0 E φ 2 ( Z, θ , λ 0 ) . . . D T j H θ 0 E φ p ( Z, θ , λ 0 )             + ∂ λ j 0 ∂ θ 0 E φ ( Z, θ , λ ) , wher e D j denotes the j -th c olumn of D . In p articular, √ n { ˆ θ ( ˆ λ ) − θ 0 ( λ 0 ) } → N { 0 , A ∗ K ∗ ( A ∗ ) T } wher e A ∗ = ( A 1 , A 2 , A 3 ) and K ∗ = V ar η ( Z, θ 0 , λ 0 , D ) with η define d in The or em 1. Pr o of. This follo ws from Theorem 1 and form ulas for the inv erse of blo c k matrices. ■ If q = 1 , w e hav e M = b T J − 1 and W = b T J − 1 W 1 making all formulas a lot more readable. F urthermore, the expressions ab o v e often simplify greatly , see e.g. Section 5. 2.1 T runcated estimators In Corollary 1 w e assume that ˆ λ is found b y solving TE’ ( λ ) = 0 . This is, how ever, not how λ is usually tuned in practice. T ypically , ˆ λ is set to the minimizer of TE ( λ ) in some compact set Λ . W e will no w discuss how Corollary 1 can b e used to make inference also in this situation. F or simplicity , we will take q = 1 and Λ = [ 0 , 1] , but all results should b e generalizable to higher dimensions and other parameter sets. As long as TE is differen tiable and ˆ λ lies in the interior of [ 0 , 1] , minimising TE in Λ is equiv alen t to solving TE’ ( λ ) = 0 . Often times, how ever, TE ( λ ) might b e increasing or decreasing with TE’ ( λ )  = 0 for all v alues of λ . In suc h cases ˆ λ is set to 0 or 1 dep ending on which of the b oundary p oints ac hiev es the low est v alue of TE ( λ ) , and the condition TE’ ( ˆ λ ) = 0 might not b e satisfied. T o study this more complicated estimator, w e will w ork with a mo dified v ersion of ˆ θ ( λ ) defined in the following wa y: ˆ θ T = ˆ θ (0) I ( λ ≤ 0) + ˆ θ ( λ ) I (0 < λ < 1) + ˆ θ (1) I ( λ ≥ 1) , (3) 10 where I is the indicator function. This estimator agrees with ˆ θ ( λ ) for all λ ∈ [0 , 1] . Outside of the unit interv al, ho w ev er, the estimator is “truncated” to either ˆ θ (0) or ˆ θ (1) dep ending on whic h side of [0 , 1] λ lies. Hence, as long as TE ( λ ) is conv ex and has a global minimizer ˆ λ G ∈ R , ˆ θ T ( ˆ λ G ) will b e equal to ˆ θ ( ˆ λ ) where ˆ λ is the minimizer of TE ( λ ) in [0 , 1] . F urthermore, when the global minimizer of TE ( λ ) lies in [ 0 , 1] , ˆ λ is the global minimizer ensuring ˆ θ T ( ˆ λ G ) = ˆ θ ( ˆ λ ) . Theorem 2. A ssume the c onditions of Cor ol lary 1 hold true, and let ˆ θ T ( λ ) b e define d as in (3) . F urthermor e, let ˆ λ G b e the glob al minimizer of TE ( λ ) . The fol lowing hold: (a) A ssume ˆ λ G Pr → λ 0 < 0 and √ n { ˆ θ (0) − θ 0 } d → N , wher e N is some zer o me an normal distribution. Then ˆ θ T ( ˆ λ ) Pr → θ 0 and √ n { ˆ θ T ( ˆ λ ) − θ 0 } d → N . (b) A ssume ˆ λ G Pr → λ 0 > 1 and √ n { ˆ θ (1) − θ 0 } d → N . Then ˆ θ T ( ˆ λ ) Pr → θ 0 and √ n { ˆ θ T ( ˆ λ G ) − θ 0 } d → N . (c) A ssume ˆ λ G Pr → λ 0 = 0 and that √ n { ˆ θ ( ˆ λ G ) T − θ T 0 , ˆ θ (0) T − θ T 0 , ˆ λ } T d → ( N T 1 , N T 2 , N 3 ) T ∼ N (0 , V ) for some matrix V . Then ˆ θ T ( ˆ λ ) Pr → θ 0 and √ n { ˆ θ T ( ˆ λ ) − θ 0 } d → I ( N 3 ≥ 0) N 1 + I ( N 3 < 0) N 2 . (d) A ssume ˆ λ G Pr → λ 0 = 1 and that √ n { ˆ θ ( ˆ λ G ) T − θ T 0 , ˆ θ (1) T − θ T 0 , ˆ λ − 1 } T d → ( N T 1 , N T 2 , N 3 ) T ∼ N (0 , V ) for some matrix V . Then ˆ θ T ( ˆ λ ) Pr → θ 0 and √ n { ˆ θ T ( ˆ λ G ) − θ 0 } d → I ( N 3 ≤ 0) N 1 + I ( N 3 > 0) N 2 . Remark. F rom Z-estimation theory , √ n { ˆ θ ( λ ) − θ ( λ ) } = J ( λ ) − 1 n − 1 / 2 P n i =1 φ { Z i , θ ( λ ) , λ } + o p (1) where θ ( λ ) is the ro ot of E φ ( Z, θ , λ ) and J ( λ ) = − ∂ θ ( λ ) E φ ( Z, θ , λ ) for λ = 0 , 1 . Com bining this with (2) and the central limit theorem sho ws the joint conv ergence required for case (c) and (d). Pr o of. W e will only pro ve (a) and (c). The remaining cases are similar. Since ˆ λ G Pr → λ 0 < 0 the probability of ˆ λ G < 0 tends to 1. F urthermore, conditioned on this ev en t, ˆ θ ( ˆ λ G ) = ˆ θ (0) . Hence, Pr { √ n { ˆ θ T ( ˆ λ G ) − θ 0 λ 0 } ≤ x } = lim n →∞ [ Pr { √ n { ˆ θ (0) − θ 0 } ≤ 11 x } Pr ( ˆ λ G < 0) + o ( Pr ( ˆ λ G ≥ 0))] . The righ t hand side of this equation conv erges to Pr( N ≤ x ) . This prov es (a). F or (c) note that √ n { ˆ θ T ( ˆ λ G ) − θ 0 (0) } = I { √ n ( ˆ λ G − 0) ≥ 0 } √ n { ˆ θ ( ˆ λ G ) − θ 0 (0) } + I { √ n ( ˆ λ G − 0) < 0 } √ n { ˆ θ (0) − θ 0 (0) } . No w since the function g ( x, y , z ) = I ( z ≥ 0) x + I ( z < 0) y has the following set of discon tin uit y p oints D g = { ( x, y , 0) | x, y ∈ R } , and this set has measure zero in the m ultiv ariate normal distribution, the con tin uous mapping theorem guaran tees that √ n { ˆ θ T ( ˆ λ G ) − θ 0 (0) } con v erges in la w tow ards g ( N 1 , N 2 , N 3 ) = I ( N 3 ≥ 0) N 1 + I ( N 3 < 0) N 2 . This concludes the pro of. ■ 2.2 Estimators of the v ariance No w that we kno w how ˆ θ ( ˆ λ ) b ehav es asymptotically , w e will define estimators of its limiting v ariance. W e start by giving approximations to the quantities in Corollary 1. Theorem 3. L et ˆ λ b e the minimizer of TE ( λ ) and ˆ θ ( ˆ λ ) the solution to 0 = Φ n ( θ , ˆ λ ) , wher e Φ n ( θ , λ ) = n − 1 P n i =1 φ ( Z i , θ , λ ) . Define the fol lowing estimators: ˆ Z 1 = n − 1 P n i =1 H ˆ λ ψ { Z i , ˆ θ ( λ ) } , ˆ Z 2 = H ˆ θ ( ˆ λ ) Ψ n ( θ ) and ˆ b = ∇ ˆ θ ( ˆ λ ) Ψ n ( θ ) , wher e Ψ n ( θ ) = n − 1 P n i =1 ψ ( Z i , θ ) . F urthermor e, let ˆ J = − ∂ ˆ θ ( ˆ λ ) Φ n ( θ , ˆ λ ) , ˆ D = ˆ J − 1 ∂ ˆ λ Φ n { ˆ θ ( ˆ λ ) , λ } , ˆ K ∗ = n − 1 P n i =1 ˆ η ˆ η T , wher e ˆ η = η { z , ˆ θ ( ˆ λ ) , ˆ λ, ˆ θ ′ ( ˆ λ ) } and ˆ W j = 1 n n X i =1                        ˆ D T j H ˆ θ φ 1 ( Z i , θ , ˆ λ ) . . . ˆ D T j H ˆ θ φ p ( Z i , θ , ˆ λ )         + ∂ ˆ λ j ∂ ˆ θ φ ( Z i , θ , λ )                for j = 1 , . . . , q , with η as define d in The or em 1. These estimators ar e c onsistent for the c orr esp onding quantities, if ther e exists functions dep ending only on z that b ound the norms of al l p artial derivatives of ∂ θ φ , ∂ λ φ , H λ ψ , H θ ψ , ∇ θ ψ , ( z , θ , λ ) 7→ η ( z , θ , λ ) η ( z , θ , λ ) T , H θ φ j and ∂ λ j ∂ θ φ for j = 1 , . . . , p with r esp e ct to ( θ , λ, θ ′ ) in a neighb ourho o d of ( θ T 0 , λ T 0 , D T ) T . Pr o of. This follo ws from standard theory , see e.g. Dæhlen et al. (2024, Le. 4, Ap. B). ■ W e are no w ready to define estimators of the limiting v ariance of √ n { ˆ θ ( ˆ λ ) − θ 0 ( λ 0 ) } . 12 Theorem 4. A ssume we ar e minimising TE ( λ ) in [0,1] and the c onditions of The or em 3 and Cor ol lary 1 hold true. Then, √ n { ˆ θ ( ˆ λ ) − θ 0 } is asymptotic al ly normal with varianc e V which c an b e estimate d c onsistently by ˆ V 1 = ˆ A ∗ ˆ K ∗ ( ˆ A ∗ ) T if ˆ λ lies in the interior of [ 0 , 1] , or ˆ V 2 = ˆ J − 1 ˆ K ( ˆ J − 1 ) T when TE is minimise d in λ with TE’ ( λ )  = 0 for λ e qual to 0 or 1 . Her e ˆ K = n − 1 P n i =1 φ { Z i , ˆ θ ( ˆ λ ) , ˆ λ } φ { Z i , ˆ θ ( ˆ λ ) , ˆ λ } T , ˆ A ∗ = ( ˆ A 1 , ˆ A 2 , ˆ A 3 ) with ˆ A 1 = ˆ J − 1 − ˆ D ˆ Z − 1 1 { ˆ D T ˆ Z 2 + ˆ W } ˆ J − 1 , ˆ A 2 = − ˆ D ˆ Z 1 ˆ D T and ˆ A 3 = − ˆ D ˆ Z − 1 1 ˆ M , wher e ˆ M is a blo ck diagonal matrix with ˆ b T ˆ J − 1 on e ach of its entries and ˆ W = (( ˆ b T ˆ J − 1 ˆ W 1 ) T , . . . , ( ˆ b T ˆ J − 1 ˆ W q ) T ) T . Pr o of. This follows directly from Corollary 1 and Theorem 2, in conjunction with Theorem 3 and the con tin uous mapping theorem. ■ It could of course happ en that λ 0 is exactly equal to either of the endp oints of Λ . Then, case (c) or (d) of Theorem 2 applies and √ n { ˆ θ ( ˆ λ ) − θ 0 } is not guaranteed to b e asymptotically normal, so that approximating it b y N (0 , ˆ V 1 ) or N (0 , ˆ V 2 ) will not b e correct. In Section 5 w e will, how ever, see that the distribution in cases (c) and (d) often simplifies to a central multiv ariate normal distribution also in this case, and that the v ariance can b e estimated by either ˆ V 1 or ˆ V 2 , defined in the preceding theorem. 3 Information criteria The theory presen ted in the previous sections is not limited to cases where ˆ λ is set to the minimizer of the training error. The results also hold true when the tuning parameter minimises quantities which are “almost” equal to TE. The most ob vious example of this is p erhaps when ˆ λ is set to the minimizer of an information criterion. Let f λ ( z , θ ) b e some parametric mo del and ℓ λ,n ( θ ) the corresp onding likelihoo d. F urthermore, for each λ , let ˆ θ ( λ ) b e the maximizer of ℓ λ,n ( θ ) . Then b y definition Akaik e’s information criterion (AIC), the Bay esian information criterion (BIC) and T ak euc hi’s information criterion (TIC) tak e the following forms for each fixed λ : AIC ( λ ) = − n − 1 ℓ λ,n { ˆ θ ( λ ) } + n − 1 p , BIC ( λ ) = − n − 1 ℓ λ,n { ˆ θ ( λ ) } + n − 1 p log n and TIC ( λ ) = 13 − n − 1 ℓ λ,n { ˆ θ ( λ ) } + n − 1 T r { ˆ J ( ˆ θ , λ ) − 1 ˆ K ( ˆ θ , λ ) } , where p is the dimension of θ , the num b er of parameters in the mo del f λ ( z , θ ) at a fixed λ . Hence, provided ˆ J and ˆ K are sufficien tly regular, the ab ov e information criteria are all only o p (1 / √ n ) aw ay from n − 1 ℓ λ,n { ˆ θ ( λ ) } , allo wing Theorem 1 to b e applied after replacing φ b y ∇ θ log f λ ( z , θ ) and η 2 ( z , α ) b y ( z , α ) 7→ ∂ θ log f λ ( z , θ ) D + ∂ λ log f λ ( z , θ ) . Corollary 1 and Theorem 2 also hold after making similar mo difications. The fact that the results of this article can b e applied also when λ is tuned b y minimising the AIC, BIC or TIC follo ws more or less directly from the forms of the criteria. W e will now discuss the fo cused information criterion (FIC), a newer criterion for which the result is less immediate. The FIC w as introduced in Claeskens and Hjort (2003), and is an information criterion with a differen t aim than the classic criteria discussed ab ov e. Rather than ranking mo dels according to how w ell they fit the data ov erall, the FIC ev aluates mo dels b y ho w w ell they estimate some pre-sp ecified parameter of interest. This parameter of interest is called the fo cus parameter and should b e chosen to reflect the main goal of an analysis. If, for instance, the goal is to estimate the median in a p opulation, the FIC prefers mo dels for whic h the estimator of the median is go o d. The quality of an estimator is ev aluated by its mean squared error (MSE), and hence, the FIC of a mo del is the estimated MSE of that mo del’s estimator of the fo cus parameter. W e will w ork with the relatively new er version of the criterion in tro duced in Jullum and Hjort (2017) and mo dified in Dæhlen et al. (2024). As before, let f λ ( z , θ ) be some parametric model and ˆ θ ( λ ) the maximizer of θ 7→ ℓ λ,n ( θ ) = P n i =1 log f λ ( Z i , θ ) for each λ . Let µ b e the fo cus parameter with true v alue µ 0 and assume µ tak es the v alue µ λ ( θ ) in the mo del f λ ( z , θ ) . The FIC of the mo del f λ ( z , θ ) for each λ is the MSE of µ λ { ˆ θ ( λ ) } . Dæhlen et al. (2024) gives an estimator of this quantit y: FIC ( λ ) = ˆ b 2 λ + n − 1 (2 ˆ b λ ˆ c λ − ˆ κ λ + ˆ τ λ ) , where ˆ c λ , ˆ κ λ and ˆ τ λ are consisten t estimators of certain p opulation quantities defined in the pap er, and ˆ b λ = µ { ˆ θ ( λ ) } − ˆ µ 0 for some estimator ˆ µ 0 , which is assumed to b e consistent for µ 0 . Assume no w that ˆ µ 0 is the solution to an equation on the form 0 = n − 1 P n i =1 ξ ( Z i , µ ) . W e 14 then hav e FIC ( λ ) = n − 1 P n i =1 [ µ { ˆ θ ( λ ) } − ˆ µ 0 ] 2 + O p (1 /n ) , and hence, provided sufficien t regularit y , Theorem 1 can b e applied b y replacing θ ( λ ) b y ( θ ( λ ) T , µ ) T , φ ( z , θ , λ ) b y ( φ ( z , θ , λ ) T , ξ ( z , µ 0 )) T and η 2 ( z , α ) b y ( θ , µ, λ, D ) 7→ 2 { µ ( θ ) − µ } ( D T ∇ θ µ ( θ ) T , − 1) T . In Section 6.2, w e illustrate this in an example. Remark. It is w orth noting that we require the criterion used for fitting λ to b e differen tiable as a function of the tuning parameter. Because of this, our theorems do not co v er situations where λ is discrete. In particular, our theorems cannot b e applied when an information criterion is used to e.g. choose b et w een a finite n um b er of mo dels. 4 Cross-v alidation When λ is tuned b y minimising the training error, ˆ θ ( ˆ λ ) can be written as the components of a Z-estimator. Because of this the results of the previous section follo w more or less directly from standard theory . In practice, ho w ev er, λ is seldom tuned with resp ect to TE, but instead set to the minimizer of CV ( λ ) , the estimate of the risk function obtained b y cross-v alidation. In this section, we will sho w that the difference b et w een tuning with resp ect to TE and CV is negligible asymptomatically and that all of the previously deriv ed results can b e applied also when ˆ λ is the minimizer of CV. First, note that for each λ the v ector ( ˆ θ ( λ ) T , ˆ θ ′ ( λ ) T ) T is the solution to equa- tion n − 1 P n i =1 ξ ( Z i , θ , λ, D ) = 0 with ξ ( z , θ , λ, D ) = ( ξ 1 ( z , θ , λ ) T , ξ 2 ( z , θ , λ, D ) T ) T where ξ 1 = φ and ξ 2 ( z , θ , λ, D ) = ∂ θ φ ( z , θ , λ ) D + ∂ λ φ ( z , θ , λ ) . This follows from definition of ˆ θ ( λ ) and the implicit function theorem. Because of this, the v ector ( ˆ θ ( λ ) T , ˆ θ ′ ( λ ) T ) T is a Z-estimator, and under weak conditions ( ˆ θ ( λ ) T , ˆ θ ′ ( λ ) T ) T is con- sisten t for some vector ( θ 0 ( λ ) T , D 0 ( λ ) T ) T and ( ˆ θ ( λ ) T , ˆ θ ′ ( λ ) T ) T ≈ ( θ 0 ( λ ) T , D 0 ( λ ) T ) T + V ( λ ) − 1 n − 1 P n i =1 ξ { Z i , θ 0 ( λ ) , D 0 ( λ ) , λ } where V ( λ ) = − ∂ θ 0 ( λ ) ,D 0 ( λ ) E ξ ( Z, θ , θ ′ , λ ) . See e.g. c hapter 5 of V an der V aart (2000) for pro ofs and sufficien t conditions. Because of this ( ˆ θ ( λ ) T − ˆ θ ( − i ) ( λ ) T , ˆ θ ′ ( λ ) T − ˆ θ ′ ( − i ) ( λ ) T ) ≈ n − 1 V ( λ ) − 1 ξ { Z i , θ 0 ( λ ) , θ ′ 0 ( λ ) } = O p (1 /n ) . W e formalize this idea in the following lemma. 15 Lemma 1. L et the p ar ameter sets Θ for θ , D for ˆ θ ′ ( λ ) and Λ for λ b e c omp act and ξ ( z , θ , D , λ ) = ( φ ( z , θ , λ ) T , ξ 2 ( z , θ , λ, D ) T ) wher e ξ 2 ( z , θ , λ, D ) = ∂ θ φ ( z , θ , λ ) D + ∂ λ φ ( z , θ , λ ) . A ssume the c onditions of L emma S1.3 in the supplementary material hold true for this ξ , and in p articular, that m 0 ( z ) b ounds ξ ( z , θ , D , λ ) for al l θ , D and λ . Then ∥ ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) ∥ , ∥ ˆ θ ′ ( − i ) ( λ ) − ˆ θ ( λ ) ∥ ≤ a 0 ,n m 0 ( Z i ) wher e a 0 ,n = O p (1 /n ) uniformly in λ and do es not dep end on i . The pro of is giv en in the supplementary material. With Lemma 1, w e are ready to quan tify the difference b et w een CV’ ( λ ) and TE’ ( λ ) . Lemma 2. A ssume al l c onditions of L emma 1 hold true with E m 0 ( Z ) 4 < ∞ . Then 1 n n X i =1 ˆ θ ′ ( − i ) ( λ ) T ∇ ˆ θ ( − i ) ( λ ) ψ ( Z i , θ ) = 1 n n X i =1 ˆ θ ′ ( λ ) T ∇ ˆ θ ( λ ) ψ ( Z i , θ ) + δ n ( λ ) , (4) wher e ∥ δ n ( λ ) ∥ = O p (1 /n ) uniformly in λ , pr ovide d ther e exist squar e inte gr able func- tions p 1 and p 2 not dep ending on θ that b ound r esp e ctively al l first and se c ond or der p artial derivatives of ψ with r esp e ct to θ . Pr o of. By Lemma 1, ˆ θ ( − i ) ( λ ) = ˆ θ ( λ ) + R i n ( λ ) and ˆ θ ′ ( − i ) ( λ ) = ˆ θ ( λ ) + S i n ( λ ) with ∥ R i n ( λ ) ∥ , ∥ S i n ( λ ) ∥ ≤ a 0 ,n m 0 ( Z i ) . A T a ylor expansion reveals that the left hand side of (4) is equal to n − 1 P n i =1 { ˆ θ ′ ( λ ) + S i n ( λ ) } T {∇ ˆ θ ( λ ) ψ ( Z i , θ ) + H θ ∗ i ψ ( Z i , θ ) R i n ( λ ) } , for some θ ∗ i s on the line segments b et w een ˆ θ ( λ ) and ˆ θ ( − i ) ( λ ) . Utilizing the existence of p 1 and p 2 , we get that ∥ C V ′ ( λ ) − T E ′ ( λ ) ∥ is b ounded by a 0 ,n ˆ θ ′ ( λ ) T n − 1 P n i =1 p 2 ( Z i ) m 0 ( Z i ) + a n, 0 n − 1 P n i =1 p 1 ( Z i ) m 0 ( Z i ) + a 2 0 ,n n − 1 P n i =1 p 2 ( Z i ) m 0 ( Z i ) 2 . Since E p 1 ( Z ) 2 , E p 2 ( Z ) 2 and E m 0 ( Z ) 4 exist, the means con v erge. This ensures ∥ C V ′ ( λ ) − T E ′ ( λ ) ∥ = O p (1 /n ) uniformly in λ as D is compact. ■ Since CV’ ( λ ) and TE’ ( λ ) are uniformly close as functions of λ b y the ab o v e lemma, w e w ould exp ect ˆ θ ( ˆ λ ) to asymptotically b ehav e similarly when tuning with resp ect to CV and TE. The follo wing theorem confirms the in tuition. 16 Theorem 5. L et λ TE b e the minimizer of TE and λ CV the minimizer of CV and assume that the c onditions of The or em 2 and L emma 2 hold true. Then √ n { ˆ θ ( λ TE ) − θ 0 } and √ n { ˆ θ ( λ CV ) − θ 0 } ar e asymptotic al ly e quivalent. In p articular, The or em 1, Cor ol lary 1 and The or em 2 hold true for ˆ θ ( ˆ λ CV ) , and the limiting distribution of √ n { ˆ θ ( λ CV ) − θ 0 } c an b e estimate d as in The or em 4. Pr o of. Let ˆ α CV = ( ˆ θ ( ˆ λ CV ) T , ˆ λ T CV , ˆ θ ′ ( ˆ λ CV ) T ) T . Since TE’ ( λ ) = CV’ ( λ ) + O p (1 /n ) uniformly in λ , TE ′ ( ˆ λ CV ) = O p (1 /n ) . Hence, n − 1 P n i =1 η ( Z i , ˆ α CV ) = O p (1 /n ) . Since n − 1 P n i =1 η ( Z i , ˆ α CV ) = o p (1 / √ n ) is all that is needed for V an der V aart (2000, Th. 5.21), the conclusion of Theorem 1 follo ws for ˆ α CV as well. This ensures that √ n ( ˆ α − α 0 ) and √ n ( ˆ α CV − α 0 ) are asymptotically equiv alen t. Theorem 5 follo ws from this fact. ■ 4.1 A more precise appro ximation Theorem 5 is a somewhat surprising result, as it tells us that asymptotically sp eaking it mak es no difference if w e tune with resp ect to TE or CV. Y et, in most applications, minimising TE and CV giv es very different results. W e will no w take a closer lo ok at what the difference b etw een CV and TE really is. T o make the deriv ations more readable, w e will only sho w p oin t wise results for ˆ θ ( λ ) , and drop λ from the notation. Arguing similarly as in the paragraph preceding Lemma 1 w e can sho w ˆ θ ( − i ) − ˆ θ ≈ − n − 1 J − 1 φ ( Z i , θ 0 ) where θ 0 is the solution to E φ ( Z, θ ) = 0 and J = − ∂ θ 0 E φ ( Z, θ ) . This approximation is not particularly surprising. The function z 7→ J − 1 φ { z , θ 0 } is called the influence function of ˆ θ as it in some sense tells how muc h a single data p oint affects estimation of θ . See e.g. Hub er (2009) for more details on influence functions. Since ˆ θ ( − i ) is the estimator you get instead of ˆ θ if y ou remo v e the i -th data p oint, it is natural that the difference b et w een ˆ θ ( − i ) and ˆ θ , is precisely the “influence” of Z i . Assuming ψ is sufficien tly smo oth in θ , a T aylor expansion shows n − 1 P n i =1 ψ ( Z i , ˆ θ ) + n − 1 P n i =1 ( ˆ θ ( − i ) − ˆ θ ) T ∇ ˆ θ ψ ( Z i , θ ) ≈ CV . A further T aylor expansion of ∇ ˆ θ ψ ( Z i , θ ) around θ 0 , reveals CV ≈ n − 1 P n i =1 ψ ( Z i , ˆ θ ) − n − 1 P n i =1 ( ˆ θ ( − i ) − ˆ θ ) ∇ ψ ( Z i , θ 0 ) ≈ n − 1 P n i =1 ψ ( Z i , ˆ θ ) − 17 n − 2 P n i =1 φ ( Z i , θ 0 ) T ( J − 1 ) T ∇ θ 0 ψ ( Z i , θ ) . By the law of large num b ers and prop er- ties of the trace op erator, w e hence get CV ≈ TE − n − 1 T r ( J − 1 C ) where C = E φ ( Z i , θ 0 ) ∇ θ 0 ψ ( Z, θ 0 ) T . The following theorem confirms these informal arguments. A full pro of is giv en in the supplemen tary material. Theorem 6. L et ˆ θ and ˆ θ ( − i ) the solutions to P n j =1 φ ( Z j , θ ) = 0 and P n j  = i φ ( Z j , θ ) r esp e ctively. A ssume further that ˆ θ , ˆ θ ( − i ) ∈ Θ ⊆ R p wher e Θ is c omp act and that ther e exists functions m 0 , m 1 , p 1 and p 2 b ounding the norms of φ , al l p artial derivatives of φ with r esp e ct to θ , ψ and al l p artial derivatives of ψ with r esp e ct to θ , r esp e ctively. Then, for a given function ψ ( z , θ ) , we have n − 1 n X i =1 ψ ( Z i , ˆ θ ( − i ) ) = n − 1 n X i =1 ψ ( Z i , ˆ θ ) − n − 1 T r( J − 1 C ) + o p (1 /n ) (5) wher e J = − ∂ θ 0 E φ ( Z, θ ) and C = E φ ( Z, θ 0 ) ∇ θ 0 ψ ( Z, θ ) T , with θ 0 b eing the solution to E φ ( Z, θ ) = 0 , pr ovide d the eigenvalues of { ∂ θ E φ ( Z, θ ) } − 1 ar e b ounde d in Θ and E m 0 ( Z ) 4 , E m 1 ( Z ) 4 , E p 1 ( Z ) 4 , E p 2 ( Z ) 2 < ∞ . When φ ( z , θ ) = ∇ θ log f ( z , θ ) for some density or probability mass function f , J is the Fisher matrix in the mo del. F urthermore, C = K if ψ = log f , where K = V ar φ ( Z, θ 0 ) . Hence, Theorem 6 simplifies to P n i =1 ψ ( Z i , ˆ θ ( − i ) ) = TIC + o p (1) , where TIC is T ak euchi’s information criterion (TIC), defined in Section 3. Additionally , if the mo del f is sp ecified correctly , K = J and P n i =1 ψ ( Z i , ˆ θ ( − i ) ) = ℓ n ( ˆ θ ) − p + o p (1) . The right hand side of this equation is easily recognized as Akaik e’s information criterion (AIC), leading Theorem 6 to simplify to the results of Stone (1977) when φ ( z , θ ) = ∇ θ log f ( z , θ ) and the parametric mo del is sp ecified correctly . Theorem 6 shows that CV and TE are more similar than one migh t initially think. In fact, the only thing separating the tw o quantities is a term of order O p (1 /n ) , whic h is dominated b y the size of TE itself, asymptotically sp eaking. In smaller samples, ho w ever, n − 1 T r ( J − 1 C ) migh t b e quite large and will ensure that tuning with resp ect to TE and CV give different results. That b eing said, there are cases where n − 1 T r ( J − 1 C ) 18 is not negligible and the difference b etw een TE and CV will matter also in the limit. W e will now turn our attention tow ards suc h a case. 4.2 When more precision is needed F or each parameter θ let R ( θ ) b e some risk function, and let ˆ θ b e some estimator of θ based on data Z 1 , . . . , Z n . Assume further that ˆ θ Pr → θ 0 and √ n ( ˆ θ − θ 0 ) d → N (0 , V ) for some matrix V . By T aylor expanding ψ around θ 0 and using the la w of total exp ectation, one can show that, provided sufficient regularit y , E { R ( ˆ θ ) } = R ( θ 0 ) + n − 1 [ ∇ θ 0 R ( θ 0 ) c + (1 / 2) T r { H θ 0 R ( θ ) V } ] + o (1 /n ) , (6) where c is equal to the limit of n E ( ˆ θ − θ 0 ) , see Dæhlen et al. (2024) for more on this parameter. If ˆ θ ∈ R and R ( θ ) = ( θ 0 − θ ) 2 , (6) reduces to the classical bias-v ariance decomp osition of the MSE. With R ( θ ) equal to the error rate of certain classification pro cedures, (6) reduces to the expressions given in O’Neill (1980). Note that (6) consists of three parts: a constan t part, R ( θ 0 ) , a term of size O (1 /n ) and a remainder term of smaller order. F or t w o estimators ˆ θ 1 and ˆ θ 2 consisten t for differen t v alues θ 1 0 and θ 2 0 resp ectiv ely , the difference b etw een their exp ected risk will asymptotically b e dominated by ho w muc h R ( θ 1 0 ) and R ( θ 2 0 ) differ, making the O (1 /n ) terms in (6) negligible asymptotically . If, on the other hand θ 1 0 and θ 2 0 are equal, the difference in exp ected risk for ˆ θ 1 and ˆ θ 2 will b e dominated by the difference b etw een the tw o corresp onding terms of size O (1 /n ) in (6) . In this latter situation, we therefore need to estimate the O (1 /n ) -effect from data in order to prop erly distinguish the risk of ˆ θ 1 from that of ˆ θ 2 . Both TE and CV attempt to estimate E R ( ˆ θ ) , but, as we will see, neither estimator is precise enough capture the O (1 /n ) effects. Let R n ( θ ) = n − 1 P n i =1 ψ ( Z i , θ ) and ˆ θ b e some estimator consisten t for θ 0 . Assuming 19 ψ is sufficiently smo oth, a T a ylor expansion sho ws TE = R n ( θ 0 ) + ( ˆ θ − θ 0 ) T ∇ θ 0 R n ( θ ) + (1 / 2)( ˆ θ − θ 0 ) T H θ 0 R n ( θ )( ˆ θ − θ 0 ) + o p (1 /n ) . (7) Since R n ( θ 0 ) is a mean and ˆ θ Pr → θ 0 , w e hav e TE Pr → R ( θ 0 ) , whic h means that TE do es a go o d job, as far as estimating the constan t part of (6) go es. Appro ximating the term of order O (1 /n ) is, ho w ev er, another story . The most ob vious reason is p erhaps that the remaining terms are no where near b eing consistent for n − 1 [ ∇ θ 0 R ( θ 0 ) c + (1 / 2) T r { H θ 0 R ( θ ) V } ] , but even if they w ere, TE would not b e precise enough to prop erly catc h effects of size O (1 /n ) , as R n ( θ 0 ) = R ( θ 0 ) + O p (1 / √ n ) by the central limit theorem. Since a term of order O p (1 / √ n ) will dominate O (1 /n ) effects, the O (1 /n ) -terms in (6) are to o small to b e captured by TE and will b e washed out by the error of R n ( θ 0 ) . As a result, TE is not well suited for ev aluating estimators whose limiting risk differs only in the O (1 /n ) -term. One might hop e that this is a defect of the TE only , and that CV will b e precise enough to capture the O (1 /n ) effects. Sadly , Theorem 6 stops this idea dead in its tracks. Since the difference b et w een CV and TE is a term of order O p (1 /n ) , the added precision of CV is not sufficient to correct for the O p (1 / √ n ) error TE makes. This is p oin ted out in the con text of mo del selection of linear regression mo dels in Shao (1993). The ab ov e might lead one to b elieve that the difference b etw een CV and TE is negligible and that there is no reason to use CV rather than TE in analysis. This conclusion is, how ever, sligh tly to o negativ e, as there is an imp ortan t difference b et w een CV and TE in exp ected v alue. T o see this, assume that E ψ ( Z, θ 0 ) = 0 and that ˆ θ = θ 0 + J − 1 n − 1 P n i =1 φ ( Z i , θ 0 ) + o p (1 / √ n ) . Then, by the central limit theorem, √ n ( ˆ θ − θ 0 ) and √ n { R n ( θ 0 ) − R ( θ 0 ) } con v erge join tly in la w to a central normal distribution with v ariance matrix Σ =     J − 1 K ( J − 1 ) T J − 1 C T C T ( J − 1 ) T V ar ψ ( Z , θ 0 )     , 20 where J and C are as in Theorem 6 and K = V ar φ ( Z, θ 0 ) . Hence, provided all necessary quan tities are uniformly integrable (see e.g. Billingsley (1999, p. 31)), E ( TE ) = R ( θ 0 ) + T r ( C J − 1 ) + (1 / 2) T r { H θ 0 R ( θ ) J − 1 K ( J − 1 ) T } + o (1 /n ) whic h misses E R ( ˆ θ ) b y a term n − 1 T r ( J − 1 C ) + o (1 /n ) . Because of this, TE is a biased estimator of E R ( ˆ θ ) . CV, on the other hand, satisfies CV = TE − T r ( J − 1 C ) + o (1 /n ) b y Theorem 6. Hence, arguing similarly as for TE, w e get E CV = E R ( ˆ θ ) + o (1 /n ) , ensuring that in exp ected v alue, CV is able to separate b et w een estimators whose risk differ only in the O (1 /n ) - term. W e stress that this holds in exp ected v alue only , and that the ab ov e shows nothing more than the fact that the bias of CV is of order o (1 /n ) . As noted b efore, the error of CV will b e of order O p (1 / √ n ) for an y single sample, a term which is to o large to prop erly capture effects of size O (1 /n ) . The discussion given in the previous paragraphs has consequences for our theorems concerning the limiting b ehaviour of ˆ θ ( ˆ λ ) . By the arguments ab ov e, neither TE nor CV is able to separate estimators whose risk differs only in the O (1 /n ) term. Hence, when tuning λ with resp ect to CV or TE, we should not exp ect that the criteria are able to iden tify the optimal v alue of λ in this O (1 /n ) -regime. This also explains wh y tuning with resp ect to TE is asymptotically equiv alen t to tuning with resp ect to CV. The fact that the theory breaks down when θ 0 ( λ ) , the solution to E φ ( Z, θ , λ ) = 0 , is constan t as a function of lambda, can also b e seen from Theorem 1. When θ 0 ( λ ) is constan t as a function of λ , the equation Ψ( α ) = 0 has multiple solutions and Ψ ′ ( α 0 ) will b e singular. Because of this, the conditions of Theorem 1 do not hold true when θ 0 ( λ ) is constant, making the theorem inapplicable in this case. This do es not p ose a problem for regularized estimators lik e in ridge regression, as θ 0 ( λ ) will rarely b e constan t in this case, but for hybrid estimators, this can indeed b e quite problematic. T ak e for instance the hybrid generative-discriminativ e classification mo del where ˆ θ ( λ ) is set to the maximizer of the following expression n − 1 P n i =1 { λ log f Y | X ( Y i , θ | X = X i ) + (1 − λ ) log f ( Y ,X ) ( Y i , X i , θ ) } for some parametric mo del f ( Y ,X ) for ( Y , X ) and where f ( Y | X ) is deriv ed from f ( Y ,X ) . In this case, ˆ θ ( λ ) aims for the maximizer of 21 Γ λ ( θ ) = λ E log f ( Y ,X ) ( Y , X , θ ) + (1 − λ ) E log f Y | X ( Y , X , θ ) . If the model is missp ecified, θ 0 ( λ ) will rarely b e constant as a function of λ . If, on the other hand, the mo del do es hold and the true underlying distribution is f ( Y ,X ) ( · , θ 0 ) for some θ 0 ∈ R p , θ 0 ( λ ) will b e constan tly equal to θ 0 . Because of this, the difference in exp ected limiting risk will b e on the order of O (1 /n ) when the mo del is sp ecified correctly and O (1) otherwise. Hence, Theorem 2 and Theorem 5 can b e applied in this case only when the mo del is missp ecified. See Section S2 in the supplementary material for more discussions. 5 When the effect of tuning disapp ears W e will now go through a situation in whic h the expressions in Corollary 1 simplify greatly and the effect of the tuning pro cedure b ecomes negligible asymptotically . Assume there exists a λ 1 suc h that φ ( y , θ , λ 1 ) = ∇ θ ψ ( y , θ ) for all θ and F -almost all y . Then, θ 0 ( λ 1 ) solves E φ ( Z, θ , λ 1 ) = ∇ θ E ψ ( Z, θ ) = 0 . Because of this, λ 0 , b and Z 1 defined in Corollary 1 are equal to λ 1 , 0 and J , resp ectively . En tering this in to the expressions in the corollary , shows that the right hand side of (2) is equal to n − 1 J − 1 φ ( z , θ 0 , λ 1 ) . This is the influence function of ˆ θ ( λ 1 ) . Hence, √ n ( ˆ θ ( ˆ λ ) − θ 0 ) = √ n { ˆ θ ( λ 1 ) − θ 0 } + o p (1) in this case, and the additional randomness introduced by tuning λ v anishes asymptotically . In practice, this means that one can ignore the effect of tuning λ and use the classic appro ximation √ n { ˆ θ ( ˆ λ ) − θ 0 } d → N { 0 , J − 1 K ( J − 1 ) T } , where J = − ∂ θ 0 E φ ( Z, θ , λ 1 ) and K = E φ ( Z , θ 0 , λ 1 ) φ ( Z, θ 0 , λ 1 ) T . A λ 1 suc h that ∇ θ ψ ( z , θ ) = φ ( z , θ , λ 1 ) actually exists in quite a few situations. In for instance ridge regression where λ is set to the minimizer of estimated MSE, w e hav e φ ( z , β , λ ) = ∇ β { ( y − β T 1: x − β 0 ) 2 + λ ∥ β 1: ∥ 2 } , where β = ( β 0 , β T 1: ) T , ensuring ∇ β ψ ( z , β ) = φ ( z , β , 0) . More generally , for an y mo del Y i = g ( X i , θ ) + ϵ i fitted by minimising P n i =1 { Y i − g ( X i , θ ) } 2 + λ ∥ θ ∥ 2 and tuned according to the MSE, ∇ θ ψ ( z , θ ) = φ ( z , θ , 0) . Even more generally , for a mo del fitted by minimising P n i =1 h { Y i , g ( X i , θ ) } + λ ∥ θ ∥ 2 for some function h , and where λ is tuned b y minimising the CV or TE 22 estimate of E h [ Y , g { X , ˆ θ ( λ ) } ] , ∇ θ ψ ( z , θ ) = φ ( z , θ , 0) , ensuring that √ n { ˆ θ ( ˆ λ ) − θ 0 } and √ n { ˆ θ (0) − θ 0 } are asymptotically equiv alent. This includes man y regularized lik eliho o d mo dels when cross v alidated Kullback-Leibler divergence is used to tune λ . The ab ov e arguments might lead one to b elieve that there rarely is need to use Theorem 2 or Theorem 5 in practice. This is, ho w ev er, not correct. T o illustrate the p otentially large effect tuning of λ can hav e on the limit distribution of ˆ θ ( ˆ λ ) , we p erformed a small sim ulation exp erimen t. F or a selection of λ -v alues, we fitted logistic regression in tw o v ariables with a ridge p enalty term. This corresp onds to setting φ equal to the gradient of y ( β 0 + β 1 x 1 + β 2 x 2 ) − log (1 + β 0 + β 1 x 1 + β 2 x 2 ) − λ ( β 2 1 + β 2 2 ) with resp ect to β = ( β 0 , β 1 , β 2 ) . W e c hose ˆ λ b y minimising CV ( λ ) with ψ ( z , θ ) = [ exp ( β 0 + β 1 x 1 ) / { 1 + exp ( β 0 + β 1 x 1 ) } − y ] 2 . F or the true underlying distribution of the data, we c hose the following. The binary v ariable Y w as set to b e Bernoulli distributed with equal probability for zero and one. F urthermore, for a selection of C -v alues, we used X | Y = 0 ∼ N ( µ, Σ) and X | Y = 1 ∼ N ( − µ, 2Σ) with µ = (1 / 2 , 1 / 2) T and Σ equal to the matrix with diagonal elemen ts 2 and off-diagonal en tries − q C / 2 . F or eac h selected C -v alue, w e sim ulated n = 100 data p oints and computed ˆ λ C and ˆ β C ( ˆ λ ) based on the data set. W e used cross-v alidation to tune λ . This pro cedure w as rep eated B = 1000 times for each C , and based on these B samples, we computed the empirical v ariance of √ n ˆ β C ( ˆ λ C ) for each C . W e also estimated the v ariance using b oth the classic metho d and Theorem 5. The results are summarized in Figure 2, whic h sho ws the error of the tw o estimators as functions of C . F rom Figure 2 it is clear that the classic estimator is insufficient in this case. The v ariance of Theorem 5 is closer to the observed v ariabilit y in almost all cases. F urthermore, we see that as C → 0 , the classic v ariance estimate decreases in effect. This is particularly clear for the estimate of the v ariance of the last comp onen t of √ n ˆ β C ( λ C ) , where the v ariance estimator pro vided b y the classic metho d is a lot w orse than the one prop osed in this article and seems to increase even more as C → 0 . This sim ulation is of course only a toy example, but Figure 2 indicates that the error in tro duced by using the classic 23 0.5 1.0 1.5 2.0 0.2 0.4 V 0 0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 V 0 1 0.5 1.0 1.5 2.0 0.0 0.5 1.0 V 0 2 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 V 1 1 0.5 1.0 1.5 2.0 C 0 1 2 3 V 1 2 0.5 1.0 1.5 2.0 C 0 2 4 6 V 2 2 Er r or of variance estimators New estimator Classical estimator Figure 2: The plot sho ws the absolute error of the classic and new estimated cov ariance of √ n ˆ β C ( ˆ λ C ) compared to the observ ed v ariance based on a thousand sim ulated data sets with n = 100 data p oin ts. The v ariables V j k refer to the cov ariance b etw een the j -th and k -th comp onent of √ n ˆ β C ( ˆ λ C ) for j = 0 , 1 , 2 . estimator of the v ariance p otentially can b e very large. 6 Examples W e will now go through a couple of applications illustrating our theoretical framework. 6.1 Diab etes in Pima Indians The data set pima.indians.diabetes , publicly a v ailable in the R-packa ge mlbench con tains information ab out female Pima Indians and prev alence of diab etes. Denote the data set by z 1 , . . . , z n where z i = ( y i , x T i ) T with x i b eing a vector of cov ariates and y i a binary v ariable enco ding whether or not the individual has diab etes for i = 1 , . . . , n . W e fit a p enalized logistic regression mo del to these data. F or some 24 pre-sp ecified v alue of λ , this corresp onds to maximising the following expression ℓ n ( β , λ ) = n X i =1 [ y i log p ( x i , β ) + (1 − y i ) log { 1 − p ( x i , β ) } ] − nλ ∥ β 1: ∥ 2 (8) where p ( x, β ) = [ 1 + exp {− β 0 − β T 1: x } ] − 1 when β = ( β 0 , β T 1: ) T . T o fit this mo del, we need to decide up on a v alue for the tuning parameter λ . One option is to choose λ minimising the Brier score defined as BS = n − 1 P n i =1 [ y i − p { x i , ˆ β ( λ ) } ] 2 where ˆ β ( λ ) is the minimizer of (8) . This ensures that the mo del has go o d predictiv e ability . In most cases, how ever, BS will underestimate the true error. T o mitigate this, we will instead c ho ose λ minimising the cross-v alidated estimate of the Brier score, defined as n − 1 P n i =1 [1 − p { x i , ˆ β ( λ ) ( − i ) } ] 2 . With the pro cedure describ ed ab ov e, we find that the minimizer of the cross- v alidated Brier score is equal to ˆ λ = 0 . 0085 . Fixing λ at this v alue, we can no w fit a p enalized logistic regression mo del to the data by minimising (8) ev aluated at λ = ˆ λ . F urthermore, by classic asymptotic likelihoo d theory , we kno w that for each fixed λ , ˆ β ( λ ) is consistent for some w ell-defined quantit y β ( λ ) and that √ n { ˆ β ( λ ) − β ( λ ) } con v erges in la w to a N { 0 , J ( λ ) − 1 K ( λ ) J ( λ ) − 1 } distribution where J ( λ ) and K ( λ ) are well-defined matrices, see e.g. V an der V aart (2000, Ch. 5). By considering λ as b eing fixed at ˆ λ , w e can therefore approximate the limiting distribution of ˆ β ( ˆ λ ) b y a N { β ( λ ) , J ( ˆ λ ) − 1 K ( ˆ λ ) J ( ˆ λ ) − 1 /n } distribution, and use it to make inference ab out β ( λ ) . The resulting estimated marginal densities of each comp onent of ˆ β ( ˆ λ ) are shown in Figure 1. W e remark that the v ery small v alue of ˆ λ is a consequence of the p enalization term b eing scaled by n , leading the “actual” size of the p enalization to b e n ˆ λ ≈ 3 . 342 . The ab o v e approximate distribution is based on p oin t wise results which do not tak e the randomness introduced by tuning λ in to account. This can, how ev er, b e done with the results of this pap er. Assuming all regularit y conditions are in place, we can apply the theory of this article to ˆ β ( ˆ λ ) by setting φ ( z , β , λ ) = ∇ β [ y log p ( x, β ) − (1 − y ) log { 1 − p ( x, β ) } − λ ∥ β 1: ∥ 2 ] and ψ ( z , β ) = { 1 − p ( x, β ) } 2 . By Theorem 5, 25 √ n { ˆ β ( ˆ λ ) − β 0 } con v erges in distribution to a cen tral normal distribution with v ariance as describ ed in Corollary 1 and which can b e estimated using the appro ximations of Theorem 4. F or eac h j = 0 , . . . , 8 , densities of the marginal appro ximated distributions of ˆ β ( ˆ λ ) j together with the densities obtained using the classic, p oin t wise results as w ell as histograms based on B = 2000 b o otstrap samples of ˆ β ( ˆ λ ) j can b e found in Figure 1 whic h w as discussed briefly in the in tro duction. F rom Figure 1 w e note that when taking the added randomness of the tuning pro cedure in to account, we get larger estimated v ariances than when using the classic p oin t wise results. This is promising, as the estimated v ariance of ˆ β ( ˆ λ ) j w as closer to the empirical v ariance in the b o otstrap sample for all but one v alue of j . F or ˆ β ( ˆ λ ) 2 , the classic v ariance estimator p erformed b etter than what w e get from Theorem 4. A quic k glance at Figure 1, how ev er, reveals that neither of the approximate distributions seem to fit well with the b o otstrap distribution in this case. In fact, the histogram of this comp onent is quite sk ew ed, giving reason to b elieve that con v ergence has not b een ac hiev ed for this particular parameter. Although the results ab o v e are sligh tly in the fav our of the theorems deriv ed in the presen t article, the limiting distributions w e get using our and the classic metho d are not astoundingly different. It is, how ev er, imp ortan t to remember that the classic result is a p oint wise result concerning the limiting b eha viour of ˆ β ( λ ) when λ is a fixed n um b er. Because of this, the classic result really do es not apply to ˆ β ( ˆ λ ) and cannot b e used to say anything ab out the limiting b ehaviour when ˆ λ is set by tuning, as neither asymptotic normalit y nor consistency of ˆ β ( ˆ λ ) follo ws from standard theory . By using the classic p oint wise metho d regardless of this fact, w e are therefore really taking a gam ble, hoping that the added randomness of the tuning pro cedure will b e negligible. Theorem 5 and Theorem 4, on the other hand, do take the tuning pro cedure into accoun t and can safely b e applied to mak e v alid asymptotic inference. 26 6.2 Hybrid lik eliho o d In the b o ok The b etter angels of our natur e , Pinker (2011) argues that the world is c hanging for the b etter and that violence is on the decline. This claim quic kly stirred up heated debate in the academic comm unit y and since its publication a large num b er of articles ha ve either attempted to supp ort or refute Pink er’s claim, see e.g. Cirillo and T aleb (2016a,b) or Cunen et al. (2020). Recen tly , Dæhlen and Hjort (2025) in vestigated whether conflicts hav e b ecome less violen t ov er time. W e will now use the theory deriv ed in the presen t article to impro v e the analysis done in that pap er. Dæhlen and Hjort (2025) uses data from the Correlates of W ar data base (Sarkees and W ayman (2010)) concerning the num b er of battle deaths in the 95 most recen t and concluded inter-state wars. The analysis is done b y first splitting the data into “older” and “newer” wars with the Korean war as the cut-off p oint. T o each of these data sets, they fit shifted log-normal mo dels and use the result to mak e inference ab out the difference in the median num b er of battle deaths for older and newer wars. They also rep eat the analysis for the difference in upp er quartiles. F urther, to fit the parametric mo dels, the framework of hybrid lik eliho o d is used. W e will not go into details concerning this method, but roughly sp eaking, they fit parametric mo dels f θ b y minimising a con v ex combination of the log-likelihoo d function and a nonparametric coun terpart called the log-empirical likelihoo d function, see e.g. Ow en (2001). How m uch weigh t is put on either of these comp onents is decided b y a parameter called the balance parameter a taking v alues in [ 0 , 1) . Moreo v er, the FIC is used to select v alues for the balance parameters a 1 and a 2 . When inv estigating change in the median n um b er of battle deaths, the median is used as fo cus parameter and the empirical median as the consisten t estimator ˆ µ 0 . When studying the difference in upp er quartile, the 0 . 75 -quan tile and the empirical 0 . 75 -quan tile are used in place of the median and empirical median. By standard results, the empirical p -quan tile satisfies o p (1 / √ n ) = n − 1 P n i =1 ξ ( Y i , µ ) where ξ ( y , µ ) = µ − µ p − { p − I ( y ≤ µ p ) } /f ( µ p ) and µ p is the p -quan tile in the underlying distribution of 27 the data. F urthermore, b y the results of Dæhlen and Hjort (2025), the estimators ˆ θ 1 ( a 1 ) and ˆ θ 2 ( a 2 ) can b oth b e expressed as comp onen ts of Z-estimators for eac h fixed v alue of a 1 and a 2 . Here ˆ θ 1 ( a 1 ) and ˆ θ 2 ( a 2 ) are the maximum h ybrid likelihoo d estimators for the older and new er w ars resp ectiv ely using the balancing parameter given in the paren thesis. Hence, b y making mo difications as explained in Section 3, we can use Corollary 1 to deriv e the limiting distribution of ˆ θ 1 (ˆ a 1 ) and ˆ θ 2 (ˆ a 2 ) . By combining the ab ov e with the estimators given in Theorem 4, w e can approx- imate the v ariance of g { ˆ θ ( ˆ a 1 ) } − g { ˆ θ ( ˆ a 2 ) } , where g ( θ ) is the median in a log-normal distribution parametrized by θ , in a wa y taking the added randomness introduced b y tuning a 1 and a 2 in to account. Using this metho d, we get an estimated standard deviation of 18431. That is a lot higher than 3815, whic h is obtained using the p oin t wise results in Dæhlen and Hjort (2025). This is a consequence of Dæhlen and Hjort (2025) not incorp orating the added randomness introduced by tuning a 1 and a 2 , leading to a m uc h to o small estimated standard deviation. F or the difference in upp er quartiles, ˆ a 1 = 0 and ˆ a 2 is c hosen as the largest accepted v alue of a 2 (in this case 0 . 99 ). Because of this, FIC ′ j ( ˆ a j )  = 0 for j = 0 , 1 and Theorem 2 leads us to approximate the limiting distribution of ˆ θ j ( ˆ a j ) by the p oint wise limit of ˆ θ j ( ˆ a j ) for j = 0 , 1 . This leads to the same appro ximate distribution as what is used in Dæhlen and Hjort (2025), and hence, all estimated standard deviations agree. As a result, ignoring the effect of tuning of a 1 and a 2 b ecomes unproblematic in this case. 7 Concluding remarks W e ha v e studied the effect tuning pro cedures has on the limiting distribution of ˆ θ ( ˆ λ ) . W e co v ered multiple wa ys of tuning, including minimization of information criteria, the training error and the cross-v alidated estimator of the risk. In addition, w e hav e defined consisten t estimators for the limiting v ariance of √ n ˆ θ ( ˆ λ ) and pro v ed a result sharply c haracterizing the distance from CV to TE. Lastly , we wen t through a simulation 28 setting and applied the theory on t w o real data sets. Our work do es ha v e s ome limitations. The p erhaps most pressing issue with our curren t theory is the inability to handle none-smo oth functions φ and ψ . This excludes m ultiple interesting settings. Since, ψ m ust b e smo othly differentiable, tuning with resp ect to the absolute error or error rate is not cov ered by our theory . F urthermore, smo othness of φ excludes many in teresting mo dels, like the Lasso or other mo dels p enalized by an L 1 -norm. This do es, of course, mak e our theory less applicable, but the smo othness assumptions are crucial for our pro ofs, and w e b eliev e that extending our theory to co v er non-smo oth settings is far from trivial. F urther, w e ha v e assumed that for each λ , ˆ θ ( λ ) is a Z-estimator. This assumption is necessary for the pro of of Theorem 5 and Theorem 6, but migh t b e more strict than what is really needed at least for the latter theorem. By the informal arguments preceding Theorem 6, one would exp ect that Theorem 5 and Theorem 6 hold true as long as ˆ θ ( λ ) has an influence function IF, suc h that ˆ θ ( λ ) = θ 0 ( λ ) + IF ( Y i , θ , λ ) + ϵ n ( θ , λ ) with “regular enough” remainder term. F ormalizing this intuition in a prop er wa y w ould certainly extend our framework, but will require proving a more general v ersion of Lemma S1.3 in the supplemen tary material. Finally , we ha v e fo cused on Theorem 2 and Theorem 5 as our goal was to study the limiting b eha viour of ˆ θ ( ˆ λ ) when ˆ λ is set b y some tuning pro cedure. These results are, how ev er, corollaries of Theorem 1, which gives the limiting distribution of the full v ector ( ˆ θ ( ˆ λ ) T , ˆ λ T , ˆ θ ′ ( ˆ λ ) T ) T . In particular, Theorem 1 can b e used to study the limiting b eha viour of ˆ λ . This can, in turn, b e applied to make inference ab out λ and create appro ximate confidence interv als and h yp othesis tests for the tuning parameter. F or instance, it might b e used to formally test if a regularization is b eneficial or if either of the t w o comp onen ts in a h ybrid mo del is significan tly preferable. 29 Supplemen tary Materials Pro ofs, lemmas and further discussions can b e found in the article’s supplementary material. A c kno wledgmen ts This work is partially funded b y the Norwegian Research council through the BigInsight cen ter for innov ation driven research (237718). P artial supp ort w as also given from Cen tre for A dv anced Study , at the A cadem y of Science and Letters, Oslo, in connection with the 2022-2023 pro ject Stabilit y and Change, led b y H. Hegre and N.L. References Arcones, M. A. (2005). Conv ergence of the optimal m-estimator ov er a parametric family of m-estimators. T est 14 , 281–315. Bac ho c, F., D. Preinerstorfer, and L. Steinberger (2020). Uniformly v alid confidence in terv als p ost-mo del-selection. The A nnals of Statistics 48 (1), 440–463. Beirami, A., M. Raza viy a yn, S. Shahramp our, and V. T arokh (2017). On optimal generalizabilit y in parametric learning. In 31st Confer enc e on Neur al Information Pr o c essing System . Berk, R., L. Bro wn, A. Buja, K. Zhang, and L. Zhao (2013). V alid p ost-selection inference. The A nnals of Statistics 41 , 802–837. Billingsley , P . (1999). Conver genc e of pr ob ability me asur es . New Y ork: John Wiley & Sons. Bouc hard, G. and B. T riggs (2004). The tradeoff b et w een generativ e and discriminativ e 30 classifiers. In 16th IASC International Symp osium on Computational Statistics (COMPST A T’04) , Prague, Czech Republic, pp. 721–728. Chetv erik o v, D., Z. Liao, and V. Chernozhuk ov (2021). On cross-v alidated lasso in high dimensions. The A nnals of Statistics 49 (3), 1300–1317. Cirillo, P . and N. N. T aleb (2016a). The decline of violen t conflicts: What do the data really sa y? Nob el F oundation Symp osium 161: The Causes of Pe ac e , 1–26. Cirillo, P . and N. N. T aleb (2016b). On the statistical prop erties and tail risk of violent conflicts. Physic a A: Statistic al Me chanics and its Applic ations 452 , 29–45. Claesk ens, G. and N. L. Hjort (2003). The fo cused information criterion. Journal of the A meric an Statistic al A sso ciation 98 (464), 900–916. Claesk ens, G. and N. L. Hjort (2008). Mo del sele ction and mo del aver aging . UK: Cam bridge Univ ersit y Press. Cra v en, P . and G. W ah ba (1978). Smo othing noisy data with spline functions: estim- ating the correct degree of smo othing by the metho d of generalized cross-v alidation. Numerische mathematik 31 (4), 377–403. Cunen, C., N. L. Hjort, and H. M. Nygård (2020). Statistical sightings of b etter angels: Analysing the distribution of battle-deaths in interstate conflict ov er time. Journal of Pe ac e R ese ar ch 57 (2), 221–234. Dæhlen, I. and N. L. Hjort (2025). Mo del robust h ybrid lik eliho o d. Journal of Statistic al Planning and Infer enc e (241), 106327. Dæhlen, I., N. L. Hjort, and I. Hobæk Haff (2024). A ccurate bias estimation with applications to fo cused mo del selection. Sc andinavian Journal of Statistics 5 . Do dge, Y. and J. Jurečk o vá (2000). A daptive R e gr ession . Springer. 31 Efron, B. (1983). Estimating the error rate of a prediction rule: impro vemen t on cross-v alidation. Journal of the A meric an statistic al asso ciation 78 (382), 316–331. Efron, B. (1986). Ho w biased is the apparent error rate of a prediction rule? Journal of the A meric an statistic al A sso ciation 81 (394), 461–470. Efron, B. (2004). The estimation of prediction error: co v ariance p enalties and cross- v alidation. Journal of the A meric an Statistic al A sso ciation 99 (467), 619–632. Germain, J.-F. and F. Roueff (2010). W eak conv ergence of the regularization path in p enalized m-estimation. Sc andinavian Journal of Statistics 37 , 477–495. Hall, A. R. and A. Inoue (2003). The large sample b eha viour of the generalized metho d of moments estimator in missp ecified mo dels. Journal of Ec onometrics 114 (2), 361–394. Hansen, L. P . (1982). Large sample prop erties of generalized metho d of momen ts estimators. Ec onometric a: Journal of the e c onometric so ciety 50 , 1029–1054. Homrighausen, D. and D. J. McDonald (2014). Lea v e-one-out cross-v alidation is risk consisten t for lasso. Mach. L e arn. 97 , 65–78. Hub er, P . J. (2009). R obust statistics (2nd ed.). Hob ok en: Wiley . Jo e, H. (2005). Asymptotic efficiency of the tw o-stage estimation metho d for copula- based mo dels. Journal of multivariate A nalysis 94 , 401–419. Jullum, M. and N. L. Hjort (2017). P arametric or nonparametric: The FIC approac h. Statistic a Sinic a 27 (3), 951–981. Kato, K. (2009). Asymptotics for argmin pro cesses: Con v exit y argumen ts. Journal of Multivariate A nalysis 100 (8), 1816–1829. K o, V. and N. L. Hjort (2019). Mo del robust inference with tw o-stage maximum lik eliho o d estimation for copulas. Journal of Multivariate A nalysis 171 , 362–381. 32 K onishi, S. and G. Kitagaw a (2008). Information criteria and statistic al mo deling . New Y ork: Springer Science & Business Media. Kuc hibhotla, A. K., J. E. K olassa, and T. A. Kuffner (2022). P ost-selection inference. A nnual R eview of Statistics and Its A pplic ation 9 , 505–527. Lee, J. D., D. L. Sun, Y. Sun, and J. E. T aylor (2016). Exact p ost-selection inference, with application to the lasso. The A nnals of Statistics 44 , 907–927. Li, K.-C. (1985). F rom stein’s un biased risk estimates to the metho d of generalized cross v alidation. The A nnals of Statistics 13 , 1352–1377. Li, K.-C. (1986). Asymptotic optimality of CL and generalized cross-v alidation in ridge regression with application to spline smo othing. The A nnals of Statistics 14 , 1101–1112. Li, K.-C. (1987). Asymptotic optimalit y for Cp, CL, cross-v alidation and generalized cross-v alidation: discrete index set. The A nnals of Statistics 15 , 958–975. Mu, B., T. Chen, and L. Ljung (2018). On asymptotic prop erties of hyperparameter estimators for k ernel-based regularization metho ds. A utomatic a 94 , 381–395. Mu, B., T. Chen, and L. Ljung (2021). On the asymptotic optimality of cross-v alidation based h yp er-parameter estimators for regularized least squares regression problems. arXiv pr eprint . O’Neill, T. J. (1980). The general distribution of the error rate of a classification pro cedure with application to logistic regression discrimination. Journal of the A meric an Statistic al A sso ciation 75 (369), 154–160. Ow en, A. B. (2001). Empiric al likeliho o d . Bo ca Raton, FL: CRC press. Pink er, S. (2011). The b etter angels of our natur e: Why violenc e has de cline d . T oron to: Viking. 33 Sark ees, M. R. and F. W a yman (2010). Resort to w ar: 1816 - 2007. W ashington DC: CQ Press. Shao, J. (1993). Linear mo del selection b y cross-v alidation. Journal of the A meric an statistic al A sso ciation 88 (422), 486–494. Shao, J. (1997). An asymptotic theory for linear mo del selection. Statistic a sinic a 7 , 221–242. Shao, J. (2003). Mathematic al statistics . New Y ork: Springer Science & Business Media. Sp ec kman, P . (1985). Spline smo othing and optimal rates of conv ergence in nonpara- metric regression mo dels. The A nnals of Statistics 13 , 970–983. Sp ok oin y , V. (2025). Sharp b ounds in p erturb ed smo oth optimization. arXiv pr eprint arXiv:2505.02002 . Stephenson, W. and T. Broderick (2020). Approximate cross-v alidation in high dimensions with guarantees. In Pr o c e e dings of the 23r d International Confer enc e on A rtificial Intel ligenc e and Statistics , V olume 108. PMLR. Stone, M. (1977). An asymptotic equiv alence of choice of mo del by cross-v alidation and akaik e’s criterion. Journal of the R oyal Statistic al So ciety: Series B (Metho do- lo gic al) 39 (1), 44–47. Tibshirani, R. and K. Knigh t (1999). The cov ariance inflation criterion for adapt- iv e mo del selection. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy 61 (3), 529–546. V an der V aart, A. W. (2000). A symptotic statistics . Cam bridge, England: Cambridge Univ ersit y Press. 34 The asymptotic effect of tuning parameters - Supplemen tary material Ingrid Dæhlen 1 , 2 , Nils Lid Hjort 1 and Ingrid Hobæk Haff 1 1 Departmen t of Mathematics, Univ ersit y of Oslo 2 Norw egian Computing Cen ter, P ost b o x 114 Blindern, Oslo, 0314, Norw a y A ddress for corresp ondence: Ingrid Dæhlen, Departmen t of Mathematics, Universit y of Oslo, Moltk e Mo es v ei 35, 0851 Oslo, No w a y , email: ingrdae@math.uio.no S1 Pro of of Lemma 1 and Theorem 6 W e start by showing a lemma regarding uniform consistency . Lemma S1.1. L et Y ∈ R d , θ ∈ R p and λ ∈ Λ ⊆ R q wher e Λ is a c omp act set. With Ψ λ,n : R d + p → R p and Ψ λ : R p → R p for e ach λ ∈ Λ , let θ 0 ( λ ) b e the unique zer o of Ψ λ ( θ ) for e ach fixe d λ and assume the fol lowing. (C1) Ther e exists ϵ > 0 such that Ψ λ,n Pr → Ψ λ uniformly on S ϵ = { ( θ , λ ) | λ ∈ Λ , ∥ θ 0 ( λ ) − θ ∥ ≤ ϵ } (C2) The function ( θ , λ ) 7→ Ψ λ ( θ ) is c ontinuously differ entiable on S ϵ with Ψ ′ λ { θ ( λ ) } non-singular for e ach λ ∈ Λ . Then, if Ψ λ,n ( θ ) has at most a single zer o ˆ θ ( λ ) for e ach n and e ach λ , sup λ    ˆ θ ( λ ) − θ 0 ( λ )    Pr → 0 . (S1.1) Pr o of. The following is a mo dified v ersion of the pro of of theorem 5.42 in V an der V aart (2000). By (C2) and the implicit function theorem, θ ( λ ) is con tin uous as a function of λ in Λ . Hence, { θ ( λ ) | λ ∈ Λ } is a compact set. This implies that S ϵ is compact. Because of this, the extreme v alue theorem and the fact that Ψ ′ λ { θ ( λ ) } is non-singular for all 1 λ ∈ Λ , we can choose ϵ to b e small enough that Ψ ′ λ ( θ ) is non-singular for all ( θ , λ ) ∈ S ϵ , ensuring that ( θ , λ ) 7→ Ψ ′ λ ( θ ) − 1 is con tin uous on S ϵ . By the extreme v alue theorem, we then hav e that the norm of each comp onen t of Ψ ′ λ ( θ ) − 1 is b ounded by some C < ∞ in S ϵ . No w fix some δ > 0 and c ho ose δ 1 smaller than b oth ϵ and δ / 2 C . Arguing as in V an der V aart (2000), we see that there exists closed neigh b ourho o ds G λ δ 1 of θ 0 ( λ ) such that Ψ λ is a homeomorphism from G λ δ 1 in to the closed ball around zero with radius δ 1 : B δ 1 (0) for eac h λ . By construction of δ 1 and the inv erse function theorem, the partial deriv ativ es of the in v erse of Ψ λ is b ounded by C for each λ . Hence, the diameter of G λ δ 1 is b ounded by 2 C δ 1 < δ for eac h λ implying G λ δ 1 ⊆ B δ { θ 0 ( λ ) } for eac h λ . W e will sho w that Ψ λ,n ( θ ) has a ro ot in G λ δ 1 for ev ery λ with probabilit y tending to one. Since G λ δ 1 ⊆ B δ { θ 0 ( λ ) } for eac h λ , this w ould guaran tee (S1.1). Since δ 1 < ϵ , we know sup λ ∈ Λ sup θ ∈ G λ δ 1 ∥ Ψ λ,n ( θ ) − Ψ λ ( θ ) ∥ Pr → 0 . Let K δ 1 ,n b e the even t that ∥ Ψ λ,n ( θ ) − Ψ λ ( θ ) ∥ < δ 1 for all θ ∈ G λ δ 1 and λ ∈ Λ . Then the ab o v e implies that Pr( K δ 1 ,n ) → 1 . No w let y ∈ B δ 1 (0) . Since Ψ λ is a homeomorphism from G λ δ 1 in to B δ 1 (0) for eac h λ , Ψ − 1 λ ( y ) ∈ G λ δ 1 for ev ery λ ∈ Λ . Hence, on the even t K δ 1 ,n , we ha ve ∥ Ψ λ,n { Ψ − 1 λ ( y ) }− y ∥ ≤ δ 1 for all λ ∈ Λ , whic h implies that y 7→ y − Ψ λ,n ◦ Ψ − 1 λ ( y ) maps B δ 1 (0) in to itself for ev ery λ ∈ Λ . Since the maps are contin uous, Brou w er’s fixed p oint theorem ensures that y 7→ y − Ψ λ,n ◦ Ψ − 1 λ ( y ) has a fixed p oint in B δ 1 (0) for every λ ∈ Λ . This is equiv alen t to Ψ λ,n ha ving a zero in G λ δ 1 ⊆ B δ { θ 0 ( λ ) } for each λ . Because of this, the probabilit y of all Ψ n,λ ha ving a zero within δ of θ 0 ( λ ) is b ounded b y Pr ( K δ 1 ,n ) . As this latter probability tends to one b y the previous arguments, (S1.1) follo ws. This concludes the pro of. ■ When Ψ λ ( θ ) = E ϕ ( Z, θ , λ ) and Ψ λ,n ( θ ) = n − 1 P n i =1 ϕ ( Z i , θ , λ ) for some function 2 ϕ : R d + p + q → R p , condition (C1) can b e easily v erified using the uniform law of large n um b ers. W e state this in the following corollary . Corollary S1.2. L et Ψ λ ( θ ) = E ϕ ( Z, θ , λ ) and Ψ λ,n ( θ , λ ) = n − 1 P n i =1 ϕ ( Z i , θ , λ ) for some function ϕ : R d + p + q → R p . Then (C1) fol lows fr om (C3) Ther e exists an F -inte gr able function m 0 : R d → R and an ϵ > 0 such that ∥ ϕ ( z , θ , λ ) ∥ ≤ m 0 ( z ) for al l ( θ , λ ) ∈ S ϵ and F -almost al l Z . W e will prov e a theorem characterizing the distance from ˆ θ ( λ ) to ˆ θ ( − i ) ( λ ) . Lemma S1.3. F or e ach λ in Λ , a c omp act subset of R q and 1 ≤ i ≤ n let ˆ θ ( λ ) and ˆ θ ( − i ) ( λ ) b e the solutions to P n j =1 ϕ ( Z j , θ , λ ) and P j  = i ϕ ( Z j , θ , λ ) in Θ , a c omp act subset of R p . A ssume ther e exist functions m 0 and m 1 such that ∥ ϕ ( z , θ , λ ) ∥ ≤ m 0 ( z ) and ∥ ∂ θ ϕ { z , θ , λ }∥ ≤ m 1 ( z ) for al l θ , λ and almost al l z . Then, if θ 7→ E ϕ ( Z, θ , λ ) satisfies (C2) of L emma S1.1 and the eigenvalues of J ( θ , λ ) − 1 = −{ E ∂ θ ϕ ( Z, θ , λ ) } − 1 ar e b ounde d in Θ × Λ , the fol lowing holds for al l λ : ∥ ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) ∥ ≤ a 0 ,n m 0 ( Z i ) (S1.2) wher e a 0 ,n = O p (1 /n ) and do es not dep end on i or λ . If, furthermor e, the eigenvalues of ∂ θ J ( θ , λ ) ar e b ounde d in Θ × Λ , we have ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) = n − 1 J { θ 0 ( λ ) , λ } − 1 ϕ { Z i , θ 0 ( λ ) , λ } + E i n ( λ ) (S1.3) wher e E i n ( λ ) satisfies ∥ E i n ( λ ) ∥ ≤ n − 1 { a 1 ,n + a 2 ,n m 0 ( Z i ) + a 3 ,n m 0 ( Z i ) 2 + a 4 ,n m 0 ( Z i ) 2 m 1 ( Z i ) } (S1.4) with a j,n = o p (1) uniformly in λ , not dep ending on i for j = 1 , 2 , 3 , 4 . 3 Pr o of. By definition we ha v e 0 = n X j =1 ϕ { Z j , ˆ θ ( λ ) , λ } and 0 = n X j  = i ϕ { Z j , ˆ θ ( − i ) ( λ ) , λ } . (S1.5) Hence, a T a ylor expansion of P n j =1 ϕ { Z j , ˆ θ ( − i ) ( λ ) , λ } around ˆ θ ( λ ) rev eals ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } = n X j =1 ϕ { Z j , ˆ θ ( − i ) ( λ ) , λ } = n X j =1 ϕ { Z j , ˆ θ ( λ ) , λ } +   n X j =1 ∂ θ ∗ i ( λ ) ϕ { Z j , θ , λ }   { ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) } = − nJ n { θ ∗ i ( λ ) , λ }{ ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) } for some θ ∗ i ( λ ) s on the line segmen t b etw een ˆ θ ( − i ) ( λ ) and ˆ θ ( λ ) , and with J n { θ , λ } = − n − 1 P n j =1 ∂ θ ϕ ( Z j , θ , λ ) . Because of the ab ov e, we can write ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) = − n − 1 J n { θ ∗ i ( λ ) , λ } − 1 ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } . Since there exists a function m 1 ( z ) b ounding all partial deriv ativ es of ϕ ( z , θ , λ ) with resp ect to θ , J n ( θ , λ ) conv erges uniformly to J ( θ , λ ) = − E ∂ θ ϕ ( Z, θ , λ ) ov er the compact parameter space Θ × Λ . F urthermore matrix inv ersion is uniformly con tin uous o v er compact sets not containing non-in vertible matrices, ensuring that also − J n ( θ , λ ) − 1 con v erges to − J ( θ , λ ) − 1 = E ∂ θ ϕ ( Z, θ , λ ) − 1 uniformly in λ and θ . Then − J n { θ ∗ i ( λ ) , λ } − 1 = − J { θ ∗ i ( λ ) , λ } − 1 + ϵ i n where ϵ i n = o p (1) uniformly in λ and i . By assumption, the eigenv alues of J ( θ , λ ) − 1 are b ounded by a n um b er with absolute v alue C 1 . Then ∥ ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) ∥ ≤ n − 1 ( C 1 + δ n ) ∥ ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ }∥ , where δ n = o p (1) do es not dep end on λ or i . Lastly by assumption ∥ ϕ ( z , θ , λ ) ∥ ≤ m 0 ( z ) . The ab ov e then guaran tees    ˆ θ ( − i ) ( λ ) − ˆ θ ( λ )    ≤ n − 1 ( C 1 + δ n ) m 0 ( Z i ) (S1.6) for all λ . This prov es (S1.2). 4 By the previous argumen ts ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) = − n − 1 [ J { θ ∗ i ( λ ) , λ } − 1 + ϵ i n ] ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } . (S1.7) where ϵ i n = o p (1) uniformly in i and λ . W e will again use T aylor expansions to arrive at a result. W e start with the second factor on the right hand side: J { θ ∗ i ( λ ) , λ } − 1 = J { θ 0 ( λ ) , λ } − 1 +  ∂ θ ∗∗ i ( λ ) J { θ , λ } − 1  { θ ∗ i ( λ ) − θ 0 ( λ ) } , where θ ∗∗ i ( λ ) is some vector on the line segment b et w een θ 0 ( λ ) and θ ∗ i ( λ ) . No w since θ ∗ i ( λ ) lies on the line segmen t b et ween ˆ θ ( λ ) and ˆ θ ( − i ) ( λ ) , ∥ θ ∗ i ( λ ) − θ 0 ( λ ) ∥ ≤ ∥ ˆ θ ( λ ) − θ 0 ( λ ) ∥ + ∥ ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) ∥ . Com bining this with Lemma S1.1 and (S1.6) ensures ∥ θ ∗ i ( λ ) − θ 0 ( λ ) ∥ ≤ R n + n − 1 ( C 1 + δ n ) m 0 ( Z i ) , where R n = o p (1) uniformly in λ and do es not dep end on i . In addition, the eigen v alues of ∂ θ J ( θ , λ ) − 1 are b ounded b y C 2 . Hence, J { θ ∗ i ( λ ) , λ } − 1 = J { θ 0 ( λ ) , λ } − 1 + r i n ( λ ) where ∥ r i n ( λ ) ∥ ≤ n − 1 ( C 1 + δ n ) C 2 m 0 ( Z i ) uniformly in λ . Secondly , w e will w ork with the third factor on the righ t hand side of (S1.7) . T a ylor expanding the function ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } around θ 0 ( λ ) , rev eals ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } = ϕ { Z i , θ 0 ( λ ) , λ } + ∂ θ ∗∗∗ i ϕ ( Z i , θ , λ ) { ˆ θ ( − i ) ( λ ) − θ 0 ( λ ) } , for some θ ∗∗∗ i ( λ ) on the line segment betw een θ ( − i ) ( λ ) and θ 0 ( λ ) . Since there exists m 1 ( z ) b ounding the norm of each comp onent of ∂ θ ϕ ( z , θ , λ ) and dep ending only on z , we can use arguments similar to those ab o v e to show ϕ { Z i , ˆ θ ( − i ) ( λ ) , λ } = ϕ { Z i , θ 0 ( λ ) , λ } + s i n ( λ ) where ∥ s i n ( λ ) ∥ ≤ n − 1 ( C 1 + δ n ) m 1 ( Z i ) m 0 ( Z i ) uniformly in λ and for all i . W e are now ready to complete the pro of. Com bining all of the previous arguments sho ws ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) = − n − 1 [ J { θ 0 ( λ ) , λ } − 1 + r i n ( λ ) + ϵ i n ][ ϕ { Z i , θ 0 ( λ ) , λ } + s i n ( λ )] , 5 whic h imply ˆ θ ( − i ) ( λ ) − ˆ θ ( λ ) = − n − 1 J { θ 0 ( λ ) , λ } ϕ { Z i , θ 0 ( λ ) , λ } + E i n ( λ ) where ∥ E i n ( λ ) ∥ ≤ n − 1 { a 1 ,n + a 2 ,n m 0 ( Z i ) + a 3 ,n m 0 ( Z i ) 2 + a 4 ,n m 0 ( Z i ) 2 m 1 ( Z i ) } and a j,n = o p (1) uniformly in λ and do es not dep end on i for j = 1 , 2 , 3 , 4 . This concludes the pro of. ■ The first part of Lemma S1.3 sho ws Lemma 1 in the article. W e will now use the second part to sho w Theorem 6. S1.1 Pro of of Theorem 6 T a ylor’s theorem ensures, P n i =1 ψ ( Z i , ˆ θ ( − i ) ) = P n i =1 ψ ( Z i , ˆ θ ) + P n i =1 ∇ θ ∗ i ψ ( Z i , θ ) T ( ˆ θ ( − i ) − ˆ θ ) for some θ ∗ i s on the line segmen ts b et w een ˆ θ and ˆ θ ( − i ) . By assumption, all conditions of Lemma S1.3 hold true, and hence, n X i =1 ∇ θ ∗ i ψ ( Z i , θ ) T ( ˆ θ ( − i ) − ˆ θ ) = − n − 1 n X i =1 ∇ θ ∗ i ψ ( Z i , θ ) T J − 1 φ ( Z i , θ 0 ) + n X i =1 ∇ θ ∗ i ψ ( Z i , θ ) T E i N . (S1.8) By the same lemma and existence of p 1 , w e ha v e      n X i =1 ∇ θ ∗ i ψ ( Z i , θ ) T E i N      ≤ a 1 ,n n − 1 n X i =1 p 1 ( Z i ) + a 2 ,n n − 1 n X i =1 m 0 ( Z i ) p 1 ( Z i ) + a 3 ,n n − 1 n X i =1 m 0 ( Z i ) 2 p 1 ( Z i ) + a 4 ,n n − 1 n X i =1 m 0 ( Z i ) 2 m 1 ( Z i ) p 1 ( Z i ) . Since the momen ts of all terms in the abov e equation exists b y assumption and a j,n = o p (1) for j = 1 , 2 , 3 , 4 , the second term on the right hand side of (S1.8) is o p (1) . F or the first term of (S1.8) , note that b y T aylor’s theorem ∇ θ ∗ i ψ ( Z i , θ ) = ∇ θ 0 ψ ( Z i , θ )+ H θ ∗∗ i ψ ( Z i , θ )( θ ∗ i − θ 0 ) for some θ ∗∗ i on the line segment b etw een θ ∗ i and θ 0 . By the triangle inequality and the fact that θ ∗ i lies on the line segmen t b etw een ˆ θ ( − i ) and ˆ θ , 6 ∥ θ ∗ i − θ 0 ∥ ≤ ∥ ˆ θ − θ 0 ∥ + ∥ ˆ θ − ˆ θ ( − i ) ∥ . Hence, by Lemma S1.3, we hav e      − n − 1 n X i =1 ∇ θ ∗ i ψ ( Z i , θ ) T J ( θ 0 ) − 1 φ ( Z i , θ 0 ) − n − 1 n X i =1 ∇ θ 0 ψ ( Z i , θ ) T J ( θ 0 ) − 1 φ ( Z i , θ 0 )      ≤ ∥ ˆ θ − θ 0 ∥ n − 1 C n X i =1 p 2 ( Z i ) m 0 ( Z i ) + a 0 ,n n − 1 C n X i =1 p 2 ( Z i ) m 0 ( Z i ) 2 , for some C . This expression is o p (1) since the exp ected v alue of b oth means are finite and ∥ ˆ θ − θ 0 ∥ , a 0 ,n = o p (1) b y Lemma S1.3. This concludes the pro of. S2 App endix - When the limit function is flat In Theorem 1, w e assumed that λ 0 w as the unique minimizer of the limiting risk R ( λ ) = E ψ { Z , θ 0 ( λ ) } . In certain cases, how ever, θ 0 ( λ ) , the solution to E φ { Z, θ , λ } = 0 , is constan t as a function of λ , and as a result, R ( λ ) is flat. This is the case for h ybrid mo dels on the form φ ( z , θ , λ ) = λφ 1 ( z , θ ) + (1 − λ ) φ 2 ( z , θ ) when E φ 1 ( Z, θ ) and E φ 2 ( Z, θ ) share ro ots, e.g. in the h ybrid generativ e-discriminativ e mo del when the generativ e mo del is sp ecified correctly or for hybrid com binations of empirical and h ybrid lik eliho o d functions (see Dæhlen and Hjort (2025)) under mo del conditions. In fact, when R ( λ ) is flat, the limit of b oth TE ( λ ) and CV ( λ ) are constant, and there is no obvious “b est” v alue for λ , making it unclear what ˆ λ ev en should b e aiming for w ere it consistent. W e will no w use informal mathematical argumen ts and simulations to illustrate why a clear and informative limit result for ˆ θ ( ˆ λ ) is hard to find when θ 0 ( λ ) is constan t. Assume that θ 0 ( λ ) is constant and equal to θ 0 . T o inv estigate the b eha viour of √ n { ˆ θ ( ˆ λ ) − θ 0 } w e will first take a closer lo ok at the function CV ( λ ) . By Theorem 6, the follo wing holds p oin t wise, CV ( λ ) = n − 1 n X i =1 ψ { Z i , ˆ θ ( λ ) , λ } − n − 1 T r { J ( θ 0 , λ ) − 1 V ( θ 0 , λ ) } + o p (1 /n ) . 7 T a ylor expanding the first term on the right hand side in the ab ov e equation, reveals n − 1 n X i =1 ψ { Z i , ˆ θ ( λ ) , λ } = n − 1 n X i =1 ψ ( Z i , θ 0 , λ ) + { ˆ θ ( λ ) − θ 0 } T n − 1 n X i =1 ∇ θ 0 ψ ( Z i , θ , λ ) + { ˆ θ ( λ ) − θ 0 } T 1 2 n n X i =1 H θ ∗ ψ ( Z i , θ , λ ) { ˆ θ ( λ ) − θ 0 } , for some θ ∗ on the line segment b etw een ˆ θ ( λ ) and θ 0 . Assume θ 0 is the true minimizer of E ψ ( Z, θ ) , i.e. that E ∇ θ 0 ψ ( Z, θ ) = 0 . Then, argumen ts similar to those given in the pro of of Lemma 4.2 and Theorem 6 sho w S n ( λ ) = n CV ( λ ) − 1 n n X i =1 ψ ( Z i , θ 0 ) ! = U n ( λ ) T V n − T r { J ( θ 0 , λ ) − 1 V ( θ 0 ) } + 1 2 U n ( λ ) T W n U n ( λ ) + o p (1) , where U n ( λ ) = √ n { ˆ θ ( λ ) − θ 0 } , V n = n − 1 P n i =1 ∇ θ 0 ψ ( Z i , θ ) , and W n = n − 1 P n i =1 H θ 0 ψ ( Z i , θ ) . Since θ 0 is constan t with resp ect to λ , w e ha v e ˆ λ = argmin λ S n ( λ ) . Note that W n is consisten t for W = E H θ 0 ψ ( Z, θ ) b y the assumptions of Theorem 6. F urthermore, it do es not dep end on λ . If, in addition, the conditions of V an der V aart (2000, Ex. 19.7) are fulfilled for ( z , λ ) 7→ ( J ( θ 0 , λ ) − 1 φ ( z , θ 0 , λ ) , ψ ( z , θ 0 )) , the pro cess λ 7→ ( U n ( θ ) , V n ) con v erges as a pro cess to a Gaussian pro cess ( U ( λ ) , V ) with co v ariance structure as describ ed in the b eginning of V an der V aart (2000, Sec. 19.2). This, in combination with Slutsky’s lemma and the contin uous mapping theorem, is sufficien t for S n to con v erge in distribution in Skor okho d space to the following limit pro cess S ( λ ) = U ( λ ) T V − T ( λ ) + 1 2 U ( λ ) T W U ( λ ) , (S2.9) where T ( λ ) = T r { J ( θ 0 , λ ) − 1 V ( θ 0 ) } . Direct computations sho w E S ( λ ) = (1 / 2) T r { W J ( λ ) − 1 K ( λ ) J ( λ ) − 1 } . Hence, if there exists a λ 0 suc h that ˆ θ ( λ 0 ) is equal to the maximum likelihoo d estimator, the 8 Cramer-Rao theorem guaran tees that λ 0 minimizes E S ( λ ) . Since such a λ 0 often exists in the case where θ 0 ( λ ) is constant, it is tempting to think that ˆ λ con v erges in probability to this λ 0 . Sadly this conclusion cannot b e dra wn, as S is a random pro cess and there is no guarantee that minimizing the exp ected v alue of S minimizes the pro cess itself. F urthermore, the conv ergence of S n to w ards S is in distribution. Because of this, conv ergence in probability of ˆ λ to w ards λ L , the minimizer of S , do es not follo w. W e can only deduce ˆ λ d → λ L , and since S is a sto chastic pro cess and not a nonrandom function, its minimizer is not guaran teed to b e a constan t. Because of this, λ L is a random v ariable and ˆ λ d → λ L do es not imply conv ergence in probabilit y . The ab o v e therefore do es not show consistency of ˆ λ to w ards λ 0 or an y fixed v alue for that matter, but rather that the distribution of ˆ λ tends to w ards that of the minimizer of S . T o illustrate the ab ov e, we p erformed a small simulation exp eriment for a hybrid LD A-logistic regression mo del with one cov ariate and binary resp onse v ariable. W e sim ulated from an LDA mo del, under whic h the true parameter minimizes b oth the limit of the generative and logistic lik eliho o d. Hence θ 0 ( λ ) is constan tly equal to the true parameter in this simulation. W e used a ψ ( y , x, θ ) = ( y − exp { β ( θ ) T x } / [1 + exp { β ( θ ) T x } ]) 2 , a smo othed version of the 0 - 1 loss function. With n = 2000 we computed B = 1000 differen t realizations of S n . A selection of sample paths is display ed in Figure 1 together with a histogram of the corresp onding minima. F rom the figure w e see that the sample paths are indeed not constan t and that although the minimum is most often one, other v alues are rep eatedly chosen, implying that λ L migh t really not b e constan t in this case. Although the ab ov e arguments and sim ulations seem to indicate that there is reasonable doubt that ˆ λ is consistent for an y one v alue, this do es not necessarily rule out all hop e that √ n { ˆ θ ( ˆ λ ) − θ 0 } has a normal limiting distribution. In the sim ulation exp eriment describ ed ab o ve, w e also computed √ n { ˆ θ ( ˆ λ ) − θ 0 } in each iteration. Histograms of eac h comp onent of this vector are display ed in Figure 2. F rom the plots we note that the distributions at least lo ok quite normal. This intuition 9 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Sample paths of S 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 H i s t o g r a m o f L Figure 1: The figure sho ws tw o plots. T o the left B = 1000 realizations of S n with n = 2000 are displa y ed. On the right hand side a histogram of the corresp onding minima is sho wn. is supp orted by the QQ-plots sho wn in Figure 3. Looking at these plots, w e see no ob vious deviances from the corresp onding normal distributions. One p ossible solution is to w ork with A n ( ˆ λ ) = √ nK ( θ 0 , λ 0 ) − 1 / 2 J ( θ 0 , ˆ λ ) { ˆ θ ( ˆ λ ) − θ 0 } rather than √ n { ˆ θ ( ˆ λ ) − θ 0 } . Poin t wise, w e hav e A n ( λ ) d → N (0 , I p ) , and hence, w e w ould exp ect A n ( ˆ λ ) d → N (0 , I p ) as w ell. T o chec k whether this idea seems reasonable, w e computed A n ( ˆ λ ) in each iteration of the sim ulation setting describ ed ab o v e. A histogram of each comp onen t of A n ( ˆ λ ) is shown in Figure 4 together with the graph of density function for the standard normal distribution. The plot seem to indicate that A n ( ˆ λ ) b eha v es more or less like a standard normally distributed v ariable. One could argue that Corollary 1 and Theorem 5 failing to co v er situations where the limiting risk function is flat, makes the result and the estimators of Theorem 4 less useful in practical situations. Sa y that w e for instance, ha v e fitted a generativ e hybrid- discriminativ e mo del to some data and wan t to apply the theory in the main part of this pap er to construct confidence in terv als and hypothesis tests for the parameters in the mo del. If the generative mo del fitted to the data is wrongly sp ecified, this can b e 10 1 0 1 0.00 0.25 0.50 0.75 1.00 1 4 2 0 2 4 0.0 0.1 0.2 0.3 0.4 2 2 0 2 4 0.0 0.1 0.2 0.3 0.4 3 2 0 2 0.0 0.2 0.4 4 H i s t o g r a m o f n { ( ) 0 } Figure 2: The figure shows histograms of each comp onent of √ n { ˆ θ ( ˆ λ ) − θ 0 } . 2 0 2 Theor etical quantiles 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Or der ed V alues 1 2 0 2 Theor etical quantiles 4 2 0 2 4 Or der ed V alues 2 2 0 2 Theor etical quantiles 4 2 0 2 4 Or der ed V alues 3 2 0 2 Theor etical quantiles 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Or der ed V alues 4 Q Q - p l o t o f n { ( ) 0 } Figure 3: The figure sho ws QQ-plots for eac h comp onent of √ n { ˆ θ ( ˆ λ ) − θ 0 } . The empirical quan tiles are compared to those of fitted normal distributions. 11 3 2 1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 1 2 0 2 0.0 0.1 0.2 0.3 0.4 2 2 0 2 0.0 0.1 0.2 0.3 0.4 0.5 3 4 2 0 2 0.0 0.1 0.2 0.3 0.4 4 H i s t o g r a m o f n K ( 0 , ) 1 / 2 J ( 0 , ) { ( ) 0 } Figure 4: The figure sho ws histograms of each component of √ nK ( θ 0 , ˆ λ ) − 1 / 2 J ( θ 0 , ˆ λ ) { ˆ θ ( ˆ λ ) − θ 0 } . done by applying Corollary 1 or Theorem 5 in conjunction with Theorem 4. If the mo del conditions do hold true, how ever, the arguments made in this section shows that the results cannot b e applied. Hence, we cannot sa y that Corollary 1, Theorem 5 or the estimators in Theorem 4 are truly mo del-agnostic, since they fail to apply when the limiting risk function is flat. In theory , this is of course quite problematic. In practice, ho w ever, this is less of a problem. When θ ′ 0 ( λ ) = 0 for all v alues of λ , E φ ( Z, θ 0 , λ ) = 0 for all λ . Because of this, ˆ θ ′ ( ˆ λ ) of Theorem 3 is consisten t for 0 and ˆ A ∗ in Theorem 4 is asymptotically equiv alen t with ( J − 1 ( ˆ λ ) , 0 , 0) . Hence, ˆ A ∗ ˆ K ( ˆ λ ) ∗ ( ˆ A ∗ ) T is asymptotically equiv alen t with J − 1 ( ˆ λ ) K ( ˆ λ ) J − 1 ( ˆ λ ) whic h is what the sim ulations of this section seem to indicate is the limiting v ariance of √ n { ˆ θ ( ˆ λ ) − θ 0 } . Lastly , we would lik e to p oint out that there are situations where ˆ θ ( ˆ λ ) is consisten t ev en if θ 0 ( λ ) is a flat function, see e.g. Arcones (2005) where the limit of the risk function used to select λ is non-flat in λ ev en when θ 0 ( λ ) is constant. Cases where the limiting risk function is non-constan t in λ ev en though θ 0 ( λ ) is flat are not discussed in this pap er, but we p oint out that such estimators do exist and that for suc h cases 12 the b eha vior discussed and illustrated in this section is not guaran teed. S3 The effect of v alidation sets In this section w e will give an informal discussion on how an alternative tuning scheme based on setting aside v alidation sets affects the asymptotic distribution of ˆ θ ( ˆ λ ) . This problem is essentially equiv alent to situations with auxiliary information, see Qin (2000) whic h ha v e inspired our pro ofs. Assume w e hav e i.i.d. data Z 1 , . . . , Z n from some distribution F . In this section, we will assume that the first n 1 data p oin ts are used for tuning λ and that the remaining n 2 = n − n 1 data p oin ts are used for estimating ˆ θ with the c hosen v alue of λ , i.e. ˆ θ ( λ ) = zero of n X i = n 1 +1 φ ( Z i , θ , λ ) and ˆ λ = argmin λ n 1 X i =1 ψ [ Z i , ˆ θ ( λ )] . Arguing as in the paragraph preceding Theorem 1 in the main part of the pap er, w e get ˆ α =         ˆ θ ( ˆ λ ) ˆ λ ˆ θ ′ ( ˆ λ )         = zero of         P n i = n 1 +1 η 1 ( Z i , θ , λ, D ) P n 1 i =1 η 2 ( Z i , θ , λ, D ) P n i = n 1 +1 η 3 ( Z i , θ , λ, D )         , where η 1 , η 2 and η 3 are as defined in Theorem 1 in the main part of the pap er. By Lemma S1.1, ˆ α conv erges in probability to w ards α 0 , the zero of α 7→         p E η 1 ( Z, α ) (1 − p ) E η 2 ( Z, α ) p E η 3 ( Z, α )         , pro vided n 1 /n → p as n → ∞ , and the regularity conditions of Lemma S1.1 hold true. Since p is constant, α 0 is equal to the zero of α 7→ E [( η 1 ( Z, α ) T , η 2 ( Z, α ) T , η 3 ( Z, α ) T )] T . Hence, ˆ α , and in particular ˆ θ ( ˆ λ ) , is consistent also when tuning with resp ect to the 13 error on a v alidation set. F or asymptotic normality , notice that T aylor’s theorem ensures 0 =         P n i = n 1 +1 η 1 ( Z i , ˆ α ) P n 1 i =1 η 2 ( Z i , ˆ α ) P n i = n 1 +1 η 3 ( Z i , ˆ α )         =         P n i = n 1 +1 η 1 ( Z i , α 0 ) P n 1 i =1 η 2 ( Z i , α 0 ) P n i = n 1 +1 η 3 ( Z i , α 0 )         + ( ˆ α − α 0 ) T         P n i = n 1 +1 J η 1 ( Z i , α ∗ ) P n 1 i =1 J η 2 ( Z i , α ∗ ) P n i = n 1 +1 J η 3 ( Z i , α ∗ )         , for some α ∗ on the line segmen t b et w een ˆ α and α 0 . Hence, √ n ( ˆ α − α 0 ) = −                 1 n P n i = n 1 +1 J η 1 ( Z i , α ∗ ) 1 n P n 1 i =1 J η 2 ( Z i , α ∗ ) 1 n P n i = n 1 +1 J η 3 ( Z i , α ∗ )         − 1         T 1 √ n         P n i = n 1 +1 η 1 ( Z i , α 0 ) P n 1 i =1 η 2 ( Z i , α 0 ) P n i = n 1 +1 η 3 ( Z i , α 0 )         . Assuming J η j for j = 1 , 2 , 3 are sufficiently regular, we therefore ha v e √ n ( ˆ α − α 0 ) = −                 (1 − p ) E J η 1 ( Z, α 0 ) p E J η 2 ( Z, α 0 ) (1 − p ) E J η 3 ( Z, α 0 )         − 1         T         (1 − p ) 1 / 2 n − 1 / 2 2 P n i = n 1 +1 η 1 ( Z i , α 0 ) p 1 / 2 n − 1 / 2 1 P n 1 i =1 η 2 ( Z i , α 0 ) (1 − p ) 1 / 2 n − 1 / 2 2 P n i = n 1 +1 η 3 ( Z i , α 0 )         + o p (1) . The ab ov e shows that √ n ( ˆ α − α 0 ) is a linear combination of terms which con v erge to w ards normally distributed v ariables at the sp eed of O ( n − 1 / 2 1 ) = O p ( pn − 1 / 2 ) or O ( n − 1 / 2 2 ) = O p ((1 − p ) n − 1 / 2 ) . Assuming 0 < p < 1 , the conv ergence rate is therefore reduced from a sp eed of n − 1 / 2 to pn − 1 / 2 or (1 − p ) n − 1 / 2 when using a v alidation set to tune λ rather than cross v alidation. This is a relativ ely natural result as np and n (1 − p ) are appro ximately equal to the amount of data which is set aside for fitting the mo del and tuning λ resp ectiv ely . 14 References Arcones, M. A. (2005). Conv ergence of the optimal m-estimator ov er a parametric family of m-estimators. T est 14 , 281–315. Dæhlen, I. and N. L. Hjort (2025). Mo del robust h ybrid lik eliho o d. Journal of Statistic al Planning and Infer enc e (241), 106327. Qin, J. (2000). Miscellanea. combining parametric and empirical lik eliho o ds. Biomet- rika 87 (2), 484–490. V an der V aart, A. W. (2000). A symptotic statistics . Cam bridge, England: Cambridge Univ ersit y Press. 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment