Beyond Consistency: Inference for the Relative risk functional in Deep Nonparametric Cox Models
There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how point…
Authors: Sattwik Ghosal, Xuran Meng, Yi Li
Bey ond Consistency: Inference for the Relativ e risk functional in Deep Nonparametric Co x Mo dels Satt wik Ghosal ∗ Xuran Meng † and Yi Li ‡ Abstract There remain theoretical gaps in deep neural net work estimators for the nonparametric Cox prop ortional hazards mo del. In particular, it is unclear how gradient-based optimization error propagates to p opulation risk under partial likelihoo d, ho w p oin twise bias can be controlled to p ermit v alid inference, and ho w ensem ble-based uncertain ty quan tification b eha v es under realistic v ariance decay regimes. W e dev elop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained netw orks that link in-sample optimization error to p opulation risk without requiring the exact empirical risk optimizer. W e then construct a structured neural parame- terization that achiev es infinity-norm approximation rates compatible with the oracle b ound, yielding control of the p oin t wise bias. Under these conditions and using the H´ ajek–Ho effding pro jection, we prov e p oin twise and m ultiv ariate asymptotic normalit y for subsampled ensem- ble estimators. W e deriv e a range of subsample sizes that balances bias correction with the requiremen t that the H´ ajek–Ho effding pro jection remain dominan t. This range accommo dates deca y conditions on the single-ov erlap co v ariance, which measures ho w strongly a single shared observ ation influences the estimator, and is w eak er than those imposed in the subsampling litera- ture. An infinitesimal jackknife represen tation pro vides analytic co v ariance estimation and v alid W ald-t yp e inference for relative risk con trasts such as log-hazard ratios. Finally , w e illustrate the finite-sample implications of the theory through simulations and a real data application. 1 In tro duction High-complexit y estimators based on deep neural net works (DNNs) ha ve achiev ed success across a range of statistical tasks. Although progress has b een made in establishing p opulation risk consis- tency and approximation guarantees in some settings ( Schmidt-Hieber , 2020 ; F arrell et al. , 2021 ), statistical inference remains underdeveloped. In particular, it is not w ell understoo d how opti- mization error arising from gradien t-based training is carried into population risk ( Allen-Zh u et al. , 2019 ; Arora et al. , 2019 ), how to enforce the p oint wise bias control required for v alid confidence in terv als ( Hall , 1992 ; Cattaneo et al. , 2020 ; F arrell et al. , 2021 ), and how to construct tractable and reliable uncertaint y quantification for scien tifically relev ant targets suc h as relative-risk con trasts ( Lakshminara yanan et al. , 2017 ; Chernozh uko v et al. , 2018 ; W ager and A they , 2018 ). Moreo ver, the inferential theory for DNN estimators in the presence of censoring is limited. W e introduce the nonparametric Cox prop ortional hazards mo del ( Co x , 1972 ) under indep en- den t censoring. Let X ∈ [0 , 1] p 0 denote a vector of (normalized) baseline co v ariates with dimension ∗ Departmen t of Biostatistics, Universit y of Michigan; e-mail: ghosal@umich.edu † Departmen t of Biostatistics, Universit y of Michigan; e-mail: xuranm@umich.edu ‡ Departmen t of Biostatistics, Universit y of Michigan; e-mail: yili@umich.edu 1 p 0 and T U denote the even t time. The nonparametric Co x mo del ( Sleep er and Harrington , 1990 ; O’Sulliv an , 1993 ; Chen and Zhou , 2007 ; Chen et al. , 2010 ) sp ecifies the conditional hazard of T U giv en X as λ ( t | X ) = λ 0 ( t ) exp { g 0 ( X ) } , (1.1) where λ 0 ( · ) is an unsp ecified baseline hazard function and g 0 : [0 , 1] p 0 → R is an unkno wn function with g 0 ( 0 ) = 0 for iden tifiability . W e estimate g 0 b y b g ov er a DNN function class and also consider relativ e risk comparisons. F or tw o co v ariate vectors x (1) ∗ and x (2) ∗ , the log-hazard contrast under mo del ( 1.1 ) is log λ t | X = x (1) ∗ λ t | X = x (2) ∗ := ψ x (1) ∗ , x (2) ∗ = g 0 x (1) ∗ − g 0 x (2) ∗ , (1.2) whic h provides an interpretable measure of relativ e risk b et w een the t wo co v ariate settings. Recen t semiparametric theory fo cuses on finite dimensional parameters, treating g 0 as a machine learned nuisance ( Chernozhuk o v et al. , 2018 ). In contrast, inference for g 0 itself, particularly for DNN estimators, remains largely unresolved. Existing metho ds establish risk consistency but do not control p oin twise bias sufficien tly to obtain distributional approximations for high-complexity DNN estimators ( Zhong et al. , 2022 ; Zhou et al. , 2023 ; Sun et al. , 2024 ; W u et al. , 2024 ). W e develop inference for DNN estimators with censored surviv al data ( Katzman et al. , 2018 ; Wiegreb e et al. , 2024 ). This setting p oses sev eral challenges. First, censoring leads to incom- plete observ ations and complicates the ob jectiv e function. Second, practical training algorithms yield only approximate empirical risk optimizers, and it remains unclear how the resulting op- timization gap carries ov er to the p opulation risk in Cox mo dels ( Schmidt-Hieber , 2020 ; Kohler and Langer , 2021 ; Zhong et al. , 2022 ; Meng and Li , 2026 ). Third, the nonparametric comp onen t g 0 ( · ) is estimable at rates slo wer than n − 1 / 2 , so classical semiparametric theory ( Andersen and Gill , 1982 ; Bic kel et al. , 1993 ; v an der V aart , 1998 ) do es not provide distributional approximations. V alid inference requires linking optimization error to p opulation risk, controlling p oin t wise bias, and characterizing the co v ariance of nonlinear functionals. In particular, inference for the contrast ψ x (1) ∗ , x (2) ∗ dep ends on the join t distribution of b g ( x (1) ∗ ) and b g ( x (2) ∗ ). Their dep endence, induced b y the shared training sample, must b e accounted for. W e dev elop a three-part inferen tial framew ork to address these challenges. 1. Optimization-to-p opulation risk bridge under partial likelihoo d. W e establish an oracle inequalit y for DNN estimators trained via gradien t-based optimization under the Cox partial lik e- liho od. Existing analyses typically assume access to the exact empirical optimizer ( Zhong et al. , 2022 ), which p ermits a tw o-step argumen t: first, uniform conv ergence establishes consistency , and second, a lo calized empirical pro cess analysis around the true function yields con vergence rates. In contrast, gradien t-based training do es not guaran tee a global optimizer of the empirical ob jective, preven ting a direct application of such lo calization arguments. Instead, w e derive a global b ound on the p opulation risk of the form p opulation risk ≲ optimization gap + approximation error + statistical error , where the optimization gap captures the deviation from exact empirical optimalit y . The resulting b ound is self-referen tial, in that the p opulation risk app ears on b oth sides of the inequalit y . W e resolv e this b y com bining empirical process tec hniques for Co x models ( Huang , 1999 ; Zhong et al. , 2 2022 ) with Bernstein-type concentration inequalities o ver suitable co verings of DNN class. 2. Bias calibration linking p opulation risk and p oin t wise inference. W e con trol p oin twise bias via sharp L ∞ appro ximation b ounds. By scaling the net work architecture (e.g., width and depth), the approximation error is asymptotically dominated by the sto c hastic fluctuation of the estimator, analogous to undersmo othing in classical nonparametric inference. This renders the approximation term in the oracle inequality negligible for p oin t wise inference. W e further imp ose a mild alignmen t condition requiring that the trained net w ork ac hieve the appro ximation rate, av oiding underutilization of mo del capacit y during optimization. 3. Subsample ensemble inference under relaxed single-o v erlap co v ariance deca y . W e construct a subsampling-based ensemble estimator and analyze its distribution using the H´ ajek Ho effding pro jection. Its v ariance is characterized b y the single-overlap c ovarianc e in tro duced in Section 4 , which captures dep endence b et ween estimators trained on subsamples sharing one observ ation ( W ager and A they , 2018 ). Existing ESM theory for generalized nonparametric regression ( Meng and Li , 2026 ) typically assumes near minimal decay of this cov ariance at rate 1 /r , where r is the subsample size. W e relax this requirement and establish asymptotic normalit y ov er a broader range of deca y rates. A key condition is that r is large enough so that appro ximation bias is negligible while remaining o ( n ), so that higher-order U-statistic terms are negligible and the H´ ajek pro jection dominates. Under these conditions, w e establish m ultiv ariate asymptotic normalit y for finite sets of ev aluation points and pro ve consistency of an infinitesimal jac kknife cov ariance estimator, enabling W ald-t yp e inference for nonlinear con trasts ( 1.2 ). Section 2 reviews DNNs and the nonparametric Co x mo del. Section 3 in tro duces the subsample- based ensemble inferential framework. Section 4 presents the theoretical results, including global consistency and asymptotic distributional theory for the ensemble estimator. Sections 5 and 6 illustrate the finite-sample p erformance through simulations and a real data application. Section 7 concludes. F ull pro ofs and auxiliary lemmas are given in the app endix. 1.1 Notation V ectors are b oldfaced: upp ercase b old letters (e.g., X ) represent random vectors, and lo wercase b old letters (e.g., x or x ∗ ) denote fixed p oin ts or realizations. F or random X , P X denotes its distribution and E X the corresp onding exp ectation. F or a v ector x = ( x 1 , . . . , x d ) ⊤ , define its ℓ p -, ℓ ∞ -, and ℓ 0 -norms b y | x | p = P d j =1 | x j | p 1 /p , | x | ∞ = max 1 ≤ j ≤ d | x j | , | x | 0 = P d j =1 I ( x j = 0) , where I ( · ) is the indicator function; for a function f : X → R , define its L p norm by ∥ f ∥ p = R X | f ( x ) | p dx 1 /p , 1 ≤ p < ∞ , ∥ f ∥ ∞ = sup x ∈X | f ( x ) | . F or a matrix A = ( a ij ), define ∥ A ∥ 0 = P i,j I ( a ij = 0) , ∥ A ∥ ∞ = max i P j | a ij | . F or tw o deterministic sequences x n and a n , write x n = O ( a n ) if | x n /a n | ≤ C for a C > 0 when n is sufficien tly large, and x n = o ( a n ) if x n /a n → 0. F or random v ariables X n , X n = O p ( a n ) if for any ϵ > 0 there exists C > 0 suc h that P ( | X n /a n | > C ) ≤ ϵ when n is sufficiently large, and X n = o p ( a n ) if X n /a n p → 0. F or sequences X n and Y n , write X n ≲ Y n if X n ≤ C Y n for a C > 0 when n is sufficien tly large, and X n ≍ Y n if X n ≲ Y n and Y n ≲ X n . Finally , a ∧ b = min { a, b } and a ∨ b = max { a, b } . 3 2 Preliminaries 2.1 Multila y er Neural Net works Let L ≥ 1 denote the num b er of hidden lay ers and p = ( p 0 , . . . , p L +1 ) the corresp onding lay er widths. With ReLU activ ation σ ( x ) = max { x, 0 } applied comp onen twise, define the shifted ac- tiv ation σ v ( x ) = σ ( x − v ). W e fo cus on ReLU b ecause of its strong appro ximation prop erties and fav orable optimization b ehavior; in particular, its deriv ativ e equals one for p ositiv e inputs, prev enting gradient atten uation during training ( Schmidt-Hieber , 2020 ; Zhong et al. , 2022 ). A netw ork with architecture ( L, p ) is the mapping f : R p 0 → R p L +1 defined by f ( x ) = W L σ v L W L − 1 σ v L − 1 · · · W 1 σ v 1 W 0 x · · · , (2.1) where W l ∈ R p l +1 × p l and v l ∈ R p l +1 . Here p 0 denotes the input dimension, p 1 , . . . , p L are the widths of the L hidden la yers, and p L +1 is the output dimension; for example, in Figure 1 , L = 3 with p 0 = 5, p 1 = p 2 = 4, p 3 = 3, and p 4 = 1. T o mitigate ov erfitting, w e restrict atten tion to sparsely connected net w orks. Define the s -sparse net work class with e n v elop e F by F ( L, p , s, F ) = n g ∈ F ( L, p ) : L X l =0 ∥W l ∥ 0 + L X l =1 ∥ v l ∥ 0 ≤ s, ∥ g ∥ ∞ ≤ F o , (2.2) where F ( L, p ) = n g ( x ) = f ( x ) − f ( 0 ) : f is a DNN of the form ( 2.1 ) , ∥W l ∥ ∞ ≤ 1 for l = 0 , . . . , L, ∥ v l ∥ ∞ ≤ 1 for l = 1 , . . . , L o . By construction, g ( 0 ) = 0 for all g ∈ F ( L, p ), which ensures identifiabilit y . x 1 x 2 x 3 x 4 x 5 Input la yer Hidden la yer 1 Hidden lay er 2 Hidden la yer 3 Output Figure 1: A fully connected neural netw ork with L = 3 and p = (5 , 4 , 4 , 3 , 1). 4 2.2 Nonparametric Co x Mo dels and H¨ older F unctional Classes W e consider right-censored surviv al data, where T U and T C denote the ev en t time and censoring time, resp ectively , and X ∈ [0 , 1] p 0 is a p 0 -dimensional cov ariate vector. W e assume that T U and T C are conditionally indep enden t given X ( Fleming and Harrington , 2005 ). W e observe n indep enden t and iden tically distributed (i.i.d.) copies D i = ( T i , ∆ i , X i ) of D = ( T , ∆ , X ), where T i = T i,U ∧ T i,C and ∆ i = I ( T i,U ≤ T i,C ). Let D n = { D i } n i =1 denote the sample of size n . W e study the nonparametric Cox model ( 1.1 ), whic h extends the classical Co x mo del ( Cox , 1972 ) b y allowing the log-risk function g 0 ( · ) to be an unknown function of X . T o ensure identifiabilit y , w e imp ose g 0 ( 0 ) = 0. While existing work on nonparametric and DNN-based estimators mainly establishes consistency ( Zhong et al. , 2022 ), w e develop inference for g 0 ( · ), including uncertaint y quan tification for p oint wise ev aluations and contrasts. W e assume that g 0 b elongs to a H¨ older class G γ p 0 ([0 , 1] p 0 , K ) with smo othness γ > 0 and radius K > 0 ( Sc hmidt-Hieb er , 2020 ; Zhong et al. , 2022 ), defined as G γ p 0 ([0 , 1] p 0 , K ) = ( g : [0 , 1] p 0 → R : X ∥ β ∥ 1 <γ ∥ ∂ β g ∥ ∞ + X ∥ β ∥ 1 = ⌊ γ ⌋ sup x = y ∈ [0 , 1] p 0 | ∂ β g ( x ) − ∂ β g ( y ) | ∥ x − y ∥ γ −⌊ γ ⌋ ∞ ≤ K ) . Here ⌊ γ ⌋ denotes the in teger part of γ , and for a m ulti-index β = ( β 1 , . . . , β p 0 ) ⊤ ∈ N p 0 , ∂ β = ∂ β 1 · · · ∂ β p 0 . W e further assume that g 0 admits a comp ositional represen tation with H¨ older-smo oth comp o- nen ts. F or some integer q ≥ 0 and v ectors d = ( d 0 , . . . , d q +1 ) ∈ N q +2 + , t = ( t 0 , . . . , t q ) ∈ N q +1 + , and γ = ( γ 0 , . . . , γ q ) ∈ R q +1 + , we assume g 0 ∈ G ( q , d , t , γ , K ) = n g = g ( q ) ◦ · · · ◦ g (0) : g ( i ) = ( g ( i ) j ) j : [ a i , b i ] d i → [ a i +1 , b i +1 ] d i +1 , g ( i ) j ∈ G γ i t i ([ a i , b i ] t i ) , | a i | , | b i | ≤ K o . (2.3) The class G com bines H¨ older smo othness with a comp ositional structure, with eac h g ( i ) acting on the output of the preceding map, reflecting the lay ered structure of DNNs ( Y arotsky , 2017 ). The v ectors d and t sp ecify lay er-wise dimensions, and γ enco des smo othness at eac h level. In a p 0 -v ariate Co x mo del ( 1.1 ), g 0 : [0 , 1] p 0 → R , so d 0 = p 0 , d q +1 = 1 , a 0 = 0 and b 0 = 1. W e pro vide t wo examples with q = 2. Example 1. g 0 ( x ) = g (2) 1 g (1) 1 g (0) 1 ( x 1 , x 2 ) , g (0) 2 ( x 3 , x 4 , x 5 ) , g (1) 2 g (0) 3 ( x 6 , x 7 ) , g (0) 4 ( x 8 , x 9 , x 10 ) , x ∈ [0 , 1] 10 . If al l c omp onent functions g ( i ) j ar e twic e c ontinuously differ entiable, then γ = (2 , 2 , 2) , d = (10 , 4 , 2 , 1) ⊤ , and t = (3 , 2 , 2) ⊤ . Example 2. g 0 ( x ) = g (2) 1 g (1) 1 ( g (0) 1 ( x 1 , x 2 ) , g (0) 2 ( x 3 )) , g (1) 2 ( g (0) 3 ( x 4 , x 5 ) , g (0) 4 ( x 6 )) , x ∈ [0 , 1] 6 , 5 wher e we supp ose e ach g ( i ) j has H¨ older smo othness γ i = 3 2 . Henc e, γ = ( 3 2 , 3 2 , 3 2 ) , d = (6 , 4 , 2 , 1) ⊤ , and t = (2 , 2 , 2) ⊤ . The comp osite-smoothness class G ( q , d , t , γ , K ) subsumes standard nonparametric mo dels, in- cluding additiv e structures ( Stone , 1985 ) and nested comp ositional regressions ( Horowitz and Mam- men , 2007 ). Related work extends this framework to broader hierarchical comp ositions b ey ond the classical H¨ older setting ( Kohler and Langer , 2021 ). Define the effective smo othness at level i b y γ ∗ i = γ i q Y l = i +1 ( γ l ∧ 1) , whic h represen ts the H¨ older smoothness associated with level i after atten uation through subsequen t comp ositions. With a training sample of size n , define ϕ n = max i =0 ,...,q n − 2 γ ∗ i / (2 γ ∗ i + t i ) , (2.4) whic h is the minimax-optimal estimation rate, up to logarithmic factors, for deep ReLU estimators ( Sc hmidt-Hieb er , 2020 ). W riting ϕ n = n − c eff giv es c eff = min i =0 ,...,q 2 γ ∗ i 2 γ ∗ i + t i , i eff = arg min 0 ≤ i ≤ q 2 γ ∗ i 2 γ ∗ i + t i . (2.5) Here, i eff iden tifies the b ottlenec k level in the comp ositional structure, and c eff determines the o verall rate ϕ n = n − c eff . Thus, the slow est lev el in the H¨ older hierarch y gov erns this ov erall rate. It also sets the scale of the optimal appro ximation bias as stated in Theorem 1 . 3 Inference on DNN Estimates via an Ensem ble Subsampling Learner W e estimate g 0 ( · ) in mo del ( 1.1 ) by maximizing the Cox partial likelihoo d ov er the DNN function class F ( L, s, p , F ) defined in ( 2.2 ). F or g : [0 , 1] p 0 → R , the log partial lik eliho o d is L n ( g ) = n − 1 n X i =1 ∆ i h g ( X i ) − log n X j : T j ≥ T i exp { g ( X j ) } oi . (3.1) The empirical optimizer is b g opt = arg max g ∈F ( L, p ,s,F ) L n ( g ) . (3.2) In practice, DNNs are trained by stochastic gradien t metho ds and ma y not attain b g opt . T o quantify the resulting optimization error, w e define the exp ected optimization gap Q n ( b g, b g opt ) := E D n { L n ( b g opt ) − L n ( b g ) } , so that Q n ( b g , b g opt ) ≥ 0, with equalit y if b g = b g opt . When the con text is clear, we write Q n ( b g ) := Q n ( b g , b g opt ). 6 Unlik e separable losses, ( 3.1 ) do es not decomp ose into a sum of indep enden t contribu tions due to risk-set coupling, and standard empirical pro cess arguments do not apply . T o isolate this dep endence, we rewrite the partial likelihoo d as L n ( g ) = n − 1 n X i =1 ∆ i ℓ n ( T i , X i ; g ) − n − 1 n X i =1 ∆ i , (3.3) where ℓ n ( t, X ; g ) = g ( X ) − log S (0) n ( t ; g ) , S (0) n ( t ; g ) = n − 1 n X j =1 Y j ( t ) exp { g ( X j ) } , and Y j ( t ) = I ( T j ≥ t ). Since the second term in ( 3.3 ) do es not dep end on g , we work with L n ( g ) = n − 1 n X i =1 ∆ i ℓ n ( T i , X i ; g ) , whic h yields the same optimizer as ( 3.2 ). W e define the population counterpart of L n ( · ) as L 0 ( g ) = E D ∆ ℓ 0 ( T , X ; g ) , ℓ 0 ( t, X ; g ) = g ( X ) − log S (0) ( t ; g ) , for 0 ≤ t ≤ τ , where τ is the study end time (Condition 1 ), and S (0) ( t ; g ) = E D { Y ( t ) exp { g ( X ) }} , Y ( t ) = I ( T ≥ t ) . The p opulation risk of any fixed function g is R ( g , g 0 ) = L 0 ( g 0 ) − L 0 ( g ) , whic h is nonnegativ e (Lemma 1 in app endix). F or a data-dep enden t estimator b g , the risk R ( b g , g 0 ) is random, and we consider its exp ectation E D n [ R ( b g , g 0 )] , whic h we refer to as the p opulation risk of b g . Uniform con vergence of S (0) n ( t ; g ) to S (0) ( t ; g ) o ver t ∈ [0 , τ ] allows the data-dependent risk-set a verage to b e replaced b y its deterministic limit, so that L n ( g ) can b e treated as an empirical pro cess with a remainder term, whose magnitude we quantify for DNN estimators. Inference based on b g is c hallenging b ecause it lac ks a tractable influence-function representation and its sampling v ariabilit y is difficult to characterize. By a veraging DNN estimators trained on subsamples, ESM stabilizes the estimator and yields an approximate linear representation via its first-order H´ ajek pro jection, enabling statistical inference. 3.1 Estimation and Inference via Ensemble Subsampling F or a fixed co v ariate p oin t x ∗ ∈ [0 , 1] p 0 and a pair x (1) ∗ , x (2) ∗ ∈ [0 , 1] p 0 , we in tro duce an ensemble subsampling metho d (ESM; Algorithm 1) to estimate g 0 ( x ∗ ) and conduct inference for b oth g 0 ( x ∗ ) and the log-hazard ratio ψ ( x (1) ∗ , x (2) ∗ ) = g 0 ( x (1) ∗ ) − g 0 ( x (2) ∗ ). The ensemble estimator b g B aggregates 7 subsample estimators { b g b } trained on o verlapping subsets of D n , and th us forms a sum of dep enden t terms. How ev er, b g B has the structure of an incomplete U-statistic ov er subsamples. This enables a H´ ajek pro jection ( W ager and Athey , 2018 ), which dec omposes b g B in to a first-order term, a sum of indep enden t influence con tributions, and an asymptotically negligible higher-order remainder. The linear representation characterizes the asymptotic distribution. Algorithm 1: Ensemble Subsampling 1: Input: Observed data D n = { ( T i , ∆ i , X i ) : i = 1 , . . . , n } , subsample size r , n um b er of subsamples B . 2: Subsampling: Let I = { 1 , . . . , n } index observ ations. Using subset sampling, randomly gener- ate B subsets I b ⊂ I with b = 1 , . . . , B . 3: Mo del T raining: F or each 1 ≤ b ≤ B , w e maximize the subsample partial lik eliho od L r ( g ; I b ) defined from ( 3.1 ) on the observ ations indexed by I b . The optimization is p erformed ov er F ( L, p , s, F ) using a gradien t-based algorithm such as SGD. Let b g b denote the resulting (approximate) solution, b g b ≈ b g b opt := arg max g ∈F ( L, p ,s,F ) L r ( g ; I b ) , (3.4) so that Q r ( b g b , b g b opt ) ≤ Q opt b , (3.5) where Q opt b quan tifies the optimization error. 4: Ensem ble Estimator: Aggregate the subsample estimators to form the ensem ble estimate for an y fixed x ∗ ∈ [0 , 1] p 0 b g B ( x ∗ ) = 1 B B X b =1 b g b ( x ∗ ) . (3.6) The corresponding log-hazard ratio (con trast) estimator for t wo profiles x (1) ∗ and x (2) ∗ is b ψ B ( x (1) ∗ , x (2) ∗ ) = b g B ( x (1) ∗ ) − b g B ( x (2) ∗ ) . (3.7) 5: V ariance Estimation: Estimate the asymptotic p oin twise v ariance b σ 2 ( x ∗ ) and the con trast v ariance b σ 2 1 , 2 using the infinitesimal jac kknife (IJ), which leverages the ESM subsampling struc- ture to provide uncertaint y quan tification. 6: Confidence In terv al (CI) Construction: Compute the approximate (1 − α ) × 100% CIs: CI g 0 ( x ∗ ) = h b g B ( x ∗ ) − c α b σ ( x ∗ ) , b g B ( x ∗ ) + c α b σ ( x ∗ ) i , CI ψ ( x (1) ∗ , x (2) ∗ ) = h b ψ B ( x (1) ∗ , x (2) ∗ ) − c α b σ 1 , 2 , b ψ B ( x (1) ∗ , x (2) ∗ ) + c α b σ 1 , 2 i , where 0 < α < 1 and c α denotes the (1 − α/ 2) quantile of standard normal. With λ ( t | x (1) ∗ ) λ ( t | x (2) ∗ ) = exp ψ ( x (1) ∗ , x (2) ∗ ) , the corresp onding CI for relativ e risk is exp( L ) , exp( U ) , where L and U are the lo wer and upp er b ounds of CI ψ ( x (1) ∗ , x (2) ∗ ) , respectively . En umerating all B ∗ = n r subsets is infeasible, so w e draw B random subsets I b 1 , . . . , I b B ⊂ I = { 1 , . . . , n } with |I b | = r . The ESM estimator is defined as the a verage of the corresp onding 8 subsample estimators in ( 3.6 ). W e estimate the asymptotic v ariance of the ensemble estimate b g B ( x ∗ ) using the bias-corrected infinitesimal jackknife (IJ) v ariance estimator, denoted as b σ 2 ( x ∗ ). Define Z j i ( x ∗ ) = J j i − J · i b g b j ( x ∗ ) − b g B ( x ∗ ) , b V i ( x ∗ ) = 1 B B X j =1 Z j i ( x ∗ ) , where J j i = I ( i ∈ I b j ) is the subsample inclusion indicator for replicate j , and J · i = B − 1 P B j =1 J j i . The p oint wise IJ v ariance is then giv en by b σ 2 ( x ∗ ) = n ( n − 1) ( n − r ) 2 n X i =1 b V 2 i ( x ∗ ) − 1 B ( B − 1) n X i =1 B X j =1 Z j i ( x ∗ ) − b V i ( x ∗ ) 2 . (3.8) Consequen tly , to ev aluate the relative risk b et ween tw o cov ariate profiles, w e estimate the v ariance of the con trast b ψ B ( x (1) ∗ , x (2) ∗ ) = b g B ( x (1) ∗ ) − b g B ( x (2) ∗ ) as b σ 2 1 , 2 = b σ 2 ( x (1) ∗ ) + b σ 2 ( x (2) ∗ ) − 2 b τ ( x (1) ∗ , x (2) ∗ ) , (3.9) where b τ ( x (1) ∗ , x (2) ∗ ) = n ( n − 1) ( n − r ) 2 n X i =1 b V i ( x (1) ∗ ) b V i ( x (2) ∗ ) − 1 B ( B − 1) n X i =1 B X j =1 Z j i ( x (1) ∗ ) − b V i ( x (1) ∗ ) Z j i ( x (2) ∗ ) − b V i ( x (2) ∗ ) . (3.10) The v ariance framework in ( 3.8 )-( 3.10 ) extends Meng and Li ( 2026 ) to account for joint dep en- dence across multiple ev aluations. The scaling factors correct Monte Carlo and finite-sample bias, enabling v alid inference when B ≳ n . Consistency of this cov ariance estimator is established later. 4 Theoretical Results W e establish asymptotic results for the ESM estimators ( 3.6 ) and ( 3.7 ). T o characterize their dep endence structure, w e introduce cov ariance measures that quantify the influence of individual observ ations. Let D i = ( T i , ∆ i , X i ), i = 1 , . . . , n , b e i.i.d. observ ations. Define the single-ov erlap cov ariance measures ζ (1) r ( x ∗ ) = Co v b g ( x ∗ ; D 1 , . . . , D r ) , b g ( x ∗ ; D 1 , D ′ 2 , . . . , D ′ r ) , ζ (1) r ( x ∗ , x ′ ∗ ) = Co v b g ( x ∗ ; D 1 , . . . , D r ) , b g ( x ′ ∗ ; D 1 , D ′ 2 , . . . , D ′ r ) , (4.1) where x ∗ , x ′ ∗ ∈ [0 , 1] p 0 are fixed ev aluation p oin ts, and D ′ i are indep enden t copies of D i . Here b g ( x ∗ ; D 1 , . . . , D r ) denotes the estimator trained on { D 1 , . . . , D r } and ev aluated at x ∗ . These quantities measure the cov ariance b etw een estimators trained on subsamples that share one observ ation and differ in all remaining en tries. They capture the leading-order dep endence induced by ov erlap in the subsampling pro cedure and are cen tral to the H´ ajek expansion of the resulting U-statistic ( W ager and Athey , 2018 ). 9 Condition 1. Assume the study ends at time τ such that ther e exists ε 0 > 0 satisfying P (∆ = 1 | X ) ≥ ε 0 and P ( T U ≥ τ | X ) ≥ ε 0 , almost sur ely with r esp e ct to P X ; we also imp ose the c ondition g 0 ( 0 ) = 0 . Condition 1 imp oses a standard nondegeneracy condition ( Andersen and Gill , 1982 ), ensuring that the probabilities of failure and of remaining at risk up to time τ are uniformly b ounded a wa y from zero, thereby preven ting degeneration of the risk-set and partial-likelihoo d pro cesses. The constrain t g 0 ( 0 ) = 0 ensures identifiabilit y of mo del ( 1.1 ). T o characterize the approximation capacit y of the net work, we imp ose architectural constrain ts on the DNN class in the follo wing Condition 2 . In particular, we introduce a h yp er parameter δ ≥ 0. The case δ = 0 corresp onds to the baseline regime achieving the minimax-optimal rate ( Sc hmidt-Hieb er , 2020 ; Zhong et al. , 2022 ), whereas δ > 0 yields an ov erparameterized architecture required for subsequent analysis. Condition 2. With g 0 sp e cifie d in ( 2.3 ) , the network class F ( L, p , s, F ) is r e quir e d to satisfy the fol lowing structur al c onditions for some δ ≥ 0 , • log 2 n ≲ L ≲ log µ ( n ) , for some µ > 1 , • n 1+ δ ϕ n ≲ min 1 ≤ i ≤ L p i ≤ max 1 ≤ i ≤ L p i ≲ n, • s ≍ n 1+ δ ϕ n log n, • F ≥ max { K , 1 } . W e define the optimal network in F ( L, p , s, F ) that b est approximates g 0 : g apx ∈ arg min g ∈F ( L, p ,s,F ) ∥ g − g 0 ∥ ∞ and use the optimal approximation error, ∥ g apx − g 0 ∥ ∞ , to measure the appro ximation capability of F ( L, p , s, F ). W e establish a b ound for ∥ g apx − g 0 ∥ ∞ and show that increasing δ sharpens this b ound. Theorem 1 (Bound of Optimal Approximation Error) . L et the true function g 0 b e sp e cifie d in ( 2.3 ) with g 0 ( 0 ) = 0 . Under Condition 2 , it holds that ∥ g apx − g 0 ∥ 2 ∞ ≲ ϕ 1+ η n , wher e η = 2 δ γ ∗ i eff t i eff c eff , with c eff and i eff define d in ( 2.5 ) . Theorem 1 c haracterizes the appro ximation capabilit y of F ( L, p , s, F ). At the minimax sparsit y lev el ( δ = 0), we ha ve η = 0 so that ∥ g apx − g 0 ∥ ∞ = O ( ϕ 1 / 2 n ) . F or δ > 0, o verparameterization sharp ens this b ound to O ( ϕ (1+ η ) / 2 n ). W e next establish a non-asymptotic p opulation risk b ound for an y estimator b g ∈ F ( L, p , s, F ). Theorem 2. Under Conditions 1 and 2 for 0 ≤ δ < c eff , for any estimator b g ∈ F ( L, p , s, F ) , ther e exist c onstants c ∗ , C ∗ 1 , C ∗ 2 > 0 , dep ending only on the network b ound F and the structur al c onstants 10 define d in Condition 2 , such that the p opulation risk satisfies max n 0 , 1 2 Q n ( b g , b g opt ) | {z } optimization gap − c ∗ ϕ n n δ L log 2 ( n ) | {z } statistic al err or o ≤ E D n [ R ( b g , g 0 )] ≤ 2 Q n ( b g , b g opt ) | {z } optimization gap + C ∗ 1 ∥ g apx − g 0 ∥ 2 ∞ | {z } appr oximation err or + C ∗ 2 ϕ n n δ L log 2 ( n ) | {z } statistic al err or . (4.2) Theorem 2 establishes that the p opulation risk is b ounded b y the optimization gap together with appro ximation and statistical error terms, and clarifies the role of δ . When δ = 0 and the exp ected optimization error Q n is small, the b ound attains the minimax rate O ( ϕ n L log 2 n ) ( Zhong et al. , 2022 ). F or 0 < δ < c eff , the rate increases to O ( ϕ n n δ L log 2 n ) but still v anishes since ϕ n n δ L log 2 n = o (1). While the oracle inequality pro vides global con trol of the p opulation risk, its lo cal Kullback–Leibler structure induces quadratic curv ature (Lemma 3 in app endix), yielding p oin t wise error bounds in the following corollary and enabling v alid inference after bias calibration. Corollary 1 (Poin twise Mean Squared Error Bound) . Supp ose the c onditions of The or em 2 hold with 0 ≤ δ < c eff , and let b g ∈ F ( L, p , s, F ) satisfy Q n ( b g , b g opt ) ≤ C ∗ 1 ϕ n n δ L log 2 n for some c onstant C ∗ 1 > 0 . F or any slow ly diver ging se quenc e M n → ∞ , if x ∗ ∼ P X and is indep endent of D n , then the p ointwise me an squar e d err or satisfies E D n h ( b g ( x ∗ ) − g 0 ( x ∗ )) 2 i ≤ M n ϕ n n δ L log 2 n with pr ob ability at le ast 1 − M − 1 n with r esp e ct to P X . Let B n,δ = n x ∗ ∈ [0 , 1] p 0 : E D n ( b g ( x ∗ ) − g 0 ( x ∗ )) 2 ≤ M n ϕ n n δ L log 2 n o (4.3) denote the subset characterized by Corollary 1 , which satisfies P X ( B n,δ ) ≥ 1 − M − 1 n . F or x ∗ ∈ B n,δ , the p oin t wise MSE v anishes when 0 ≤ δ < c eff . T o establish asymptotic normalit y at x ∗ , ho wev er, the p oin twise bias must also decay at a suitable rate. Be cause the estimator b g b in ( 3.6 ) aggregates base learners trained on subsamples via gradien t-based optimization, both the statistical estimation bias and the algorithmic optimization bias need to b e controlled at the subsample level. Condition 3. F or e ach subsample D r (indexe d by I b ) with size r , we assume the p ointwise esti- mation bias of the tr aine d estimator b g b ∈ F ( L, p , s, F ) define d in ( 3.4 ) is governe d by the optimal network g apx . F or any tar get p oint x ∗ ∈ [0 , 1] p 0 , E D r [ b g b ( x ∗ )] − g apx ( x ∗ ) ≲ { A r ( g apx , x ∗ ) } 1 − ξ r , for a smal l 0 < ξ r < 1 , wher e A r ( g apx , x ∗ ) = | g apx ( x ∗ ) − g 0 ( x ∗ ) | is the p ointwise appr oximation err or of g apx at the subsample level r . Condition 4. F or e ach subsample D r (indexe d by I b ) with size r , the maximum optimization err or Q opt r define d in ( 3.5 ) satisfies Q opt r ≲ ϕ r r δ L log 2 r . 11 Condition 3 is anchored at the approximation b enc hmark g apx , the b est achiev able element in F ( L, p , s, F ). It requires that, on a verage, the trained estimator is close to g apx , thereby sepa- rating approximation error from training error. This formulation is intrinsic to the function class and does not imp ose an artificial centering at g 0 . An y additional deviation reflects optimization deficiency rather than the approximation limit and cannot b e absorb ed into the bias term, so the condition is necessary to control bias at the scale required for asymptotic normality . As in clas- sical nonparametric inference ( W asserman , 2006 ), asymptotic normality requires undersmo othing so that the squared bias is negligible relative to the v ariance. Under the minimax regime ( δ = 0), ∥ g apx − g 0 ∥ ∞ = O ( ϕ 1 / 2 r ), yielding p oin twise bias O ( ϕ (1 − ξ r ) / 2 r ), whic h is not negligible relative to the v ariance. Th us, an undersmo othed regime ( δ > 0) is required, reflecting the same bias–v ariance tradeoff as in classical settings. Condition 4 controls the optimization gap b y requiring that the trained estimator attains the statistical scale of the oracle b ound. This condition is stated in terms of the empirical ob jectiv e actually optimized in practice and do es not assume exact minimization. Without such control, the optimization error would en ter the leading term of the risk and prev ent v alid bias calibration, making it necessary for inference. Condition 5. Assume that B ≳ n , r = n α for 0 < α < 1 , and the single-overlap c ovarianc e satisfies inf x ∗ ∈ [0 , 1] p 0 ζ (1) r ( x ∗ ) ≳ r − ν for some ν ≥ 1 . Condition 5 imp oses standard U-statistic scaling so that the first-order H´ ajek pro jection dom- inates the asymptotic expansion, and Mon te Carlo v ariabilit y from ensembling is asymptotically negligible. Because the DNN class admits a b ounded env elop e, the H´ ajek decomp osition implies that the leading cov ariance is O (1 /r ), yielding the natural low er bound ν ≥ 1. Theorem 3. Supp ose g 0 b elongs to the class ( 2.3 ) with g 0 ( 0 ) = 0 . L et b g B b e the ensemble estimator define d in ( 3.6 ) b ase d on B subsamples of size r , and let B r,δ b e define d in ( 4.3 ) with subsample size r . Assume Condition 2 holds with hyp er p ar ameter δ ∈ c eff (1 − c eff ) 2 − c eff , c eff , and Conditions 3 – 5 hold at subsample size r . L et m 0 = c eff (1 + η )(1 − ξ r ) . Then, with ν ∈ 1 , 1 + m 0 2 , for any α ∈ ( α low er , α upper ) and any fixe d x ∗ ∈ B r,δ , it holds that s n r 2 ζ (1) r ( x ∗ ) b g B ( x ∗ ) − g 0 ( x ∗ ) d − → N (0 , 1) , wher e α low er = 1 2 − ν + m 0 , α upper = 1 ν + c eff − δ . This range reflects tw o comp eting requirements in ESM. The low er b ound α > α low er arises from bias control: the subsample size must b e sufficien tly large so that the bias is asymptotically negligible. In con trast, the upp er b ound α < α upper is imposed by the need for H´ ajek domination, requiring the v ariance of the remainder to v anish. A similar tradeoff app ears in random forests, where the single-ov erlap co v ariance decays at rate 1 / { r (log r ) p 0 } (i.e., ν ≈ 1) ( W ager and Athey , 2018 ). In our setting, a nonempt y admissible interv al ( α low er , α upper ) is ensured when ν < 1 + m 0 2 , whic h guaran tees separation b et ween the bias and v ariance constraints. The parameter δ gov erns this tradeoff through its effect on approximation: when δ is small, the approximation bias decays more slo wly , increasing α low er and requiring larger subsamples; as δ increases, bias decays faster due to undersmo othing, but this enlarges the pro jection remainder v ariance, which decreases α upper . Consequen tly , more aggressive undersmo othing narro ws the admissible range for α . 12 A t fixed ev aluation p oin ts x (1) ∗ , . . . , x ( p ) ∗ ∈ [0 , 1] p 0 , define the vectors of ensem ble estimators and true function v alues as b g B := b g B ( x (1) ∗ ) . . . b g B ( x ( p ) ∗ ) , g 0 := g 0 ( x (1) ∗ ) . . . g 0 ( x ( p ) ∗ ) . (4.4) Condition 6. L et Ψ r denote the p × p single-overlap c ovarianc e matrix evaluate d at x (1) ∗ , . . . , x ( p ) ∗ , with entries ( Ψ r ) ij = ζ (1) r x ( i ) ∗ , x ( j ) ∗ for 1 ≤ i, j ≤ p . With ζ (1) r ( x ∗ ) = ζ (1) r ( x ∗ , x ∗ ) , define D r = diag n ζ (1) r x (1) ∗ , . . . , ζ (1) r x ( p ) ∗ o . L et R r = D − 1 / 2 r Ψ r D − 1 / 2 r denote the normalize d single- overlap c ovarianc e matrix. We assume R r is uniformly wel l-c onditione d, i.e., λ min ( R r ) ≥ c 0 > 0 for al l sufficiently lar ge r , wher e λ min ( R r ) denotes the minimum eigenvalue of R r . Condition 6 is a standard nondegeneracy condition ensuring that the estimates at the p ev alua- tion points are not perfectly correlated, thereby yielding a w ell-defined limiting m ultiv ariate normal distribution ( Jacot et al. , 2018 ). W e extend Theorem 3 to the m ultiv ariate setting by establishing asymptotic normality of b g B in ( 4.4 ). Theorem 4. Supp ose that g 0 b elongs to the class ( 2.3 ) with g 0 ( 0 ) = 0 . L et b g B and g 0 b e define d as in ( 4.4 ) b ase d on B subsamples of size r , B r,δ b e as define d in ( 4.3 ) with sample size r and hyp er p ar ameter δ ∈ c eff (1 − c eff ) 2 − c eff , c eff . Assume Conditions 3 – 6 hold. Then, for any ν ∈ 1 , 1 + m 0 2 , ther e exist α low er and α upper such that for al l α ∈ ( α low er , α upper ) and any fixe d p oints x (1) ∗ , . . . , x ( p ) ∗ ∈ B r,δ , r n r 2 Ψ − 1 / 2 r ( b g B − g 0 ) d − → N p ( 0 , I p ) , (4.5) wher e α low er and α upper ar e the same as define d in The or em 3 . F or pairwise contrasts, consider the sp ecial case p = 2 ev aluated at x (1) ∗ and x (2) ∗ . T aking v = (1 , − 1) ⊤ , the corresp onding pairwise v ariance is σ 2 r x (1) ∗ , x (2) ∗ := v ⊤ Ψ r v = ζ (1) r ( x (1) ∗ ) + ζ (1) r ( x (2) ∗ ) − 2 ζ (1) r x (1) ∗ , x (2) ∗ . Because Condition 6 for p = 2 ensures the 2 × 2 cov ariance matrix is uniformly w ell-conditioned, the contrast v ariance is non-degenerate. W e define the corresp onding scalar normalizing factor as b n,r x (1) ∗ , x (2) ∗ := s n r 2 σ 2 r x (1) ∗ , x (2) ∗ . Corollary 2. Under the c onditions of The or em 4 with p = 2 , for any fixe d p air of c ovariate p oints x (1) ∗ , x (2) ∗ and any subsample sc aling exp onent satisfying α low er < α < α upper , b n,r x (1) ∗ , x (2) ∗ n b ψ B ( x (1) ∗ , x (2) ∗ ) − ψ ( x (1) ∗ , x (2) ∗ ) o d − → N (0 , 1) . Since analytical forms for ζ (1) r ( x ∗ ) and ζ (1) r ( x (1) ∗ , x (2) ∗ ) are una v ailable, the next theorem pro vides their consistent estimators. 13 Theorem 5. L et B r,δ b e as define d in ( 4.3 ) with sample size r . F or any two fixe d evaluation p oints x (1) ∗ , x (2) ∗ ∈ B r,δ , the fol lowing r esults hold. (i) Under Condition 5 , let b σ 2 ( x ( k ) ∗ ) denote the IJ varianc e estimator for the individual ensemble estimator b g B ( x ( k ) ∗ ) , as define d in ( 3.8 ) . Then, for any subsample sc aling exp onent satisfying 1 / 2 < α < α upper , and for k = 1 , 2 it holds n b σ 2 ( x ( k ) ∗ ) r 2 ζ (1) r ( x ( k ) ∗ ) p − → 1 . (4.6) (ii) Supp ose, in addition, that Condition 6 holds. L et b σ 2 1 , 2 denote the IJ varianc e estimator for the c ontr ast b g B ( x (1) ∗ ) − b g B ( x (2) ∗ ) , as define d in ( 3.9 ) . Then, for any subsample sc aling exp onent satisfying 1 / 2 < α < α upper , n b σ 2 1 , 2 r 2 σ 2 r x (1) ∗ , x (2) ∗ p − → 1 . (4.7) Theorem 5 shows that when B ≳ n and 1 / 2 < α < α upper , the IJ estimator consisten tly esti- mates the co v ariance of the first-order H´ ajek pro jection after correcting for Mon te Carlo v ariabilit y from subsampling. 5 Sim ulation Studies Sim ulation studies assess the finite-sample p erformance of the prop osed ESM estimator for the nonparametric risk function g 0 ( x ∗ ) and the log-hazard ratio g 0 ( x (1) ∗ ) − g 0 ( x (2) ∗ ). F or eac h training dataset of size n , co v ariates X i are generated i.i.d. from N ( 0 , I 10 ) and truncated comp onen t wise to [ − 3 , 3], for i = 1 , . . . , n . The function g 0 ( · ) dep ends on the first three comp onents of X i , with the remaining sev en serving as noise v ariables. Conditional on X i , ev ent times T i,U are generated from the Cox mo del: λ ( t | X i ) = 0 . 1 t exp { g 0 ( X i ) } , while censoring times T i,C are generated indep endently from an exp onen tial distribution with the rate chosen to yield appro ximately 30% censoring. F or v arious levels of functional complexit y , w e consider three forms of g 0 . Case 1 (Linear). g 0 , 1 ( x ) = x 1 − 1 . 2 x 2 + 0 . 8 x 3 . Case 2 (Smo oth additiv e nonlinear). g 0 , 2 ( x ) = 0 . 7 x 1 − 0 . 5 x 3 + 0 . 4 x 2 2 + 0 . 3 x 1 x 2 . Case 3 (Comp ositional nonlinear). g 0 , 3 ( x ) = 0 . 7 x 1 − 0 . 8 x 3 + 0 . 5 x 2 2 + sin(0 . 5 x 1 x 2 ) + 1 . 2 exp − 0 . 25( x 1 − 1) 2 − 0 . 25( x 2 + 1) 2 − 1 . 2 exp( − 0 . 5) . 14 An indep enden t test set { x test ,i } 80 i =1 is generated i.i.d. from N ( 0 , I 10 ), truncated comp onen t wise to [ − 3 , 3], and held fixed across replications. F or each training dataset, b g B ( x test ,i ) and its bias- corrected IJ v ariance e stimate are computed. The settings are n ∈ { 800 , 1000 } , r = ⌊ n α ⌋ , α ∈ { 0 . 85 , 0 . 90 , 0 . 95 } , with 200 Mon te Carlo replications per configuration. The log partial lik eliho od in ( 3.2 ) is optimized in PyT orc h ( Paszk e et al. , 2019 ). Each ensemble uses B = 1000 subsamples, and each base learner is a DNN with arc hitecture ( p 0 , 128 , 64 , 1), w eight decay 0.02, learning rate 0.1, 500 ep ochs, and drop out 0.1, with p 0 = 10. F or p oin twise inference on g 0 ( x test ,i ), p erformance is summarized by T able 1: Simulation summary for p oin t wise inference on g 0 ( x ) across test p oints. EmpSD is the empirical standard deviation of b g B ( x ) ov er replications. SE is the mean estimated standard error. CP is the empirical co verage probabilit y of nominal 95% in terv als, and AIL is the mean interv al length. Case Setting Bias MAE EmpSD SE CP AIL 1 n = 800 , r = ⌊ n 0 . 85 ⌋ -0.02 0.36 0.48 0.50 0.87 1.98 n = 800 , r = ⌊ n 0 . 90 ⌋ -0.02 0.21 0.45 0.48 0.92 1.86 n = 800 , r = ⌊ n 0 . 95 ⌋ -0.03 0.11 0.40 0.44 0.95 1.72 2 n = 800 , r = ⌊ n 0 . 85 ⌋ 0.11 0.26 0.47 0.49 0.89 1.91 n = 800 , r = ⌊ n 0 . 90 ⌋ 0.05 0.14 0.42 0.44 0.93 1.74 n = 800 , r = ⌊ n 0 . 95 ⌋ 0.04 0.11 0.36 0.40 0.95 1.59 3 n = 800 , r = ⌊ n 0 . 85 ⌋ 0.01 0.28 0.48 0.49 0.89 1.93 n = 800 , r = ⌊ n 0 . 90 ⌋ 0.05 0.17 0.44 0.46 0.92 1.81 n = 800 , r = ⌊ n 0 . 95 ⌋ 0.02 0.14 0.40 0.43 0.94 1.68 1 n = 1000 , r = ⌊ n 0 . 85 ⌋ -0.03 0.26 0.41 0.42 0.89 1.66 n = 1000 , r = ⌊ n 0 . 90 ⌋ -0.05 0.15 0.37 0.39 0.93 1.52 n = 1000 , r = ⌊ n 0 . 95 ⌋ -0.03 0.06 0.34 0.36 0.95 1.40 2 n = 1000 , r = ⌊ n 0 . 85 ⌋ 0.09 0.19 0.39 0.40 0.91 1.58 n = 1000 , r = ⌊ n 0 . 90 ⌋ 0.04 0.10 0.35 0.36 0.94 1.41 n = 1000 , r = ⌊ n 0 . 95 ⌋ 0.06 0.12 0.31 0.33 0.93 1.29 3 n = 1000 , r = ⌊ n 0 . 85 ⌋ 0.02 0.23 0.41 0.42 0.89 1.64 n = 1000 , r = ⌊ n 0 . 90 ⌋ -0.00 0.12 0.37 0.38 0.92 1.49 n = 1000 , r = ⌊ n 0 . 95 ⌋ 0.03 0.15 0.33 0.35 0.92 1.40 bias, mean absolute error (MAE), empirical standard deviation (EmpSD), corrected IJ standard error (SE), cov erage probabilit y (CP), and corrected av erage interv al length (AIL), a veraged o ver the test p oin ts. F or con trast inference, we consider ψ ( x test ,i , x test ,j ) = g 0 ( x test ,i ) − g 0 ( x test ,j ) , where ( i, j ) index tw o distinct test p oin ts selected from the same test set. W e fo cus on pairs satisfying g 0 ( x test ,i ) , g 0 ( x test ,j ) ∈ [ − 1 . 5 , 2 . 0] and | g 0 ( x test ,i ) − g 0 ( x test ,j ) | ≈ 1 . 0 , which represent mo derate signal levels and con trasts commonly encountered in the design. W e rep ort the same metrics for b ψ . Figure 2 shows a representativ e visualization for Case 3. T able 1 summarizes p oin t wise inference, 15 and T able 2 summarizes contrast inference. 3 2 1 0 1 2 3 T rue g$ 4 3 2 1 0 1 2 3 4 Estimated center ed g Estimated center ed g vs Estimated true g Model 3 (a) P oint estimates and v ariations. 0.2 0.3 0.4 0.5 0.6 0.7 Empirical SD of bagged estimator 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Estimated SE Emstimated SD vs Estimated SE cor r ected for Model 3 Cor r ected SE Uncor r ected SE (b) Estimated standard errors (corrected vs. un- corrected). Figure 2: Estimation and inference summary for Mo del 3 with n = 800, r = ⌊ n 0 . 90 ⌋ , B = 1000, DNN architecture ( p 0 , 128 , 64 , 1). Panel (a) plots b g B ( x ) versus g 0 ( x ) across test p oin ts. Panel (b) compares EmpSD with the estimated SE, highlighting corrected and uncorrected versions when a v ailable. Ov erall, the p oin twise bias is small relativ e to stochastic v ariation. F or mo derately large α (e.g., α ∈ [0 . 90 , 0 . 95]), the corrected IJ standard error closely matches the empirical standard deviation, yielding near-nominal cov erage. In this regime, α is large enough to suppress appro ximation bias ( α > α low er ) while remaining below the threshold required for H´ ajek domination and asymptotic normalit y ( α < α upper ). Figure 3 illustrates the full empirical sp ectrum. F or smaller v alues ( α ≤ 0 . 8), cov erage deterio- rates due to elev ated approximation bias, consisten t with the theoretical low er b ound. In contrast, as α approaches 1, co verage collapses despite relativ ely stable MAE. This breakdown reflects the failure of the asymptotic Gaussian approximation as subsamples b ecome highly ov erlapping. Cor- resp ondingly , the corrected standard error deviates from the empirical standard deviation in this regime, providing empirical evidence for the necessity of the upper b ound α upper . Finally , increasing the sample size from n = 800 to n = 1000 impro ves estimation efficiency , as reflected b y reductions in the empirical standard deviation (EmpSD) and the av erage interv al length. In addition, contrast inference ac hieves cov erage closer to the nominal level than point wise inference. This impro vemen t is attributable to partial bias cancellation in the con trast, and remains eviden t even under the strongly nonlinear Mo del 3. 5.1 Comparison with Comp eting Metho ds ESM (with r = ⌊ n 0 . 95 ⌋ ) is compared with the Cox prop ortional hazards mo del (Co xPH), Random Surviv al F orest (RSF), Gradient Bo osting Regression (GBR), and DeepSurv. T able 3 summarizes 16 0.7 0.8 0.85 0.9 0.95 0.97 0.99 0.999 S u b s a m p l i n g i n d e x 0.60 0.70 0.80 0.90 1.00 Coverage P r obability Coverage vs Subsampling Inde x Model 1 (Linear) Model 3 (Nonlinear) Nominal 95% (a) Co verage Probability 0.7 0.8 0.85 0.9 0.95 0.97 0.99 0.999 S u b s a m p l i n g i n d e x 0.000 0.200 0.400 0.600 0.800 Mean Absolute Er r or MAE vs Subsampling Inde x Model 1 (Linear) Model 3 (Nonlinear) (b) Mean Absolute Error 0.7 0.8 0.85 0.9 0.95 0.97 0.99 0.999 S u b s a m p l i n g i n d e x 0.300 0.350 0.400 0.450 0.500 Standar d Deviation SE vs EmpSD Model 1 SE Empirical SD (c) SE vs EmpSD (Mo del 1) 0.7 0.8 0.85 0.9 0.95 0.97 0.99 0.999 S u b s a m p l i n g i n d e x 0.300 0.350 0.400 0.450 0.500 Standar d Deviation SE vs EmpSD Model 3 SE Empirical SD (d) SE vs EmpSD (Mo del 3) Figure 3: Empirical p erformance across the subsampling index α . At smaller v alues ( α ≤ 0 . 8), high MAE degrades the CP , v alidating the theoretical lo wer b ound α low er . Conv ersely , as α → 1, the CP drops significantly despite stable MAE, and v ariance estimates diverge, empirically v alidating the upp er b ound α upper . 17 T able 2: Sim ulation summary for inference on the log-hazard ratio con trast ψ ( x test ,i , x test ,j ). Case Setting Bias MAE EmpSD SE CP AIL 1 n = 800 , r = ⌊ n 0 . 85 ⌋ -0.45 0.65 0.69 0.72 0.94 2.83 n = 800 , r = ⌊ n 0 . 90 ⌋ -0.23 0.52 0.62 0.67 0.95 2.61 n = 800 , r = ⌊ n 0 . 95 ⌋ -0.22 0.50 0.55 0.63 0.97 2.46 2 n = 800 , r = ⌊ n 0 . 85 ⌋ -0.17 0.59 0.71 0.69 0.94 2.72 n = 800 , r = ⌊ n 0 . 90 ⌋ -0.02 0.45 0.55 0.64 0.97 2.51 n = 800 , r = ⌊ n 0 . 95 ⌋ 0.06 0.43 0.53 0.56 0.96 2.21 3 n = 800 , r = ⌊ n 0 . 85 ⌋ -0.22 0.41 0.47 0.49 0.92 1.92 n = 800 , r = ⌊ n 0 . 90 ⌋ -0.21 0.36 0.41 0.47 0.93 1.85 n = 800 , r = ⌊ n 0 . 95 ⌋ -0.10 0.33 0.40 0.46 0.95 1.80 1 n = 1000 , r = ⌊ n 0 . 85 ⌋ -0.22 0.53 0.58 0.60 0.92 2.35 n = 1000 , r = ⌊ n 0 . 90 ⌋ -0.28 0.51 0.56 0.55 0.91 2.16 n = 1000 , r = ⌊ n 0 . 95 ⌋ -0.18 0.41 0.48 0.51 0.96 1.97 2 n = 1000 , r = ⌊ n 0 . 85 ⌋ -0.06 0.41 0.50 0.57 0.97 2.24 n = 1000 , r = ⌊ n 0 . 90 ⌋ 0.05 0.40 0.50 0.50 0.95 1.98 n = 1000 , r = ⌊ n 0 . 95 ⌋ 0.15 0.38 0.44 0.46 0.95 1.80 3 n = 1000 , r = ⌊ n 0 . 85 ⌋ -0.20 0.37 0.41 0.41 0.92 1.61 n = 1000 , r = ⌊ n 0 . 90 ⌋ -0.04 0.31 0.38 0.40 0.94 1.55 n = 1000 , r = ⌊ n 0 . 95 ⌋ 0.02 0.28 0.35 0.38 0.96 1.48 p oin t wise MAE and the asso ciated uncertaint y quan tification. Under the linear setting ( Case 1 ), Co xPH p erforms well, consisten t with correct mo del sp ecification. ESM remains comp etitiv e, ac hieving cov erage close to the nominal 95% lev el with relatively short interv als. Under the non- linear setting ( Case 3 ), where the linear structure is missp ecified, the adv antage of ESM b ecomes pronounced. CoxPH main tains short in terv als but exhibits reduced cov erage, indicating unreliable uncertain ty quantification. RSF and GBR attain low er MAE but tend to undercov er, suggesting unstable v ariance estimation. DeepSurv achiev es high cov erage at the exp ense of wider interv als. In contrast, ESM attains lo wer MAE than comp eting metho ds while maintaining cov erage near the nominal level with shorter interv als than DeepSurv, yielding a fav orable accuracy–uncertain ty tradeoff, particularly in the nonlinear setting. 6 Real Data Analysis W e analyze data from the Boston Lung Cancer Surviv al Cohort, restricted to patients with non- small cell lung cancer ( W ang et al. , 2023 ). The outcome is ov erall surviv al, defined as the time from diagnosis to death, with righ t-censoring at the last follo w-up; the even t indicator is recorded accordingly . Co v ariates include sex, age at diagnosis, smoking history (pac k-y ears), tumor size (cm), c hemotherapy status, cancer stage, and BMI. The cohort is randomly split in to a 70% training set ( n = 1 , 454) and a 30% inference set ( n = 623). The prop osed ESM is compared with the Co x prop ortional hazards mo del (Co xPH), Random F orest (RF), Gradient Bo osting Regression (GBR), and a standalone DeepSurv netw ork. The ensem ble uses B = 1000 subsamples with subsample size r = ⌊ n 0 . 90 ⌋ , a drop out rate of 10%, and a 18 T able 3: Estimation P erformance and Uncertaint y Quan tification (Replications R = 200, Bo otstrap = 200) for Mo del 1 and 3. Poin t wise metrics are av eraged o ver the test set x test . Case n Method MAE EmpSD SE CP AIL Case 1 800 Co xPH 0.11 0.03 0.14 0.95 0.57 RSF 0.69 0.02 0.20 0.53 1.03 GBR 0.33 0.03 0.35 0.89 1.37 DeepSurv 0.39 0.06 2.52 0.99 2.63 ESM 0.11 0.40 0.44 0.95 1.72 1000 Co xPH 0.10 0.03 0.13 0.95 0.50 RSF 0.67 0.02 0.26 0.54 1.01 GBR 0.32 0.03 0.32 0.88 1.24 DeepSurv 0.31 0.11 0.58 1.00 2.25 ESM 0.06 0.34 0.36 0.95 1.40 Case 3 800 Co xPH 0.60 0.02 0.15 0.31 0.58 RSF 0.62 0.03 0.30 0.60 1.18 GBR 0.34 0.03 0.34 0.87 1.32 DeepSurv 0.37 0.05 0.68 0.99 2.62 ESM 0.14 0.40 0.43 0.94 1.68 1000 Co xPH 0.60 0.01 0.13 0.26 0.52 RSF 0.60 0.02 0.30 0.60 1.16 GBR 0.32 0.03 0.30 0.86 1.20 DeepSurv 0.30 0.04 0.57 0.99 2.24 ESM 0.15 0.33 0.35 0.92 1.40 DNN arc hitecture ( p 0 , 128 , 64 , 1). As rep orted in T able 4 , the ensemble achiev es the highest mean C-index and A UC, while main taining the smallest standard errors among all metho ds. It also yields the shortest av erage in terv al lengths (AIL) for b oth metrics, indicating more efficien t uncertaint y quan tification. T able 4: Comparativ e Performance of Surviv al Mo dels on the BLCS Inference Set Metho d Mean C C-index SE C-index AIL Mean AUC A UC SE A UC AIL ESM 0.7316 0.0018 0.0074 0.7987 0.0039 0.0171 DeepSurv 0.7284 0.0019 0.0076 0.7928 0.0043 0.0186 Co xPH 0.7223 0.0020 0.0080 0.7944 0.0050 0.0214 RF 0.7221 0.0020 0.0080 0.7928 0.0058 0.0247 GBR 0.7209 0.0019 0.0076 0.7865 0.0043 0.0184 Stage-sp ecific hazard ratios (HRs) as functions of patient characteristics are shown in Figure 5 . Eac h curv e is defined relativ e to a reference patient (male, receiv ed c hemotherapy , age 65, BMI 26, 35 pack-y ears), with one cov ariate v aried at a time and others fixed at their reference v alues. Ev aluation is p erformed ov er 50 ev enly spaced v alues within clinically relev an t ranges: age at diagnosis (45–80 years), tumor size (2–6 cm), smoking history (0–80 pack-y ears), and BMI (15–45 kg/m 2 ). The HR at each v alue compares the risk of death to that of the reference profile; v alues ab o v e 1 indicate increased risk, v alues b elo w 1 indicate reduced risk, and 1 indicates no difference. Uncertain ty is quantified b y point wise 95% confidence bands based on the infinitesimal jackknife (IJ) co v ariance estimator. Confidence bands widen near the boundaries of the co v ariate ranges with 19 0.70 0.72 0.74 C-Inde x ESM DeepSurv Co xPH R andom F or est GBR C-Inde x Comparison (a) C-Index Comparison Boxplot 2 4 6 8 10 T ime (years) 0.76 0.78 0.80 0.82 0.84 T ime-dependent AUC T ime- V arying AUC Comparison DeepSurv (mean=0.793) ESM (mean=0.799) Co xPH (mean=0.794) R andom F or est (mean=0.793) GBR (mean=0.786) (b) Time-V arying AUC Comparison Figure 4: Comparativ e ev aluation of mo del p erformance. The ensemble shows strong ranking stabilit y and stable estimation accuracy o ver time. reduced data supp ort in these regions. The stage-stratified con trast analysis characterizes v ariation in co v ariate effects across stages while pro viding v alid uncertain ty quantification via the IJ-based framew ork. 7 Conclusion W e develop a unified theoretical and inferen tial framew ork for fully nonparametric Cox mo dels with DNNs, linking optimization-based risk b ounds to ensemble-based asymptotic inference. The resulting uncertaint y quantification targets contrast functionals that are often more clinically rele- v ant than absolute risk, such as hazard ratios comparing tw o patients who differ in a single feature (e.g., age 70 v ersus 50). Our metho ds yield stable v ariance estimation and reliable finite-sample co verage, addressing limitations of standard b ootstrap metho ds in high-dimensional settings. F uture work follows naturally from the flexibilit y of the ensemble framework. First, w e will extend inference from the risk function g 0 ( · ) to the surviv al function S ( t | x ∗ ). Since surviv al prob- abilities are nonlinear functionals of the log-hazard, the Gaussian limits for b g ( · ) enable p oin twise in terv als and sim ultaneous bands via an infinite dimensional delta metho d, accounting for uncer- tain ty in b oth the DNN and the Breslo w-t yp e baseline hazard estimator. Second, the framew ork extends naturally to treatment effect estimation in time-to-even t settings. Building on ensemble- based estimation ideas ( Meng and Li , 2026 ), we define the surviv al Conditional Average T reatment Effect (CA TE) as τ ( t | x ∗ ) = S 1 ( t | x ∗ ) − S 0 ( t | x ∗ ) , where S j ( t | x ∗ ) denotes the conditional surviv al probability under treatmen t j ∈ { 0 , 1 } . Under standard unconfoundedness conditions, this can b e estimated by applying ESM separately to the treated and control cohorts. Because the groups are indep enden t, the v ariance of b τ ( t | x ∗ ) is the sum of the group-sp ecific IJ v ariances, yielding v alid W ald-type confidence interv als. A full theoretical treatment of this causal extension remains for future work. Finally , while our theory assumes con trolled optimization error, further work is needed to un- 20 45 50 55 60 65 70 75 80 Age at Diagnosis (years) 0.5 1.0 1.5 2.0 Hazar d R atio Stage 4 (95% CI) Stage 3 (95% CI) Stage 2 (95% CI) (a) Age at Diagnosis (45 to 80 years) 0 10 20 30 40 50 60 70 80 P ack- Y ears Smok ed 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Hazar d R atio Stage 4 (95% CI) Stage 3 (95% CI) Stage 2 (95% CI) (b) P ack-Y ears (0 to 80) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 T umour Size (cm) 0.6 0.8 1.0 1.2 1.4 Hazar d R atio Stage 4 (95% CI) Stage 3 (95% CI) Stage 2 (95% CI) (c) T umor Size (2 to 6 cm) 15 20 25 30 35 40 BMI (kg/m²) 0.5 1.0 1.5 2.0 2.5 Hazar d R atio Stage 4 (95% CI) Stage 3 (95% CI) Stage 2 (95% CI) (d) BMI (15 to 45 kg/m 2 ) Figure 5: Stage-sp ecific contrast HR studies. In eac h panel, curves corresp ond to Stages 2, 3, and 4, with shaded regions indicating IJ-based 95% confidence bands. 21 derstand ho w implicit regularization from practical training strategies, such as early stopping and drop out, interacts with the Cox loss and affects inference. References Allen-Zhu, Z. , Li, Y. and Liang, Y. (2019). Learning and generalization in ov erparameterized neural net w orks, going b ey ond t wo lay ers. In A dvanc es in Neur al Information Pr o c essing Systems , v ol. 32. Andersen, P. K. and Gill, R. D. (1982). Co x’s regression mo del for counting pro cesses. The A nnals of Statistics 10 1100–1120. Arora, S. , Du, S. S. , Hu, W. , Li, Z. and W ang, R. (2019). Fine-grained analysis of opti- mization and generalization for o verparameterized tw o-la yer neural netw orks. arXiv pr eprint arXiv:1901.08584 . Bickel, P. J. , Klaassen, C. A. J. , Ritov, Y. and Wellner, J. A. (1993). Efficient and A daptive Estimation for Semip ar ametric Mo dels . Johns Hopkins Univ ersity Press. Ca tt aneo, M. D. , F arrell, M. H. and Feng, Y. (2020). Large sample prop erties of partitioning-based series estimators. The Annals of Statistics 48 1718–1741. Chen, K. , Guo, S. , Sun, L. and W ang, J.-L. (2010). Global partial lik eliho o d for nonparametric prop ortional hazards mo dels. Journal of the A meric an Statistic al Asso ciation 105 750–760. Chen, S. and Zhou, L. (2007). Lo cal partial likelihoo d estimation in prop ortional hazards regres- sion. The Annals of Statistics 35 888–916. Chernozhuko v, V. , Chetveriko v, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. and R obins, J. (2018). Double/debiased mac hine learning for treatment and structural parameters. The Ec onometrics Journal 21 C1–C68. Cox, D. R. (1972). Regression mo dels and life-tables. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) 34 187–220. F arrell, M. H. , Liang, T. and Misra, S. (2021). Deep neural net works for estimation and inference. Ec onometric a 89 181–213. Fleming, T. R. and Harrington, D. P. (2005). Counting Pr o c esses and Survival A nalysis . John Wiley & Sons. Hall, P. (1992). The Bo otstr ap and Edgeworth Exp ansion . Springer Series in Statistics, Springer- V erlag. Horo witz, J. L. and Mammen, E. (2007). Rate-optimal estimation for a general class of nonpara- metric regress ion mo dels with unknown link functions. The A nnals of Statistics 35 2589–2619. Huang, J. (1999). Efficien t estimation of the partly linear additiv e Cox mo del. The Annals of Statistics 27 1536–1563. 22 Jacot, A. , Gabriel, F. and Hongler, C. (2018). Neural tangen t kernel: Conv ergence and generalization in neural net works. In A dvanc es in neur al information pr o c essing systems , vol. 31. Ka tzman, J. L. , Shaham, U. , Cloninger, A. , Ba tes, J. , Jiang, T. and Kluger, Y. (2018). DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural netw ork. BMC Me dic al R ese ar ch Metho dolo gy 18 24. K ohler, M. and Langer, S. (2021). On the rate of conv ergence of fully connected deep neural net work regression estimates. The Annals of Statistics 49 2231–2250. Lakshminara y anan, B. , Pritzel, A. and Blundell, C. (2017). Simple and scalable predictiv e uncertain ty estimation using deep ensembles. A dvanc es in Neur al Information Pr o c essing Systems 30 6401–6413. Meng, X. and Li, Y. (2026). Inference for deep neural netw ork estimators in generalized non- parametric mo dels. Journal of the Americ an Statistic al Asso ciation In press. O’Sulliv an, F. (1993). Nonparametric estimation in the Cox mo del. The A nnals of Statistics 21 124–145. P aszke, A. , Gross, S. , Massa, F. , Lerer, A. , Bradbur y, J. , Chanan, G. , Killeen, T. , Lin, Z. , Gimelshein, N. , Antiga, L. , Desmaison, A. , K ¨ opf, A. , Y ang, E. , DeVito, Z. , Raison, M. , Tejani, A. , Chilamkur thy, S. , Steiner, B. , F ang, L. , Bai, J. and Chint ala, S. (2019). PyT orc h: An imp erative st yle, high-p erformance deep learning library . In A dvanc es in Neur al Information Pr o c essing Systems , v ol. 32. Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural netw orks with ReLU activ ation function. The Annals of Statistics 48 1875–1897. Sleeper, L. A. and Harrington, D. P. (1990). Regression splines in the Cox mo del with application to cov ariate effects in liver disease. Journal of the Americ an Statistic al Asso ciation 85 941–949. Stone, C. J. (1985). Additiv e regression and other nonparametric mo dels. The A nnals of Statistics 13 689–705. Sun, Y. , Kang, J. , Haridas, C. , Ma yne, N. , Potter, A. , Y ang, C.-F. , Christiani, D. C. and Li, Y. (2024). Penalized deep partially linear Cox mo dels with application to CT scans of lung cancer patients. Biometrics 80 ujae022. v an der V aar t, A. W. (1998). Asymptotic Statistics . Cam bridge Universit y Press. W a ger, S. and A they, S. (2018). Estimation and inference of heterogeneous treatmen t effects using random forests. Journal of the A meric an Statistic al Asso ciation 113 1228–1242. W ang, X. , Romer o-Gutierrez, C. W. , Kothari, J. , Shafer, A. , Li, Y. and Christiani, D. C. (2023). Prediagnosis smoking cessation and o v erall surviv al among patien ts with non-small cell lung cancer. JAMA Network Op en 6 e2311966. W asserman, L. (2006). Al l of Nonp ar ametric Statistics . Springer T exts in Statistics, Springer Science & Business Media. 23 Wiegrebe, S. , Kopper, P. , Sonabend, R. , Bischl, B. and Bender, A. (2024). Deep learning for surviv al analysis: A review. A rtificial Intel ligenc e R eview 57 56. Wu, Q. , Tong, X. and Zhao, X. (2024). Deep partially linear Cox model for curren t status data. Biometrics 80 ujae032. Y arotsky, D. (2017). Error b ounds for approximations with deep ReLU netw orks. Neur al Net- works 94 103–114. Zhong, Q. , Mueller, J. and W ang, J.-L. (2022). Deep learning for the partially linear Cox mo del. The Annals of Statistics 50 1348–1376. Zhou, J. , Zhang, Y. and Yu, Z. (2023). Partial linear Cox mo del with deep ReLU netw orks for in terv al-censored failure time data. arXiv pr eprint arXiv:2307.00195 . 24
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment