A Theory of Nonparametric Covariance Function Estimation for Discretely Observed Data
We study nonparametric covariance function estimation for functional data observed with noise at discrete locations on a $d$-dimensional domain. Estimating the covariance function from discretely observed data is a challenging nonparametric problem, …
Authors: Yoshikazu Terada, Atsutomo Yara
A THEOR Y OF NONP ARAMETRIC CO V ARIANCE FUNCTION ESTIMA TION FOR DISCRETEL Y OBSER VED D A T A B Y Y O S H I K A Z U T E R A D A 1 , 2 , a A N D A T S U T O M O Y A R A 1 1 Graduate School of Engineering Science , The University of Osaka, a terada.yoshikazu.es@osaka-u.ac.jp 2 Center for Advanced Inte grated Intelligence Resear ch, RIKEN W e study nonparametric cov ariance function estimation for functional data observed with noise at discrete locations on a d -dimensional domain. Es- timating the cov ariance function from discretely observed data is a challeng- ing nonparametric problem, particularly in multidimensional settings, since the cov ariance function is defined on a product domain and thus suffers from the curse of dimensionality . This motivates the use of adaptive estimators, such as deep learning estimators. Howe ver , existing theoretical results are largely limited to estimators with explicit analytic representations, and the properties of general learning-based estimators remain poorly understood. W e establish an oracle inequality for a broad class of learning-based esti- mators that applies to both sparse and dense observ ation regimes in a unified manner , and deriv e con vergence rates for deep learning estimators over sev- eral classes of cov ariance functions. The resulting rates suggest that structural adaptation can mitigate the curse of dimensionality , similarly to classical non- parametric re gression. W e further compare the con v ergence rates of learning- based estimators with sev eral existing procedures. For a one-dimensional smoothness class, deep learning estimators are suboptimal, whereas local lin- ear smoothing estimators achie ve a f aster rate. For a structured function class, howe ver , deep learning estimators attain the minimax rate up to polylogarith- mic f actors, whereas local linear smoothing estimators are suboptimal. These results rev eal a distincti ve adaptivity–v ariance trade-of f in cov ariance func- tion estimation. 1. Introduction. Recent advances in measurement technology have led to the increas- ing av ailability of data in which each subject is observed over a domain, either continuously or at a finite set of locations, as commonly encountered in longitudinal and spatio-temporal studies. In functional data analysis (FD A), such data are viewed as realizations of random elements in a function space, or equi v alently as sample paths of stochastic processes ( Ram- say and Silverman , 2005 ; Horváth and K okoszka , 2012 ; Hsing and Eubank , 2015 ; W ang, Chiou and Müller , 2016 ). Estimation of the mean and cov ariance functions is a fundamental problem in FD A. It plays a central role in many subsequent methods, most notably functional principal component analysis ( Hall and Hosseini-Nasab , 2006 ; Hall, Müller and W ang , 2006 ; Zhou, W ei and Y ao , 2025 ) and functional linear regression ( Hall and Horowitz , 2007 ; Y uan and Cai , 2010 ; Zhou, Y ao and Zhang , 2023 ), and is also essential for reconstructing latent trajectories from discretely observed data ( Y ao, Müller and W ang , 2005 ). When functional data are observed continuously ov er the domain, co variance estimation is essentially no more difficult than in the multidimensional setting, as the empirical cov ariance function is directly av ailable. By contrast, when measuerments are av ailable only at discrete locations, the cov ariance function estimation becomes a challenging nonparametric estima- tion problem. In this paper , we focus on this standard and practically important discretely observed setting. MSC2020 subject classifications : Primary 62R10, 62G05; secondary 62G08. K eywor ds and phr ases: Functional data, ReLU neural networks, Phase transition. 1 2 Consider a centered, square-integrable stochastic process X ( · ) on a d -dimensional domain T ⊂ R d with the co variance function K ◦ ( s, t ) = E { X ( s ) X ( t ) } ( s, t ∈ T ) . W e observ e i.i.d. replicates X 1 , . . . , X n , but not the trajectories/surfaces themselves. Instead, for subject i we observe noisy measurements Y ij = X i ( T ij ) + ε ij , j = 1 , . . . , m i , where the observed points T ij ∈ T are random, and ε ij are mean-zero noises. F or simplicity , the number of measurments are the same among the subjects (i.e., m = m 1 = · · · = m n ). Follo wing Zhang and W ang ( 2016 ), we refer to the dense regime as the case where m → ∞ as n → ∞ , and to the sparse regime as the case where m is uniformly bounded. Existing estimators are typically based on smoothing splines ( Cai and Y uan , 2010 ) or local linear smoothing ( Y ao, Müller and W ang , 2005 ; Hall, Müller and W ang , 2006 ; Li and Hsing , 2010 ; Zhang and W ang , 2016 ), and their statistical properties have been in vestigated in considerable detail. Local linear smoothing, for example, has the attracti ve feature that the uniform con ver gence can often be ensured ( Li and Hsing , 2010 ; Zhang and W ang , 2016 ). Moreov er , Cai and Y uan ( 2010 ) deri ves con vergence rates for the smoothing spline estimators and prov es their minimax optimality . Most of the existing literature focuses on functions defined on one-dimensional domains. In many modern applications, howe ver , functions are defined on multidimensional domains, and the cov ariance function is therefore defined on the corresponding product domain, whose dimension is doubled, thereby exacerbating the curse of dimensionality . Recent results in nonparametric regression suggest that adaptiv e deep learning estimators can, for suitably structured function classes, mitigate the curse of dimensionality ( Schmidt- Hieber , 2019 , 2020 ; Nakada and Imaizumi , 2020 ; Suzuki and Nitanda , 2021 ). This fact mo- ti vates the use of learning-based estimators for covariance function estimation. For instance, Y an, Y ao and Zhou ( 2025 ) established theoretical guarantees for deep learning estimators of the mean function, and Sarkar and Panaretos ( 2022 ) proposed a deep learning approach to cov ariance function estimation. In this paper , we study a general nonparametric least-squares (LS) approach to estimating K ◦ . More precisely , we consider a learning-based estimator defined as a minimizer of the least-squares loss (1) 1 n m ( m − 1) n X i =1 X 1 ≤ j = k ≤ m { Y ij Y ik − K ( T ij , T ik ) } 2 . This loss function is widely used (e.g., Y ao, Müller and W ang , 2005 ; Hall, Müller and W ang , 2006 ; Li and Hsing , 2010 ; Cai and Y uan , 2010 ; Zhang and W ang , 2016 ). For mean function estimation, Y an, Y ao and Zhou ( 2025 ) adapted classical localization analysis ( Bartlett, Bousquet and Mendelson , 2005 ; Koltchinskii , 2006 ; Blanchard, Bousquet and Massart , 2008 ) and thereby provided a general framew ork for analyzing learning-based estimators. For cov ariance function estimation, by contrast, an analogous framework is not yet av ailable. Sarkar and Panaretos ( 2022 ) provided a theoretical analysis of deep learning estimators for discretely observed data on regular grids. Howe ver , their analysis requires the number of grid points in each coordinate direction to diver ge for consistency . This condition appears overly restricti ve. In addtition, Paul and Peng ( 2009 ) deri ved theoretical properties of a restricted maximum likelihood estimator under very restrictiv e conditions, but their results do not attain optimal rates when the number of measurements m div erges. Moreo ver , their approach cannot accommodate both sparse and dense regimes in a unified manner . At first sight, one might expect the classical localization analysis for nonparametric re- gression to e xtend directly to co v ariance function estimation, as in the case of mean function A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 3 estimation. A naiv e application of the existing theory , howe ver , yields loose bounds when the number of measurements m div erges. The main difficulty arises from the dependence structure induced by the least-squares loss ( 1 ), which is a sum of within-subject U -statistics. Unless this within-subject U -statistic structure is handled explicitly , sharp bounds cannot be obtained. W e discuss this issue in detail in Section 2.2 . In this paper , we de velop a framework for sharp localization analysis of covariance func- tion estimation. Inspired by the prior work on empirical risk minimization of U -statistics ( Clémençon, Lugosi and V ayatis , 2008 ), we extend T alagrand’ s inequality to the least-squares loss ( 1 ) for cov ariance function estimation. Combining this extension with localization anal- ysis, we deri ve a sharp oracle inequality that treats sparse and dense regimes in a unified man- ner . As an application, we use this oracle inequality to establish, to the best of our knowledge, the first conv ergence-rate results for deep learning estimators ov er sev eral classes of cov ari- ance functions. W e observ e a phase transition phenomenon similar to those reported in Cai and Y uan ( 2010 , 2011 ) and Y an, Y ao and Zhou ( 2025 ). Our results suggest that, under highly anisotropic smoothness, deep learning estimators can mitigate the curse of dimensionality in cov ariance function estimation, as also demon- strated in other estimation problems ( Suzuki and Nitanda , 2021 ; Y an, Y ao and Zhou , 2025 ). In the sparse regime, the con vergence rate of deep learning estimators matches, up to polyloga- rithmic factors, the minimax rate for nonparametric regression ov er 2 d -dimensional domains, as established in Suzuki and Nitanda ( 2021 ). Interestingly , under the dense regime in the one-dimensional setting, for the class of smooth functions considered by Zhang and W ang ( 2016 ), deep learning estimators are sub- optimal, whereas local linear smoothing attains a faster rate. In contrast, for the structured function classes studied by Cai and Y uan ( 2010 ), deep learning estimators attain the same minimax rate as smoothing spline estimators up to polylogarithmic factors, while local linear smoothing becomes suboptimal. Thus, these results reflect an inherent limitation of learning- based estimators for co variance function estimation. Such behavior appears to be uncommon in classical nonparametric estimation and may be specific to cov ariance function estimation. The remainder of the paper is organized as follo ws. Section 2 introduces the model setting and the estimators. Section 3 establishes a unified oracle inequality for general learning-based estimator . In Section 4 , we consider the deep learning estimator andderiv es its conv ergence rates for se v eral classes of cov ariance functions and compares the results with existing work. Section 5 contains the conclusion of this paper . All proofs are deferred to the Appendix. 2. Preliminaries. In this section, we describe the repeated measurement model and the least-squares estimator and fix notation. W e then discuss the challenges in the theoretical analysis of cov ariance function estimation. 2.1. Model and least squares estimation. In this paper , we employ the repeated mea- surment model, which is commonly assumed in existing literatrue (e.g., Zhang and W ang ( 2016 ) and Y an, Y ao and Zhou ( 2025 )). Let T := [0 , 1] d . Let X ( · ) be a square integrable and centered stochastic process on T (i.e., ∀ t ∈ T ; E [ X ( t )] = 0 .), and assume that X is jointly measurable. The cov ariance function of X will be denoted by K ◦ , that is, K ◦ ( t, t ′ ) := E X ( t ) X ( t ′ ) , t, t ′ ∈ T . Let X 1 , . . . , X n be independent and identically distributed copies of X . The processes X i ( i = 1 , . . . , n ) are not observed directly . Instead, for the i th subject, noisy measurements are observed at m i time points T ij ( j = 1 , . . . , m i ) . W e assume that T ij are i.i.d. from a dis- tribution P T on T and are independent of X i ( i = 1 , . . . , n ) . That is, we assume the follo wing repeated measruement model: Y ij = X i ( T ij ) + ϵ ij ( i = 1 , . . . , n ; j = 1 , . . . , m i ) (2) 4 where ϵ ij are independent noises with E [ ϵ ij ] = 0 . Thus, we observe { ( T ij , Y ij ) : i = 1 , . . . , n, j = 1 , . . . , m i } . For simplicity , we assume that the number of measurements is the same across subjects, that is, m = m 1 = · · · = m n , and that P T is the uniform distrib ution on [0 , 1] d . T o estimate the true cov ariance function K ◦ , we consider the following general nonpara- metric least square (LS) estimator: b K n := arg min K ∈K nm 1 nm ( m − 1) n X i =1 X 1 ≤ j = k ≤ m { Y ij Y ik − K ( T ij , T ik ) } 2 , (3) where K nm is a gi ven class of symmetric functions. T o ev aluate the accuracy of the LS estimator b K n , the following mean squared error is considered: E h b K n − K ◦ 2 2 i , where ∥ K ∥ 2 2 := R T ×T K 2 ( s, t ) dP ⊗ 2 T ( s, t ) . 2.2. Difficulity of co variance function estimation. W e e xplain why classical localization analysis does not yield tight bounds for cov ariance function estimation, in contrast to mean function estimation. In the theory of mean function estimation ( Y an, Y ao and Zhou , 2025 ), bounding the fol- lo wing quantity for r > 0 is essential: sup f ∈F 1 nm n X i =1 m X j =1 ∥ f − f ◦ ∥ 2 2 − { f ( T ij ) − f ◦ ( T ij ) } 2 ∥ f − f ◦ ∥ 2 2 + r , where f ◦ is the true mean function of X , and F is a gi ven class of mean functions. Since the T ij are independent, each term in the sum is independent. Therefore, the classical T alagrand inequality (e.g., Theorem 3.3.16 of Giné and Nickl ( 2016 )) can be applied to bound this quantity , and standard localization analysis yields a sharp oracle inequality . In contrast, for cov ariance function estimation, bounding the following quantity for r > 0 is essential: sup K ∈K nm 1 nm ( m − 1) n X i =1 X 1 ≤ j = k ≤ m ∥ K − K ◦ ∥ 2 2 − { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ∥ K − K ◦ ∥ 2 2 + r . In the sparse re gime, where the number of measurements m is bounded, the following within- subject U -statistic can be vie wed as a single random v ariable Z i : Z i := 1 m ( m − 1) X 1 ≤ j = k ≤ m ∥ K − K ◦ ∥ 2 2 − { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ∥ K − K ◦ ∥ 2 2 + r . Thus, as in Y an, Y ao and Zhou ( 2025 ), standard localization analysis combined with T a- lagrand’ s inequality yields an oracle inequality . Ho we ver , in the dense regime, where the number of measurements m div erges, this naive approach yields a looser bound because it does not exploit the f act that the observations approach the continuous re gime. W e address this g ap by dev eloping a unified nonasymptotic analysis of the learning-based estimator ( 3 ) that applies to both sparse and dense regimes. A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 5 3. A theory of nonparametric covariance function estimation. In this section, we establish an oracle inequality to derive the con v ergence rates of the mean squared error for a general LS esimator . W e assume the following two general assumptions. A S S U M P T I O N 1 . (Boundedness) There e xists a positi ve number B K > 0 such that ∥ K ◦ ∥ ∞ < B K and ∀ K ∈ K nm ; ∥ K ∥ ∞ < B K . A S S U M P T I O N 2 . (Sub-Gaussian) There exist positi ve numbers B 1 > 0 and B 2 > 0 such that ∀ λ ∈ R ; E h e λX i ( T ij ) i ≤ e λ 2 B 2 1 / 2 and E h e λϵ ij i ≤ e λ 2 B 2 2 / 2 T o ov ercome the dif ficulty described in Section 2.2 , we establish the following modified T alagrand inequality for cov ariance function estimation. The lemma builds on a key repre- sentation of U -statistics due to Clémençon, Lugosi and V ayatis ( 2008 ). L E M M A 3.1. (Modified T alagr and inequality) Let { X ij } n × m be i.i.d. S -valued random variables. Let F be a countable class of measurable functions f = ( f 1 , . . . , f n ) : S × S → R n such that ∥ f i ∥ ∞ ≤ U < ∞ . Assume that E f i ( X i 1 , X i 2 ) = 0 for any f ∈ F and any i = 1 , . . . , n . Set S n := sup f ∈F n X i =1 X j = k f i ( X ij , X ik ) and ˜ S n := sup f ∈F n X i =1 ⌊ m/ 2 ⌋ X j =1 f i X ij , X i ( ⌊ m/ 2 ⌋ + j ) . Mor eover , assume that U 2 ≥ σ 2 ≥ 1 n 1 ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 E f 2 i X ij , X i ( ⌊ m/ 2 ⌋ + j ) , and define v n := 2 U E X [ ˜ S n ] + n ⌊ m/ 2 ⌋ σ 2 . Then, for any x ≥ 0 and any α > 0 , with pr obability at least 1 − e − x , 1 nm ( m − 1) S n ≤ (1 + α ) E " ˜ S n n ⌊ m/ 2 ⌋ # + s 2 σ 2 n ⌊ m/ 2 ⌋ x + 3 2 + 1 α U x n ⌊ m/ 2 ⌋ . Combined with a conditioning ar gument, this lemma enables the localization analysis by controlling the empirical process associated with a sum of U -statistics. No w , we introduce se veral definitions for localization analysis ( Bartlett, Bousquet and Mendelson , 2005 ; K oltchinskii , 2006 ). D E FI N I T I O N 3.2 (Sub-root function) . Let r ∗ > 0 . A nondecreasing function ϕ : [0 , ∞ ) → [0 , ∞ ) is called a sub-root function on [ r ∗ , ∞ ) if r 7→ ϕ ( r ) / √ r is nonincreasing for r ≥ r ∗ . D E FI N I T I O N 3.3 (Rademacher fixed point) . Let r ∗ nm > 0 and let ϕ nm : [0 , ∞ ) → [0 , ∞ ) be a sub-root function on [ r ∗ nm , ∞ ) . Let { σ ij } be i.i.d. Rademacher v ariables. Define K nm ( r ) := K ∈ K nm | ∥ K − K ◦ ∥ 2 2 ≤ r . If ϕ nm ( r ∗ nm ) ≤ r ∗ nm and ϕ nm ( r ) ≥ E sup K ∈K nm ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij n K T ij , T i ( ⌊ m/ 2 ⌋ + j ) − K ◦ T ij , T i ( ⌊ m/ 2 ⌋ + j ) o , then r ∗ nm is called the Rademacher fixed point of K nm . 6 Localization analysis based on the modified T alagrand inequality yields the follo wing or- acle inequality . T H E O R E M 3.4 . Under Assumption 1 and Assumption 2, for m ≥ 2 , there e xists a global constat c > 0 suc h that E h ∥ b K n − K ◦ ∥ 2 2 i ≤ c inf K ∈K nm E ∥ K − K ◦ ∥ 2 2 + 1 n + log 2 ( n ⌊ m/ 2 ⌋ ) 1 n ⌊ m/ 2 ⌋ + r ∗ nm , wher e r ∗ nm is the Rademacher fixed point of K nm . This theorem can be seen as an extension of Theorem 2.1 in Y an, Y ao and Zhou ( 2025 ), which establishes a sharp oracle inequality for mean function estimation, to cov ariance func- tion estimation. It also extends the theory of Cai and Y uan ( 2010 ) to general least-squares (LS) estimators that may not admit an e xplicit form. Interestingly , our inequality matches the oracle inequality of Y an, Y ao and Zhou ( 2025 ). For mean function estimation, a total of nm observ ations { Y ij } are used, whereas cov ariance estimation in v olves nm 2 pairwise products { Y ij Y ik } . Howe ver , these terms are highly redundant, as they are constructed from only nm underlying observations { Y ij } . This point is also mentioned in Section 5 of Cai and Y uan ( 2010 ). In contrast, when the estimator admits an explicit representation, such as local linear smoothing estimators (e.g., Zhang and W ang ( 2016 )), a more refined analysis that exploits its structure may improv e the 1 / ( nm ) rate to 1 / ( nm 2 ) . This improvement, ho wev er , requires imposing additional structural constraints on the estimator , thereby sacrificing adaptivity . W e discuss this direction in Section 4.3 . R E M A R K . In Theorem 3.4 , Assumption 2 is used to control tail probabilities in the con- centration arguments. Hence, it can be relax ed to a sub-W eibull assumption. Under this weaker condition, our results remain valid with the same polynomial rates; only the poly- logarithmic factors deteriorate (i.e., log 2 ( n ⌊ m/ 2 ⌋ ) is replaced by log c ( n ⌊ m/ 2 ⌋ ) for some larger c > 2 ). Theorem 3.4 is not directly applicable in practice, since the fixed point r ∗ nm depends on the population distribution, which is typically unknown. T o address this issue, we extend Theorem 3.4 by replacing the Rademacher fixed point r ∗ nm with a quantity depending only on the complexity of K nm . C O RO L L A RY 3.5 . Under Assumption 1 and Assumption 2 , for m ≥ 2 , we have E [ ∥ ˆ K n − K ◦ ∥ 2 2 ] ≤ c inf K ∈K nm E [ ∥ K − K ◦ ∥ 2 2 ] + 1 n + V Cdim( K ) log ( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ , wher e c is a global constant and V Cdim( K nm ) denotes the the V apnik–Chervonenkis (VC) dimension of K nm . 4. Covariance function estimation via deep learning . In this section, we deriv e con- ver gence rates for the learning-based estimators based on deep neural networks (DNNs). First, we consider a class that illustrates the potential to mitigate the curse of dimensional- ity . Second, we consider a class that facilitates comparison with the results of Cai and Y uan ( 2010 ). Finally , we compare the performance of the deep learning estimator with existing estimators in one-dimensional settings. A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 7 4.1. Con ver gence rates for anisotr opic Besov space. T o illustrate the potential of deep learning estimators to mitigate the curse of dimensionality , we consider the anisotropic Besov space introduced by Suzuki and Nitanda ( 2021 ). This space forms a rich function class that includes a wide range of functions and is particularly suitable for capturing heterogeneous smoothness in real-world data. For example, in spatiotemporal functional data, the cov ariance function may be very smooth in the temporal direction but much less smooth in the spatial direction, and vice versa. W e define the anisotropic Beso v space follo wing the notation of Suzuki and Nitanda ( 2021 ). Let R + := { x ∈ R | x > 0 } . For β = ( β 1 , . . . , β d ) ∈ R d + , we define ˜ β := d X j =1 1 β j − 1 , ¯ β := max j =1 ,...,d β j . For a function f : R d → R , the r -th difference of f in the direction h ∈ R d is recursively defined by ∆ r h ( f )( x ) := ∆ r − 1 h ( f )( x + h ) − ∆ r − 1 h ( f )( x ) , ∆ 0 h ( f )( x ) := f ( x ) , for x ∈ [0 , 1] d with x + h ∈ [0 , 1] d ; otherwise, we set ∆ r h ( f )( x ) := 0 . Moreover , for f ∈ L p ([0 , 1] d ) with p ∈ (0 , ∞ ] , the r -th modulus of smoothness of f is defined as w r,p ( f , t ) = sup h ∈ R d : ∥ h i ∥≤ t i ∥ ∆ r h ( f ) ∥ p , where t = ( t 1 , . . . , t d ) ∈ R d + . Using this modulus of smoothness, we define the anisotropic Besov space B β p,q ([0 , 1] d ) for β = ( β 1 , . . . , β d ) ∈ R d + . D E FI N I T I O N 4.1 (Anisotropic Besov space) . For 0 < p, q ≤ ∞ , β = ( β 1 , . . . , β d ) ∈ R d + , and r := max i β i + 1 , define the seminorm | · | B β p,q by | f | B β p,q := P ∞ k =0 2 k w r,p f , 2 − k/β 1 , . . . , 2 − k/β d q 1 /q , if q < ∞ , sup k ≥ 0 2 k w r,p f , 2 − k/β 1 , . . . , 2 − k/β d , if q = ∞ . The norm of the anisotropic Besov space B β p,q ([0 , 1] d ) is defined by ∥ f ∥ B β p,q := ∥ f ∥ p + | f | B β p,q , and B β p,q ([0 , 1] d ) = n f ∈ L p ([0 , 1] d ) | ∥ f ∥ B β p,q < ∞ o . Intuiti vely , β characterizes the smoothness in each coordinate direction. If β i is large, functions in B β p,q are smooth along the i -th direction; otherwise, they may be less regular in that direction. W e consider the following class of cov ariance functions as a subset of the anisotropic Besov space: K β p,q := n K ∈ B ( β ,β ) p,q ( T 2 ) | ∥ K ∥ B ( β ,β ) p,q ≤ 1 , and K is symmetric and positive definite o . Here, The choice of ( β , β ) for the smoothness parameters is natural, reflecting the symmetry of cov ariance functions. 8 W e introduce a model for cov ariance functions based on deep feedforward neural net- works. Let σ : R → R denote an acti v ation function, and we adopt the ReLU acti v ation func- tion σ ( x ) = max { x, 0 } . For b = ( b 1 , . . . , b r ) ⊤ ∈ R r , define the biased activ ation function σ v : R r → R r by σ b y 1 . . . y r = σ ( y 1 − b 1 ) . . . σ ( y r − b r ) . The architecture of a feedforward neural network is specified by its depth L (the number of hidden layers) and a common width W ∈ N . A deep neural network with depth L and width W is defined as a function of the form (4) F ( x ) = W L σ b L W L − 1 σ b L − 1 · · · W 1 σ b 1 W 0 x, x ∈ R 2 d , where W 0 ∈ R W × 2 d , W L ∈ R 1 × W , W l ∈ R W × W for 1 ≤ l ≤ L , and b L ∈ R , b l ∈ R W are bias parameters. Since we consider estimating the cov ariance function, the input dimension is taken to be 2 d . W e no w define the class of sparse networks by F ( L, W , S ) = F of the form ( 4 ) | L X j =0 ∥ W j ∥ 0 + | b j | 0 ≤ S , where b 0 denotes the zero vector . Here, ∥ W j ∥ 0 and | b j | 0 represent the numbers of nonzero entries of W j and b j , respectiv ely . T o ensure symmetry of the cov ariance function estimator , we consider the follo wing class of cov ariance functions: (5) K ( L, W, S ) := K ( · , ⋆ ) = 1 2 { h ( · , ⋆ ) + h ( ⋆, · ) } | h ∈ F ( L, W, S ) . This DNN model ( 5 ) is much simpler than the CovNet model ( Sarkar and Panaretos , 2022 ), leading to a simpler con ver gence analysis. It may seem that the estimator should be positive definite, as in CovNet. Ho we ver , post-estimation projection al ways ensures that e K n − K ◦ 2 ≤ b K n − K ◦ 2 , where e K n is the projection of b K n onto the class of positiv e definite functions. Thus, posi- ti ve definiteness need not be imposed at the estimation stage, as is also the case in existing approaches. The follo wing theorem establishes conv ergence rates of the deep learning estimator over the class K β p,q . T H E O R E M 4.2 . Suppose 0 < p, q ≤ ∞ and ˜ β / 2 > (1 /p − 1 / 2) + for β ∈ R d + . Let δ = 2 (1 /p − 1 / 2) + and ν = ( ˜ β / 2 − δ ) / 2 δ . Consider the deep neural network model K ( L, W , S ) satisfies (i) L ≍ log( n ⌊ m/ 2 ⌋ ) , (ii) W ≍ ( n ⌊ m/ 2 ⌋ ) 1 / ( ˜ β +1) , and (iii) S ≍ ( n ⌊ m/ 2 ⌋ ) 1 / ( ˜ β +1) log 2 ( n ⌊ m/ 2 ⌋ ) . If K ◦ ∈ K β p,q and ∥ K ◦ ∥ ∞ ≤ B K , under Assumption 2 , we obtain E [ ∥ b K n − K ◦ ∥ 2 2 ] ≤ c ( 1 n + 1 ( n ⌊ m/ 2 ⌋ ) ˜ β / ( ˜ β +1) log 3 ( n ⌊ m/ 2 ⌋ ) ) , wher e c is a universal constant that does not depend on n and m . A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 9 In the sparse regime, the conv ergence rate of deep learning estimators coincides with that of deep nonparametric regression in 2 d dimensions, as established in Suzuki and Nitanda ( 2021 ). The rates do not directly depend on the dimension d , and thus suggest that deep learning estimators can mitigate the curse of dimensionality . For example, set β 1 = α and β 2 = · · · = β d = ( d − 1) α for some α > 0 . This represents an anisotropic setting in which the function is non-smooth in the first coordinate, while being considerably smoother in the other coordinates. Then ˜ β = α/ 2 , and the resulting rate does not depend on the dimension d . The bound in Theorem 4.2 exhibits a phase transition with respect to the number of mea- surements m . When m ≪ n 1 / ˜ β , the second term dominates and the conv ergence rate is de- termined by ( nm ) − ˜ β / ( ˜ β +1) up to polylogarithmic factors. In contrast, when m ≫ n 1 / ( ˜ β ) , the first term n − 1 dominates, and the estimator achieves the parametric rate 1 /n . The transition occurs at the critical scaling m ≍ n 1 / ˜ β , where the two terms are of the same order . R E M A R K . W e can easily extend Theorem 4.2 for the af fine composition model or the deep composition model in Suzuki and Nitanda ( 2021 ). For example, the affine composition model for cov ariance functions can be defined as follo ws: K β aff ,p,q := n H ( As + b, At + b ) | H ∈ K β p,q , A ∈ R ˜ d × d , b ∈ R ˜ d o , β ∈ R ˜ d . Thus, the directions of smoothness need not be aligned with the coordinate directions and can be arbitrarily oriented through af fine transformations. 4.2. Con ver gence r ates for the tensor pr oduct space. W e consider the setting of Cai and Y uan ( 2010 ) and thus set T = [0 , 1] . Let H ( κ ) be a reproducing kernel Hilbert space (RKHS) with kernel κ : T × T → R . W e assume that X has sample paths in H ( κ ) almost surely and that E [ ∥ X ∥ 2 H ( κ ) ] < ∞ , where ∥ · ∥ H ( κ ) denotes the RKHS norm. Then Theorem 1 of Cai and Y uan ( 2010 ) implies that the true cov ariance function K ◦ lies in the tensor product space H ( κ ⊗ κ ) . As a canonical example, we consider H ( κ ) to be the periodic Sobole v space W α, 2 per ( T ) ⊂ L 2 ( T ) of order α ∈ N defined on T . This setting is also discussed as a representativ e ex- ample in Cai and Y uan ( 2010 ). For further details on periodic Sobolev spaces, we refer to Appendix 2.4 of Berlinet and Thomas-Agnan ( 2004 ). In this setting, it is well known that the eigenfunctions of κ are gi ven by the Fourier basis functions (e.g., Section 2 of W ahba ( 1975 )) as follo ws: κ ( s, t ) = 1 + ∞ X j =1 1 (2 π j ) 2 α n ψ (c) j ( s ) ψ (c) j ( t ) + ψ (s) j ( s ) ψ (s) j ( t ) o , where ψ (c) j ( t ) := √ 2 cos(2 π j t ) and ψ (s) j ( t ) := √ 2 sin(2 π j t ) . For notational con venience, we rearrange the eigenfunctions into a single orthonormal sequence { ψ j } j ≥ 1 in L 2 ( T ) as fol- lo ws: ψ 1 ( t ) := 1 , ψ 2 j ( t ) := ψ (c) j ( t ) , ψ 2 j +1 ( t ) := ψ (s) j ( t ) , j ≥ 1 . Let { ρ j } j ≥ 1 be the eigen v alues of κ associated with { ψ j } j ≥ 1 , arranged in non-increasing order . Since X has sample paths in H ( κ ) ⊂ L 2 ( T ) , the sample path X and its cov ariance func- tion K ◦ admit the follo wing expansions: X ( · ) = ∞ X j =1 Z j ψ j ( · ) , Z j := Z T X ( t ) ψ j ( t ) dt, 10 and (6) K ◦ ( s, t ) = X j,k ≥ 1 c j k ψ j ( s ) ψ k ( t ) , where c j k := Co v( Z j , Z k ) . W e consider the follo wing tensor product class of cov ariance functions defined as the ball of radius R in the RKHS H ( κ ⊗ κ ) : K α TP := { K ∈ H ( κ ⊗ κ ) | ∥ K ∥ H ( κ ⊗ κ ) ≤ R } , ∥ K ∥ 2 H ( κ ⊗ κ ) := X j,k ≥ 1 c 2 j k ρ j ρ k . The following theorem establishes the con vergence rates of the deep learning estimator for the class K α TP . T H E O R E M 4.3. Consider the deep neural network model K ( L, W, S ) satisfies (i) L ≍ log 3 ( n ⌊ m/ 2 ⌋ ) , (ii) W ≍ ( n ⌊ m/ 2 ⌋ ) 1 / (2 α +1) log 5 ( n ⌊ m/ 2 ⌋ ) , and (iii) S ≍ ( n ⌊ m/ 2 ⌋ ) 1 / (2 α +1) log( n ⌊ m/ 2 ⌋ ) . If K ◦ ∈ K α TP and ∥ K ◦ ∥ ∞ ≤ B K , under Assumption 2 , we obtain E [ ∥ b K n − K ◦ ∥ 2 2 ] ≤ C 1 n + 1 ( n ⌊ m/ 2 ⌋ ) 2 α/ (2 α +1) log 17 ( n ⌊ m/ 2 ⌋ ) , wher e c is a universal constant independent of n and m . For the tensor product space H ( κ ⊗ κ ) , the con ver gence rate of the deep learning estima- tor matches that of one-dimensional nonparametric regression, ev en though the cov ariance function is defined on the tw o-dimensional domain [0 , 1] 2 . The same holds for the smoothing spline estimator studied by Cai and Y uan ( 2010 ). Moreover , combined with Theorem 6 of Cai and Y uan ( 2010 ), our result sho ws that the deep learning estimator achie ves the minimax rate up to a polylogarithmic factor 1 . The two approaches, howe ver , rely on fundamentally different mechanisms. The smooth- ing spline estimator achiev es the optimal rate by e xplicitly specifying the reproducing kernel associated with the underlying RKHS a priori, whereas the deep learning estimator attains the same rate in an adapti ve manner , without e xplicit knowledge of the k ernel. 4.3. Comparison with other estimators. W e compare the deep learning estimator with existing estimators in one-dimensional settings. Here, we ignore polylogarithmic factors in the con ver gence rates. F or the local linear smoothing estimator, Zhang and W ang ( 2016 ) consider a class of cov ariance functions satisfying the follo wing condition: ∂ 2 K ( s, t ) ∂ s 2 , ∂ 2 K ( s, t ) ∂ s∂ t and ∂ 2 K ( s, t ) ∂ t 2 are bounded on [0 , 1] 2 . (7) This class is contained in the anisotropic Besov space B (2 , 2) ∞ , ∞ ([0 , 1] 2 ) , which corresponds to isotropic smoothness in each coordinate. Here, we note that the LS loss ( 1 ) coincide with the equal weight per observ ation (OBS) scheme of Zhang and W ang ( 2016 ). For the sparse regime, the con ver gence rates of the local linear smoothing estimator and the deep learning estimator coincide. For the dense re gime, Corollary 4.4 of Zhang and W ang ( 2016 ) shows that the local linear smoothing estimator b K (LLS) n with a suitable choice of tuning parameter satisfies 1 The polylogarithmic factor log 17 ( n ⌊ m/ 2 ⌋ ) is due to our simple approximation scheme. The exponent 17 is not intrinsic and can likely be reduced. A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 11 • ∥ b K (LLS) n − K ◦ ∥ 2 = O p (( nm 2 ) − 1 / 3 ) when m 2 / √ n → 0 , and • ∥ b K (LLS) n − K ◦ ∥ 2 = O p ( n − 1 / 2 ) when √ n = O ( m 2 ) . In contrast, Theorem 4.2 sho ws that the deep learning estimator b K (DL) n satisfies • ∥ b K (DL) n − K ◦ ∥ 2 = O p (( nm ) − 1 / 3 ) when m/ √ n → 0 , and • ∥ b K (DL) n − K ◦ ∥ 2 = O p ( n − 1 / 2 ) when √ n = O ( m ) . Thus, surprisingly , in the one-dimensional smoothness setting under the dense regime, the deep learning estimator is suboptimal, and therefore the local linear smoothing estimator is a suitable choice. Ho wev er , the situation can be rev ersed under a structured setting. Indeed, e ven when the true cov ariance function K ◦ belongs to H ( κ ⊗ κ ) with α = 2 and also satisfies the condi- tion ( 7 ), the local linear smoothing estimator yields only the same rates as above. The local linear smoothing estimator does not e xploit the tensor product structure and therefore cannot achie ve the optimal rate. By contrast, in this setting, the deep learning estimator , due to its adapti vity , satisfies • ∥ b K (DL) n − K ◦ ∥ 2 = O p (( nm ) − 2 / 5 ) when m 2 / √ n → 0 , and • ∥ b K (DL) n − K ◦ ∥ 2 = O p ( n − 1 / 2 ) when √ n = O ( m 2 ) . Therefore, the deep learning estimator is a suitable choice in this setting. Moreov er , while linear estimators such as the local linear smoothing estimator and the smoothing spline estimator lack adapti vity , deep learning estimators can adapt to anisotropic smoothness ov er multidimensional domains. For example, the suboptimality of linear esti- mators in classical nonparametric regression is demonstrated in Suzuki and Nitanda ( 2021 ); Hayakaw a and Suzuki ( 2020 ). Similar arguments can be applied to cov ariance function esti- mation. 5. Conclusion. In this paper , we studied nonparametric cov ariance function estimation for discretely observed functional data on a multidimensional domain. W e established an or- acle inequality for a broad class of learning-based estimators that applies to both sparse and dense observ ation re gimes in a unified framew ork. As an important application, we deriv ed con ver gence rates for deep learning estimators ov er se veral classes of cov ariance functions. Our results re veal a phase transition with respect to the number of measurements m . This transition highlights the interplay between sampling density and intrinsic smoothness in co- v ariance function estimation. W e further demonstrated that the performance of estimators crucially depends on the un- derlying function class. In one-dimensional setting, for a class of smooth functions consid- ered by Zhang and W ang ( 2016 ), deep learning estimators are suboptimal, whereas local linear smoothing estimators achieve faster rates. In contrast, for a structured function class, such as the tensor product RKHS studied in Cai and Y uan ( 2010 ), deep learning estimators attain the minimax rate up to polylogarithmic factors, while local linear smoothing estimators are suboptimal. These findings indicate that learning-based estimators are not alw ays optimal, but can ef fectiv ely exploit structural properties such as anisotropy . This leads to a distincti ve adapti vity–variance trade-off that appears to be specific to covariance function estimation and is not commonly observed in classical nonparametric re gression. As an important direction for future work, by combining our framework with the refined analysis of FPCA based on the local linear smoothing estimator by Zhou, W ei and Y ao ( 2025 ), we may obtain con ver gence rates for deep learning estimators of the eigenfunctions. 12 APPENDIX A: PR OOF OF THE KEY LEMMA (LEMMA 3.1 ) Let S m denote the symmetric group on { 1 , . . . , m } . Let Π be a random permutation taking v alues in S m , uniformly distributed ov er all m ! permutations. For a realization π ∈ S m , we write π ( j ) for the image of j ∈ { 1 , . . . , m } under π . First, follo wing the argument of Clémençon, Lugosi and V ayatis ( 2008 ), note that 1 m ( m − 1) X j = k f i ( X ij , X ik ) = 1 m ! X π ∈ S m 1 ⌊ m/ 2 ⌋ ⌊ m/ 2 ⌋ X j =1 f i X π ( j ) , X π ( ⌊ m/ 2 ⌋ + j ) = E Π 1 ⌊ m/ 2 ⌋ ⌊ m/ 2 ⌋ X j =1 f i X Π( j ) , X Π( ⌊ m/ 2 ⌋ + j ) . By Jensen’ s inequality , S n = sup f ∈F m ( m − 1) ⌊ m/ 2 ⌋ E Π n X i =1 ⌊ m/ 2 ⌋ X j =1 f i X i Π( j ) , X i Π( ⌊ m/ 2 ⌋ + j ) ≤ m ( m − 1) ⌊ m/ 2 ⌋ E Π sup f ∈F n X i =1 ⌊ m/ 2 ⌋ X j =1 f i X i Π( j ) , X i Π( ⌊ m/ 2 ⌋ + j ) . Define M m := m ( m − 1) ⌊ m/ 2 ⌋ , ¯ S n := E Π sup f ∈F n X i =1 ⌊ m/ 2 ⌋ X j =1 f i X i Π( j ) , X i Π( ⌊ m/ 2 ⌋ + j ) , and S n,π := sup f ∈F n X i =1 ⌊ m/ 2 ⌋ X j =1 f i X iπ ( j ) , X iπ ( ⌊ m/ 2 ⌋ + j ) . Then S n ≤ M m ¯ S n , and it suf fices to deriv e an inequality for ¯ S n . W e observe that E X h e λ ( ¯ S n − E X [ ¯ S n ]) i = E X h e λ E Π [ S n, Π − E X S n, Π ] i ≤ E Π ,X h e λ ( S n, Π − E X S n, Π ) i . By Theorem 3.3.16 of Giné and Nickl ( 2016 ), for any permutation π , ∀ λ ∈ [0 , 2 / 3] , E h e λ ( S n,π /U − E [ S n,π /U ]) i ≤ exp λ 2 2 − 3 λ v n U 2 . Then, we obtain E X h e λ ( ¯ S n − E X [ ¯ S n ]) i ≤ exp λ 2 2 − 3 λ v n U 2 . Consequently , for any x ≥ 0 , with probability at least 1 − e − x , ¯ S n ≤ E [ ¯ S n ] + √ 2 v n x + 3 U 2 x. By Y oung’ s inequality 2 √ ab ≤ αa + b/α , for any x ≥ 0 and any α > 0 , with probability at least 1 − e − x , 1 nm ( m − 1) S n ≤ 1 n ⌊ m/ 2 ⌋ E [ ¯ S n ] + 2 q U E [ ˜ S n ] x + p 2 n ⌊ m/ 2 ⌋ σ 2 x + 3 U 2 x A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 13 ≤ (1 + α ) sup π E " ˜ S n n ⌊ m/ 2 ⌋ # + s 2 σ 2 n ⌊ m/ 2 ⌋ x + 3 2 + 1 α U x n ⌊ m/ 2 ⌋ . This completes the proof. APPENDIX B: PR OOF OF THEOREM 3.4 The proof strategy follows a standard localization analysis (e.g., Bartlett, Bousquet and Mendelson ( 2005 ); K oltchinskii ( 2006 ); Blanchard, Bousquet and Massart ( 2008 )) combined with a truncation argument (e.g., Bagirov , Clausen and K ohler ( 2009 )), as also employed in Y an, Y ao and Zhou ( 2025 ). The technical lemmas for the proof are given in Appendix F . Step 1. By Lemma F .1 with C = 2 , for ev ery x ≥ 0 we hav e, with probability at least 1 − e − x , ∥ b K n − K ◦ ∥ 2 2 ≤ 2 nm ( m − 1) n X i =1 X j = k n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o 2 + 6400 B 2 K r ∗ + 96 B 2 K x n ⌊ m/ 2 ⌋ . This yields E ∥ b K n − K ◦ ∥ 2 2 ≤ 2 nm ( m − 1) E n X i =1 X j = k n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o 2 + 6400 B 2 K r ∗ + 96 B 2 K n ⌊ m/ 2 ⌋ . (8) Step 2. W e start from the identity n X i =1 X j = k n Y ij Y ik − b K n ( T ij , T ik ) o 2 = n X i =1 X j = k { Y ij Y ik − K ◦ ( T ij , T ik ) } 2 + n X i =1 X j = k n K ◦ ( T ij , T ik ) − b K n ( T ij , T ik ) o 2 + 2 n X i =1 X j = k { Y ij Y ik − K ◦ ( T ij , T ik ) } n K ◦ ( T ij , T ik ) − b K n ( T ij , T ik ) o . In addition, for any K ∈ K nm , E n X i =1 X j = k { Y ij Y ik − K ( T ij , T ik ) } 2 = E n X i =1 X j = k { Y ij Y ik − K ◦ ( T ij , T ik ) } 2 + E n X i =1 X j = k { K ◦ ( T ij , T ik ) − K ( T ij , T ik ) } 2 . By the optimality of b K n ov er K nm , we therefore obtain E 1 nm ( m − 1) n X i =1 X j = k n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o 2 14 ≤ E 1 nm ( m − 1) n X i =1 X j = k { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 + 2 E 1 nm ( m − 1) n X i =1 X j = k { Y ij Y ik − K ◦ ( T ij , T ik ) } n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o . Since Y ij = X i ( T ij ) + ϵ ij , we can expand Y ij Y ik = { X i ( T ij ) + ϵ ij } { X i ( T ik ) + ϵ ik } = X i ( T ij ) X i ( T ik ) + X i ( T ij ) ϵ ik + X i ( T ik ) ϵ ij + ϵ ij ϵ ik . Thus it suf fices to control the three terms U nm := 1 nm ( m − 1) n X i =1 X j = k { X i ( T ij ) X i ( T ik ) − K ◦ ( T ij , T ik ) } n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o , V nm := 1 nm ( m − 1) n X i =1 X j = k X i ( T ij ) ϵ ik n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o , W nm := 1 nm ( m − 1) n X i =1 X j = k ϵ ij ϵ ik n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o . Once E [ U nm ] , E [ V nm ] , E [ W nm ] are bounded, we obtain 2 nm ( m − 1) E n X i =1 X j = k n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o 2 ≤ 2 E ∥ K − K ◦ ∥ 2 2 + 4 { E [ U nm ] + E [ V nm ] + E [ W nm ] } . Combining this with ( 8 ) yields E ∥ b K n − K ◦ ∥ 2 2 ≤ 2 E ∥ K − K ◦ ∥ 2 2 + 4 { E [ U nm ] + E [ V nm ] + E [ W nm ] } + 6400 B 2 K r ∗ + 96 B 2 K n ⌊ m/ 2 ⌋ . (9) Step 3. (Control of U nm ). Let e i ( T ij , T ik ) := X i ( T ij ) X i ( T ik ) − K ◦ ( T ij , T ik ) . Introduce a clipping le vel β U > 0 and write U nm ≤ 1 n n X i =1 " 1 m ( m − 1) X j = k {C β U e i ( T ij , T ik ) }{ b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) } − Z {C β U e i ( t, t ′ ) }{ b K n ( t, t ′ ) − K ◦ ( t, t ′ ) } dP ⊗ 2 T ( t, t ′ ) # + 1 n n X i =1 1 m ( m − 1) X j = k { e i ( T ij , T ik ) − C β U e i ( T ij , T ik ) }{ b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) } − Z { e i ( t, t ′ ) − C β U e i ( t, t ′ ) }{ b K n ( t, t ′ ) − K ◦ ( t, t ′ ) } dP ⊗ 2 T ( t, t ′ ) A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 15 + 1 n n X i =1 Z e i ( t, t ′ ) { b K n ( t, t ′ ) − K ◦ ( t, t ′ ) } dP ⊗ 2 T ( t, t ′ ) =: U (I) nm + U (II) nm + U (II I) nm . (i) Contr ol of U (I) nm . By Lemma F .2 , for any C U > 0 , E h U (I) nm i ≤ 1 C U E h ∥ b K n − K ◦ ∥ 2 2 i + 800 C 2 U β 2 U r ∗ + β U n ⌊ m/ 2 ⌋ C U β U + 44 B K . (10) (ii) Contr ol of U (II) nm . Using the pointwise bound | b K n − K ◦ | ≤ 2 B K , we hav e E h U (II) nm i ≤ 2 E h { e 1 ( T , T ′ ) − C β U e 1 ( T , T ′ ) }{ b K n ( T , T ′ ) − K ◦ ( T , T ′ ) } i ≤ 4 B K E | e 1 ( T , T ′ ) | 1 {| e 1 ( T , T ′ ) | > β U } . Moreov er , we hav e | e 1 ( T , T ′ ) | ≤ 8 B 2 1 exp | e 1 ( T , T ′ ) | 8 B 2 1 and 1 {| e 1 ( T , T ′ ) | > β U } ≤ exp | e 1 ( T , T ′ ) | − β U 8 B 2 1 . By the sub-Gaussian assumption, E h U (II) nm i ≤ 32 B K B 2 1 E exp | e 1 ( T , T ′ ) | 4 B 2 1 exp − β U 8 B 2 1 ≤ 64 B K B 2 1 exp − β U 8 B 2 1 , (11) where we used e 1 / 4 √ 2 < 2 in the last step. (iii) Contr ol of U (II I) nm . By Y oung’ s inequality , for an y α > 0 , E h U (II I) nm i = E " Z ( 1 n n X i =1 e i ( t, t ′ ) ) n b K n ( t, t ′ ) − K ◦ ( t, t ′ ) o dP ⊗ 2 T ( t, t ′ ) # ≤ α 2 E Z ( 1 n n X i =1 e i ( t, t ′ ) ) 2 dP ⊗ 2 T ( t, t ′ ) + 1 2 α E h ∥ b K n − K ◦ ∥ 2 2 i . By sub-Gaussianity , E [ X 4 ( t )] ≤ 16 B 4 1 , and hence E Z ( 1 n n X i =1 e i ( t, t ′ ) ) 2 dP ⊗ 2 T ( t, t ′ ) ≤ 16 B 4 1 n . Therefore, E h U (II I) nm i ≤ 8 αB 4 1 n + 1 2 α E h ∥ b K n − K ◦ ∥ 2 2 i . (12) T aking β U = 8 B 2 1 log n and combining ( 10 )–( 12 ), we obtain E [ U nm ] ≤ 1 C U + 1 2 α E h ∥ b K n − K ◦ ∥ 2 2 i + 800 C 2 U (8 B 2 1 log n ) 2 r ∗ + 8 B 2 1 log n n ⌊ m/ 2 ⌋ C U 8 B 2 1 log n + 44 B K + 8(8 B K B 2 1 + αB 4 1 ) n 16 =: c U, 1 E h ∥ b K n − K ◦ ∥ 2 2 i + c U, 2 B K B 2 1 + B 4 1 n + c U, 3 B 4 1 log 2 n + B K B 2 1 log n n ⌊ m/ 2 ⌋ (13) + c U, 4 B 4 1 log 2 n · r ∗ where c U, 1 , . . . , c U, 4 are uni versal constants. Step 4. (Control of V nm ). As in Step 3, we decompose V nm using a clipping le vel β V > 0 : V nm ≤ 1 nm ( m − 1) n X i =1 X j = k {C β V ( ϵ ik X i ( T ij )) − E [ C β V ( ϵ ik X i ( T ij )) | X i , T ij ] } × n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o + 1 nm ( m − 1) n X i =1 X j = k { ϵ ik X i ( T ij ) − C β V ( ϵ ik X i ( T ij )) + E [ C β V ( ϵ ik X i ( T ij )) | X i , T ij ] } × n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o =: V (I) nm + V (II) nm . (i) By Lemma F .3 , for any C V > 0 , E h V (I) nm i ≤ 1 C V E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 V β 2 V r ∗ + 2 β V n ⌊ m/ 2 ⌋ C V β V + 26 B K . (14) (ii) Since E [ ϵ ik X i ( T ij ) | X i , T ij ] = 0 , we hav e E [ C β V ( ϵ ik X i ( T ij )) | X i , T ij ] = − E [ ϵ ik X i ( T ij ) − C β V ( ϵ ik X i ( T ij )) | X i , T ij ] . Therefore, by Jensen’ s inequality and the triangle inequality , E h ϵ ik X i ( T ij ) − C β V ( ϵ ik X i ( T ij )) + E [ C β V ( ϵ ik X i ( T ij )) | X i , T ij ] i ≤ 2 E h | ϵ ik X i ( T ij ) | 1 {| ϵ ik X i ( T ij ) | > β V } i . Consequently , E h V (II) nm i ≤ 4 B K E h | ϵ ik X i ( T ij ) | 1 {| ϵ ik X i ( T ij ) | > β V } i ≤ 32 B K B 1 B 2 exp − β V 2 B 1 B 2 , (15) where the last inequality follo ws from the sub-Gaussian assumption. T aking β V = 2 B 1 B 2 log( n ⌊ m/ 2 ⌋ ) and combining ( 14 )–( 15 ) yields E [ V nm ] ≤ 1 C V E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 V 2 B 1 B 2 log( n ⌊ m/ 2 ⌋ ) 2 r ∗ + 2 B 1 B 2 11 log( n ⌊ m/ 2 ⌋ ) C V 2 B 1 B 2 log( n ⌊ m/ 2 ⌋ ) + 26 B K + 16 B K n ⌊ m/ 2 ⌋ =: c V , 1 E h ∥ b K n − K ◦ ∥ 2 2 i A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 17 + c V , 2 B 1 B 2 B 1 B 2 log 2 ( n ⌊ m/ 2 ⌋ ) + B K log( n ⌊ m/ 2 ⌋ ) + B K n ⌊ m/ 2 ⌋ (16) + c V , 3 B 2 1 B 2 2 log 2 ( n ⌊ m/ 2 ⌋ ) r ∗ , where c V , 1 , c V , 2 , c V , 3 are uni versal constants. Step 5. (Control of W nm ). With a clipping le v el β W > 0 , decompose W nm ≤ 1 nm ( m − 1) n X i =1 X j = k {C β W ( ϵ ij ϵ ik ) − E [ C β W ( ϵ ij ϵ ik )] } n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o + 1 nm ( m − 1) n X i =1 X j = k { ϵ ij ϵ ik − C β W ( ϵ ij ϵ ik ) + E [ C β W ( ϵ ij ϵ ik )] } × n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o =: W (I) nm + W (II) nm . (i) By Lemma F .4 , for any C W > 0 , E h W (I) nm i ≤ 1 C W E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 W β 2 W r ∗ + 2 β W n ⌊ m/ 2 ⌋ C W β W + 26 B K . (17) (ii) Since E [ ϵ ij ϵ ik ] = 0 , the same argument as in Step 4 gi ves E ϵ ij ϵ ik − C β W ( ϵ ij ϵ ik ) + E [ C β W ( ϵ ij ϵ ik )] ≤ 2 E [ | ϵ ij ϵ ik | 1 {| ϵ ij ϵ ik | > β W } ] , and hence E h W (II) nm i ≤ 4 B K E [ | ϵ ij ϵ ik | 1 {| ϵ ij ϵ ik | > β W } ] ≤ 32 B K B 2 2 exp − β W 2 B 2 2 . (18) T aking β W = 2 B 2 2 log( n ⌊ m/ 2 ⌋ ) and combining ( 17 )–( 18 ) yields E [ W nm ] ≤ 1 C W E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 W 2 B 2 2 log( n ⌊ m/ 2 ⌋ ) 2 r ∗ + 22 B 2 2 log( n ⌊ m/ 2 ⌋ ) C W 2 B 2 2 log( n ⌊ m/ 2 ⌋ ) + 26 B K + 32 B K B 2 2 n ⌊ m/ 2 ⌋ =: c W, 1 E h ∥ b K n − K ◦ ∥ 2 2 i + c W, 2 B 2 2 B 2 2 log 2 ( n ⌊ m/ 2 ⌋ ) + B K log( n ⌊ m/ 2 ⌋ ) + B K n ⌊ m/ 2 ⌋ (19) + c W, 3 B 4 2 log 2 ( n ⌊ m/ 2 ⌋ ) r ∗ , where c W, 1 , c W, 2 , c W, 3 are uni versal constants. Step 6. (Aggregation). Substituting ( 13 ), ( 16 ), and ( 19 ) into ( 9 ) and collecting terms, we arri ve at E ∥ b K n − K ◦ ∥ 2 2 ≤ C onst. × " E ∥ K − K ◦ ∥ 2 2 + B K B 2 1 + B 4 1 n 18 + 1 n ⌊ m/ 2 ⌋ ( ( B 2 1 B 2 2 + B 4 2 ) log 2 n ⌊ m/ 2 ⌋ + ( B 1 B 2 B K + B K B 2 2 ) log n ⌊ m/ 2 ⌋ + B 4 1 log 2 n + B K B 2 1 log n + B 1 B 2 B K + B K B 2 2 + B 2 K ) + r ∗ ( B 4 1 log 2 n + ( B 2 1 B 2 2 + B 4 2 ) log 2 n ⌊ m/ 2 ⌋ + B 2 K )# . In particular , absorbing the fixed parameters B 1 , B 2 , B K into the ov erall constant, we obtain the oracle inequalty E ∥ b K n − K ◦ ∥ 2 2 ≤ C onst. × inf K ∈K E ∥ K − K ◦ ∥ 2 2 + 1 n + log 2 n ⌊ m/ 2 ⌋ 1 n ⌊ m/ 2 ⌋ + r ∗ . APPENDIX C: PR OOF OF COR OLLAR Y 3.5 L E M M A C.1 (Duddley’ s entropy integral, Theorem 2 in Srebro and Sridharan ( 2010 )) . Let F be a function class and suppose that sup f ∈F P n i =1 ( f ( X i ) − f ◦ ( X i )) 2 /n ≤ c 2 1 with given X = { X i , 1 ≤ i ≤ n } . Then we have E " sup f ∈F 1 n n X i =1 σ i ( f ( X i ) − f ◦ ( X i )) # ≤ inf 0 ≤ t ≤ c 1 4 t + 12 √ n Z c 1 t p log N ( ε, F | X , L 2 ) dε , wher e the expectation is tak en with r espect to σ i . L E M M A C.2 (Corollary 2.2 in Bartlett, Bousquet and Mendelson ( 2005 )) . Let F be a function class and suppose that ∥ f ∥ ∞ ≤ 1 for any f ∈ F . Suppose that positive numbers r and t sattisfy r ≥ 10 E " sup f ∈F ( r ) 1 n n X i =1 σ i f ( X i ) − f ◦ ( X i ) # + 11 t n , wher e { X i , 1 ≤ i ≤ n } ar e i.i.d. random variables. Then, with pr obability at least 1 − e − t , one holds: F ( r ) ⊂ ( f ∈ F : 1 n n X i =1 f ( X i ) − f ◦ ( X i ) 2 ≤ 2 r ) . P RO O F . (Proof of Corollary 3.5 ) W ithout loss of generality , we may assume that ∥ K ◦ ∥ ≤ 1 and ∥ K ∥ ≤ 1 for all K ∈ K . By theorem 3.4 , we have (20) E h ∥ ˆ K n − K ◦ ∥ 2 2 i ≤ c × inf K ∈K E ∥ K − K ◦ ∥ 2 2 + 1 n + log 2 ( n ⌊ m/ 2 ⌋ ) 1 n ⌊ m/ 2 ⌋ + r ∗ . Define r 1 := sup ( r > 0 : 10 E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } + 11 log ( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ ≤ r ) . A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 19 Since E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } is monotone nondecreasing in r , it follows that r 1 =10 E sup K ∈K ( r 1 ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } (21) + 11 log ( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ . Moreov er , we hav e r ∗ ≤ r 1 . For an y r ≥ r 1 , let ˆ K (2 r ) := K ∈ K : 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ≤ 2 r . Then the local Rademacher complexity R nm ( K ( r )) satisfies R nm ( K ( r )) := E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ E sup K ∈ ˆ K (2 r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } + E sup K ∈K ( r ) \ ˆ K (2 r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } . By Lemma C.1 , the first term is bounded by E sup K ∈ ˆ K (2 r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 4 n ⌊ m/ 2 ⌋ + 12 p n ⌊ m/ 2 ⌋ Z √ 2 r 1 / ( n ⌊ m/ 2 ⌋ ) p log N ( ε, K| D n , L 2 ) dε ≤ 4 n ⌊ m/ 2 ⌋ + 12 √ 2 r p n ⌊ m/ 2 ⌋ s log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 . For the second term, we ha ve E sup K ∈K ( r ) \ ˆ K (2 r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 2 P ∃ K ∈ K ( r ) : 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 > 2 r 20 ≤ 2 n ⌊ m/ 2 ⌋ , where the last inequality follo ws from Lemma C.2 together with ( 21 ). Hence, R nm ( K ( r )) ≤ 6 n ⌊ m/ 2 ⌋ + 12 √ 2 r p n ⌊ m/ 2 ⌋ s log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 . Moreov er , we obtain r 1 ≤ 60 n ⌊ m/ 2 ⌋ + 120 √ 2 r 1 p n ⌊ m/ 2 ⌋ E " s log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 # + 11 log ( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ ≤ 60 n ⌊ m/ 2 ⌋ + r 1 2 + 120 2 n ⌊ m/ 2 ⌋ E log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 + 11 log ( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ , where the second inequality follows from the arithmetic–geometric mean inequality . There- fore, there exists a constant C > 0 such that r 1 ≤ C n ⌊ m/ 2 ⌋ E log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 + log ( n ⌊ m/ 2 ⌋ ) . By Theorem 9.4 of Györfi et al. ( 2002 ) and the relation between covering numbers and packing numbers, log N 1 n ⌊ m/ 2 ⌋ , K| D n , L 2 ≤ V Cdim( K ) log 2 e ( n ⌊ m/ 2 ⌋ ) 2 log 3 e ( n ⌊ m/ 2 ⌋ ) 2 + log 3 . Since n ⌊ m/ 2 ⌋ ≥ 1 , the statement of the corollary follows. APPENDIX D: PR OOF OF THEOREM 4.2 L E M M A D.1 (Proposition 2 in Suzuki and Nitanda ( 2021 ) (the case r = 2 )) . Let 0 < p, q ≤ ∞ and β ∈ R 2 d + satisfy ˜ β / 2 > (1 /p − 1 / 2) + . Assume further that l ∈ N satisfies 0 < ¯ β < min ( l, l − 1 + 1 /p ) . Set δ = 2 (1 /p − 1 / 2) + , ν = ˜ β / 2 − δ 2 δ , W 0 := 12 dl ( l + 2) + 4 d. Then, for any f ∈ B β p,q ([0 , 1] 2 d ) with ∥ f ∥ B β p,q ≤ 1 and for any N ∈ N , ther e exists a neural network ˜ f ∈ F ( L, W, S ) satisfying (i) L = 3 + 2 log 2 3(2 d ∨ l ) ϵ c ( d, l ) + 5 log 2 (2 d ∨ l ) , (ii) W = N W 0 , (iii) S = ( L − 1) W 2 0 + 1 N , and ∥ ˜ f − f ∥ 2 ≤ N − ˜ β , wher e ϵ = N − ˜ β log( N ) − 1 , and c ( d, l ) is a constant depending only on d and l . P RO O F O F T H E O R E M 4 . 2 . In this proof, C onst. denotes a positiv e constant whose value may change from line to line. By Theorem 6 in Bartlett et al. ( 2019 ), the VC dimension of the class of DNNs F ( L, W , S ) is bounded as V Cdim ≤ C onst. LS log( S ) , A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 21 where C onst. is independent of L and S . Substituting this bound into the result of Corol- lary 3.5 , we obtain (22) E [ ∥ ˆ K n − K ◦ ∥ 2 2 ] ≤ C onst. inf K ∈K ( L,W,S,B ) E [ ∥ K − K ◦ ∥ 2 2 ] + 1 n + LS log( S ) log( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ , where C onst. is independent of n and m . By Lemma D.1 , taking N = ( n ⌊ m/ 2 ⌋ ) 1 / ( ˜ β +1) and the corresponding L , W , and S , we hav e inf K ∈K ( L,W,S ) ∥ K − K ◦ ∥ 2 2 = inf h ∈F ( L,W,S ) 1 2 { h ( · , ⋆ ) + h ( ⋆, · ) } − K ◦ 2 2 ≤ 1 2 inf h ∈F ( L,W,S ) ∥ h ( · , ⋆ ) − K ◦ ∥ 2 2 + 1 2 inf h ∈F ( L,W,S ) ∥ h ( ⋆, · ) − K ◦ ∥ 2 2 = inf h ∈F ( L,W,S ) ∥ h ( · , ⋆ ) − K ◦ ∥ 2 2 ≤ ( n ⌊ m/ 2 ⌋ ) − ˜ β / ( ˜ β +1) . By substituting this bound and the values of L and S corresponding to N = ( n ⌊ m/ 2 ⌋ ) 1 / ( ˜ β +1) into ( 22 ), we obtain the desired bound. APPENDIX E: PR OOF OF THEOREM 4.3 L E M M A E.1 (Lemma 1.2 in Chen and Liu ( 2022 )) . Let D be any bounded subset of R d and C D := sup x ∈ [ − 1 , 1] d sup y ∈ D | x ⊤ y | . F or ϵ ∈ (0 , 1 / 2) , Ther e exists a deep ReLU network ˜ X ϵ with O (log ( C D M ) log 2 (1 /ϵ )) layer s and O (log ( C D M ) log 3 (1 /ϵ ))) nodes suc h that | ˜ X ϵ − cos(2 πw ⊤ x ) | ≤ ϵ, for w ∈ [ − M , M ] d , x ∈ D . L E M M A E.2 . Let { ψ j } j ≥ 1 be the F ourier basis of L 2 ([0 , 1]) given by ψ 1 ( x ) := 1 , ψ 2 j ( t ) := √ 2 cos(2 π j t ) , ψ 2 j +1 ( t ) := √ 2 sin(2 π j t ) , j ∈ N . F or any j, k ∈ N , let Λ j,k = log (2( j ∨ k )) . Ther e e xists c L,W > 0 , independent of j and k , such that the following holds. F or any ϵ and inte ger s L and W satisfying L ≤ c L,W Λ j,k log 2 (1 /ϵ ) and width W ≤ c L,W Λ j,k log 3 (1 /ϵ ) , one can construct a deep ReLU net- work e Ψ j k : [0 , 1] 2 → R with depth L and width W satisfying ψ j ( s ) ψ k ( t ) − e Ψ j k ( s, t ) ∞ ≤ ϵ. P RO O F . W e first note that, by definition of the F ourier basis, each ψ j is either the constant function 1 , or a constant multiple of cos(2 π l t ) , or a constant multiple of sin(2 π l t ) for some integer l ≥ 1 . Moerov er , by construction, ψ 1 has frequency 0 , and both ψ 2 j and ψ 2 j +1 hav e frequency j . Thus, the frequency associated with ψ j is giv en by l j := ⌊ j / 2 ⌋ . Hence, for any j, k ≥ 1 , l j ∨ l k = j 2 ∨ k 2 ≤ j 2 ∨ k 2 = 1 2 ( j ∨ k ) ≤ j ∨ k . If either ψ j or ψ k is constant, then ψ j ( s ) ψ k ( t ) reduces to a one-dimensional sine or cosine function with frequenc y at most j ∨ k , and the result follo ws from Lemma E.1 . W e therefore consider the nonconstant case. In this case, up to a multiplicativ e constant, ψ j ( s ) ψ k ( t ) is a 22 product of sine and/or cosine functions with frequencies l j and l k . By standard trigonometric identities, it can be written as a linear combination of at most two functions of the form sin (2 π ( ± l j s ± l k t )) or cos (2 π ( ± l j s ± l k t )) . Let u := ( s, t ) ∈ [0 , 1] 2 . Each function abov e is of the form sin 2 π ξ ⊤ u or cos 2 π ξ ⊤ u , where ξ ∈ {± l j } × {± l k } . Since l j ∨ l k ≤ j ∨ k , we have ∥ ξ ∥ ∞ ≤ j ∨ k . W e first consider cos(2 π ξ ⊤ u ) . For a deep ReLU network F , let L ( F ) and W ( F ) denote its depth and width, respecti vely . Applying Lemma E.1 with d = 2 , D = [0 , 1] 2 , and M = j ∨ k , it follo ws that, for any ϵ ∈ (0 , 1 / 2) , there exists a deep ReLU network F (cos) ϵ such that ∥ F (cos) ϵ ( · ) − cos(2 π ξ ⊤ · ) ∥ ∞ ≤ ε, and L F (cos) ϵ ≤ c L Λ j,k log 2 (1 /ϵ ) , W F (cos) ϵ ≤ c W Λ j,k log 3 (1 /ϵ ) , where c L , c W > 0 are constants independent of ϵ , j , and k . The same conclusion holds for sin(2 π ξ ⊤ u ) , since sin(2 π ξ ⊤ u ) = cos 2 π ξ ⊤ u − π 2 . Thus, the statement of this lemma is prov ed with c L,W := c L ∨ c W . P RO O F O F T H E O R E M 4 . 3 . In this proof, Const . denotes a positiv e constant whose value may change from line to line. For M ∈ N , define I M := ( j, k ) ∈ N 2 : j k ≤ M , and define the truncation of K ◦ at le vel M by K ◦ M ( s, t ) := X ( j,k ) ∈ I M c j k ψ j ( s ) ψ k ( t ) . Since { ψ j ψ k } are orthonormal basis of L 2 ( T 2 ) , we hav e ∥ K ◦ M − K ◦ ∥ 2 2 = X ( j,k ) ∈ I M c 2 j k . By the definition of K α TP and ρ j = (2 π ⌊ j / 2 ⌋ ) − 2 α for j ≥ 2 , X ( j,k ) / ∈ Λ M c 2 j k ≤ ( sup j k >M ρ j ρ k ) X j,k ≥ 1 c 2 j k ρ j ρ k ≤ R 2 sup j k >M ρ j ρ k ≤ C onst. × R 2 M 2 α , where C onst. is independent of M . Thus, we obtain (23) ∥ K ◦ M − K ◦ ∥ 2 2 ≲ R 2 M 2 α . By Lemma E.2 , for any ϵ ∈ (0 , 1 / 2) and each ( j, k ) ∈ I M , there exists a deep ReLU net- work e Ψ j k satisfying L ( e Ψ j k ) ≲ log(2 M ) log 2 (1 /ϵ ) , W ( e Ψ j k ) ≲ log(2 M ) log 3 (1 /ϵ ) , and ψ j ψ k − e Ψ j k ∞ ≤ ϵ. A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 23 No w , we define an approximator of K ◦ M as e K M ,ϵ ( s, t ) := X ( j,k ) ∈ I M c j k e Ψ j k ( s, t ) . By parallelization and a final affine output layer , e K M ,ϵ can be represented by a deep ReLU network. Here, we note that X ( j,k ) ∈ I M | c j k | ≤ X ( j,k ) ∈ I M c 2 j k ρ j ρ k 1 / 2 X ( j,k ) ∈ I M ρ j ρ k 1 / 2 ≤ C onst. × R, where C onst. is a global constant. Thus, we obtain K ◦ M − e K M ,ϵ 2 = X ( j,k ) ∈ I M c j k ψ j ψ k − e Ψ j k 2 ≤ X ( j,k ) ∈ I M | c j k | ψ j ψ k − e Ψ j k 2 ≤ C onst. × R max ( j,k ) ∈ I M ψ j ψ k − e Ψ j k ∞ ≤ C onst. × Rϵ. (24) Combining ( 23 ) and ( 24 ), we obtain ∥ K ◦ − e K M ,ϵ ∥ 2 2 ≤ C onst. × R 2 M − 2 α + ϵ 2 . Thus, we take M − α ≍ ϵ . No w , we bound the number of nonzero parameters of e K M ,ϵ . By construction, for each ( j, k ) , the network e Ψ j k satisfies L ( e Ψ j k ) ≲ log(2 M ) log 2 (1 /ϵ ) ≲ log 3 ( M ) , W ( e Ψ j k ) ≲ log(2 M ) log 3 (1 /ϵ ) ≲ log 4 ( M ) . Hence, the number of nonzero parameters satisfies S ( e Ψ j k ) ≲ L ( e Ψ j k ) W ( e Ψ j k ) 2 ≲ log 11 ( M ) . Since the network e K M ,ϵ is obtained by combining | I M | such networks in parallel, S ( e K M ,ϵ ) ≲ | I M | log 11 ( M ) ≲ M log 12 ( M ) . Here, we used | I M | = # { ( j, k ) ∈ N 2 | j k ≤ M } = M X j =1 ⌊ M /j ⌋ ≲ M log ( M ) . From Theorem 6 in Bartlett et al. ( 2019 ), the VC dimension of the network class F ( L, W , S ) is bounded by V Cdim( F ( L, W , S )) ≲ LS log( S ) . Thus, we consider the network model F ( L, W, S ) with L ≍ log 3 ( M ) , W ≍ M log 5 ( M ) , and S ≍ M log 12 ( M ) . W ith this choice of L , W , and S , and taking M ≍ ( n ⌊ m/ 2 ⌋ ) 1 / (2 α +1) , Corollary 3.5 yields E [ ∥ ˆ K n − K ◦ ∥ 2 2 ] ≤ C onst. × M − 2 α + 1 n + LS log( S ) log( n ⌊ m/ 2 ⌋ ) n ⌊ m/ 2 ⌋ ≤ C onst. × 1 n + 1 ( n ⌊ m/ 2 ⌋ ) 2 α/ (2 α +1) log 17 ( n ⌊ m/ 2 ⌋ ) , where C onst. is independent of n and m . 24 APPENDIX F: TECHNICAL LEMMAS FOR THEOREM 3.4 Here, we state several technical lemmas used in the proof of Theorem 3.4 . The following lemmas are obtained via a peeling device ( van de Geer , 2000 ) and a conditioning argument, combined with Lemma 3.1 . Proofs of these technical lemmas are gi ven in Appendix G . L E M M A F .1. Let C > 1 / (40 √ 2 B K ) . Then, for any x ≥ 0 , with pr obability at least 1 − e − x , the following inequality holds: ∥ b K n − K ◦ ∥ 2 2 − 1 nm ( m − 1) n X i =1 X j = k n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o 2 ≤ 1 C ∥ b K n − K ◦ ∥ 2 2 + 3200 C B 2 K r ∗ + 2 B 2 K x n ⌊ m/ 2 ⌋ ( C + 22) . L E M M A F .2 . Let ˜ e i ( T ij , T ik ) := C β U e i ( T ij , T ik ) . F or any C U > 1 / (20 √ 2 β U ) , the fol- lowing inequality holds: E 1 nm ( m − 1) n X i =1 X j = k ˜ e i ( T ij , T ik ) n b K n ( T ij , T ik ) − K ◦ ( T ij , T ik ) o − ⟨ ˜ e i , b K n − K ◦ ⟩ ≤ 1 C U E h ∥ b K n − K ◦ ∥ 2 2 i + 800 C 2 U β 2 U r ∗ + β U n ⌊ m/ 2 ⌋ ( C U β U + 44 B k ) , wher e ⟨ ˜ e i , K ⟩ := R T ×T ˜ e i ( s, t ) K ( s, t ) dP ⊗ 2 T ( s, t ) . L E M M A F .3 . Let e ( V ) i ( T ij , ϵ ik ) := C β V ( ϵ ik X i ( T ij )) − E [ C β V ( ϵ ik X i ( T ij )) | X i , T ij ] . F or any C V > 1 / (40 √ 2 β V ) , the following inequality holds: E 1 nm ( m − 1) n X i =1 X j = k e ( V ) i ( T ij , ϵ ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } ≤ 1 C V E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 V β 2 V r ∗ + 2 β V n ⌊ m/ 2 ⌋ ( C V β V + 26 B k ) . L E M M A F .4 . Let e ( W ) i ( ϵ ij , ϵ ik ) := C β W ( ϵ ij ϵ ik ) − E [ C β W ( ϵ ij ϵ ik )] . F or any C V > 1 / (40 √ 2 β V ) , the following inequality holds: E 1 nm ( m − 1) n X i =1 X j = k e ( W ) i ( ϵ ij , ϵ ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } ≤ 1 C W E h ∥ b K n − K ◦ ∥ 2 2 i + 3200 C 2 W β 2 W r ∗ + 2 β W n ⌊ m/ 2 ⌋ ( C W β W + 26 B k ) . P RO O F . The proof of this lemma is omitted, as it is essentially the same as that of Lemma F .3 . APPENDIX G: PR OOFS OF TECHNICAL LEMMAS G.1. Proof of Lemma F .1 . A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 25 P RO O F O F L E M M A F. 1 . The proof follows the localization analysis ( Bartlett, Bousquet and Mendelson , 2005 ; Blanchard, Bousquet and Massart , 2008 ) combining with Lemma 3.1 . First, we apply the peeling de vice (e.g., v an de Geer ( 2000 )) to deriv e a tight inequality . W e consider the usual symmetrization technique (e.g., van der V aart and W ellner , 1996 ). Let { T ′ ij } be an independent copy of { T ij } . For simplicity of notation, let j 1 = j and j 2 := ⌊ m/ 2 ⌋ + j . W e ha ve Z T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) − { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 = E T ′ { K ( T ′ ij 1 , T ′ ij 2 ) − K ◦ ( T ′ ij 1 , T ′ ij 2 ) } 2 − { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 . Hence, we hav e R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r = E T ′ n K ( T ′ ij 1 , T ′ ij 2 ) − K ◦ ( T ′ ij 1 , T ′ ij 2 ) o 2 − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r . Therefore, we hav e sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 R ( K − K ◦ ) 2 d ( P T ⊗ P T ) − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r ≤ sup K ∈K E T ′ n X i =1 ⌊ m/ 2 ⌋ X j =1 n K ( T ′ ij 1 , T ′ ij 2 ) − K ◦ ( T ′ ij 1 , T ′ ij 2 ) o 2 − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r ≤ E T ′ sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 n K ( T ′ ij 1 , T ′ ij 2 ) − K ◦ ( T ′ ij 1 , T ′ ij 2 ) o 2 − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r , and we obtain E T sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r ≤ E T ,T ′ sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 n K ( T ′ ij 1 , T ′ ij 2 ) − K ◦ ( T ′ ij 1 , T ′ ij 2 ) o 2 − K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) 2 R T ×T ( K − K ◦ ) 2 d ( P T ⊗ P T ) + r ≤ 2 E T ,T ′ ,σ sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ∥ K − K ◦ ∥ 2 2 + r , where { σ ij } is a Rademacher sequence, which is independent from { T ij } . Let r be a real number not less than r ∗ (i.e., r ≥ r ∗ ). Then, for s > 1 , since K = K ( r ) ∪ ∞ [ k =1 {K ( s k r ) \ K ( s k − 1 r ) } ! , we hav e E T ,T ′ ,σ sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ∥ K − K ◦ ∥ 2 2 + r 26 ≤ E T ,T ′ ,σ sup K ∈K ( r ) n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ∥ K − K ◦ ∥ 2 2 + r + ∞ X l =1 E T ,T ′ ,σ sup K ∈K ( s l r ) \K ( s l − 1 r ) n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ∥ K − K ◦ ∥ 2 2 + r ≤ 1 r E T ,T ′ ,σ sup K ∈K ( r ) n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 + ∞ X l =1 1 ( s l − 1 + 1) r E sup K ∈K ( s l r ) \K ( s l − 1 r ) n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 Moreov er , by the comparison theorem (Theorem 4.2 in ), for all r ≥ r ∗ , E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ≤ 4 B K E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 4 B K ϕ ( r ) holds. Combining these bounds, we obtain 4 B K ϕ ( r ) r + ∞ X l =1 4 B K ϕ ( s l r ) ( s l − 1 + 1) r = 4 B K r ( ϕ ( r ) + ∞ X l =1 ϕ ( s l r ) s l − 1 + 1 ) ≤ 4 B K r ( ϕ ( r ) + ∞ X l =1 s l/ 2 ϕ ( r ) s l − 1 + 1 ) = 4 B K ϕ ( r ) r ( 1 + ∞ X l =0 s ( l +1) / 2 s l + 1 ) ≤ 4 B K ϕ ( r ) r ( 1 + s 1 / 2 1 2 + ∞ X l =1 s − l/ 2 !) = 4 B K ϕ ( r ) r 1 + s 1 / 2 1 2 + 1 s 1 / 2 − 1 Here, the right-hand side is minimized at s = 3 + 2 √ 2 , and thus the upper bound satisfies 4 B K ϕ ( r ) r 5 + 2 √ 2 2 ≤ 4 B K ϕ ( r ) r 8 2 = 16 B K ϕ ( r ) r . Therefore, E T sup K ∈K 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 ∥ K − K ◦ ∥ 2 2 − { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } 2 ∥ K − K ◦ ∥ 2 2 + r ≤ 32 B K ϕ ( r ) r holds. Next, we verify the conditions needed to apply the modified T alagrand inequality (Lemma 3.1 ). Define f K ( · , ⋆ ) := ∥ K − K ◦ ∥ 2 2 − { K ( · , ⋆ ) − K ◦ ( · , ⋆ ) } 2 ∥ K − K ◦ ∥ 2 2 + r , A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 27 and set F := { f K | K ∈ K } . First, it is clear that E [ f K ( T i 1 , T i 2 )] = 0 . By the arithmetic– geometric mean inequality , for j = k , E f 2 K ( T ij , T ik ) = E ∥ K − K ◦ ∥ 2 2 − { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ∥ K − K ◦ ∥ 2 2 + r 2 ≤ E h { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 4 i 4 r ∥ K − K ◦ ∥ 2 2 ≤ E h (2 B K ) 2 { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 i 4 r ∥ K − K ◦ ∥ 2 2 ≤ B 2 K r , and hence we can take σ 2 := B 2 K /r . Also, we have | f K ( T ij , T ik ) | ≤ ∥ K − K ◦ ∥ 2 2 − { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ∥ K − K ◦ ∥ 2 2 + r ≤ 4 B 2 K r =: U. Therefore, for all x ≥ 0 , with probability at least 1 − e − x , the follo wing holds: sup K ∈K 1 nm ( m − 1) n X i =1 X j = k f K ( T ij , T ik ) ≤ (1 + α ) E sup K ∈K 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 f K T π ( j ) , T π ( ⌊ m/ 2 ⌋ + j ) + s 2 B 2 K n ⌊ m/ 2 ⌋ x r + 1 3 + 1 α 4 B 2 K r x n ⌊ m/ 2 ⌋ ≤ (1 + α ) 32 B K ϕ ( r ) r + s 2 B 2 K n ⌊ m/ 2 ⌋ x r + 3 2 + 1 α 4 B 2 K r x n ⌊ m/ 2 ⌋ . That is, for all x ≥ 0 , with probability at least 1 − e − x , the follo wing holds: ∀ K ∈ K ; ∥ K − K ◦ ∥ 2 2 − 1 nm ( m − 1) P n i =1 P j = k { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ∥ K − K ◦ ∥ 2 2 + r ≤ inf α> 0 32(1 + α ) B K r r ∗ r + s 2 B 2 K n ⌊ m/ 2 ⌋ x r + 3 2 + 1 α 4 B 2 K r x n ⌊ m/ 2 ⌋ . Let A 1 := 32(1 + α ) B K √ r ∗ + s 2 B 2 K x n ⌊ m/ 2 ⌋ , A 2 := 3 2 + 1 α 4 B 2 K x n ⌊ m/ 2 ⌋ . Then it suf fices to find r satisfying A 1 1 √ r + A 2 1 r ≤ 1 C . It is enough to find an r such that r ≥ C 2 A 2 1 + 2 C A 2 . Since we hav e C 2 A 2 1 + 2 C A 2 ≤ 2(32) 2 (1 + α ) 2 C 2 B 2 K r ∗ + 4 B 2 K x n ⌊ m/ 2 ⌋ C 2 + 2 C 3 2 + 1 α , 28 and then taking α = 1 / 4 yields that the abov e condition is satisfied for any r such that 2 B 2 K (32) 2 (5 / 4) 2 C 2 r ∗ + x n ⌊ m/ 2 ⌋ ( C 2 + 22 C ) ≤ r . Note that if C > 1 / (40 √ 2 B K ) , then r > r ∗ holds. Therefore, for any x > 0 and any C > 1 / (40 √ 2 B K ) , with probability at least 1 − e − x , the follo wing holds: ∀ K ∈ K ; ∥ K − K ◦ ∥ 2 2 − 1 nm ( m − 1) n X i =1 X j = k { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } 2 ≤ 1 C ∥ K − K ◦ ∥ 2 2 + 3200 C B 2 K r ∗ + 2 B 2 K x n ⌊ m/ 2 ⌋ ( C + 22) . G.2. Proof of Lemma F .2 . P RO O F O F L E M M A F. 2 . First, we work conditionally on the stochastic processes X 1 , . . . , X n . Under this conditioning, e i can be regarded as a fix ed function. Let f i ( · , ⋆ ) := ˜ e i ( · , ⋆ ) { K ( · , ⋆ ) − K ◦ ( · , ⋆ ) } − ⟨ ˜ e i , K − K ◦ ⟩ ∥ K − K ◦ ∥ 2 2 + r . W e verify the conditions needed to apply the modified T alagrand inequality (Lemma 3.1 ). By definition, for j = k , we ha ve E T [ f i ( T ij , T ik )] = 0 . Moreover , | f i ( T ij , T ik ) | = ˜ e i ( T ij , T ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } − ⟨ ˜ e i , K − K ◦ ⟩ ∥ K − K ◦ ∥ 2 2 + r ≤ 4 β U B K r , E T [ f 2 i ( T ij , T ik )] = E T " ˜ e i ( T ij , T ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } − ⟨ ˜ e i , K − K ◦ ⟩ ∥ K − K ◦ ∥ 2 2 + r 2 # ≤ β 2 U 4 r ∥ K − K ◦ ∥ 2 2 E T h | K ( T ij , T ik ) − K ◦ ( T ij , T ik ) | 2 i = β 2 U 4 r . By Lemma 3.1 , for any x ≥ 0 and any α > 0 , with probability at least 1 − e − x , sup f ∈F 1 nm ( m − 1) n X i =1 X j = k f i ( T ij , T ik ) ≤ (1 + α ) E T | X sup f ∈F 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 f i T ij , T i ( ⌊ m/ 2 ⌋ + j ) + s 1 n ⌊ m/ 2 ⌋ β 2 U 2 r x + 3 2 + 1 α 4 β U B K r n ⌊ m/ 2 ⌋ x holds. By the symmetrization argument, we ha ve E T | X sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 ˜ e i ( T ij 1 , T ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } − ⟨ ˜ e i , K − K ◦ ⟩ ∥ K − K ◦ ∥ 2 2 + r A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 29 ≤ 2 E T | X sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij ˜ e i ( T ij 1 , T ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ∥ K − K ◦ ∥ 2 2 + r . By the comparison theorem, it follo ws that E T | X sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij ˜ e i ( T ij 1 , T ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 2 β U E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } = 2 β U ϕ ( r ) . As in Lemma F .1 , we obtain E T | X sup K ∈K 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 ˜ e i ( T ij 1 , T ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } − ⟨ ˜ e i , K − K ◦ ⟩ ∥ K − K ◦ ∥ 2 2 + r ≤ 16 β U ϕ ( r ) r . Therefore, the upper bound is gi ven by 16(1 + α ) β U r r ∗ r + s 1 n ⌊ m/ 2 ⌋ β 2 U 2 r x + 3 2 + 1 α 4 β U B K r n ⌊ m/ 2 ⌋ x. Consequently , as in Lemma F .1 , C 2 A 2 1 + 2 C A 2 ≤ 2(16) 2 (1 + α ) 2 C 2 β 2 U r ∗ + β 2 U n ⌊ m/ 2 ⌋ C 2 x + 2 C 3 2 + 1 α 4 β U B K n ⌊ m/ 2 ⌋ x = 2(16) 2 (1 + α ) 2 C 2 β 2 U r ∗ + β U x n ⌊ m/ 2 ⌋ C 2 β U + 8 C B k 3 2 + 1 α and hence, taking α = 1 / 4 , the upper bound becomes 1 /C for any r satisfying 2(20) 2 C 2 β 2 U r ∗ + β U x n ⌊ m/ 2 ⌋ C 2 β U + 44 C B k ≤ r . If C > 1 / (20 √ 2 β U ) , then r > r ∗ holds. Therefore, for any x > 0 and any C > 1 / (20 √ 2 β U ) , with probability at least 1 − e − x , ∀ K ∈ K ; 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 ˜ e i ( T ij , T ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } − ⟨ ˜ e i , K − K ◦ ⟩ ≤ 1 C ∥ K − K ◦ ∥ 2 2 + 800 C 2 β 2 U r ∗ + β U x n ⌊ m/ 2 ⌋ ( C β U + 44 B k ) holds. G.3. Proof of Lemma F .3 . P RO O F O F L E M M A F. 3 . W e first condition on the stochastic process X 1 , . . . , X n . Also, let Z ij := ( T ij , ϵ ij ) , and define f i ( Z ij , Z ik ) := e ( V ) i ( T ij , ϵ ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } ∥ K − K ◦ ∥ 2 2 + r . 30 For j = k , we hav e E b [ f i ( Z ij , Z ik )] = 0 and | f i ( Z ij , Z ik ) | = e ( V ) i ( T ij , ϵ ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } ∥ K − K ◦ ∥ 2 2 + r ≤ 4 B K β V r , E b h | f i ( Z ij , Z ik ) | 2 i = E e ( V ) i ( T ij , ϵ ik ) { K ( T ij , T ik ) − K ◦ ( T ij , T ik ) } ∥ K − K ◦ ∥ 2 2 + r 2 ≤ 4 β 2 V 4 r ∥ K − K ◦ ∥ 2 2 E h | K ( T ij , T ik ) − K ◦ ( T ij , T ik ) | 2 i = β 2 V r . Further , noting that E b e ( V ) i ( T ij , ϵ ik ) | X i , T ij = 0 , the symmetrization argument yields E Z | X sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 e ( V ) i ( T ij 1 , ϵ ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ∥ K − K ◦ ∥ 2 2 + r ≤ 2 E Z | X sup K ∈K n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij e ( V ) i ( T ij 1 , ϵ ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ∥ K − K ◦ ∥ 2 2 + r . By the comparison theorem, E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij e ( V ) i ( T ij 1 , ϵ ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 4 β V E sup K ∈K ( r ) 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 σ ij { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } = 4 β V ϕ ( r ) . Follo wing the proof of Lemma F .1 , we obtain E sup K ∈K 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 e ( V ) i ( T ij 1 , ϵ ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ∥ K − K ◦ ∥ 2 2 + r ≤ 32 β V ϕ ( r ) r . By Lemma 3.1 , for any x ≥ 0 and any α > 0 , with probability at least 1 − e − x , sup f ∈F 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 f i ( Z ij 1 , Z iπ 2 j ) ≤ (1 + α ) E Z | X sup f ∈F 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 f i T ij 1 , T ij 2 + s 1 n ⌊ m/ 2 ⌋ 2 β 2 V r x + 3 2 + 1 α 4 β V B K r n ⌊ m/ 2 ⌋ x holds, and the upper bound is gi ven by 32(1 + α ) β U r r ∗ r + s 1 n ⌊ m/ 2 ⌋ 2 β 2 V r x + 3 2 + 1 α 4 β V B K r n ⌊ m/ 2 ⌋ x. Therefore, as in Lemma F .1 , C 2 A 2 1 + 2 C A 2 ≤ 2(32) 2 (1 + α ) 2 C 2 β 2 U r ∗ + 2 β 2 V n ⌊ m/ 2 ⌋ C 2 x + 2 C 3 2 + 1 α 4 β V B K n ⌊ m/ 2 ⌋ x = 2(32) 2 (1 + α ) 2 C 2 β 2 V r ∗ + 2 β V x n ⌊ m/ 2 ⌋ C 2 β V + 4 C B k 3 2 + 1 α . A THEOR Y OF NONP ARAMETRIC COV ARIANCE FUNCTION ESTIMA TION 31 Hence, taking α = 1 / 4 , the above upper bound is at most 1 /C for any r satisfying 2(40) 2 C 2 β 2 U r ∗ + β U x n ⌊ m/ 2 ⌋ C 2 β U + 26 C B k ≤ r . Here, if C > 1 / (40 √ 2 β V ) , then r > r ∗ holds. Therefore, for any x > 0 and an y C > 1 / (40 √ 2 β V ) , with probability at least 1 − e − x , we obtain ∀ K ∈ K , 1 n ⌊ m/ 2 ⌋ n X i =1 ⌊ m/ 2 ⌋ X j =1 e ( V ) i ( T ij 1 , ϵ ij 2 ) { K ( T ij 1 , T ij 2 ) − K ◦ ( T ij 1 , T ij 2 ) } ≤ 1 C ∥ K − K ◦ ∥ 2 2 + 3200 C 2 β 2 V r ∗ + 2 β V x n ⌊ m/ 2 ⌋ ( C β V + 26 B k ) . This completes the proof. Funding. The first author was supported by JSPS KAKENHI Grants (JP24K14855, JP25K15032, JP25K21806). The second author was supported in part by JST BOOST (JP- MJBS2402). REFERENCES B A G I R OV , A . M . , C L AU S E N , C . and K O H L E R , M . (2009). Estimation of a Regression Function by Maxima of Minima of Linear Functions. IEEE T ransactions on Information Theory 55 833–845. B A RT L E T T , P . L ., B O U S Q U E T , O . and M E N D E L S O N , S . (2005). Local Rademacher complexities. The Annals of Statistics 33 1497–1537. B A RT L E T T , P . L . , H A RV E Y , N . , L I AW , C . and M E H R A B I A N , A . (2019). Nearly-tight VC-dimension and Pseu- dodimension Bounds for Piece wise Linear Neural Netw orks. J ournal of Machine Learning Resear ch 20 1–17. B E R L I N E T , A . and T H O M A S - A G N A N , C . (2004). Reproducing K ernel Hilbert Spaces in Probability and Statis- tics . Springer , New Y ork. B L A N C H A R D , G ., B O U S Q U E T , O . and M A S S A RT , P . (2008). Statistical performance of support vector machines. The Annals of Statistics 36 489–531. C A I , T. T. and Y U A N , M . (2010). Nonparametric Cov ariance Function Estimation for Functional and Longitudi- nal Data T echnical Report, University of Pennsylv ania, Philadelphia, P A. C A I , T. T. and Y U A N , M . (2011). Optimal Estimation of the Mean Function Based on Discretely Sampled Functional Data: Phase T ransition. The Annals of Statistics 39 2330–2355. C H E N , L . and L I U , W . (2022). On the uniform approximation estimation of deep ReLU networks via frequency decomposition. Aims Math 7 19018–19025. C L É M E N Ç O N , S ., L U G O S I , G . and V AY AT I S , N . (2008). Ranking and Empirical Minimization of U -Statistics. The Annals of Statistics 36 844–874. G I N É , E . and N I C K L , R . (2016). Mathematical F oundations of Infinite-Dimensional Statistical Models . Cam- bridge Series in Statistical and Pr obabilistic Mathematics 40 . Cambridge Uni versity Press, Cambridge, UK. G Y Ö R FI , L . , K O H L E R , M . , K R Z Y ˙ Z A K , A . and W A L K , H . (2002). A distribution-fr ee theory of nonparametric r e gr ession . Springer . H A L L , P . and H O RO W I T Z , J . L . (2007). Methodology and Con vergence Rates for Functional Linear Regression. The Annals of Statistics 35 70–91. H A L L , P . and H O S S E I N I - N A S A B , M . (2006). On properties of functional principal components analysis. Journal of the Royal Statistical Society: Series B 68 109–126. H A L L , P ., M Ü L L E R , H . - G . and W A N G , J . - L . (2006). Properties of Principal Component Methods for Functional and Longitudinal Data Analysis. The Annals of Statistics 34 1493–1517. H AY A K AW A , S . and S U Z U K I , T. (2020). On the Minimax Optimality and Superiority of Deep Neural Network Learning ov er Sparse Parameter Spaces. Neural Networks 123 343–361. H O RV ÁT H , L . and K O K O S Z K A , P . (2012). Inference for Functional Data with Applications . Springer , Ne w Y ork. H S I N G , T . and E U BA N K , R . (2015). Theor etical F oundations of Functional Data Analysis, with an Introduction to Linear Operators . W iley . K O LT C H I N S K I I , V . (2006). Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics 34 2593–2656. 32 L I , Y . and H S I N G , T. (2010). Uniform Con vergence Rates for Nonparametric Regression and Principal Compo- nent Analysis in Functional/Longitudinal Data. The Annals of Statistics 38 3321–3351. N A K A D A , R . and I M A I Z U M I , M . (2020). Adaptiv e Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality. Journal of Mac hine Learning Resear ch 21 1–39. P AU L , D . and P E N G , J . (2009). Consistency of restricted maximum likelihood estimators of principal compo- nents. The Annals of Statistics 37 1229–1271. R A M S AY , J . O . and S I L V E R M A N , B . W. (2005). Functional Data Analysis . Springer , New Y ork. S A R K A R , S . and P A NA R E T O S , V . M . (2022). CovNet: Cov ariance Netw orks for Functional Data on Multidimen- sional Domains. Journal of the Royal Statistical Society: Series B (Statistical Methodolo gy) 84 1785–1820. S C H M I D T - H I E B E R , J . (2019). Deep ReLU network approximation of functions on a manifold. arXiv:1908.00695 . S C H M I D T - H I E B E R , J . (2020). Nonparametric Regression Using Deep Neural Networks with ReLU Activ ation Function. The Annals of Statistics 48 1875–1897. S R E B R O , N . and S R I D H A R A N , K . (2010). Note on refined dudle y integral co vering number bound. https://www . cs.cornell.edu/~sridharan/dudley .pdf . S U Z U K I , T. and N I TA N DA , A . (2021). Deep learning is adaptiv e to intrinsic dimensionality of model smoothness in anisotropic Besov space. Advances in Neur al Information Pr ocessing Systems 34 3609–3621. V A N D E G E E R , S . A . (2000). Empirical Pr ocesses in M -Estimation . Cambridge Univ ersity Press, Cambridge. V A N D E R V A A RT , A . W. and W E L L N E R , J . A . (1996). W eak Con ver gence and Empirical Pr ocesses: W ith Appli- cations to Statistics . Springer Series in Statistics . Springer , Ne w Y ork. W A H B A , G . (1975). Smoothing noisy data with spline functions. Numerische Mathematik 24 383–393. W A N G , J . - L . , C H I O U , J . - M . and M Ü L L E R , H . - G . (2016). Functional Data Analysis. Annual Review of Statistics and Its Application 3 257–295. Y A N , S . , Y AO , F . and Z H O U , H . (2025). Deep Regression for Repeated Measurements. Journal of the American Statistical Association 120 2461–2472. Y AO , F., M Ü L L E R , H . - G . and W A N G , J . - L . (2005). Functional Data Analysis for Sparse Longitudinal Data. Journal of the American Statistical Association 100 577–590. Y U A N , M . and C A I , T. T. (2010). A Reproducing Kernel Hilbert Space Approach to Functional Linear Regres- sion. The Annals of Statistics 38 3412–3444. Z H A N G , X . and W A N G , J . - L . (2016). From Sparse to Dense Functional Data and Beyond. The Annals of Statistics 44 2281–2321. Z H O U , H . , W E I , D . and Y AO , F. (2025). Theory of Functional Principal Component Analysis for Discretely Observed Data. The Annals of Statistics 53 2103–2127. Z H O U , H . , Y A O , F . and Z H A N G , H . (2023). Functional Linear Regression for Discretely Observed Data: From Ideal to Reality. Biometrika 110 381–393.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment