Minimax optimal adaptive structured transfer learning through semi-parametric domain-varying coefficient model
Transfer learning aims to improve inference in a target domain by leveraging information from related source domains, but its effectiveness critically depends on how cross-domain heterogeneity is modeled and controlled. When the conditional mechanism…
Authors: Hanxiao Chen, Debarghya Mukherjee
Minimax optimal adaptive structur ed transfer learning thr ough semi-parametric domain-varying coefficient model Hanxiao Chen and Debarghya Mukherjee Department of Mathematics and Statistics, Boston University February 23, 2026 Abstract T ransfer learning aims to improve infer ence in a target domain by leveraging information from related source domains, but its effectiveness critically depends on how cross-domain het- erogeneity is modeled and controlled. When the conditional mechanism linking covariates and responses varies acr oss domains, indiscriminate information pooling can lead to negative trans- fer , degrading performance relative to target-only estimation. W e study a multi-sour ce, single- target transfer learning problem under conditional distributional drift and propose a semipara- metric domain–varying coefficient model (DVCM), in which domain-r elatedness is encoded through an observable domain identifier . This framework generalizes classical varying-coefficient models to structur ed transfer learning and interpolates between invariant and fully heteroge- neous regimes. Building on this model, we develop an adaptive transfer learning estimator that selectively borr ows str ength from informative source domains while pr ovably safeguar ding against negative transfer . Our estimator is computationally efficient and easy to implement; we also show that it is minimax rate-optimal and derive its asymptotic distribution, enabling valid uncertainty quantification and hypothesis testing despite data-adaptive pooling and shrinkage. Our results pr ecisely characterize the interplay among domain heterogeneity , the smoothness of the underlying mean function, and the number of sour ce domains and ar e corr oborated by comprehensive numerical experiments and two r eal-data applications. 1 Introduction T ransfer learning, domain adaptation, and multi-task learning are modern machine-learning method- ologies that aim to improve pr ediction or estimation in a target domain by leveraging data from related source domains, especially when labeled target data are scar ce or costly to obtain. These methods have been successfully applied across a wide range of areas [1, 2, 3]. The key challenge is that domains typically differ , via shifts in covariate distributions ( covariate shift ), response dis- tributions ( label shift ), or the conditional mechanism linking them ( concept/posterior drift ). There- fore, any gain hinges on how “relatedness” between sour ce and target is quantified and enforced. Most of the existing appr oaches in transfer learning align distributions or representations, share parameters acr oss tasks, or reweight instances to emphasize target-relevant information. How- ever , if information from source domains is incorporated blindly , without adequate preventive 1 measures against uninformative or mismatched sources, it may degrade the estimator ’s perfor - mance relative to what could be achieved using only the target data, a phenomenon commonly referr ed to as negative transfer in the literature [4]. Therefor e, the goal of transfer learning is to construct an adaptive estimator that efficiently borr ows information fr om related sources while en- suring performance never degrades r elative to a tar get-only estimator . A further challenge, often underemphasized in theor etical work, is to provide valid uncertainty quantification (e.g., in terms of asymptotic distribution) for these estimators to draw valid statistical inference. Modern transfer learning problems increasingly involve data collected across multiple het- erogeneous environments or domains, where the relationship between covariates and responses is not strictly invariant but instead exhibits systematic variation. Such heterogeneity may arise from differ ences in population characteristics, experimental conditions, data-collection protocols, or other contextual factors. A central challenge in these settings is to leverage information from related envir onments to improve inference in a target environment, while avoiding degradation in performance when the source environments are only weakly informative. T o formalize this setting, we consider an environment-indexed framework. Each environment is associated with a domain identifier U e ∈ U , where U is a compact set generated fr om some (unknown) distribution P U . W ithin environment e , the covariate–response pairs ar e generated as X ie ∼ P X | U ( · | U e ) , Y ie | ( X ie , U e ) ∼ P U e ( · | X ie ) , so that the conditional mechanism relating Y to X may vary across environments. The primary goal of transfer learning in this setting is to improve inference for the target environment u 0 by borrowing information from related environments, while safeguarding against the risk of negative transfer . However , to efficiently transfer information from the source domains, it is essential to model how the conditional distribution varies smoothly with r espect to the domain identifier , so that information can be shared acr oss nearby environments. Motivated by this framework, we model the cr oss-environment heterogeneity through a Do- main–V arying Coefficient Model (DVCM), inspired by the classical varying–coef ficient model (VCM; see, e.g.,[5, 6]). Specifically , we assume that E [ Y | X , U ] = g − 1 X ⊤ θ ( U ) , where g is a known link function and θ ( · ) is an unknown coef ficient function of the domain index U . This specification repr esents a structured and interpretable restriction of the general condi- tional distribution of Y given ( X , U ) , allowing it to vary smoothly across environments. The key distinction from a standar d generalized linear model (GLM) is that GLMs assume fixed coeffi- cients across observations, wher eas DVCMs allow coef ficients to vary with U , thereby providing a flexible yet parsimonious mechanism for capturing systematic heterogeneity acr oss domains. In our generalized linear DVCM framework, we assume access to data from K + 1 domains, indexed by k ∈ { 0, 1, . . . , K } , where k = 0 denotes the target domain and k ∈ { 1, . . . , K } denote the sour ce domains. From domain k , we observe n k response–pr edictor pairs D k = { ( Y ki , X ki ) } n k i = 1 . Each domain is additionally associated with a domain identifier U k , which is constant within a domain (i.e., U ki ≡ U k for all i ) but varies acr oss domains. Under this model, the conditional mean of the response satisfies E [ Y ki | X ki , U k ] = g − 1 X ⊤ ki θ ( U k ) . (1.1) In practice, the domain identifier U k may repr esent calendar time (e.g., year or quarter of data collection) in temporal processes, geographic or institutional indicators (e.g., city , r egion, hospital) 2 in spatial or multi-site studies, or cohort characteristics such as tenure or exposure duration. In biomedical and sensing applications, U k may encode instrument-specific or batch effects (e.g., scanner model, assay batch, or sensor platform), which are typically constant within a domain but vary across domains. Similarly , under stratified sampling designs, U k may consist of the indicators defining the k th stratum. Under the above-mentioned generalized linear DVCM model, our goal is to estimate the target- domain coefficient θ ( u 0 ) by borrowing information from the source domains while guarding against negative transfer . As mentioned previously , the potential gains from transfer hinge on (i) the smoothness of the coefficient function θ ( · ) and (ii) the similarity between U k and u 0 . Even when θ ( · ) is not very smooth, transfer can be beneficial if some sour ce identifier U k lies suf ficiently close to u 0 , while conversely , when no source is particularly close to u 0 , a high degree of smoothness of θ ( · ) can still enable effective transfer by ensuring that θ ( U k ) r emains close to θ ( u 0 ) . Building on this insight, we propose a minimax-optimal, computationally efficient, and adaptive estimator ˆ θ TL ( u 0 ) for θ ( u 0 ) that exploits source information when relevant and remains robust to negative transfer , i.e., its risk is never worse than a target-only estimator of θ ( u 0 ) , and is concep- tually simple and easy to implement. As a baseline, in the absence of source data, a target-only least-squares (or GLM) estimator attains the optimal rate for θ ( u 0 ) , but it leverages neither the smoothness of θ nor cross-domain similarities. T o incorporate both, we first form a nonparametric pilot ˆ θ DVCM ( u 0 ) (e.g., via local polynomial regr ession) by pooling sour ces (and a split of the tar - get) whose U k are near u 0 . W e then fit a GLM on the tar get domain with an adaptive ridge penalty that shrinks the GLM estimator toward ˆ θ DVCM ( u 0 ) . The key challenge, ther efore, lies in designing the penalty in a careful, data-driven manner so as to guard against potential negative transfer . T o this end, we constr uct a penalty based on inverse-variance reweighting , which ef fectively serves this purpose. In Section 2, we provide a practical recipe for constructing such a penalty , and establish sharp theoretical guarantees for its performance. Although our methodology is applicable for vector-valued U , we conduct our theoretical study under an univariate (dim ( U ) = 1) domain indicator for the simplicity of presentation. One of our key theoretical contributions is to rigorously establish that ˆ θ TL ( u 0 ) achieves the following minimax-optimal rate (Theorems 3.6 and 3.9): inf ˆ θ sup θ , {P k } K k = 0 E h ∥ ˆ θ − θ ( u 0 ) ∥ 2 2 i ≍ n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 , where K is the number of source domains, n = ∑ K k = 0 n k is the total sample size, β denotes the smoothness of the coef ficients of θ , and γ quantifies the proximity of source and the target iden- tifiers (e.g., variability among U i ’s, smaller γ means closer domain indices; see Section 3.1 for details). T wo implications are immediate fr om the rate: (i) it is never worse than n − 1 0 , so the esti- mator is immune to negative transfer; and (ii) it improves when either γ is small (i.e., U k is close to U 0 ) or β is large (i.e., θ is smoother). Furthermore, the term ( K / γ ) − 2 β is unavoidable. When K is small and γ is large, only limited information can be effectively pooled from the source do- mains, even in the presence of infinite source data. A large γ indicates that θ ( u 0 ) and θ ( u k ) are not sufficiently close, while a small K implies inadequate local information to reliably infer θ ( u 0 ) from θ ( u k ) . This limitation is analogous to the behavior of the bias term encountered in standard nonparametric regression. Our analysis can be extended to multivariate U in a straightforward manner , with no additional insight, beyond routine bookkeeping. While minimax optimality characterizes the fundamental estimation difficulty , rates alone do 3 not pr ovide uncertainty quantification. In our setting, deriving a valid inference is particularly delicate: the proposed estimator combines nonparametric pooling with a data-adaptive shrinkage matrix Q , so both the pilot estimator and the penalty are random and depend on the full sample. Consequently , the estimator is not a simple linear functional of the data, and standard asymptotic arguments do not apply dir ectly . W e show that (Theorem 3.10 and the corollaries follow), under appropriate undersmoothing conditions, the adaptive estimator nevertheless admits a centered asymptotically normal distribution with a feasible variance estimator: ˆ Σ − 1/ 2 TL ( ˆ θ TL − θ ( U 0 )) L = ⇒ N ( 0, I p ) where ˆ Σ TL is an estimator of the variance of ˆ θ TL , which depends on ( β , γ , K , n , n 0 ) (precisely quan- tified in Section 3.2). This enables confidence intervals and W ald-type tests for θ ( u 0 ) that properly account for the data-adaptive pooling and shrinkage mechanism. W e summarize our contribution below: 1. Methodological contribution: W e propose a domain varying coefficient model, bridging standard GLM and VCM that relates source and target domains via observable domain iden- tifiers. W e develop a computationally ef ficient methodology to construct a minimax optimal estimator ˆ θ TL , which is provably safe (no negative transfer), and adaptively borrows strength from r elated source domains. 2. Theoretical contribution: On theor etical front, we rigorously establish that ˆ θ TL is minimax rate optimal. Furthermore, we also establish the asymptotic normality of ˆ θ TL , which aids in inference and constructing an asymptotically valid confidence interval. Details can be found in Section 3. 3. Application: Last but not least, we demonstrate the efficiency of our estimator through extensive simulation, as well as on two real socio-economic datasets: i) US adult-income dataset, and ii) SLID-Ontario dataset (details can be found in Sections 4 and 5). The or ganization of this paper is as follows: we conclude the Intr oduction section with a brief dis- cussion of the r elated literature and introduce the notations used throughout the rest of the paper . In Section 2, we pr esent our two-step methodology for constructing ˆ θ TL . In Section 3, we establish theoretical guarantees of our proposed estimator . In Section 4, we present extensive numerical experiments demonstrating the efficacy of our methodology . In Section 5, we apply our method to two real datasets. Finally , we conclude by outlining promising directions for future resear ch in Section 6. 1 . Positioning in the existing literature: T ransfer learning with parametric regr ession has gained significant attention recently . For a single source domain, [7] proposes data-enriched linear re- gression with a penalized differ ence between source and target coef ficients, while [8] analyzes a fine-tuning appr oach with a significance test for positive transfer . W ith multiple sources, selecting informative domains is the key . [9] proposes a minimax-efficient strategy for high-dimensional linear regression, followed by de-biasing with target data, extended by [10] and [11] to high- dimensional generalized linear models. [12] incorporate dependence among observations, and 1 Code implementing our methodology and reproducing all experiments is available at github.com/hanxiao- chen/ Transfer_Learning_DVCM 4 [13] introduce an importance-weighted method using residuals for transfer learning. Beyond para- metric settings, nonparametric transfer learning has also seen advancements in both classification [14, 15, 16, 17, 18] and regr ession [19, 20], as well as other extensions [21, 22, 23]. Recent extensions further expand the scope to reinfor cement learning, functional data analysis, matrix estimation, outlier detection, heavy-tailed data, and bootstrap [24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. Notably , [34] proposed a Bayesian varying coef ficient model with Gaussian process priors for geospatial transfer learning, though they did not provide conver gence rates. T o our knowledge, this is the only work that employs a VCM for transfer learning. Our approach differs by explicitly specifying the functional class of θ ( · ) , which enables the derivation of matching minimax lower and upper bounds. It is worth clarifying how our framework differs from several existing transfer-learning paradigms. First, high-dimensional parametric transfer methods typically assume a fixed coefficient vector across domains and exploit sparsity or shared support structur e (e.g., Lasso-based transfer), fo- cusing on variable selection and parameter shrinkage. In contrast, we explicitly model domain heterogeneity through a smooth coefficient function θ ( U ) , allowing systematic variation across environments rather than enforcing invariance. Second, many modern approaches rely on repre- sentation alignment or feature adaptation, seeking a domain-invariant representation of X through deep architectures. While powerful in practice, such methods are often algorithmic and do not yield transparent statistical characterizations of the bias–variance tradeoff under domain drift. Our DVCM framework instead imposes a structured, interpr etable restriction: cross-domain sim- ilarity is encoded through smoothness in the domain index, leading to precise minimax character - izations and adaptive guarantees. Thus, rather than performing penalized fine-tuning in an ab- stract parameter space, our appr oach leverages an explicit geometric structure on domains, which enables both negative-transfer control and rigor ous inference. Notations. For a vector x = [ x 1 , . . . , x p ] ⊤ , define ∥ x ∥ q = ( ∑ p i = 1 | x i | q ) 1/ q , x ⊗ 2 = xx ⊤ , and ∥ x ∥ 2 A = x ⊤ A x , where A is positive semidefinite. The spectral norm of A is ∥ A ∥ 2 . For x ∈ R p and y ∈ R q , x ⊗ y = [ x 1 y ⊤ , . . . , x p y ⊤ ] ⊤ . Let ˆ θ estimate θ ; define MSE A ( ˆ θ ) = E ∥ ˆ θ − θ ∥ 2 A and M ( ˆ θ ) = E [ ( ˆ θ − θ ) ⊗ 2 ] . Denote by e i , j the length- j vector with 1 in the i th position. For matrices A , B , A ⪰ B means A − B is positive semidefinite; λ min ( A ) and λ max ( A ) ar e its smallest and largest eigenval- ues. For sequences a n , b n > 0, write a n ≲ b n if a n ≤ Cb n for some C > 0, and a n ≍ b n if both a n ≲ b n and a n ≳ b n . Also, a n = o ( b n ) means | a n / b n | → 0, and a n = O ( b n ) means sup n | a n / b n | < ∞ . For real numbers x , y , let x ∧ y = min ( x , y ) and x ∨ y = max ( x , y ) ; for integer n , [ n ] = { 1, . . . , n } . For the k th domain, define the index set I k = [ n k ] and the pooled source data D S = S K k = 1 D k . Let Γ = { U k : 0 ≤ k ≤ K } ∪ { X ki : 0 ≤ k ≤ K , i ∈ I k } denote all covariates. W e use A l = [ I p , 0 p × l p ] when defining nonparametric estimators, wher e I p is the identity matrix and 0 p × l p is a zero matrix. 2 Methodology In this section, we present our methodology for estimating θ ( U 0 ) , the coefficient on the tar get domain. Our proposed methodologies for the linear DVCM and generalized linear DVCM are presented in Sections 2.1 and 2.2, r espectively . 5 2.1 T ransfer learning for linear DVCM Recall that we use the index 0 for the target domain and 1 ≤ k ≤ K for the sour ce domains. The observed data from k -th domain is denoted by { ( X ki , Y ki ) } i ∈ I k and the domain identifier is U k for k ∈ { 0, 1, . . . , K } . As per our data-generating model, Equation (1.1), the observed sample from k th domain, under the linearity assumption, is assumed to follow Y ki = X ⊤ ki θ ( U k ) + ε ki where the noise ε ki ’s ar e independent within and acr oss the domains. Our parameter of inter est is θ ( U 0 ) = θ ( u 0 ) ( u 0 being the realization of U 0 ), the coefficient of the target domain. A simple estimator for θ ( u 0 ) is the ordinary least squares estimator computed using only the tar get-domain data D 0 : ˆ θ LR ( u 0 ) = arg min α ∑ i ∈ I 0 Y 0 i − X ⊤ 0 i α 2 = X ⊤ 0 X 0 − 1 X ⊤ 0 y 0 . (2.1) Although this tar get-only estimator is rate-optimal (and efficient under Gaussian err ors), it ignores potentially useful information fr om the source domains. Nevertheless, it serves as our target-only baseline , and any transfer-learning-based estimator should not underperform r elative to this esti- mator . W e now describe our adaptive transfer learning procedur e, which is summarized in Algo- rithm 1. Step I: Nonparametric initialization. T o borrow information from the source domains, we first construct a nonparametric point estimator of θ ( u 0 ) , denoted by ˆ θ DVCM ( u 0 ) , using all available data via local polynomial regression. Specifically , let W be a smoothing kernel (typically a symmet- ric pr obability density function; see Section 3 for precise assumptions) with bandwidth parameter h > 0. The estimator ˆ θ DVCM ( u 0 ) is defined as: ˆ θ DVCM ( u 0 ) = A l · arg min α ∈ R ( l + 1 ) p ∑ K k = 0 ∑ i ∈ I k Y ki − Φ l U k − u 0 h ⊤ ⊗ X ⊤ ki α 2 W U k − u 0 h , (2.2) where Φ l ( x ) = 1, x , x 2 /2! , . . . , x l / l ! ⊤ is the l -th order polynomial feature map, ⊗ denotes the Kronecker product, and A l = [ I p , 0 p × l p ] selects the first p coordinates of the minimizer . This estimator admits the closed-form expression ˆ θ DVCM ( u 0 ) = A l ( Z ⊤ WZ ) − 1 Z ⊤ W y , (2.3) where Z ki = Φ l U k − u 0 h ⊗ X ki , W = S − 1 h diag n W U k − u 0 h o k , i , S h = ∑ k , i W U k − u 0 h . (2.4) The key advantage of ˆ θ DVCM ( u 0 ) is that it aggregates source information using local weights that are larger for domains with identifiers U k close to u 0 (i.e., the weight decreases as | U k − u 0 | grows). Consequently , it effectively borrows information fr om relevant sour ce domains. As will be shown in Section 3 (Equation (3.1)), the optimal bandwidth h automatically balances two factors: (i) the smoothness of θ ( · ) , and (ii) the proximity and spr ead of the U k ’s. If θ ( · ) is sufficiently smooth or the U k ’s are cluster ed near u 0 , then a substantial amount of information is borrowed from the sources. 6 Step II: Fine-tuning. Although ˆ θ DVCM ( u 0 ) borrows information adaptively , it may perform worse than the baseline ˆ θ LR ( u 0 ) when θ ( · ) is not sufficiently smooth and the U k ’s are far from u 0 , or when the bandwidth h is misspecified. Such situations can lead to negative transfer , as illustrated in our numerical studies. T o guard against this phenomenon, we fine-tune ˆ θ DVCM ( u 0 ) using the tar get data through a ridge-r egularized regr ession: ˆ θ TL ( u 0 ) = arg min α ∈ R p 1 2 n 0 ∑ i ∈ I 0 Y 0 i − X ⊤ 0 i α 2 + 1 2 ∥ α − ˆ θ DVCM ( u 0 ) ∥ 2 Q = 1 n 0 X ⊤ 0 X 0 + Q − 1 1 n 0 X ⊤ 0 y 0 + Q ˆ θ DVCM ( u 0 ) , (2.5) where Q is a symmetric positive definite matrix. The choice of Q is crucial: only a proper selection ensures adaptivity and pr otection against negative transfer . One oracle choice is Q = δ σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 , (2.6) for some constant δ ∈ ( 1/2, 2 ) (see Theor em 3.6). Her e σ 2 ( u 0 ) denotes the noise variance in the tar- get domain, and n 0 is the sample size of the target domain. W e assume that the mean squared error matrix M ( ˆ θ DVCM ( u 0 )) is invertible, which holds whenever the covariance matrix of ˆ θ DVCM ( u 0 ) is positive definite. This choice of Q can be interpreted as the ratio of uncertainties between two es- timators: σ 2 ( u 0 ) / n 0 is proportional to the variance of the target-only estimator ˆ θ LR ( u 0 ) , wher eas M ( ˆ θ DVCM ( u 0 )) represents the MSE of ˆ θ DVCM ( u 0 ) . W ith this construction, the second step interpo- lates between the target-only and pooled estimators: when the DVCM pilot is r elatively precise, strong shrinkage occurs; when it is noisy , the estimator automatically reverts toward the target- only solution. Next, we describe a data-driven way to obtain ˆ Q , which can be used in Equation (2.5) to compute ˆ θ TL ( u 0 ) . Estimation of Q : W e now present a fully data-driven choice of Q , which also provably yields an optimal estimator ˆ θ TL ( u 0 ) , as will be established in Section 3. Our construction closely mimics the oracle choice in Equation (2.6), where the unknown MSE of ˆ θ DVCM ( u 0 ) and ˆ θ LR ( u 0 ) are replaced by consistent estimators. T o this end, recall that the mean squared error matrix of ˆ θ DVCM ( u 0 ) can be decomposed into bias and variance components: M ( ˆ θ DVCM ( u 0 )) = ( E [ ˆ θ DVCM ( u 0 )] − θ ( u 0 )) ( E [ ˆ θ DVCM ( u 0 )] − θ ( u 0 )) ⊤ | {z } : = Bias ⊗ 2 ( ˆ θ DVCM ( u 0 )) + Va r ˆ θ DVCM ( u 0 ) . (2.7) W e estimate these two components separately using the procedures described in Section 2.3. Specif- ically , we employ a plug-in estimator for the bias term and a sandwich-type estimator for the variance term. Furthermore, we estimate σ 2 ( u 0 ) using the sample mean of the squared residuals. Combining these estimators yields the following data-driven penalty matrix ˆ Q : ˆ Q = δ ˆ σ 2 ( u 0 ) n 0 d Bias ⊗ 2 ˆ θ DVCM ( u 0 ) + c V ar ˆ θ DVCM ( u 0 ) − 1 . (2.8) Finally , the adaptive transfer-learning estimator ˆ θ TL ( u 0 ) is obtained by substituting the above data-driven matrix ˆ Q into Equation (2.5). Our entire pr ocedure is summarized in Algorithm 1. In the next subsection, we extend our algorithm to a generalized DVCM model. 7 Algorithm 1 T ransfer Learning for Linear DVCM Input: data { ( U k , { X ki , Y ki } i ∈ I k ) } K k = 0 , order l , bandwidth h 1: 1) Split the target-domain sample into two halves of equal size, indexed by I 0 and I ∗ 0 2: 2) Compute ˆ θ DVCM ( u 0 ) as in (2.2) 3: 3) Estimate ˆ Q via Section 2.3 4: 4) Form ˆ θ TL ( u 0 ) using (2.5) with ˆ Q Output: ˆ θ TL ( u 0 ) 2.2 T ransfer learning for generalized linear DVCM In this section, we extend our methodology to the setting where the conditional distribution of Y given ( X , U ) follows a generalized linear model with a general link function g (cf. Equation (1.1)). More specifically , we assume that the conditional distribution belongs to a canonical exponential family: f ( Y ki = y | X ki = x , U k = u ) = exp ν ( u ) − 1 y x ⊤ θ ( u ) − b x ⊤ θ ( u ) + c ( y , ν ( u ) ) , (2.9) where ν ( u ) is a scale parameter . This formulation implies E [ Y ki | X ki , U k ] = b ′ ( X ⊤ ki θ ( U k )) = ⇒ g ( E [ Y ki | X ki , U k ] ) = X ⊤ ki θ ( U k ) , where g ( µ ) = ( b ′ ) − 1 ( µ ) is the canonical link function. The key ideas parallel those in Section 2.1, with appropriate modifications to accommodate a general link. As before, we begin with a target- only estimator computed using only tar get data, which serves as the no-transfer baseline. How- ever , instead of minimizing a squared-err or loss, we minimize the negative log-likelihood: ˆ θ GLR ( u 0 ) = arg min α ∑ i ∈ I 0 b ( X ⊤ α ) − Y X ⊤ α : = arg min α ∑ i ∈ I 0 ℓ ( X ⊤ 0 i α , Y 0 i ) . (2.10) By classical GLM asymptotics [35], the tar get-only MLE ˆ θ GLR ( u 0 ) is √ n 0 -consistent and asymptot- ically normal under standard regularity conditions. Nevertheless, as in the linear case, it ignor es potentially informative source data. T o exploit cross-domain similarity , we again proceed in two steps: Step I: Nonparametric initialization. W e first construct a nonparametric estimator of θ ( u 0 ) using local polynomial regr ession, pooling all observations together: ˆ θ GDVCM ( u 0 ) = A l · arg min α ∈ R ( l + 1 ) p ∑ K k = 0 ∑ i ∈ I k ℓ ( Z ⊤ ki α , Y ki ) W U k − u 0 h , (2.11) where Z ki = Φ l U k − u 0 h ⊗ X ki . This estimator mirrors (2.2), with the squared-error loss replaced by the negative log-likelihood. Step II: Fine-tuning. As for the generalized linear case, in this step we refine the pilot ˆ θ GDVCM ( u 0 ) using the target data via a ridge-r egularized GLM: ˆ θ TL ( u 0 ) = arg min α 1 n 0 ∑ i ∈ I 0 ℓ ( X ⊤ 0 i α , Y 0 i ) + 1 2 ∥ α − ˆ θ GDVCM ( u 0 ) ∥ 2 Q . (2.12) 8 Algorithm 2 T ransfer Learning for Generalized DVCM Input: data { ( U k , { X ki , Y ki } i ∈ I k ) } k = 0,. . ., K , order l , bandwidth h , loss function ℓ 1: 1) Randomly split the target-domain data into two equal parts, indexed by I 0 and I ∗ 0 2: 2) Compute ˆ θ GDVCM ( u 0 ) as in Equation (2.11) 3: 3) Estimate ˆ Q using the procedure in Section 2.3 4: 4) Construct the transfer learning estimator ˆ θ TL ( u 0 ) using Equation (2.12) with ˆ Q Output: ˆ θ TL ( u 0 ) In Section 3, we show that the following oracle choice of Q makes ˆ θ TL ( u 0 ) adaptive and immune to negative transfer: Q = δ ν ( u 0 ) n 0 M ˆ θ GDVCM ( u 0 ) − 1 (2.13) for δ ∈ ( 1/2, 2 ) . This choice parallels (2.6), with the Gaussian noise variance σ 2 ( u 0 ) replaced by the GLM scale parameter ν ( u 0 ) . As in Section 2.1, the matrix Q captures the ratio of uncertainties between the target-only estimator ˆ θ GLR ( u 0 ) and the pilot ˆ θ GDVCM ( u 0 ) . Consequently , shrinkage is strong when ˆ θ GDVCM ( u 0 ) is r elatively mor e pr ecise, and it r elaxes toward the tar get-only estima- tor ˆ θ GLR ( u 0 ) otherwise. The complete pr ocedure is summarized in Algorithm 2, and all theoretical guarantees are established in Section 3. In the next section, we develop a fully data-driven esti- mator ˆ Q and show that, when substituted into (2.12), the resulting estimator ˆ θ TL ( u 0 ) achieves minimax-optimal rates while remaining adaptive to potential negative transfer . Remark 2.1 In the procedural description above, we used the entire dataset in Step I to construct the nonparametric estimator ˆ θ DVCM ( u 0 ) and then reused the target-domain data in Step II to compute the fine-tuned estimator ˆ θ TL ( u 0 ) . Consequently , the target data are involved in both steps, which induces sta- tistical dependence between ˆ θ DVCM ( u 0 ) and the second-stage objective used to define ˆ θ TL ( u 0 ) . While this dependence has a negligible impact on empirical performance, it complicates the theoretical analysis. T o simplify the exposition and proofs in Section 3, we therefore adopt a data-splitting scheme. In particular , we assume that 2 n 0 target-domain observations ar e available: the first n 0 samples are used in Step I to con- struct ˆ θ DVCM ( u 0 ) , and the remaining n 0 samples are reserved for Step II to compute ˆ θ TL ( u 0 ) . Under this scheme, ˆ θ DVCM ( u 0 ) is independent of the second half of the target data used in the fine-tuning step, which substantially simplifies the theor etical ar guments. Although it may be possible to establish the asymptotic properties of ˆ θ TL ( u 0 ) without data-splitting, we do not pursue that direction in this paper . 2.3 Estimating Q In this subsection, we propose a consistent estimator ˆ Q of the oracle penalty matrix Q . W e formu- late the procedur e in the generalized linear model setting, since the linear model is recover ed as a special case by taking ℓ ( η , y ) = ( η − y ) 2 /2. Let s j ( η , y ) = ∂ j ∂η j ℓ ( η , y ) denote the j -th derivative of the loss function with respect to the linear predictor η . Using the bias–variance decomposition in Equation (2.7), we define ˆ Q = δ ˆ ν ( u 0 ) n 0 ˆ M ˆ θ GDVCM ( u 0 ) − 1 = δ ˆ ν ( u 0 ) n 0 n d Bias ˆ θ GDVCM ( u 0 ) ⊗ 2 + c V ar ˆ θ GDVCM ( u 0 ) o − 1 . (2.14) W e now describe how to estimate each component. 9 Estimation of ν ( u 0 ) . The scale parameter ν ( u 0 ) can be estimated using standard GLM methodol- ogy . A common approach is based on Pearson r esiduals: r 2 i = Y 0 i − g − 1 ( X ⊤ 0 i ˆ θ GLR ( u 0 )) q b ′′ ( X ⊤ 0 i ˆ θ GLR ( u 0 )) ! 2 , ˆ ν ( u 0 ) = 1 n 0 ∑ n 0 i = 1 r 2 i . (2.15) Under standard r egularity conditions, ˆ ν ( u 0 ) is a consistent estimator of ν ( u 0 ) [35]. Estimation of the bias term. The leading bias of ˆ θ GDVCM ( u 0 ) is of order h β . W e therefor e estimate d Bias ( ˆ θ GDVCM ( u 0 )) = c 0 h β , where c 0 > 0 is determined by the local polynomial approximation. For integer β , the plug-in bias estimator is d Bias ˆ θ GDVCM ( u 0 ) = ˆ ζ − 1 0,1 ˆ ζ β ,1 1,1 1 β ! ˆ θ ( β ) ( u 0 ) h β , (2.16) where ˆ ζ r , s = ( n h ) − 1 ∑ K k = 0 n k t r k W ( t k ) s and t k = ( U k − u 0 ) / h . A consistent estimator of θ ( β ) ( u 0 ) , namely , ˆ θ ( β ) ( u 0 ) , can be obtained via another application of local polynomial regression [36], i.e. ˆ θ ( β ) ( u 0 ) = ˜ A β ˆ α DVCM , ˜ A β = h − β [ 0 p × β p , I p ] , where ˆ α DVCM minimizes the objective of (2.11). The consistency of the r esulting estimator can be obtained from classical kernel r egression theory [37, 38]. Estimation of the variance term. Let t k = ( U k − u 0 ) / h . Following [39], we estimate the variance component using a sandwich-type estimator: c V ar ˆ θ GDVCM ( u 0 ) = A l ˆ Λ − 1 ˆ ∆ ˆ Λ − 1 A ⊤ l , (2.17) where ˆ ∆ = 1 ( n h ) 2 K ∑ k = 0 ∑ i ∈ I k s 2 1 Z ⊤ ki ˆ α GDVCM , Y ki Z ki Z ⊤ ki W ( t k ) 2 , ˆ Λ = 1 n h K ∑ k = 0 ∑ i ∈ I k s 2 Z ⊤ ki ˆ α GDVCM , Y ki Z ki Z ⊤ ki W ( t k ) . Here, ˆ α GDVCM denotes the minimizer of the objective function in Equation (2.11). By a standar d application of the weak law of large numbers, the estimator c V ar consistently estimates the variance component of M ( ˆ θ GDVCM ( u 0 )) . Combining the above components, ˆ Q consistently mimics the oracle choices in Equations (2.6) for linear DVCM and (2.13) for generalized linear DVCM. Consequently , when substituted into the fine-tuning step, it yields a pr ovably optimal adaptive estimator , as established in the next section. 3 Theoretical Analysis In this section, we establish the theoretical pr operties of the estimator ˆ θ TL ( u 0 ) under both linear and generalized linear model settings. Our analysis characterizes the minimax-optimal rates of 10 estimation and derives the limiting distributions necessary for statistical infer ence. Section 3.1 de- velops a non-asymptotic theory for ˆ θ TL ( u 0 ) under the linear model assumption (Equation (2.5)). W e show that the proposed estimator attains the minimax-optimal rate of convergence. Based on these r esults, Section 3.2 establishes asymptotic normality and presents valid infer ence pr oce- dures. Finally , Section 3.3 extends the analysis to the generalized linear model framework (Equa- tion (2.12)), demonstrating that the minimax optimality and inferential guarantees continue to hold in this more general setting. 3.1 Non-Asymptotic Results for Linear DVCMs Before pr esenting our main results, we briefly recall the notation and data-generating assump- tions. For each domain k ∈ { 0, 1, . . . , K } , the observations follow the linear model Y ki = X ⊤ ki θ ( U k ) + ε ki , where ε ki is independent of X ki conditional on U k , with E [ ε ki | U k ] = 0, and V a r ( ε ki | U k ) = σ 2 ( U k ) . Thus, the noise variance is allowed to vary across domains through its dependence on U k . For any u ∈ U , we define d k ( u ) = ∥ u − U k ∥ 2 , which measures the distance between the k -th domain identifier U k and u . In particular , at u = u 0 , the collection { d k ( u 0 ) } 1 ≤ k ≤ K quantifies the similarity between each sour ce domain and the target domain. Let { d ( k ) ( u 0 ) } 1 ≤ k ≤ K denote the corresponding or der statistics. These order ed distances induce a ranking of the source domains according to their pr oximity to the target domain. W e now state the assumptions requir ed for our theoretical analysis. Assumption 3.1 (Functional coefficient) The coefficient function θ ( u ) = θ 1 ( u ) , . . . , θ p ( u ) ⊤ is a col- lection of p functions that belong to a H ¨ older class H ( β , L ) with β , L > 0 . Assumption 3.2 (Assumptions on the data-distribution) The data distribution ( U , X , Y ) is assumed to satisfy the following: (a) The variables U k are i.i.d. from a location-scale family: f U ( u ) = 1 γ f u − u ∗ γ , where f is compactly supported on B ⊂ R and satisfies a ′ 0 ≤ f ( u ) ≤ a 0 for all u ∈ B . The constants u ∗ ∈ B and 0 < γ < ∞ are the location and scale parameters. Furthermor e, the covariate X ki is assumed to be compactly supported; we assume ∥ X ki ∥ 2 ≤ 1 almost surely without loss of generality . (b) The conditional fourth moment E [ | Y | 4 | U = u ] is assumed to be continuous, and consequently uniformly upper bounded on the support of U (as it is compact). (c) For h ≥ d ( 1 ) ( u 0 ) , it holds almost surely that A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ λ − 1 0 , for some constant λ 0 > 0 , where ( A l , Z , W ) is same as defined in Equation (2.4) . Assumption 3.3 (Balanced domain sizes) The source samples ar e assumed to be balanced, i.e., there exist positive constants b ′ 0 , b 0 such that n k / ¯ n ∈ [ b ′ 0 , b 0 ] , for k ∈ { 1, . . . , K } and ¯ n = ( ∑ K k = 0 n k ) / ( K + 1 ) denotes the average sample size. For the target domain, we only assume an upper bound, i.e., n 0 / ¯ n ≤ b 0 . 11 Assumption 3.4 (Uniform kernel) The kernel function W is a uniform pdf W ( u ) = 1 2 1 { | u | ≤ 1 } . Discussions on the assumptions: Assumption 3.1 is standard in the nonparametric estimation literature and imposes smoothness conditions on the coef ficient function θ ( · ) . Assumption 3.2(a) models the domain indices { U k } as i.i.d. draws from a well-behaved loca- tion–scale family with compactly supported density f . The scale parameter γ controls the disper- sion of the domain indices: smaller values of γ correspond to closely r elated domains (in which case transfer learning is beneficial), whereas lar ger values of γ reflect more heter ogeneous and widely dispersed domains. As established in Theor em 3.6, the minimax-optimal estimation rate is governed by the interplay between ( γ , β ) . Smaller γ and/or larger β correspond to mor e in- formative source domains and hence faster convergence rates. Conversely , larger γ and smaller β indicate weaker cross-domain similarity , in which case the performance of our estimator ap- proaches that of the target-only estimator . The compactness assumption on the support of X is made for technical convenience and can be relaxed using standard truncation arguments. As- sumption 3.2(b) ensures that the error distribution is well behaved. In contrast to much of the ex- isting literatur e, which assumes sub-Gaussian err ors, we only requir e bounded fourth moments. Assumption 3.2(c) guarantees the well-posedness of the linear system in Equation (2.3), thereby ensuring the existence of the estimator . A closely r elated assumption appears in Section 1.6.1 of [40]. In Appendix D, we present a general result showing that under mild conditions the assump- tion holds almost surely . Assumption 3.3 requir es that the sample sizes across source domains grow at comparable rates, preventing any single source domain from becoming asymptotically negligible or overly domi- nant. If this were violated, the analysis could be restricted to domains with asymptotically non- negligible sample pr oportions. In contrast, we impose no lower bound on the ratio n 0 / ¯ n for the tar get domain and allow n 0 / ¯ n → 0, thereby accommodating practically relevant scenarios in which the target sample size is of smaller or der than the aggregate sour ce sample sizes. Finally , although our theoretical development is presented under the uniform kernel assump- tion (Assumption 3.4), the analysis extends to any kernel W that is compactly supported and bounded away from 0 and ∞ on its support. Our first result characterizes the rate of convergence of the target-only baseline ˆ θ LR ( u 0 ) (defined in Equation (2.1)) and the nonparametric DVCM estimator ˆ θ DVCM ( u 0 ) (defined in Equation (2.3)). By default, the order of the polynomial in constructing ˆ θ DVCM ( u 0 ) is chosen as l = ⌊ β ⌋ throughout this section. Proposition 3.5 Let A be any matrix satisfying 0 ⪯ A ⪯ C I for some constant C > 0 , and define MSE A ( ˆ θ ) = E ∥ ˆ θ − θ ∥ 2 A , where ∥ x ∥ 2 A = x ⊤ A x . Under Assumption 3.2, the tar get-only estimator ˆ θ LR ( u 0 ) satisfies for some constant C 1 > 0 : MSE A ˆ θ LR ( u 0 ) ≤ C 1 n − 1 0 . Furthermore, under Assumptions 3.1 – 3.4, the DVCM estimator ˆ θ DVCM ( u 0 ) computed with bandwidth h ∗ = med e 0 ( n / γ ) − 1 2 β + 1 , d ( 1 ) ( u 0 ) , d ( K ) ( u 0 ) , e 0 > 0 is a constant . (3.1) satisfies MSE A ˆ θ DVCM ( u 0 ) ≤ C 2 max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 , 12 for some constant C 2 > 0 , where n = n 0 + ∑ K k = 1 n k . The proof is deferr ed to Appendix B.1. The rate for ˆ θ LR ( u 0 ) follows directly from standard lin- ear regr ession theory . W e now pr ovide intuition for the rate of ˆ θ DVCM ( u 0 ) and the associated bandwidth choice. By the standar d bias–variance trade-off in nonparametric regression, for any bandwidth h , MSE ˆ θ DVCM ( u 0 ) ≍ h 2 β + γ / n h . Here, the effective sample size is scaled by γ to reflect the dispersion of the domain indices U k . Minimizing the right-hand side over h without constraints yields h opt ∝ ( n / γ ) − 1/ ( 2 β + 1 ) which leads to the classical rate ( n / γ ) − 2 β / ( 2 β + 1 ) . However , in our setting, the bandwidth must lie in the interval [ d ( 1 ) ( u 0 ) , d ( K ) ( u 0 )] . The lower bound arises because h must exceed d ( 1 ) ( u 0 ) ; otherwise, no source domain would fall inside the bandwidth window , leading to degeneracy . The upper bound reflects that if h > d ( K ) ( u 0 ) , then all domains ar e automatically included, and further enlargement has no additional ef fect. Ther efore, the bandwidth selection amounts to minimizing MSE subject to the constraint h ∈ [ d ( 1 ) ( u 0 ) , d ( K ) ( u 0 )] , which yields the truncated (median-based) choice of optimal bandwidth in Equation (3.1), and consequently the resulting rate is precisely the one stated in Proposition 3.5. W e next present our main results r egarding the rate of convergence of ˆ θ TL ( u 0 ) , which shows that for a range of choices of the adaptive penalty Q , our proposed estimator does not suffer fr om negative transfer: Theorem 3.6 Consider the choice of h as in (3.1) and let the chosen Q satisfy: 1 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 ⪯ Q ⪯ 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 (3.2) Then, under Assumptions 3.1-3.4, for any u 0 ∈ U and A ⪰ 0 the following holds sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E h ∥ ˆ θ TL ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≤ C n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . Remark 3.7 Although we use a local polynomial regression–based estimator for θ ( · ) in constructing ˆ θ DVCM ( u 0 ) , the conclusion of the above theorem remains valid for a broad class of alternative nonparametric estimators of θ ( · ) , including spline-based and neural-network–based methods. The proof of Theor em 3.6 is deferred to Appendix B.2. As shown in Pr oposition 3.5, the n − 1 0 part is the rate of parametric estimator ˆ θ LR ( u 0 ) while the max ( ( K / γ ) − 2 β , ( n / γ ) − 2 β / ( 2 β + 1 ) , n − 1 ) part is the rate of nonparametric estimator ˆ θ DVCM ( u 0 ) . Therefor e, the MSE of ˆ θ TL ( u 0 ) is always smaller than or equal to the minimum of the MSEs of ˆ θ LR ( u 0 ) and ˆ θ DVCM ( u 0 ) , i.e., the adaptive estima- tor consistently outperforms or matches the target-only estimator , making it robust to negative transfer . In the following cor ollary we show that the proposed data-driven estimator ˆ Q satisfies Equation (3.2) with probability tending to one; consequently , the estimator ˆ θ TL ( u 0 ) constructed using ˆ Q also satisfies the conclusion of the theorem. Corollary 3.8 Let ˆ M ( ˆ θ DVCM ( u 0 )) and ˆ σ 2 ( u 0 ) be the consistent estimators obtained via the procedur e in Section 2.3. Define the estimator ˆ Q as: ˆ Q = δ ˆ σ 2 ( u 0 ) n 0 ˆ M ˆ θ DVCM ( u 0 ) − 1 δ ∈ ( 1/ 2, 2 ) . 13 The estimator ˆ θ TL ( u 0 ) , obtained by substituting ˆ Q into Equation (2.5) , achieves the same rate of conver- gence as in Theorem 3.6 with pr obability tending to 1. The pr oof of Cor ollary 3.8 can be found in Appendix B.3. W e conclude this section with a theorem establishing the minimax lower bound for estimating θ ( u 0 ) , ther eby confirming that our proposed estimator achieves the minimax-optimal rate. Theorem 3.9 Under Assumptions 3.1-3.3, for any u 0 ∈ U , it holds that inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E h ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≥ C ′ n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . The proof of the above theorem is provided in Appendix B.4. T ogether , Theorem 3.6 and Theo- rem 3.9 establish the minimax optimality of our pr oposed estimator . 3.2 Inference with linear DVCM In the previous subsection, we established the rate of convergence of ˆ θ ( u 0 ) . However , convergence rates alone ar e insufficient for inferential tasks, such as testing hypotheses of the form H 0 : θ ( u 0 ) = 0 versus H 1 : θ ( u 0 ) = 0 . T o this end, we now establish the asymptotic normality of the proposed estimator . W e begin by intr oducing modifications to the earlier assumptions requir ed to derive the asymptotic normality result. Assumption 2 ′ (Modification of Assumption 3.2) The distribution of ( U , X , Y ) is assumed to satisfy the conditions of Assumption 3.2. Furthermore, the conditional second moment matrix Ψ ( u ) = E [ X X ⊤ | U = u ] is positive definite and continuous almost everywhere on U . Moreover , there exists constants 0 < c ′ 0 < c 0 < ∞ such that, almost surely for all u ∈ U , c ′ 0 ≤ λ min Ψ ( u ) ≤ λ max Ψ ( u ) ≤ c 0 . Compared to Assumption 3.2, Assumption 2 ′ imposes an additional uniform lower and upper bound on the conditional variance matrix of X given U to facilitate Lindeber g-type central limit theorem arguments. T owards presenting our main result, let us intr oduce r LR and r DVCM , which denote the conver gence rates of ˆ θ LR ( u 0 ) and ˆ θ DVCM ( u 0 ) , respectively , i.e. ˆ θ LR ( u 0 ) − θ ( u 0 ) = O p ( r LR ) and ˆ θ DVCM ( u 0 ) − θ ( u 0 ) = O p ( r DVCM ) . W e define their relative rate by ρ n : = r LR / r DVCM . Note that we allow r DVCM to be bounded away fr om 0, implying the non-informativeness of the sources. The following theor em characterizes the asymptotic distribution of the proposed transfer learning estimator using ρ n . Theorem 3.10 Suppose Assumptions 3.1, 2 ′ , 3.3, and 3.4 hold. Let ˆ θ TL ( u 0 ) be constructed as in (2.2) , with the shrinkage matrix ˆ Q defined in Section 2.3 and the bandwidth parameter h is chosen to satisfy: γ / K ≪ h ≪ ( γ / n ) 1 2 β + 1 . (3.3) Then the adaptive transfer learning estimator ˆ θ TL ( u 0 ) satisfies the following asymptotic results: If ρ n → 0, √ n 0 ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N 0, Ω LR ( u 0 ) , 14 If ρ n → ∞ , s n h γ ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N 0, Ω DVCM ( u 0 ) . Here Ω LR ( u 0 ) and Ω DVCM ( u 0 ) denote the asymptotic covariance matrices of ˆ θ LR ( u 0 ) and ˆ θ DVCM ( u 0 ) , respectively . The proof of this theorem is deferred to Appendix B.5. The theorem highlights the adaptivity of the transfer learning estimator . When ρ n → 0, the target-only estimator conver ges at a faster rate than the DVCM estimator , and consequently , the transfer learning estimator attains the same convergence rate as the target-only estimator . In contrast, when ρ n → ∞ , the DVCM estimator con- verges faster than the target-only estimator , and the transfer learning estimator correspondingly achieves a convergence rate comparable to that of the DVCM estimator . The asymptotic normality results, particularly in the r egime where ρ n → ∞ , r equire additional conditions on the choice of the bandwidth h (see Equation (3.3)). T o motivate the condition, let us briefly r ecall the classical bandwidth condition requir ed for establishing the asymptotic normality of a pointwise nonparametric regr ession estimator . Suppose we observe Y i = f ∗ ( X i ) + ε i , wher e f ∗ belongs to a β -H ¨ older class and we wish to estimate f ∗ ( x 0 ) at a fixed point x 0 . In this setting, the bandwidth that achieves the minimax-optimal rate is h ∗ ∝ n − 1/ ( 2 β + 1 ) , which yields the optimal convergence rate n − β / ( 2 β + 1 ) . However , this bandwidth choice does not generally yield a centered asymptotic normal distribution, because the bias term is of the same or der as the stochastic fluctu- ations. T o establish asymptotic normality , then one should either correct for the bias or performs undersmoothing [41, 42], i.e., choose h such that n h 2 β + 1 → 0, which would yield: s √ n h ( ˆ f ( x 0 ) − f ∗ ( x 0 )) = ⇒ N ( 0, σ 2 ) , where σ 2 is the asymptotic variance. The undersmoothing condition ensur es that the bias term h 2 β is asymptotically negligible relative to the stochastic error 1 / n h , thereby yielding a center ed nor - mal limit. This centering is essential for constructing valid ( 1 − α ) -level confidence intervals. The trade-off is a slightly slower rate of convergence, since √ n h is strictly smaller than the minimax- optimal rate n β / ( 2 β + 1 ) whenever n h 2 β + 1 → 0. In this paper , we take this undersmoothing ap- proach. The bandwidth condition in Equation (3.3) r eflects precisely the undersmoothing phenomenon discussed above. In particular , the requirement n h 2 β + 1 / γ → 0 ensures that the squared bias of ˆ θ DVCM ( u 0 ) , which is of order h 2 β , is asymptotically negligible compared to its variance term, which is of order γ / ( n h ) . Consequently , the stochastic fluctuations dominate the bias, leading to an asymptotically normal distribution centered at zer o. The additional condition K h / γ → ∞ guar- antees that the effective number of source domains satisfying | U k − u 0 | ≤ h diverges (r ecall that γ / K is the order of the minimum distance between the target and the source indicators). In other words, the local polynomial estimator underlying ˆ θ DVCM ( u 0 ) is constructed from an increasing amount of source-domain information. W ithout this condition, the number of contributing do- mains would remain bounded, precluding the application of central limit theorem ar guments and hence preventing asymptotic normality . T ogether , these conditions ensure that the estimator is both bias-negligible and supported by a sufficiently lar ge effective sample size, thereby yielding a valid Gaussian limit suitable for inference. It is apparent from Theorem 3.10 that the convergence rate and limiting variance of ˆ θ TL ( u 0 ) depend on whether ρ n → 0 or ρ n → ∞ . However , in practice, the true regime is typically un- 15 known. Ther efore, a unified r epresentation of the limiting variance is necessary for drawing valid inferences acr oss all regimes. The following cor ollary serves this purpose: Corollary 3.11 Under the conditions of Theorem 3.10, the estimator ˆ θ TL ( u 0 ) satisfies: ˆ Σ − 1/ 2 TL ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N ( 0, I ) , where the unified covariance estimator is given by: ˆ Σ TL = B − 1 Q ˆ Q ˆ V DVCM ( u 0 ) ˆ Q B − 1 Q + B − 1 Q ˆ Ψ ( u 0 ) ˆ V LR ( u 0 ) ˆ Ψ ( u 0 ) B − 1 Q , with B Q = ˆ Ψ ( u 0 ) + ˆ Q. The pr oof of this cor ollary can be found in Appendix B.8. This corollary gives the practitioner a concrete form of the standard error of the ˆ θ TL ( u 0 ) , which, relies on ˆ Ψ ( u 0 ) , ˆ V LR ( u 0 ) and ˆ V DVCM ( u 0 ) , consistent estimators for Ψ ( u 0 ) , Va r ( ˆ θ LR ( u 0 )) and V ar ( ˆ θ DVCM ( u 0 )) , respectively . One can easily construct such consistent estimators by taking their sample analogues, as pr escribed below: 1. ˆ Ψ ( u 0 ) = 1 n 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i ; 2. ˆ V LR ( u 0 ) = ˆ Ψ ( u 0 ) − 1 [ 1 n 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i ˆ ε 0 i ( u 0 ) 2 ] ˆ Ψ ( u 0 ) − 1 with ˆ ε 0 i ( u 0 ) = Y 0 i − X ⊤ 0 i ˆ θ LR ( u 0 ) ; 3. ˆ V DVCM ( u 0 ) is defined in the same way as in Equation (2.17). It is possible to readily use the result of the above corollary for various types of inference problems. For instance, to test the null hypothesis θ ( u 0 ) = w for a given vector w , one may consider the test statistic T n = ∥ ˆ Σ − 1/ 2 TL ( ˆ θ TL ( u 0 ) − w ) ∥ 2 2 , which, under the null hypothesis, follows a χ 2 distribution with p degrees of freedom, where p denotes the dimension of X . Furthermor e, to test a linear contrast of the form v ⊤ θ ( u 0 ) = ζ for a given scalar ζ , one may use the statistic T n = ( v ⊤ ˆ θ TL ( u 0 ) − ζ ) / p v ⊤ ˆ Σ TL v , which converges in distribution to N ( 0, 1 ) under the null hypothesis. Remark 3.12 A key technical ingredient in the proof of Theorem 3.10 is the derivation of the limiting distribution of ˆ θ DVCM ( u 0 ) under the regime ρ n → ∞ . This result is formalized in Proposition A.8, stated in Appendix A. Briefly, the proposition establishes that if the bandwidth h is chosen to satisfy condition (3.3) , then under the stated assumptions, p n h / γ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) − b DVCM ( u 0 ) d − → N ( 0, Ω DVCM ( u 0 ) ) , where Ω DVCM ( u 0 ) = σ 2 ( u 0 ) h ζ − 1 0,1 ζ 0,2 ζ − 1 0,1 i 1,1 Ψ ( u 0 ) − 1 , ζ r , s = R Φ l ( t ) ⊗ 2 t r W s ( t ) f u 0 − u ∗ + h t γ dt , and b DVCM ( u 0 ) ≍ h β denotes the bias term. Here, u ∗ is same as defined as in Assumption 3.2. This result naturally generalizes the classical asymptotic theory for varying-coefficient models (VCMs), as developed in [43], by allowing multiple observations per domain value u . In particular , when the per-domain sample size satisfies n k = 1 for all k , Proposition A.8 r educes to the standard asymptotic normality result for the classical VCM estimator of [43]. 16 Does our choice of h make ˆ θ TL ( u 0 ) adaptive? A natural question is whether the bandwidth choice in Equation (3.3) r enders ˆ θ TL ( u 0 ) adaptive, or whether it could lead to negative transfer . However , a closer inspection of our arguments (see Appendix B.2), shows that the following conclusion holds regar dless of the bandwidth choice: MSE A ˆ θ TL ( u 0 ) ≤ min MSE A ˆ θ DVCM ( u 0 ) , MSE A ˆ θ LR ( u 0 ) . as long as Q satisfies Equation (3.2) (which ˆ Q satisfies with probability going to 1, as established in Corollary 3.8). In particular , any bandwidth h satisfying Equation (3.3) still guarantees that ˆ θ TL ( u 0 ) is free from negative transfer , since its risk never exceeds that of the tar get-only baseline. 3.3 Extension to generalized DVCM In this section, we establish theoretical properties of ˆ θ TL ( u 0 ) under the assumption that the data are generated fr om a generalized linear model (see Equation (2.9)). As discussed in Section 2.2, the proposed estimation procedur e is closely related to that of the linear model. The key differ ence lies in replacing the squar ed-error loss with a more general negative log-likelihood loss, which is appropriate for the GLM framework. The assumptions are similar to that for the linear DVCM model, except we modify Assumption 2 ′ as follows: Assumption 2 ′′ (Modification of Assumption 2 ′ ) The distribution of ( U , X , Y ) is assumed to satisfy the conditions of Assumption 2 ′ , but with the conditional second moment matrix defined as E b ′′ X ⊤ θ ( U ) X X ⊤ | U = u . Furthermore, it is assumed that sup u ∈ U E h | b ( 3 ) X ⊤ θ ( u ) | 4 U = u i is uniformly bounded. Discussion on the augmented assumptions: In Assumption 2 ′′ , we generalize the definition of Ψ ( · ) (defined in Assumption 2 ′ ) by incorporating the second derivative of the mean function b ( · ) . Note that, for the linear model, b ( x ) = x 2 /2 and hence b ′′ ( x ) = 1, in which case the new defini- tion of Ψ ( · ) reduces to the form given in Assumption 2 ′ for the linear DVCM model. Moreover , a mild bounded-moment condition on b ( 3 ) X ⊤ θ ( u ) imposed, which is required to establish a cen- tral limit theorem via higher -order T aylor expansions. Such a regularity condition is standar d in the literature and is commonly assumed when establishing weak convergence of the proposed estimator . W e ar e now ready to present our main theoretical results. As in the linear model setting, we present two main theorems: one characterizing the rate of conver gence and the other describing the asymptotic normality of the proposed estimator . Our first result concerns the convergence rate of ˆ θ TL ( u 0 ) under the generalized DVCM framework and serves as the counterpart to Theor em 3.6 in the linear case: Theorem 3.13 Suppose Assumptions 3.1, 2 ′′ , 3.3, and 3.4 hold. Let ˆ θ TL ( u 0 ) be constructed as in (2.12) , with the shrinkage matrix ˆ Q defined in Section 2.3. If the bandwidth is chosen as h = med e 0 ( n / γ ) − 1 2 β + 1 , d ( 1 ) ( u 0 ) , d ( K ) ( u 0 ) , for some constant e 0 > 0, then the adaptive transfer learning estimator ˆ θ TL ( u 0 ) satisfies ∥ ˆ θ TL ( u 0 ) − θ ( u 0 ) ∥ 2 2 = O p n − 1 0 ∧ max n ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 o . 17 The proof of Theorem 3.13 is deferred to Appendix B.6. This result demonstrates that, under the proposed choices of the bandwidth h and the weighting matrix ˆ Q , the rate of convergence of ˆ θ TL ( u 0 ) under GLM setting coincides with that obtained in the linear case (Theorem 3.6). The adaptivity of ˆ θ TL ( u 0 ) is also evident: its convergence rate is never worse than n − 1/ 2 0 , thereby pre- cluding negative transfer . Moreover , when the number of source K , or the smoothness parameter β , is large, or the heter ogeneity parameter γ is small, the convergence rate is strictly faster than that of the target-only estimator , reflecting the ability of the method to ef ficiently leverage information from the r elevant source domains. Having established the rate, we next present a result on the asymptotic normality of the pro- posed estimator . Let ˆ Ψ ( u 0 ) , ˆ V GLR ( u 0 ) , and ˆ V GDVCM ( u 0 ) denote consistent estimators of Ψ ( u 0 ) , V ar ( ˆ θ GLR ( u 0 )) , and V ar ( ˆ θ GDVCM ( u 0 )) , respectively . Then, the following asymptotic normality re- sult holds for ˆ θ TL ( u 0 ) : Theorem 3.14 Let ˆ Q be a pre-specified positive semidefinite matrix. Suppose Assumptions 3.1, 2 ′′ , 3.3, and 3.4 hold and the bandwidth h satisfy Equation (3.3) . The transfer learning estimator ˆ θ TL ( u 0 ) is asymp- totically normal: ˆ Σ − 1/ 2 TL ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N ( 0, I ) . The asymptotic covariance estimator ˆ Σ TL is given by ˆ Σ TL = B − 1 Q ˆ Q ˆ V GDVCM ( u 0 ) ˆ Q B − 1 Q + B − 1 Q ˆ Ψ ( u 0 ) ˆ V GLR ( u 0 ) ˆ Ψ ( u 0 ) B − 1 Q , with B Q = ˆ Ψ ( u 0 ) + ˆ Q. The proof of the above theorem can be found in Appendix B.7. This result extends Corollary 3.11 to the GDVCM framework. Since the inference pr ocedure relies on ˆ Σ TL , which in turn depends on consistent estimation of Ψ ( u 0 ) , Va r ( ˆ θ GLR ( u 0 )) , Va r ( ˆ θ GDVCM ( u 0 )) , we next describe consistent estimators for these quantities, obtained as empirical analogues of their population definitions: 1. ˆ Ψ ( u 0 ) = 1 n 0 ∑ i ∈ I 0 b ′′ X ⊤ 0 i ˆ θ GLR ( u 0 ) X 0 i X ⊤ 0 i ; 2. ˆ V GLR ( u 0 ) = ˆ Ψ ( u 0 ) − 1 1 n 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i ( Y 0 i − ˆ µ 0 i ) 2 ˆ Ψ ( u 0 ) − 1 , with ˆ µ 0 i = b ′ X ⊤ 0 i ˆ θ GLR ( u 0 ) ; 3. ˆ V GDVCM ( u 0 ) is defined in the same way as in Equation (2.17). By a standard application of the law of large numbers, the proposed estimators are consistent, which in turn guarantees the validity of the resulting inferential procedures. As illustrated in the linear model setting in Section 3.2, the above asymptotic normality r esult can be dir ectly employed to conduct infer ence in a variety of testing pr oblems. These include, for example, testing pointwise hypotheses of the form θ ( u 0 ) = w , as well as more general linear constraints on θ ( u 0 ) , such as R θ ( u 0 ) = r , for a given matrix R and vector r (e.g., testing whether a particular coordinate or linear combination equals zero). 4 Simulation experiments In this section, we present various numerical experiments to support and illustrate our theoretical results. W e investigate three distinct models: linear regression, logistic regression, and Poisson 18 regr ession. Across these settings, we examine several key properties of our estimator (e.g., its rate of conver gence, sensitivity to bandwidth selection, robustness under varying levels of similarity between the source and tar get domains, asymptotic normality) by varying factors such as the sam- ple size ( n and ¯ n ), the number of domains ( K ), and heterogeneity among domain identifiers ( γ ). The (generalized) DVCM estimators considered in this section are local linear estimators, i.e., we set l = 1 in Equation (2.11). Data generation: W e use the following data-generating setup for our simulation studies: 1. W e generate U 1 , . . . , U K ∼ Unif ( − γ / 2, γ /2 ) , i.e. a centered uniform distribution of length γ . W e vary the value of γ to control the degree of heterogeneity among these domain identifiers. The target u 0 is fixed to be 0. 2. W e set X ki ∈ R p , where the first coordinate is the intercept and the other coordinates are generated from N ( 0, Σ ) for all 1 ≤ k ≤ K , 1 ≤ i ≤ n k , with Σ i j = 0.7 | i − j | . (The choice of p will be specified later). 3. The true parameter vector is specified as θ ( u ) = ( θ 0 ( u ) , . . . , θ p − 1 ( u ) ) , where θ 0 ( u ) = − tanh ( 16 ( u − 0.2 ) ) + g ( u ) , θ 1 ( u ) = exp ( 5 u + 2.5 ) / 100 − 0.5 + g ( u ) , and θ j ( u ) = ( − 0.5 ) j − 1 exp ( 2 u ) for j ≥ 2. The additional term g ( u ) = u 3 · sign ( u ) is included to ensure that θ ( · ) possesses a continuous second derivative but a discontinuous third derivative. This construction also guarantees that the linear pr edictor X ⊤ θ ( u ) remains in a reasonable range and, in the binary r esponse setting (defined below), that the success probability P ( Y = 1 | X , U ) is not too close to 0 or 1. 4. The response variable Y is generated as: • For linear regr ession: Y ki = X ⊤ k , i θ ( U k ) + 0.5 × ϵ ki , ϵ ki ∼ N ( 0, 1 ) . • For logistic regr ession: Y ki ∼ Ber σ ( X ⊤ k , i θ ( U k )) with σ ( x ) = ( 1 + e − x ) − 1 . • For Poisson regr ession: Y ki ∼ Poi exp X ⊤ k , i θ ( U k ) . Estimation procedure: For each of the thr ee r esponse-generating mechanisms, we obtain the max- imum likelihood estimator by minimizing the negative log-likelihood, which serves as our loss function. W e compute three dif ferent estimators in our simulation studies: i) the target only ˆ θ GLR ( u 0 ) defined in (2.10), ii) the DVCM estimator ˆ θ DVCM ( u 0 ) defined in (2.11), and iii) the transfer learning estimator ˆ θ TL in (2.12). W e perform data-split on the tar get domain to make ˆ θ DVCM ( u 0 ) and ˆ θ GLR ( u 0 ) independent. For ˆ θ TL ( u 0 ) , we compute ˆ Q via method in Section 2.3. 4.1 Bandwidth sensitivity analysis Performance across γ . First, we vary γ while keeping all other parameters fixed. Recall that γ controls the dispersion of the domain identifiers U k around the target u 0 . When γ is small, the domains are concentrated near u 0 , so the source domains are informative for the tar get. When γ is large, the domains ar e more dispersed and become incr easingly irrelevant. Consequently , the DVCM estimator is expected to perform better when γ is small, whereas the GLR estimator should dominate when γ is large. The transfer learning estimator ˆ θ TL ( u 0 ) is designed to adapt between these regimes. W e consider the setting p = 4 (dimension of X ), n S = 600 (total number of source 19 samples), n 0 = 50 (target samples), and K = 5 (number of source domains). Figure 1 r eports the MSE E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 2 of the three estimators ˆ θ DVCM , ˆ θ GLR , and ˆ θ TL based on 200 simulations for γ ∈ { 0.5, 1, 1.5 } across a range of bandwidths h . The left, middle, and right columns correspond to linear , logistic, and Poisson r egression models, respectively , while the upper , middle, and lower rows correspond to the cases γ = 0.5, 1, and 1.5, respectively . Since ˆ θ GLR ( u 0 ) is a target-only estimator independent of the bandwidth h , its MSE remains constant as h varies and therefor e appears as a flat line. For the linear model (left column), when γ = 0.5, the domains ar e highly related and ˆ θ DVCM ( u 0 ) achieves the smallest MSE. In contrast, the target-only estimator ˆ θ GLR ( u 0 ) underperforms in this regime because it ignores the informative source data. The transfer learning estimator ˆ θ TL ( u 0 ) closely tracks ˆ θ DVCM ( u 0 ) and inherits its advantage. When γ = 1, the source domains are moderately close to the target. The estimator ˆ θ DVCM ( u 0 ) performs well for smaller bandwidths but deteriorates as h increases, while ˆ θ GLR ( u 0 ) is stable as it does not depend on the choice of bandwidth. The adaptive estimator ˆ θ TL ( u 0 ) adapts between the best of the two and remains near -optimal across bandwidth choices. Finally , when γ = 1.5, the source domains are less relevant, and ˆ θ DVCM ( u 0 ) suf fers fr om substantial bias. In this regime, ˆ θ GLR ( u 0 ) outperforms the pooled estimator . The transfer learning estimator aligns with ˆ θ GLR ( u 0 ) and again achieves the comparable performance across bandwidths. The logistic and Poisson models (middle and right columns) exhibit the same qualitative pattern: ˆ θ DVCM ( u 0 ) dominates when γ is small, ˆ θ GLR ( u 0 ) dominates when γ is large, and ˆ θ TL ( u 0 ) adaptively tracks the better estimator in each regime. Overall, the figur e demonstrates that ˆ θ TL ( u 0 ) consistently achieves the lowest MSE across band- widths, values of γ , and model families by adaptively combining the strengths of ˆ θ GLR ( u 0 ) and ˆ θ DVCM ( u 0 ) . Performance across K . W e next vary K ∈ { 5, 10, 15 } and compare the performance of the three estimators as before. The results are summarized in Figure 2, wher e we plot the MSE as a function of the bandwidth h under the Linear , Logistic, and Poisson models. Thr oughout these simulations, we fix p = 4, ¯ n = 120 (i.e., 120 observations per source domain), n 0 = 50, and γ = 1. The quali- tative conclusions are similar to those in the previous setup. As befor e, ˆ θ GLR ( u 0 ) is independent of the bandwidth and therefore appears as a flat line. In contrast, ˆ θ DVCM ( u 0 ) relies heavily on the bandwidth choice; selecting h either too small or too large leads to larger/suboptimal MSE. The proposed estimator ˆ θ TL ( u 0 ) adaptively combines the strengths of these two approaches. When ˆ θ DVCM ( u 0 ) achieves a smaller MSE than ˆ θ GLR ( u 0 ) , the estimator ˆ θ TL ( u 0 ) attains an MSE that is very close to, and occasionally even smaller than, that of ˆ θ DVCM ( u 0 ) . Conversely , when the MSE of ˆ θ DVCM ( u 0 ) exceeds that of ˆ θ GLR ( u 0 ) due to a suboptimal bandwidth choice, the performance of ˆ θ TL ( u 0 ) automatically aligns with that of ˆ θ GLR ( u 0 ) . These experiments clearly demonstrate the adaptive nature of the pr oposed method. 4.2 Asymptotic normality In this section, we present simulation results illustrating the asymptotic normality of ˆ θ TL ( u 0 ) , as established in Theorem 3.10. For simplicity , we focus exclusively on the linear data-generating model. Recall that ρ n denotes the relative efficiency ratio between ˆ θ LR ( u 0 ) and ˆ θ DVCM ( u 0 ) , and that the proposed estimator ˆ θ TL ( u 0 ) adapts to the better of the two procedur es, achieving asymp- totic normality in both regimes ( ρ n ↓ 0 or ρ n ↑ ∞ ). Here, we numerically demonstrate the asymptotic normality of ˆ θ TL ( u 0 ) in both of these regimes. T o simulate the case ρ n → 0, we set ( K , γ ) = ( 5, 5 ) with ¯ n = 100 and n 0 = 50, and for ρ n → ∞ , we set ( K , γ ) = ( 30, 0.1 ) while keeping ¯ n and n 0 unchanged. 20 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 010 0 . 015 0 . 020 0 . 025 0 . 030 MSE Linear ( γ = 0.5) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 2 0 . 4 0 . 6 Logistic ( γ = 0.5) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 02 0 . 04 0 . 06 0 . 08 Poisson ( γ = 0.5) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 05 0 . 10 0 . 15 MSE Linear ( γ = 1) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 Logistic ( γ = 1) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 02 0 . 04 0 . 06 0 . 08 Poisson ( γ = 1) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 1 0 . 2 MSE Linear ( γ = 1.5) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 3 0 . 4 0 . 5 0 . 6 Logistic ( γ = 1.5) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 0 0 . 2 0 . 4 Poisson ( γ = 1.5) Figure 1: MSE of the estimators across differ ent h and γ , with ( n , n 0 , K ) fixed. The left, middle, and right panels show MSE of linear , logistic, and Poisson-based estimators while upper , middle, and lower panels are cases wher e γ = 0.5, 1, and 1.5. 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 05 0 . 10 0 . 15 MSE Linear ( K = 5) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 Logistic ( K = 5) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 02 0 . 04 0 . 06 0 . 08 Poisson ( K = 5) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 MSE Linear ( K = 10) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 2 0 . 4 0 . 6 Logistic ( K = 10) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 Poisson ( K = 10) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 MSE Linear ( K = 15) DVCM GLR TL 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 2 0 . 4 0 . 6 Logistic ( K = 15) 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 bandwidth 0 . 05 0 . 10 Poisson ( K = 15) Figure 2: MSE of the estimators across differ ent h and K , with ( ¯ n , n 0 , γ ) fixed. The left, middle, and right panels show MSE of linear , logistic, and Poisson-based estimators while upper , middle, and lower panels are cases wher e K = 5, 10, and 15. 21 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 ρ → 0 standard normal ˇ θ 0 , p-val=0.81 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 standard normal ˇ θ 1 , p-val=0.19 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 standard normal ˇ θ 2 , p-val=0.05 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 standard normal ˇ θ 3 , p-val=0.52 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 ρ → ∞ standard normal ˇ θ 0 , p-val=0.23 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 standard normal ˇ θ 1 , p-val=0.45 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 standard normal ˇ θ 2 , p-val=0.28 − 4 − 3 − 2 − 1 0 1 2 3 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 standard normal ˇ θ 3 , p-val=0.86 Figure 3: The histogram of normalized estimators ˇ θ j ( u 0 ) = ˆ θ TL, j ( u 0 ) − θ j ( u 0 ) ˆ SE ( ˆ θ TL, j ( u 0 ) ) when ρ n → 0 and ∞ . As predicted by Theorem 3.10, the normalized estimator ˆ θ TL ( u 0 ) should converge in distribu- tion to the standard normal in both of these regimes. The standar d error ˆ SE ( ˆ θ TL, j ( u 0 )) is computed as the squar e root of the variance estimator proposed in Cor ollary 3.11. Each coordinate is stan- dardized as ˇ θ j ( u 0 ) = ˆ θ TL, j ( u 0 ) − θ j ( u 0 ) ˆ SE ˆ θ TL, j ( u 0 ) , j = 0, 1, 2, 3. Figure 3 displays the histograms of the standardized estimators based on 200 Monte Carlo repli- cations under the linear model. The four columns correspond to ˇ θ 0 – ˇ θ 3 , while the upper and lower rows r epresent the regimes ρ n → 0 and ρ n → ∞ , respectively . W e further statistically test the nor- mality using the Kolmogorov–Smirnov test, and the corresponding p -values are r eported in the histogram legends. Across all panels, the empirical distributions closely resemble the standard normal law , providing strong visual support for the theor etical r esults. Mor eover , all reported p - values exceed 0.05, of fering additional empirical evidence for the asymptotic normality of ˇ θ j ( u 0 ) . 4.3 Phase transition in the rate of estimation In this subsection, we demonstrate the phase transition in the rate of convergence of ˆ θ TL ( u 0 ) . Recall that we established in Theorem 3.6 that MSE ( ˆ θ TL ( u 0 )) ≲ n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . (4.1) It follows immediately that, depending on the choice of ( K , n , n 0 , γ , β ) , the rate of ˆ θ TL ( u 0 ) transi- tions between differ ent regimes. The goal of this subsection is to numerically illustrate this phase transition behavior by varying these parameters. W e divide our presentation into three parts, de- pending on whether we vary K , γ , or n . T o visualize the convergence rates, we present log–log plots with the logarithm of the varying parameter (i.e., K , γ , or n ) on the X -axis and the logarithm of the MSE on the Y -axis. The slope of each segment in these plots can be interpreted as the con- vergence rate. Throughout the simulations, the shrinkage matrix Q is set to its oracle value, and we fix the number of covariates to be p = 2. 22 3 . 5 4 . 0 4 . 5 5 . 0 5 . 5 6 . 0 log(K) − 9 − 8 − 7 − 6 − 5 − 4 log(MSE) Slope = -4.06 Slope = -0.89 Linear TL Breakpoint 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 5 . 5 log(K) − 10 − 8 − 6 − 4 − 2 Slope = -3.92 Slope = -0.88 Logistic 3 . 50 3 . 75 4 . 00 4 . 25 4 . 50 4 . 75 5 . 00 5 . 25 5 . 50 log(K) − 9 − 8 − 7 − 6 − 5 Slope = -4.62 Slope = -0.89 Poisson Figure 4: Log–log plot of MSE of ˆ θ TL as a function of K , while keeping ( ¯ n S , γ ) fixed. V ertical dotted lines indicate empirical breakpoints that mark phase transitions in the convergence behavior . The left, middle, and right panels correspond to the linear , logistic, and Poisson models, r espectively . Phase transition by varying K . In this part, we vary K to highlight its ef fect on the phase transition in the conver gence rate of ˆ θ TL ( u 0 ) . W e assume γ ≫ n − 1/ ( 2 β ) in this setting. Under this regime, the MSE of ˆ θ TL ( u 0 ) satisfies MSE ( ˆ θ TL ( u 0 )) ≲ n − 1 0 , K ≲ γ n 1/ ( 2 β ) 0 , γ K 2 β , γ n 1/ ( 2 β ) 0 ≲ K ≲ γ 2 β 2 β + 1 n 1/ ( 2 β + 1 ) , γ n 2 β 2 β + 1 , K ≳ γ 2 β 2 β + 1 n 1/ ( 2 β + 1 ) . Since γ ≫ n − 1/ ( 2 β ) implies ( γ / n ) 2 β / ( 2 β + 1 ) ≫ n − 1 , the rate exhibits three distinct phases on a log–log scale with respect to K : (i) a flat region at level n − 1 0 for small K (as the rate does not depend on K ), (ii) a linear region with slope − 2 β , and (iii) another linear region with slope − 2 β / ( 2 β + 1 ) for large K . As a numerical validation, we conduct simulations under the same data-generating process across linear , logistic, and Poisson models, this time varying K while fixing ¯ n S = 1500, γ = 0.1, and n 0 = 30. For computing ˆ θ DVCM ( u 0 ) , the bandwidth is chosen as described in Equa- tion (3.1). Figure 4 presents the r esulting log–log plot of log MSE against log K . As expected, we observe thr ee distinct linear phases in the plot, along with their corresponding empirical slopes, which align closely with the theoretical pr edictions of − 2 β = − 4 and − 2 β / ( 2 β + 1 ) = − 0.8. Phase transition by varying γ . W e now illustrate the phase transition in the convergence rate of ˆ θ TL ( u 0 ) by varying γ . Based on Theorem 3.6, the rate can be decomposed as follows: MSE ( ˆ θ TL ( u 0 )) ≲ n − 1 , γ ≲ n − 1/ ( 2 β ) , n − 2 β 2 β + 1 γ 2 β 2 β + 1 , n − 1/ ( 2 β ) ≲ γ ≲ min n K 2 β + 1 2 β n − 1/ ( 2 β ) , n n − 2 β + 1 2 β 0 o , n − 1 0 , min n K 2 β + 1 2 β n − 1/ ( 2 β ) , n n − 2 β + 1 2 β 0 o ≲ γ ≲ K 2 β + 1 2 β n − 1/ ( 2 β ) , K − 2 β γ 2 β , K 2 β + 1 2 β n − 1/ ( 2 β ) ≲ γ ≲ K n − 1/ ( 2 β ) 0 , n − 1 0 , γ ≳ K n − 1/ ( 2 β ) 0 . (4.2) In this simulation study , we assume ¯ n ≫ n 0 , i.e., the target sample size is much smaller than the average source sample size. Under this regime, we have K ( 2 β + 1 ) /2 β n − 1/ 2 β ≪ nn − ( 2 β + 1 ) /2 β 0 , which 23 − 6 . 5 − 6 . 0 − 5 . 5 − 5 . 0 − 4 . 5 − 4 . 0 log( γ ) − 10 − 9 − 8 − 7 − 6 − 5 log(MSE) Slope = 0.77 Slope = 3.84 Linear TL Breakpoint − 7 − 6 − 5 − 4 − 3 − 2 log( γ ) − 6 − 4 − 2 0 Slope = 0.79 Slope = 3.88 Logistic − 7 − 6 − 5 − 4 − 3 − 2 log( γ ) − 8 − 6 − 4 − 2 Slope = 0.81 Slope = 4.04 Poisson Figure 5: Log–log plot of MSE of ˆ θ TL as a function of γ , while keeping ( n , K ) fixed. V ertical dotted lines indicate empirical breakpoints that mark phase transitions in the convergence behavior . The left, middle, and right panels correspond to the linear , logistic, and Poisson models, r espectively . eliminates the third phase in Equation (4.2). Consequently , the log–log plot (with log γ on the X - axis and log MSE ( ˆ θ TL ( u 0 )) on the Y -axis) exhibits four distinct phases:(i) a flat region at level n − 1 for small γ ; (ii) a linear growth with slope 2 β / ( 2 β + 1 ) ; (iii) a second linear growth with slope 2 β ; and (iv) a flat region at level n − 1 0 for large γ . Figure 5 illustrates this behavior . W e set the average source sample size to ¯ n S = 600, the target sample size to n 0 = 30, and use the oracle choice of Q . The number of domains is set to K = 12 for the linear model and K = 10 for the other settings. The functional coefficients are specified as θ 0 ( u ) = θ 1 ( u ) = tanh 8 ( u − 0.2 ) . As pr edicted by the theory , the figure displays clear transitions across the four r egimes, with the empirical slopes in the two linear phases closely matching the theoretical values 2 β / ( 2 β + 1 ) = 0.8 and 2 β = 4 (for β = 2). Phase transition by varying ¯ n S . W e now fix ( n 0 , K , γ ) and vary the average source sample size ¯ n S , so that the total source sample size is n S = K ¯ n S and the overall sample size is n = n 0 + n S . Under the same condition γ ≫ n − 1/ ( 2 β ) , the MSE of ˆ θ TL ( u 0 ) exhibits the following phase transition in its convergence rate: MSE ( ˆ θ TL ( u 0 )) ≲ n − 1 0 , n ≲ γ n 2 β + 1 2 β 0 , γ n 2 β 2 β + 1 , γ n 2 β + 1 2 β 0 ≲ n ≲ γ − 2 β K 2 β + 1 , n − 1 0 ∧ γ K 2 β , n ≳ γ − 2 β K 2 β + 1 . Consequently , on a log–log scale (with log n on the X -axis and log MSE ( ˆ θ TL ( u 0 )) on the Y -axis), the curve exhibits thr ee distinct phases: (i) a flat r egion at level n − 1 0 for small n ; (ii) a linear regime with slope − 2 β / ( 2 β + 1 ) ; and (iii) a second plateau at level n − 1 0 ∧ ( γ / K ) 2 β for sufficiently lar ge n . Figure 6 illustrates this behavior with γ = 0.1 fixed while varying ¯ n S . W e observe a short flat region followed by a linear regime whose empirical slope is close to − 2 β / ( 2 β + 1 ) = − 0.8 (for β = 2). The thir d plateau emerges only when n becomes extremely lar ge, which is consistent with the transition scale γ − 2 β K 2 β + 1 . Across the three models, we use K = 10 (linear), K = 3 (logistic), and K = 2 (Poisson), r eflecting differing saturation behaviors in the third phase. 24 2 4 6 8 10 12 log(n) − 12 − 10 − 8 − 6 − 4 log(MSE) Slope = -0.83 Linear TL Breakpoint 4 . 0 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 log(n) − 3 . 4 − 3 . 2 − 3 . 0 − 2 . 8 − 2 . 6 − 2 . 4 Slope = -0.86 Logistic 4 5 6 7 8 9 10 11 log(n) − 9 − 8 − 7 − 6 − 5 Slope = -0.83 Poisson Figure 6: Log–log plot of MSE as a function of the sample size n . The vertical dotted lines r epresent the br eakpoints for phase transition. The left, middle, and right panels correspond to the linear , logistic, and Poisson models, respectively . 5 Real data analysis In this section, we apply our methods to two real datasets to illustrate the performance of our proposed estimator . The first dataset is SLID-Ontario dataset (Subsection 5.1); it is an economic dataset, where we want to predict a person’s composite hourly wage using their demographic attributes. The second dataset is US Adult Income dataset (Subsection 5.2), where the goal is to predict whether a person’s yearly wage is greater than 50,000 based on their various attributes. In both of these studies, we take U (domain identifier) to be the years of employment , as this typically determines an individual’s base salary . For the simplicity of the implementation, we include two covariates in studies: i) X 1 is gender , which is a binary variable (1 if female, 0 if male), and ii) X 2 is years of education, as these two variables are known to affect the income quite significantly . Mathematically speaking, we fit the following (generalized) linear model: g ( E ( Y | X ) ) = θ 0 ( U ) + θ 1 ( U ) X 1 + θ 2 ( U ) X 2 , (5.1) where g is a link function, Y is the response variable (composite hourly wage in SLID-Ontario dataset, and indicator whether annual income is > 50, 000 in adult income dataset). Noticing that some values of U are unrealistic outliers such as negative years of employment , we remove all data points where U falls outside the “3 σ ” region. The retention rates ar e 99.97% (3996/3997) for SLID-Ontario and 99.53% (48615/48842) for US Adult Income. 5.1 Application 1: Survey of Labour and Income Dynamics in Ontario The first dataset we analyze is a public-use sample fr om the 1994 Survey of Labour and Income Dynamics in Ontario (SLID-Ontario) [44]. The dataset contains information on four attributes for 3997 individuals: age , gender , years of education , and composite hourly wage . In this application, we study the predictive r elationship between the logarithm of the composite hourly wage (the response variable Y ) and other covariates. Following [44], we use the log-transformed wage to mitigate non-normality . W e employ the following domain-varying-coef ficient model E [ Y | X , U ] = θ 0 ( U ) + θ 1 ( U ) X 1 + θ 2 ( U ) X 2 , (5.2) where X 1 is a gender indicator and X 2 denotes years of education. W e approximate the years of employment U by U = age − years of education − 6, assuming individuals begin schooling at age six 25 and enter the workfor ce immediately after graduation. T o ensure scale invariance and compara- bility across domains, we normalize U via the min–max transformation U ← U − min 1 ≤ i ≤ n U i max 1 ≤ i ≤ n U i − min 1 ≤ i ≤ n U i , (5.3) so that the domain identifiers lie in [ 0, 1 ] . T o construct source and target domains, we discretize U into ten bins [ 0, 0.1 ] , ( 0.1, 0.2 ] , . . . , ( 0.9, 1 ] , and map each U i to the midpoint of its bin. Let U ∗ = { 0.05, 0.15, . . . , 0.95 } denote the set of bin midpoints. Each midpoint defines a domain identifier . Specifically , domain j consists of all ob- servations satisfying U i ∈ (( j − 1 ) /10, j /10 ] and its associated identifier is U ∗ j = ( j − 0.5 ) / 10. For each u 0 ∈ U ∗ , we designate the corresponding domain as the target domain D 0 and treat the remaining observations as the source domain D S . Our objective is to evaluate predictive per- formance on the target domain and assess whether borrowing information from nearby domains improves accuracy without inducing negative transfer . T owards that goal, we randomly split the target domain (of size m ) into thr ee equal parts {D j 0 } 3 j = 1 , wher e D 1 0 ∪ D 2 0 is used for training and D 3 0 is reserved for testing. In our experiment, we compare performance of three estimators: 1. T arget-only baseline ˆ θ LR ( u 0 ) , fitted using D 1 0 ∪ D 2 0 . 2. Nonparametric DVCM ˆ θ DVCM ( u 0 ) , computed using pooled data D S ∪ D 1 0 ∪ D 2 0 via local poly- nomial regr ession. 3. Adaptive transfer-learning estimator ˆ θ TL ( u 0 ) . T o construct ˆ θ TL ( u 0 ) , we first compute a pilot estimator ˜ θ DVCM ( u 0 ) using D S ∪ D 1 0 . W e then fine- tune this estimate using D 2 0 by solving ˆ θ TL ( u 0 ) = arg min α 1 m / 3 ∑ ( X ki , Y ki ) ∈ D 2 0 Y ki − X ⊤ ki α 2 + ∥ α − ˜ θ DVCM ( u 0 ) ∥ 2 ˆ Q . The data-splitting ensures independence between the pilot estimator and the refinement step, as discussed in Section 3. The penalty matrix ˆ Q is estimated as in (2.14). W e then evaluate the pre- dictive performance on the test set D 3 0 using MSE ( u 0 , ˆ θ ) = 1 m / 3 ∑ ( X i , Y i ) ∈ D 3 0 ( X ⊤ i ˆ θ ( u 0 ) − Y i ) 2 . T o r educe variability due to random splitting, we repeat the pr ocedure ten times and report the average MSE. Each value in U ∗ is tr eated in turn as the tar get domain, yielding a trajectory MSE ( u 0 , ˆ θ ) across u 0 . The thr ee estimators { ˆ θ LR , ˆ θ DVCM , ˆ θ TL } produce the trajectories shown in Figure 7 (left panel), while the right panel displays the distribution of the unbinned U i values. Exact numerical MSE can be found in Appendix E. Several patterns emerge from Figure 7. First, the target-only estimator ˆ θ LR ( u 0 ) exhibits lar ge MSE near the right boundary ( u 0 ≈ 1), reflecting data scar city in that region, as seen in the his- togram. Second, ˆ θ DVCM ( u 0 ) performs worse near the left boundary , where the target domain itself contains abundant data and therefore the target-only baseline itself is a strong predictor . Last but not least, the adaptive estimator ˆ θ TL ( u 0 ) automatically tracks the better of the two estimators; when u 0 is small, it behaves similarly to ˆ θ LR ( u 0 ) ; when u 0 is large, it aligns more closely with ˆ θ DVCM ( u 0 ) . Across all target domains, ˆ θ TL ( u 0 ) achieves the most stable and favorable perfor- mance, corroborating both our theor etical results and simulation findings. 26 5.2 Application 2: US Adult Income The US Adult Income dataset (also known as the “Census Income” or “Adult” dataset) contains demographic attributes and income levels for 48, 842 individuals from the 1994 US Census. It is widely used in the machine learning literature, particularly in studies of classification performance and algorithmic fairness (e.g., see [45, 46, 47]), wher e the goal is to predict whether an individual earns mor e than $50,000 per year . T o maintain consistency with the previous subsection, we select three covariates, age, gender , and years of education—to predict the binary response variable Y . W e model the r esponse using the generalized linear DVCM P ( Y = 1 | X 1 , X 2 , U ) = 1 1 + exp ( −{ θ 0 ( U ) + θ 1 ( U ) X 1 + θ 2 ( U ) X 2 } ) . (5.4) Here, Y = 1 indicates annual income ≥ 50,000, and Y = 0 otherwise. As in Section 5.1, we approximate years of employment U and construct domain identifiers using the same binning and scaling procedur e, resulting in 10 domains indexed by U ∗ = { 0.05, 0.15, . . . , 0.95 } . For a given u 0 ∈ U ∗ , the observations with U ∗ k = u 0 form the target domain D 0 (of size m ), while the remaining observations constitute the source domain D S . The target domain is randomly split into training subsets D 1 0 and D 2 0 and a test subset D 3 0 , each of size m /3. Here also, we compare three estimators: i) the target-only baseline ˆ θ GLR ( u 0 ) (constructed using D 1 0 ∪ D 2 0 ), ii) the non-parametric generalized DVCM ˆ θ DVCM ( u 0 ) (constructed using D S ∪ D 1 0 ∪ D 2 0 ), and iii) our proposed transfer-learning estimator ˆ θ TL ( u 0 ) , which, as in the previous subsection, constr ucted in two steps: first, we compute a pilot non-parametric DVCM estimator ˜ θ DVCM ( u 0 ) using D S ∪ D 1 0 . W e then refine this estimate on D 2 0 by solving ˆ θ TL ( u 0 ) = arg min α 1 m / 3 ∑ ( X ki , Y ki ) ∈ D 2 0 ℓ ( X ⊤ ki α , Y ki ) + 1 2 ∥ α − ˜ θ DVCM ( u 0 ) ∥ 2 ˆ Q , where ℓ ( · , · ) denotes the cross-entropy loss and the penalty matrix ˆ Q is chosen as in Equation (2.14). T o r educe variability due to random splitting, we repeat the pr ocedure ten times and report the average cross-entropy loss on the test set D 3 0 . Each value in U ∗ is treated in turn as the target do- main, yielding trajectories of predictive error acr oss u 0 . The three estimators { ˆ θ GLR , ˆ θ DVCM , ˆ θ TL } produce the curves shown in Figure 8 (left panel), while the right panel displays the distribution of the unbinned U i values. Exact numerical results ar e provided in Appendix E. The qualitative behavior mirr ors that observed in Section 5.1. The logistic r egression estimator ˆ θ GLR ( u 0 ) performs particularly well near the left endpoint ( u 0 ≈ 0), wher eas in other regions the performance of the nonparametric estimator ˆ θ DVCM ( u 0 ) is at par . The adaptive estimator ˆ θ TL ( u 0 ) consistently aligns with the better-performing method across domains, highlighting its ability to automatically balance between pooling and target-only learning. 6 Conclusion and future work W e study multi-source transfer learning under posterior drift , where the conditional relationship between response and covariates varies across environments indexed by a domain identifier U . T o capture this structur ed heter ogeneity , we introduce a domain–varying coefficient model (DVCM) and propose a two-step estimator that combines nonparametric pooling across source domains with a ridge-type fine-tuning step on the tar get domain. Our main contribution is a data-adaptive choice of the shrinkage matrix Q that provably prevents negative transfer: the r esulting estimator 27 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Scaled Y ear of Employment 0 . 10 0 . 12 0 . 14 0 . 16 0 . 18 0 . 20 MSE MSE of estimators DV CM LR TL 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Scaled Y ear of Employment 0 100 200 300 400 Frequency Histogram for Scaled Y ear of Employment Figure 7: Model evaluation on SLID-Ontario dataset. The MSEs of the linear regression, varying- coefficient model, and transfer learning estimator at different values of u 0 based on bandwidth of 0.2 are shown on the left. The histogram of the scaled year of employment, U , is shown on the right. 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Scaled Y ear of Employment 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 CE CE of estimators DV CM LR TL 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Scaled Y ear of Employment 0 1000 2000 3000 4000 5000 Frequency Histogram for Scaled Y ear of Employment Figure 8: Model evaluation on the US Adult Income dataset. The cross-entropy of the logistic re- gression, logistic-based varying-coefficient, and transfer-learning estimators at different values of u 0 based on bandwidth of 0.2 is shown on the left. The histogram of the scaled year of employ- ment, U , is shown on the right. 28 automatically interpolates between target-only and pooled estimators and never incu rs higher risk than the tar get-only baseline. W e establish matching minimax upper and lower bounds for esti- mating θ ( u 0 ) , revealing a phase transition governed by smoothness, domain dispersion, and the number of source environments. W e further derive asymptotic normality with feasible variance estimation, enabling valid confidence intervals and hypothesis tests. Simulations and real-data experiments confirm that the procedur e adaptively tracks the better of tar get-only and pooled learning acr oss regimes. However , several interesting directions r emain open for future investiga- tion: 1. Beyond linear models in X . In this paper , we focused on models that are linear in X . A natural extension is to consider more flexible non-linear structures, such as single-index models E [ Y | X , U ] = g ( X ⊤ θ ( U ) ) for unknown link g , or additive models of the form E [ Y | X , U ] = ∑ d j = 1 θ j ( U ) f j ( X ) , where both θ j ( · ) and f j ( · ) ar e unknown. An important theoretical question is whether the negative-transfer robustness and adaptive minimax op- timality established here continue to hold under suitable reformulations of the estimation procedur e for such broader nonparametric classes. 2. Modern machine learning estimators for the U -varying component. W e estimated the non- parametric components via local polynomial regr ession. A pr omising direction is to inves- tigate neural network or transformer-based estimators for learning θ ( U ) , especially when U is multi-dimensional. As observed in recent work (e.g., see [48, 49]), neural network estima- tors can adapt to compositional structures and mitigate the curse of dimensionality . Under- standing whether similar adaptivity and phase-transition phenomena persist under modern deep-learning architectur es remains an important open pr oblem. 3. High-dimensional covariates and structured sparsity . Our analysis assumes that the di- mension of X is fixed. In growing/high-dimensional settings, variable selection becomes essential, particularly when only a subset of covariates is informative, and the sparsity pat- tern may vary across domains. One natural extension is to incorporate sparsity-inducing penalties (e.g., ℓ 1 -regularization or structured group penalties) into the domain-adaptive framework. Alternatively , domain heterogeneity may be captured through a low-rank la- tent factor structure. Developing adaptive procedur es that combine transfer learning with sparsity or low-rank structur e on the covariates would substantially broaden the scope of the model. 4. Bayesian formulations and adaptive borrowing. A Bayesian perspective offers another ap- pealing direction. One may model the domain identifiers U i as draws from a prior distribu- tion (wher e the prior variance encodes cross-domain similarity), and the coef ficient function θ ( · ) is generated fr om a nonparametric prior (e.g., Gaussian process, spline-based prior , or Bayesian neural network prior) that models the smoothness. An important theor etical ques- tion is how to modify the likelihood equation appropriately so that the resulting posterior achieves adaptive contraction rates that match the minimax frequentist rates derived her e. If so, one may use this approach for uncertainty quantification fr om a Bayesian perspective. Appendix Throughout the theor etical analysis in the Appendix, we assume a target sample-splitting pr oce- dure. Specifically , we observe a total of 2 n 0 target samples, which ar e partitioned into two equal 29 subsets satisfying |I 0 | = |I ∗ 0 | = n 0 . The subset I 0 is used to construct the initial nonparamet- ric estimator , while I ∗ 0 is reserved for the subsequent fine-tuning step. Under this setup, we first analyze the estimator in the linear r esponse setting. The nonparametric estimator ˆ θ DVCM ( u 0 ) is defined as ˆ θ DVCM ( u 0 ) = A l · arg min α ∈ R ( l + 1 ) p ∑ K k = 0 ∑ i ∈ I k Y ki − Φ l U k − u 0 h ⊤ ⊗ X ⊤ ki α 2 W U k − u 0 h , (.1) where Φ l ( x ) = 1, x , x 2 /2! , . . . , x l / l ! ⊤ denotes the l -th order polynomial feature map, ⊗ is the Kronecker pr oduct, and A l = [ I p , 0 p × l p ] extracts the first p coordinates of the minimizer . The fine-tuning estimator constructed fr om I ∗ 0 is then given by ˆ θ TL ( u 0 ) = arg min α ∈ R p 1 2 n 0 ∑ i ∈ I ∗ 0 Y 0 i − X ⊤ 0 i α 2 + 1 2 ∥ α − ˆ θ DVCM ( u 0 ) ∥ 2 Q = 1 n 0 X ⊤ 0 X 0 + Q − 1 1 n 0 X ⊤ 0 y 0 + Q ˆ θ DVCM ( u 0 ) . (.2) Similarly , for the generalized linear response model, the Step I estimator is defined as ˆ θ GDVCM ( u 0 ) = A l · arg min α ∈ R ( l + 1 ) p K ∑ k = 0 ∑ i ∈ I k ℓ Z ⊤ ki α , Y ki W U k − u 0 h , (.3) while the Step II fine-tuning estimator based on I ∗ 0 is ˆ θ TL ( u 0 ) = arg min α 1 n 0 ∑ i ∈ I ∗ 0 ℓ X ⊤ 0 i α , Y 0 i + 1 2 ∥ α − ˆ θ GDVCM ( u 0 ) ∥ 2 Q . (.4) This appendix, centered on the estimators defined above, is organized as follows. Section A collects auxiliary lemmas used throughout the paper . Section B contains proofs of the main theo- rems, propositions, and lemmas from the main body . Section C provides the proofs of the auxiliary lemmas. A Auxiliary lemmas In this section we collect auxiliary lemmas that support the main results. Section A.1 gathers tools for the nonasymptotic analysis: Lemmas A.1 – A.4 serve as building blocks for Proposition 3.5. Section A.2 contains asymptotic tools: Lemma A.5 underpins Theorem 3.13 and Proposition A.9; Lemma A.6 is utilized to pr ove Pr oposition A.8; Lemma A.10 is used in the pr oof of Theorem 3.13; and Lemma A.11 aids Theorem 3.14. A.1 Auxiliary lemmas for nonasymptotic analysis Recall the linear DVCM estimator is ˆ θ DVCM ( u 0 ) = A l ( Z ⊤ WZ ) − 1 Z ⊤ W y ∈ R p . (A.1) 30 Let us first recall some basic notations: Z ⊤ = Z 01 , Z 02 , . . . , Z 0 n 0 , . . . , Z K 1 , Z K 2 , . . . , Z K n K , W = S − 1 h diag W U 0 − u 0 h , . . . , W U 0 − u 0 h | {z } n 0 identical terms , . . . , W U K − u 0 h , . . . , W U K − u 0 h | {z } n K identical terms , y ⊤ = h y ⊤ 0 , . . . , y ⊤ K i , where the normalizing constant S h = ∑ K k = 0 ∑ i ∈ I k W U k − u 0 h , and W is a uniform kernel (Assump- tion 3.4). The following lemma provides a finite sample concentration inequality on the distance of u 0 from its nearest and furthest neighbors, i.e., d ( 1 ) ( u 0 ) and d ( K ) ( u 0 ) . In particular , it shows that d ( 1 ) ( u 0 ) and d ( K ) ( u 0 ) are of the order γ / K and γ . Lemma A.1 Under Assumption 3.2 (a), the following bounds hold: (1) 1 − 2 a 0 t K K 1 { t ≤ K / ( 2 a 0 ) } ≤ P K d ( 1 ) ( u 0 ) > γ t ≤ 1 − 2 a ′ 0 t K K 1 t ≤ K / ( 2 a ′ 0 ) , C 1 ( K / γ ) − 2 β ≤ E h d 2 β ( 1 ) ( u 0 ) i ≤ C 2 ( K / γ ) − 2 β . (2) 1 − 2 a 0 t K K 1 { t ≤ K / ( 2 a 0 ) } ≤ P K d ( K ) ( u 0 ) > γ t ≤ 1 − 2 a ′ 0 t K K 1 t ≤ K / ( 2 a ′ 0 ) , C 3 γ 2 β ≤ E h d 2 β ( K ) ( u 0 ) i ≤ C 4 γ 2 β . The proof on part (1) is in Appendix C.1 and the proof on part (2) is in Appendix C.2. The following Lemma shows that under our assumptions, the random variables Z ki are upper bounded almost surely under certain constraints. Lemma A.2 Under Assumptions 3.2 and 3.4, it holds that 1 { | U k − u 0 | ≤ h } ∥ Z ki ∥ 2 ≤ 2 . The proof is in Appendix C.3. Lemma A.3 Let Assumptions 3.2 – 3.4 hold and h be such that h ≤ | U | . Then there exists a constant C > 0 such that E h S − 1 h | d ( 1 ) ( u 0 ) ≤ h i ≤ C γ n h . The proof is in Appendix C.4. Recall that Γ = { U k : 0 ≤ k ≤ K } ∪ { X ki : 0 ≤ k ≤ K , i ∈ I k } is the set of all covariates. The following lemma shows that the MSE A (conditioning on Γ ) of the DVCM estimator is of order O ( h 2 β + S − 1 h ) . 31 Lemma A.4 Under Assumptions 3.1– 3.4, the following upper bound holds: E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A Γ i ≤ q 2 1 h 2 β + q 2 S − 1 h , (A.2) where q 1 , q 2 > 0 are constants, and S h = ( 1/2 ) ∑ K k = 0 n k 1 { | U k − u 0 | ≤ h } . The proof of this lemma is in Appendix C.5. A.2 Auxiliary lemmas for asymptotic analysis The next lemma provides asymptotic expressions for the conditional mean and variance of the quantities ∆ k and Λ k in the GDVCM setting. These expansions serve as essential building blocks for the subsequent asymptotic analysis. Lemma A.5 Under Assumptions 3.1 2 ′′ , 3.3, and 3.4, define for t k = ( u k − u 0 ) / h, ¯ θ ( u 0 ) ⊤ = θ ( u 0 ) ⊤ , h θ ′ ( u 0 ) ⊤ , . . . , h l θ ( l ) ( u 0 ) ⊤ , ∆ k = n − 1/ 2 k ∑ i ∈ I k s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki W ( t k ) , Λ k = n − 1 k ∑ i ∈ I k s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki Z ⊤ ki W ( t k ) , where Z ki = Φ l ( t k ) ⊗ X ki and s j ( η , y ) = ∂ j ℓ ( η , y ) / ∂ η j . Then, for any k ∈ { 0 } ∪ [ K ] and sufficiently small h > 0 , and some ˜ u k satisfying | ˜ u k − u 0 | ≤ h n − 1/ 2 k E [ ∆ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) { 1 + o ( 1 ) } ≲ h β V ar [ ∆ k | U k = u k ] = ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) , E [ Λ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) + O ( h β ) . The pr oof of this lemma is found in Appendix C.6. The next lemma (on linear DVCM) is an im- mediate specialization of Lemma A.5 (on generalized DVCM) obtained by choosing the quadratic loss ℓ ( η , y ) = 1 2 ( η − y ) 2 . In this case s 2 ≡ 1 and Ψ ( u ) = E [ X X ⊤ | U = u ] ; moreover the GLR scale function ν ( u ) is replaced by the noise variance σ 2 ( u ) . Hence all conclusions of Lemma A.5 r emain the same, with the only substitution σ 2 ( · ) = ν ( · ) . Lemma A.6 Under Assumptions 3.1, 2 ′ , 3.3, and 3.4, define for t k = ( u k − u 0 ) / h, ¯ θ ( u 0 ) ⊤ = θ ( u 0 ) ⊤ , h θ ′ ( u 0 ) ⊤ , . . . , h l θ ( l ) ( u 0 ) ⊤ , ∆ k = n − 1/ 2 k ∑ i ∈ I k s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki W ( t k ) , Λ k = n − 1 k ∑ i ∈ I k s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki Z ⊤ ki W ( t k ) , 32 where Z ki = Φ l ( t k ) ⊗ X ki and s j ( η , y ) = ∂ j ℓ ( η , y ) / ∂ η j . Then, for any k ∈ { 0 } ∪ [ K ] and sufficiently small h > 0 , and some ˜ u k satisfying | ˜ u k − u 0 | ≤ h n − 1/ 2 k E [ ∆ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) { 1 + o ( 1 ) } ≲ h β V ar [ ∆ k | U k = u k ] = σ 2 ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) , E [ Λ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) + O ( h β ) . Proof A.7 This corollary on linear DVCM is a special case of Lemma A.5 (on GDVCM). Therefor e, Lemma A.5 may be directly applied, with the only differ ence being variance function ν ( · ) replaced by σ 2 ( · ) . The next pr oposition establishes the asymptotic distribution of the two base estimators ˆ θ LR ( u 0 ) and ˆ θ DVCM ( u 0 ) . Proposition A.8 Under Assumptions 3.1, 2 ′ , 3.3, and 3.4, the target-only estimator ˆ θ LR ( u 0 ) satisfies: √ n 0 ˆ θ LR ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω LR ( u 0 ) ) , Ω LR ( u 0 ) = σ 2 ( u 0 ) Ψ ( u 0 ) − 1 . Furthermore, if h is such that d ( 1 ) ( u 0 ) ≤ h ≤ d ( K ) ( u 0 ) and γ K ≪ h ≲ γ n 1 2 β + 1 , then ˆ θ DVCM ( u 0 ) satisfies: s n h γ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) − b DVCM ( u 0 ) d − → N ( 0, Ω DVCM ( u 0 ) ) , where b DVCM ( u 0 ) ≲ h β , Ω DVCM ( u 0 ) = σ 2 ( u 0 ) h ζ − 1 0,1 ζ 0,2 ζ − 1 0,1 i 1,1 Ψ ( u 0 ) − 1 , ζ r , s = Z Φ l ( t ) ⊗ 2 t r W s ( t ) f u 0 − u ∗ + h t γ dt , u ∗ is same as defined in Assumption 3.2. Proposition A.9 Under Assumptions 3.1, 2 ′′ , 3.3, and 3.4, the target-only estimator ˆ θ GLR ( u 0 ) satisfies: √ n 0 ˆ θ GLR ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω GLR ( u 0 ) ) , Ω GLR ( u 0 ) = ν ( u 0 ) Ψ ( u 0 ) − 1 . Furthermore, if h is such that d ( 1 ) ( u 0 ) ≤ h ≤ d ( K ) ( u 0 ) and γ K ≪ h ≲ γ n 1 2 β + 1 , then ˆ θ GDVCM ( u 0 ) satisfies: s n h γ ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) − b GDVCM ( u 0 ) d − → N ( 0, Ω GDVCM ( u 0 ) ) , where b GDVCM ( u 0 ) ≲ h β , 33 Ω GDVCM ( u 0 ) = ν ( u 0 ) h ζ − 1 0,1 ζ 0,2 ζ − 1 0,1 i 1,1 Ψ ( u 0 ) − 1 , ζ r , s = Z Φ l ( t ) ⊗ 2 t r W s ( t ) f u 0 − u ∗ + h t γ dt , u ∗ is same as defined in Assumption 3.2. Note that this result extends Proposition A.8, with σ 2 ( · ) replaced by ν ( · ) and with Ψ ( · ) redefined accordingly . The pr oof is given in Appendix C.7. The next lemma establishes, within the GDVCM framework, that if the shrinkage matrix Q is chosen to be of the same order as ρ 2 (i.e., Q ≍ p ρ 2 I ) where ρ : = r GLR / r GDVCM is the ratio of the convergence rates of the tar get-only GLR estimator and the GDVCM estimator , then the TL estimator ˆ θ TL ( u 0 ) adapts to the faster procedure: specifically , ∥ ˆ θ TL ( u 0 ) − θ ( u 0 ) ∥ = O p r TL , r TL : = r GLR ∧ r GDVCM . Moreover , if the bandwidth h satisfies the standard localization conditions, then ˆ θ TL ( u 0 ) attains the same asymptotic distribution as the faster estimator in Proposition A.9: if ρ → 0, √ n 0 ˆ θ TL ( u 0 ) − θ ( u 0 ) → d N 0, Ω GLR ( u 0 ) ; if ρ → ∞ , p n h / γ ˆ θ TL ( u 0 ) − θ ( u 0 ) − b GDVCM ( u 0 ) → d N 0, Ω GDVCM ( u 0 ) . Lemma A.10 Let Assumptions 3.1 2 ′′ , 3.3, and 3.4 hold. Let the rates r GLR and r GDVCM be such that ˆ θ GLR ( u 0 ) − θ ( u 0 ) = O p ( r GLR ) , ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) = O p ( r GDVCM ) . Moreover , let Q be a positive-definite matrix satisfying for ρ = r GLR / r GDVCM , c ρ 2 ≤ λ min ( Q ) ≤ λ max ( Q ) ≤ C ρ 2 for constants C > c > 0, w .p. → 1. Then ˆ θ TL ( u 0 ) − θ ( u 0 ) 2 = O p ( r GLR ∧ r GDVCM ) . Furthermore, suppose h satisfies the same conditions in Proposition A.9, then: If ρ → 0, √ n 0 ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω GLR ( u 0 ) ) , If ρ → ∞ , s n h γ ˆ θ TL ( u 0 ) − θ ( u 0 ) − b GDVCM ( u 0 ) d − → N ( 0, Ω GDVCM ( u 0 ) ) , where Ω GLR ( u 0 ) , b GDVCM ( u 0 ) , and Ω GDVCM ( u 0 ) are defined in Proposition A.9. The proof is given in Appendix C.9. The next lemma, which is under the GDVCM framework, states that if we choose h to be small enough, then the TL estimator achieves asymptotical normal- ity with mean 0. Lemma A.11 Under the same conditions of Lemma A.10, suppose that the bandwidth additionally satisfies n h 2 β + 1 γ → 0 . Then If ρ → 0, √ n 0 ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω GLR ( u 0 ) ) , 34 If ρ → ∞ , s n h γ ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω GDVCM ( u 0 ) ) , where Ω GLR ( u 0 ) , b GDVCM ( u 0 ) , and Ω GDVCM ( u 0 ) are defined in Proposition A.9. Proof A.12 This lemma follows directly from Lemma A.10 by choosing h such that p n h / γ b GDVCM → 0 . Note that by Proposition A.9 the bias b GDVCM is of order O ( h β ) , so this is equivalent to requiring n h 2 β + 1 / γ → 0 . B Proofs of main theorems B.1 Proof of Proposition 3.5 Proof B.1 Recall that, we choose h according to Equation (3.1) : h = med e 0 ( n / γ ) − 1 2 β + 1 , d ( 1 ) ( u 0 ) , d ( K ) ( u 0 ) . Depending on the distribution of { u 0 , . . . , u K } , any one of the three elements can be chosen as h. Based on this, we divide our analysis into three disjoint events: E 1 = e 0 ( n / γ ) − 1 2 β + 1 < d ( 1 ) ( u 0 ) , E 2 = e 0 ( n / γ ) − 1 2 β + 1 > d ( K ) ( u 0 ) , E 3 = d ( 1 ) ( u 0 ) ≤ e 0 ( n / γ ) − 1 2 β + 1 ≤ d ( K ) ( u 0 ) . As d ( 1 ) ( u 0 ) ≤ d ( K ) ( u 0 ) , it is immediate that we choose h = d ( 1 ) ( u 0 ) under E 1 , h = d ( K ) ( u 0 ) under E 2 and h = e 0 ( n / γ ) − 1/ ( 2 β + 1 ) under E 3 . Now , from Lemma A.4, we know that for any choice of h we have: E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A Γ i ≤ q 2 1 h 2 β + q 2 S − 1 h . (B.1) Therefor e, by a simple law of total expectation, we have: E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≤ 3 ∑ j = 1 E h q 2 1 h 2 β + q 2 S − 1 h | E j i P ( E j ) : = 3 ∑ j = 1 r j p j , where, for notational simplicity , define r j = E [ q 2 1 h 2 β + q 2 S − 1 h | E j ] and p j = P ( E j ) . W e next pr ovide a bound on each r j and p j on a case-by-case basis. First, consider E 1 : We choose h = d ( 1 ) ( u 0 ) , so only the nearest source (in terms of u) is selected. As a consequence, by Assumption 3.3 a deterministic bound holds S h = 1 2 K ∑ k = 0 n k 1 { | U k − u 0 | ≤ h } = 1 2 ( n 0 + n ( 1 ) ) ≲ ¯ n . 35 where n ( 1 ) is the number of samples in the nearest source domain. Using this bandwidth, the bounds in Lemma A.1 yield E h d ( 1 ) ( u 0 ) 2 β | E 1 i ≲ K γ − 2 β , p 1 ≍ 1 − ( n γ 2 β ) − 1 2 β + 1 K + , which implies r 1 = E h q 2 1 h 2 β + q 2 S − 1 h | E 1 i ≲ K γ − 2 β + 1 ¯ n . Next, consider E 2 : We choose h = d ( K ) ( u 0 ) in this case. This means all the domains are selected, and consequently , we have the following deterministic bound: S h = 1 2 K ∑ k = 0 n k 1 { | U k − u 0 | ≤ h } = n 2 . Another application of Lemma A.1 yields: E h d ( K ) ( u 0 ) 2 β | E 2 i ≲ γ 2 β , p 2 : = P ( E 2 ) ≍ 1 − 1 − ( n γ 2 β ) − K 2 β + 1 + . Thus, it follows that r 2 = E h q 2 1 h 2 β + q 2 S − 1 h | E 2 i ≲ γ 2 β + 1 n . Finally , consider E 3 : In this case, h = e 0 ( n / γ ) − 1/ ( 2 β + 1 ) . Hence, the bias part is deterministically bounded by E h h 2 β | E 3 i ≲ n γ − 2 β 2 β + 1 . By Lemma A.3, the variance part is bounded by E h S − 1 h | E 3 i ≲ γ n h ≍ n γ − 2 β 2 β + 1 . As a consequence: r 3 = E h q 2 1 h 2 β + q 2 S − 1 h | E j i ≲ n γ − 2 β 2 β + 1 , p 3 = 1 − p 1 − p 2 . Now that we have established bounds on { r j } and { p j } , we will bound the MSE using them. However , it is apparent from the definition of the events that one of these thr ee events will dominate the others depending on the growth of ( n , γ , K ) . We discuss the bounds f or MSE under three circumstances: i ) n γ 2 β ≫ K 2 β + 1 , i i ) n γ 2 β ≪ 1, i i i ) 1 ≲ n γ 2 β ≲ K 2 β + 1 . Case 1: At first we consider the case n γ 2 β ≫ K 2 β + 1 . Note that this implies: n γ 2 β ≫ K 2 β + 1 ( = ⇒ ¯ n − 1 ≍ ( n / K ) − 1 ≪ ( γ / K ) 2 β , = ⇒ ( n / γ ) − 2 β / ( 2 β + 1 ) ≪ ( γ / K ) 2 β . 36 Therefor e, we have the upper bound on r 1 : r 1 ≲ K γ − 2 β + 1 ¯ n ≲ K γ − 2 β , and on r 3 : r 3 ≲ n γ − 2 β 2 β + 1 ≪ K γ − 2 β . Furthermore, as K ≥ 1 , the assumption n γ 2 β ≫ K 2 β + 1 immediately implies n γ 2 β ≫ 1 , which, in turn, implies n − 1 ≪ γ 2 β . Therefor e, we have: r 2 ≲ γ 2 β + 1 n ≲ γ 2 β . Moreover , the same condition n γ 2 β ≫ 1 also implies p 2 ≍ 1 − 1 − ( n γ 2 β ) − K / ( 2 β + 1 ) + ≍ ( n γ 2 β ) − K / ( 2 β + 1 ) . Combining all the bounds yields the following upper bound on the MSE of ˆ θ DVCM : E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≤ r 1 p 1 + r 2 p 2 + r 3 ( 1 − p 1 − p 2 ) ≤ r 1 ∨ r 3 + r 2 p 2 ≲ K γ − 2 β + γ 2 β ( n γ 2 β ) − K 2 β + 1 ≲ K γ − 2 β . Here, the last inequality follows from the fact that ( K / γ ) − 2 β ≫ γ 2 β ( n γ 2 β ) − K / ( 2 β + 1 ) as n γ 2 β ≫ K 2 β + 1 . This completes the bound under Case 1. Case 2: In this case, we assume that n γ 2 β ≪ 1 ≲ K 2 β + 1 , i.e. γ 2 β ≪ n − 1 . This immediately implies: r 2 ≲ γ 2 β + 1 n ≲ 1 n , and p 2 ≍ 1 − 1 − ( n γ 2 β ) − K / ( 2 β + 1 ) + → 1 since n γ 2 β ↓ 0 , and consequently p 1 = p 3 = 0 . Therefore, E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≲ r 2 p 2 ≲ 1 n . Case 3: Finally , in this case, we consider the last case, 1 ≲ n γ 2 β ≲ K 2 β + 1 . W e argue that in this case, MSE is upper bounded by the rate ( n / γ ) − 2 β / ( 2 β + 1 ) . T o establish this, it is enough to show r 1 p 1 ∨ r 2 p 2 ≲ ( n / γ ) − 2 β / ( 2 β + 1 ) , as r 3 p 3 ≲ ( n / γ ) − 2 β / ( 2 β + 1 ) by the bound on r 3 . With the definitions of p 1 and p 2 , it follows that p 1 ≍ 1 − ( n γ 2 β ) − 1/ ( 2 β + 1 ) K + ≲ exp n − K ( n γ 2 β ) − 1/ ( 2 β + 1 ) o , [ ∵ ( 1 − x ) K ≲ e − K x for x ∈ ( 0, 1 ) ] 37 p 2 ≍ 1 − 1 − ( n γ 2 β ) − K / ( 2 β + 1 ) + ≲ ( n γ 2 β ) − K / ( 2 β + 1 ) . [ ∵ 1 ≲ n γ 2 β ] As we are consider the scenario 1 ≲ n γ 2 β ≲ K 2 β + 1 , we have: n γ 2 β ≲ K 2 β + 1 = ⇒ K γ − 2 β ≲ 1 ¯ n , 1 ≲ n γ 2 β = ⇒ 1 n ≲ γ 2 β . Thus we have the upper bounds for r 1 and r 2 r 1 ≲ K γ − 2 β + 1 ¯ n ≲ 1 ¯ n , r 2 ≲ γ 2 β + 1 n ≲ γ 2 β . These bounds, along with the upper bound on ( p 1 , p 2 ) , yield: r 1 p 1 ( n / γ ) − 2 β 2 β + 1 = ¯ n − 1 exp n − K ( n γ 2 β ) − 1/ ( 2 β + 1 ) o ( n / γ ) − 2 β 2 β + 1 ≍ ( K / n ) exp n − K ( n γ 2 β ) − 1/ ( 2 β + 1 ) o ( n / γ ) − 2 β 2 β + 1 = K ( n γ 2 β ) − 1/ ( 2 β + 1 ) exp n − K ( n γ 2 β ) − 1/ ( 2 β + 1 ) o ≲ 1 , [ ∵ xe − x is always upper bounded by a constant for x > 0 ] and, r 2 p 2 ( n / γ ) − 2 β 2 β + 1 = γ 2 β ( n γ 2 β ) − K / ( 2 β + 1 ) ( n / γ ) − 2 β 2 β + 1 = ( n γ 2 β ) 2 β − K 2 β + 1 ≲ 1 [ ∵ n γ 2 β ≳ 1, and assume K ≥ 2 β w .l.o.g. ] Hence, we have shown in this case that E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≲ n γ − 2 β 2 β + 1 . (B.2) Justification of maximal upper bounds As a short summary , we have established the following regime-specific upper bound on the mean squar ed error: E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A i ≲ ( K / γ ) − 2 β if n γ 2 β ≫ K 2 β + 1 ≳ 1, ( n / γ ) − 2 β 2 β + 1 if 1 ≲ n γ 2 β ≲ K 2 β + 1 , n − 1 if n γ 2 β ≪ 1 ≲ K 2 β + 1 . 38 We next argue that, within each region, the overall upper bound is simply the maximum among the thr ee individual upper bounds corresponding to that region, which follows fr om simple algebra. First observe that when n γ 2 β ≫ K 2 β + 1 , then: n γ 2 β ≫ K 2 β + 1 = ⇒ γ n ≪ γ K 2 β + 1 = ⇒ n γ − 2 β 2 β + 1 ≪ K γ − 2 β , and multiplying K / γ to middle inequality yields: γ n ≪ γ K 2 β + 1 = ⇒ 1 ¯ n ≪ K γ − 2 β = ⇒ 1 n ≪ K γ − 2 β . Hence, we conclude: n γ 2 β ≫ K 2 β + 1 = ⇒ K γ − 2 β ≍ max K γ − 2 β , n γ − 2 β 2 β + 1 , 1 n . Secondly , consider the case when n γ 2 β ≪ 1 . In this case, we have: γ 2 β ≪ n − 1 = ⇒ ( γ / n ) 2 β ≪ n − ( 2 β + 1 ) = ⇒ ( n / γ ) − 2 β 2 β + 1 ≪ n − 1 and γ 2 β ≪ n − 1 = ⇒ ( γ / K ) 2 β ≪ n − 1 . Hence, n γ 2 β ≪ 1 = ⇒ 1 n ≍ max K γ − 2 β , n γ − 2 β 2 β + 1 , 1 n . Finally , let us consider the third case 1 ≲ n γ 2 β ≲ K 2 β + 1 . In this case, we have: n γ 2 β ≲ K 2 β + 1 = ⇒ ( γ / K ) 2 β + 1 ≲ γ / n = ⇒ ( K / γ ) − 2 β ≲ ( n / γ ) − 2 β 2 β + 1 , and 1 ≲ n γ 2 β = ⇒ n − 1 ≲ γ 2 β = ⇒ n − ( 2 β + 1 ) ≲ ( γ / n ) 2 β = ⇒ n − 1 ≲ ( n / γ ) − 2 β 2 β + 1 . Therefor e, in this case, we have: 1 ≲ n γ 2 β ≲ K 2 β + 1 = ⇒ n γ − 2 β 2 β + 1 ≍ max K γ − 2 β , n γ − 2 β 2 β + 1 , 1 n . This completes the proof. 39 B.2 Proof of Theorem 3.6 Proof B.2 We apply Theorem 1 of [50], which roughly says that for any two estimators ˆ θ 1 and ˆ θ 2 , M ( ˆ θ 1 ) ⪰ M ( ˆ θ 2 ) = ⇒ MSE A ( ˆ θ 1 ) ≥ MSE A ( ˆ θ 2 ) for any A ⪰ 0 . Our goal is to show that there exists a matrix Q that makes the difference between squar ed error matrices positive semidefinite, i.e. M ˆ θ DVCM ( u 0 ) − M ˆ θ TL ( u 0 ) ⪰ 0 , and M ˆ θ LR ( u 0 ) − M ˆ θ TL ( u 0 ) ⪰ 0 . Define S Q = ˆ Σ 0 + Q with ˆ Σ 0 = ( X ⊤ 0 X 0 ) / n 0 and X 0 is a matrix concatenating all the { X 0 i } i ∈ I ∗ 0 . From the first order condition, we have: ˆ θ TL ( u 0 ) = 1 n 0 ( X ⊤ 0 X 0 ) + Q − 1 1 n 0 ( X ⊤ 0 y 0 ) + Q ˆ θ DVCM ( u 0 ) = ( ˆ Σ 0 + Q ) − 1 1 n 0 ( X ⊤ 0 y 0 ) + Q ˆ θ DVCM ( u 0 ) = S − 1 Q 1 n 0 ( X ⊤ 0 y 0 ) + Q ˆ θ DVCM ( u 0 ) = S − 1 Q ˆ Σ 0 θ ( u 0 ) + 1 n 0 X ⊤ 0 ϵ 0 + Q ˆ θ DVCM ( u 0 ) [ ∵ y 0 = X 0 θ ( u 0 ) + ϵ 0 ] = S − 1 Q ( S Q − Q ) θ ( u 0 ) + 1 n 0 X ⊤ 0 ϵ 0 + Q ˆ θ DVCM ( u 0 ) = θ ( u 0 ) + S − 1 Q 1 n 0 X ⊤ 0 ϵ 0 + S − 1 Q Q ( ˆ θ DVCM ( u 0 ) − θ ( u 0 )) . Therefor e, the conditional bias of ˆ θ TL ( u 0 ) is Bias ˆ θ TL ( u 0 ) X 0 = S − 1 Q Q Bias ˆ θ DVCM ( u 0 ) X 0 = S − 1 Q Q Bias ˆ θ DVCM ( u 0 ) because ˆ θ DVCM ( u 0 ) is independent of X 0 due to sample splitting and E [ ϵ 0 | X 0 ] = 0 on the target domain. Now for the conditional variance, we utilize this independence again and have V ar ˆ θ TL ( u 0 ) X 0 = Va r S − 1 Q 1 n 0 X ⊤ 0 ϵ 0 + Q ( ˆ θ DVCM ( u 0 ) − θ ( u 0 )) X 0 = S − 1 Q σ 2 ( u 0 ) n 0 ˆ Σ 0 + Q V ar ˆ θ DVCM ( u 0 ) Q S − 1 Q . Hence, M ˆ θ TL ( u 0 ) X 0 = Bias ˆ θ TL ( u 0 ) X 0 ⊗ 2 + Va r ˆ θ TL ( u 0 ) X 0 = S − 1 Q σ 2 ( u 0 ) n 0 ˆ Σ 0 + Q M ˆ θ DVCM ( u 0 ) Q S − 1 Q . 40 Using the fact that M ˆ θ LR ( u 0 ) X 0 = σ 2 ( u 0 ) n 0 ˆ Σ − 1 0 = σ 2 ( u 0 ) n 0 S − 1 Q S Q ˆ Σ − 1 0 S Q S − 1 Q = σ 2 ( u 0 ) n 0 S − 1 Q h ˆ Σ 0 + 2 Q + Q ˆ Σ − 1 0 Q i S − 1 Q , [ ∵ S Q = ˆ Σ 0 + Q ] we obtain the difference between the second-order moments M ˆ θ LR ( u 0 ) − M ˆ θ TL ( u 0 ) = E M ˆ θ LR ( u 0 ) X 0 − M ˆ θ TL ( u 0 ) X 0 = E S − 1 Q Q σ 2 ( u 0 ) n 0 ˆ Σ − 1 0 − M ˆ θ DVCM ( u 0 ) + 2 σ 2 ( u 0 ) n 0 Q − 1 QS − 1 Q and M ˆ θ DVCM ( u 0 ) − M ˆ θ TL ( u 0 ) = E M ˆ θ DVCM ( u 0 ) X 0 − M ˆ θ TL ( u 0 ) X 0 = E M ˆ θ DVCM ( u 0 ) − S − 1 Q σ 2 ( u 0 ) n 0 ˆ Σ 0 + Q M ˆ θ DVCM ( u 0 ) Q S − 1 Q = E S − 1 Q S Q M ˆ θ DVCM ( u 0 ) S Q − σ 2 ( u 0 ) n 0 ˆ Σ 0 − Q M ˆ θ DVCM ( u 0 ) Q S − 1 Q . Replacing S Q with ˆ Σ 0 + Q , we have M ˆ θ DVCM ( u 0 ) − M ˆ θ TL ( u 0 ) = E S − 1 Q ˆ Σ 0 M ˆ θ DVCM ( u 0 ) − σ 2 ( u 0 ) n 0 ˆ Σ − 1 0 + ˆ Σ − 1 0 Q M ˆ θ DVCM ( u 0 ) + M ˆ θ DVCM ( u 0 ) Q ˆ Σ − 1 0 i ˆ Σ 0 S − 1 Q . Notice that if we set Q such that 1 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 ⪯ Q ⪯ 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 then due to the positive semidefiniteness of M ˆ θ DVCM ( u 0 ) , ˆ Σ 0 , and S − 1 Q , the following holds M ˆ θ LR ( u 0 ) − M ˆ θ TL ( u 0 ) ⪰ 0, M ˆ θ DVCM ( u 0 ) − M ˆ θ TL ( u 0 ) ⪰ 0. By Theorem 1 of [50], this implies that MSE A ˆ θ TL ( u 0 ) ≤ min MSE A ˆ θ DVCM ( u 0 ) , MSE A ˆ θ LR ( u 0 ) , 41 for any positive semidefinite A. Then proof of the theorem follows from Proposition 3.5, and we have already argued that MSE A ˆ θ LR ( u 0 ) ≤ C n − 1 0 , which follows from basic pr operties of linear regression. Furthermore, in Proposition 3.5 we have es- tablished that: MSE A ˆ θ DVCM ( u 0 ) ≤ C max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . Hence, combining these, we obtain: MSE A ˆ θ TL ( u 0 ) ≤ min MSE A ˆ θ LR ( u 0 ) , MSE A ˆ θ DVCM ( u 0 ) ≤ C n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . This completes the proof. B.3 Proof of Corollary 3.8 Proof B.3 Define the event E = 1 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 ⪯ ˆ Q ⪯ 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 , and the desired rate r n , K , γ = q n 0 ∨ min ( K / γ ) 2 β , ( n / γ ) 2 β / ( 2 β + 1 ) , n . By Markov’ s inequality , it follows that P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t = P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t , E + P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t , E c ≤ P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t E + P ( E c ) ≤ r 2 n , K , γ t 2 E ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) 2 E + P ( E c ) Consider the expectation term. By Theorem 3.6, it holds that r 2 n , K , γ E ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) 2 E ≤ C for some constant C > 0 , and thus we have P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t ≤ C t 2 + P ( E c ) . T aking lim sup n → ∞ on both sides, it follows that lim sup n → ∞ P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t ≤ C t 2 + lim sup n → ∞ P ( E c ) . 42 Now let’ s deal with term P ( E c ) . By definition of E , we obtain P ( E ) = P 1 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 ⪯ δ ˆ σ 2 ( u 0 ) n 0 ˆ M ˆ θ DVCM ( u 0 ) − 1 ⪯ 2 σ 2 ( u 0 ) n 0 M ˆ θ DVCM ( u 0 ) − 1 = P 1 2 I ⪯ δ ˆ σ 2 ( u 0 ) σ 2 ( u 0 ) M ˆ θ DVCM ( u 0 ) ˆ M ˆ θ DVCM ( u 0 ) − 1 ⪯ 2 I . By their definitions, ˆ σ 2 ( u 0 ) σ 2 ( u 0 ) = 1 + o p ( 1 ) M ˆ θ DVCM ( u 0 ) ˆ M ˆ θ DVCM ( u 0 ) − 1 = I + o p ( 1 ) and thus the middle term δ ˆ σ 2 ( u 0 ) σ 2 ( u 0 ) M ˆ θ DVCM ( u 0 ) ˆ M ˆ θ DVCM ( u 0 ) − 1 n → ∞ − − − → p δ I . The coefficient δ ∈ ( 1 2 , 2 ) by its definition, and this implies that as n → ∞ , P ( E c ) → 0. Hence, we get lim sup n → ∞ P r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) ≥ t ≤ C t 2 , which means that r n , K , γ ˆ θ TL, ˆ Q ( u 0 ) − θ ( u 0 ) = O p ( 1 ) as n → ∞ . B.4 Proof of Theorem 3.9 Proof B.4 We want to establish a lower bound for inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A . We will use Le Cam’ s approach [40] to find the minimax lower bound. We consider the data generating process Y ki | U k , X ki ∼ N ( X ⊤ ki θ 0 ( U k ) , 1 ) , θ 0 = ( θ 01 , . . . , θ 0 p ) ⊤ ; Y ki | U k , X ki ∼ N ( X ⊤ ki θ 1 ( U k ) , 1 ) , θ 1 = ( θ 11 , . . . , θ 1 p ) ⊤ . Recall that A is a positive semidefinite matrix whose j-th lar gest eigenvalue is λ A , j (with λ A ,1 > 0 ) and cor- responding eigenvector is v A , j = [ v A , j 1 , . . . , v A , j p ] ⊤ , with which each coordinate of functional coefficients is defined as θ 0 j ( u ) ≡ 0 for j = 1, . . . , p 43 θ 1 j ( u ) = v A ,1 j L h β W u − u 0 h for j = 1, . . . , p , where W ( u ) = c 0 exp − 1 1 − u 2 1 ( | u | ≤ 1 ) , and c 0 > 0 is a small constant. There exists a threshold C 0 > 0 such that if c 0 ≤ C 0 , then W belongs to H ( β , 1/ 2 ) . Using the definition l = ⌊ β ⌋ and the fact that W ∈ H ( β , 1/2 ) , it is shown that θ ( l ) 1 j ( u ) − θ ( l ) 1 j ( u ′ ) = | v A ,1 j | L h β − l W ( l ) u − u 0 h − W ( l ) u ′ − u 0 h ≤ | v A ,1 j | L 2 | u − u ′ | β − l , thus the constructed hypotheses lie in the parameter space. That is, ∀ j ∈ [ p ] , θ 1 j ∈ H ( β , L ) . Moreover , we select domain sizes to be equal: n k ≡ ¯ n for k = 0, 1, . . . , K , (B.3) let U k ’ s be iid fr om pdf 1 γ f u − u ∗ γ , with f ( · ) ≤ a 0 , (B.4) and make X ki bounded: ∥ X ki ∥ 2 ≤ 1 . (B.5) Such choices of n k , U k and X ki also lie in the pr esumed DGP , see Assumptions 3.2 and 3.3. Our metric of interest is lower bounded by ∥ θ 1 ( u 0 ) − θ 0 ( u 0 ) ∥ A = q θ 1 ( u 0 ) ⊤ A θ 1 ( u 0 ) = v u u t p ∑ j = 1 λ A , j v ⊤ A , j θ 1 ( u 0 ) 2 ≥ r λ A ,1 v ⊤ A ,1 θ 1 ( u 0 ) 2 = q λ A ,1 L h β W ( 0 ) . We set r n , K , γ = q λ A ,1 L h β W ( 0 ) /3 , with h to be chosen later . Let P 0 and P 1 are the joint distributions of { ( U k , X ki , Y ki ) : k ∈ { 0 } ∪ [ K ] , i ∈ [ n k ] } under hypotheses θ 0 and θ 1 . Then the following holds: inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ inf ˆ θ ( u 0 ) max θ ∈{ θ 0 , θ 1 } E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ inf ˆ θ ( u 0 ) max θ ∈{ θ 0 , θ 1 } P ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ A ≥ r n , K , γ ≥ r 2 n , K , γ inf ψ max j ∈ { 0,1 } P j ( ψ = j ) , (B.6) Now we justify the last inequality in (B.6) . By definition of r n , K , γ , it holds that ∥ θ 1 ( u 0 ) − θ 0 ( u 0 ) ∥ A > 2 r n , K , γ . 44 Then, for any estimator ˆ θ ( u 0 ) , it is impossible for both ∥ ˆ θ ( u 0 ) − θ 0 ( u 0 ) ∥ A < r n , K , γ and ∥ ˆ θ ( u 0 ) − θ 1 ( u 0 ) ∥ A < r n , K , γ to simultaneously hold. Indeed, by the triangle inequality , this would imply ∥ θ 1 ( u 0 ) − θ 0 ( u 0 ) ∥ A ≤ ∥ θ 1 ( u 0 ) − ˆ θ ( u 0 ) ∥ A + ∥ ˆ θ ( u 0 ) − θ 0 ( u 0 ) ∥ A < 2 r n , K , γ , which contradicts the assumption. Therefore, for any estimator ˆ θ ( u 0 ) , at least one of the following events must occur: ∥ ˆ θ ( u 0 ) − θ 0 ( u 0 ) ∥ A ≥ r n , K , γ or ∥ ˆ θ ( u 0 ) − θ 1 ( u 0 ) ∥ A ≥ r n , K , γ . Now define a test function ψ ∈ { 0, 1 } based on the estimator by ψ = arg min j ∈ { 0,1 } ∥ ˆ θ ( u 0 ) − θ j ( u 0 ) ∥ A , which selects the hypothesis closest to the estimator . Then, ψ = j = ⇒ ∥ ˆ θ ( u 0 ) − θ j ( u 0 ) ∥ A ≥ r n , K , γ , so that max j ∈ { 0,1 } P j ∥ ˆ θ ( u 0 ) − θ j ( u 0 ) ∥ A ≥ r n , K , γ ≥ max j ∈ { 0,1 } P j ( ψ = j ) , which pr oves last line of (B.6) . Moreover , by the definition of total variation distance and Pinsker ’ s inequal- ity , we have P 0 ( ψ = 0 ) + P 1 ( ψ = 1 ) = 1 − ( P 0 ( ψ = 1 ) − P 1 ( ψ = 1 ) ) ≥ 1 − TV ( P 0 , P 1 ) ≥ 1 − r 1 2 KL ( P 0 ∥ P 1 ) , (B.7) and thus inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ 2 1 − r 1 2 KL ( P 0 ∥ P 1 ) ! . (B.8) This completes the reduction part in Le Cam’ s approach. Now let p 0 , p 1 be the density functions associated with P 0 , P 1 , then KL ( P 0 ∥ P 1 ) = E P 0 " log p 0 ( U k , X ki : k ∈ { 0 } ∪ [ K ] , i ∈ I k ) ∏ K k = 0 ∏ i ∈ I k p 0 ( Y ki | U k , X ki ) p 1 ( U k , X ki : k ∈ { 0 } ∪ [ K ] , i ∈ I k ) ∏ K k = 0 ∏ i ∈ I k p 1 ( Y ki | U k , X ki ) # , where p j ( U k , X ki : k ∈ { 0 } ∪ [ K ] , i ∈ I k ) is the joint density function of all the U k and X ki , and p j ( Y ki | U k , X ki ) is the conditional density of Y ki given ( U k , X ki ) . The mar ginal distribution of U k , X ki are the same under the two hypotheses, which implies p 0 ( U k , X ki : k ∈ { 0 } ∪ [ K ] , i ∈ I k ) = p 1 ( U k , X ki : k ∈ { 0 } ∪ [ K ] , i ∈ I k ) , 45 so it suffices to consider the conditional distribution Y ki | U k , X i when calculating KL ( P 0 ∥ P 1 ) . Utilizing the normality of Y ki | U k , X ki and letting φ be the pdf of standard normal distribution, we obtain: KL ( P 0 ∥ P 1 ) = E P 0 " log ∏ K k = 0 ∏ i ∈ I k p 0 ( Y ki | U k , X ki ) ∏ K k = 0 ∏ i ∈ I k p 1 ( Y ki | U k , X ki ) # = K ∑ k = 0 ∑ i ∈ I k E U k , X ki Z log φ ( t ) φ ( t − θ 1 ( U k ) · X ki ) φ ( t ) d t = K ∑ k = 0 ∑ i ∈ I k E U k , X ki ( θ 1 ( U k ) · X ki ) 2 ≤ K ∑ k = 0 ∑ i ∈ I k E U k ∥ θ 1 ( U k ) ∥ 2 2 . [ ∵ Equation (B.5) ] (B.9) It holds by definition of θ 1 that ∥ θ 1 ( U k ) ∥ 2 2 = L 2 h 2 β W 2 u 0 − U k h ≤ L 2 h 2 β W 2 ( 0 ) 1 ( | u 0 − U k | ≤ h ) , so we get an upper bound: KL ( P 0 ∥ P 1 ) ≤ K ∑ k = 0 ∑ i ∈ I k E U k h ∥ θ 1 ( U k ) ∥ 2 2 i ≤ L 2 W 2 ( 0 ) h 2 β K ∑ k = 0 ∑ i ∈ I k E U k [ 1 ( | u 0 − U k | ≤ h )] ≤ L 2 c 2 0 h 2 β K ∑ k = 0 n k E U k [ 1 ( | u 0 − U k | ≤ h )] . [ ∵ W ( 0 ) ≤ c 0 by its definition ] (B.10) Due to Equation (B.4) on the upper bound of density of U k , for any k ∈ [ K ] , the expectation part is upper bounded by E U k [ 1 ( | u 0 − U k | ≤ h )] ≤ 2 a 0 h γ , while for k = 0 , E U k [ 1 ( | u 0 − U k | ≤ h )] = 1 , and thus KL ( P 0 ∥ P 1 ) ≤ L 2 c 2 0 h 2 β n 0 + 2 a 0 h γ K ∑ k = 1 n k ! ≤ ( 1 ∨ 2 a 0 ) L 2 c 2 0 h 2 β n 0 + h γ K ∑ k = 1 n k ! . Because of the inequality x + y ≤ 2 ( x ∨ y ) for x , y > 0 , we have KL ( P 0 ∥ P 1 ) ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 h 2 β n 0 ∨ h γ K ∑ k = 1 n k ! . (B.11) We are going to discuss 3 possible choices of h ∈ { h 1 , h 2 , h 3 } . All the choices of h lead to the same upper bound of KL-divergence, that is KL ( P 0 ∥ P 1 ) ≤ 2 α for some constant α ∈ ( 0, 1/ 2 ) . Then we apply 46 Equation (B.8) and use the definition r n , K , γ = p λ A ,1 L h β W ( 0 ) /3 to obtain inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ 2 1 − r 1 2 KL ( P 0 ∥ P 1 ) ! ≥ n p λ A ,1 L h β j W ( 0 ) /3 o 2 2 1 − √ α for j = 1, 2, 3 = λ A ,1 L 2 W 2 ( 0 ) h 2 β j 18 1 − √ α for j = 1, 2, 3. (B.12) Hence, we are able to take maximum acr oss the 3 h j ’ s to find a tighter bound, and this yields the lower bound inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ max j ∈ { 1,2,3 } λ A ,1 L 2 W 2 ( 0 ) h 2 β j 18 1 − √ α . Now , we discuss the 3 choices of h ∈ { h 1 , h 2 , h 3 } , and their corresponding minimax lower bound. Case 1: In this case, we find the first choice of h : h = h 1 = n − 1 2 β 0 ∧ ( γ / K ) , and use it to find corresponding α , and upper bound of KL-divergence. Equation (B.3) says n k ≡ ¯ n for k = 0, 1, . . . , K . Following the intermediate result in Equation (B.11) we then obtain an upper bound KL ( P 0 ∥ P 1 ) ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 h 2 β ¯ n 1 ∨ h K γ . With the definition of h = h 1 , it follows that 1 ∨ h K γ ≤ 1 ∨ K γ n − 1 2 β 0 ∧ ( γ / K ) ≤ 1 . Also, by Equation (B.3) , n 0 = ¯ n, and we obtain h 2 β = n − 1 0 ∧ ( K / γ ) − 2 β ≤ n − 1 0 = ¯ n − 1 . Hence, we can plug the upper bound for 1 ∨ h K / γ and h 2 β into KL ( P 0 ∥ P 1 ) ’ s upper bound. This yields KL ( P 0 ∥ P 1 ) ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 . Note that the rate of r n , K , γ = p λ A ,1 L h β W ( 0 ) /3 is immediately obtained from the selected h in each case. 47 Therefor e, by Equation (B.8) we have inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 = λ A ,1 L 2 W 2 ( 0 ) n − 1 0 ∧ ( K / γ ) − 2 β 18 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 = λ A ,1 L 2 c 2 0 n − 1 0 ∧ ( K / γ ) − 2 β 18 e 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 (B.13) where the last line is by definition W ( u ) = c 0 exp − 1 1 − u 2 1 ( | u | ≤ 1 ) = ⇒ W ( 0 ) = c 0 e . We want to find c 0 such that 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 ≥ 1 2 = ⇒ c 0 ≤ 1 2 p ( 1 ∨ 2 a 0 ) L 2 . Recall that W ∈ H ( β , 1/ 2 ) only if c 0 ≤ C 0 , so we can set c 0 = C 0 ∧ 1 2 p ( 1 ∨ 2 a 0 ) L 2 . This yields a lower bound inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ λ A ,1 L 2 c 2 0 36 e 2 n − 1 0 ∧ ( K / γ ) − 2 β . Case 2: In this case, we define h = h 2 = n − 1 2 β 0 ∧ n − 1 2 β = ⇒ h 2 β = n − 1 0 ∧ n − 1 . Plugging this bandwidth into Equation (B.10) yields KL ( P 0 ∥ P 1 ) ≤ L 2 c 2 0 h 2 β E U k " K ∑ k = 0 n k 1 ( | u 0 − U k | ≤ h ) # . Notice that the following upper bound always holds: K ∑ k = 0 n k 1 ( | u 0 − U k | ≤ h ) ≤ n . 48 Hence, KL ( P 0 ∥ P 1 ) ≤ L 2 c 2 0 h 2 β n = L 2 c 2 0 n − 1 2 β 0 ∧ n − 1 2 β 2 β n ≤ L 2 c 2 0 ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 . Again, by Equation (B.8) we have inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 = λ A ,1 L 2 c 2 0 n − 1 0 ∧ n − 1 18 e 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 (B.14) Same as Case 1, we can set c 0 = C 0 ∧ 1 2 √ ( 1 ∨ 2 a 0 ) L 2 to find lower bound ≥ λ A ,1 L 2 c 2 0 36 e 2 n − 1 0 ∧ n − 1 . Case 3: In this case, we utilize the third choice of h : h = h 3 = ( n / γ ) − 1 2 β + 1 ∧ n − 1 2 β 0 . Firstly , it follows from Equation (B.11) that KL ( P 0 ∥ P 1 ) ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 h 2 β n 0 ∨ h γ K ∑ k = 1 n k ! . By the construction of h = h 3 we have n 0 ∨ h γ K ∑ k = 1 n k = n 0 ∨ ( n / γ ) − 1 2 β + 1 ∧ n − 1 2 β 0 γ K ∑ k = 1 n k ≤ n 0 ∨ ( n / γ ) − 1 2 β + 1 γ K ∑ k = 0 n k . Recall that n = ∑ K k = 0 n k and thus n 0 ∨ h γ K ∑ k = 1 n k ≤ n 0 ∨ ( n / γ ) 2 β 2 β + 1 Moreover , the definition of h also implies h 2 β = n − 1 0 ∧ ( n / γ ) − 2 β 2 β + 1 = n 0 ∨ ( n / γ ) 2 β 2 β + 1 − 1 . 49 Now we can plug the values of n 0 ∨ h γ ∑ K k = 1 n k and h 2 β into KL ( P 0 ∥ P 1 ) and get KL ( P 0 ∥ P 1 ) ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 h 2 β n 0 ∨ h γ K ∑ k = 1 n k ! ≤ ( 2 ∨ 4 a 0 ) L 2 c 2 0 n 0 ∨ ( n / γ ) 2 β 2 β + 1 − 1 n 0 ∨ ( n / γ ) 2 β 2 β + 1 = ( 2 ∨ 4 a 0 ) L 2 c 2 0 . Again, using the lower bound in Equation (B.8) we have inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ r 2 n , K , γ 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 = λ A ,1 L 2 c 2 0 n − 1 0 ∧ ( n / γ ) − 2 β 2 β + 1 18 e 2 1 − q ( 1 ∨ 2 a 0 ) L 2 c 2 0 (B.15) Same as Case 1, we can set c 0 = C 0 ∧ 1 2 √ ( 1 ∨ 2 a 0 ) L 2 to find lower bound ≥ λ A ,1 L 2 c 2 0 36 e 2 n − 1 0 ∧ ( n / γ ) − 2 β 2 β + 1 . Summarizing Case 1 - Case 3: By Equation (B.12) , we find the maximum of the lower bounds on in Equations (B.13) - (B.15) , it follows that inf ˆ θ ( u 0 ) sup ∀ j ∈ [ p ] , θ j ∈H ( β , L ) E ∥ ˆ θ ( u 0 ) − θ ( u 0 ) ∥ 2 A ≥ C · 1 n 0 ∧ max K γ − 2 β , n γ − 2 β 2 β + 1 , 1 n , where C = λ A ,1 L 2 c 2 0 36 e 2 for c 0 = C 0 ∧ 1 2 p ( 1 ∨ 2 a 0 ) L 2 . B.5 Proof of Theorem 3.10 Proof B.5 Linear r egression is a special GLM with squared loss ℓ ( η , y ) = 1 2 ( y − η ) 2 . Accordingly we focus on the major changes relative to proof of Lemma A.10. Readers may refer to part 2 of Appendix C.9 for the proof under a more general setup. The norm ∥ · ∥ used in the proof denotes the ℓ 2 -norm when applied to a vector , and the spectral norm when applied to a matrix. Write Ψ ( u 0 ) = E [ X X ⊤ | U = u 0 ] . Let τ TL : = ˆ θ TL ( u 0 ) − θ ( u 0 ) , τ DVCM : = ˆ θ DVCM ( u 0 ) − θ ( u 0 ) , and τ LR : = ˆ θ LR ( u 0 ) − θ ( u 0 ) . Recall the rates of ˆ θ LR ( u 0 ) , ˆ θ DVCM ( u 0 ) , and their ratio τ LR = O p ( r LR ) , τ DVCM = O p ( r DVCM ) , ρ : = r LR r DVCM , 50 and by construction Q = ˆ Q ≍ p ρ 2 I , i.e. c ρ 2 ≤ λ min ( Q ) ≤ λ max ( Q ) ≤ C ρ 2 w .p. → 1 . Consider the TL objective with linear DVCM: L n ( α ) = 1 2 n 0 ∑ i ∈ I ∗ 0 Y 0 i − X ⊤ 0 i α 2 + 1 2 ∥ α − ˆ θ DVCM ( u 0 ) ∥ 2 Q . T aking the gradient and setting it to zero at α = ˆ θ TL ( u 0 ) gives 1 n 0 X ⊤ 0 X 0 ˆ θ TL ( u 0 ) − Y 0 + Q ˆ θ TL ( u 0 ) − ˆ θ DVCM ( u 0 ) = 0, where X 0 ∈ R n 0 × p with X 0 i being its i th row , Y 0 ∈ R n 0 with entries being Y 0 i . Let ϵ 0 ∈ R n 0 be a vector with ε 0 i being its i th entry . Substitute Y 0 = X 0 θ ( u 0 ) + ϵ 0 and ˆ θ TL ( u 0 ) = θ ( u 0 ) + τ TL : 1 n 0 X ⊤ 0 X 0 τ TL − ϵ 0 + Q τ TL − τ DVCM = 0 ⇐ ⇒ τ TL = 1 n 0 X ⊤ 0 X 0 + Q − 1 Q τ DVCM + 1 n 0 X ⊤ 0 ϵ 0 . Moreover , the involved components have the following limits: 1 n 0 X ⊤ 0 X 0 = Ψ ( u 0 ) + o p ( 1 ) , 1 n 0 X ⊤ 0 ϵ 0 = 1 n 0 X ⊤ 0 X 0 τ LR = ⇒ 1 n 0 X ⊤ 0 ϵ 0 = ( Ψ ( u 0 ) + o p ( 1 ) ) τ LR , ( ∵ OLS solution ) . so the linear repr esentation is written as τ TL = Ψ ( u 0 ) + o p ( 1 ) + Q − 1 n Q τ DVCM + ( Ψ ( u 0 ) + o p ( 1 ) ) τ LR o = Ψ ( u 0 ) + Q − 1 n Q τ DVCM + Ψ ( u 0 ) τ LR o ( 1 + o p ( 1 ) ) . (B.16) By sample splitting, τ LR is independent of τ DVCM . Now we discuss the two cases ρ → 0 and ρ → ∞ . Regime ρ → 0 (LR–dominated). The properties ρ : = r LR / r DVCM → 0 and Q = O p ( ρ 2 ) jointly imply the denominator in (B.16) is Ψ ( u 0 ) + Q − 1 = Ψ ( u 0 ) − 1 + o p ( 1 ) . Moreover , ∥ Ψ − 1 ( u 0 ) Q τ DVCM ∥ = O p ( ρ 2 r DVCM ) = o p ( r LR ) . Thus, from the r epresentation in (B.16) , τ TL = τ LR + o p ( r LR ) . By Slutsky’ s theor em and part 1 of Proposition A.8, it follows that r − 1 LR ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N 0, Ω LR ( u 0 ) . Regime ρ → ∞ (DVCM–dominated). Start from the linearization τ TL = ( Ψ ( u 0 ) + Q ) − 1 { Q τ DVCM + Ψ ( u 0 ) τ LR } + o p ( r TL )( 1 + o p ( 1 ) ) . Use the identity ( Ψ ( u 0 ) + Q ) − 1 Q = ( Ψ ( u 0 ) + Q ) − 1 ( Ψ ( u 0 ) + Q − Ψ ( u 0 )) = I − ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) , 51 to rewrite τ TL = n τ DVCM − ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ DVCM + ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ LR o ( 1 + o p ( 1 ) ) . (B.17) Since λ min ( Q ) ≍ p ρ 2 and Ψ ( u 0 ) ⪰ c ′ 0 I , we have ∥ ( Ψ ( u 0 ) + Q ) − 1 ∥ ≤ ∥ Q − 1 ∥ = O p ( ρ − 2 ) , ∥ Ψ ∥ = O ( 1 ) . Hence it follows that ∥ ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ DVCM ∥ ≤ ∥ ( Ψ ( u 0 ) + Q ) − 1 ∥ ∥ Ψ ( u 0 ) ∥ ∥ τ DVCM ∥ = O p ( ρ − 2 r DVCM ) = o p ( r DVCM ) , and, using r LR = ρ r DVCM , ∥ ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ LR ∥ ≤ ∥ ( Ψ ( u 0 ) + Q ) − 1 ∥ ∥ Ψ ( u 0 ) ∥ ∥ τ LR ∥ = O p ( ρ − 2 r LR ) = O p ( r DVCM / ρ ) = o p ( r DVCM ) . Hence, going back to Equation (B.17) , this yields τ TL = τ DVCM + o p ( r DVCM ) . Therefor e, by Slutsky’ s theorem and part 2 of Pr oposition A.8 r − 1 DVCM ˆ θ TL ( u 0 ) − θ ( u 0 ) − b DVCM ( u 0 ) d − → N 0, Ω DVCM ( u 0 ) . Proposition A.8 also implies that r 2 DVCM ≍ h 2 β + γ / n h , b DVCM ( u 0 ) ≲ h β and thus the condition n h 2 β + 1 / γ → 0 implies r − 1 DVCM b DVCM ( u 0 ) = o p ( 1 ) , and this proves the theor em. B.6 Proof of Theorem 3.13 Proof B.6 It is shown in Proposition A.9 that ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p h 2 β + O p γ n h . and h is set to be med e 0 ( n / γ ) − 1 2 β + 1 , d ( 1 ) ( u 0 ) , d ( K ) ( u 0 ) = med O ( n / γ ) − 1 2 β + 1 , O p ( γ / K ) , O p ( γ ) , where the order of d ( 1 ) ( u 0 ) and d ( K ) ( u 0 ) are discussed in Lemma A.1. Then it follows that the bias-squar ed and variance of DVCM are of order O p h 2 β = med O p ( n / γ ) − 2 β 2 β + 1 , O p ( γ / K ) 2 β , O p γ 2 β , 52 O p ( γ / n h ) = med O p ( n / γ ) − 2 β 2 β + 1 , O p ( K / n ) , O p n − 1 . We discuss the rates of O p h 2 β and O p ( γ / n h ) under three cir cumstances: i ) n γ 2 β ≫ K 2 β + 1 , i i ) n γ 2 β ≪ 1, i i i ) 1 ≲ n γ 2 β ≲ K 2 β + 1 . Case 1: At first we consider the case n γ 2 β ≫ K 2 β + 1 . This implies: n γ 2 β ≫ K 2 β + 1 = ⇒ ( n / γ ) − 2 β / ( 2 β + 1 ) ≪ ( γ / K ) 2 β . Also, K ≥ 1 immediately imply ( γ / K ) 2 β ≲ γ 2 β . Thus, it follows that O p h 2 β = O p ( γ / K ) 2 β by its definition. Moreover , the same condition n γ 2 β ≫ K 2 β + 1 also implies that n γ 2 β ≫ K 2 β + 1 = ⇒ K / n ≪ ( γ / K ) 2 β where the K / n term has lower bound n − 1 ≲ K / n . Thus, we have shown that ( γ / K ) 2 β ≳ K / n ∨ n − 1 ∨ ( n / γ ) − 2 β / ( 2 β + 1 ) , and this means O p ( γ / n h ) ≲ O p h 2 β = O p ( γ / K ) 2 β , so we conclude in this case ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p ( γ / K ) 2 β . Case 2: In this case, we assume that n γ 2 β ≪ 1 ≲ K 2 β + 1 . The first part of inequality implies that n γ 2 β ≪ 1 = ⇒ ( n / γ ) − 2 β 2 β + 1 ≪ n − 1 while K ≥ 1 immediately yields K / n ≳ n − 1 . Therefor e, by its definition, O p ( γ / n h ) = O p n − 1 . The condition n γ 2 β ≪ 1 also implies that n γ 2 β ≪ 1 = ⇒ n γ 2 β ≪ K 2 β = ⇒ ( γ / K ) 2 β ≪ n − 1 , n γ 2 β ≪ 1 = ⇒ γ 2 β ≪ n − 1 . 53 Hence, we have shown that O p ( γ / n h ) = O p n − 1 ≫ O p h 2 β = med O p ( n / γ ) − 2 β 2 β + 1 , O p ( γ / K ) 2 β , O p γ 2 β . and therefor e ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p n − 1 . Case 3: In this case, we consider the last case, 1 ≲ n γ 2 β ≲ K 2 β + 1 . This condition implies n γ 2 β ≲ K 2 β + 1 = ⇒ ( n / γ ) − 2 β 2 β + 1 ≳ ( γ / K ) 2 β , 1 ≲ n γ 2 β = ⇒ γ 2 β ≳ ( n / γ ) − 2 β 2 β + 1 . Thus by definition O p h 2 β = O p ( n / γ ) − 2 β 2 β + 1 . Moreover , the same condition also implies n γ 2 β ≲ K 2 β + 1 = ⇒ ( n / γ ) − 2 β 2 β + 1 ≲ K / n , 1 ≲ n γ 2 β = ⇒ n − 1 ≲ ( n / γ ) − 2 β 2 β + 1 . Thus by definition O p ( γ / n h ) = O p ( n / γ ) − 2 β 2 β + 1 . Therefor e, in this case ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p ( n / γ ) − 2 β 2 β + 1 . Collecting Case 1 to Case 3. Collecting the the results fr om Case 1 to Case 3 , we have established that ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p ( K / γ ) − 2 β if n γ 2 β ≫ K 2 β + 1 ≳ 1, O p ( n / γ ) − 2 β 2 β + 1 if 1 ≲ n γ 2 β ≲ K 2 β + 1 , O p n − 1 if n γ 2 β ≪ 1 ≲ K 2 β + 1 . Now we can apply the same argument in the last part of Appendix B.1 (“Justification of maximal upper bounds”) to conclude that ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) 2 2 = O p max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . Now we have established the rate of ˆ θ GDVCM ( u 0 ) and we will move forward to ˆ θ TL ( u 0 ) . It has been shown in Lemma A.10 that ˆ θ TL ( u 0 ) − θ ( u 0 ) 2 2 = O p r 2 GLR ∧ r 2 GDVCM . The rate r 2 GLR = n − 1 0 so we conclude that ˆ θ TL ( u 0 ) − θ ( u 0 ) 2 2 = O p n − 1 0 ∧ max ( K / γ ) − 2 β , ( n / γ ) − 2 β 2 β + 1 , n − 1 . 54 B.7 Proof of Theorem 3.14 Proof B.7 Let τ TL : = ˆ θ TL ( u 0 ) − θ ( u 0 ) , τ GDVCM : = ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) , and τ GLR : = ˆ θ GLR ( u 0 ) − θ ( u 0 ) . Recall the expansion in Equation (B.16) τ TL = Ψ ( u 0 ) + ˆ Q − 1 n ˆ Q τ GDVCM + Ψ ( u 0 ) τ GLR o { 1 + o p ( 1 ) } , (B.18) and set B Q : = Ψ ( u 0 ) + ˆ Q. It is shown in Pr oposition A.9 that the asymptotic bias and variance for DVCM are of orders O ( h β ) and O ( γ / ( n h ) ) . Therefore the condition n h 2 β + 1 / γ → 0 leads to the asymptotic unbiasedness of DVCM s n h γ ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) d − → N ( 0, Ω GDVCM ( u 0 ) ) . Therefor e, by Proposition A.9, and with definition on the variance for GLR and GDVCM , V GLR ( u 0 ) : = 1 n 0 Ω GLR ( u 0 ) , V GDVCM ( u 0 ) : = γ n h Ω GDVCM ( u 0 ) it follows that V − 1/ 2 GLR ( u 0 ) τ GLR d − → N 0, I , V − 1/ 2 GDVCM ( u 0 ) τ GDVCM d − → N 0, I . Under sample splitting, the two terms τ GLR , τ GDVCM are independent of each other . Since Ψ ( u 0 ) is positive definite and ˆ Q is positive semidefinite, B Q is invertible w .p. → 1 . By the continuous mapping theorem, Σ − 1/ 2 1 h B − 1 Q Ψ ( u 0 ) τ GLR i d − → N 0, I , Σ 1 = B − 1 Q Ψ ( u 0 ) V GLR ( u 0 ) Ψ ( u 0 ) B − 1 Q and Σ − 1/ 2 2 h B − 1 Q ˆ Q τ GDVCM i d − → N 0, I , Σ 2 = B − 1 Q ˆ Q V GDVCM ( u 0 ) ˆ Q B − 1 Q Independence implies their covariances are additive. Using (B.18) and by Slutsky’ s theorem, Σ − 1/ 2 TL τ TL d − → N ( 0, I ) , where Σ TL = Σ 1 + Σ 2 = B − 1 Q Ψ ( u 0 ) V GLR ( u 0 ) Ψ ( u 0 ) B − 1 Q + B − 1 Q ˆ Q V GDVCM ( u 0 ) ˆ Q B − 1 Q . For feasible inference, let ˆ Ψ ( u 0 ) , ˆ V GLR ( u 0 ) , and ˆ V GDVCM ( u 0 ) be consistent estimators. Define ˆ Σ TL by replacing the population quantities in Σ TL with their estimators. Then ˆ Σ TL = Σ TL { 1 + o p ( 1 ) } , and another application of Slutsky’ s theor em yields ˆ Σ − 1/ 2 TL τ TL d − → N ( 0, I ) . 55 B.8 Proof of Corollary 3.11 Proof B.8 This proof is the linear DVCM specialization of of the generalized DVCM derivation in Ap- pendix B.7. it follows the same steps, but with Ψ ( u ) = E [ X X ⊤ | U = u ] and the variance component σ ( · ) specialized to the linear model. Let τ TL : = ˆ θ TL ( u 0 ) − θ ( u 0 ) , τ DVCM : = ˆ θ DVCM ( u 0 ) − θ ( u 0 ) , and τ LR : = ˆ θ LR ( u 0 ) − θ ( u 0 ) . In the linear case (quadratic loss), the same linearization as in (B.16) holds: τ TL = Ψ ( u 0 ) + ˆ Q − 1 n ˆ Q τ DVCM + Ψ ( u 0 ) τ LR o { 1 + o p ( 1 ) } , (B.19) and we write B Q : = Ψ ( u 0 ) + ˆ Q. By Pr oposition A.8, the DVCM bias and variance are of orders O ( h β ) and O ( γ / ( n h ) ) . Hence, if n h 2 β + 1 / γ → 0 , the bias is negligible and s n h γ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) d − → N 0, Ω DVCM ( u 0 ) . Moreover , by the same pr oposition, and with definition on the variance for LR and DVCM , V LR ( u 0 ) : = 1 n 0 Ω LR ( u 0 ) , V DVCM ( u 0 ) : = γ n h Ω DVCM ( u 0 ) it follows that V − 1/ 2 LR ( u 0 ) τ LR d − → N 0, I , V − 1/ 2 DVCM ( u 0 ) τ DVCM d − → N 0, I . and (by sample splitting) these limits are independent. Since Ψ ( u 0 ) ≻ 0 and ˆ Q ⪰ 0 , B Q is invertible w .p. → 1 . By the continuous mapping theorem, Σ − 1/ 2 1 h B − 1 Q Ψ ( u 0 ) τ LR i d − → N ( 0, I ) , Σ 1 : = B − 1 Q Ψ ( u 0 ) V LR ( u 0 ) Ψ ( u 0 ) B − 1 Q , and Σ − 1/ 2 2 h B − 1 Q ˆ Q τ DVCM i d − → N ( 0, I ) , Σ 2 : = B − 1 Q ˆ Q V DVCM ( u 0 ) ˆ Q B − 1 Q . Independence implies additivity of covariances. From (B.19) , τ TL = B − 1 Q Ψ ( u 0 ) τ LR + B − 1 Q ˆ Q τ DVCM + o p ( τ TL ) , so by Slutsky’ s theor em, Σ − 1/ 2 TL τ TL d − → N ( 0, I ) , Σ TL = Σ 1 + Σ 2 = B − 1 Q Ψ V LR Ψ B − 1 Q + B − 1 Q ˆ Q V DVCM ˆ Q B − 1 Q , For feasible inference, take consistent estimators ˆ Ψ ( u 0 ) , ˆ V LR ( u 0 ) , ˆ V DVCM ( u 0 ) , form ˆ Σ TL by plug-in, and conclude ˆ Σ − 1/ 2 TL τ TL → d N ( 0, I ) . 56 C Proof of auxiliary lemmas C.1 Proof of part (1) of Lemma A.1 Before going into the details, let’s lay down some notations. Assume that f is supported on [ a , b ] with f ∈ [ 1/ a 0 , a 0 ] for some large a 0 > 0. W e further assume without loss of generality that ( b − a ) ≤ a 0 /2. As the density of U is f ( ( u − u ∗ ) / γ ) / γ , it is immediate that: F U ( u ) : = P ( U ≤ u ) = F u − u ∗ γ . Hence U is supported on [ a γ + u ∗ , b γ + u ∗ ] . From the definition of near est neighbour , we have: P ( d ( 1 ) ( u 0 ) ≥ γ t / K ) = P ( | U − u 0 | ≥ γ t / K ) K = ( 1 − P ( | U − u 0 | ≤ γ t / K ) ) K = 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K K Observe that the probability is not non-zero for all t . In fact t must have the following upper bound: γ t / K ≤ max { b γ + u ∗ − u 0 , u 0 − u ∗ − a γ } : = γ t 0 = ⇒ t ≤ K t 0 . Now let us concentrate on the upper bound. As f ≥ 1 / a 0 , we have: F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≥ 2 t a 0 K , and consequently: 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≤ 1 − 2 t a 0 K . This bound only makes sense only if t ≤ ( a 0 K ) / 2. W e now prove the lower bound on the proba- bility , where we use the upper bound on f , i.e., f ≤ a 0 . This implies: 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≥ 1 − 2 a 0 t K . As before, this bound only makes sense when t ≤ K / ( 2 a 0 ) . Therefor e, it is concluded that 1 − 2 a 0 t K K 1 ( t ≤ K / ( 2 a 0 )) ≤ P ( d ( 1 ) ( u 0 ) ≥ γ t / K ) ≤ 1 − 2 t a 0 K K 1 ( t ≤ ( a 0 K ) / 2 ) . (C.1) Now consider the expectation: E h ( K / γ ) β d ( 1 ) ( u 0 ) β i = Z ∞ 0 P ( K / γ ) β d ( 1 ) ( u 0 ) β ≥ t dt = Z ∞ 0 P d ( 1 ) ( u 0 ) ≥ γ K t 1 β dt . 57 By the upper bound in Equation (C.1), P d ( 1 ) ( u 0 ) ≥ γ K t 1 β ≤ 1 − 2 t 1 β a 0 K K 1 ( t 1 β ≤ ( a 0 K ) / 2 ) we obtain: E h ( K / γ ) β d ( 1 ) ( u 0 ) β i ≤ Z ∞ 0 1 − 2 t 1/ β a 0 K ! K dt ≤ Z ∞ 0 exp − 2 t 1/ β a 0 ! dt ≤ Z ∞ 0 exp − 2 t 1/ β a 0 ! dt = C 1 . W e now pr ove the lower bound on the expectation, where we use the lower bound in Equation (C.1), P d ( 1 ) ( u 0 ) ≥ γ K t 1 β ≥ 1 − 2 a 0 t 1 β K K 1 ( t 1 β ≤ K / ( 2 a 0 )) Define t 1 = min { t 0 , 1/ ( 2 a 0 ) } . W e now have: E h ( K / γ ) β d ( 1 ) ( u 0 ) β i = Z ( K t 0 ) β 0 P d ( 1 ) ( u 0 ) ≥ γ K t 1 β dt ≥ Z ( K t 1 ) β 0 P d ( 1 ) ( u 0 ) ≥ γ K t 1 β dt ≥ Z ( K t 1 ) β 0 1 − 2 a 0 t 1/ β K ! K dt ≥ Z ( K t 1 ) β 0 exp − 2 a 0 t 1/ β 1 − ( 2 a 0 t 1/ β ) 2 K ! dt [ ∵ ( 1 + x / n ) n ≥ e x ( 1 − x 2 / n ) ] ≥ Z 1 C β K β /2 t β 0 0 exp − 2 a 0 t 1/ β 1 − ( 2 a 0 t 1/ β ) 2 K ! dt [ ∵ K ≥ 1, C > 1 to be chosen later ] ≥ 1 − 2 a 0 t 0 C 2 ! Z 1 C β K β /2 t β 1 0 exp − 2 a 0 t 1/ β dt ≥ 1 − 2 a 0 t 1 C 2 ! Z 1 C β t β 1 0 exp − 2 a 0 t 1/ β dt [ ∵ K ≥ 1 ] ≥ C 2 . Now it is immediate that any C > max { 1, 2 a 0 t 0 } is a valid choice. 58 C.2 Proof of part (2) of Lemma A.1 W e use the same notation as before. Assume that f is supported on [ a , b ] with f ∈ [ 1 / a 0 , a 0 ] for some large a 0 > 0. W e further assume without loss of generality that ( b − a ) ≤ a 0 /2. As the density of U is f (( u − u ∗ ) / γ ) / γ , it is immediate that: F U ( u ) = P ( U ≤ u ) = F u − u ∗ γ . Hence U is supported on [ a γ + u ∗ , b γ + u ∗ ] . Recall that d ( K ) ( u 0 ) = max k ∈ [ K ] | U k − u 0 | , it follows that P ( d ( K ) ( u 0 ) ≥ γ t / K ) = 1 − P ( | U − u 0 | ≤ γ t / K ) K = 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K K Observe that the probability is not non-zero for all t . In fact t must have the following upper bound: γ t / K ≤ max { b γ + u ∗ − u 0 , u 0 − u ∗ − a γ } : = γ t 0 = ⇒ t ≤ K t 0 . Now let us concentrate on the upper bound. As f ≥ 1 / a 0 , we have: F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≥ 2 t a 0 K , and consequently: 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≤ 1 − 2 t a 0 K . This bound only makes sense only if t ≤ ( a 0 K ) / 2. W e now prove the lower bound on the proba- bility , where we use the upper bound on f , i.e., f ≤ a 0 . This implies: 1 − F u 0 − u ∗ γ + t K − F u 0 − u ∗ γ − t K ≥ 1 − 2 a 0 t K . As before, this bound only makes sense when t ≤ K / ( 2 a 0 ) . Therefor e, it is concluded that " 1 − 2 a 0 t K K # 1 ( t ≤ K / ( 2 a 0 )) ≤ P ( d ( K ) ( u 0 ) ≥ γ t / K ) ≤ " 1 − 2 t a 0 K K # 1 ( t ≤ ( a 0 K ) / 2 ) . (C.2) Now consider the expectation E h d ( K ) ( u 0 ) β i . Since U is supported on [ a γ + u ∗ , b γ + u ∗ ] , whose Lebesgue measure = ( b − a ) γ , it is immediate that E h d ( K ) ( u 0 ) β i ≤ ( b − a ) γ β . Now consider the expectation: E h ( K / γ ) β d ( K ) ( u 0 ) β i = Z ∞ 0 P ( K / γ ) β d ( K ) ( u 0 ) β ≥ t dt = Z ∞ 0 P d ( K ) ( u 0 ) ≥ γ K t 1 β dt . 59 By the lower bound in Equation (C.2), P d ( K ) ( u 0 ) ≥ γ K t 1 β ≥ 1 − 2 a 0 t 1 β K K 1 ( t 1 β ≤ K / ( 2 a 0 )) . Define t 1 = min { t 0 , 1/ ( 4 a 0 ) } . W e now have: E h ( K / γ ) β d ( K ) ( u 0 ) β i = Z ( K t 0 ) β 0 P d ( K ) ( u 0 ) ≥ γ K t 1 β dt ≥ Z ( K t 1 ) β 0 P d ( K ) ( u 0 ) ≥ γ K t 1 β dt ≥ Z ( K t 1 ) β 0 1 − 2 a 0 t 1/ β K ! K dt = Z K t 1 0 " 1 − 2 a 0 s K K # β s β − 1 ds [ ∵ substitution s = t 1 β ] ≥ " 1 − 2 a 0 s K K # s = K / ( 4 a 0 ) Z K t 1 0 β s β − 1 ds ≥ 1 2 Z K t 1 0 β s β − 1 ds = C 2 K β . This means that E h d ( K ) ( u 0 ) β i ≥ C 2 γ β for some constant C 2 . C.3 Proof of Lemma A.2 Let × denote Cartesian product of two sets. Recall the definition of Z ki : Z ki = Φ l U k − u 0 h ⊗ X ki = X ki j 1 m ! U k − u 0 h m ( j , m ) ∈ { 1, . .. , p } × { 0,1,. . ., l } ∈ R p ( l + 1 ) . Therefor e, ∥ Z ki ∥ 2 2 1 { | U k − u 0 | ≤ h } = p ∑ j = 1 l ∑ m = 0 X 2 ki j 1 m ! U k − u 0 h m 2 1 { | U k − u 0 | ≤ h } ≤ l ∑ m = 0 1 m ! U k − u 0 h m 2 1 { | U k − u 0 | ≤ h } [ ∵ ∥ X ki ∥ 2 ≤ 1 ] ≤ l ∑ m = 0 1 m ! 2 60 ≤ l ∑ m = 0 1 m ! ≤ e . Hence, ∥ Z ki ∥ 2 2 1 { | U k − u 0 | ≤ h } ≤ e = ⇒ ∥ Z ki ∥ 2 1 { | U k − u 0 | ≤ h } ≤ √ e < 2 C.4 Proof of Lemma A.3 By Assumption 3.3, n k ≥ b ′ 0 ¯ n for all k = 1, . . . , K . Hence S h = 1 2 K ∑ k = 0 n k 1 { | U k − u 0 | ≤ h } ≥ b ′ 0 ¯ n 2 K ∑ k = 1 1 { | U k − u 0 | ≤ h } . Let E : = { d ( 1 ) ( u 0 ) ≤ h } and pick an index k 0 such that | U k 0 − u 0 | = d ( 1 ) ( u 0 ) . Then on E , S h ≥ b ′ 0 ¯ n 2 1 + ∑ k ∈ [ K ] \{ k 0 } 1 { | U k − u 0 | ≤ h } = b ′ 0 ¯ n 2 ( S + 1 ) , where S : = ∑ k ∈ [ K ] \{ k 0 } 1 { | U k − u 0 | ≤ h } . Therefor e, E h S − 1 h | E i ≤ 2 b ′ 0 ¯ n E 1 S + 1 E . W e now upper bound E [ ( S + 1 ) − 1 | E ] , using the inequality E [ ( S + 1 ) − 1 | E ] ≤ E [ ( S + 1 ) − 1 ] P ( E ) . Unconditionally , S ∼ Bin ( K − 1, p ) with p = P ( | U − u 0 | ≤ h ) ≥ a ′ 0 h γ [by Assumption 3.2 (a) and condition h ≤ |U | ] . Since S ∼ Bin ( K − 1, p ) , E 1 S + 1 = K − 1 ∑ t = 0 1 t + 1 K − 1 t p t ( 1 − p ) K − 1 − t . Use the identity 1 t + 1 K − 1 t = 1 K K t + 1 , to obtain E 1 S + 1 = 1 K K − 1 ∑ t = 0 K t + 1 p t ( 1 − p ) K − 1 − t . 61 Reindex with s = t + 1 (so s = 1, . . . , K ) and factor one p : E 1 S + 1 = 1 p K K ∑ s = 1 K s p s ( 1 − p ) K − s = 1 p K K ∑ s = 0 K s p s ( 1 − p ) K − s − ( 1 − p ) K ! . Notice that the bracket equals 1 − ( 1 − p ) K , hence E 1 S + 1 = 1 − ( 1 − p ) K p K . Moreover , P ( E ) = 1 − ( 1 − p ) K . Hence E 1 S + 1 E ≤ E [ 1 / ( S + 1 ) ] P ( E ) = 1 p K ≤ γ a ′ 0 K h . Combining the bounds and using n = ( K + 1 ) ¯ n , E h S − 1 h | E i ≤ 2 b ′ 0 ¯ n · γ a ′ 0 K h ≤ C γ n h , for a constant C > 0 depending only on a ′ 0 , b ′ 0 . C.5 Proof of Lemma A.4 Bias. Recall ˆ θ DVCM ( u 0 ) = A l ( Z ⊤ WZ ) − 1 Z ⊤ W y . Conditioning on Γ : = { ( U k , X ki ) } , write y = Z α ( u 0 ) + r + ϵ , E [ ϵ | Γ ] = 0, where α ( u 0 ) ∈ R ( l + 1 ) p collects the local polynomial coefficients at u 0 (with A l α ( u 0 ) = θ ( u 0 ) ), and r ∈ R n is the stacked approximation error with entries r ki = X ⊤ ki { θ ( U k ) − ∑ l v = 0 θ ( v ) ( u 0 )( U k − u 0 ) v / v ! } . Then E ˆ θ DVCM ( u 0 ) − θ ( u 0 ) | Γ = A l ( Z ⊤ WZ ) − 1 Z ⊤ W r . (C.3) W e now bound the right-hand side in operator norm. Using Assumption 3.2 A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ λ − 1 0 , we obtain E ˆ θ DVCM ( u 0 ) − θ ( u 0 ) | Γ 2 ≤ A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 A l Z ⊤ W r 2 ≤ λ − 1 0 A l Z ⊤ W r 2 . (C.4) 62 Next, note that A l Z ⊤ extracts the first p r ows of Z ⊤ , which correspond to the part containing X ki only . Equivalently , A l Z ⊤ W r = K ∑ k = 0 ∑ i ∈ I k w k X ki r ki = K ∑ k = 0 ∑ i ∈ I k w k X ki X ⊤ ki n θ ( U k ) − l ∑ v = 0 θ ( v ) ( u 0 )( U k − u 0 ) v / v ! o , with w k : = W U k − u 0 h / S h . Using ∥ X ki ∥ 2 ≤ 1, A l Z ⊤ W r 2 ≤ K ∑ k = 0 ∑ i ∈ I k w k θ ( U k ) − l ∑ v = 0 θ ( v ) ( u 0 )( U k − u 0 ) v / v ! 2 . Finally , by Assumption 3.1(a) ( θ ∈ H ( β , L ) ) and the standard local-polynomial r emainder , on the support of w k we have | U k − u 0 | ≤ h and hence θ ( U k ) − l ∑ v = 0 θ ( v ) ( u 0 )( U k − u 0 ) v / v ! 2 ≤ Ch β . Since ∑ k , i w k = 1 (because they are normalized by S h ), it follows that A l Z ⊤ W r 2 ≤ Ch β . Plugging into (C.4) yields E ˆ θ DVCM ( u 0 ) − θ ( u 0 ) | Γ 2 ≤ q ′ 1 h β , for some constant q ′ 1 > 0, and therefore E ⊤ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) | Γ A E ˆ θ DVCM ( u 0 ) − θ ( u 0 ) | Γ ≤ λ A ,1 q ′ 1 2 h 2 β = : q 1 h 2 β , where λ A ,1 is the largest eigenvalue of A . V ariance. Let ϵ be the stacked noise vector with E [ ϵ | Γ ] = 0 and V ar ( ϵ | Γ ) = Σ ϵ : = diag ( σ 2 ( U k )) ⪯ σ 2 max I . Then ˆ θ DVCM ( u 0 ) − E [ ˆ θ DVCM ( u 0 ) | Γ ] = A l ( Z ⊤ WZ ) − 1 Z ⊤ W ϵ , so Cov ( ˆ θ DVCM ( u 0 ) | Γ ) = A l ( Z ⊤ WZ ) − 1 Z ⊤ W Σ ϵ WZ ( Z ⊤ WZ ) − 1 A ⊤ l ⪯ σ 2 max A l ( Z ⊤ WZ ) − 1 Z ⊤ W 2 Z ( Z ⊤ WZ ) − 1 A ⊤ l . Because W ( u ) = 1 2 1 ( | u | ≤ 1/ 2 ) and W is normalized by S h , we have W 2 = ( 2 S h ) − 1 W , hence ( Z ⊤ WZ ) − 1 Z ⊤ W 2 Z ( Z ⊤ WZ ) − 1 = 1 2 S h ( Z ⊤ WZ ) − 1 . Therefor e, Cov ( ˆ θ DVCM ( u 0 ) | Γ ) ⪯ σ 2 max 2 S h A l ( Z ⊤ WZ ) − 1 A ⊤ l , 63 and E h ˆ θ DVCM ( u 0 ) − E [ ˆ θ DVCM ( u 0 ) | Γ ] 2 A | Γ i = tr A Cov ( ˆ θ DVCM ( u 0 ) | Γ ) ≤ σ 2 max 2 S h tr A A l ( Z ⊤ WZ ) − 1 A ⊤ l ≤ σ 2 max λ A ,1 2 S h tr A l ( Z ⊤ WZ ) − 1 A ⊤ l ≤ σ 2 max λ A ,1 p 2 S h A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ σ 2 max λ A ,1 p 2 λ 0 S − 1 h = : q 2 S − 1 h . Conclusion. Combining the conditional squared bias and conditional variance bounds, E h ∥ ˆ θ DVCM ( u 0 ) − θ ( u 0 ) ∥ 2 A Γ i ≤ q 1 h 2 β + q 2 S − 1 h . C.6 Proof of Lemma A.5 Fix k ∈ { 0 } ∪ [ K ] and condition on U k = u k . Let t k : = u k − u 0 h , Z ki : = Φ l ( t k ) ⊗ X ki , η ∗ ki : = X ⊤ ki θ ( u k ) , and define δ ki : = s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki W ( t k ) , ∆ k : = n − 1/ 2 k ∑ i ∈ I k δ ki . Throughout, s j ( η , y ) = ∂ j ℓ ( η , y ) / ∂ η j , and Assumptions 3.1 – 3.4 are assumed. W e work with a natural exponential family and canonical link so that E [ s 1 ( η ∗ ki , Y ki ) | X ki , U k ] = 0 E [ s 2 1 ( η ∗ ki , Y ki ) | X ki , U k ] = ν ( u k ) b ′′ ( η ∗ ki ) s j ( η , Y ki ) = b ( j ) ( η ) for j ≥ 2, i.e. higher derivatives ar e independent of Y , (C.5) with variance function ν ( · ) bounded and continuous. Conditional mean. By T aylor expansion of s 1 ( · , Y ki ) in η ar ound η ∗ ki , write s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki = s 1 ( η ∗ ki , Y ki ) + s 2 ( η ∗ ki , Y ki ) Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki + 1 2 s 3 ( ˜ η ki , Y ki ) Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki 2 , (C.6) where η ∗ ki = X ⊤ ki θ ( u k ) and ˜ η ki lies between η ∗ ki and Z ⊤ ki ¯ θ ( u 0 ) . Using the identity X ⊤ ki = Z ⊤ ki A ⊤ l , the local–polynomial bias satisfies Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki = X ⊤ ki θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( u k − u 0 ) l = Z ⊤ ki A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l , (C.7) 64 uniformly for | t k | in the support of W (hence | u 0 − u k | ≤ h ), and ˜ u k is between u k and u 0 . Mor eover , when | u 0 − u k | ≤ h , by H ¨ older continuity H ( L , β ) and boundedness of ∥ X ki ∥ 2 , its absolute value is such that | Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki | ≤ ∥ X ki ∥ 2 | θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) | l ! | u k − u 0 | l = O ( h β ) . (C.8) T aking E [ · | X ki , U k = u k ] in (C.6) and using the identity s 2 ( η ∗ ki , Y ki ) = b ′′ ( η ∗ ki ) gives E h s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki X ki , U k = u k i = b ′′ ( η ∗ ki ) Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki + O | Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki | 2 . (C.9) Multiplying (C.9) by Z ki W ( t k ) and taking E [ · | U k = u k ] , we use E [ b ′′ ( η ∗ ki ) X ki X ⊤ ki | U k = u k ] = Ψ ( u k ) and (C.7) to obtain E [ δ ki | U k = u k ] = E h b ′′ ( η ∗ ki ) Z ki Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki W ( t k ) | U k = u k i + R k = E " b ′′ ( η ∗ ki ) Z ki Z ⊤ ki A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) | U k = u k # + R k = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) + R k , where the r emainder satisfies R k = E h O | Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki | 2 ∥ Z ki ∥ U k = u k i W ( t k ) = O ( h 2 β ) = o ( h β ) , by boundedness of ∥ X ki ∥ 2 , result in (C.8), and the bounded–moment assumption on b ( 3 ) (so that E [ | s 3 ( ˜ η ki , Y ki ) | 4 | U k ] is O ( 1 ) ). Therefor e, it follows that n − 1/ 2 k E [ ∆ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) { 1 + o ( 1 ) } Moreover , by bounded spectral norm of Φ l ( t k ) ⊗ 2 , Ψ ( u k ) , A l , and the same argument as in (C.8), its absolute value is such that n − 1/ 2 k E [ ∆ k | U k = u k ] ≤ Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) 2 { 1 + o ( 1 ) } = O ( h β ) . Conditional variance. By the same expansion as in Equation (C.6) and we square both sides to obtain s 2 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki = s 2 1 ( η ∗ ki , Y ki ) + O ( h β ) The second identity in (C.5) further implies E h s 2 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki X ki , U k = u k i = ν ( u k ) b ′′ ( η ∗ ki ) + O ( h β ) , where the O ( h β ) term collects all the higher-order terms. Hence E δ ki δ ⊤ ki | U k = u k = E s 2 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki Z ⊤ ki | U k = u k W ( t k ) 2 65 = ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) . Since E [ δ ki | U k = u k ] = O ( h β ) , the subtraction of E [ δ ki | U k = u k ] E [ δ ki | U k = u k ] ⊤ affects the variance only at order O ( h 2 β ) , so V ar [ δ ki | U k = u k ] = ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) . Because { δ ki } i ∈ I k are conditionally i.i.d., V ar [ ∆ k | U k = u k ] = Va r [ δ ki | U k = u k ] = ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) . The matrix term Λ k . Recall Λ k = n − 1 k ∑ i ∈ I k s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki Z ⊤ ki W ( t k ) . By a first–order T aylor expansion of s 2 ( · , Y ki ) in η ar ound η ∗ ki , s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki = s 2 ( η ∗ ki , Y ki ) + s 3 ( ˜ η ki , Y ki ) Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki , where ˜ η ki lies between η ∗ ki and Z ⊤ ki ¯ θ ( u 0 ) . T aking E [ · | X ki , U k = u k ] and using the third identity in (C.5) gives E s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki | X ki , U k = u k = b ′′ ( η ∗ ki ) + E s 3 ( ˜ η ki , Y ki ) | X ki , U k = u k Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki . By the same local polynomial appr oximation in (C.7), Z ⊤ ki ¯ θ ( u 0 ) − η ∗ ki = O ( h β ) uniformly for | t k | in the support of W . Now we marginalize X ki . By the moment condition in Assumption 2 ′′ , E [ | s 3 ( ˜ η ki , Y ki ) | | U k = u k ] = E [ | b ( 3 ) ( ˜ η ki ) | | U k = u k ] = O ( 1 ) . Hence E [ Λ k | U k = u k ] = E h b ′′ ( η ∗ ki ) + O ( h β ) Z ki Z ⊤ ki U k = u k i W ( t k ) = Φ l ( t k ) ⊗ 2 ⊗ E [ b ′′ ( η ∗ ki ) X ki X ⊤ ki | U k = u k ] W ( t k ) + O ( h β ) = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) + O ( h β ) . C.7 Proof of Proposition A.9 Proof C.1 We first note that the condition γ K ≪ h ≲ γ n 1 2 β + 1 implies a r elationship we will repeatedly apply: h 2 β − 1 K ≪ h 2 β γ ≲ 1 n h . (C.10) 66 Let s j ( η , y ) = ∂ j ℓ ( η , y ) / ∂ η j denote derivatives w .r .t. the first argument of ℓ . Recall ˆ θ GDVCM ( u 0 ) = A l ˆ α GDVCM , ˆ α GDVCM = arg min α ∈ R ( l + 1 ) p K ∑ k = 0 ∑ i ∈ I k ℓ Z ⊤ ki α , Y ki W U k − u 0 h , with A l = [ I p , 0, . . . , 0 ] , Z ki = Φ l ( t k ) ⊗ X ki and t k = ( u k − u 0 ) / h. Write ¯ θ ( u 0 ) ⊤ = θ ( u 0 ) ⊤ , h θ ′ ( u 0 ) ⊤ , . . . , h l θ ( l ) ( u 0 ) ⊤ , r n : = n h γ − 1/ 2 . We also use short hand notation for the density of U k : f γ ( u ) : = 1 γ f u − u ∗ γ . Due to Assumption 3.3, we assume w .l.o.g. that p k : = n k n ≍ 1 K . The norm ∥ · ∥ used in the proof denotes the ℓ 2 -norm when applied to a vector , and the spectral norm when applied to a matrix. T aylor Expansion. Define Q n ( α ) : = K ∑ k = 0 ∑ i ∈ I k ℓ Z ⊤ ki α , Y ki W ( t k ) , D n ( δ ) : = Q n ¯ θ ( u 0 ) + r n δ − Q n ( ¯ θ ( u 0 )) , where r n = ( n h / γ ) − 1/ 2 . Thus the minimizer ˆ δ of D n ( δ ) satisfy ˆ δ = r − 1 n ( ˆ α GDVCM − ¯ θ ( u 0 )) . A second- order T aylor expansion of each summand at η ∗ ki : = Z ⊤ ki ¯ θ ( u 0 ) yields ℓ ( η ∗ ki + r n Z ⊤ ki δ , Y ki ) = ℓ ( η ∗ ki , Y ki ) + r n s 1 ( η ∗ ki , Y ki ) Z ⊤ ki δ + 1 2 r 2 n s 2 ( η ∗ ki , Y ki )( Z ⊤ ki δ ) 2 + 1 6 r 3 n s 3 ( ˜ η ki , Y ki )( Z ⊤ ki δ ) 3 , (C.11) with ˜ η ki between η ∗ ki and η ∗ ki + r n Z ⊤ ki δ . Define the following quantities ¯ ∆ : = − 1 n h K ∑ k = 0 n 1/2 k ∆ k , ¯ Λ : = 1 n h K ∑ k = 0 n k Λ k , R n ( δ ) = 1 6 r 3 n K ∑ k = 0 ∑ i ∈ I k s 3 ( ˜ η ki , Y ki ) ( Z ⊤ ki δ ) 3 W ( t k ) , with (as in Lemma A.5) ∆ k : = n − 1/ 2 k ∑ i ∈ I k s 1 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki W ( t k ) , Λ k : = n − 1 k ∑ i ∈ I k s 2 Z ⊤ ki ¯ θ ( u 0 ) , Y ki Z ki Z ⊤ ki W ( t k ) . Summing over ( k , i ) for Equation (C.11) and collecting terms gives D n ( δ ) = − ( n h ) r n δ ⊤ ¯ ∆ + 1 2 ( n h ) r 2 n δ ⊤ ¯ Λ δ + R n ( δ ) . (C.12) 67 Marginalizing U k . From Lemma A.5, for each k and small enough h , n − 1/ 2 k E [ ∆ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t k h ) l W ( t k ) { 1 + o ( 1 ) } ≲ h β , (C.13) V ar [ ∆ k | U k = u k ] = ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) 2 + O ( h β ) , (C.14) E [ Λ k | U k = u k ] = Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) W ( t k ) + O ( h β ) . (C.15) 1) Mean of ¯ ∆ . Since ¯ ∆ = − ( n h ) − 1 ∑ K k = 0 n 1/2 k ∆ k , E ¯ ∆ = − K ∑ k = 0 p k n h − 1 n − 1/ 2 k E ∆ k o . For k ∈ [ K ] , integrating (C.13) w .r .t. the density f γ ( u ) = γ − 1 f u − u ∗ γ of U k and writing t = ( u − u 0 ) / h, h − 1 n − 1/ 2 k E ∆ k = h − 1 Z Φ l ( t ) ⊗ 2 ⊗ Ψ ( u ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! ( t h ) l W ( t ) f γ ( u ) du ( 1 + o ( 1 ) ) t = ( u − u 0 ) / h = Z Φ l ( t ) ⊗ 2 t l W ( t ) ⊗ Ψ ( u 0 ) A ⊤ l θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) l ! h l γ − 1 f u 0 − u ∗ + h t γ dt ( 1 + o ( 1 ) ) . Notice that due to H ¨ older continuity and the fact that | ˜ u k − u 0 | ≤ h , it holds that ∥ θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) ∥ ≲ h β − l and thus h − 1 n − 1/ 2 k E ∆ k = O ( γ − 1 h β ) = ⇒ E ¯ ∆ = O ( γ − 1 h β ) . (C.16) 2) V ariance of ¯ ∆ . By the law of total variance, V ar ( ¯ ∆ ) = ( n h ) − 2 K ∑ k = 0 n k V ar ( ∆ k ) = ( n h ) − 2 K ∑ k = 0 n k E V ar ( ∆ k | U k ) + Va r E [ ∆ k | U k ] . For the leading term, using (C.14) and the change of variables t = ( u − u 0 ) / h, h − 1 E V ar ( ∆ k | U k ) = h − 1 Z ν ( u ) Φ l ( t ) ⊗ 2 ⊗ Ψ ( u ) W ( t ) 2 f γ ( u ) du ( 1 + o ( 1 ) ) = γ − 1 h R Φ l ( t ) ⊗ 2 W ( t ) 2 f u 0 − u ∗ + h t γ dt i ⊗ Ψ ( u 0 ) ν ( u 0 ) ( 1 + o ( 1 ) ) = γ − 1 ζ 0,2 ⊗ Ψ ( u 0 ) ν ( u 0 ) ( 1 + o ( 1 ) ) . Hence ( n h ) − 2 K ∑ k = 0 n k E V ar ( ∆ k | U k ) = ( γ n h ) − 1 ν ( u 0 ) ζ 0,2 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) . 68 For the second term, by the bound ∥ Va r ( Z ) ∥ ≤ E ∥ Z ∥ 2 and (C.13) , together with the same change of variables t = ( u − u 0 ) / h and the upper-boundedness of ∥ θ ( l ) ( u 0 ) − θ ( l ) ( ˜ u k ) ∥ , λ max { Ψ ( u ) } , and f ( u ) , V ar E [ ∆ k | U k ] ≲ n k h 2 β Z ∥ Φ l ( t ) ⊗ 2 ∥ 2 t 2 l W ( t ) 2 f γ ( u 0 + h t ) d u = γ − 1 n k h 2 β + 1 Z ∥ Φ l ( t ) ⊗ 2 ∥ 2 t 2 l W ( t ) 2 f ( ( u 0 + h t ) / γ ) d t Since Φ l ( t ) = ( 1, t , . . . , t l / l ! ) ⊤ implies ∥ Φ l ( t ) ∥ 2 2 1 ( | t | ≤ 1 ) ≲ 1 and W is the uniform kernel (Assump- tion 3.4) with finite moments, we obtain V ar E [ ∆ k | U k ] ≲ n k h 2 β + 1 γ . Therefor e, ( n h ) − 2 K ∑ k = 0 n k V ar E [ ∆ k | U k ] ≲ ( n h ) − 2 K ∑ k = 0 n 2 k h 2 β + 1 γ = h 2 β − 1 γ K ∑ k = 0 n k n 2 ≲ h 2 β − 1 γ K , using n k / n ≲ 1/ K . Thus combining the two components in the law of total variance and applying Equation (C.10) yield: V ar ( ¯ ∆ ) = ( γ n h ) − 1 ν ( u 0 ) ζ 0,2 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) . (C.17) 3) Mean of ¯ Λ . Since ¯ Λ = ( n h ) − 1 ∑ K k = 0 n k Λ k , E ¯ Λ = K ∑ k = 0 p k n h − 1 E Λ k o . Integrating (C.15) as above gives h − 1 E Λ k = h − 1 Z Φ l ( t ) ⊗ 2 ⊗ Ψ ( u ) W ( t ) f γ ( u ) du ( 1 + o ( 1 ) ) t = ( u − u 0 ) / h = γ − 1 Z Φ l ( t ) ⊗ 2 W ( t ) f u 0 − u ∗ + h t γ dt ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) = γ − 1 ζ 0,1 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) . Therefor e, E ¯ Λ = γ − 1 ζ 0,1 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) . (C.18) Rate of minimizer . Fix M < ∞ and consider the sphere {∥ δ ∥ = M } . Assumption 2 ′′ implies bounded E | s 3 ( · , Y ) | 4 = E | b ( 3 ) ( · ) | 4 and E ∥ X ∥ 4 , while Assumption 3.4 (compactly supported bounded W ) im- plies E [ W ( t k )] ≍ R 1 ( | u | ≤ h ) f γ ( u ) d u = O ( h / γ ) . Using the T aylor remainder above and | Z ⊤ ki δ | ≤ ∥ Z ki ∥ ∥ δ ∥ , we get sup ∥ δ ∥ ≤ M | R n ( δ ) | = O p ( 1 ) · M 3 r 3 n K ∑ k = 0 ∑ i ∈ I k ∥ Z ki ∥ 3 W ( t k ) = O p M 3 r 3 n n h γ = O p M 3 p n h / γ = o p ( 1 ) . (C.19) 69 Equation (C.18) implies γ ¯ Λ p − → ζ 0,1 ⊗ Ψ ( u 0 ) , so for some c > 0 , λ min ( ¯ Λ ) ≥ c γ w .p. → 1. Moreover , fr om Equations (C.16) and (C.17) , as well as condition n h 2 β + 1 / γ ≲ 1 , ∥ ¯ ∆ ∥ = O p ( n h γ ) − 1/ 2 . Recall the decomposition in (C.12) : D n ( δ ) = − ( n h ) r n δ ⊤ ¯ ∆ + 1 2 ( n h ) r 2 n δ ⊤ ¯ Λ δ + R n ( δ ) . Hence, on { ∥ δ ∥ = M } , inf ∥ δ ∥ = M D n ( δ ) ≥ − ( n h ) r n ∥ ¯ ∆ ∥ M + 1 2 ( n h ) r 2 n λ min ( ¯ Λ ) M 2 − sup ∥ δ ∥ ≤ M | R n ( δ ) | ≥ − M · O p ( 1 ) + 1 2 γ · c γ M 2 − o p ( 1 ) = 1 2 c M 2 − O p ( M ) − o p ( 1 ) . Choosing M as a sufficiently large constant makes the RHS positive w .p. → 1 . Since D n ( 0 ) = 0 and D n is convex, the (unique) minimizer ˆ δ : = ar g min δ D n ( δ ) lies in the ball { ∥ δ ∥ ≤ M } w .p. → 1 , which justifies the rate r n and consistency . Linearization. On { ∥ δ ∥ ≤ M } we have the uniform expansion, hence ∇ D n ( δ ) = − ( n h ) r n ¯ ∆ + ( n h ) r 2 n ¯ Λ δ + ∇ R n ( δ ) . ∇ D n ( ˆ δ ) = 0 , so ˆ δ = r − 1 n ¯ Λ − 1 ¯ ∆ − γ − 1 ¯ Λ − 1 ∇ R n ( ˆ δ ) = r − 1 n ¯ Λ − 1 ¯ ∆ + o p ( 1 ) , because sup ∥ δ ∥ ≤ M ∥∇ R n ( δ ) ∥ = o p ( 1 ) and ∥ ¯ Λ − 1 ∥ = O p ( γ ) by Equations (C.19) and (C.18) . Finally, since ˆ α GDVCM = ¯ θ ( u 0 ) + r n ˆ δ and ˆ θ GDVCM ( u 0 ) = A l ˆ α GDVCM , A l ˆ δ = r − 1 n ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) = r − 1 n A l ¯ Λ − 1 ¯ ∆ + o p ( 1 ) . (C.20) CL T for ¯ ∆ . Fix a unit vector v 0 ∈ R p ( l + 1 ) and define T K ( v 0 ) : = v ⊤ 0 ( ¯ ∆ − E ¯ ∆ ) = − 1 n h K ∑ k = 0 n 1/2 k η k ( v 0 ) , η k ( v 0 ) : = v ⊤ 0 ∆ k − E ∆ k . By construction, { η k ( v 0 ) } K k = 0 are independent, center ed random variables. From (C.14) and Lemma A.5, V ar η k ( v 0 ) | U k = u k = v ⊤ 0 ν ( u k ) Φ l ( t k ) ⊗ 2 ⊗ Ψ ( u k ) v 0 W ( t k ) 2 + O ( h β ) . Thus, taking expectation w .r .t. U k over density f γ ( u ) = γ − 1 f ( γ − 1 ( u − u ∗ )) yields V ar T K ( v 0 ) = ( n h ) − 2 K ∑ k = 0 n k V ar η k ( v 0 ) = ( γ n h ) − 1 ν ( u 0 ) v ⊤ 0 ( ζ 0,2 ⊗ Ψ ( u 0 )) v 0 ( 1 + o ( 1 ) ) . 70 Moreover , since n k ≍ n / K , it follows that E " ∑ k n 3/2 k W ( t k ) # ≍ ( K h / γ ) ( n / K ) 3/2 . Assumption 2 ′′ (bounded moments of Y ) implies the term E | η k ( v 0 ) | 3 is upper bounded. Hence the L ya- punov condition is verified: ∑ K k = 0 E " n 1/2 k nh η k ( v 0 ) 3 # V ar ( T K ( v 0 )) 3/2 ≲ ( K h / γ ) ( n / K ) 3/2 / ( n h ) 3 ( n h ) − 1 · γ − 1 3/2 ≍ r γ K h − → 0 . Therefor e, by L yapunov’ s CL T , T K ( v 0 ) p V ar ( T K ( v 0 )) d − → N ( 0, 1 ) . By the Cram ´ er–Wold device, p γ n h ¯ ∆ − E ¯ ∆ d − → N ( 0, ζ 0,2 ⊗ Ψ ( u 0 ) ν ( u 0 ) ) . (C.21) Collecting results. Combine Equations (C.16) , (C.18) , and (C.21) . The bias term is b GDVCM ( u 0 ) = A l ( E ¯ Λ ) − 1 E ¯ ∆ = O ( h β ) and the asymptotic variance of the centered estimator is Ω GDVCM ( u 0 ) = ν ( u 0 ) e ⊤ 1, l + 1 ζ − 1 0,1 ζ 0,2 ζ − 1 0,1 e 1, l + 1 ⊗ Ψ ( u 0 ) − 1 . Therefor e, s n h γ ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) − b GDVCM ( u 0 ) d − → N 0, Ω GDVCM ( u 0 ) . C.8 Proof of Proposition A.8 Proof C.2 The asymptotic distribution for the linear DVCM is a special case of the generalized DVCM result (see Appendix C.7 for the pr oof under a more general setup). Below , we highlight only the differences from the general pr oof. Recall that the model is: Y ki = X ⊤ ki θ ( U k ) + ε ki , E [ ε ki | X ki , U k ] = 0, Va r ( ε ki | X ki , U k ) = σ 2 ( U k ) . For linear r egression, our loss function ℓ ( η , y ) is the squar ed-error loss, i.e. ℓ ( η , y ) = ( y − η ) 2 /2 , wher e η = X ⊤ θ . As a consequence, the derivatives of the loss function (w .r .t. η ) satisfy: s 1 ( η , y ) = η − y , s 2 ( η , y ) = 1, s 3 ( η , y ) = 0 . Furthermore, we have b ( x ) = x 2 /2 , which implies b ′ ( x ) = x , b ′′ ( x ) = 1 and b ′′ ′ ( x ) = 0 . 71 Main dif ferences between the two proofs. In case of linear regr ession, the term with s 3 in the pr oof of Proposition A.9 vanishes as s 3 ≡ 0 . Furthermore, the definition of the conditional variance Ψ ( u ) simplifies to: Ψ ( u ) = E h X X ⊤ b ′′ ( X ⊤ θ ( u ) ) | U = u i = E [ X X ⊤ | U = u ] , and the variance term ν ( u ) becomes σ 2 ( u ) . With the same idea in the proof of Proposition A.9, we construct the function D n ( δ ) in a way such that the minimizer ˆ δ of D n ( δ ) satisfy A l ˆ δ = r − 1 n ˆ θ DVCM ( u 0 ) − θ ( u 0 ) , with r n = ( n h / γ ) − 1/ 2 . Because s 3 ≡ 0 , the T aylor remainder R n ( δ ) in the generalized DVCM proof (Equation (C.12) ) vanishes, and consequently we have: D n ( δ ) = − ( n h ) r n δ ⊤ ¯ ∆ + 1 2 ( n h ) r 2 n δ ⊤ ¯ Λ δ , where ¯ Λ : = 1 n h K ∑ k = 0 Z ki Z ⊤ ki W ( t k ) , ¯ ∆ : = 1 n h K ∑ k = 0 ( Y ki − Z ⊤ ki ¯ θ ( u 0 )) Z ki W ( t k ) . Using the fact that the optimal δ satisfies ∇ D n ( ˆ δ ) = 0 , we have: ˆ δ = ¯ Λ − 1 ¯ ∆ = ⇒ r − 1 n ˆ θ DVCM ( u 0 ) − θ ( u 0 ) = r − 1 n A l ¯ Λ − 1 ¯ ∆ . (C.22) where the RHS follows from the definition of δ . The asymptotic limits of E [ ∆ k | U k = u k ] , Va r [ ∆ k | U k = u k ] , and E [ Λ k | U k = u k ] has been shown in Lemma A.6. Integrating the above quantities w .r .t. U k with density f γ ( u ) = ( 1/ γ ) f ( u − u ∗ ) / γ and applying Equation (C.10) , the values of E [ ¯ ∆ ] , V ar ( ¯ ∆ ) , and E [ ¯ Λ ] become: E [ ¯ ∆ ] = O ( γ − 1 h β ) , V ar ( ¯ ∆ ) = ( γ n h ) − 1 σ 2 ( u 0 ) ζ 0,2 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) , E ¯ Λ = γ − 1 ζ 0,1 ⊗ Ψ ( u 0 ) ( 1 + o ( 1 ) ) . In particular , the change–of–variables and integration steps are worked out in detail in Appendix C.7 for the generalized DVCM case, and the same calculations apply her e exactly . Note that the quantities ζ r , s does not depend on the loss function and consequently remain fixed. Since ∆ k are sums of ε ki Z ki W ( t k ) with mean zero conditional on ( X ki , U k ) , the L yapunov (or Lindeberg–Feller) CL T applies, yielding p γ n h ¯ ∆ − E ¯ ∆ d − → N 0, ζ 0,2 ⊗ Ψ ( u 0 ) σ 2 ( u 0 ) . Now we have established the asymptotic normality of ¯ ∆ . Going back to Equation (C.22) , it can be easily verified that its linear transformation r − 1 n A l ¯ Λ − 1 ¯ ∆ is such that r − 1 n ˆ θ DVCM ( u 0 ) − θ ( u 0 ) − b DVCM ( u 0 ) d − → N ( 0, Ω DVCM ( u 0 ) ) where b DVCM ( u 0 ) = O ( h β ) , Ω DVCM ( u 0 ) = σ 2 ( u 0 ) e ⊤ 1, l + 1 ζ − 1 0,1 ζ 0,2 ζ − 1 0,1 e 1, l + 1 ⊗ Ψ ( u 0 ) − 1 . This completes the proof. 72 C.9 Proof of Lemma A.10 Recall the rates of ˆ θ GLR ( u 0 ) , ˆ θ GDVCM ( u 0 ) , and their ratio ˆ θ GLR ( u 0 ) − θ ( u 0 ) = O p ( r GLR ) , ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) = O p ( r GDVCM ) , ρ : = r GLR r GDVCM , and assume Q ≍ p ρ 2 I , i.e. c ρ 2 ≤ λ min ( Q ) ≤ λ max ( Q ) ≤ C ρ 2 w .p. → 1. Set r TL : = r GLR ∧ r GDVCM . Define D n ( δ ) : = L n ( θ ( u 0 ) + r TL δ ) − L n ( θ ( u 0 )) , and L n ( α ) : = 1 n 0 ∑ i ∈ I ∗ 0 ℓ ( X ⊤ 0 i α , Y 0 i ) + 1 2 ∥ α − ˆ θ GDVCM ( u 0 ) ∥ 2 Q . In this proof, we use ∥ · ∥ to denote ℓ 2 -norm of a vector and spectral norm of a matrix. Part 1: Rate and consistency of TL. The key step is to show that for any ε > 0 there exists a large constant M = M ( ϵ ) such that P inf ∥ δ ∥ = M D n ( δ ) > 0 = P inf ∥ δ ∥ = M L n ( θ ( u 0 ) + r TL δ ) > L n ( θ ( u 0 )) ≥ 1 − ε . (C.23) Since D n ( 0 ) = 0 and D n ( · ) is convex, this implies that with probability at least 1 − ε a global minimizer lies inside the ball { θ ( u 0 ) + r TL δ : ∥ δ ∥ ≤ M } . Consequently , ∥ ˆ θ TL ( u 0 ) − θ ( u 0 ) ∥ = O p ( r TL ) . Expanding D n ( δ ) . A T aylor expansion of L n ( θ ( u 0 ) + r TL δ ) around θ ( u 0 ) gives D n ( δ ) = r TL δ ⊤ G n + 1 2 r 2 TL δ ⊤ H n δ + R n ( δ ) , (C.24) with G n = 1 n 0 ∑ i ∈ I ∗ 0 s 1 ( X ⊤ 0 i θ ( u 0 ) , Y 0 i ) X 0 i + Q ( θ ( u 0 ) − ˆ θ GDVCM ( u 0 )) , H n = 1 n 0 ∑ i ∈ I ∗ 0 s 2 ( X ⊤ 0 i θ ( u 0 ) , Y 0 i ) X 0 i X ⊤ 0 i + Q , R n ( δ ) = r 3 TL 6 n 0 ∑ i ∈ I ∗ 0 s 3 ( ˜ η 0 i , Y 0 i ) ( X ⊤ 0 i δ ) 3 , with ˜ η 0 i being some intermediate point between X ⊤ 0 i θ ( u 0 ) and X ⊤ 0 i ( θ ( u 0 ) + r TL δ ) . Order of G n , H n , and R n . It is immediate fr om the definition of the remainder term R n that for any M > 0: sup ∥ δ ∥ ≤ M | R n ( δ ) | ≤ r 3 TL M 3 6 n 0 ∑ i ∈ I ∗ 0 s 3 ( ˜ η 0 i , Y 0 i ) ∥ X 0 i ∥ 3 = O p ( r 3 TL M 3 ) , (C.25) 73 as E | s 3 ( · , Y ) | 4 < ∞ and ∥ X ∥ < ∞ by Assumption 2 ′′ . Our next goal is to obtain the order of G n and H n . Note that G n involves the term Q ( θ ( u 0 ) − ˆ θ GDVCM ( u 0 )) . From our definition of r GDVCM , we have: θ ( u 0 ) − ˆ θ GDVCM ( u 0 ) = O p ( r GDVCM ) , and by our assumption on Q , we have Q ≍ p ρ 2 I . Furthermore, from the standard property of the (centered) scor e function on the target domain, we have: 1 n 0 ∑ i ∈ I ∗ 0 s 1 ( X ⊤ 0 i θ ( u 0 ) , Y 0 i ) X 0 i = O p ( n − 1/ 2 0 ) = O p ( r GLR ) . Therefor e, combining these orders, we have: G n = O p r GLR + ρ 2 r GDVCM . (C.26) Furthermore, we have λ min ( Ψ ( u 0 )) ≥ c ′ 0 by Assumption 2 ′′ . Therefore, by the law of lar ge num- bers: λ min ( H n ) ≥ λ min ( Ψ ( u 0 ) + Q ) + o p ( 1 ) ≥ λ min ( Ψ ( u 0 )) + λ min ( Q ) + o p ( 1 ) = O p ( 1 + ρ 2 ) (C.27) Equivalent characterizations of r TL . Observe that r GLR + ρ 2 r GDVCM 1 + ρ 2 = r GLR 1 + ρ 1 + ρ 2 ≤ 2 r GLR , ρ ≤ 1, r GDVCM ρ + ρ 2 1 + ρ 2 ≤ 2 r GDVCM , ρ ≥ 1, and thus we obtain lower bound on r TL r GLR + ρ 2 r GDVCM 1 + ρ 2 ≤ 2 ( r GLR ∧ r GDVCM ) : = 2 r TL = ⇒ r TL ≥ r GLR + ρ 2 r GDVCM 2 ( 1 + ρ 2 ) . (C.28) Furthermore, we also have: r TL = min { r GLR , r GDVCM } ≤ r GLR + ρ 2 r GDVCM 1 + ρ 2 . Combining this with (C.28) yields the bound r GLR + ρ 2 r GDVCM 2 ( 1 + ρ 2 ) ≤ r TL ≤ r GLR + ρ 2 r GDVCM 1 + ρ 2 = ⇒ r TL ≍ r GLR + ρ 2 r GDVCM 1 + ρ 2 . (C.29) V erifying Objective (C.23) . Recall the definition of D n ( δ ) fr om Equation (C.24). W e have on the sphere {∥ δ ∥ = M } : inf ∥ δ ∥ = M D n ( δ ) ≥ − r TL ∥ G n ∥ | {z } : = A n ≥ 0 M + 1 2 r 2 TL λ min ( H n ) | {z } : = B n ≥ 0 M 2 − sup ∥ δ ∥ ≤ M | R n ( δ ) | = − A n M + B n M 2 − sup ∥ δ ∥ ≤ M | R n ( δ ) | . (C.30) 74 The order of A n and B n can be derived from the order of G n and λ min ( H n ) respectively . In partic- ular , using the order of G n , as established in Equation (C.26), we have: A n = r TL ∥ G n ∥ = O p r TL r GLR + ρ 2 r GDVCM . Moreove, the lower bound on the minimum eigenvalue of H n (Equation (C.27)) and equation on r TL in Equation (C.29) yield the following order of B n : B n = 1 2 r 2 TL λ min ( H n ) = 1 2 r TL · r GLR + ρ 2 r GDVCM 2 ( 1 + ρ 2 ) Ω p ( 1 + ρ 2 ) = Ω p r TL r GLR + ρ 2 r GDVCM . T r eating M as a constant, the upper bound on the remainder term (Equation (C.25)) yields sup ∥ δ ∥ ≤ M | R n ( δ ) | = O p r 3 TL , and by Equation (C.29), as well as the immediate relation 1 + ρ 2 ≳ 1, O p ( r 3 TL ) = o p ( r 2 TL ) = o p r TL r GLR + ρ 2 r GDVCM 1 + ρ 2 = o p r TL ( r GLR + ρ 2 r GDVCM ) . Hence the remainder sup ∥ δ ∥ ≤ M | R n ( δ ) | is dominated by both A n and B n , so Equation (C.30) is written as inf ∥ δ ∥ = M D n ( δ ) ≥ − A n M + B n M 2 + o ( A n ∧ B n ) . Both A n and B n are of order O p r TL r GLR + ρ 2 r GDVCM so by choosing large enough M , B n M 2 dominates − A n M uniformly in M . Therefor e we have verified that by setting M to be large enough, the condition in (C.23) is verified, and this further implies ˆ θ TL ( u 0 ) − θ ( u 0 ) = O p ( r TL ) . Part 2: Asymptotic distribution. Let s j ( η , y ) = ∂ j ℓ ( η , y ) / ∂ η j , τ TL : = ˆ θ TL ( u 0 ) − θ ( u 0 ) and τ GDVCM : = ˆ θ GDVCM ( u 0 ) − θ ( u 0 ) . The first–order condition is 1 n 0 ∑ i ∈ I ∗ 0 s 1 X ⊤ 0 i ˆ θ TL ( u 0 ) , Y 0 i X 0 i + Q ˆ θ TL ( u 0 ) − ˆ θ GDVCM ( u 0 ) = 0. (C.31) A second–order T aylor expansion of s 1 at first argument ar ound θ ( u 0 ) gives s 1 X ⊤ 0 i ˆ θ TL ( u 0 ) , Y 0 i = s 1 X ⊤ 0 i θ ( u 0 ) , Y 0 i + s 2 X ⊤ 0 i θ ( u 0 ) , Y 0 i X ⊤ 0 i τ TL + 1 2 s 3 ( ¯ η 0 i , Y 0 i ) X ⊤ 0 i τ TL 2 , where ¯ η 0 i lies between X ⊤ 0 i θ ( u 0 ) and X ⊤ 0 i ˆ θ TL ( u 0 ) . Plug into (C.31) and collect terms to obtain 1 n 0 X ⊤ 0 S 1 + 1 n 0 X ⊤ 0 S 2 X 0 + Q τ TL + 1 2 n 0 ∑ i ∈ I ∗ 0 s 3 ( ¯ η 0 i , Y 0 i ) X 0 i X ⊤ 0 i τ TL 2 = Q τ GDVCM , (C.32) 75 where X 0 ∈ R n 0 × p with X 0 i being its i th row , S 1 ∈ R n 0 with entries being s 1 ( X ⊤ 0 i θ ( u 0 ) , Y 0 i ) , and S 2 ∈ R n 0 × n 0 is a diagonal matrix with entries s 2 ( X ⊤ 0 i θ ( u 0 ) , Y 0 i ) . By E | s 3 ( · , Y ) | 4 < ∞ and ∥ X ∥ < ∞ 1 n 0 ∑ i ∈ I ∗ 0 s 3 ( ¯ η 0 i , Y 0 i ) ∥ X 0 i ∥ 3 = O p ( 1 ) , and thus 1 2 n 0 ∑ i s 3 ( ¯ η 0 i , Y 0 i ) X 0 i X ⊤ 0 i τ TL 2 ≤ 1 n 0 ∑ i s 3 ( ¯ η 0 i , Y 0 i ) ∥ X 0 i ∥ 3 ∥ τ TL ∥ 2 = O p ∥ τ TL ∥ 2 . (C.33) By the law of large numbers and Assumption 2 ′′ , 1 n 0 X ⊤ 0 S 2 X 0 = Ψ ( u 0 ) + o p ( 1 ) . (C.34) Recall τ GLR : = ˆ θ GLR ( u 0 ) − θ ( u 0 ) . The score equation of generalized linear model gives 0 = 1 n 0 ∑ i ∈ I ∗ 0 s 1 ( X ⊤ 0 i ˆ θ GLR , Y 0 i ) X 0 i = 1 n 0 X ⊤ 0 S 1 + 1 n 0 X ⊤ 0 S 2 X 0 τ GLR + R n 0 . Reorganizing terms yields − 1 n 0 X ⊤ 0 S 1 = 1 n 0 X ⊤ 0 S 2 X 0 τ GLR + R n 0 . By standard theory of generalized linear model, it follows that 1 n 0 X ⊤ 0 S 2 X 0 = Ψ ( u 0 ) + o p ( 1 ) and ∥ R n 0 ∥ = o p ( r GLR ) . Therefor e, − 1 n 0 X ⊤ 0 S 1 = Ψ ( u 0 ) τ GLR + o p ( r GLR ) . (C.35) Rearranging (C.32) yields r − 1 TL τ TL = r − 1 TL 1 n 0 X ⊤ 0 S 2 X 0 + Q − 1 Q τ GDVCM − 1 n 0 X ⊤ 0 S 1 + ˜ R n with ˜ R n = − r − 1 TL 1 n 0 X ⊤ 0 S 2 X 0 + Q − 1 1 2 n 0 ∑ i ∈ I ∗ 0 s 3 ( ¯ η 0 i , Y 0 i ) X 0 i X ⊤ 0 i τ TL 2 . Substituting the Equations (C.33) and (C.34), it follows from the consistency of ˆ θ TL ( u 0 ) that ˜ R n = r − 1 TL Ψ ( u 0 ) + Q − 1 O p ( ∥ τ TL ∥ 2 ) = O p ( ∥ τ TL ∥ ) = o p ( 1 ) , and thus r − 1 TL τ TL = r − 1 TL 1 n 0 X ⊤ 0 S 2 X 0 + Q − 1 Q τ GDVCM − 1 n 0 X ⊤ 0 S 1 + o p ( 1 ) . 76 Now we substitute the Equations (C.34) and (C.35), and obtain r − 1 TL τ TL = r − 1 TL Ψ ( u 0 ) + Q − 1 n Q τ GDVCM + Ψ ( u 0 ) τ GLR o + o p ( 1 ) . (C.36) By sample splitting, τ GLR is independent of τ GDVCM , based on which we will derive the regime–specific asymptotic normality . Regime ρ → 0 (GLR–dominated). From Q = O p ( ρ 2 ) , we have ( Ψ ( u 0 ) + Q ) − 1 = Ψ − 1 ( u 0 ) + o p ( 1 ) and ∥ Ψ − 1 ( u 0 ) Q τ GDVCM ∥ = O p ( ρ 2 ∥ τ GDVCM ∥ ) = O p ( ρ 2 r GDVCM ) = O p ( r GLR ρ ) = o p ( r GLR ) . Hence τ TL = τ GLR + o p ( r GLR ) . By Slutsky’s theorem and part 1 of Pr oposition A.9 r − 1 GLR ˆ θ TL ( u 0 ) − θ ( u 0 ) d − → N 0, Ω GLR ( u 0 ) . Regime ρ → ∞ (DVCM–dominated). Starting from τ TL = ( Ψ ( u 0 ) + Q ) − 1 { Q τ GDVCM + Ψ ( u 0 ) τ GLR } + o p ( r TL ) , use the identity ( Ψ ( u 0 ) + Q ) − 1 − Q − 1 = − ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) Q − 1 . Since λ min ( Q ) ≍ p ρ 2 and Ψ ( u 0 ) ⪰ c ′ 0 I , ∥ ( Ψ ( u 0 ) + Q ) − 1 ∥ ≤ ∥ Q − 1 ∥ = O p ( ρ − 2 ) , ∥ Ψ ( u 0 ) ∥ = O ( 1 ) , hence ( Ψ ( u 0 ) + Q ) − 1 − Q − 1 ≤ ∥ ( Ψ ( u 0 ) + Q ) − 1 ∥ ∥ Ψ ( u 0 ) ∥ ∥ Q − 1 ∥ = O p ( ρ − 4 ) = o p ∥ Q − 1 ∥ . (C.37) Next, apply the identity ( Ψ ( u 0 ) + Q ) − 1 Q = I − ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) to get τ TL = τ GDVCM − ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ GDVCM + ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ GLR + o p ( r GDVCM ) . The second and third terms ar e negligible by Equation (C.37): ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ GDVCM ≤ ∥ Q − 1 ∥ ∥ Ψ ( u 0 ) ∥ ∥ τ GDVCM ∥{ 1 + o p ( 1 ) } = O p ( ρ − 2 r GDVCM ) = o p ( r GDVCM ) , and, using r GLR = ρ r GDVCM and Equation (C.37), ( Ψ ( u 0 ) + Q ) − 1 Ψ ( u 0 ) τ GLR ≤ ∥ Q − 1 ∥ ∥ Ψ ( u 0 ) ∥ ∥ τ GLR ∥{ 1 + o p ( 1 ) } = O p ( ρ − 2 r GLR ) = O p ( r GDVCM / ρ ) = o p ( r GDVCM ) . Therefor e, τ TL = τ GDVCM + o p ( r GDVCM ) , and by Slutsky’s theorem and part 2 of Pr oposition A.9 r − 1 GDVCM ˆ θ TL ( u 0 ) − θ ( u 0 ) − b GDVCM ( u 0 ) d − → N 0, Ω GDVCM ( u 0 ) . 77 D Discussion of Assumption 3.2 In this section, we examine the plausibility and technical strength of Assumption 3.2, which r e- quires boundedness of ∥ A l ( Z ⊤ WZ ) − 1 A ⊤ l ∥ 2 . Our analysis considers two asymptotic regimes based on the number of tasks K : (i) a fixed- K regime with n 0 → ∞ , and (ii) a gr owing-task regime with K → ∞ . The corr esponding results are established in Pr oposition D.1 and Proposition D.3, r espectively . In both r egimes, we show that Assumption 3.2 is satisfied in the sense of almost sur e convergence under mild conditions, thereby pr oviding theoretical justification for its use in our framework. Proposition D.1 Assume that λ min E [ X 0 i X ⊤ 0 i ] ≥ κ , n 0 ∑ K k = 0 n k ≥ c 0 for some constants κ , c 0 > 0 . Then it holds that P lim sup n 0 → ∞ A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ C ! = 1 for some constant C > 0 . Proof D.2 Recall Z ⊤ WZ = w 0 ∑ i ∈ I 0 Z 0 i Z ⊤ 0 i + K ∑ k = 1 w k ∑ i ∈ I k Z ki Z ⊤ ki , w k = W ( U k − u 0 ) / h S h , S h = K ∑ k = 0 n k W U k − u 0 h . Using Z 0 i = A ⊤ l X 0 i , define B : = w 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i , M : = K ∑ k = 1 w k ∑ i ∈ I k Z ki Z ⊤ ki , so that Z ⊤ WZ = A ⊤ l B A l + M . Let T : = A l M − 1 A ⊤ l ⪰ 0 . The Woodbury identity yields A l ( Z ⊤ WZ ) − 1 A ⊤ l = T − T ( B − 1 + T ) − 1 T ⪯ B − 1 , hence A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ ∥ B − 1 ∥ 2 = λ min ( B ) − 1 . (D.1) Since w 0 = W ( 0 ) / S h and S h ≤ W ( 0 ) ∑ K k = 0 n k , we have w 0 = W ( 0 ) S h ≥ 1 ∑ K k = 0 n k . Therefor e, λ min ( B ) = w 0 λ min ∑ i ∈ I 0 X 0 i X ⊤ 0 i ! ≥ 1 ∑ K k = 0 n k λ min ∑ i ∈ I 0 X 0 i X ⊤ 0 i ! . 78 Using ∑ K k = 0 n k ≤ n 0 / c 0 , this becomes λ min ( B ) ≥ c 0 λ min 1 n 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i ! . (D.2) Let Σ 0 : = E [ X 0 i X ⊤ 0 i ] with λ min ( Σ 0 ) ≥ κ . Since X 0 i X ⊤ 0 i are i.i.d. PSD matrices with λ max ( X 0 i X ⊤ 0 i ) = ∥ X 0 i ∥ 2 2 ≤ 1 , the matrix Chernoff bound [51] gives that for any η ∈ ( 0, 1 ) , P ( λ min ∑ i ∈ I 0 X 0 i X ⊤ 0 i ! ≤ ( 1 − η ) n 0 κ ) ≤ p exp − η 2 n 0 κ 2 . T aking η = 1 /2 yields P ( λ min 1 n 0 ∑ i ∈ I 0 X 0 i X ⊤ 0 i ! ≤ κ 2 ) ≤ p exp ( − c n 0 κ ) (D.3) for a constant c > 0 . On the complement of the event in (D.3) , we have λ min ( 1 n 0 ∑ X 0 i X ⊤ 0 i ) ≥ κ /2 and thus by (D.2) , λ min ( B ) ≥ c 0 · κ 2 . Plugging into (D.1) gives A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ 2 c 0 κ = : C . Therefor e, define B n 0 = n A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 > C o we have P ( B n 0 ) ≤ p exp ( − C ′ n 0 κ ) , which implies for some k 0 > 0 ∞ ∑ n 0 = k 0 P ( B n 0 ) ≤ ∞ ∑ n 0 = k 0 p exp ( − C ′ n 0 κ ) < ∞ . Applying Borel–Cantelli lemma pr oves the result. Proposition D.3 Assume that λ min E [ X X ⊤ | U = u ] ≥ λ 0 for all u ∈ U and K h / γ ≳ log ( K ) , then it holds P lim sup K → ∞ A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ C = 1 for some constant C > 0 . Proof D.4 For a fixed K , define the quantities Y k : = n k ∑ i = 1 Y ki = W k S h n k ∑ i = 1 X ki X ⊤ ki ⊗ ( r k r ⊤ k ) , S ( K ) = Z ⊤ WZ = K ∑ k = 1 Y k , 79 where W k : = W (( U k − u 0 ) / h ) is the uniform kernel indicator , S h : = ∑ K k = 1 n k W k , and r k = Φ l (( U k − u 0 ) / h ) . 1. Spectral bound. Condition on U 1: K . Then W k , S h , r k are deterministic and { Y k } K k = 1 are conditionally independent. Moreover , by boundedness of ∥ X ki ∥ 2 2 in Assumption 3.2 and ∥ r k ∥ 2 2 on the kernel window , there exists a constant R > 0 such that n k ∑ i = 1 X ki X ⊤ ki 2 ≤ R n k , ∥ r k r ⊤ k ∥ 2 = ∥ r k ∥ 2 2 ≤ R . Therefor e, applying n k / ¯ n ≤ b 0 in Assumption 3.2, ∥ Y k ∥ 2 ≤ W k S h n k ∑ i = 1 X ki X ⊤ ki 2 ∥ r k r ⊤ k ∥ 2 ≤ R 2 n k W k S h ≤ R 2 n k ∑ K s = 1 n s = R 2 K n k ¯ n ≤ R 2 b 0 K . Define a quantity (to be used later) R 0 : = R 2 b 0 K . (D.4) 2. Conditional matrix Chernof f. Let M ( U ) : = E [ S ( K ) | U 1: K ] and µ ( U ) : = λ min ( M ( U ) ) . Applying the matrix multiplicative Chernoff bound [51] conditional on U 1: K yields, for any ε ∈ ( 0, 1 ) , P λ min ( S ( K ) ) ≤ ( 1 − ε ) µ ( U ) U 1: K ≤ d 0 exp − ε 2 µ ( U ) 2 R 0 . (D.5) Now we want to lower bound µ ( U ) Using E [ X X ⊤ | U = u ] ⪰ λ 0 I p , E " n k ∑ i = 1 X ki X ⊤ ki U k # = n k E [ X X ⊤ | U k ] ⪰ n k λ 0 I p . Hence, M ( U ) = E [ S ( K ) | U 1: K ] = K ∑ k = 1 W k S h E " n k ∑ i = 1 X ki X ⊤ ki U k # ⊗ ( r k r ⊤ k ) ⪰ λ 0 I p ⊗ K ∑ k = 1 n k W k S h r k r ⊤ k . Therefor e, µ ( U ) ≥ λ 0 λ min K ∑ k = 1 α k r k r ⊤ k ! , α k : = n k W k S h . (D.6) 3. Concentration of the moment matrix. Define the unnormalized ( l + 1 ) × ( l + 1 ) matrix G : = K ∑ k = 1 n k W k r k r ⊤ k , so that K ∑ k = 1 α k r k r ⊤ k = G S h . 80 Hence λ min K ∑ k = 1 α k r k r ⊤ k ! = λ min ( G ) S h . (D.7) We first lower bound λ min ( G ) . Let T k : = n k W k r k r ⊤ k ⪰ 0 , so G = ∑ K k = 1 T k . Because W k and r k depend only on U k , the matrices { T k } are independent. Moreover , on the window ∥ r k ∥ 2 2 ≤ R and n k ≤ b 0 ¯ n, so λ max ( T k ) ≤ n k ∥ r k ∥ 2 2 ≤ b 0 ¯ n R = : R A . Let M G : = E [ G ] = ∑ K k = 1 n k E [ W k r k r ⊤ k ] and µ G : = λ min ( M G ) . Using the density lower bound on [ u 0 − h , u 0 + h ] and the change of variables t = ( u − u 0 ) / h, E [ W k r k r ⊤ k ] = Z u 0 + h u 0 − h Φ l u − u 0 h Φ l u − u 0 h ⊤ f U ( u ) du ⪰ a ′ 0 h γ Z 1 − 1 Φ l ( t ) Φ l ( t ) ⊤ dt . There exists κ 0 such that R 1 − 1 Φ l ( t ) Φ l ( t ) ⊤ dt ≥ κ 0 > 0 . Then µ G ≥ a ′ 0 h γ κ 0 K ∑ k = 1 n k = a ′ 0 h γ κ 0 K ¯ n . (D.8) Applying matrix Chernoff to G = ∑ K k = 1 T k yields, for any η ∈ ( 0, 1 ) , P { λ min ( G ) ≤ ( 1 − η ) µ G } ≤ ( l + 1 ) exp − η 2 µ G 2 R A ≤ ( l + 1 ) exp − c 1 K h γ , (D.9) for some constant c 1 > 0 . 4. Upper bound S h We also control S h = ∑ K k = 1 n k W k from above. Note that 0 ≤ n k W k ≤ n k ≤ b 0 ¯ n and { W k } are independent. Moreover , by the density upper bound in Assumption 3.2, E [ W k ] = P ( | U k − u 0 | ≤ h ) = Z u 0 + h u 0 − h f U ( u ) du ≤ 2 a 0 h γ . Thus E [ S h ] = K ∑ k = 1 n k E [ W k ] ≤ 2 a 0 h γ K ∑ k = 1 n k = 2 a 0 h γ K ¯ n . A Chernoff bound gives that for any t ∈ ( 0, 1 ) , P { S h ≥ ( 1 + t ) E [ S h ] } ≤ exp − c 2 K h γ (D.10) for some constant c 2 > 0 . 5. probability on “good event”. Let E : = n λ min ( G ) ≥ ( 1 − η ) µ G o ∩ n S h ≤ ( 1 + t ) E [ S h ] o . 81 On E , using (D.7) , (D.8) , and E [ S h ] ≤ ( 2 a 0 h / γ ) K ¯ n, λ min K ∑ k = 1 α k r k r ⊤ k ! = λ min ( G ) S h ≥ ( 1 − η ) µ G ( 1 + t ) E [ S h ] ≥ 1 − η 1 + t · a ′ 0 2 a 0 κ 0 = : ˜ κ 0 . Combining with (D.6) gives µ ( U ) ≥ λ 0 ˜ κ 0 on E . (D.11) Moreover , by the union bound and (D.9) – (D.10) , P ( E c ) ≤ C 0 exp − c 0 K h γ . (D.12) 6. Conclusion. Let A : = { λ min ( S ( K ) ) ≤ ( 1 − ε ) λ 0 ˜ κ 0 } . By the tower property , P ( A ) = E P ( A | U 1: K ) ≤ E P ( A | U 1: K ) 1 ( E ) + P ( E c ) . On E , µ ( U ) ≥ λ 0 ˜ κ 0 by (D.11) , hence by (D.5) , P ( A | U 1: K ) 1 ( E ) ≤ d 0 exp − ε 2 µ ( U ) 2 R 0 1 ( E ) ≤ d 0 exp − ε 2 λ 0 ˜ κ 0 2 R 0 1 ( E ) . Recalling R 0 = R 2 b 0 / K from (D.4) , the exponent equals − c 3 K for some c 3 > 0 . Therefore, P ( A ) ≤ d 0 e − c 3 K + P ( E c ) ≤ d 0 e − c 3 K + C 0 exp − c 0 K h γ ≤ exp − C ′ K h γ , and we have shown that there exist constants C , C ′ > 0 such that P n λ min S ( K ) ≤ C o ≤ exp − C ′ K h γ . Let the RHS exponent K h γ ≥ M 0 C ′ log ( K ) for some constant M 0 > 1 then it follows that, for some constant k 0 , ∞ ∑ K = k 0 P n λ min S ( K ) ≤ C o ≤ ∞ ∑ K = k 0 exp − C ′ K h γ ≤ ∞ ∑ K = k 0 exp ( − M 0 log ( K ) ) = ∞ ∑ K = k 0 K − M 0 < ∞ Using Borel–Cantelli lemma pr oves the almost sure r esult λ min Z ⊤ WZ ≥ C which further implies A l ( Z ⊤ WZ ) − 1 A ⊤ l 2 ≤ ˜ C . E Additional Simulation Results for Section 4 This section presents supplementary simulation results corresponding to Section 4. The MSE ± standard deviation for the analysis in Section 5.1, computed across differ ent random seeds, is reported in T able 1. The cross-entr opy loss ± standard deviation for the analysis in Section 5.2, also computed across dif ferent random seeds, is r eported in T able 2. W e observe that the transfer learning (TL) estimator exhibits adaptivity , effectively selecting the better-performing estimator between the two baselines acr oss differ ent settings. 82 T able 1: Estimation performance across U values for experiments in Section 5.1. Entries are mean ± std. U DVCM GLR TL 0.05 0.1173 ± 0.0146 0.1084 ± 0.0158 0.1092 ± 0.0151 0.15 0.1441 ± 0.0102 0.1438 ± 0.0116 0.1450 ± 0.0103 0.25 0.1710 ± 0.0085 0.1599 ± 0.0091 0.1599 ± 0.0095 0.35 0.1630 ± 0.0130 0.1561 ± 0.0142 0.1581 ± 0.0132 0.45 0.1771 ± 0.0176 0.1765 ± 0.0200 0.1757 ± 0.0196 0.55 0.1371 ± 0.0086 0.1390 ± 0.0092 0.1392 ± 0.0093 0.65 0.1516 ± 0.0224 0.1538 ± 0.0199 0.1537 ± 0.0232 0.75 0.1827 ± 0.0347 0.1840 ± 0.0357 0.1864 ± 0.0351 0.85 0.1810 ± 0.0234 0.1835 ± 0.0237 0.1807 ± 0.0220 0.95 0.0910 ± 0.0448 0.1258 ± 0.0615 0.0961 ± 0.0535 T able 2: Estimation performance across U values for experiments in Section 5.2. Entries are mean ± std. U DVCM GLR TL 0.05 0.1302 ± 0.0033 0.0246 ± 0.0046 0.0258 ± 0.0050 0.15 0.2811 ± 0.0052 0.2222 ± 0.0077 0.2237 ± 0.0076 0.25 0.4453 ± 0.0069 0.4444 ± 0.0089 0.4439 ± 0.0080 0.35 0.5184 ± 0.0073 0.5120 ± 0.0071 0.5123 ± 0.0074 0.45 0.5486 ± 0.0089 0.5405 ± 0.0093 0.5404 ± 0.0091 0.55 0.5549 ± 0.0062 0.5406 ± 0.0066 0.5402 ± 0.0066 0.65 0.5355 ± 0.0136 0.5331 ± 0.0114 0.5333 ± 0.0121 0.75 0.4708 ± 0.0214 0.4645 ± 0.0231 0.4662 ± 0.0237 0.85 0.4124 ± 0.0217 0.3934 ± 0.0246 0.3967 ± 0.0266 0.95 0.3462 ± 0.0229 0.3340 ± 0.0290 0.3308 ± 0.0351 References [1] Q. Y ang, Y . Zhang, W . Dai, and S. J. Pan, T ransfer learning . Cambridge University Pr ess, 2020. [2] L. Shao, F . Zhu, and X. Li, “T ransfer learning for visual categorization: A survey ,” IEEE trans- actions on neural networks and learning systems , vol. 26, no. 5, pp. 1019–1034, 2014. [3] C. Cai, S. W ang, Y . Xu, W . Zhang, K. T ang, Q. Ouyang, L. Lai, and J. Pei, “T ransfer learning for drug discovery ,” Journal of Medicinal Chemistry , vol. 63, no. 16, pp. 8683–8694, 2020. [4] S. J. Pan and Q. Y ang, “A survey on transfer learning,” IEEE T ransactions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2009. [5] T . Hastie and R. T ibshirani, “V arying-coefficient models,” Journal of the Royal Statistical Society Series B: Statistical Methodology , vol. 55, no. 4, pp. 757–779, 1993. 83 [6] Z. Cai, J. Fan, and R. Li, “Efficient estimation and inferences for varying-coefficient models,” Journal of the American Statistical Association , vol. 95, no. 451, pp. 888–902, 2000. [7] A. Chen, A. B. Owen, and M. Shi, “Data enriched linear regression,” Electronic Journal of Statistics , vol. 9, no. 1, pp. 1078 – 1112, 2015. [Online]. A vailable: https: //doi.org/10.1214/15- EJS1027 [8] D. Obst, B. Ghattas, S. Claudel, J. Cugliari, Y . Goude, and G. Oppenheim, “Improved linear regr ession prediction by transfer learning,” Computational Statistics & Data Analysis , vol. 174, p. 107499, 2022. [9] S. Li, T . T . Cai, and H. Li, “T ransfer learning for high-dimensional linear regression: Pre- diction, estimation and minimax optimality ,” Journal of the Royal Statistical Society Series B: Statistical Methodology , vol. 84, no. 1, pp. 149–173, 2022. [10] Y . T ian and Y . Feng, “T ransfer learning under high-dimensional generalized linear models,” Journal of the American Statistical Association , pp. 1–14, 2022. [11] S. Li, T . T . Cai, and H. Li, “T ransfer learning in large-scale gaussian graphical models with false discovery rate control,” Journal of the American Statistical Association , vol. 118, no. 543, pp. 2171–2183, 2023. [12] J. Chen, D. Huang, L. W ang, K. L. Lunetta, D. Mukherjee, and H. Cheng, “T ransfer learning under high-dimensional graph convolutional r egression model for node classification,” arXiv preprint arXiv:2405.16672 , 2024. [13] J. Zhao, S. Zheng, and C. Leng, “Residual importance weighted transfer learning for high- dimensional linear regr ession,” arXiv preprint , 2023. [14] T . T . Cai and H. W ei, “T ransfer learning for nonparametric classification: Minimax rate and adaptive classifier ,” The Annals of Statistics , vol. 49, no. 1, 2021. [15] H. W . Reeve, T . I. Cannings, and R. J. Samworth, “Adaptive transfer learning,” The Annals of Statistics , vol. 49, no. 6, pp. 3618–3649, 2021. [16] S. Maity , D. Dutta, J. T erhorst, Y . Sun, and M. Banerjee, “A linear adjustment-based approach to posterior drift in transfer learning,” Biometrika , p. asad029, 07 2023. [Online]. A vailable: https://doi.org/10.1093/biomet/asad029 [17] J. Fan, C. Gao, and J. M. Klusowski, “Robust transfer learning with unreliable sour ce data,” arXiv preprint arXiv:2310.04606 , 2023. [18] C. Scott, “A generalized neyman-pearson criterion for optimal domain adaptation,” in Algo- rithmic Learning Theory . PMLR, 2019, pp. 738–761. [19] T . T . Cai and H. Pu, “T ransfer learning for nonparametric r egression: Non-asymptotic mini- max analysis and adaptive procedur e,” arXiv preprint , 2022. [20] C. W ang, C. W ang, X. He, and X. Feng, “Minimax optimal transfer learning for kernel-based nonparametric regr ession,” arXiv preprint , 2023. 84 [21] T . T . Cai, D. Kim, and H. Pu, “T ransfer learning for functional mean estimation: Phase transi- tion and adaptive algorithms,” The Annals of Statistics , vol. 52, no. 2, pp. 654–678, 2024. [22] C. Cai, T . T . Cai, and H. Li, “T ransfer learning for contextual multi-armed bandits,” The Annals of Statistics , vol. 52, no. 1, pp. 207–232, 2024. [23] A. Auddy , T . T . Cai, and A. Chakraborty , “Minimax and adaptive transfer learning for non- parametric classification under distributed differ ential privacy constraints,” arXiv preprint arXiv:2406.20088 , 2024. [24] E. Chen, X. Chen, and W . Jing, “Data-driven knowledge transfer in batch q* learning,” Journal of the American Statistical Association , pp. 1–25, 2025. [25] E. Chen, S. Li, and M. I. Jordan, “T ransfer q-learning for finite-horizon markov decision pro- cesses,” Electronic Journal of Statistics , vol. 19, no. 2, pp. 5289–5312, 2025. [26] C. Qin, J. Xie, T . Li, and Y . Bai, “An adaptive transfer learning framework for functional classification,” Journal of the American Statistical Association , vol. 120, no. 550, pp. 1201–1213, 2025. [27] B. Zhao, C. Ma, and M. Kolar , “T rans-glasso: A transfer learning appr oach to precision matrix estimation,” Journal of the American Statistical Association , pp. 1–21, 2025. [28] A. Jalan, Y . Jedra, A. Mazumdar , S. S. Mukherjee, and P . Sarkar , “Optimal transfer learning for missing not-at-random matrix completion,” arXiv preprint , 2025. [29] M. M. Kalan, E. J. Neugut, and S. Kpotufe, “T ransfer neyman-pearson algorithm for outlier detection,” arXiv preprint , 2025. [30] Y . Y an, Q. Ma, R. Zhang, and X. W ang, “T ransfer learning for high-dimensional data with heavy-tailed noise: A sparse convoluted rank regr ession method,” Statistics and Computing , vol. 36, no. 1, p. 45, 2026. [31] Z. Shang, P . Sang, and C. Jin, “Bootstrap nonparametric inference under data integration,” arXiv preprint arXiv:2501.01610 , 2025. [32] J. Chai, E. Chen, and J. Fan, “Deep transfer q -learning for offline non-stationary reinfor cement learning,” arXiv preprint , 2025. [33] J. Chai, E. Chen, and L. Y ang, “T ransition transfer q -learning for composite markov decision processes,” arXiv pr eprint arXiv:2502.00534 , 2025. [34] M. Bussas, C. Sawade, N. K ¨ uhn, T . Scheffer , and N. Landwehr , “V arying-coefficient models for geospatial transfer learning,” Machine Learning , vol. 106, pp. 1419–1440, 2017. [35] P . McCullagh, Generalized linear models . Routledge, 2019. [36] D. Ruppert and M. P . W and, “Multivariate locally weighted least squares regr ession,” The annals of statistics , pp. 1346–1370, 1994. [37] E. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications , vol. 9, no. 1, pp. 141–142, 1964. 85 [38] T . Gasser and H.-G. M ¨ uller , “Estimating regression functions and their derivatives by the kernel method,” Scandinavian journal of statistics , pp. 171–185, 1984. [39] J. Fan, M. Farmen, and I. Gijbels, “Local maximum likelihood estimation and infer ence,” Journal of the Royal Statistical Society Series B: Statistical Methodology , vol. 60, no. 3, pp. 591–608, 1998. [40] A. T sybakov , Introduction to Nonparametric Estimation , ser . Springer Series in Statistics. Springer New Y ork, 2008. [Online]. A vailable: https://books.google.com/books?id= mwB8rUBsbqoC [41] P . Hall, “Effect of bias estimation on coverage accuracy of bootstrap confidence intervals for a probability density ,” The Annals of Statistics , pp. 675–694, 1992. [42] S. Calonico, M. D. Cattaneo, and M. H. Farrell, “On the effect of bias estimation on coverage accuracy in nonparametric infer ence,” Journal of the American Statistical Association , vol. 113, no. 522, pp. 767–779, 2018. [43] J. Fan and W . Zhang, “Statistical estimation in varying coef ficient models,” The annals of Statis- tics , vol. 27, no. 5, pp. 1491–1518, 1999. [44] J. Fox, Applied regr ession analysis and generalized linear models . Sage publications, 2015. [45] M. Y urochkin, A. Bower , and Y . Sun, “T raining individually fair ml models with sensitive subspace robustness,” in International Confer ence on Learning Representations , 2020. [46] D. Mukherjee, M. Y urochkin, M. Banerjee, and Y . Sun, “T wo simple ways to learn individual fairness metrics from data,” in International conference on machine learning . PMLR, 2020, pp. 7097–7107. [47] M. Y ur ochkin and Y . Sun, “Sensei: Sensitive set invariance for enforcing individual fairness,” in International Conference on Learning Repr esentations , 2021. [48] J. Schmidt-Hieber , “Nonparametric r egression using deep neural networks with r elu activa- tion function,” The Annals of Statistics , vol. 48, no. 4, p. 1875, 2020. [49] M. Kohler and S. Langer , “On the rate of convergence of fully connected deep neural network regr ession estimates,” The Annals of Statistics , vol. 49, no. 4, pp. 2231–2249, 2021. [50] C. M. Theobald, “Generalizations of mean squar e err or applied to ridge regr ession,” Journal of the Royal Statistical Society Series B: Statistical Methodology , vol. 36, no. 1, pp. 103–106, 1974. [51] J. A. T ropp, “An introduction to matrix concentration inequalities,” Foundations and T rends® in Machine Learning , vol. 8, no. 1-2, pp. 1–230, 2015. [Online]. A vailable: http://dx.doi.org/10.1561/2200000048 86
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment