Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

Convergence Analysis of T w o-La y er Neural Net w o rks under Gaussian Input Masking Afro diti K olomv aki ak203@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA F angsh uo Liao fangshuo.liao@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Ev an Dramk o e d55@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Ziyun Guang c g105@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Anastasios Kyrillidis anastasios@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Abstract W e in vestigate the conv ergence guarantee of tw o-lay er neural netw ork training with Gaussian randomly mask ed inputs. This scenario corresp onds to Gaussian dropout at the input lev el, or noisy input training common in sensor netw orks, priv acy-preserving training, and federated learning, where each user ma y ha v e access to partial or corrupted features. Using a Neural T angent Kernel (NTK) analysis, we demonstrate that training a tw o-lay er ReLU net work with Gaussian randomly masked inputs achiev es linear conv ergence up to an error region prop ortional to the mask’s v ariance. A k ey technical contribution is resolving the randomness within the non-linear activ ation, a problem of indep endent interest. 1 Intro duction Neural netw orks (NNs) hav e revolutionized AI applications, where their success largely stems from their abilit y to learn complex patterns when trained on well-curated datasets (Sch uhmann et al., 2022; Li et al., 2023b; Gunasekar et al., 2023; Edwards, 2024). A comp onent to the success of NNs is its ability to mo del a broad range of tasks and data distributions under v arious scenarios. Empirical evidence has suggested neural net work’s abilit y to learn ev en under noisy input (Kariotakis et al., 2024), gradient noise (Ruder, 2017), as w ell as mo diﬁcations to the internal representations during training (Sriv asta v a et al., 2014; Y uan et al., 2022). Leveraging such abilit y of the neural netw orks, many real-w orld deplo yment adopts a mo diﬁcation to the data representations during training to achiev e particular goals such as robustness, priv acy , or eﬃciency . Among the metho ds, p erturbing the represen tations with an additive noise has b een studied b y a num b er of prior works (Gao et al., 2019; Li et al., 2025; 2023a; Madry et al., 2018; Lo o et al., 2022; T silivis and Kemp e, 2022; Ilyas et al., 2019), sho wcasing b oth the b eneﬁt of such p erturbation and the stable conv ergence of the training under this setting. Compared with additive noise, p erturbing the representations by multiplying it with a mask has rarely b een studied theoretically . P erturbing the representations with multiplicativ e noise app ears in man y real-world settings, either by design or unin ten tionally . F or instance, in federated learning (FL) settings (McMahan et al., 2017; Kairouz et al., 2021), particularly vertical FL (Cheng et al., 2020; Liu et al., 2021; Romanini et al., 2021; He et al., 2020; Liu et al., 2022; 2024), diﬀeren t features of the input data may be av ailable to diﬀeren t parties, eﬀectiv ely creating a form of sparsity-inducing multiplicativ e masking on the input space. Moreov er, the drop-out family (Sriv asta v a et al., 2014; Rey and Mnih, 2021) is a class of metho ds to prev en t o verﬁtting and 1 impro ve generalization ability of neural netw orks during training. Lastly , training mo dels under data-parallel proto col o v er a wireless channel incurs the c hannel eﬀect that blurs the data passed to the work ers through a m ultiplication T se and Viswanath (2005). Theoretically analyzing the training dynamics of neural net w orks under these settings are diﬃcult, esp ecially when the introduced randomness are intert wined with the nonlinearity of the activ ation function. While there has b een previous work that studies the conv ergence of neural net w ork training under drop-out (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020), they often assume that the drop-out happ ens after the nonlinear activ ations are applied. F rom a technical p ersp ective, statistics of the neural netw ork outputs are easier to handle as the randomness are not aﬀected by the nonlinearit y . In this pap er, we take a step further into the understanding of multiplicativ e p erturbations in neural netw ork training b y considering noise applied b efore the nonlinear activ ation. In particular, the setting we consider is the training of a t wo-la yer MLP where the inputs b ears a multiplicativ e Gaussian mask. This prototype pro vides a simpliﬁed scenario to study the noise-inside-activ ation diﬃculty , while generalizes v arious training scenarios ranging from input masking (Kariotakis et al., 2024) to Gaussian drop-out (Rey and Mnih, 2021), if one views the input in our setting as ﬁxed embeddings from previous lay ers of a deep neural net work. Under this setting, we aim to answer the following question: How do multiplic ative p erturb ations at the input level pr op agate thr ough the network and aﬀe ct the tr aining dynamics? Our Contributions. Analyzing the training dynamics under the Gaussian masks ov er the input means that w e ha ve to study the statistical prop erties of random v ariables inside a non-linear function. Our work takes a step tow ards resolving this technical diﬃculty . Moreo ver, we utilize an NTK-based analysis (Du et al., 2018; Song and Y ang, 2020; Oymak and Soltanolkotabi, 2019; Liao and Kyrillidis, 2022) to study the training con vergence of the tw o-la y er MLP under suﬃcient ov erparameterization. T o our kno wledge, this work provides the ﬁrst conv ergence analysis for neural netw ork training under Gaussian m ultiplicative input masking. Sp eciﬁcally , for inputs x masked b y x ⊙ c where c ∼ N ( 1 , κ 2 I ) , we prov e that: i ) The exp ected loss decomp oses in to a smo othed neural netw ork loss plus an adaptive regularization term; ii ) T raining achiev es linear con vergence to an error ball of radius O ( κ ) . Empirical results show case and supp ort our theory . Our Con tributions. T o our kno wledge, this work provides the ﬁrst conv ergence analysis for neural netw ork training under Gaussian multiplicativ e input masking. Our main con tributions are summarized as follows: • The or etic al A nalysis of Input Masking. W e provide a rigorous characterization of the training dynamics for t wo-la yer ReLU netw orks where noise is injected b efor e the non-linear activ ation. W e ov ercome the tec hnical c hallenge of resolving the exp ectation of non-linear functions of random v ariables, pro ving that the exp ected loss decomp oses in to a smo othed ob jectiv e plus an adaptive, data-dep endent regularizer. • Gener al Sto chastic T r aining F r amework. W e develop a general conv ergence theorem for ov erparameterized neural net works trained with biased sto chastic gradient estimators. This result, which establishes linear con vergence to a noise-dep endent error ball, is of indep endent interest b ey ond the sp eciﬁc setting of Gaussian masking. • Explicit Conver genc e Guar ante es. W e derive constructive b ounds for the conv ergence rate and the ﬁnal error radius. W e show explicitly how these quantities dep end on the mask v ariance κ 2 , the netw ork width m , and the initialization scale, demonstrating that the training conv erges linearly up to a ﬂo or determined b y the noise level. • Empiric al V alidation and Privacy Utility. W e conﬁrm our theoretical predictions regarding the exp ected gradien t and loss landscap e through simulations. F urthermore, w e demonstrate the practical utility of this training regime as a defense against Mem b ership Inference Attac ks (MIA), highlighting a fav orable trade-oﬀ b et ween priv acy and utilit y . 2 2 Related W ork Neural Net w ork Robustness. The study of neural net work robustness has a rich history , with early work fo cusing primarily on additiv e p erturbations. Results such as (Bartlett et al., 2017) and (Miy ato et al., 2018) established generalization b ounds for neural net w orks under adversarial p erturbations, showing that the net work’s Lipsc hitz constant pla ys a crucial role in determining robustness. Subsequen t work by (Cohen et al., 2019) in tro duced randomized smo othing techniques for certiﬁed robustness against ℓ 2 p erturbations, while (W ong et al., 2018) dev elop ed metho ds for training pro v ably robust deep neural netw orks. Regularization techniques ha ve emerged as p ow erful to ols for enhancing netw ork robustness. Drop out (Sriv asta v a et al., 2014) pioneered the idea of randomly masking internal neurons during training, eﬀectively creating an implicit ensemble of subnet w orks (Y uan et al., 2022; Hu et al., 2023; Kariotakis et al., 2024; W olfe et al., 2023; Liao and Kyrillidis, 2022; Dun et al., 2023; 2022). This connection b etw een feature masking and regularization was further explored in (Ghorbani et al., 2021), who sho wed that drop out can b e interpreted as a form of data-dep endent regularization. Note that sparsit y-inducing norms, based on Laplacian contin uous distribution, ha v e a long history in sparse recov ery problems (Bach et al., 2011; Jenatton et al., 2011; Bach et al., 2012; Kyrillidis et al., 2015). Empirical studies on the eﬀect of sparsity , represented by m ultiplicative Bernoulli distributions, can b e found in (Kariotakis et al., 2024). Neural T angent Kernel (NTK). Jacot et al. (2018) disco vers that inﬁnite-width neural netw ork ev olv es as a Gaussian pro cess with a stable k ernel computed from the outer pro duct of the tangent features of the neural netw ork. Later works adopted ﬁnite-width correction and applied the framework to the analysis of neural net work conv ergence Du et al. (2018; 2019b); Oymak and Soltanolkotabi (2019). The Neural T angent Kernel framework is one of the few theoretical to ols fo cused on theoretical understanding neural net work training. Later works extended the pro of to classiﬁcation tasks, where the notion of tangent features is considered as a feature mapping onto a space where the training data are separable Ji and T elgarsky (2020). Although the NTK framework has b een treated as "lazy training" that preven ts useful featuers to b e learned, it enables exact analysis of the neural netw ork training dynamic under v arious scenarios for diﬀeren t architectures Nguyen (2021); Du et al. (2019a); T ruong (2025); W u et al. (2023). Based on the NTK framew ork, several pap er studies the con vergence of training shallow neural netw orks under random neuron masking (e.g. drop out Sriv astav a et al. (2014)) Liao and Kyrillidis (2022); Mianjy and Arora (2020). Ho wev er, these works usually considers the masking applied after the nonlinearity is applied, which allows direct computation of the statistics of the output and gradien t under randomness. 3 Problem Setup Giv en a dataset { ( x i , y i ) } n i =1 , we are interested in training a neural netw ork f ( θ , · ) that maps each input x i ∈ R d ’s to an output f ( θ , x i ) that ﬁts the lab els y i ∈ R . W e consider f ( θ , · ) as a t wo-la yer ReLU activ ated Multi-La yer P erceptron (MLP) under the NTK scaling: f ( θ , x ) = 1 √ m m X r =1 a r σ  w ⊤ r x  , where θ = ( { w r } m r =1 , { a r } m r =1 ) denotes the neural netw ork parameters, and σ ( · ) = max { 0 , ·} denotes the ReLU activ ation function. W e assume that the second-lay er w eights a r ∈ {± 1 } are ﬁxed, and only the ﬁrst lay er weigh ts w r ’s are trainable. Th us, we will b e using f ( W , x ) ≡ f ( θ , x ) where W ∈ R m × d , unless otherwise stated. This neural net w ork set-up is studied widely in previous works (Du et al., 2018). W e consider the training of the neural netw ork by minimizing the MSE loss L ( W ) o ver the dataset { ( x i , y i ) } n i =1 : L ( W ) = 1 2 n X i =1 ( f ( W , x i ) − y i ) 2 . With the inﬂuence of the ReLU activ ation, the loss is b oth non-conv ex and non-smo oth. Ho wev er, a line of previous w orks (Du et al., 2018; Song and Y ang, 2020; Oymak and Soltanolkotabi, 2019) prov es a linear 3 con vergence rate of the loss function under the assumption that the num b er of hidden neurons is suﬃciently large b y adopting an NTK-based analysis (Jacot et al., 2018). While ther e have b e en the or etic al appr o aches and assumptions that go b eyond the NTK assumption, our fo cus is on a gener alize d sc enario wher e the input data may b e c orrupte d in e ach iter ation under a multiplic ative Gaussian noise: Let c ∼ N  1 d , κ 2 I d  b e an isotropic Gaussian random v ector centered at the all-one v ector 1 d , the neural net w ork output is giv en b y f ( W , x ⊙ c ) , where ⊙ denotes the Hadamard (elemen t-wise) pro duct b et ween t wo v ectors. Under the multiplicativ e noise, the neural netw ork is trained with gradien t descen t where eac h gradient is computed based on the surrogate loss L C ( W ) deﬁned ov er the neural netw ork with the masked input: L C ( W ) = 1 2 n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 . Here C = { c i } n i =1 denotes the collection of the masks for all input x i . W e assume that c i ’s are indep endent. In real-w orld applications, this scheme can b e considered as training on an imprecise hardw are, where eac h input data p oin t is read-in with noise. Alternativ ely , one could view each x i as the output of a pre-trained large mo del, and our training scheme can b e considered as ﬁne-tuning the last tw o lay ers with the Gaussian drop-out (W ang and Manning, 2013; Kingma et al., 2015; Rey and Mnih, 2021) in the intermediate lay er. Let { W k } K k =1 b e generated from the sto c hastic gradient descent given by: W k +1 = W k − η ∇ W L C k ( W k ) , (1) where C k is sampled indep endently in every iteration of the gradient descent. Our goal is to study the con vergence of the loss sequence {L ( W k ) } ∞ k =1 . Notice that the loss inv olv ed in the weigh t-up date is the surrogate loss L C k ( W k ) , but the loss we aim to show con vergence is the original loss L ( W ) . Our set-up marks some diﬀerences from previous w orks. First, our set-up is distinct from unbiased estimators in current literature; our setup do es not hav e such fa vorable prop ert y , since the randomness is applied at the input level of the neural netw ork. Second, although there is a line of work that analyzes the conv ergence of v anilla drop-out tranining on t wo-la yer neural netw orks (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020), in their analysis the mask is applied to the hidden neurons after the activ ation function. On the contrary , our mask is applied directly to the input, which is con tained in the non-linear function. Therefore, any analysis of the mask randomness m ust go through the ReLU function, whic h brings tec hnical diﬃcult y . W e assume the follo wing prop ert y for the training data. Assumption 3.1. The tr aining dataset { ( x i , y i ) } n i =1 satisﬁes ∥ x i ∥ 2 ≤ 1 , | y i | ≤ O (1) , and for any p air i  = j , ther e exists no r e al numb er q such that x i = q · x j . This assumption guarantees the b oundedness of the dataset, and that the input data are non-degenerate, whic h is a standard assumption in Du et al. (2018); Song and Y ang (2020); Liao and Kyrillidis (2022). 4 Exp ectation of the Loss and Gradient under Gaussian Mask A formal mathematical c haracterization of the exp ected loss and gradient is essential not only in prior literature of neural net work training conv ergence (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020) but also in the classical analysis of SGD ev en in the con vex domain (Shamir and Zhang, 2013; Garrigos and Go wer, 2023; T ang et al., 2013). In this section, we fo cus on the deriv ation of the explicit form of the exp ected surrogate loss E C [ L C ( W )] and the exp ected surrogate gradient E C [ ∇ W L C ( W )] . Starting with gradien t calculations, the surrogate gradient with resp ect to the r -th neuron can b e written as: ∇ w r L C ( W ) = a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  (2) Setting c i = 1 for all i ∈ [ n ] gives the gradient of the original loss ∇ w r L ( W ) . Let Φ 1 ( · ) denote the CDF of the standard (one-dimensional) Gaussian random v ariable, and let ϕ, ψ : R → R b e deﬁned as: ϕ ( x ) = exp  − x 2  ; ψ ( x ) = | x | · ϕ ( x ) . (3) 4 (a) (b) Figure 1: (a). Eﬀect of the noise standard deviation κ on the shap e of the smo othed activ ation function ˆ σ ( z ; κ ) = z · Φ 1 ( z / ( κ ∥ w ⊙ x ∥ 2 )) , where z = w ⊤ x . F or this visualization, ∥ w ⊙ x ∥ 2 is held constant at 1 . 0 . As κ increases, the activ ation b ecomes progressively smo other compared to the standard ReLU (dotted black line). F or small κ (e.g., κ = 0 . 01 ), ˆ σ closely appro ximates the standard ReLU. (b). Theoretical smo othed activ ation ˆ σ ( w , x ) v ersus its empirical estimate E c [ σ ( w ⊤ ( x ⊙ c ))] for a ﬁxed pre-activ ation v alue w ⊤ x ≈ 0 . 77 (actual v alue dep ends on ﬁxed w , x ) as the noise standard deviation κ v aries. The close match across a range of—relativ ely small— κ v alues v alidates the theoretical mo del for ˆ σ . Note that this b ehavior consistently follo ws empirically for diﬀerent w , x v alues. Observ e that ϕ ( x ) ∈ (0 , 1] and ψ ( x ) ∈ (0 , 1 / √ 2 e ] . Before w e state the results in this section, we need to deﬁne the follo wing quantities. Deﬁnition 4.1. Fix a ﬁrst-lay er weigh t W ∈ R m × d and training data { ( x i , y i ) } n i =1 . W e deﬁne the: • Data-related quantit y: B x = max i ∈ [ n ] ∥ x i ∥ ∞ , B y := max i ∈ [ n ] | y i | . • W eigh t-related quantit y (row-wise): R w = max r ∈ [ m ] ∥ w r ∥ 2 . • Mixed quantit y: R u := max r ∈ [ m ] ,i ∈ [ n ] ∥ w r ⊙ x i ∥ 2 and ψ max = max r ∈ [ m ] ,i ∈ [ n ] ψ  w ⊤ r x i 2 κ ∥ w r ⊙ x i ∥ 2  , ϕ max = max r ∈ [ m ] ,i ∈ [ n ] ϕ  w ⊤ r x i 2 κ ∥ w r ⊙ x i ∥ 2  . Exp ected Surrogate Loss. T o start, w e fo cus on the exp ected loss under the Gaussian input mask. F or a ﬁxed neural netw ork f ( W , · ) , we hav e the following result. Theorem 4.2. L et u i,r = w r ⊙ x i . Deﬁne the smo othe d activation and neur al network as: ˆ σ κ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  , ˆ f ( W , x ) = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) . L et ϕ max , ψ max , B y , R u , and R w b e deﬁne d in Deﬁnition 4.1. If B y ≤ 3 √ mR w , then we have that: E C [ L C ( W )] = E + 1 2 n X i =1  ˆ f ( W , x i ) − y i  2 2 | {z } T 1 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ 1  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 | {z } T 2 , with the magnitude of E b ounde d by: |E | ≤ mn  κ 2 R 2 u ψ 2 max +  κ 2 R 2 u + κR w  ϕ 2 max  . (4) Remark 4.3. The core of our analysis of the exp ected loss inv olves understanding how the ReLU activ ation b eha ves under the multiplicativ e Gaussian input mask. Lemma D.14 in the app endix pro vides the analytical form for the expectation of a truncated Gaussian random v ariable. This leads to the deﬁnition of a smo othed 5 (a) Exact exp ectation ˜ σ ( w , x ) = z Φ( z /σ ) + σ φ ( z /σ ) (b) Proxy ˆ σ ( w , x ) = z Φ( z /σ ) Figure 2: Smo othed ReLU under multiplicativ e Gaussian input masking for ﬁxed κ = 0 . 2 , where z = w ⊤ x and σ = κ ∥ w ⊙ x ∥ 2 . (a) Exact closed-form exp ectation ˜ σ ( w , x ) = E c [ σ ( w ⊤ ( x ⊙ c ))] = z Φ( z /σ ) + σ φ ( z /σ ) (as shown in 19)) matches the Mon te Carlo estimate. (b) Proxy smo othed activ ation ˆ σ ( w , x ) = z Φ( z /σ ) (used in Theorem 4.2) diﬀers mainly near z ≈ 0 due to the missing σ φ ( z /σ ) term. activ ation function, as presented in Theorem 4.2. Figure 1b demonstrates the corresp ondence b etw een the theoretical and empirical v alues of this smo othed activ ation across a range of noise levels κ for a ﬁxed input w ⊤ x : b eing an appro ximation of ReLU, as κ increases, it is exp ected the tw o curv es to deviate, yet for small enough κ v alues (here, κ ⪅ 0 . 2 ) the tw o curve s coincide. Figure 2b visually compares this theoretical smo othed activ ation ˆ σ with its empirical estimate E c [ σ ( w ⊤ ( x ⊙ c ))] for a ﬁxed κ = 0 . 2 as the input w ⊤ x v aries. The close agreement v alidates our analytical deriv ation of ˆ σ and illustrates its smo othing eﬀect compared to the standard ReLU. Th us, the term Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  can be interpreted as a smo othed v ersion of the indicator function I { w ⊤ x ≥ 0 } . T o visualize the impact of the noise v ariance κ 2 on the shap e of this smo othed activ ation, see Figure 1a for v arious v alues w , x (F or this illustration, we assume a ﬁxed v alue for ∥ w ⊙ x ∥ 2 = 1 . 0 to isolate the eﬀect of z = w ⊤ x and κ ). As κ increases, the transition of ˆ σ around the origin b ecomes progressively gen tler compared to the sharp kink of the standard ReLU activ ation. Remark 4.4. Theorem 4.2 shows that the exp ected loss can b e approximated by the combination of terms T 1 and T 2 , with an additive error term deﬁned b y E . Notice that the smo othed activ ation ˆ σ κ ( w , x ) satisﬁes (see (19)): ˆ σ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  = E c ∼N ( 1 ,κ 2 I ) [ σ ( w ⊤ ( x ⊙ c ))] ± O  κ ∥ w ⊙ x ∥ 2 ϕ  w ⊤ x κ ∥ w ⊙ x ∥ 2  Therefore, here T 1 can b e seen as loss deﬁned on the smo othed neural net work ˆ f ( W , · ) with the same w eights and dataset. Remark 4.5. One ma y notice that the form of T 2 is similar to the ℓ 2 regularization in the ridge regression. T o understand T 2 , we ﬁrst notice that: ∇ w r ˆ f ( W , x i ) ≈ 1 √ m m X r =1 a r x i Φ 1  w ⊤ r x i κ ∥ u i,r ∥ 2  . Therefore, T 2 can appro ximately b e written as: T 2 ≈ κ 2 2 n X i =1      m X r =1 ∇ w r ˆ f ( W , x i ) ⊙ w r      2 2 = κ 2 2 n X i =1 d X j =1  ∇ ˆ w j f ( W , x i ) ⊤ ˆ w j  2 = vec ( W ) ⊤ ˆ H vec ( W ) . Here ˆ w j is the j th ro w of the matrix W = [ w 1 , . . . , w m ] ∈ R d × m , vec ( W ) = concat ( ˆ w 1 , . . . , ˆ w d ) is the concatenation of the ˆ w j ’s, and ˆ H ∈ R md × md is the blo ck-diagonal matrix whose j th diagonal blo ck is 6 ˆ H j := P n i =1 ∇ ˆ w j f ( W , x i ) ∇ ˆ w j f ( W , x i ) ⊤ ∈ R m × m for j ∈ [ d ] . In tuitively , ˆ H can b e seen as a matrix consisting of the tangent features’ (Baratin et al., 2021; LeJeune and Alemohammad, 2024) outer pro ducts. As a result, T 2 can b e seen as the regularization term of W in terms of a norm deﬁned by the tangent feature outer pro duct matrix ˆ H . Remark 4.6. The magnitude of E is giv en in (4). A t a ﬁrst glance, one could see that the term decreases monotonically as κ decrease, implying a smaller error when κ is small. As discussed in the b eginning of this section, ϕ max and ψ max are upp er-bounded by some constant. Therefore, in the worst case, we hav e |E | ≤ O  mn  κ 2 R u + κR w  , whic h scales linearly with κ . Belo w, w e sk etch the pro of of Theorem 4.2. The full pro of of Theorem 4.2 is deferred to App endix B.2. Pr o of sketch. Our pro of starts with the decomp osition of the exp ected loss as: E C [ L C ( W )] = 1 2 n X i =1 E C h ( f ( W , x i ⊙ c i )) 2 i + 1 2 n X i =1 y 2 i − n X i =1 y i E C [( f ( W , x i ⊙ c i ))] . It b oils do wn to analyzing the terms E C h ( f ( W , x i ⊙ c i )) 2 i and E C [( f ( W , x i ⊙ c i ))] . Plugging in f ( W , x i ⊙ c i ) , it suﬃces to analyze the following exp ectations: E 1 = E c  σ  w ⊤ r ( x ⊙ c )  σ  w ⊤ r ′ ( x ⊙ c )  ; E 2 = E c  σ  w ⊤ r ( x ⊙ c )  . The tric k of ev aluating E 1 and E 2 is to notice that w ⊤ r ( x ⊙ c ) = c ⊤ ( w r ⊙ x i ) . Since c ∼ N  1 , κ 2 I  , we m ust hav e that c ⊤ ( w r ⊙ x ) ∼ N  w ⊤ r x , κ 2 ∥ w r ⊙ x ∥ 2 2  . Therefore, we can deﬁne z 1 = c ⊤ ( w r ⊙ x ) and z 2 = c ⊤ ( w r ′ ⊙ x ) . Then the problem of ev aluating E 1 and E 2 b ecomes computing: E 1 = E z 1 ,z 2 [ z 1 z 2 I { z 1 ≥ 0; z 2 ≥ 0 } ] ; E 2 = E z 1 [ z 1 I { z 1 ≥ 0 } ] . Here Co v ( z 1 , z 2 ) = ( w r ⊙ x i ) ⊤ ( w r ′ ⊙ x i ) . T o complete the pro of, we prov e the following tw o lemmas. Lemma 4.7. L et z 1 ∼ N  µ 1 , κ 2 1  . Then, we have that: E [ z 1 I { z 1 ≥ 0 } ] = κ √ 2 π exp  − µ 2 2 κ 2  + µ Φ 1  µ κ  . Lemma 4.8. L et z 1 ∼ N  µ 1 , κ 2 1  and z 2 ∼ N  µ 2 , κ 2 2  , with Cov ( z 1 , z 2 ) = κ 1 κ 2 ρ . L et Φ 2 ( a, b, ρ ) denote the joint CDF of standar d Gaussian r andom variables ˆ z 1 , ˆ z 2 with c ovarianc e ρ at z 1 = a, z 2 = b . Then, we have: E [ z 1 z 2 I { z 1 ≥ 0; z 2 ≥ 0 } ] = ( µ 1 µ 2 + κ 1 κ 2 ρ ) Φ 2  µ 1 κ 1 , µ 2 κ 2 , ρ  + 1 √ 2 π ( κ 1 µ 2 T 1 + κ 2 µ 1 T 2 ) + κ 1 κ 2 2 π exp  − 1 2 (1 − ρ 2 )  µ 2 1 κ 2 1 − 2 ρµ 1 µ 2 κ 1 κ 2 + µ 2 2 κ 2 2  Her e, T 1 , T 2 ar e deﬁne d as: T 1 = exp  − µ 2 1 2 κ 2 1  Φ 1 1 p 1 − ρ 2  µ 2 κ 2 − ρµ 1 κ 1  ! ; T 2 = exp  − µ 2 2 2 κ 2 2  Φ 1 1 p 1 − ρ 2  µ 1 κ 1 − ρµ 2 κ 2  ! Plugging z 1 = c ⊤ ( w r ⊙ x ) and z 2 = c ⊤  w ⊤ r ′ x  bac k into E C h ( f ( W , x i ⊙ c i )) 2 i and E C [( f ( W , x i ⊙ c i ))] and b ounding the emerging error terms would give the desired result. Details are deferred into the app endix. Exp ected Surrogate Gradient. In the following part, we study the exp ectation of the surrogate gradient. Theorem 4.9. A ssume that c i ∼ N  1 , κ 2 I  for some κ ≤ 1 . L et ϕ max , ψ max , B y , R u , and R w b e deﬁne d in Deﬁnition 4.1. Then, we have that: E C [ ∇ w r L C ( W )] = ∇ w r L ( W ) + g r + 3 κ 2 m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ · I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  (5) 7 wher e the magnitude of g r c an b e b ounde d as: ∥ g r ∥ 2 ≤  6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + σ max ( X ) √ m ϕ max L ( W ) 1 2 + 6 nκR u ψ max . (6) Remark 4.10. Theorem 4.9 shows that the exp ected gradient can b e written as the summation of the v anilla loss gradient ∇ w r L ( W ) , a term T 3 ab o ve, and a gradient error g r . The magnitude of g r is controlled in (6). As discussed previously , when   w ⊤ r x i   > 0 , b oth ϕ max and ψ max decreases exp onen tially as κ decreases. Note that, although the second term scales with L ( W ) , when L ( W ) decreases during training, that term will also con tribute less to the ov erall gradient error. Remark 4.11. One could observ e that the third term on the right-hand side of (5) is the gradient of the function R ( W ) with resp ect to w r , where R ( W ) is given by: R ( W ) = 3 κ 2 m      m X r ′ =1 a r ′ w r ′ ⊙ x i I  w ⊤ r ′ x i ≥ 0       2 . Therefore, R ( W ) can b e seen as a scaled v ersion of the T 2 -term in Theorem 4.2. This again v eriﬁes the regularization eﬀect of the Gaussian random mask. The pro of of Theorem 4.9 is provided in App endix B.3, and we pro vide the pro of sk etc h b elo w. Pr o of sketch. By the form of the surrogate gradien t in (2), we fo cus on the following tw o terms: T 1 = f ( W , x i ⊙ c i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  ; T 2 = y i ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  . Plug in the form of f ( W , x i ⊙ c i ) , we can write T 1 as: T 1 = 1 √ m m X r ′ =1 a r ′ σ ( w ⊤ r ′ ( x i ⊙ c i )) · ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  = 1 √ m m X r ′ =1 a r ′ Diag ( x i ) c i c ⊤ i I  c ⊤ i ( w r ⊙ x i ) ≥ 0  · I  c ⊤ i ( w r ′ ⊙ x i ≥ 0)  ( w r ′ ⊙ x i ) , while T 2 can b e written as: T 2 = y i x i ⊙  c i I  c ⊤ i ( w r ⊙ x i ) ≥ 0  . This allo ws us to fo cus on the following quantities instead E  cc ⊤ I  c ⊤ u ≥ 0; c ⊤ v ≥ 0  ; E  c I  c ⊤ u ≥ 0  with multi-v ariate Gaussian random v ariable analysis. The rest of the pro of then pro ceeds similarly as in the pro of of Theorem 4.2. 5 Convergence Gua rantee of T raining with Gaussian Mask Here, we study the conv ergence prop erty of a general framework of sto c hastic neural netw ork training, which giv es us a theoretical result that can b e of indep endent interest. Recall also the setting in Section 3: Consider f ( W , · ) as a tw o-lay er ReLU activ ated MLP , as describ ed ab o ve. Let ξ denote the randomness in one step of (sto c hastic) gradient descent. Let ˆ L ( W , ξ ) and ∇ w r ˆ L ( W , ξ ) denote the sto c hastic loss and the sto chastic gradien t induced by ξ , resp ectiv ely . W e consider the sequence { W k } K k =1 generated by the following up dates: W k +1 = W k − η ∇ W ˆ L ( W k , ξ k ) . (7) Instead, the connection b et ween ∇ w r ˆ L ( W , ξ ) and ˆ L ( W , ξ ) with resp ect to w r , along with other requirements, are stated in the assumption b elow. 8 Assumption 5.1. F or al l ξ , W , we assume that the fol lowing pr op erties hold: E ξ h ˆ L ( W , ξ ) i ≤ 2 L ( W ) + ε 1 , (8)    E ξ h ∇ w r ˆ L ( W , ξ ) i − ∇ w r L ( W )    2 ≤ ε 3 L ( W ) 1 2 + ε 2 , (9)    ∇ w r ˆ L ( W , ξ )    2 2 ≤ γ ˆ L ( W , ξ ) . (10) Here, (8) and (9) provide an upp er b ound on the exp ected loss and the error of the exp ected gradient. (10) can b e seen as a relaxed form of the smo othness. Our analysis is based on the standard NTK-type argument as in (Du et al., 2018; Song and Y ang, 2020; Liao and Kyrillidis, 2022), which considers the inﬁnite-width NTK H ∞ giv en b y: H ∞ ij = x ⊤ i x j E w ∼N ( 0 , I )  I  w ⊤ x i ≥ 0; w ⊤ x j ≥ 0  . (11) It is shown in Du et al. (2018) that H ∞ is p ositiv e deﬁnite. W e deﬁne λ 0 := λ min ( H ∞ ) > 0 . Theorem 5.2. A ssume that the ﬁrst-layer weights of a neur al network ar e initialize d ac c or ding to w 0 ,r ∼ N  0 , τ 2 I  for some τ > 0 , and the se c ond-layer weights ar e initialize d ac c or ding to a r ∼ Unif {± 1 } . L et the numb er of hidden neur ons satisfy m = Ω  n 4 K 2 λ 4 0 δ 2 τ 2  and the step size satisfy η = O  λ 0 n 2  . A ssume that A ssumptions 3.1, 5.1 hold for some γ = C 1 · n m with some smal l enough ε 1 , ε 2 , ε 3 satisfying: ε 1 ≤ O  δ m nK 4  , ε 2 ≤ O  δ λ 0 nK 2  , ε 3 ≤ O  λ 0 √ mn  . (12) Then, with pr ob ability at le ast 1 − 2 δ − n 2 exp  − n 3 δ 2 τ 2 λ 3 0  , for al l k ∈ [ K ] , the se quenc e { W k } K k =1 gener ate d by (7) satisﬁes: E ξ 0 ,..., ξ k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + O  mn λ 2 0 · ε 2 2 + ε 1  . (13) F urthermor e, we c an guar ante e that ∥ w k,r − w 0 ,r ∥ 2 ≤ O  τ λ 0 n  for al l r ∈ [ m ] and k ∈ [ K ] . In short, Theorem 5.2 shows that under a small enough ε 1 , ε 2 , ε 3 and γ , if the neural netw ork is suﬃciently o verparameterized, then, with a small enough step size η , w e can guaran tee the conv ergence under the training giv en by (7) and that the change in each w r is b ounded by O  τ λ 0 n  . As shown in (13), the exp ected loss con verges linearly up to a ball around the global minimum with radius given by O  mn λ 2 0 · ε 2 2 + ε 1  . This error region monotonically decreases as the error in the exp ected loss and gradien t, namely ε 1 and ε 2 , decreases. T raining Conv ergence with Gaussian Input Mask. W e apply the general result in Theorem 5.2 to the scenarios of Gaussian input masking, as given by (1). T o apply Theorem 5.2, one need to mak e sure that the requiremen ts in Assumption 5.1 are guaranteed. Here, we present tw o corollaries as extensions of Theorem 4.2 and Theorem 4.9, with the goal of sho wing (8) and (9). Corollary 5.3. L et B y , ϕ max , ψ max , R u , and R w b e deﬁne d in Deﬁnition 4.1. If B y ≤ 3 √ mR w , then we have E C [ L C ( W )] ≤ 2 L ( W ) + 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ 2 max + mnκ 2  R 2 u + 1  ψ 2 max Corollary 5.3 follows simply from Theorem 4.2 by upp er-b ounding the diﬀerence b etw een the smo othed neural net work function ˆ f ( W , · ) and the v anilla neural netw ork function f ( W , · ) , and b y upp er-b ounding the regularization term. In particular, Corollary 5.3 implies the b ound of the error ε 1 as 2 mnκ 2 R u + mn  κ 2 R 2 u + κR w  ϕ 2 max + mnκ 2  R 2 u + 1  ψ 2 max . Corollary 5.4. L et B y , ϕ max , ψ max , R u , and R w b e deﬁne d in Deﬁnition 4.1. If B y ≤ 3 √ mR w , then we have ∥ E C [ ∇ w r L C ( W )] − ∇ w r L ( W ) ∥ 2 ≤ O  nκ 2 B 2 x R w + nκR u √ d  ϕ max  + O  σ max ( X ) ϕ max √ m  L ( W ) 1 2 + O  nκR u ψ max + κ 2 √ mB 2 x R w  9 Similar to Corollary 5.3, Corollary 5.4 follo ws from upp er b ounding the regularization term in Theorem 4.9. By Corollary 5.4, we can write ε 2 and ε 3 in Assumption 5.1 as ε 2 = O  nκ 2 B 2 x R w + nκR u √ d  ϕ max  + O  nκR u ψ max + κ 2 √ mB 2 x R w  and lik ely ε 3 = O  σ max ( X ) ϕ max √ m  . The pro of of Corollary 5.3 and Corollary 5.4 are deferred to App endix C.2. T o complete the requiremen ts in Assumption 5.1, we can show the following lemma for (10). Lemma 5.5. A ssume that A ssumption 3.1 holds. Then, we have: ∥∇ w r L C ( W ) ∥ 2 2 ≤ C 1 n m L C ( W ) . The pro of can b e found in the app endix D.21 With the help of Corollary 5.3,5.4, and Lemma D.21, w e can derive the conv ergence guarantee of training the t wo-la yer ReLU neural netw ork under Gaussian input mask. Theorem 5.6. A ssume that the ﬁrst-layer weights ar e initialize d ac c or ding to w 0 ,r ∼ N  0 , τ 2 I  for some τ > 0 , and the se c ond-layer weights ar e initialize d ac c or ding to a r ∼ Unif {± 1 } . L et the numb er of hidden neur ons satisfy m = Ω  n 4 K 2 λ 4 0 δ 2 τ 2  and the step size satisfy η = O  λ 0 n 2  . A ssume that for al l W ∈ { W k } K k =1 , the fol lowing hold: κ = O   √ δ λ 0 τ 2 K 2  m 1 4 √ d + nd   ˆ ϕ max + ˆ ψ max    (14) σ max ( X ) ˆ ϕ max ≤ O  λ 0 √ n  . (15) Then, we have that, with pr ob ability at le ast 1 − 2 δ − n 2 exp  − n 3 δ 2 τ 2 λ 3 0  , for al l k ∈ [ K ] , the se quenc e { W k } K k =1 gener ate d by (7) satisﬁes: E C 0 ,..., C k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + O  κτ 2 mn 2 d 2  ˆ ϕ 2 max + ˆ ψ 2 max  (16) + O  κ 2 τ 2 m 2 nd  + O  κτ mn √ d ˆ ϕ 2 max  (17) wher e ˆ ϕ max = max k ∈ [ K ] ϕ max ( W k ) and ˆ ψ max = max k ∈ [ K ] ψ max ( W k ) . In short, Theorem 5.6 guarantees the con vergence of training a tw o-lay er ReLU neural netw ork under Gaussian input mask in the form of (16) under the condition of suﬃcient ov erparameterization, prop er step size, and the requiremen t in (14) and (15). In particular, (14) requires a suﬃcien tly small Gaussian v ariance κ . The condition in (15) requires either a small maximum singular v alue of the input data matrix X , or a small ϕ max . Lastly , (16) shows a linear conv ergence of the exp ected loss up to some error region. Notice that the ﬁrst part of the error region dep ends b oth on κ, τ and on ϕ max and ψ max , and the second part of the error region dep ends solely on κ and τ . This means that one can guarantee an arbitrarily small error region when the Gaussian noise κ and the initialization scale τ is suﬃciently small. Remark 5.7. Both the requirement and the error region in Theorem 5.6 dep end on the quan tity ˆ ϕ max and ˆ ψ max . Recall that: ˆ ϕ max = max k,r,i ( exp −  w ⊤ r x i  4 κ 2 ∥ w r ⊙ x i ∥ 2 2 !) ; ˆ ψ max = max k,r,i (   w ⊤ r x i   · exp −  w ⊤ r x i  4 κ 2 ∥ w r ⊙ x i ∥ 2 2 !) . Both quantities decay exp onentially fast as κ deca ys, when w ⊤ r x i  = 0 for all k , r , i . As the sequence { W k } K k =1 is generated under the randomness of C k ’s, in tuitively it is almost never the case where w ⊤ k,r x i = 0 . Therefore, in most cases Theorem 5.6 should require only a log-dep endency of κ on other parameters in order for (14) and (15) to b e satisﬁed. Ho wev er, it should b e noticed that κ still need to decay in p o werla w if one w an t to suﬃcien tly decrease the second part of the error region. 10 (a) 0 1 2 3 4 5 Mask Standard Deviation ( κ ) 65 70 75 80 85 90 95 T est Accuracy 1 Local Step 20 Local Steps 40 Local Steps (b) Figure 3: (a). T raining loss L ( W k ) (log-scale) versus training iteration for a tw o-lay er ReLU netw ork ( n = 500 , d = 20 , m = 100 ) trained with full-batc h gradien t descen t under diﬀerent lev els of input multiplicativ e Gaussian noise standard deviation κ . (b). Distributed training with Gaussian mask for diﬀeren κ and num b er of lo cal steps. 6 Exp eriments 6.1 Empirical Validation of T raining Convergence with Gaussian Mask. Theorem 5.6 asserts that training a suﬃciently ov erparameterized tw o-lay er ReLU netw ork with Gaussian m ultiplicative input noise results in linear conv ergence of the exp ected loss to an error ball. The radius of this error ball is prop ortional to the noise v ariance (controlled by κ ) and other netw ork and data-dep enden t terms. W e empirically v erify this conv ergence b eha vior. Simulation Setup. W e train a tw o-lay er ReLU MLP: As a to y example, the netw ork has d = 20 input features and m = 100 hidden units. The training dataset comprised n = 500 synthetic samples, with input features x i normalized suc h that ∥ x i ∥ 2 ≤ 1 , and target v alues y i generated from a non-linear function of x i with small added noise. First-lay er weigh ts W were initialized using Kaiming uniform initialization, and second-la yer weigh ts a r ∈ {± 1 } w ere ﬁxed. The net work was trained for 2000 iterations using full-batch gradien t descent with a learning rate of 0 . 005 . W e p erformed separate training runs for diﬀerent noise levels: κ ∈ { 0 . 0 , 0 . 05 , 0 . 2 , 0 . 4 , 0 . 6 , 1 . 0 , 2 . 0 } . F or each run, we track ed the ev olution of the clean training loss L ( W k ) . R esults and Discussion. The training tra jectories, plotted in Figure 3a, illustrate the theoretical predictions. F or clean training ( κ = 0 . 0 ), the loss exhibits an initial linear conv ergence phase. When multiplicativ e Gaussian noise is introduced, the initial linear conv ergence trend is preserv ed. How ever, as training progresses, the loss conv erges not to the same minimal v alue but to a distinct error ball, plateauing at a v alue higher than the clean case. As exp ected, the size of this error ball, indicated by the ﬁnal conv erged loss v alue, systematically increases with the noise level κ . This direct relationship b et ween κ and the size of the error ball pro vides strong empirical supp ort for the conv ergence guarantees outlined in Theorem 5.6. 6.2 Impact of Multiplicative Gaussian Noise on Mo del Accuracy W e trained a 1-hidden-lay er MLP (4096 hidden units, GELU activ ation, drop out=0.2) on CIF AR-10 for 80 ep ochs using Adam W optimization with cosine annealing, lab el smo othing, and standard data augmentation. During training, we injected multiplicativ e Gaussian noise x ← x · (1 + κZ ) where Z ∼ N (0 , 1) . and ev aluated the impact of noise strength κ on clean test accuracy . The results reveal that small amounts of MG noise κ ≈ 0 . 2 improv e generalization, acting as an eﬀective regularizer b ey ond the existing random cropping and ﬂipping. This suggests that mo dest input p erturbations help the mo del learn more robust features that transfer b etter to the test set. How ever, accuracy degrades monotonically b ey ond this p oint, dropping to 49.88% at κ = 1 . 8 . Next, to assess the impact of multiplicativ e Gaussian noise on a well-regularized con volutional arc hitecture, we trained a CNN on CIF AR-10 with v arying noise strengths κ . The architecture 11 (a) MLP (b) CNN Figure 4: T est accuracy versus multiplicativ e Gaussian noise strength κ for (a) a 1-hidden-lay er MLP and (b) a CNN, trained on CIF AR-10. Small noise levels ( κ ≈ 0 . 2) can impro ve generalization for the MLP , likely due to regularization eﬀects. In contrast, the CNN exhibits robustness by maintaining baseline accuracy . Beyond this p oin t, accuracy degrades monotonically for b oth architectures as noise corrupts the training signal. consists of 4 con volutional la yers ( 32 → 32 → 64 → 64 ﬁlters with 3 × 3 k ernels), batc h normalization after eac h con volutional la yer, three drop out la y ers with rates 0 . 2 , 0 . 3 and 0 . 5 resp ectiv ely , and a fully connected la yer with ReLU activ ation. W e trained the CNN for 80 ep o chs using the Adam optimizer with lr = 10 − 3 , w eight deca y = 10 − 5 and batc h size = 128 . During training, we applied Multiplicative Gaussian (MG) noise x ← x · (1 + κZ ) where Z ∼ N (0 , 1) , while all accuracy measurements were p erformed on the clean test set (noiseless). As shown in Figure 4b, we observe that the mo del exhibits robustness to low-magnitude MG noise, main taining its baseline accuracy ( ≈ 71% ) at κ = 0 . 2 . Beyond that p oint though, accuracy degrades monotonically and excessive noise corrupts the training signal. 6.3 Application: Distributed T raining over Wireless Channels Comm unicating signals o ver wireless channels incurs fading phenomena to the signals (T se and Viswanath, 2005). Sp eciﬁcally , a time-v arying signal x ( t ) transmitted ov er channel given by h ( t ) and additive noise z ( t ) result in y ( t ) = x ( t ) h ( t ) + z ( t ) . F or data parallel distributed training ov er wireless channels, eac h input data is passed through the c hannel to the work ers to p erform lo cal training. W e consider using the Gaussian mask ed input training scheme studied in this pap er as a simpliﬁed setup to mo del the channel fading in wireless comm unication. In particular, we let x b e the signals transmitted x ( t ) , and c ∼ N  1 , κ 2 I d  b e the c hannel eﬀect h ( t ) . F or simplicity , we set the additive noise to 0 . The time-dep ending behavior of x ( t ) and h ( t ) are transformed into the masking scheme that in eac h iteration, a new mask is applied to the sample. Under this setup, we train a t wo-la yer MLP with 128 hidden neurons for the MNIST dataset using F edA vg. That is, w e assume that the total training pro cess is partitioned into multiple global iterations, where in each global iteration, the central server passes the up dated mo del parameter together with the current copy of training data through a wireless channel. Each work er receives the training data with channel fading (in our case, mo deled with Gaussian multiplicativ e noise), and up dated its lo cal copy of the mo del using gradien t descen t starting from the parameter shared by the server for some num b er of lo cal steps. After the lo cal up date, the work ers sends the up dated parameter to the central server to p erform an aggregation by a v eraing the w orker’s weigh ts. Under this setup, we train an aggregated mo del with 5 w ork ers and the choice of {1, 20, 40} lo cal steps with a batch size of 128. W e also v ary the v ariance of the Gaussian mask to study the relationship b et w een the num b er of lo cal steps and the noise scales. F or eac h com bination of lo cal steps and Gaussian v ariance, w e run 5 trials and record the mean and standard deviation of the resulting accuracy . W e plot the result in Figure 3b. In general, for all choices of the num b er of lo cal steps, we observe a decay in the test accuracy as the input masking v ariance grows larger, indicating the negative inﬂuence of the noise to 12 the o v erall training p erformance. In particular, we can also observ e that, in the low noise regime ( κ ≈ 0 ), the resulting ﬁnal accuracy is not inﬂuenced muc h by the num b er of lo cal iterations. Ho wev er, as the noise scale gro ws larger (larger κ ), the ﬁnal accuracy decays drastically as we increase the num b er of lo cal iterations. W e hypothesis that this b eha vior is due to the fact more lo cal iterations allows the work ers to ﬁt more to the noise instead of the original signals in the data. 6.4 Eﬃcacy of Multiplicative Gaussian Noise Against Memb ership Inference Attacks In this section, we empirically ev aluate the eﬀectiveness of training with input multiplicativ e Gaussian (MG) noise as a defense against Membership Inference Attac ks (MIAs). In simple words, the primary goal of a MIA is to determine if a sp eciﬁc data p oint x w as part of the training set of a target mo del f . Threat Mo del and Attac k Metho dology . W e adopt the black-box shadow mo del attack metho dology (A dversary 1) from the ML-Leaks framework (Salem et al., 2019). In this setup, the adversary aims to determine whether a sp eciﬁc data p oint was part of a target mo del’s training set using only the mo del’s output p osteriors. Because the adversary lacks the target’s training lab els, they employ a shado w mo del trained on a proxy dataset to mimic the target’s b eha vior. By observing ho w the shadow mo del treats its o wn members versus non-members, the adversary generates lab eled data to train an attack mo del. This binary classiﬁer learns to identify the statistical "signatures" of membership—such as increased conﬁdence or reduced en tropy—enabling it to p erform membership inference on the original target mo del. A detailed breakdo wn of the data partitioning and the ﬁve-stage attack pipeline is pro vided in App endix A.2. Exp erimen tal Setup. W e ev aluate the eﬀectiv eness of Multiplicative Gaussian (MG) noise as a priv acy defense using the CIF AR-10 dataset. The data is partitioned into four disjoin t sets to train and audit b oth target and shadow mo dels indep enden tly . Our ev aluation cov ers tw o architectures—a fully-connected MLP and a multi-block CNN—to ensure the defense generalizes across diﬀerent mo del complexities. W e measure priv acy leakage by training a Logistic Regression attack mo del against target mo dels sub jected to v arying noise in tensities ( κ ∈ { 0 . 0 , 0 . 5 , 1 . 2 , 1 . 8 } ) and training durations (20–120 ep o chs). The defense is quantiﬁed via Precision, Recall, and AUC, where an A UC of 0.5 indicates p erfect priv acy (random guessing). Detailed h yp erparameters, partitioning sizes, and architectural sp eciﬁcations are provided in App endix A.2. 6.4.1 Results and Discussion Our exp eriments conﬁrm that training with multiplicativ e Gaussian noise systematically enhances a mo del’s resilience to membership inference attacks. The results for the MLP and CNN mo del are presented in Figure 6 and 7 resp ectiv ely , with the AUC v alues of our exp erimen ts illustrated in Figure 5. (a) MLP (b) CNN Figure 5: A ttac k A UC on the target mo del when it is (a) an MLP and (b) a CNN. Higher v alues indicate greater priv acy leakage. T raining with MG noise (larger κ ) consisten tly reduces attac k success. 13 Figure 6: MLP target. Multiplicativ e Gaussian noise pro vides resilience against the attacks as κ increases. Figure 7: CNN target. W e observe a similar trend as the MLP Figure 7 demonstrates the attack success for κ = 0 . 0 , in which case as the n umber of training ep o chs increases from 20 to 120, AUC rises from 0 . 578 to a signiﬁcant 0 . 782 . This v alidates that our attack implementation correctly captures priv acy leakage. The cen tral ﬁnding is the consistent defensive b eneﬁt of multiplicativ e Gaussian noise. As shown in Figures 6 for MLP , for any giv en num b er of training ep ochs, applying MG noise (increasing κ ) retains (relatively) constan t b oth the precision and recall of the attac k. F or example, after 120 ep o chs of training, the standard mo del is highly vulnerable (Attac k AUC = 0 . 782 ). In contrast, the mo del trained with κ = 0 . 5 reduces this leakage (A UC = 0 . 692 ), and mo dels with stronger noise achiev e even b etter priv acy (AUC = 0 . 585 for κ = 1 . 2 and AUC = 0 . 543 for κ = 1 . 8 ). This demonstrates a clear dose-resp onse relationship: greater noise v ariance leads to stronger priv acy protection against MIAs. Similar trends were observed for the CNN architecture (see Figure 7). Figure 4 illustrates the classic priv acy-utility trade-oﬀ. In essence, by sacriﬁcing some mo del utilit y , training with multiplicativ e Gaussian noise eﬀectively obfuscates the statistical signature of data mem b ership, thereby mitigating priv acy risks. 7 Conclusion This work inv estigates the fundamental question of how indep endent multiplicativ e Gaussian masking aﬀects training dynamics. F ocusing on a t wo-la yer ReLU net work in the NTK regime, we demonstrate that the masked ob jectiv e admits a closed-form decomp osition in to a smo othed square loss plus an explicit, data-dep endent regularizer. This structure allows training with the gradient from the masked ob jective to achiev e linear con vergence to ward a noise-con trolled error ball given small step size and large enough o v er-parameterization. 14 Bey ong our theory , W e provided exp erimen tal result to v alidate the conv ergence to small ball, and presen ted applications in distributed training under channel fading and how the masking can defend against attacks. Limitations and F uture W ork. Our theoretical analysis is currently constrained to the small-noise regime and the extreme ov er-parameterization typical of NTK mo dels. F urthermore, our pro ofs rely on the indep endence of masks across iterations and do not yet incorp orate a formal priv acy accounting pip eline, suc h as subsampling or comp osition. Despite these constraints, m ultiplicative Gaussian masking represents a pro v able and practically viable method for injecting input-level uncertain t y . These results provide a principled foundation for future exploration of deep net works and more complex noise settings in feature-wise training. References F rancis Bach, Ro dolphe Jenatton, Julien Mairal, and Guillaume Ob ozinski. Conv ex optimization with sparsit y-inducing norms. Optimization for Machine L e arning , page 19, 2011. F rancis Bach, Ro dolphe Jenatton, Julien Mairal, Guillaume Ob ozinski, et al. Optimization with sparsity- inducing p enalties. F oundations and T r ends ® in Machine L e arning , 4(1):1–106, 2012. Aristide Baratin, Thomas George, César Lauren t, R Devon Hjelm, Guillaume La joie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignmen t. In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 2269–2277. PMLR, 2021. P eter L Bartlett, Dylan J F oster, and Matus J T elgarsky . Sp ectrally-normalized margin b ounds for neural net works. A dvanc es in Neur al Information Pr o c essing Systems , 30, 2017. Y ong Cheng, Y ang Liu, Tianjian Chen, and Qiang Y ang. F ederated learning for priv acy-preserving AI. Communic ations of the A CM , 63(12):33–36, 2020. Jerem y Cohen, Elan Rosenfeld, and Zico Kolter. Certiﬁed adversarial robustness via randomized smo othing. In International Confer enc e on Machine L e arning , pages 1310–1320. PMLR, 2019. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descen t prov ably optimizes ov er- parameterized neural netw orks. arXiv pr eprint arXiv:1810.02054 , 2018. Simon S. Du, Kangcheng Hou, Barnabás P ó czos, Ruslan Salakh utdinov, Ruosong W ang, and Keyulu Xu. Graph neural tangent k ernel: F using graph neural netw orks with graph k ernels, 2019a. URL https://arxiv.org/abs/1905.13192 . Simon S. Du, Jason D. Lee, Hao ch uan Li, Liwei W ang, and Xiyu Zhai. Gradien t descent ﬁnds global minima of deep neural netw orks, 2019b. URL . Chen Dun, Cameron R W olfe, Christopher M Jermaine, and Anastasios Kyrillidis. ResIST: La y er-wise decomp osition of resnets for distributed training. In Unc ertainty in A rtiﬁcial Intel ligenc e , pages 610–620. PMLR, 2022. Chen Dun, Mirian Hip olito, Chris Jermaine, Dimitrios Dimitriadis, and Anastasios Kyrillidis. Eﬃcient and ligh t-weigh t federated learning via async hronous distributed drop out. In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 6630–6660. PMLR, 2023. Chris Edw ards. Data quality may b e all you need, 2024. R uiqi Gao, Tianle Cai, Hao ch uan Li, Liwei W ang, Cho-Jui Hsieh, and Jason D. Lee. Con vergence of adv ersarial training in ov erparametrized neural netw orks, 2019. URL . Guillaume Garrigos and Rob ert M. Gow er. Handb ook of conv ergence theorems for (sto chastic) gradien t metho ds. arXiv pr eprint arXiv:2301.11235 , 2023. URL . Behro oz Ghorbani, Song Mei, Theo dor Misiakiewicz, and Andrea Mon tanari. Linearized tw o-lay er neural net works in high dimension. The A nnals of Statistics , 49(2):1029–1054, 2021. 15 Suriy a Gunasekar, Yi Zhang, Jyoti Aneja, Caio César T eodoro Mendes, Allie Del Giorno, Siv akan th Gopi, Mo jan Jav aheripi, Piero Kauﬀmann, Gustav o de Rosa, Olli Saarikivi, et al. T extb o oks are all you need. arXiv pr eprint arXiv:2306.11644 , 2023. Chao yang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi W ang, Xiao yang W ang, Praneeth V epakomma, Abhishek Singh, Hang Qiu, et al. F edML: A research library and b enchmark for federated mac hine learning. arXiv pr eprint arXiv:2007.13518 , 2020. Erdong Hu, Y uxin T ang, Anastasios Kyrillidis, and Chris Jermaine. F ederated learning o ver images: V ertical decomp ositions and pre-trained bac kb ones are diﬃcult to b eat. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer V ision , pages 19385–19396, 2023. Andrew Ilyas, Shibani Santurkar, Dimitris T sipras, Logan Engstrom, Brandon T ran, and Aleksander Madry . A dversarial examples are not bugs, they are features. A dvanc es in Neur al Information Pr o c essing Systems , 32, 2019. Arth ur Jacot, F ranck Gabriel, and Clément Hongler. Neural tangen t k ernel: Conv ergence and generalization in neural netw orks. A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. Ro dolphe Jenatton, Jean-Y ves Audibert, and F rancis Bach. Structured v ariable selection with sparsity-inducing norms. The Journal of Machine L e arning R ese ar ch , 12:2777–2824, 2011. Ziw ei Ji and Matus T elgarsky . Polylogarithmic width suﬃces for gradient descent to achiev e arbitrarily small test error with shallow relu netw orks, 2020. URL . P eter Kairouz, H Brendan McMahan, Brendan A ven t, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhago ji, Kallista Bona witz, Zachary Charles, Graham Cormo de, Rachel Cummings, et al. Adv ances and op en problems in federated learning. F oundations and T r ends ® in Machine L e arning , 14(1–2):1–210, 2021. Emmanouil Kariotakis, Grigorios T sagkatakis, Panagiotis T sakalides, and Anastasios Kyrillidis. Lev eraging sparse input and sparse mo dels: Eﬃcient distributed learning in resource-constrained en vironments. In Confer enc e on Parsimony and L e arning , pages 554–569. PMLR, 2024. Diederik P Kingma, Tim Salimans, and Max W elling. V ariational drop out and the lo cal reparameterization tric k. In A dvanc es in Neur al Information Pr o c essing Systems , volume 28, pages 2575–2583, 2015. URL https: //papers.nips.cc/paper/5666- variational- dropout- and- the- local- reparameterization- trick . Anastasios Kyrillidis, Luca Baldassarre, Marwa El Halabi, Quo c T ran-Dinh, and V olkan Cevher. Structured sparsit y: Discrete and conv ex approaches. In Compr esse d Sensing and its Applic ations: MA THEON W orkshop 2013 , pages 341–387. Springer, 2015. Daniel LeJeune and Sina Alemohammad. An adaptive tangent feature p ersp ective of neural net works. In Y uejie Chi, Gintare Karolina Dziugaite, Qing Qu, Atlas W ang, and Zhihui Zhu, editors, Confer enc e on Parsimony and L e arning , volume 234 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 379–394. PMLR, 03–06 Jan 2024. URL https://proceedings.mlr.press/v234/lejeune24a.html . Guanlin Li, Han Qiu, Shangwei Guo, Jiw ei Li, and Tianw ei Zhang. Rethinking adversarial training with neural tangen t kernel. arXiv pr eprint arXiv:2312.02236 , 2023a. Sh uyao Li, Ilias Diakonik olas, and Jelena Diakonik olas. Distributionally robust optimization with adv ersarial data con tamination, 2025. URL . Y uanzhi Li, Sébastien Bub ec k, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin T at Lee. T extb o oks are all you need I I: phi-1.5 technical rep ort. arXiv pr eprint arXiv:2309.05463 , 2023b. F angshuo Liao and Anastasios Kyrillidis. On the con v ergence of shallow neural netw ork training with randomly masked neurons. T r ansactions on Machine L e arning R ese ar ch , 2022. URL https://openreview. net/forum?id=ebZ0gGRJwQx . 16 Y ang Liu, T ao F an, Tianjian Chen, Qian Xu, and Qiang Y ang. F A TE: An industrial grade platform for collab orativ e learning with data protection. Journal of Machine L e arning R ese ar ch , 22(226):1–6, 2021. Y ang Liu, Xinw ei Zhang, Y an Kang, Liping Li, Tianjian Chen, Mingyi Hong, and Qiang Y ang. F edBCD: A comm unication-eﬃcient collab orativ e learning framework for distributed features. IEEE T r ansactions on Signal Pr o c essing , 70:4277–4290, 2022. Y ang Liu, Y an Kang, Tianyuan Zou, Y anhong Pu, Y uanqin He, Xiaozhou Y e, Y e Ouyang, Y a-Qin Zhang, and Qiang Y ang. V ertical federated learning: Concepts, adv ances, and challenges . IEEE T r ansactions on K now le dge and Data Engine ering , 2024. No el Lo o, Ramin Hasani, Alexander Amini, and Daniela Rus. Evolution of neural tangent kernels under b enign and adv ersarial training. A dvanc es in Neur al Information Pr o c essing Systems , 35:11642–11657, 2022. Aleksander Madry , Aleksandar Mak elov, Ludwig Schmidt, Dimitris T sipras, and Adrian Vladu. T ow ards deep learning mo dels resistant to adv ersarial attacks. In International Confer enc e on L e arning R epr esentations , 2018. Brendan McMahan, Eider Mo ore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Comm unication- eﬃcien t learning of deep netw orks from decentralized data. In A rtiﬁcial Intel ligenc e and Statistics , pages 1273–1282. PMLR, 2017. P o orya Mianjy and Raman Arora. On conv ergence and generalization of drop out training. In A dvanc es in Neur al Information Pr o c essing Systems , volume 33, pages 14124–14134, 2020. URL https://proceedings. neurips.cc/paper/2020/file/f1de5100906f31712aaa5166689bfdf4- Paper.pdf . T akeru Miyato, Shin-ichi Maeda, Masanori Ko yama, and Shin Ishii. Virtual adversarial training: A regular- ization metho d for sup ervised and semi-sup ervised learning. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 41(8):1979–1993, 2018. Quynh Nguy en. On the pro of of global conv ergence of gradien t descen t for deep relu netw orks with linear widths, 2021. URL . Samet Oymak and Mahdi Soltanolkotabi. T ow ards mo derate ov erparameterization: Global conv ergence guaran tees for training shallow neural netw orks, 2019. URL . Mélanie Rey and Andriy Mnih. Gaussian drop out as an information b ottleneck lay er. In Bayesian De ep L e arning W orkshop, NeurIPS , 2021. URL https://bayesiandeeplearning.org/2021/papers/40.pdf . Daniele Romanini, A dam James Hall, Pa vlos Papadopoulos, T om Titcombe, Abbas Ismail, T udor Ceb ere, Rob ert Sandmann, Robin Ro ehm, and Michael A Ho eh. PyV ertical: A vertical federated learning framework for m ulti-headed SplitNN. arXiv pr eprint arXiv:2104.00489 , 2021. Sebastian R uder. An ov erview of gradient descent optimization algorithms, 2017. URL abs/1609.04747 . Ahmed Salem, Y ang Zhang, Mathias Hum b ert, Pascal Berrang, Mario F ritz, and Michael Bac kes. ML-Leaks: Mo del and Data Indep endent Mem b ership Inference A ttacks and Defenses on Mac hine Learning Mo dels. In Pr o c e e dings of the 2019 Network and Distribute d System Se curity Symp osium (NDSS) , 2019. ISBN 1-891562- 55-X. URL https://www.ndss- symposium.org/ndss2019/papers/ndss2019_03A- 1_Salem_paper.pdf . Christoph Sc huhmann, Romain Beaumont, Richard V encu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Co om b es, Aarush Katta, Clayton Mullis, Mitchell W ortsman, et al. LAION-5b: An op en large-scale dataset for training next generation image-text mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 35:25278–25294, 2022. Ohad Shamir and T ong Zhang. Sto chastic gradient descent for non-smo oth optimization: Conv ergence results and optimal av eraging sc hemes. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning , pages 71–79, 2013. URL https://proceedings.mlr.press/v28/shamir13.html . 17 Zhao Song and Xin Y ang. Quadratic suﬃces for o ver-parametrization via matrix chernoﬀ b ound, 2020. URL https://arxiv.org/abs/1906.03593 . Nitish Sriv asta v a, Geoﬀrey Hinton, Alex Krizhevsky , Ily a Sutsk ev er, and Ruslan Salakhutdino v. Drop out: A simple wa y to preven t neural netw orks from ov erﬁtting. The Journal of Machine L e arning R ese ar ch , 15(1): 1929–1958, 2014. Cheng T ang et al. Conv ergence analysis of sto chastic gradient descent on strongly conv ex functions. In Pr o c e e dings of the 2013 Eur op e an Signal Pr o c essing Confer enc e , pages 1568–1572, 2013. URL https: //www.esat.kuleuven.be/sista/ROKS2013/files/abstracts/ChengTang.pdf . Lan V. T ruong. Global conv ergence rate of deep equilibrium mo dels with general activ ations, 2025. URL https://arxiv.org/abs/2302.05797 . Da vid T se and Pramo d Visw anath. F undamentals of Wir eless Communic ation . Cambridge Universit y Press, Cam bridge, United Kingdom, 2005. ISBN 9780521845274. doi: 10.1017/CBO9780511807213. Nik olaos T silivis and Julia Kemp e. What can the neural tangent k ernel tell us ab out adversarial robustness? A dvanc es in Neur al Information Pr o c essing Systems , 35:18116–18130, 2022. Sida W ang and Christopher D Manning. F ast drop out training. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning , pages 118–126, 2013. URL https://proceedings.mlr.press/v28/ wang13a.html . Cameron R W olfe, Jingkang Y ang, F angsh uo Liao, Arindam Cho wdh ury , Chen Dun, Artun Bay er, Santiago Segarra, and Anastasios Kyrillidis. GIST: Distributed training for large-scale graph conv olutional netw orks. Journal of A pplie d and Computational T op olo gy , pages 1–53, 2023. Eric W ong, F rank Schmidt, Jan Hendrik Metzen, and J Zico K olter. Scaling prov able adversarial defenses. A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. Y ongtao W u, F angh ui Liu, Grigorios G Chrysos, and V olkan Cevher. On the conv ergence of enco der-only shallo w transformers, 2023. URL . Ashkan Y ousefp our, Igor Shilo v, Alexandre Sabla yrolles, Davide T estuggine, Karthik Prasad, Mani Malek, John Nguy en, Say an Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormo de, and Ilya Mirono v. Opacus: User-friendly diﬀeren tial priv acy library in pytorc h, 2021. Binhang Y uan, Cameron R W olfe, Chen Dun, Y uxin T ang, Anastasios Kyrillidis, and Chris Jermaine. Distributed learning of fully connected neural netw orks using indep enden t subnet training. Pr o c e e dings of the VLDB Endowment , 2022. 18 A A dditional Exp erimental Results and Related Details. A.1 Empirical Validation of Exp ected Gradient Prop erties (Theorem 4.9). Theorem 4.9 characterizes the exp ected gradient under Gaussian input masking as E C [ ∇ w r L C ( W )] = ∇ w r L ( W ) + T 3 ,r + g r . Here, ∇ w r L ( W ) is the clean input gradient, T 3 ,r is a systematic deviation term prop ortional to κ 2 , and g r is a residual error b ounded by Eq. (6). Simulation Setup. W e used a tw o-lay er ReLU MLP with d = 20 input features, m = 100 hidden units, on n = 500 synthetic samples ( ∥ x i ∥ 2 ≤ 1 , y i ∼ N (0 , 0 . 5 2 ) ). First-lay er weigh ts W are from N (0 , 0 . 1 2 ) ; second-la yer a r ∈ {± 1 } are ﬁxed. W e analyze ∇ w r L C ( W ) by av eraging N = 2000 Monte Carlo samples for κ ∈ [0 . 001 , 1 . 0] , for a representativ e neuron r . F or this setup, the clean loss L ( W ) ≈ 71 . 52 and ∥∇ w r L ( W ) ∥ 2 ≈ 0 . 981 . R esults and Discussion. Our sim ulations v alidate the decomp osition in Theorem 4.9. Figure 8a displays the ℓ 2 -norms of the gradient comp onents versus κ . The clean gradien t norm is constant. The T 3 ,r ’s norm, ∥T 3 ,r ∥ 2 , scales with κ (e.g., from ≈ 8 . 1 × 10 − 7 at κ = 0 . 001 to ≈ 0 . 81 at κ = 1 . 0 ), conﬁrming its theoretical dep endence. The norm of the empirically estimated exp ected masked gradien t, ∥ E C [ ∇ w r L C ( W )] ∥ 2 , follows the clean gradient for small κ and reﬂects the v ector sum with the growing teal term for larger κ in Eq. 5. (a) ℓ 2 -norms of key comp onents of the exp ected gradien t E C [ ∇ w r L C ( W )] :the clean gradient norm ( ∥∇ w r L ( W ) ∥ 2 ), the T 3 ,r ’s norm ( ∥T 3 ,r ∥ 2 ), and the total exp ected masked gradien t norm. T 3 ,r scales with κ 2 . (b) Comparison of the ℓ 2 -norm of the empirically estimated gradien t error term, ∥ g r ∥ 2 , against its theoretical upper b ound from Eq. (6). The empirical error (solid line) re- mains b elow the derived b ound (dashed line) across all tested κ . Log-log scale. Figure 8: ℓ 2 -norms of the gradient comp onents (left) and residual error b ound chec k (right). Figure 8b examines the residual error term g r . It compares the ℓ 2 -norm of the empirically estimated g 0 with its theoretical upp er b ound from Eq. (6). The estimated error norm, ∥ g r ∥ est , increases with κ (from ≈ 1 . 65 × 10 − 2 at κ = 0 . 001 to ≈ 0 . 53 at κ = 1 . 0 ). Imp ortantly , the theoretical b ound on ∥ g r ∥ 2 consisten tly upp er-bounds the empirical error across the entire range of κ . F or instance, at κ = 0 . 001 , ∥ g r ∥ est ≈ 0 . 0165 while its b ound is ≈ 7 . 85 . In summary , the sim ulations conﬁrm that the expected gradient under Gaussian input masking, with suﬃcien tly small κ v alues, is well-appro ximated by the sum of the clean gradient and the κ 2 -dep enden t term, with a residual error that is eﬀectively b ounded by our theoretical deriv ation. A.2 Exp erimental Details for Section 6.4 W e adopt the blac k-b ox threat mo del and the shadow mo del attack metho dology (A dv ersary 1) prop osed in the ML-Leaks framework (Salem et al., 2019). Thr e at Mo del. The adversary has black-box access to a trained target mo del f . This means the adversary can query the mo del with any input x and observ e its output p osterior probability vector p = f ( x ) (i.e., the softmax output ov er the classes), but has no access to the mo del’s parameters, gradients or original training 19 data. The adversary’s goal is to train an attack mo del A that, given the p osterior p f ( x ) from the target mo del for a p oin t x , predicts whether x was a mem b er of the target’s training set. Shadow Mo del A ttack Pip eline. Since the attack er do es not hav e access to the target mo del’s training set, they cannot directly generate lab eled data (member vs. non-member p osteriors) to train their attack mo del. The shadow mo del technique circumv ents this by creating a proxy environmen t where the attack er controls data mem b ership. The pip eline is as follows: 1. Data Partitioning: The attack er p ossesses a dataset D shadow , disjoin t from the target’s training set but dra wn from the same data distribution. This set is split into D T rain Shadow and D Out Shadow . 2. Shadow Mo del T r aining: A shadow mo del S , whic h mimics the target model’s architecture and training pro cess, is trained on D T rain Shadow . 3. A ttack Dataset Gener ation: The trained shadow mo del S is queried on its o wn training data (members, D T rain Shadow ) and its hold-out data (non-members, D Out Shadow ). The resulting p osterior vectors p S ( x ) are collected. F ollo wing (Salem et al., 2019), the top-3 sorted probabilities of eac h p osterior are used as features: ϕ ( p S ( x )) = ( p (1) , p (2) , p (3) ) . These feature vectors are lab eled “1” if x ∈ D T rain Shadow and “0” otherwise. 4. A ttack Mo del T r aining: A binary classiﬁer, the attac k mo del A , is trained on this generated dataset of “(feature, lab el)” pairs. It learns to distinguish the statistical “signature” of a member’s p osterior from a non-mem b er’s. This signature often manifests as higher conﬁdence (larger p (1) ) and lo w er en tropy for mem b ers, a result of the target/shadow mo del ov erﬁtting to its training data. 5. Infer enc e on T ar get Mo del: T o attack the original target mo del f , the adv ersary queries it with a p oint of interest x , extracts the features ϕ ( p f ( x )) , and feeds them to the trained attack mo del A to get a mem b ership prediction. Dataset and Partitioning. W e use the CIF AR-10 dataset, consisting of 60,000 images. The full p o ol is sh uﬄed and divided into four disjoint sets of 10,520 images each: target_train (training MG-protected mo dels), target_test (non-mem b er audit data), shadow_train (training shadow mo dels), and shadow_test (shado w non-member data). All data is normalized using the mean and standard deviation of their resp ectiv e training sets. F or protected mo dels, inputs x are mo diﬁed via element wise m ultiplication with a random mask m , where m i ∼ N (1 , κ 2 ) . Mo del Arc hitectures. • MLP (“nn”): A fully-connected netw ork with one hidden lay er of 100 neurons (T anh activ ation). Input la y er: 3,072 features. • CNN (“cnn”): T wo Conv-ReLU-MaxPool blo c ks, follow ed by a T anh-activ ated fully-connected la yer with 100 hidden units. T raining and Hyp erparameters. Both mo dels are trained using the Adam optimizer (Learning Rate: 10 − 3 , ℓ 2 Regularization: 10 − 7 ) for interv als b etw een 20 and 120 ep o chs. The attack mo del A is a LogisticRegression classiﬁer (scikit-learn), trained on a balanced dataset of member and non-member p osteriors. A.2.1 Multiplicative Gaussian noise vs Diﬀerential privacy W e ev aluate the priv acy-utility tradeoﬀ of Multiplicativ e Gaussian (MG) noise against Diﬀeren tial Priv acy (DP-SGD) using the CIF AR-10 image classiﬁcation dataset. F ollo wing the standard ML-Leaks ev aluation proto col [Shokri et al., 2017], we utilize the dataset partitioning describ ed in the previous section and employ a Standard CNN arc hitecture, i.e. a shallow baseline consisting of tw o conv olutional lay ers ( 5 × 5 kernels, 32 ﬁlters) follow ed by max-p o oling and a fully connected lay er. F or the multiplicativ e gaussian defense, we sw eep the noise parameter κ ∈ { 0 . 0 , 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 , 1 . 2 } where κ = 0 . 0 represents the undefended baseline. F or 20 eac h training batch, w e apply element-wise multiplicativ e noise to input features ˜ x = x ⊙  1 + κZ  ›where Z ∼ N (0 , I ) is the standard Gaussian noise. F or the the Diﬀerential Priv acy , (DP-SGD) part, we sweep the noise multiplier σ ∈ { 0 . 3 , 0 . 5 , 0 . 8 , 1 . 0 , 1 . 5 , 2 . 0 , 3 . 0 } using the opacus library (Y ousefp our et al. (2021)). W e set the p er-sample gradien t clipping norm C = 1 . 0 and target δ = 10 − 5 . The empirical ﬁndings for this exp erimet are illustrated in Figure 9. The plots map the priv acy leakage (Attac k Precision and Recall) against the mo del’s utilit y (T arget Accuracy) across the swept noise parameters. Figure 9: Priv acy-utilit y tradeoﬀ for the Standard CNN. The left panel sho ws Attac k Precision vs. Accuracy , and the right panel shows Attac k Recall vs. A ccuracy . Ev aluation on High-Capacity Architecture : W e rep eated the ev aluation using an Improved CNN arc hitecture. This mo del features a deep er conv olutional structure (con volutional blo cks with increasing ﬁlter sizes ( 32 → 64 → 128 ) using 3 × 3 kernels) and also incorp orates Batch Normalization and Drop out to ac hieve higher baseline utilit y . Figure 10 b elow illustrates the results for the improv ed CNN architecture. W e observ e that the p erformance gap b etw een the tw o metho ds narrows and the MG noise curve remains m uch closer to the near-random guess of the attack er for a longer stretc h of the accuracy sp ectrum. Figure 10: Priv acy-utilit y tradeoﬀ for the Improv ed CNN. Unlike the Standard CNN, the MG noise curve sta ys m uch closer to the DP-SGD curve across the accuracy range B Pro ofs in Section 4 In this section, we ﬁrst prov e an exact form of the exp ected surrogate loss function. 21 B.1 General Fo rm of Exp ected Loss Lemma B.1. L et u i,r = w r ⊙ x i , let ˆ σ κ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  , and let ˆ f ( θ , x ) = 1 √ m P m r =1 a r ˆ σ κ ( w r , x ) . Then we have E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! wher e C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r ar e deﬁne d as C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1   ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 · w ⊤ r ′ x − u ⊤ i,r u i,r ′ · w ⊤ r x i κ 2 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! Pr o of. By deﬁnition, we ha ve E C [ L C ( θ )] = 1 2 n X i =1 E c i h ( f ( θ , x i ⊙ c i ) − y i ) 2 i F or simplicity , w e ﬁx i ∈ [ n ] , and study E c h ( f ( θ , x ⊙ c ) − y ) 2 i . In the analysis b elo w, we let u r = w r ⊙ x . In particular, we hav e E c h ( f ( θ , x ⊙ c ) − y ) 2 i = E c h f ( θ , x ⊙ c ) 2 i − 2 y E c [ f ( θ , x ⊙ c )] + y 2 Here w e shall ev aluate the tw o exp ectations separately . T o start, for the ﬁrst-order term, we hav e E c [ f ( θ , x ⊙ c )] = 1 √ m m X r =1 a r E  σ  w ⊤ r ( c ⊙ x )  = 1 √ m m X r =1 a r E  σ  c ⊤ ( w r ⊙ x )  (18) Notice that since c ∼ N  1 , κ 2 I  . By Lemma D.13 w e hav e that c ⊤ ( w r ⊙ x ) ∼ N  w ⊤ r x , κ 2 ∥ w r ⊙ x ∥ 2 2  , since 1 ⊤ ( w r ⊙ x ) = w ⊤ r x . Applying Lemma D.18 with z = c ⊤ ( w r ⊙ x ) , mean w ⊤ r x , and standard deviation κ ∥ w r ⊙ x ∥ , w e ha ve that E  σ  c ⊤ ( w r ⊙ x )  = κ ∥ w r ⊙ x ∥ 2 √ 2 π exp −  w ⊤ r x  2 2 κ 2 ∥ w r ⊙ x ∥ 2 2 ! + w ⊤ r xΦ 1  w ⊤ r x κ ∥ w r ⊙ x ∥ 2  (19) Plugging in to the form of (18) gives E c [ f ( θ , x ⊙ c )] = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) + κ √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! (20) 22 Next, w e fo cus on the second-order term. Since w ⊤ r ( c ⊙ x ) = c ⊤ ( w r ⊙ x ) , w e ha ve E c h f ( θ , x ⊙ c ) 2 i = 1 m m X r,r ′ =1 a r a r ′ E  σ  c ⊤ u r  σ  c ⊤ u r ′  Let z 1 = c ⊤ u r and z 2 = c ⊤ u r ′ , w e ha v e that z 1 ∼ N  w ⊤ r x , κ 2 ∥ u r ∥ 2 2  , z 2 ∼ N  w ⊤ r ′ x , κ 2 ∥ u r ′ ∥ 2 2  , and Co v ( z 1 , z 2 ) = κ 2 u ⊤ r u r ′ . Applying Lemma D.9 with a = b = 0 gives E  σ  c ⊤ u r  σ  c ⊤ u r ′  =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  Φ 2  w ⊤ r x κ ∥ u r ∥ 2 , w ⊤ r ′ x κ ∥ u r ′ ∥ 2 , u ⊤ r u r ′ ∥ u r ∥ 2 ∥ u r ′ ∥ 2  + κ 2 2 π ∥ u r ∥ 2 ∥ u r ′ ∥ 2 exp   − ∥ u r ′ ∥ 2 2  w ⊤ r x  2 − 2  u ⊤ r u r ′   w ⊤ r x   w ⊤ r ′ x  + ∥ u r ∥ 2 2  w ⊤ r ′ x  2 2 κ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2    + κ √ 2 π  ∥ u r ∥ 2 · w ⊤ r ′ x · ˆ T 1 ,r,r ′ + ∥ u r ′ ∥ 2 · w ⊤ r x · ˆ T 2 ,r,r ′  where ˆ T 1 ,r,r ′ , ˆ T 2 ,r,r ′ are deﬁned as ˆ T 1 ,r,r ′ = exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! Φ 1    ∥ u r ∥ 2 2 · w ⊤ r ′ x − u ⊤ r u r ′ · w ⊤ r x κ ∥ u r ∥ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2  1 2    ˆ T 2 ,r,r ′ = exp −  w ⊤ r ′ x  2 2 κ 2 ∥ u r ′ ∥ 2 2 ! Φ 1    ∥ u r ′ ∥ 2 2 · w ⊤ r x − u ⊤ r u r ′ · w ⊤ r ′ x κ ∥ u r ′ ∥ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2  1 2    F or the simplicity of notations, we deﬁne T 1 ,r,r ′ = ∥ u r ∥ 2 · w ⊤ r ′ x · ˆ T 1 ,r,r ′ and T 2 ,r,r ′ = ∥ u r ′ ∥ 2 · w ⊤ r x · ˆ T 2 ,r,r ′ . Moreo ver, w e deﬁne E r,r ′ = ∥ u r ∥ 2 ∥ u r ′ ∥ 2 exp   − ∥ u r ′ ∥ 2 2  w ⊤ r x  2 − 2  u ⊤ r u r ′   w ⊤ r x   w ⊤ r ′ x  + ∥ u r ∥ 2 2  w ⊤ r ′ x  2 2 κ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2    Lastly , we use the deﬁnition of the Gaussian Copula function C ( a, b, ρ ) = Φ 2 ( a, b, ρ ) − Φ 1 ( a ) Φ 1 ( b ) and deﬁne C r,r ′ =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  C  w ⊤ r x κ ∥ u r ∥ 2 , w ⊤ r ′ x κ ∥ u r ′ ∥ 2 , u ⊤ r u r ′ ∥ u r ∥ 2 ∥ u r ′ ∥ 2  Under these deﬁnitions, we hav e that E  σ  c ⊤ u r  σ  c ⊤ u r ′  =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ ) = ˆ σ κ ( w r , x ) ˆ σ κ ( w r ′ , x ) + κ 2 u ⊤ r u r ′ Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ ) 23 Plugging bac k into the expression of E c h f ( θ , x ⊙ c ) 2 i giv es E c h f ( θ , x ⊙ c ) 2 i = 1 m m X r,r ′ =1 a r a r ′ ˆ σ κ ( w r , x ) ˆ σ κ ( w r ′ , x ) + 1 m m X r,r ′ =1 a r a r ′ κ 2 u ⊤ r u r ′ Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) ! 2 + κ 2      1 √ m m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  Com bining the expression of E c h f ( θ , x ⊙ c ) 2 i and E c [ f ( θ , x ⊙ c )] , and noticing T 1 ,r,r ′ = T 2 ,r ′ ,r , we hav e E c h ( f ( θ , x ⊙ c ) − y ) 2 i = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) ! 2 + κ 2      1 √ m m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  − 2 y √ m m X r =1 a r ˆ σ κ ( w r , x ) − 2 κy √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 ∥ u r ∥ 2 2 ! + y 2 = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) − y ! 2 + κ 2 m      m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + 2 κ √ 2 π T 1 ,r,r ′  − 2 κy √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! T o extend to the case of x i , y i , we need to re-deﬁne C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1    ∥ u i,r ∥ 2 2 · w ⊤ r ′ x i − u ⊤ i,r u i,r ′ · w ⊤ r x i κ ∥ u i,r ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2  1 2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! 24 Moreo ver, let ˆ f ( θ , x ) = 1 √ m P m r =1 a r ˆ σ κ ( w r , x ) . Then we hav e that E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′ C i,r,r ′ + κ 2 2 π E i,r,r ′ + κ r 2 π T i,r,r ′ G i,r ! − κy r 2 π m n X i =1 m X r =1 a r G i,r = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! B.2 Pro of of Theorem 4.2 Pr o of. Let u i,r = w r ⊙ x i . By Lemma B.1, we hav e that E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! where C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r are deﬁned as C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1    ∥ u i,r ∥ 2 2 · w ⊤ r ′ x i − u ⊤ i,r u i,r ′ · w ⊤ r x i κ ∥ u i,r ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2  1 2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! Therefore, the pro of of the theorem relies on the upp er b ound of C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r . T o upp er-b ound C i,r,r ′ , w e utilize the result in that | C ( a, b, ρ ) | ≤ | arcsin ρ | 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  ≤ | ρ | 4 exp  − a 2 + b 2 4  25 where we used Lemma D.19 that | arcsin x | ≤ π 2 · | x | . Plugging in a = w ⊤ r x i κ ∥ u i,r ∥ 2 , b = w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 and ρ = u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 giv es | C i,r,r ′ | ≤    w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′   ·   u ⊤ i,r u i,r ′   4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp − 1 4 κ 2  w ⊤ r x i  2 ∥ u i,r ∥ 2 2 +  w ⊤ r ′ x i  2 ∥ u i,r ′ ∥ 2 2 !! ≤ 1 4     w ⊤ r x i   w ⊤ r ′ x i    + κ 2 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  = κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2   w ⊤ r x i   κ ∥ u i,r ∥ 2 ·   w ⊤ r ′ x i   κ ∥ u i,r ∥ 2 + 1 ! ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  = κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ψ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  + ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  where w e use the deﬁnition P i,r =   w ⊤ r x i   · exp  − ( w ⊤ r x i ) 2 4 κ 2 ∥ u i,r ∥ 2 2  . F or the term E i,r,r ′ , w e notice that b y letting a = w ⊤ r x i κ ∥ u i,r ∥ 2 , b = w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 and ρ = u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ , we hav e exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    = exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  Using exp  − a 2 − 2 ρab + b 2 2(1 − ρ 2 )  ≤ exp  − a 2 + b 2 4  , we hav e that | E i,r,r ′ | ≤ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp − 1 4 κ 2  w ⊤ r x i  2 ∥ u i,r ∥ 2 2 +  w ⊤ r ′ x i  2 ∥ u i,r ′ ∥ 2 2 !! = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  Therefore, w e hav e     C i,r,r ′ + κ 2 2 π E i,r,r ′     ≤ κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ψ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  + ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  This giv es that       m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′        ≤ κ 2 4   m X r =1 ∥ u i,r ∥ 2 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ∥ u i,r ∥ 2 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   (21) By deﬁnition, we hav e ∥ u i,r ∥ 2 ≤ R u . Therefore       m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′        ≤ 1 4 κ 2 R 2 u   m X r =1 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   26 Next, we fo cus on the term T i,r,r ′ and G i,r . By the property of CDF, we hav e that | T i,r,r ′ | ≤   w ⊤ r ′ x i   ≤ ∥ w r ∥ 2 . Therefore      1 √ m m X r ′ =1 a r ′ T i,r,r ′ − y i      ≤ 1 √ m m X r ′ =1 ∥ w r ′ ∥ 2 + | y i | ≤ √ mR w + B y ≤ 2 √ mR w where w e applied ∥ w r ∥ 2 ≤ R w and B y ≤ 3 √ mR w . Thus      m X r =1 a r G i,r 1 √ m m X r ′ =1 a r ′ T i,r,r ′ − y i !      ≤ m X r =1 G i,r · 2 √ mR w ≤ √ mR w m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  where w e used ∥ u i,r ∥ 2 ≤ R u . Combining the inequality ab ov e and (21), we ha ve |E | ≤ nκ 2 R 2 u 4 m   m X r =1 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   + nκR w 2 m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  Applying the deﬁnition of ψ max and ϕ max giv es the desired results. B.3 Pro of of Theorem 4.9 Pr o of. By the form of the gradient, we hav e E C k [ ∇ w r L C ( θ )] = a r √ m n X i =1 E C [( f ( θ , x i ⊙ c i ) − y i ) x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] = a r √ m n X i =1 E c i [ f ( θ , x i ⊙ c i ) x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] | {z } T 1 ,i − a r √ m n X i =1 y i E c i [ x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] | {z } T 2 ,i (22) Let u r,i = w r ⊙ x i . F or T 1 ,i , we further hav e T 1 ,i = 1 √ m m X r ′ =1 a r ′ E c i  σ  w ⊤ r ′ ( x i ⊙ c i )  x i ⊙ c i I  w ⊤ r ( x i ⊙ c i ) ≥ 0  = 1 √ m m X r ′ =1 a r ′ E c i h ( x i ⊙ c i ) ( x i ⊙ c i ) ⊤ w r ′ I  w ⊤ r ( x i ⊙ c i ) ≥ 0; w ⊤ r ′ ( x i ⊙ c i ) ≥ 0  i = 1 √ m m X r ′ =1 a r ′  E c i  c i c ⊤ i I  w ⊤ r ( x i ⊙ c i ) ≥ 0; w ⊤ r ′ ( x i ⊙ c i ) ≥ 0  ⊙  x i x ⊤ i  w r ′ = 1 √ m m X r ′ =1 a r ′  E c i  c i c ⊤ i I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  ⊙  x i x ⊤ i  w r ′ = 1 √ m m X r =1 a r ′ Diag ( x ) i E c i  c i c ⊤ i I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  u r ′ ,i F or T 2 ,i , we can easily obtain T 2 ,i = E c i  c i I  u ⊤ r,i c i ≥ 0  ⊙ x i Abstractly , we are thus in terested in the following quantit y: E c  cc ⊤ I  c ⊤ u ≥ 0; c ⊤ v ≥ 0  ; E c  c I  c ⊤ u ≥ 0  where c ∼ N  µ , κ 2 I  , and u , v are ﬁxed vectors. Let z 1 = c ⊤ u and z 2 = c ⊤ v . Then we hav e z 1 ∼ N  c ⊤ u , κ 2 ∥ u ∥ 2 2  ; z 2 ∼ N  c ⊤ v , κ 2 ∥ v ∥ 2 2  27 A ccording to Lemma D.11 and Lemma D.12, and by deﬁning µ = 1 , we hav e that ∆ (1) k,r,i ∈ R d and ∆ (2) k,r,r ′ ,i ∈ R d × d deﬁned b elo w ∆ (1) k,r,i := E c i  c i I  c ⊤ i u r,i ≥ 0  − 1 · Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  ∆ k,r,r ′ ,i := E c  cc ⊤ I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  u r ′ ,i −  11 ⊤ u r ′ ,i + 3 κ 2 u r ′ ,i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  satisﬁes    ∆ (1) r,i    ∞ ≤ κR u ϕ max    ∆ (2) r,r ′ ,i    ∞ ≤ 4 κ ∥ v ∥ 2  √ dϕ max + ψ max  Here w e used ∥ µ ∥ ∞ = 1 and µ ⊤ ( w r ⊙ x i ) = w ⊤ k,r x i when µ = 1 . Therefore, for T 1 ,i , we hav e T 1 ,i = 1 √ m m X r ′ =1 a r ′ Diag ( x i )   w ⊤ r ′ x i · 1  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  + ∆ (2) r,r ′ ,i  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) ( w r ⊙ x i ) Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  = 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i · x i Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  + 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  = f ( θ , x i ) x i I  w ⊤ r x i ≥ 0  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + g 1 ,i + g 2 ,i where g 1 ,i = 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i · x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0   g 2 ,i = 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0   Lik ely , for T 2 ,i w e ha ve T 2 ,i =  1Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  + ∆ (1) r,i  ⊙ x i = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  + ∆ (1) r,i ⊙ x i = x i I  w ⊤ r x i ≥ 0  + ∆ (1) r,i ⊙ x i + g 3 ,i 28 where g 3 ,i = x i ·  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   . Therefore, the ﬁnal gradient is giv en b y E C [ ∇ w r L C ( θ )] = a r √ m n X i =1 ( f ( θ , x i ) − y i ) x i I  w ⊤ r x i  + a r √ m n X i =1 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + y i ∆ (1) r,i ⊙ x i ! + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + a r √ m n X i =1 ( g 1 ,i + g 2 ,i − y i · g 3 ,i ) = ∇ w r L ( θ ) + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + a r √ m n X i =1 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + y i ∆ (1) r,i ⊙ x i ! | {z } g 4 + a r √ m n X i =1 ( g 1 ,i + g 2 ,i − y i · g 3 ,i ) Notice that we can re-write g 1 ,i as g 1 ,i = x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i I  w ⊤ r ′ x i ≥ 0  + x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   + x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   · f ( θ , x i ) Then, b y the deﬁnition of g 3 ,i , we hav e that g 1 ,i − y i · g 3 ,i = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   + ( f ( θ , x i ) − y i ) x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   Using Lemma D.4, we hav e that | Φ 1 ( a ) − I { a ≥ 0 }| ≤ exp  − a 2 2  ≤ ϕ  a 2  29 Therefore, w e hav e that      n X i =1 ( g 1 ,i − y i · g 3 ,i )      2 ≤ n √ m      m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0        2 +      n X i =1 ( f ( θ , x i ) − y i ) · x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0        2 ≤ n √ m m X r ′ =1   w ⊤ r ′ x i   ϕ  w ⊤ r ′ x i 2 κ ∥ w r ′ ⊙ x i ∥ 2  + ∥ Diag ( ∆ ) X ( f ( θ − y )) ∥ 2 = κ √ mR u ψ max + σ max ( X ) ϕ max L ( θ ) 1 2 Moreo ver, w e can b ound g 2 ,i as ∥ g 2 ,i ∥ ≤ 3 κ 2 √ m m X r ′ =1 ∥ x ∥ 2 ∞ ∥ w r ∥ · 2 ϕ max ≤ 6 κ 2 √ mB 2 x R w ϕ max Lastly , we can b ound g 3 as ∥ g 3 ∥ 2 ≤ 1 m n X i =1 m X r ′ =1    x i ⊙ ∆ (2) r,r ′ ,i    + 1 √ m n X i =1 | y i |    ∆ (1) r,i ⊙ x i    2 ≤ 1 m n X i =1 m X r ′ =1    ∆ (2) r,r ′ ,i    ∞ ∥ x i ∥ 2 + 1 √ m n X i =1 | y i |    ∆ (1) r,r ′ ,i    ∞ ∥ x i ∥ 2 ≤ 1 m n X i =1 m X r ′ =1    ∆ (2) r,r ′ ,i    ∞ + B y √ m n X i =1    ∆ (1) r,r ′ ,i    ∞ ≤ 4 nκR u  √ dϕ max + ψ max  + B y √ m · nκR u ϕ max ≤ 5 nκR u  √ dϕ max + ψ max  when B y ≤ √ md . Therefore, we hav e that      E C [ ∇ w r L C ( θ )] − ∇ w r L ( θ ) + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  !      2 ≤ nκR u ψ max + σ max ( X ) ϕ max √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w ϕ max + 5 nκR u  √ dϕ max + ψ max  ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max C Pro ofs in Section 5 C.1 Pro of of Theorem 5.2 Pr o of. T o start the pro of, w e deﬁne the following quantit y in the standard NTK-based analysis of tw o-lay er ReLU neural netw ork. Let R = C 1 · τ λ 0 n for some C 1 > 0 , we deﬁne even t A i,r and set S i , S ⊥ i as A i,r =  ∃ w ∈ B ( w 0 ,r , R ) : I  w ⊤ 0 ,r x i ≥ 0   = I  w ⊤ x i ≥ 0  (23) S i = { r ∈ [ m ] : ¬ A i,r } ; S ⊥ i = [ m ] \ S i (24) 30 Lemma 16 from shows that with probability at least 1 − n exp  − mR τ  , we ha v e that   S ⊥ i   ≤ 4 mR τ . In the follo wing of the pro of, w e assume that such even t holds. Deﬁne K ′ = min  k ∈ N : ∃ r ∈ [ m ] s.t. ∥ w k,r − w 0 ,r ∥ 2 > R  . Then for all k < K ′ , w e hav e that w k,r ∈ B ( w 0 ,r , R ) . Fix an y k < K ′ − 1 . Consider the expansion of L ( θ k +1 ) as the following L ( θ k +1 ) = 1 2 n X i =1 ( f ( θ k +1 , x i ) − y i ) 2 = 1 2 n X i =1 (( f ( θ k +1 , x i ) − f ( θ k , x i )) + ( f ( θ k , x i ) − y i )) 2 = 1 2 n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) 2 + n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) ( f ( θ k , x i ) − y i ) + 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 (25) W e will analyze the three terms separately . T o start, notice that 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 = L ( θ k ) (26) F or the ﬁrst term, by the deﬁnition of f ( θ , x ) , we ha ve | f ( θ k +1 , x i ) − f ( θ k , x i ) | =      1 √ m m X r =1 a r  σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i       ≤ 1 √ m m X r =1   σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i    ≤ 1 √ m m X r =1    ( w k +1 − w k ) ⊤ x i    ≤ 1 √ m m X r =1 ∥ w k +1 − w k ∥ = η √ m m X r =1    ∇ w r ˆ L ( θ k , ξ k )    2 where in the ﬁrst inequality w e use the fact that a = ± 1 , and in the second inequality w e use the 1 -Lipsc hitzness of ReLU. Applying Assumption 5.1, we hav e that n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) 2 ≤ η 2 m n X i =1 m X r =1    ∇ w r ˆ L ( θ k , ξ k )    2 ! ≤ η 2 n m ·  m · q γ ˆ L ( θ k , ξ k )  2 = η 2 mnγ ˆ L ( θ k , ξ k ) (27) Lastly , to analyze the second term, we use the following deﬁnition of I i,k and I ⊥ i,k I i,k = 1 √ m X r ∈ S i a r σ  w ⊤ k,r x i  ; I ⊥ i,k = 1 √ m X r ∈ S ⊥ i a r σ  w ⊤ k,r x i  Then w e hav e that f ( θ k , x i ) = I i,k + I ⊥ i,k . Therefore f ( θ k +1 , x i ) − f ( θ k , x i ) = ( I i,k +1 − I i,k ) +  I ⊥ i,k +1 − I ⊥ i,k  31 By the 1 -Lipschitzness of ReLU, we hav e that   I ⊥ i,k +1 − I ⊥ i,k   =       1 √ m X r ∈ S ⊥ i a r  σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i        ≤ 1 √ m X r ∈ S ⊥ i   σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i    ≤ 1 √ m X r ∈ S ⊥ i    ( w k +1 ,r − w k,r ) ⊤ x i    ≤ η √ m X r ∈ S ⊥ i    ∇ w r ˆ L ( θ k , ξ k )    2 ≤ η √ γ √ m   S ⊥ i   ˆ L ( θ k , ξ k ) 1 2 Applying   S ⊥ i   ≤ 4 mR τ giv es    I ⊥ i,k +1 − I ⊥ i,k    ≤ 4 η R τ √ γ m ˆ L ( θ k , ξ k ) 1 2 .This giv es that n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) ( f ( θ k , x i ) − y i ) = n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + n X i =1  I ⊥ i,k +1 − I ⊥ i,k  ( f ( θ k , x i ) − y i ) ≤ n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + n X i =1  I ⊥ i,k +1 − I ⊥ i,k  2 ! 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 ! 1 2 ≤ n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + 4 η R τ √ γ mn ˆ L ( θ k , ξ k ) 1 2 L ( θ k ) 1 2 (28) Plugging (26), (27), and (28) into (25) gives L ( θ k +1 ) ≤ L ( θ k ) + η 2 mnγ ˆ L ( θ k , ξ k ) + 4 η R τ √ γ mn ˆ L ( θ k , ξ k ) 1 2 L ( θ k ) 1 2 + n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) Under Jensen’s inequality , w e ha ve that E ξ k h ˆ L ( θ k , ξ k ) 1 2 i ≤ E ξ k h ˆ L ( θ k , ξ k ) i 1 2 . Using the prop erty that E ξ k h ˆ L ( θ k , ξ k ) i ≤ 2 L ( θ k ) + ε 1 from Assumption 5.1, we can also obtain that E ξ k h ˆ L ( θ k , ξ k ) 1 2 i ≤ (2 L ( θ k ) + ε 1 ) 1 2 32 Therefore, taking the exp ectation of L ( θ k +1 ) giv es E ξ k [ L ( θ k +1 )] ≤ L ( θ k ) + η 2 mnγ (2 L ( θ k ) + ε ) + 4 η R τ √ γ mn (2 L ( θ k ) + ε 1 ) 1 2 L ( θ k ) 1 2 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) ≤ L ( θ k ) + η 2 mnγ (2 L ( θ k ) + ε 1 ) + 10 η R τ √ γ mn L ( θ k ) + 4 η R τ √ γ mn · ε 1 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) =  1 + 2 η 2 mnγ + 10 C η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) (29) where in the last inequality w e use the prop erty that p a ( a + b ) ≤ 5 4 a + b . Recall that w k +1 ,r , w k,r ∈ B ( w 0 ,r , R ) . Therefore, for r ∈ S i , w e m ust hav e that I n w ⊤ k +1 ,r x i ≥ 0 o = I  w ⊤ 0 ,r x i ≥ 0  = I n w ⊤ k,r x i ≥ 0 o . Th us, w e ha ve E ξ k [ I i,k +1 − I i,k ] = 1 √ m X r ∈ S i a r E ξ k [ w k +1 ,r − w k,r ] ⊤ x i I  w ⊤ k,r x i ≥ 0  = − η √ m X r ∈ S i a r E ξ k h ∇ w r ˆ L ( θ k , ξ k ) i ⊤ x i I  w ⊤ k,r x i ≥ 0  (30) Let g k,r = E ξ k h ∇ w r ˆ L ( θ k , ξ k ) i − ∇ w r L ( θ k ) . Then b y Assumption 5.1 we hav e that ∥ g k,r ∥ 2 ≤ ε 3 L ( θ ) 1 2 + ε 2 . Using g k,r , we can write (30) as E ξ k [ I i,k +1 − I i,k ] = − η X r ∈ S i a r √ m ∇ w r L ( θ k ) ⊤ x i I  w ⊤ k,r x i ≥ 0  − η √ m X r ∈ S i a r g ⊤ k,r x i I  w ⊤ k,r x i ≥ 0  (31) By deﬁnition, we hav e ∇ w r L ( θ k ) = a r √ m n X j =1 ( f ( θ k , x i ) − y i ) x j I  w ⊤ k,r x j ≥ 0  Therefore, w e hav e that a r √ m ∇ w r L ( θ k ) ⊤ x i I  w ⊤ k,r x i ≥ 0  = 1 m n X j =1 ( f ( θ k , x i ) − y i ) x ⊤ i x j I  w ⊤ k,r x i ≥ 0 w ⊤ k,r x j ≥ 0  33 Com bining with (31), we hav e that n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) = − η m n X i,j =1 X r ∈ S i ( f ( θ k , x i ) − y i ) ( f ( θ k , x j ) − y j ) x ⊤ i x j I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  − η √ m n X i =1 X r ∈ S i ( f ( θ k , x i ) − y i ) a r g ⊤ k,r x i I  w ⊤ k,r x i ≥ 0  ≤ − η n X i,j =1 ( f ( θ k , x i ) − y i ) x ⊤ i x j m X r ∈ S i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  ! | {z } H k,ij ( f ( θ k , x j ) − y j ) + η √ m m X i =1 n X r =1 | f ( θ k , x i ) − y i | ∥ g k,r ∥ 2 ≤ − η λ min ( H k ) n X i =1 ( f ( θ k , x i ) − y i ) 2 + η ε 2 √ mn n X r =1 ( f ( θ k , x i ) − y i ) 2 ! 1 2 + η ε 3 √ mn L ( θ k ) = −  2 η λ min ( H k ) + η ε 3 √ mn  L ( θ k ) + 2 η ε 2 √ mn L ( θ k ) 1 2 Using the prop ert y that ab ≤ a 2 2 + b 2 2 , we hav e that for any C ′ > 0 , 2 η ε 2 √ mn L ( θ k ) 1 2 ≤ C ′ η λ 0 r γ m n + η ε 2 2 n C ′ λ 0 r mn γ Moreo ver, b y Lemma C.1, we hav e that when m = Ω  n 2 λ 2 0 log n δ  , with probabilit y at least 1 − δ − n 2 exp − mR τ , it holds that ∥ H k − H ∞ ∥ F ≤ λ 0 6 + 1 m   n X i,j =1   S ⊥ i   2   1 2 + 2 nR τ Plugging in   S ⊥ i   ≤ 4 mR τ and R ≤ C 1 · τ λ 0 n , we hav e that ∥ H k − H ∞ ∥ F ≤ λ 0 2 for small enough C 1 . Thus, we ha ve that λ min ( H k ) ≥ λ 0 2 . Therefore, we hav e that n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) ≤  C ′ η λ 0 r γ m n + η ε 3 √ mn − η λ 0  L ( θ k ) + η ε 2 2 n C ′ λ 0 r mn γ Plugging this back into (31) gives E ξ k [ L ( θ k +1 )] ≤  1 + 2 η 2 mnγ + 10 C η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 +  C ′ η λ 0 r γ m n + η ε 3 √ mn − η λ 0  L ( θ k ) + η ε 2 2 n C ′ λ 0 r mn γ =  1 − η λ 0 + η ε 3 √ mn + 2 η 2 mnγ + (10 C + C ′ ) η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 + η ε 2 2 n C ′ λ 0 r mn γ 34 Apply ε 3 ≤ C ε · λ 0 √ mn , γ = C 1 · n m and η = C 2 · λ 0 n 2 giv es E ξ k [ L ( θ k +1 )] ≤  1 −  1 − 2 C 1 C 2 − (10 C + C ′ ) p C 1  η λ 0  L ( θ k ) + η λ 0  mnε 2 2 C ′ √ C 1 λ 2 0 +  C 1 C 2 + 4 C p C 1  ε 1  Cho osing a small enough C 1 , C 2 , C, C ′ giv es E ξ k [ L ( θ k +1 )] ≤  1 − η λ 0 2  L ( θ k ) + 1 2 ˆ C η λ 0  mn λ 2 0 · ε 2 2 + ε 1  (32) for a large enough ˆ C . Thus, unrolling the iterations gives E ξ 0 ,..., ξ k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + ˆ C  mn λ 2 0 · ε 2 2 + ε 1  (33) for all k < K ′ . Next, we shall low er b ound K ′ . F or all k ≤ K ′ , we hav e that ∥ w k,r − w 0 ,r ∥ 2 ≤ k − 1 X t =0 ∥ w t +1 ,r − w t,r ∥ 2 = η k − 1 X t =0    ∇ w r ˆ L ( W t , ξ t )    2 ≤ η √ γ k − 1 X t =0 ˆ L ( W t , ξ t ) 1 2 By (33), we hav e E ξ 0 ,..., ξ t − 1 h ˆ L ( W t , ξ t ) 1 2 i ≤ ( L ( W t ) + ε 1 ) 1 2 ≤ 2  1 − η λ 0 2  t L ( W 0 ) +  ˆ C + 1   mn λ 2 0 · ε 2 2 + ε 1  ! 1 2 ≤ 2  1 − η λ 0 4  t L ( W 0 ) 1 2 + q ˆ C + 1  ε 2 λ 0 √ mn + √ ε 1  Therefore, w e hav e E ξ 0 ,..., ξ k − 1  ∥ w k,r − w 0 ,r ∥ 2  ≤ η √ γ k − 1 X t =0 E ξ 0 ,..., ξ t − 1 h ˆ L ( W t , ξ t ) 1 2 i ≤ 2 η √ γ L ( W 0 ) 1 2 ∞ X t =0  1 − η λ 0 4  t + k r  ˆ C + 1  γ  ε 2 λ 0 √ mn + √ ε 1  = 8 √ γ λ 0 L ( W 0 ) 1 2 + k r  ˆ C + 1  γ  ε 2 λ 0 √ mn + √ ε 1  By Lemma 26 in , we hav e that E W 0 , a h L ( W 0 ) 2 i = O ( n ) . γ = C 1 · n m , we hav e that E W 0 , a , ξ 0 ,..., ξ k − 1  ∥ w k,r − w 0 ,r ∥ 2  ≤ O  n λ 0 √ m  + O  k  ε 2 n λ 0 + r ε 1 n m  Th us, b y Mark ov’s inequalit y , we ha ve that with probabilit y at least 1 − δ 3 K , ∥ w k,r − w 0 ,r ∥ 2 ≤ O  nK λ 0 δ √ m  | {z } T 1 + O  K 2 δ  ε 2 n λ 0 + r ε 1 n m  | {z } T 2 Setting m = Ω  n 4 λ 4 0 δ 2 τ 2  guaran tees that T 1 ≤ C 1 2 · τ λ 0 n = R 2 and set ε 2 ≤ O  δ λ 0 nK 2  , ε 1 ≤ O  δ m K 4 n  giv es that T 2 ≤ C 1 2 · τ λ 0 n = R 2 . Combining the b ound on T 1 and T 2 and taking a union b ound giv es that, with probability at least 1 − δ 3 , it holds that ∥ w k,r − w 0 ,r ∥ 2 ≤ R ; ∀ k ∈ [ K ] This sho ws that we must hav e K ′ > K , which completes the pro of. 35 Lemma C.1. L et H ∞ b e deﬁne d in (11), and let H k b e deﬁne d as H k,ij = x ⊤ i x j m X r ∈ S i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  Fix any R . A ssume that w k,r ∈ B ( w 0 ,r , R ) for al l r ∈ [ m ] . If w 0 ,r ∼ N  0 , τ 2 I  , and m = Ω () , then we have that ∥ H k − H ∞ ∥ F ≤ λ 0 6 + 1 m 2 n X i,j =1   S ⊥ i   2 + 2 nR τ Pr o of. W e deﬁne ˆ H k as follo ws ˆ H k,ij = x ⊤ i x j m m X r =1 I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  Then w e hav e that ∥ H k − H ∞ ∥ F ≤    H k − ˆ H k    F +    ˆ H k − ˆ H 0    F +    ˆ H 0 − H ∞    F By Lemma 3.1 in Du et al. (2018), we ha ve that with probability at least 1 − δ , w e ha ve that    ˆ H 0 − H ∞    F ≤ λ 0 6 when m = Ω  n 2 λ 2 0 log n δ  . By Lemma 3.2 in Song and Y ang (2020), we hav e that with probability at least 1 − n 2 exp − mR τ , it holds that    ˆ H k − ˆ H 0    ≤ 2 nR τ . Lastly , for the ﬁrst term, we hav e    H k − ˆ H k    2 F ≤ n X i,j =1  H k,ij − ˆ H k,ij  2 = 1 m 2 n X i,j =1   x ⊤ i x j X r ∈ S ⊥ i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0    2 ≤ 1 m 2 n X i,j =1   S ⊥ i   2 Com bining the three b ounds gives the desired result. C.2 Pro of of Theorem 5.6 W e view Gaussian input masking as a sp ecial case of the general sto chastic training framework in Section 5. Recall that in that framework, the randomness at iteration k is denoted by ξ k , and the up date rule is W k +1 = W k − η ∇ W ˆ L ( W k , ξ k ) . (7) In the Gaussian-masked setting we take ξ k ≡ C k , ˆ L ( W , ξ k ) ≡ L C k ( W ) , where C k is the multiplicativ e Gaussian mask at iteration k and L C k is the masked loss. Thus ∇ w r ˆ L ( W , ξ k ) ≡ ∇ w r L C k ( W ) , and the up date (7) coincides with the masked gradien t descent rule (1) . Therefore, to apply Theorem 5.2 to training with Gaussian input masks, it suﬃces to verify that Assumption 5.1 holds with suitable ε 1 , ε 2 , ε 3 , γ , and that these parameters satisfy the smallness conditions of Theorem 5.2 under the constraints (14)– (15). Throughout the pro of we condition on the high-probability NTK even t of Theorem 5.2, on which 36 • the minimum eigenv alue of the empirical NTK satisﬁes λ min ( H k ) ≥ λ 0 / 2 for all k ∈ [ K ] , • all ﬁrst-lay er weigh ts remain in a ball of radius R = C 1 τ λ 0 /n around their initialization, i.e., ∥ w k,r − w 0 ,r ∥ 2 ≤ R for all k ∈ [ K ] , r ∈ [ m ] , • the data are b ounded as in Assumption 3.1. The probabilit y of this even t is at least 1 − 2 δ − n 2 exp ( − n 3 / ( δ 2 τ 2 λ 3 0 )) , as in Theorem 5.2. All inequalities b elo w hold on this even t. Comparing Corollary 5.3 with Assumption 5.1(8), we identify ε 1 ( W ) = 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ max ( W ) 2 + mnκ 2  R 2 u + 1  ψ max ( W ) 2 . (34) On the NTK even t, the weigh ts stay close to initialization, hence their norms are uniformly b ounded; using Assumption 3.1 and the deﬁnition of R w and R u , we obtain: R w ( W k ) := max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d (35) R u ( W k ) := max r ∈ [ m ] ,i ∈ [ n ] ∥ w k,r ⊙ x i ∥ 2 ≤ C u τ √ d. (36) for constan ts C w , C u > 0 . Lemma C.2 (Bound on R w ) . On the NTK event of The or em 5.2, ther e exists an absolute c onstant C w > 0 such that for al l iter ations k ≤ K : R w ( W k ) := max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d. (37) Pr o of. Recall that the ﬁrst–lay er weigh ts are initialized as w 0 ,r ∼ N ( 0 , τ 2 I d ) for r = 1 , . . . , m . During training, the NTK even t of Theorem 5.2 ensures that each row stays in a small ball around its initialization: ∥ w k,r − w 0 ,r ∥ 2 ≤ R, R := C 1 τ λ 0 n , ∀ k ≤ K, r ∈ [ m ] . (38) W e deﬁne z r := 1 τ w 0 ,r . Each co ordinate satisﬁes ( z r ) j ∼ N (0 , 1) , making z r a standard Gaussian vector N ( 0 , I d ) . Its squared norm follows a chi-square distribution: ∥ z r ∥ 2 2 ∼ χ 2 d . Using the Laurent–Massart concen tration inequalit y , for t = d : Pr  ∥ z r ∥ 2 2 ≥ 5 d  ≤ e − d . Th us, with high probability , ∥ z r ∥ 2 ≤ √ 5 d . Deﬁning C 0 = √ 5 , w e obtain the initialization b ound: ∥ w 0 ,r ∥ 2 = τ ∥ z r ∥ 2 ≤ C 0 τ √ d. (39) By a union b ound ov er r ∈ [ m ] , this holds for all rows with probability at least 1 − me − d . Combining the triangle inequalit y with (38) and (39), we ﬁnd: ∥ w k,r ∥ 2 ≤ ∥ w 0 ,r ∥ 2 + ∥ w k,r − w 0 ,r ∥ 2 ≤ C 0 τ √ d + C 1 τ λ 0 n = τ √ d  C 0 + C 1 λ 0 n √ d  . Since λ 0 /n is O (1) and d ≥ 1 , we deﬁne the absolute constan t C w := C 0 + C 1 λ 0 n √ d . T aking the maximum o ver r ∈ [ m ] yields: R w ( W k ) = max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d. (40) In tuitively , since the mo v ement term C 1 λ 0 n √ d v anishes as d → ∞ , the weigh ts remain on the same scale as their initialization throughout training. 37 Lemma C.3 (Bound on R u ) . Supp ose A ssumption 3.1 holds, such that the input data is b ounde d in ℓ ∞ -norm by B x := max i ∈ [ n ] ∥ x i ∥ ∞ . On the NTK event of The or em 5.2, ther e exists an absolute c onstant C u > 0 such that, for al l iter ations k ≤ K : R u ( W k ) := max r ∈ [ m ] ,i ∈ [ n ] ∥ w k,r ⊙ x i ∥ 2 ≤ C u τ √ d. (41) Pr o of. Consider an y iteration k ≤ K , neuron r ∈ [ m ] , and sample index i ∈ [ n ] . W e analyze the squared ℓ 2 -norm of the Hadamard pro duct by pulling out the maximum co ordinate of the input vector: ∥ w k,r ⊙ x i ∥ 2 2 = d X j =1 w 2 k,r,j x 2 i,j ≤  max 1 ≤ j ≤ d x 2 i,j  d X j =1 w 2 k,r,j = ∥ x i ∥ 2 ∞ ∥ w k,r ∥ 2 2 . T aking the square ro ot of b oth sides, we obtain the inequality: ∥ w k,r ⊙ x i ∥ 2 ≤ ∥ x i ∥ ∞ ∥ w k,r ∥ 2 ≤ B x ∥ w k,r ∥ 2 . T aking the maximum ov er all r ∈ [ m ] and i ∈ [ n ] yields: R u ( W k ) ≤ B x R w ( W k ) . (42) F rom the w eight stabilit y b ound previously established (Pro of of R w ), we know that on the NTK even t, the w eights are b ounded b y R w ( W k ) ≤ C w τ √ d , where C w is an absolute constant. Substituting this into (42): R u ( W k ) ≤ B x ( C w τ √ d ) . Deﬁning the absolute constan t C u := B x C w completes the pro of. Note that since B x is a ﬁxed prop ert y of the dataset and C w is indep enden t of k , C u is a v alid absolute constant for the problem. W e an iteration k ∈ [ K ] and denote, for brevity , R w := R w ( W k ) , R u := R u ( W k ) , ϕ k := ϕ max ( W k ) , ψ k := ψ max ( W k ) . Then (34) b ecomes ε 1 ( W k ) = 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ 2 k + mnκ 2  R 2 u + 1  ψ 2 k . (43) W e now b ound each of the three terms on the right-hand side using (35),(36). (i) First term. Using R 2 u ≤ C 2 u τ 2 d from (36), we get 2 mnκ 2 R 2 u ≤ 2 mnκ 2 · C 2 u τ 2 d = (2 C 2 u ) κ 2 τ 2 mnd = O  κ 2 τ 2 mnd  . (44) (ii) Midd le term. mn  κ 2 R 2 u + κR w  ϕ 2 k = mnκ 2 R 2 u ϕ 2 k + mnκR w ϕ 2 k . F or the κ 2 R 2 u ϕ 2 k comp onen t, we use R 2 u ≤ C 2 u τ 2 d from (36): mnκ 2 R 2 u ϕ 2 k ≤ mnκ 2 ( C 2 u τ 2 d ) ϕ 2 k = C 2 u κ 2 τ 2 mndϕ 2 k = O  κ 2 τ 2 mndϕ 2 k  . (45) 38 F or the κR w ϕ 2 k comp onen t, we use R w ≤ C w τ √ d from (35): mnκR w ϕ 2 k ≤ mnκ ( C w τ √ d ) ϕ 2 k = C w κτ mn √ dϕ 2 k = O  κτ mn √ dϕ 2 k  . (46) (iii) L ast term. F or the last term, using (36) , it is true that R 2 u + 1 ≤ C 2 u τ 2 d + 1 , so there exists a constan t C ′ u > 0 such that R 2 u + 1 ≤ C ′ u τ 2 d for big enough d . Hence, mnκ 2  R 2 u + 1  ψ 2 k ≤ mnκ 2 ( C ′ u τ 2 d ) ψ 2 k = C ′ u κ 2 τ 2 mndψ 2 k = O  κ 2 τ 2 mndψ 2 k  . (47) Com bining (43) with (44), (45), (46), and (47), we obtain ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mndϕ 2 k  + O  κτ mn √ dϕ 2 k  + O  κ 2 τ 2 mndψ 2 k  = O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ϕ 2 k + ψ 2 k )  + O  κτ mn √ dϕ 2 k  . (48) W e denote ˆ ϕ max := max k ∈ [ K ] ϕ max ( W k ) , ˆ ψ max := max k ∈ [ K ] ψ max ( W k ) , so that for each k , ϕ 2 k ≤ ˆ ϕ 2 max , ψ 2 k ≤ ˆ ψ 2 max . Substituting these into (48) yields ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  , ∀ k ∈ [ K ] . (49) T aking the maximum ov er k ∈ [ K ] do es not c hange the right-hand side, so ε 1 := max k ∈ [ K ] ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  . (50) or equiv alen tly: ε 1 ≤ O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max + 1)  + O  κτ mn √ d ˆ ϕ 2 max  (51) Matc hing Corollary 5.4 with Assumption 5.1(9), we read oﬀ ε 3 ( W ) = O  σ max ( X ) ϕ max ( W ) √ m  and ε 2 ( W ) = O  ( nκ 2 B 2 x R w + nκR u √ d ) ϕ max ( W )  + O  nκR u ψ max ( W ) + κ 2 √ mB 2 x R w  . (52) Using ∥ x i ∥ 2 ≤ 1 from Assumption 3.1, we hav e B x ≤ 1 . On the NTK ev ent, for all k ∈ [ K ] we hav e the uniform b ounds R w ( W k ) ≤ C w τ √ d, R u ( W k ) ≤ C u τ √ d, for some absolute constants C w , C u > 0 (cf. the b ounds prov ed earlier for R w and R u ). Substituting these in to (52) yields: ε 2 ( W k ) ≤ O   nκ 2 C w τ √ d + nκC u τ √ d √ d  ϕ max ( W k )  + O  nκC u τ √ dψ max ( W k ) + κ 2 √ mC w τ √ d  = O  nκ 2 τ √ dϕ max ( W k )  + O  nκτ dϕ max ( W k )  + O  nκτ √ dψ max ( W k )  + O  κ 2 τ √ md  . (53) 39 Since d ≥ 1 , w e ha ve √ d ≤ d , and hence each term containing √ d can be upp er b ounded b y the corresp onding expression with d in place of √ d . Therefore, ε 2 ( W k ) ≤ O  nκ 2 τ dϕ max ( W k )  + O  nκτ dϕ max ( W k )  + O  nκτ dψ max ( W k )  + O  κ 2 τ √ md  ≤ O  κτ nd  ϕ max ( W k ) + ψ max ( W k )  + O  κ 2 τ d  nϕ max ( W k ) + √ m  . (54) T o obtain a uniform b ound ov er the whole training tra jectory , deﬁne ˆ ϕ max := max k ∈ [ K ] ϕ max ( W k ) , ˆ ψ max := max k ∈ [ K ] ψ max ( W k ) . T aking the maximum ov er k in (54) yields ε 2 := max k ∈ [ K ] ε 2 ( W k ) ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d  n ˆ ϕ max + √ m  . W e can simplify ε 2 further as: O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d  n ˆ ϕ max + √ m  = O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d  n ˆ ϕ max + √ m  = O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ dn ˆ ϕ max + κ 2 τ d √ m  ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κτ dn ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d √ m  = O  2 κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d √ m  Th us, ε 2 ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d √ m  (55) Theorem 5.2 requires ε 2 to satisfy , for some absolute constan t c 0 > 0 , ε 2 ≤ c 0 δ λ 0 nK 2 (56) Com bining (55) and (56), we get that the following inequality must hold: C 1 κτ nd  ˆ ϕ max + ˆ ψ max  + C 2 κ 2 τ d √ m ≤ c 0 δ λ 0 nK 2 . (57) Let us denote a := C 1 κτ nd  ˆ ϕ max + ˆ ψ max  , b := C 2 κ 2 τ d √ m, R := c 0 δ λ 0 nK 2 . Then (57) can b e written as a + b ≤ R . A suﬃcien t wa y to enforce this inequality is to require that eac h term a and b is at most R / 2 : a ≤ R 2 , b ≤ R 2 (58) Indeed, if (58) holds, then a + b ≤ R 2 + R 2 = R, so (57) is automatically satisﬁed. Imp osing a ≤ R/ 2 yields: C 1 κτ nd  ˆ ϕ max + ˆ ψ max  ≤ c 0 2 δ λ 0 nK 2 . 40 Th us, κ ≤ κ lin 2 := c 0 2 C 1 δ λ 0 τ n 2 dK 2  ˆ ϕ max + ˆ ψ max  . (59) Similarly , imp osing b ≤ R/ 2 gives: C 2 κ 2 τ d √ m ≤ c 0 2 δ λ 0 nK 2 ⇒ κ 2 ≤ c 0 2 C 2 δ λ 0 τ d √ mnK 2 and therefore κ ≤ κ quad 2 := r c 0 2 C 2 s δ λ 0 τ d √ mnK 2 . (60) T o ensure (56) holds, it is suﬃcient that (59) and (60) b oth hold. Equiv alen tly , κ ≤ min { κ lin 2 , κ quad 2 } . In big- O notation we may write this as κ = O δ λ 0 τ n 2 dK 2 ( ˆ ϕ max + ˆ ψ max ) ! and κ = O s δ λ 0 τ d √ mnK 2 ! . (61) Similarly , for ε 1 the general sto c hastic conv ergence theorem requires that ε 1 ≤ c 1 δ m nK 4 , (4) for some absolute constant c 1 > 0 . Com bining (51) and (4) we get that the following must hold: O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max + 1)  + O  κτ mn √ d ˆ ϕ 2 max  ≤ c 1 δ m nK 4 ⇒ C a κ 2 τ 2 nd  ˆ ϕ 2 max + ˆ ψ 2 max + 1  + C b κτ n √ d ˆ ϕ 2 max ≤ c 1 δ nK 4 . Similarly , we imp ose C a κ 2 τ 2 nd  1 + ˆ ϕ 2 max + ˆ ψ 2 max  ≤ c 1 2 δ nK 4 ⇒ κ 2 ≤ c 1 2 C a δ τ 2 n 2 dK 4  ˆ ϕ 2 max + ˆ ψ 2 max + 1  ⇒ κ ≤ κ quad 1 := r c 1 2 C a · √ δ τ n √ dK 2 q ˆ ϕ 2 max + ˆ ψ 2 max + 1 . (62) and C b κτ n √ d ˆ ϕ 2 max ≤ c 1 2 δ nK 4 ⇒ κ ≤ κ lin 1 := c 1 2 C b δ τ n 2 √ dK 4 ˆ ϕ 2 max . (63) Th us, the ε 1 requiremen t (4) is guaranteed whenever κ ≤ min  κ quad 1 , κ lin 1  . 41 Com bining all constraints from ε 1 and ε 2 , we see that a suﬃcient set of conditions is κ ≤ min  κ lin 1 , κ quad 1 , κ lin 2 , κ quad 2  . Equiv alen tly , in big- O notation, κ = O min ( δ λ 0 τ n 2 dK 2 ( ˆ ϕ max + ˆ ψ max ) , δ τ n 2 √ dK 4 ˆ ϕ 2 max , s δ λ 0 τ d √ mnK 2 , s δ τ 2 n 2 dK 4 ( ˆ ϕ 2 max + ˆ ψ 2 max + 1) )! . (64) Instead of carrying this minimum in the theorem statement, we deﬁne a slightly more restrictiv e but cleaner condition that implies all of the ab ov e b ounds: κ = O √ δ λ 0 τ 2 K 2  m 1 / 4 √ d + nd  ˆ ϕ max + ˆ ψ max  ! . (65) F or ε 3 , w e combine (9) and (5.4) which yields that: ε 3 = O  σ max ( X ) ϕ max √ m  W e assume that σ max ( X ) ˆ ϕ max ≤ C λ 0 / √ n for some constant C > 0 (assumption (15)), and therefore ε 3 ≤ O  λ 0 √ mn  , (66) whic h is exactly the form required in Theorem 5.2. Theorem 5.2 includes the factor O  mn λ 2 0 · ε 2 2 + ε 1  (55) , (50) = So we study the quantit y mn λ 2 0 · ε 2 2 . mn λ 2 0 · ε 2 2 = mn λ 2 0 ·  κ 2 τ 2 n 2 d 2 ( ˆ ϕ max + ˆ ψ max ) 2 + κ 4 τ 2 d 2 m  ≤ 2 λ 2 0 κ 2 τ 2 n 3 d 2 m ( ˆ ϕ 2 max + ˆ ψ 2 max ) + 1 λ 2 0 κ 4 τ 2 d 2 m 2 n = O  κτ 2 mn 3 d 2  ˆ ϕ 2 max + ˆ ψ 2 max   + O  κ 2 τ 2 m 2 nd 2  (67) b ecause ( a + b ) 2 ≤ 2( a 2 + b 2 ) and κ 4 ≤ κ 2 ≤ κ for κ ≤ 1 F urthermore, ε 1 = O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  ≤ O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  (68) Commen t: T ried b ounding O  κτ mn √ d ˆ ϕ 2 max  in the term O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  as ˆ ϕ 2 max ≤ ˆ ϕ 2 max + ˆ ψ 2 max and √ d ≤ d 2 (since in our case it deﬁnitely holds that d > 1 ) but it do es not necessarily hold that τ ≤ τ 2 Therefore, O  mn λ 2 0 · ε 2 2 + ε 1  (68) , (67) = O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  Th us, the exp ected loss is b ounded by: E [ L ( W K )] ≤  1 − η λ 0 2  K L ( W 0 ) + O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  42 C.2.1 Pro of of Corolla ry 5.3 Pr o of. By Theorem 4.2, w e ha ve that | E C [ L C ( θ )] − L ( θ ) | ≤      1 2 n X i =1   ˆ f ( θ , x i ) − y i  2 − ( f ( W , x i ) − y i ) 2       | {z } T 1 + κ 2 2 m n X i =1      m X r =1 a r ( w r ⊙ x i ) Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2       2 2 | {z } T 2 + mn  κ 2 R 2 u ψ 2 max +  κ 2 R 2 u + κR w  ϕ 2 max  No w, w e can b ound T 1 and T 2 separately . F or T 1 , we hav e      1 2 n X i =1   ˆ f ( θ , x i ) − y i  2 − ( f ( W , x i ) − y i ) 2       ≤ 1 2 n X i =1   ˆ f ( θ , x i ) − f ( W , x i )  2 + 2     ˆ f ( θ , x i ) − f ( W , x i )   ˆ f ( θ , x i ) − f ( W , x i )      ≤ n X i =1  ˆ f ( θ , x i ) − f ( W , x i )  2 + L ( θ ) Here, w e can b ound ˆ f ( W , x i ) − f ( W , x i ) as    ˆ f ( W , x i ) − f ( W , x i )    ≤ 1 √ m m X r =1   ˆ σ ( w r , x i ) − σ  w ⊤ r x i    ≤ 1 √ m m X r =1   w ⊤ r x i    I  w ⊤ r x i ≥ 0  − Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  ≤ 1 √ m m X r =1   w ⊤ r x i   exp −  w ⊤ r x i  2 2 κ 2 ∥ w r ⊙ x i ∥ 2 ! ≤ 1 √ m m X r =1 ψ  w ⊤ r x i 2 κ ∥ w ⊤ r x i ∥ 2  ≤ √ mψ max where in the third inequality we used Lemma D.4. Therefore, T 1 can b e b ounded as T 1 ≤ L ( θ ) + mnψ max F or T 2 , we an b ound it as T 2 ≤ 2 κ 2 n X i =1 m X r =1 ∥ w r ⊙ x i ∥ ≤ 2 κ 2 mnR 2 u Plugging in T 1 and T 2 giv es the desired result. 43 C.2.2 Pro of of Corolla ry 5.4 Pr o of. By Theorem 4.9, w e ha ve that ∥ E C [ ∇ w r L ( θ )] − ∇ w r L ( θ ) ∥ 2 =      g r + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0       2 ≤ ∥ g r ∥ 2 + 3 κ 2 √ m m X r ′ =1    a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0     2 ≤ ∥ g r ∥ 2 + 3 κ 2 √ m m X r ′ =1 ∥ x i ∥ 2 ∞ ∥ w r ∥ 2 ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max + 3 κ 2 √ mB 2 x R w Theorem 4.9 states that the norm ∥ g r ∥ 2 satisﬁes ∥ g r ∥ 2 ≤  6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + σ max ( X ) √ m ϕ max L ( θ ) 1 2 + 6 nκR u ψ max Moreo ver, b y Assumption 3.1, we hav e that ∥ x ∥ ∞ ≤ B x . Therefore, we obtain that ∥ E C [ ∇ w r L ( θ )] − ∇ w r L ( θ ) ∥ 2 ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max + 3 κ 2 √ mB 2 x R w C.2.3 Pro of of Lemma D.21 Pr o of. By the form of ∇ w r L C ( W ) in (2), we hav e: ∥∇ w r L C ( W ) ∥ 2 =        a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  | {z } ≤ 1        2 ≤ | a r | √ m n X i =1 | f ( W , x i ⊙ c i ) − y i | · ∥ x i ⊙ c i ∥ 2 ≤ √ n √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 ! 1 / 2 · ∥ c i ∥ ∞ ∥ x i ∥ 2 using D.20 ≤ C √ n √ m (2 L C ( W )) 1 / 2 using ∥ c i ∥ ∞ ≤ C and ∥ x i ∥ 2 ≤ 1 ≤ C √ 2 √ n √ m L C ( W ) 1 / 2 where, in the ﬁrst inequalit y , we use the fact that the indicator function is upp er-b ounded by 1 and in the second inequalit y , we use the fact that a r = ± 1 . and so, ∥∇ w r L C ( W ) ∥ 2 2 ≤ 2 C 2 · n m L C ( W ) 44 D A uxiliary Results D.1 Gaussian Random Va riables D.1.1 Conditional Exp ectation and Cova riance Lemma D.1. L et c ∼ N  µ , κ 2 I  , and let z = c ⊤ u , z ′ = c ⊤ v . Then we have that E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  ; E c [ c | z , z ′ ] = µ + s 1  z − µ ⊤ u  + s 2  z ′ − µ ⊤ v  wher e the ve ctos s 1 , s 2 ar e deﬁne d as s 1 = ∥ v ∥ 2 2 u − u ⊤ v · v ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ; s 2 = ∥ u ∥ 2 2 v − u ⊤ v · u ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 Pr o of. By the formula of conditional exp ectation, we hav e E c [ c | z ] = E [ c ] + Cov ( c , z ) V ar ( z ) − 1 ( z − E [ z ]) By the deﬁnition of c , we hav e E [ c ] = µ . Moreov er, since z = c ⊤ u , by Lemma D.13, we hav e E [ z ] = µ ⊤ u and V ar ( z ) = κ 2 ∥ u ∥ 2 2 . Therefore, the cov ariance b etw een c and z is given b y Co v ( c , z ) = E [( c − µ ) ( z − ⟨ µ , v ⟩ )] = E [ z c ] − µ ⟨ µ , v ⟩ where E [ z c ] j = d X j ′ =1 E [ c j c j ′ u j ′ ] = d X j ′ =1  µ j µ j ′ + I { j = j ′ } κ 2  u j ′ = µ j µ ⊤ u + κ 2 u j This giv es E [ z c ] = µ ⊤ u · µ + κ 2 u Therefore E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  Similarly , since z ′ = c ⊤ v . Then E c [ c | z , z ′ ] = E [ c ] + Cov ( c , [ z , z ′ ]) Co v ( z , z ′ ) − 1  [ z , z ′ ] ⊤ − E [[ z , z ′ ]] ⊤  where Co v ( c , [ z , z ′ ]) = κ 2  u v  ; Co v ( z , z ′ ) = κ 2  ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  This giv es E c [ c | z , z ′ ] = µ +  u v   ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  − 1  z − µ ⊤ u z ′ − µ ⊤ v  = µ + 1 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2  u v   ∥ v ∥ 2 2 − u ⊤ v − u ⊤ v ∥ u ∥ 2 2  − 1  z − µ ⊤ u z ′ − µ ⊤ v  = µ +  ∥ v ∥ 2 2 u − u ⊤ v · v   z − µ ⊤ u  +  ∥ u ∥ 2 2 v − u ⊤ v · u   z ′ − µ ⊤ v  ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 = µ + s 1  z − µ ⊤ u  + s 2  z ′ − µ ⊤ v  where the vectos s 1 , s 2 are deﬁned as s 1 = ∥ v ∥ 2 2 u − u ⊤ v · v ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ; s 2 = ∥ u ∥ 2 2 v − u ⊤ v · u ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 45 Lemma D.2. L et c ∼ N  µ , κ 2 I  , and let z = c ⊤ u , z ′ = c ⊤ v . Then we have Cov ( c | z , z ′ ) = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 Pr o of. The conditional cov ariance of Gaussian random v ariables is given by Co v ( c | z , z ′ ) = Co v ( c ) − Cov ( c , [ z , z ′ ]) Co v ( z , z ′ ) − 1 Co v ( c , [ z , z ′ ]) ⊤ Recall that in the previos lemma we hav e computed Co v ( c , [ z , z ′ ]) = κ 2  u v  ; Co v ( z , z ′ ) = κ 2  ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  Th us, w e ha ve Co v ( c | z , z ′ ) = κ 2 I − κ 2  u v   ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  − 1  u v  = κ 2 I − κ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2  u v   ∥ v ∥ 2 2 − u ⊤ v − u ⊤ v ∥ u ∥ 2 2   u v  = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 D.1.2 App roximation of CDF In appro ximation of the Gaussian CDF, we use the follo wing prop ert y Lemma D.3 ( ? ) . L et x ≥ 0 b e given, then the fol lowing ine quality holds 1 2  1 − exp  − x 2 2  1 2 ≤ 1 √ 2 π Z x 0 exp  − t 2 2  dt ≤ 1 2  1 − exp  − 2 x 2 π  1 2 T o start, we analyze the CDF of a single v ariable Gaussian random v ariable Lemma D.4. L et Φ 1 b e the CDF of a standar d Gaussian r andom variable. Then we have | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  Pr o of. Let ν ( x ) = 1 √ 2 π R x 0 exp  − t 2 2  dt for x ≥ 0 . W e study the case of α ≥ 0 and α < 0 separately . F or α ≥ 0 , w e hav e Φ 1 ( α ) = 1 √ 2 π Z α −∞ exp  − t 2 2  dt = 1 2 + ν ( α ) ≥ 1 2 + 1 2  1 − exp  − α 2 2  1 2 ≥ 1 − exp  − α 2 2  = I { α ≥ 0 } − exp  − α 2 2  46 Since Φ 1 ≤ 1 = I { α ≥ 0 } when α ≥ 0 , w e must hav e that, for α ≥ 0 , | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  . Similarly , for α < 0 , we hav e Φ 1 ( α ) = 1 √ 2 π Z α −∞ exp  − t 2 2  dt = 1 2 − ν ( α ) ≤ 1 2 − 1 2  1 − exp  − α 2 2  1 2 ≤ exp  − α 2 2  = I { α ≥ 0 } + exp  − α 2 2  Since Φ 1 ≥ 0 = I { α ≥ 0 } when α < 0 , we must ha ve that, for α < 0 , | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  . Lemma D.5. L et Φ 2 ( α 1 , α 2 , ρ ) b e the joint CDF of two standar d Gaussian r andom variables with c ovarianc e ρ at α 1 , α 2 . Then we have | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ 2 exp − min { α 1 , α 2 } 2 2 ! Pr o of. Since z 1 , z 2 are standard Gaussian random v ariables with cov ariance ρ , w e hav e z 1 | z 2 = ζ ∼ N  ρζ , 1 − ρ 2  . According to the deﬁnition of CDF, Φ 2 ( α 1 , α 2 , ρ ) = Z α 2 −∞ Z α 1 −∞ f z 1 ,z 2 ( ζ 1 , ζ 2 , ρ ) dζ 1 dζ 2 = Z α 2 −∞ Z α 1 −∞ f z 1 | z 2 = ζ 2 ( ζ 1 ) dζ 1 f z 2 ( ζ 2 ) dζ 2 F o cusing on the inner integral, we substitute ζ ′ = ζ 1 − ρζ 2 √ 1 − ρ 2 . Then we hav e dζ 1 = p 1 − ρ 2 dζ ′ . Thus Z α 1 −∞ f z 1 | z 2 = ζ 2 ( ζ 1 ) dζ 1 = 1 p 2 π (1 − ρ 2 ) Z α 1 −∞ exp − ( ζ 1 − ρζ 2 ) 2 2(1 − ρ 2 ) ! dζ 1 = 1 √ 2 π Z α 1 − ρζ 2 √ 1 − ρ 2 −∞ exp  − ζ ′ 2 2  dζ ′ = Φ 1 α 1 − ρζ 2 p 1 − ρ 2 ! = I ( α 1 − ρζ 2 p 1 − ρ 2 ≥ 0 ) + ε α 1 − ρζ 2 p 1 − ρ 2 ! = I { α 1 ≥ ρζ 2 } + ε α 1 − ρζ 2 p 1 − ρ 2 ! 47 Where     ε  α 1 − ρζ 2 √ 1 − ρ 2      ≤ exp  − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 )  . Therefore Φ 2 ( α 1 , α 2 , ρ ) = Z α 2 −∞ I { α 1 ≥ ρζ 2 } + ε α 1 − ρζ 2 p 1 − ρ 2 !! f z 2 ( ζ 2 ) dζ 2 = Z α 2 −∞ I { α 1 ≥ ρζ 2 } f z 2 ( ζ 2 ) dζ 2 | {z } I 1 + Z α 2 −∞ ε α 1 − ρζ 2 p 1 − ρ 2 ! f z 2 ( ζ 2 ) dζ 2 | {z } I 2 F or I 1 , w e hav e I 1 = Z α 2 −∞ I { α 1 ≥ ρζ 2 } f z 2 ( ζ 2 ) dζ 2 = Z min { α 1 ρ ,α 2 } −∞ f z 2 ( ζ 2 ) dζ 2 = Φ 1  min  α 1 ρ , α 2  = I  min  α 1 ρ , α 2  ≥ 0  + ε  min  α 1 ρ , α 2  Notice that, when α 1 ≥ 0 and α 2 ≥ 0 , we must hav e min n α 1 ρ , α 2 o ≥ 0 . Conv ersely , when min n α 1 ρ , α 2 o ≥ 0 , w e must hav e that α 1 ≥ 0 and α 2 ≥ 0 . Therefore I n min n α 1 ρ , α 2 o ≥ 0 o = I { α 1 ≥ 0; α 2 ≥ 0 } . Thus, we hav e Φ 2 ( α 1 , α 2 , ρ ) = I { α 1 ≥ 0; α 2 ≥ 0 } + ε  min  α 1 ρ , α 2  + I 2 F or I 2 , w e hav e |I 2 | ≤ Z α 2 −∞      ε α 1 − ρζ 2 p 1 − ρ 2 !      f z 2 ( ζ 2 ) dζ 2 ≤ Z α 2 −∞ exp − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 ) ! f z 2 ( ζ 2 ) dζ 2 = 1 √ 2 π Z α 2 −∞ exp − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 ) − ζ 2 2 2 ! dζ 2 = 1 √ 2 π exp  − α 2 1 2  Z α 2 −∞ exp − ( ζ 2 − ρα 1 ) 2 2 (1 − ρ 2 ) ! dζ 2 ≤ exp  − α 2 1 2  Moreo ver, since    ε  min n α 1 ρ , α 2 o    ≤ exp  − 1 2 min n α 1 ρ , α 2 o 2  , we hav e that | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ exp − 1 2 min  α 1 ρ , α 2  2 ! + exp  − α 2 1 2  ≤ exp − min { α 1 , α 2 } 2 2 ! + exp  − α 2 1 2  48 Exc hanging α 1 and α 2 , we hav e | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ exp − 1 2 min  α 1 ρ , α 2  2 ! + exp  − α 2 1 2  ≤ exp − min { α 1 , α 2 } 2 2 ! + exp  − α 2 2 2  Th us, w e ha ve | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ 2 exp − min { α 1 , α 2 } 2 2 ! D.1.3 Uni-va riate Coupled Exp ectation Lemma D.6. let z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Deﬁne T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 1 + ρT 2 ) ; E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 2 + ρT 1 ) Pr o of. W riting the exp ectation in integral form, w e ha ve that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b Z ∞ a z 1 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 1 f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 Since z 1 , z 2 ∼ N (0 , 1) with co v ariance ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, let ρ ′ = 1 − ρ 2 , w e ha ve f ( z 1 | z 2 ) = 1 √ 2 π ρ ′ exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! Th us, b y Lemma D.14, we hav e that Z ∞ a z 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 p 2 π ρ ′ Φ 1  ρz 2 − a √ ρ ′  ! = r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 in Lemma D.14. Plugging into the original in tegral gives E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  ! f ( z 2 ) dz 2 = √ ρ ′ 2 π Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ − z 2 2 2 ! dz 2 + ρ Z ∞ b z 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 49 Notice that exp − ( a − ρz 2 ) 2 2 ρ ′ − z 2 2 2 ! = exp  − a 2 − 2 ρaz 2 + z 2 2 2 ρ ′  = exp − ( z 2 − ρa ) 2 2 ρ ′ ! · exp  − a 2 2  F rom a previous result, w e hav e an identit y for the conditional probability , which expresses the CDF term as an in tegral: Φ 1 ρz 2 − a p 1 − ρ 2 ! = Z ∞ a f ( z 1 | z 2 ) dz 1 (69) and so: ρ Z ∞ b z 2 Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 = ρ Z ∞ b z 2 Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) dz 2 Moreo ver, b y applying Lemma D.16, we hav e E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = √ ρ ′ 2 π exp  − a 2 2  Z ∞ b exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 + ρ Z ∞ b z 2 Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) dz 2 = ρ ′ √ 2 π exp  − a 2 2  Z ∞ b f ( z 2 | z 1 = a ) dz 2 + ρ Z ∞ b Z ∞ a z 2 f ( z 1 | z 2 ) f ( z 2 ) | {z } f ( z 1 ,z 2 ) dz 1 dz 2 = ρ ′ √ 2 π exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! + ρ E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] Therefore, w e can conclude that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] − ρ E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = ρ ′ √ 2 π exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! = ρ ′ √ 2 π T 1 (70) Switc hing z 1 , z 2 and a, b gives E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] − ρ E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = ρ ′ √ 2 π exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! = ρ ′ √ 2 π T 2 (71) Solving for E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] and E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] from (70) and (71) gives E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 1 + ρT 2 ) ; E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 2 + ρT 1 ) Lemma D.7. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Deﬁne T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  aT 1 + ρ 2 bT 2  E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  50 Pr o of. Again, w e write the exp ectation in the integral form to get that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = Z ∞ b Z ∞ a z 2 1 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 2 1 f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 Since z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Deﬁne ρ ′ = 1 − ρ 2 . By Lemma D.15, we hav e that Z ∞ a z 2 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 2 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ ( a + ρz 2 ) exp − ( a − ρz 2 ) 2 2 ρ ′ ! + p 2 π ρ ′  ρ ′ + ρ 2 z 2 2  Φ 1  ρz 2 − a √ ρ ′  ! = ρ 2 z 2 2 Φ 1  ρz 2 − a √ ρ ′  + ρz 2 r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + a r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρ ′ Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 in Lemma D.15. Therefore, we hav e E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ 2 Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 + ρ r ρ ′ 2 π Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + a r ρ ′ 2 π Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + ρ ′ Z ∞ b Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 T o start, by Lemma D.16, w e hav e Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) = Z ∞ a f ( z 1 | z 2 ) f ( z 2 ) dz 1 = Z ∞ a f ( z 1 , z 2 ) dz 1 Therefore, for the ﬁrst term, we hav e Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a z 2 2 f ( z 1 , z 2 ) dz 2 = E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  F or the last term, we ha ve Z ∞ b Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a f ( z 1 , z 2 ) dz 1 dz 2 = Φ 2 ( − a, − b, ρ ) Next, w e notice that exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) = 1 √ 2 π exp  − a 2 − 2 ρaz 2 + z 2 2 2 ρ ′  = 1 √ 2 π exp  − a 2 2  exp − ( z 2 − ρa ) 2 2 ρ ′ ! Therefore, for the second term, we apply Lemma D.14 to get that Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b z 2 exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = 1 √ 2 π exp  − a 2 2  ρ ′ exp − ( ρa − b ) 2 2 ρ ′ ! + p 2 π ρ ′ ρa Φ 1  ρa − b √ ρ ′  ! = ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  51 where w e set κ = √ ρ ′ and µ = ρa in Lemma D.14. F or the third term, w e ha ve Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = p ρ ′ exp  − a 2 2  Z ∞ b f ( z 2 | z 1 = a ) dz 2 = p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  Com bining all four terms gives E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ r ρ ′ 2 π  ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + a r ρ ′ 2 π · p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  + ρ ′ Φ 2 ( − a, − b, ρ ) = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ a √ 2 π  ρ 2 + 1  exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  + ρ ′ Φ 2 ( − a, − b, ρ ) Therefore, w e can conclude that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  − ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ a √ 2 π  ρ 2 + 1  T 1 + ρ ′ Φ 2 ( − a, − b, ρ ) E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  − ρ 2 E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ b √ 2 π  ρ 2 + 1  T 2 + ρ ′ Φ 2 ( − a, − b, ρ ) Solving for E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  and E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  giv es E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  aT 1 + ρ 2 bT 2  E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  W rite ρ ′ = 1 − ρ 2 giv es the desired result. Lemma D.8. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Deﬁne T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) Pr o of. T o start, we write out the integral form of the exp ectation as E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b Z ∞ a z 1 z 2 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 1 f ( z 1 | z 2 ) dz 1  z 2 f ( z 2 ) dz 2 52 Since z 1 , z 2 ∼ N (0 , 1) with co v ariance ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, let ρ ′ = 1 − ρ 2 , w e ha ve f ( z 1 | z 2 ) = 1 √ 2 π ρ ′ exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! Th us, b y Lemma D.14, we hav e that Z ∞ a z 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 p 2 π ρ ′ Φ 1  ρz 2 − a √ ρ ′  ! = r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 . Therefore E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  ! z 2 f ( z 2 ) dz 2 = r ρ ′ 2 π Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + ρ Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 T o start, by Lemma D.16, w e hav e Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) = Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) = Z ∞ a f ( z 1 , z 2 ) dz 1 Therefore, for the second term, we hav e Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a z 2 2 f ( z 1 , z 2 ) dz 1 dz 2 = E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  F or the ﬁrst term, we ha ve exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) = 1 √ 2 π exp  − z 2 2 − 2 ρaz 2 + a 2 2 ρ ′  = 1 √ 2 π exp  − a 2 2  exp − ( z 2 − ρa ) 2 2 ρ ′ ! Therefore, b y Lemma D.14, the ﬁrst term can b e written as Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b z 2 exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = 1 √ 2 π exp  − a 2 2  ρ ′ exp − ( ρa − b ) 2 2 ρ ′ ! + p 2 π ρ ′ ρa Φ 1  ρa − b √ ρ ′  ! = ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  where w e set κ = √ ρ ′ and µ = ρa in Lemma D.14. Com bining the tw o terms, we hav e E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = r ρ ′ 2 π  ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + aρρ ′ √ 2 π T 1 + ρ E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  53 F rom Lemma D.7, we ha ve that E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  Therefore E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) W rite ρ ′ = 1 − ρ 2 giv es the desired result. Lemma D.9. L et z 1 ∼ N  µ 1 , κ 2 1  and z 2 ∼ N  µ 2 , κ 2 2  , with Cov ( z 1 , z 2 ) = κ 1 κ 2 ρ . L et a, b ∈ R . Then we have E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = ( µ 1 µ 2 + κ 1 κ 2 ρ ) Φ  µ 1 − a κ 1 , µ 2 − b κ 2 , ρ  + κ 1 κ 2 2 π exp − 1 2 (1 − ρ 2 ) ( µ 1 − a ) 2 κ 2 1 − 2 ρ κ 1 κ 2 ( µ 1 − a ) ( µ 2 − b ) + ( µ 2 − b ) 2 κ 2 2 !! + 1 √ 2 π (( κ 2 ρa + κ 1 µ 2 ) T 1 + ( κ 1 ρb + κ 2 µ 1 ) T 2 ) Her e T 1 , T 2 ar e deﬁne d as T 1 = exp − ( a − µ 1 ) 2 2 κ 2 1 ! Φ 1 1 p 1 − ρ 2  ρ ( a − µ 1 ) κ 1 − b − µ 2 κ 2  ! T 2 = exp − ( b − µ 2 ) 2 2 κ 2 2 ! Φ 1 1 p 1 − ρ 2  ρ ( b − µ 2 ) κ 2 − a − µ 1 κ 1  ! Pr o of. Let ˆ z 1 = z 1 − µ 1 κ 1 and ˆ z 1 = z 2 − µ 2 κ 2 . Then we hav e ˆ z 1 , ˆ z 2 ∼ N (0 , 1) . Moreo v er, Co v ( ˆ z 1 , ˆ z 2 ) = E [ ˆ z 1 ˆ z 2 ] = 1 κ 1 κ 2 E [( z 1 − µ 1 ) ( z 2 − µ 2 )] = ρ Since z 1 = κ 1 ˆ z 1 + µ 1 and z 2 = κ 2 ˆ z 2 + µ 2 , we hav e E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = E  ( κ 1 ˆ z 1 + µ 1 ) ( κ 2 ˆ z 2 + µ 2 ) I  ˆ z 1 ≥ a − µ 1 κ 1 ; ˆ z 2 ≥ b − µ 2 κ 2  = κ 1 κ 2 E h ˆ z 1 ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + µ 1 µ 2 E h I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + κ 1 µ 2 E h ˆ z 1 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + κ 2 µ 1 E h ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi where w e re-deﬁned ˆ a = a − µ 1 κ 1 and ˆ b = b − µ 2 κ 2 . By Lemma D.16, Lemma D.6, and Lemma D.8, we hav e E h ˆ z 1 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = 1 √ 2 π ( T 1 + ρT 2 ) ; E h ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = 1 √ 2 π ( T 2 + ρT 1 ) E h ˆ z 1 ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = p 1 − ρ 2 2 π exp − ˆ a 2 − 2 ρ ˆ a ˆ b + ˆ b 2 2 (1 − ρ 2 ) ! + ρ Φ 2  − ˆ a, − ˆ b, ρ  + ρ √ 2 π  ˆ aT 1 + ˆ bT 2  and E h I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = Φ 2  − ˆ a, − ˆ b, ρ  . Here, T 1 , T 2 are deﬁned as T 1 = exp  − ˆ a 2 2  Φ ρ ˆ a − ˆ b p 1 − ρ 2 ! ; T 2 = exp − ˆ b 2 2 ! Φ ρ ˆ b − ˆ a p 1 − ρ 2 ! Plugging in the v alue of ˆ a and ˆ b give s the desired result. 54 D.1.4 Multi-va riate Coupled Exp ectation Lemma D.10. L et c ∼ N  µ , κ 2 I  , and let u ∈ R d . Deﬁne z = c ⊤ u . Then we have E [ c I { z ≥ 0 } ] = µ Φ 1  − µ ⊤ u κ ∥ u ∥ 2  + κ √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! · u ∥ u ∥ 2 Pr o of. A ccording to the law of total exp ectation, E  c I  c ⊤ u ≥ 0  = E z [ E c [ c I { z ≥ 0 } | z ]] = E z [ E c [ c | z ] I { z ≥ 0 } ] By Lemma D.1, we hav e that E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  . Therefore E  c I  c ⊤ u ≥ 0  = u ∥ u ∥ 2 2 E z [ z I { z ≥ 0 } ] + µ − µ ⊤ u · u ∥ u ∥ 2 2 ! E z [ I { z ≥ 0 } ] By deﬁnition, E z [ I { z ≥ 0 } ] = Pr ( z ≥ 0) = 1 − Pr ( z ≤ 0) . Since, by Lemma D.13, z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  , w e ha ve that Pr ( z ≤ 0) = Pr  z − µ ⊤ u κ ∥ u ∥ 2 ≤ − µ ⊤ u ∥ u ∥ 2  = Φ 1  − µ ⊤ u κ ∥ u ∥ 2  Moreo ver, let ˆ z = z − µ ⊤ u κ ∥ u ∥ 2 , then we hav e E z [ z I { z ≥ 0 } ] = E ˆ z   κ ∥ u ∥ 2 ˆ z + µ ⊤ u  I  ˆ z ≥ − µ ⊤ u κ ∥ u ∥ 2  = κ ∥ u ∥ 2 E ˆ z  ˆ z I  ˆ z ≥ − µ ⊤ u κ ∥ u ∥ 2  + µ ⊤ u E z [ z ≥ 0] By the PDF of ˆ z , w e hav e E ˆ z [ ˆ z I { z ≥ 0 } ] = 1 √ 2 π Z ∞ a z exp  − z 2 2  dz = − 1 √ 2 π exp  − z 2 2  | ∞ a = 1 √ 2 π exp  − a 2 2  Therefore E z [ z I { z ≥ 0 } ] = κ ∥ u ∥ 2 √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! + µ ⊤ u E z [ z ≥ 0] Plugging in gives E  c I  c ⊤ u ≥ 0  = µ Φ 1  − µ ⊤ u κ ∥ u ∥ 2  + κ √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! · u ∥ u ∥ 2 Lemma D.11. L et c ∼ N  µ , κ 2 I  with κ ≤ 1 . L et u , v b e given, and let z 1 = c ⊤ u , z 2 = c ⊤ v . Then we have that     E c [ c I { z 1 ≥ 0 } ] − µ Φ 1  µ ⊤ u κ ∥ u ∥ 2      ∞ ≤ κ ∥ u ∥ 2 exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! Pr o of. Giv en the form of the conditional exp ectation, we hav e E c [ c I { z 1 ≥ 0 } ] = Z ∞ 0 E c [ c | z 1 ] f 1 ( z 1 ) dz 1 = Z ∞ 0 µ + u ∥ u ∥ 2 2  z 1 − µ ⊤ u  ! f 1 ( z 1 ) dz 1 55 Since z 1 = c ⊤ u , we must hav e that z 1 ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . Deﬁne z ′ = z 1 − µ ⊤ u κ ∥ u ∥ 2 , then we hav e that z 1 = κ ∥ u ∥ 2 z ′ + µ ⊤ u E c [ c I { z 1 ≥ 0 } ] = Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ( µ + κ u · z ′ ) f ( z ′ ) dz ′ = µ · Z µ ⊤ u κ ∥ u ∥ 2 −∞ f ( z ′ ) dz ′ + κ u · Z µ ⊤ u κ ∥ u ∥ 2 −∞ z ′ f ( z ′ ) dz ′ = µ Φ 1  µ ⊤ u κ ∥ u ∥ 2  + κ u exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! where in the last equality we applied Lemma D.14 with a = 0 and κ = 1 in the lemma. Therefore, we hav e that     E c [ c I { z 1 ≥ 0 } ] − µ Φ 1  µ ⊤ u κ ∥ u ∥ 2      ∞ ≤ κ ∥ u ∥ ∞ exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! No w, w e shall div e into E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  . Lemma D.12. L et c ∼ N  µ , κ 2 I  with ∥ µ ∥ 2 ≥ 2 and κ ≤ 1 . L et u , v b e given, and let z 1 = µ ⊤ u , z 2 = µ ⊤ v . Then we have that     E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u 2 κ ∥ v ∥ 2  Φ 1  µ ⊤ v 2 κ ∥ v ∥ 2      ≤ ∆ wher e ∆ is given by ∆ = 2 κ ∥ v ∥ 2  ∥ µ ∥ 2  ϕ  µ ⊤ u 2 κ ∥ u ∥ 2  + ϕ  µ ⊤ u 2 κ ∥ v ∥ 2  + ∥ µ ∥ ∞  ψ  µ ⊤ u 2 κ ∥ u ∥ 2  + ψ  µ ⊤ u 2 κ ∥ u ∥ 2  Pr o of. W e hav e E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0 E c  cc ⊤ | z 1 , z 2  f ( z 1 , z 2 ) dz 1 dz 2 Notice that E c  cc ⊤ | z 1 , z 2  = Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤ By the form of Cov ( c | z 1 , z 2 ) , we hav e Co v ( c | z 1 , z 2 ) = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 := M By the form of E [ c | z 1 , z 2 ] w e hav e E [ c | z 1 , z 2 ] = z 1 · s 1 + z 2 · s 2 + s 3 where s 1 = ∥ v ∥ 2 2 u − ⟨ u , v ⟩ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 ; s 2 = ∥ u ∥ 2 2 v − ⟨ u , v ⟩ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 s 3 = µ −  ∥ v ∥ 2 2 u − ⟨ u , v ⟩ v  ⟨ µ , u ⟩ +  ∥ u ∥ 2 2 v − ⟨ u , v ⟩ u  ⟨ µ , v ⟩ ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 56 Therefore E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤ = ( z 1 · s 1 + z 2 · s 2 + s 3 ) ( z 1 · s 1 + z 2 · s 2 + s 3 ) ⊤ Recall that we are interested in E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  f ( z 1 , z 2 ) dz 1 dz 2 Let ˆ z 1 = z 1 − µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 = z 1 − µ ⊤ v κ ∥ v ∥ 2 Then w e hav e ˆ z 1 , ˆ z 2 ∼ N (0 , 1); Co v ( ˆ z 1 , ˆ z 2 ) = u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2 := ρ Th us, ˆ z 1 | ˆ z 2 = γ 2 ∼ N  ργ 2 , 1 − ρ 2  ; ˆ z 2 | ˆ z 1 = γ 1 ∼ N  ργ 1 , 1 − ρ 2  Moreo ver, E [ c | z 1 , z 2 ] = z 1 · s 1 + z 2 · s 2 + s 3 =  κ ∥ u ∥ 2 ˆ z 1 + µ ⊤ u  s 1 +  κ ∥ v ∥ 2 ˆ z 2 + µ ⊤ v  s 2 + s 3 Redeﬁne ˆ s 1 = κ ∥ u ∥ 2 s 1 ; ˆ s 2 = κ ∥ v ∥ 2 s 2 ; ˆ s 3 = µ ⊤ u · s 1 + µ ⊤ v · s 2 + s 3 = µ Then w e hav e E [ c | z 1 , z 2 ] = ˆ s 1 ˆ z 1 + ˆ s 2 ˆ z 2 + ˆ s 3 In this case, let ˆ f b e the joint PDF of ˆ z 1 and ˆ z 2 , then we hav e ˆ f ( ˆ z 1 , ˆ z 2 ) = 1 2 π p 1 − ρ 2 exp  − 1 2(1 − ρ 2 )  z 2 1 − 2 ρz 1 z 2 + z 2 2   = κ 2 ∥ u ∥ 2 ∥ v ∥ 2 2 π κ 2 ∥ u ∥ 2 ∥ v ∥ 2 p 1 − ρ 2 exp  − 1 2(1 − ρ 2 )  z 2 1 − 2 ρz 1 z 2 + z 2 2   = κ 2 ∥ u ∥ 2 ∥ v ∥ 2 f ( ˆ z 1 , ˆ z 2 ) 57 Therefore, since dz 1 = κ ∥ u ∥ 2 d ˆ z 1 , and dz 2 = κ ∥ v ∥ 2 d ˆ z 2 , we hav e E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 = ˆ s 1 ˆ s ⊤ 1 Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 1 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 1 + ˆ s 2 ˆ s ⊤ 2 Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 2 +  ˆ s 1 ˆ s ⊤ 2 + ˆ s 2 ˆ s ⊤ 1  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 1 ˆ z 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 3 +  ˆ s 1 ˆ s 3 + ˆ s 3 ˆ s ⊤ 1  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 1 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 4 +  ˆ s 2 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 2  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 5 +  ˆ s 3 ˆ s ⊤ 3 + M  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 6 Since our goal is to study the term E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v , w e need to understand the terms I 1 to I 6 , as w ell as understanding the matrix-v ector pro duct in front of these terms. T o start, under some standard computation, w e hav e ˆ s ⊤ 1 u = κ ∥ u ∥ 2 · ∥ v ∥ 2 2 u ⊤ u − v ⊤ u · v ⊤ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = κ ∥ u ∥ 2 ˆ s ⊤ 1 v = κ ∥ u ∥ 2 · ∥ v ∥ 2 2 u ⊤ v − v ⊤ u · v ⊤ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = 0 ˆ s ⊤ 2 u = κ ∥ v ∥ 2 · ∥ u ∥ 2 2 v ⊤ u − v ⊤ u · u ⊤ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = 0 ˆ s ⊤ 2 v = κ ∥ v ∥ 2 · ∥ u ∥ 2 2 v ⊤ v − v ⊤ u · u ⊤ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = κ ∥ v ∥ 2 Therefore, the following must holds ˆ s 1 ˆ s ⊤ 1 v = 0 ; ˆ s 2 ˆ s ⊤ 2 v = κ ∥ v ∥ 2 ˆ s 2 ;  ˆ s 1 ˆ s ⊤ 2 + ˆ s 2 ˆ s ⊤ 1  v = κ ∥ v ∥ 2 ˆ s 1 ;  ˆ s 1 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 1  v = µ ⊤ v · ˆ s 1 ;  ˆ s 2 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 2  v = µ ⊤ v · ˆ s 2 + κ ∥ v ∥ 2 µ 58 Lastly , we hav e  ˆ s 3 ˆ s ⊤ 3 + M  v = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  uv ⊤ − vu ⊤  2 v = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  uv ⊤ − vu ⊤   u ⊤ v · v − ∥ v ∥ 2 2 u  = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  u ⊤ v · ∥ v ∥ 2 2 u −  u ⊤ v  2 v − u ⊤ v · ∥ v ∥ 2 2 u + ∥ v ∥ 2 2 ∥ u ∥ 2 2 v  = µ ⊤ v · µ + κ 2 v + κ 2 v = µ ⊤ v · µ + 2 κ 2 v Therefore, w e can write E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v as E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v = κ ∥ v ∥ 2 ˆ s 2 · I 2 + κ ∥ v ∥ 2 ˆ s 1 · I 3 + µ ⊤ v · ˆ s 1 · I 4 +  µ ⊤ v · ˆ s 2 + κ ∥ v ∥ 2 µ  I 5 +  µ ⊤ v · µ + 2 κ 2 v  I 6 = κ ∥ v ∥ 2 ( I 2 · ˆ s 2 + I 3 · ˆ s 1 ) + µ ⊤ v ( I 4 · ˆ s 1 + I 5 · ˆ s 2 ) +  κ ∥ v ∥ 2 · I 5 + µ ⊤ v · I 6  µ + 2 κ 2 I 6 v (72) By the deﬁnition of I 2 to I 6 , we ﬁrst notice that I 6 = Z ∞ − µ ⊤ u κ ∥ u ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 = P  ˆ z 1 ≥ − µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 ≥ − µ ⊤ v κ ∥ v ∥ 2  = P  ˆ z 1 ≤ µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 ≤ µ ⊤ v κ ∥ v ∥ 2  = Φ 2  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , ρ  Moreo ver, w e can in v oke Lemma D.6, Lemma D.7, and Lemma D.8 to get that I 4 = 1 √ 2 π ( T 1 + ρT 2 ) ; I 5 = 1 √ 2 π ( T 2 + ρT 1 ) I 2 = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  I 3 = p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) where T 1 , T 2 and a, b are deﬁned as T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! a = − µ ⊤ u κ ∥ u ∥ 2 ; b = − µ ⊤ v κ ∥ v ∥ 2 ; ρ = u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2 T o ease our computation, we deﬁne E = 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  ; F = Φ 2 ( − a, − b, ρ ) 59 Then the terms I 2 to I 6 can b e written as I 2 = ρE + F + 1 √ 2 π  bT 2 + ρ 2 aT 1  ; I 3 = E + ρF + 1 √ 2 π ( bT 2 + aT 1 ) I 4 = 1 √ 2 π ( T 1 + ρT 2 ) ; I 5 = 1 √ 2 π ( T 2 + ρT 1 ) ; I 6 = F (73) No w, the trick of ev aluating (72) is to re-write ˆ s 1 and ˆ s 2 as b elo w ˆ s 1 = κ ∥ u ∥ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ·  ∥ v ∥ 2 2 u − u ⊤ v · v  = κ  ∥ v ∥ 2 2 u − u ⊤ v · v  ∥ u ∥ 2 ∥ v ∥ 2 2 (1 − ρ 2 ) = κ 1 − ρ 2 · u ∥ u ∥ 2 − κρ 1 − ρ 2 · v ∥ v ∥ 2 = κ 1 − ρ 2  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  ˆ s 2 = κ ∥ v ∥ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ·  ∥ u ∥ 2 2 v − u ⊤ v · u  = κ  ∥ u ∥ 2 2 v − u ⊤ v · u  ∥ u ∥ 2 ∥ v ∥ 2 2 (1 − ρ 2 ) = κ 1 − ρ 2 · v ∥ v ∥ 2 − κρ 1 − ρ 2 · u ∥ u ∥ 2 = κ 1 − ρ 2  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  (74) No w, w e can simplify (72) with (73) and (74). T o start, for the terms I 2 · ˆ s 2 + I 3 · ˆ s 1 w e ha ve I 2 · ˆ s 2 + I 3 · ˆ s 1 = κ 1 − ρ 2  ρE + F + 1 √ 2 π  bT 2 + ρ 2 aT 1    v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + κ 1 − ρ 2  E + ρF + ρ √ 2 π ( bT 2 + aT 1 )   u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κE 1 − ρ 2  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κF 1 − ρ 2  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κρaT 1 √ 2 π (1 − ρ 2 )  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κbT 2 √ 2 π (1 − ρ 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κE · u ∥ u ∥ 2 + κF · v ∥ v ∥ 2 + κρaT 1 √ 2 π · u ∥ u ∥ 2 + κbT 2 √ 2 π · v ∥ v ∥ 2 = κ  E + ρaT 1 √ 2 π  u ∥ u ∥ 2 +  F + bT 2 √ 2 π  v ∥ v ∥ 2  60 Similarly , for the term I 4 ˆ s 1 + I 4 ˆ s 2 , we hav e I 4 ˆ s 1 + I 4 ˆ s 2 = κ √ 2 π (1 − ρ 2 )  ( T 1 + ρT 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ( T 2 + ρT 1 )  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κT 1 √ 2 π (1 − ρ 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κT 2 √ 2 π (1 − ρ 2 )  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κT 1 √ 2 π · v ∥ v ∥ 2 + κT 2 √ 2 π · u ∥ u ∥ 2 = κ √ 2 π  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  Applying these ev aluations, (72) b ecomes E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v = κ 2 ∥ v ∥ 2  E + ρaT 1 √ 2 π  u ∥ u ∥ 2 +  F + bT 2 √ 2 π  v ∥ v ∥ 2  + κ µ ⊤ v √ 2 π  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  + κ ∥ v ∥ 2 √ 2 π ( T 2 + ρT 1 ) µ + µ ⊤ v · F · µ + 2 κ 2 F v = κ 2 ∥ v ∥ 2  E · u ∥ u ∥ 2 + ρaT 1 √ 2 π · u ∥ u ∥ 2 + bT 2 √ 2 π · v ∥ v ∥ 2  | {z } g 1 + κ √ 2 π  µ ⊤ v  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  + ∥ v ∥ 2 ( T 2 + ρT 1 ) µ  | {z } g 2 + F  µµ ⊤ v + 3 κ 2 v  (75) Then w e hav e that E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u κ ∥ u ∥ 2  Φ 1  µ ⊤ v κ ∥ v ∥ 2  = g 1 + g 2 + C  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2  (76) The pro of then pro ceed by estimating the magnitude of the three terms. T o start, we need to b ound T 1 and T 2 . In particular, since Φ 1 is the CDF, its magnitude must b e b ounded by 1 . Therefore 0 ≤ T 1 ≤ exp  − a 2 2  ; 0 ≤ T 2 ≤ exp  − b 2 2  Therefore, the ℓ ∞ norm of g 2 is b ounded b y ∥ g 2 ∥ ∞ ≤ κ √ 2 π  µ ⊤ v  exp  − a 2 2  + exp  − b 2 2  + ∥ v ∥ 2 ∥ µ ∥ 2  exp  − a 2 2  + ρ exp  − b 2 2  ≤ 2 κ √ 2 π ∥ v ∥ 2 ∥ µ ∥ 2  exp  − a 2 2  + exp  − b 2 2  ≤ κ ∥ v ∥ 2 ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  (77) 61 Next, for E , we hav e E = 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  = 1 − ρ 2 4 π  exp  − a 2 − 2 ρab + ρ 2 b 2 2 (1 − ρ 2 ) − b 2 2  + exp  − ρ 2 a 2 − 2 ρab + b 2 2 (1 − ρ 2 ) − a 2 2  ≤ 1 4 π  exp  − a 2 2  + exp  − b 2 2  ≤ 1 4 π  ϕ  a 2  + ϕ  b 2  Therefore, the magnitude of g 1 can b e b ounded b y ∥ g 1 ∥ 2 ≤ κ 2 ∥ v ∥ 2  | E | + | a | T 1 √ 2 π + | b | T 2 √ 2 π  ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + | a | √ 2 π exp  − a 2 2  + | b | √ 2 π exp  − b 2 2  ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + ψ  a 2  + ψ  b 2  (78) Moreo ver, b y the b ound of the Gaussian Copula function, we hav e that |C ( a, b, ρ ) | ≤ 1 4 exp  − a 2 + b 2 4  Therefore, w e hav e that C  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2  ≤ 1 4 exp  − a 2 4  exp  − b 2 4  = 1 4 ϕ  a 2  ϕ  b 2  Com bining the results gives     E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u κ ∥ u ∥ 2  Φ 1  µ ⊤ v κ ∥ v ∥ 2 ,      2 ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + ψ  a 2  + ψ  b 2  + κ ∥ v ∥ 2 ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  + 1 4    µ ⊤ v   ∥ µ ∥ ∞ + 3 κ 2 ∥ v ∥ 2  ϕ  a 2  ϕ  b 2  = ∥ v ∥ 2   κ 2 + κ ∥ µ ∥ 2   ϕ  a 2  + ϕ  b 2  +  κ 2 + κ ∥ µ ∥ ∞   ψ  a 2  + ψ  b 2  ≤ 2 κ ∥ v ∥ 2  ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  + ∥ µ ∥ ∞  ψ  a 2  + ψ  b 2  D.1.5 Other Results Lemma D.13. L et c ∼ N  µ , κ 2 I  , and let u ∈ R d b e a ve ctor. Deﬁne z = c ⊤ u . Then we have that z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . 62 Pr o of. Since z = c ⊤ u where c ∼ N  µ , κ 2 I  . Then the moment generating function of z is given b y M z ( t ) = E [exp ( z t )] = E   d Y j =1 exp ( c j u j t )   = d Y j =1 E [exp ( c j u j t )] = d Y j =1 exp  u j µ j t + 1 2 u 2 j κ 2 t 2  = exp     d X j =1 u j µ j   t + 1 2   d X j =1 u 2 j   κ 2 t 2   = exp  µ ⊤ u · t + 1 2 ∥ u ∥ 2 2 κ 2 t 2  Therefore, z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . Lemma D.14. L et κ, µ, a ∈ R b e given such that κ > 0 . Then we have that Z ∞ a z exp − ( z − µ ) 2 2 κ 2 ! dz = κ 2 exp − ( µ − a ) 2 2 κ 2 ! + κµ √ 2 π Φ 1  µ − a κ  Pr o of. W e use a change of v ariable z ′ = z − µ κ . Then we hav e that z = κz ′ + µ , and dz = κdz ′ . Therefore Z ∞ a z exp − ( z − µ ) 2 2 κ 2 ! dz = Z ∞ a − µ κ ( κz ′ + µ ) exp  − z ′ 2 2  κdz ′ = κ 2 Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ + κµ Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = κ 2 exp  − z ′ 2 2  | a − µ κ ∞ + κµ √ 2 π  1 − Φ 1  a − µ κ  = κ 2 exp − ( µ − a ) 2 2 κ 2 ! + κµ √ 2 π Φ 1  µ − a κ  Lemma D.15. L et κ, µ, a ∈ R b e given such that κ > 0 . Then we have that Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ 2 ( a + µ ) exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π κ  κ 2 + µ 2  Φ 1  µ − a κ  Pr o of. T o start, let z ′ = z − µ κ . Then we hav e that z = κz ′ + µ , and dz = κdz ′ . Therefore Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ Z ∞ a − µ κ ( κz ′ + µ ) 2 exp  − z ′ 2 2  dz ′ = κ 3 Z ∞ a − µ κ z ′ 2 exp  − z ′ 2 2  dz ′ + 2 κ 2 µ Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ + κµ 2 Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ 63 Notice that for the third term, we hav e that Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = √ 2 π  1 − Φ 1  a − µ κ  = √ 2 π Φ 1  µ − a κ  F or the second term, we can directly apply Lemma D.14 with κ = 1 , µ = 0 to get that Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ = exp − ( a − µ ) 2 2 κ 2 ! F or the ﬁrst term, we apply integration by parts with u ( z ′ ) = − z ′ and v ( z ′ ) = exp  − z ′ 2 2  . In particular, notice that v ′ ( z ′ ) = − z ′ exp  − z ′ 2 2  and u ′ ( z ′ ) = − 1 . Therefore Z ∞ a − µ κ z ′ 2 exp  − z ′ 2 2  dz ′ = Z ∞ a − µ κ u ( z ′ ) dv ( z ′ ) = u ( z ′ ) v ( z ′ ) | ∞ a − µ κ − Z ∞ a − µ κ v ( z ′ ) du ( z ′ ) = − z ′ exp  − z ′ 2 2  | ∞ a − µ κ + Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π  1 − Φ 1  a − µ κ  = a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π Φ 1  µ − a κ  Putting things together, we hav e that Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ 3 a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π Φ 1  µ − a κ  ! + 2 κ 2 µ exp − ( a − µ ) 2 2 κ 2 ! + κµ 2 √ 2 π Φ 1  µ − a κ  =  κ 2 ( a − µ ) + 2 κ 2 µ  exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π  κ 3 + κµ 2  Φ 1  µ − a κ  = κ 2 ( a + µ ) exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π κ  κ 2 + µ 2  Φ 1  µ − a κ  Lemma D.16. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . Then we have that Z ∞ a f ( z 1 | z 2 ) dz 1 = Φ 1 ρz 2 − a p 1 − ρ 2 ! 64 Pr o of. Since z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , we hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, using a ch ange of v ariable z ′ = z 1 − ρz 2 √ 1 − ρ 2 , we hav e Z ∞ a f ( z 1 | z 2 ) dz 1 = 1 p 2 π (1 − ρ 2 ) Z ∞ a exp − ( z 1 − ρz 2 ) 2 2 (1 − ρ 2 ) ! dz 1 = 1 √ 2 π Z ∞ a − ρz 2 √ 1 − ρ 2 exp  − z ′ 2 2  dz ′ = 1 − Φ 1 a − ρz 2 p 1 − ρ 2 ! = Φ 1 ρz 2 − a p 1 − ρ 2 ! Lemma D.17. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . Then we have that Φ 2 ( − a, − b, ρ ) = Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 = Z ∞ b Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 Pr o of. Let z ′ 1 , z ′ 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , and deﬁne z 1 = − z ′ 1 , z 2 = − z ′ 2 . Then we hav e that z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ . By symmetry , we ha ve f ( z 1 , z 2 ) = f ( − z 1 , − z 2 ) = f ( z ′ 1 , z ′ 2 ) . Moreo ver, dz ′ 2 dz ′ 1 = ( − dz 2 ) ( − dz 1 ) = dz 2 dz 1 . Thus Φ 2 ( − a, − b, ρ ) = Z − a −∞ Z − b −∞ f ( z ′ 1 , z ′ 2 ) dz ′ 2 dz ′ 1 = Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 Recall that f ( z 1 , z 2 ) = f ( z 1 | z 2 ) f ( z 2 ) . Then we can apply Lemma D.16 to get that Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 = Z ∞ b  Z ∞ a f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 = Z ∞ b Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 Lemma D.18. L et z ∼ N  µ, κ 2  , and let a ∈ R . Then we have E [ z I { z ≥ a } ] = κ √ 2 π exp − ( µ − a ) 2 2 κ 2 ! + µ Φ 1  µ − a κ  Pr o of. Deﬁne ˆ z = z − µ κ . Then we hav e that ˆ z ∼ N (0 , 1) . Since z = κ ˆ z + µ , we ha ve E [ z I { z ≥ 0 } ] = E  ( κ ˆ z + µ ) I  ˆ z ≥ a − µ κ  = κ E  ˆ z I  ˆ z ≥ a − µ κ  + µ E  I  ˆ z ≥ a − µ κ  Notice that E  I  ˆ z ≥ a − µ κ  = Pr  ˆ z ≥ a − µ κ  = Φ 1  µ − a κ  . Moreov er E h ˆ z I n ˆ z ≥ − µ κ oi = 1 √ 2 π Z ∞ a − µ κ ˆ z exp  − z 2 2  dz = 1 √ 2 π exp − ( a − µ ) 2 2 κ 2 ! Therefore E [ z I { z ≥ 0 } ] = κ √ 2 π exp − ( a − µ ) 2 2 κ 2 ! + µ Φ 1  µ − a κ  65 Lemma D.19. . L et x ∈ [ − 1 , 1] . Then we have that | arcsin x | ≤ π 2 · | x | . Pr o of. T o start, consider the case x > 0 . Deﬁne f ( x ) = arcsin x x . Then we hav e that f ′ ( x ) = x − 2  x √ 1 − x 2 − arcsin x  ; f ′′ ( x ) = x − 3 3 x 3 − 2 x (1 − x 2 ) 3 2 + 2 arcsin x ! F or all x ∈ (0 , 1] , we hav e that 1 − x 2 ≥ 0 . Notice that b y the T aylor expansion of arcsin x , we ha ve arcsin x ≤ x + x 3 6 when x ∈ (0 , 1] . Therefore 3 x 3 − 2 x (1 − x 2 ) 3 2 + 2 arcsin x ≥ 3 x 3 − 2 x + 2 x  1 − x 2  3 2 + x 3 3  1 − x 2  3 2 (1 − x 2 ) 3 2 ≥ 3 x 3 − 2 x  1 −  1 − x 2  3 2  (1 − x 2 ) 3 2 Since  1 − x 2  3 2 ≥  1 − x 2  3 ≥ 1 − 3 x 4 + 2 x 6 , we must hav e that 3 x 3 − 2 x  1 −  1 − x 2  3 2  ≤ 3 x 3 − 6 x 5 + 4 x 7 = 3 x 3  1 − 2 x 2 + x 4  + x 7 = 3 x 3  1 − x 2  2 + x 7 ≥ 0 Th us, w e m ust ha ve that f ′′ ( x ) ≥ 0 . Therefore, for an y ϵ ∈ (0 , 0 . 1] , we hav e that for x ∈ [ ϵ, 1] f ( x ) ≤ f (((1 + ϵ ) x − ϵ ) · 1 + (1 − x ) · ϵ ) ≤ (1 − x ) · f ( ϵ ) + ((1 + ϵ ) x − ϵ ) · f (1) = (1 − x ) · f ( ϵ ) + π 2 · x − π 2 · ϵ (1 − x ) ≤ π 2 · x + 1 . 002 (1 − x ) ≤ π 2 This gives that f ( x ) ≤ π 2 for all x ∈ (0 , 1] . Since f ( x ) is an even function, we hav e f ( x ) ≤ π 2 for all x ∈ [ − 1 , 0) . Therefore, | arcsin x | ≤ π 2 | x | when x ∈ [ − 1 , 1] \ { 0 } . When x = 0 , we hav e arcsin x = 0 . This completes the pro of. Lemma D.20. L et u := ( | e 1 | , ..., | e n | ) ∈ R n and 1 := (1 , ..., 1) ∈ R n . Then n X i =1 | e i | = u ⊤ 1 . By the Cauchy–Schwarz ine quality, u ⊤ 1 ≤ ∥ u ∥ 2 ∥ 1 ∥ 2 . Mor e over, ∥ u ∥ 2 = n X i =1 u 2 i ! 1 / 2 = n X i =1 | e i | 2 ! 1 / 2 = n X i =1 e 2 i ! 1 / 2 , ∥ 1 ∥ 2 = n X i =1 1 2 ! 1 / 2 = √ n. Combining the ab ove gives n X i =1 | e i | ≤ √ n n X i =1 e 2 i ! 1 / 2 . Lemma D.21. A ssume that A ssumption 3.1 holds. Then, we have: ∥∇ w r L C ( W ) ∥ 2 2 ≤ √ n √ m L C ( W ) 1 2 . 66 Pr o of. By the form of ∇ w r L C ( W ) in (2), we hav e: ∥∇ w r L C ( W ) ∥ 2 =        a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  | {z } ≤ 1        2 ≤ | a r | √ m n X i =1 | f ( W , x i ⊙ c i ) − y i | · ∥ x i ⊙ c i ∥ 2 ≤ √ n √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 ! 1 / 2 · ∥ c i ∥ ∞ ∥ x i ∥ 2 using D.20 ≤ C √ n √ m (2 L C ( W )) 1 / 2 using D.22 with ∥ c i ∥ ∞ ≤ 1 + κ r 2 log  2 d δ  = C and ∥ x i ∥ 2 ≤ 1 ≤ C √ 2 √ n √ m L C ( W ) 1 / 2 where, in the ﬁrst inequalit y , we use the fact that the indicator function is upp er-b ounded by 1 and in the second inequalit y , we use the fact that a r = ± 1 . and so, ∥∇ w r L C ( W ) ∥ 2 2 ≤ 2 C 2 · n m L C ( W ) Lemma D.22 (High-probability ℓ ∞ b ound for Gaussian masks) . Fix δ ∈ (0 , 1) . L et c i ∈ R d b e a Gaussian mask with indep endent c o or dinates c i ∼ N ( 1 , κ 2 I d ) , i.e., c i,j = 1 + κg i,j , g i,j i.i.d. ∼ N (0 , 1) . Then, with pr ob ability at le ast 1 − δ , we have ∥ c i ∥ ∞ ≤ 1 + κ r 2 log  2 d δ  . Pr o of. Since c i ∼ N ( 1 , κ 2 I d ) with indep enden t co ordinates, eac h co ordinate can b e written as c i,j = 1 + κg i,j , where g i,j ∼ N (0 , 1) i.i.d. Hence ∥ c i ∥ ∞ = max j ∈ [ d ] | c i,j | = max j ∈ [ d ] | 1 + κg i,j | . W e observe that | c i,j | = | 1 + ( c i,j − 1) | ≤ | 1 | + | c i,j − 1 | triangle inequalit y = 1 + | κg i,j | W e will ﬁrst b ound max j | c i,j − 1 | = max j | κg i,j | , and then conv ert this into a b ound on ∥ c i ∥ ∞ . F or brevity , we denote g i,j as g ∼ N (0 , 1) and we are going to show that Pr( | g | ≥ t ) ≤ 2 e − t 2 / 2 (79) By symmetry of the standard normal distribution, Pr( | g | ≥ t ) = Pr( g ≥ t ) + Pr( g ≤ − t ) = 2 Pr( g ≥ t ) . 67 So it suﬃces to upp er b ound Pr( g ≥ t ) . F or any λ > 0 , since the exp onential is monotone increasing we ha ve: g ≥ t ⇒ λg ≥ λt ⇒ e λg ≥ e λt and so Pr( g ≥ t ) = Pr( e λg ≥ e λt ) By Marko v’s inequalit y , for any nonnegative random v ariable X and any a > 0 , Pr( X ≥ a ) ≤ E [ X ] a Applying this with X = e λg and a = e λt giv es Pr( g ≥ t ) = Pr  e λg ≥ e λt  ≤ E [ e λg ] e λt = E [ e λg ] e − λt (80) Computation of the moment generating function E [ e λg ] : The standard normal density is φ ( x ) = 1 √ 2 π e − x 2 / 2 , x ∈ R . Therefore, E [ e λg ] = Z ∞ −∞ e λx φ ( x ) dx = 1 √ 2 π Z ∞ −∞ exp  λx − x 2 2  dx. (81) W e now study the exp onen t: λx − x 2 2 = − 1 2  x 2 − 2 λx  = − 1 2  ( x − λ ) 2 − λ 2  = λ 2 2 − ( x − λ ) 2 2 . Plugging this into (81) yields E [ e λg ] = 1 √ 2 π Z ∞ −∞ exp  λ 2 2 − ( x − λ ) 2 2  dx = e λ 2 / 2 · 1 √ 2 π Z ∞ −∞ exp  − ( x − λ ) 2 2  dx. W e set u = x − λ (c hange of v ariables with dx = du ) Z ∞ −∞ exp  − ( x − λ ) 2 2  dx = Z ∞ −∞ exp  − u 2 2  du = √ 2 π . Hence E [ e λg ] = e λ 2 / 2 . (82) Substituting (82) into (80) giv es Pr( g ≥ t ) ≤ exp  λ 2 2 − λt  , ∀ λ > 0 . The right-hand side is a v alid b ound for every λ > 0 , so we choose λ to make it as small as p ossible. Deﬁne f ( λ ) := λ 2 2 − λt. 68 Then f ′ ( λ ) = λ − t , so the unique minimizer is λ = t (and f ′′ ( λ ) = 1 > 0 conﬁrms it is a minimum). Plugging λ = t gives Pr( g ≥ t ) ≤ exp  t 2 2 − t 2  = e − t 2 / 2 . Using Pr( | g | ≥ t ) = 2 Pr( g ≥ t ) prov es (79). No w, we fo cus on one of the mask co ordinates by ﬁxing a co ordinate j ∈ [ d ] . Since c i,j − 1 = κg i,j with g i,j ∼ N (0 , 1) , for any u ≥ 0 we hav e Pr  | c i,j − 1 | ≥ u  = Pr  | g i,j | ≥ u/κ  ≤ 2 exp  − u 2 2 κ 2  , where w e applied (79) with t = u/κ . Let’s deﬁne the even t E i ( u ) := n max j ∈ [ d ] | c i,j − 1 | ≤ u o . Its complemen t is the even t that at least one co ordinate deviates by more than u : E i ( u ) c = n ∃ j ∈ [ d ] s.t. | c i,j − 1 | > u o . By the union b ound, Pr  E i ( u ) c  = Pr  d [ j =1 {| c i,j − 1 | > u }  ≤ d X j =1 Pr  | c i,j − 1 | > u  ≤ d X j =1 2 exp  − u 2 2 κ 2  = 2 d exp  − u 2 2 κ 2  . Cho ose u so that the right-hand side is at most δ : 2 d exp  − u 2 2 κ 2  ≤ δ ⇐ ⇒ − u 2 2 κ 2 ≤ log  δ 2 d  ⇐ ⇒ u ≥ κ r 2 log  2 d δ  . Set u := κ r 2 log  2 d δ  . Then Pr( E i ( u ) c ) ≤ δ , i.e. Pr( E i ( u )) ≥ 1 − δ , and on E i ( u ) we hav e | c i,j − 1 | ≤ u ∀ j ∈ [ d ] . On E i ( u ) , for each co ordinate j , | c i,j | = | 1 + ( c i,j − 1) | ≤ | 1 | + | c i,j − 1 | ≤ 1 + u. T aking the maximum ov er j yields, with probability at least 1 − δ , ∥ c i ∥ ∞ = max j ∈ [ d ] | c i,j | ≤ 1 + u = 1 + κ r 2 log  2 d δ  . 69

Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment