Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preservin…

Authors: Afroditi Kolomvaki, Fangshuo Liao, Evan Dramko

Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking
Convergence Analysis of T w o-La y er Neural Net w o rks under Gaussian Input Masking Afro diti K olomv aki ak203@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA F angsh uo Liao fangshuo.liao@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Ev an Dramk o e d55@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Ziyun Guang c g105@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Anastasios Kyrillidis anastasios@ric e.e du Computer Scienc e Dept., Ric e University, Houston, TX, USA Abstract W e in vestigate the conv ergence guarantee of tw o-lay er neural netw ork training with Gaussian randomly mask ed inputs. This scenario corresp onds to Gaussian dropout at the input lev el, or noisy input training common in sensor netw orks, priv acy-preserving training, and federated learning, where each user ma y ha v e access to partial or corrupted features. Using a Neural T angent Kernel (NTK) analysis, we demonstrate that training a tw o-lay er ReLU net work with Gaussian randomly masked inputs achiev es linear conv ergence up to an error region prop ortional to the mask’s v ariance. A k ey technical contribution is resolving the randomness within the non-linear activ ation, a problem of indep endent interest. 1 Intro duction Neural netw orks (NNs) hav e revolutionized AI applications, where their success largely stems from their abilit y to learn complex patterns when trained on well-curated datasets (Sch uhmann et al., 2022; Li et al., 2023b; Gunasekar et al., 2023; Edwards, 2024). A comp onent to the success of NNs is its ability to mo del a broad range of tasks and data distributions under v arious scenarios. Empirical evidence has suggested neural net work’s abilit y to learn ev en under noisy input (Kariotakis et al., 2024), gradient noise (Ruder, 2017), as w ell as mo difications to the internal representations during training (Sriv asta v a et al., 2014; Y uan et al., 2022). Leveraging such abilit y of the neural netw orks, many real-w orld deplo yment adopts a mo dification to the data representations during training to achiev e particular goals such as robustness, priv acy , or efficiency . Among the metho ds, p erturbing the represen tations with an additive noise has b een studied b y a num b er of prior works (Gao et al., 2019; Li et al., 2025; 2023a; Madry et al., 2018; Lo o et al., 2022; T silivis and Kemp e, 2022; Ilyas et al., 2019), sho wcasing b oth the b enefit of such p erturbation and the stable conv ergence of the training under this setting. Compared with additive noise, p erturbing the representations by multiplying it with a mask has rarely b een studied theoretically . P erturbing the representations with multiplicativ e noise app ears in man y real-world settings, either by design or unin ten tionally . F or instance, in federated learning (FL) settings (McMahan et al., 2017; Kairouz et al., 2021), particularly vertical FL (Cheng et al., 2020; Liu et al., 2021; Romanini et al., 2021; He et al., 2020; Liu et al., 2022; 2024), differen t features of the input data may be av ailable to differen t parties, effectiv ely creating a form of sparsity-inducing multiplicativ e masking on the input space. Moreov er, the drop-out family (Sriv asta v a et al., 2014; Rey and Mnih, 2021) is a class of metho ds to prev en t o verfitting and 1 impro ve generalization ability of neural netw orks during training. Lastly , training mo dels under data-parallel proto col o v er a wireless channel incurs the c hannel effect that blurs the data passed to the work ers through a m ultiplication T se and Viswanath (2005). Theoretically analyzing the training dynamics of neural net w orks under these settings are difficult, esp ecially when the introduced randomness are intert wined with the nonlinearity of the activ ation function. While there has b een previous work that studies the conv ergence of neural net w ork training under drop-out (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020), they often assume that the drop-out happ ens after the nonlinear activ ations are applied. F rom a technical p ersp ective, statistics of the neural netw ork outputs are easier to handle as the randomness are not affected by the nonlinearit y . In this pap er, we take a step further into the understanding of multiplicativ e p erturbations in neural netw ork training b y considering noise applied b efore the nonlinear activ ation. In particular, the setting we consider is the training of a t wo-la yer MLP where the inputs b ears a multiplicativ e Gaussian mask. This prototype pro vides a simplified scenario to study the noise-inside-activ ation difficulty , while generalizes v arious training scenarios ranging from input masking (Kariotakis et al., 2024) to Gaussian drop-out (Rey and Mnih, 2021), if one views the input in our setting as fixed embeddings from previous lay ers of a deep neural net work. Under this setting, we aim to answer the following question: How do multiplic ative p erturb ations at the input level pr op agate thr ough the network and affe ct the tr aining dynamics? Our Contributions. Analyzing the training dynamics under the Gaussian masks ov er the input means that w e ha ve to study the statistical prop erties of random v ariables inside a non-linear function. Our work takes a step tow ards resolving this technical difficulty . Moreo ver, we utilize an NTK-based analysis (Du et al., 2018; Song and Y ang, 2020; Oymak and Soltanolkotabi, 2019; Liao and Kyrillidis, 2022) to study the training con vergence of the tw o-la y er MLP under sufficient ov erparameterization. T o our kno wledge, this work provides the first conv ergence analysis for neural netw ork training under Gaussian m ultiplicative input masking. Sp ecifically , for inputs x masked b y x ⊙ c where c ∼ N ( 1 , κ 2 I ) , we prov e that: i ) The exp ected loss decomp oses in to a smo othed neural netw ork loss plus an adaptive regularization term; ii ) T raining achiev es linear con vergence to an error ball of radius O ( κ ) . Empirical results show case and supp ort our theory . Our Con tributions. T o our kno wledge, this work provides the first conv ergence analysis for neural netw ork training under Gaussian multiplicativ e input masking. Our main con tributions are summarized as follows: • The or etic al A nalysis of Input Masking. W e provide a rigorous characterization of the training dynamics for t wo-la yer ReLU netw orks where noise is injected b efor e the non-linear activ ation. W e ov ercome the tec hnical c hallenge of resolving the exp ectation of non-linear functions of random v ariables, pro ving that the exp ected loss decomp oses in to a smo othed ob jectiv e plus an adaptive, data-dep endent regularizer. • Gener al Sto chastic T r aining F r amework. W e develop a general conv ergence theorem for ov erparameterized neural net works trained with biased sto chastic gradient estimators. This result, which establishes linear con vergence to a noise-dep endent error ball, is of indep endent interest b ey ond the sp ecific setting of Gaussian masking. • Explicit Conver genc e Guar ante es. W e derive constructive b ounds for the conv ergence rate and the final error radius. W e show explicitly how these quantities dep end on the mask v ariance κ 2 , the netw ork width m , and the initialization scale, demonstrating that the training conv erges linearly up to a flo or determined b y the noise level. • Empiric al V alidation and Privacy Utility. W e confirm our theoretical predictions regarding the exp ected gradien t and loss landscap e through simulations. F urthermore, w e demonstrate the practical utility of this training regime as a defense against Mem b ership Inference Attac ks (MIA), highlighting a fav orable trade-off b et ween priv acy and utilit y . 2 2 Related W ork Neural Net w ork Robustness. The study of neural net work robustness has a rich history , with early work fo cusing primarily on additiv e p erturbations. Results such as (Bartlett et al., 2017) and (Miy ato et al., 2018) established generalization b ounds for neural net w orks under adversarial p erturbations, showing that the net work’s Lipsc hitz constant pla ys a crucial role in determining robustness. Subsequen t work by (Cohen et al., 2019) in tro duced randomized smo othing techniques for certified robustness against ℓ 2 p erturbations, while (W ong et al., 2018) dev elop ed metho ds for training pro v ably robust deep neural netw orks. Regularization techniques ha ve emerged as p ow erful to ols for enhancing netw ork robustness. Drop out (Sriv asta v a et al., 2014) pioneered the idea of randomly masking internal neurons during training, effectively creating an implicit ensemble of subnet w orks (Y uan et al., 2022; Hu et al., 2023; Kariotakis et al., 2024; W olfe et al., 2023; Liao and Kyrillidis, 2022; Dun et al., 2023; 2022). This connection b etw een feature masking and regularization was further explored in (Ghorbani et al., 2021), who sho wed that drop out can b e interpreted as a form of data-dep endent regularization. Note that sparsit y-inducing norms, based on Laplacian contin uous distribution, ha v e a long history in sparse recov ery problems (Bach et al., 2011; Jenatton et al., 2011; Bach et al., 2012; Kyrillidis et al., 2015). Empirical studies on the effect of sparsity , represented by m ultiplicative Bernoulli distributions, can b e found in (Kariotakis et al., 2024). Neural T angent Kernel (NTK). Jacot et al. (2018) disco vers that infinite-width neural netw ork ev olv es as a Gaussian pro cess with a stable k ernel computed from the outer pro duct of the tangent features of the neural netw ork. Later works adopted finite-width correction and applied the framework to the analysis of neural net work conv ergence Du et al. (2018; 2019b); Oymak and Soltanolkotabi (2019). The Neural T angent Kernel framework is one of the few theoretical to ols fo cused on theoretical understanding neural net work training. Later works extended the pro of to classification tasks, where the notion of tangent features is considered as a feature mapping onto a space where the training data are separable Ji and T elgarsky (2020). Although the NTK framework has b een treated as "lazy training" that preven ts useful featuers to b e learned, it enables exact analysis of the neural netw ork training dynamic under v arious scenarios for differen t architectures Nguyen (2021); Du et al. (2019a); T ruong (2025); W u et al. (2023). Based on the NTK framew ork, several pap er studies the con vergence of training shallow neural netw orks under random neuron masking (e.g. drop out Sriv astav a et al. (2014)) Liao and Kyrillidis (2022); Mianjy and Arora (2020). Ho wev er, these works usually considers the masking applied after the nonlinearity is applied, which allows direct computation of the statistics of the output and gradien t under randomness. 3 Problem Setup Giv en a dataset { ( x i , y i ) } n i =1 , we are interested in training a neural netw ork f ( θ , · ) that maps each input x i ∈ R d ’s to an output f ( θ , x i ) that fits the lab els y i ∈ R . W e consider f ( θ , · ) as a t wo-la yer ReLU activ ated Multi-La yer P erceptron (MLP) under the NTK scaling: f ( θ , x ) = 1 √ m m X r =1 a r σ  w ⊤ r x  , where θ = ( { w r } m r =1 , { a r } m r =1 ) denotes the neural netw ork parameters, and σ ( · ) = max { 0 , ·} denotes the ReLU activ ation function. W e assume that the second-lay er w eights a r ∈ {± 1 } are fixed, and only the first lay er weigh ts w r ’s are trainable. Th us, we will b e using f ( W , x ) ≡ f ( θ , x ) where W ∈ R m × d , unless otherwise stated. This neural net w ork set-up is studied widely in previous works (Du et al., 2018). W e consider the training of the neural netw ork by minimizing the MSE loss L ( W ) o ver the dataset { ( x i , y i ) } n i =1 : L ( W ) = 1 2 n X i =1 ( f ( W , x i ) − y i ) 2 . With the influence of the ReLU activ ation, the loss is b oth non-conv ex and non-smo oth. Ho wev er, a line of previous w orks (Du et al., 2018; Song and Y ang, 2020; Oymak and Soltanolkotabi, 2019) prov es a linear 3 con vergence rate of the loss function under the assumption that the num b er of hidden neurons is sufficiently large b y adopting an NTK-based analysis (Jacot et al., 2018). While ther e have b e en the or etic al appr o aches and assumptions that go b eyond the NTK assumption, our fo cus is on a gener alize d sc enario wher e the input data may b e c orrupte d in e ach iter ation under a multiplic ative Gaussian noise: Let c ∼ N  1 d , κ 2 I d  b e an isotropic Gaussian random v ector centered at the all-one v ector 1 d , the neural net w ork output is giv en b y f ( W , x ⊙ c ) , where ⊙ denotes the Hadamard (elemen t-wise) pro duct b et ween t wo v ectors. Under the multiplicativ e noise, the neural netw ork is trained with gradien t descen t where eac h gradient is computed based on the surrogate loss L C ( W ) defined ov er the neural netw ork with the masked input: L C ( W ) = 1 2 n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 . Here C = { c i } n i =1 denotes the collection of the masks for all input x i . W e assume that c i ’s are indep endent. In real-w orld applications, this scheme can b e considered as training on an imprecise hardw are, where eac h input data p oin t is read-in with noise. Alternativ ely , one could view each x i as the output of a pre-trained large mo del, and our training scheme can b e considered as fine-tuning the last tw o lay ers with the Gaussian drop-out (W ang and Manning, 2013; Kingma et al., 2015; Rey and Mnih, 2021) in the intermediate lay er. Let { W k } K k =1 b e generated from the sto c hastic gradient descent given by: W k +1 = W k − η ∇ W L C k ( W k ) , (1) where C k is sampled indep endently in every iteration of the gradient descent. Our goal is to study the con vergence of the loss sequence {L ( W k ) } ∞ k =1 . Notice that the loss inv olv ed in the weigh t-up date is the surrogate loss L C k ( W k ) , but the loss we aim to show con vergence is the original loss L ( W ) . Our set-up marks some differences from previous w orks. First, our set-up is distinct from unbiased estimators in current literature; our setup do es not hav e such fa vorable prop ert y , since the randomness is applied at the input level of the neural netw ork. Second, although there is a line of work that analyzes the conv ergence of v anilla drop-out tranining on t wo-la yer neural netw orks (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020), in their analysis the mask is applied to the hidden neurons after the activ ation function. On the contrary , our mask is applied directly to the input, which is con tained in the non-linear function. Therefore, any analysis of the mask randomness m ust go through the ReLU function, whic h brings tec hnical difficult y . W e assume the follo wing prop ert y for the training data. Assumption 3.1. The tr aining dataset { ( x i , y i ) } n i =1 satisfies ∥ x i ∥ 2 ≤ 1 , | y i | ≤ O (1) , and for any p air i  = j , ther e exists no r e al numb er q such that x i = q · x j . This assumption guarantees the b oundedness of the dataset, and that the input data are non-degenerate, whic h is a standard assumption in Du et al. (2018); Song and Y ang (2020); Liao and Kyrillidis (2022). 4 Exp ectation of the Loss and Gradient under Gaussian Mask A formal mathematical c haracterization of the exp ected loss and gradient is essential not only in prior literature of neural net work training conv ergence (Liao and Kyrillidis, 2022; Mianjy and Arora, 2020) but also in the classical analysis of SGD ev en in the con vex domain (Shamir and Zhang, 2013; Garrigos and Go wer, 2023; T ang et al., 2013). In this section, we fo cus on the deriv ation of the explicit form of the exp ected surrogate loss E C [ L C ( W )] and the exp ected surrogate gradient E C [ ∇ W L C ( W )] . Starting with gradien t calculations, the surrogate gradient with resp ect to the r -th neuron can b e written as: ∇ w r L C ( W ) = a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  (2) Setting c i = 1 for all i ∈ [ n ] gives the gradient of the original loss ∇ w r L ( W ) . Let Φ 1 ( · ) denote the CDF of the standard (one-dimensional) Gaussian random v ariable, and let ϕ, ψ : R → R b e defined as: ϕ ( x ) = exp  − x 2  ; ψ ( x ) = | x | · ϕ ( x ) . (3) 4 (a) (b) Figure 1: (a). Effect of the noise standard deviation κ on the shap e of the smo othed activ ation function ˆ σ ( z ; κ ) = z · Φ 1 ( z / ( κ ∥ w ⊙ x ∥ 2 )) , where z = w ⊤ x . F or this visualization, ∥ w ⊙ x ∥ 2 is held constant at 1 . 0 . As κ increases, the activ ation b ecomes progressively smo other compared to the standard ReLU (dotted black line). F or small κ (e.g., κ = 0 . 01 ), ˆ σ closely appro ximates the standard ReLU. (b). Theoretical smo othed activ ation ˆ σ ( w , x ) v ersus its empirical estimate E c [ σ ( w ⊤ ( x ⊙ c ))] for a fixed pre-activ ation v alue w ⊤ x ≈ 0 . 77 (actual v alue dep ends on fixed w , x ) as the noise standard deviation κ v aries. The close match across a range of—relativ ely small— κ v alues v alidates the theoretical mo del for ˆ σ . Note that this b ehavior consistently follo ws empirically for different w , x v alues. Observ e that ϕ ( x ) ∈ (0 , 1] and ψ ( x ) ∈ (0 , 1 / √ 2 e ] . Before w e state the results in this section, we need to define the follo wing quantities. Definition 4.1. Fix a first-lay er weigh t W ∈ R m × d and training data { ( x i , y i ) } n i =1 . W e define the: • Data-related quantit y: B x = max i ∈ [ n ] ∥ x i ∥ ∞ , B y := max i ∈ [ n ] | y i | . • W eigh t-related quantit y (row-wise): R w = max r ∈ [ m ] ∥ w r ∥ 2 . • Mixed quantit y: R u := max r ∈ [ m ] ,i ∈ [ n ] ∥ w r ⊙ x i ∥ 2 and ψ max = max r ∈ [ m ] ,i ∈ [ n ] ψ  w ⊤ r x i 2 κ ∥ w r ⊙ x i ∥ 2  , ϕ max = max r ∈ [ m ] ,i ∈ [ n ] ϕ  w ⊤ r x i 2 κ ∥ w r ⊙ x i ∥ 2  . Exp ected Surrogate Loss. T o start, w e fo cus on the exp ected loss under the Gaussian input mask. F or a fixed neural netw ork f ( W , · ) , we hav e the following result. Theorem 4.2. L et u i,r = w r ⊙ x i . Define the smo othe d activation and neur al network as: ˆ σ κ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  , ˆ f ( W , x ) = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) . L et ϕ max , ψ max , B y , R u , and R w b e define d in Definition 4.1. If B y ≤ 3 √ mR w , then we have that: E C [ L C ( W )] = E + 1 2 n X i =1  ˆ f ( W , x i ) − y i  2 2 | {z } T 1 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ 1  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 | {z } T 2 , with the magnitude of E b ounde d by: |E | ≤ mn  κ 2 R 2 u ψ 2 max +  κ 2 R 2 u + κR w  ϕ 2 max  . (4) Remark 4.3. The core of our analysis of the exp ected loss inv olves understanding how the ReLU activ ation b eha ves under the multiplicativ e Gaussian input mask. Lemma D.14 in the app endix pro vides the analytical form for the expectation of a truncated Gaussian random v ariable. This leads to the definition of a smo othed 5 (a) Exact exp ectation ˜ σ ( w , x ) = z Φ( z /σ ) + σ φ ( z /σ ) (b) Proxy ˆ σ ( w , x ) = z Φ( z /σ ) Figure 2: Smo othed ReLU under multiplicativ e Gaussian input masking for fixed κ = 0 . 2 , where z = w ⊤ x and σ = κ ∥ w ⊙ x ∥ 2 . (a) Exact closed-form exp ectation ˜ σ ( w , x ) = E c [ σ ( w ⊤ ( x ⊙ c ))] = z Φ( z /σ ) + σ φ ( z /σ ) (as shown in 19)) matches the Mon te Carlo estimate. (b) Proxy smo othed activ ation ˆ σ ( w , x ) = z Φ( z /σ ) (used in Theorem 4.2) differs mainly near z ≈ 0 due to the missing σ φ ( z /σ ) term. activ ation function, as presented in Theorem 4.2. Figure 1b demonstrates the corresp ondence b etw een the theoretical and empirical v alues of this smo othed activ ation across a range of noise levels κ for a fixed input w ⊤ x : b eing an appro ximation of ReLU, as κ increases, it is exp ected the tw o curv es to deviate, yet for small enough κ v alues (here, κ ⪅ 0 . 2 ) the tw o curve s coincide. Figure 2b visually compares this theoretical smo othed activ ation ˆ σ with its empirical estimate E c [ σ ( w ⊤ ( x ⊙ c ))] for a fixed κ = 0 . 2 as the input w ⊤ x v aries. The close agreement v alidates our analytical deriv ation of ˆ σ and illustrates its smo othing effect compared to the standard ReLU. Th us, the term Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  can be interpreted as a smo othed v ersion of the indicator function I { w ⊤ x ≥ 0 } . T o visualize the impact of the noise v ariance κ 2 on the shap e of this smo othed activ ation, see Figure 1a for v arious v alues w , x (F or this illustration, we assume a fixed v alue for ∥ w ⊙ x ∥ 2 = 1 . 0 to isolate the effect of z = w ⊤ x and κ ). As κ increases, the transition of ˆ σ around the origin b ecomes progressively gen tler compared to the sharp kink of the standard ReLU activ ation. Remark 4.4. Theorem 4.2 shows that the exp ected loss can b e approximated by the combination of terms T 1 and T 2 , with an additive error term defined b y E . Notice that the smo othed activ ation ˆ σ κ ( w , x ) satisfies (see (19)): ˆ σ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  = E c ∼N ( 1 ,κ 2 I ) [ σ ( w ⊤ ( x ⊙ c ))] ± O  κ ∥ w ⊙ x ∥ 2 ϕ  w ⊤ x κ ∥ w ⊙ x ∥ 2  Therefore, here T 1 can b e seen as loss defined on the smo othed neural net work ˆ f ( W , · ) with the same w eights and dataset. Remark 4.5. One ma y notice that the form of T 2 is similar to the ℓ 2 regularization in the ridge regression. T o understand T 2 , we first notice that: ∇ w r ˆ f ( W , x i ) ≈ 1 √ m m X r =1 a r x i Φ 1  w ⊤ r x i κ ∥ u i,r ∥ 2  . Therefore, T 2 can appro ximately b e written as: T 2 ≈ κ 2 2 n X i =1      m X r =1 ∇ w r ˆ f ( W , x i ) ⊙ w r      2 2 = κ 2 2 n X i =1 d X j =1  ∇ ˆ w j f ( W , x i ) ⊤ ˆ w j  2 = vec ( W ) ⊤ ˆ H vec ( W ) . Here ˆ w j is the j th ro w of the matrix W = [ w 1 , . . . , w m ] ∈ R d × m , vec ( W ) = concat ( ˆ w 1 , . . . , ˆ w d ) is the concatenation of the ˆ w j ’s, and ˆ H ∈ R md × md is the blo ck-diagonal matrix whose j th diagonal blo ck is 6 ˆ H j := P n i =1 ∇ ˆ w j f ( W , x i ) ∇ ˆ w j f ( W , x i ) ⊤ ∈ R m × m for j ∈ [ d ] . In tuitively , ˆ H can b e seen as a matrix consisting of the tangent features’ (Baratin et al., 2021; LeJeune and Alemohammad, 2024) outer pro ducts. As a result, T 2 can b e seen as the regularization term of W in terms of a norm defined by the tangent feature outer pro duct matrix ˆ H . Remark 4.6. The magnitude of E is giv en in (4). A t a first glance, one could see that the term decreases monotonically as κ decrease, implying a smaller error when κ is small. As discussed in the b eginning of this section, ϕ max and ψ max are upp er-bounded by some constant. Therefore, in the worst case, we hav e |E | ≤ O  mn  κ 2 R u + κR w  , whic h scales linearly with κ . Belo w, w e sk etch the pro of of Theorem 4.2. The full pro of of Theorem 4.2 is deferred to App endix B.2. Pr o of sketch. Our pro of starts with the decomp osition of the exp ected loss as: E C [ L C ( W )] = 1 2 n X i =1 E C h ( f ( W , x i ⊙ c i )) 2 i + 1 2 n X i =1 y 2 i − n X i =1 y i E C [( f ( W , x i ⊙ c i ))] . It b oils do wn to analyzing the terms E C h ( f ( W , x i ⊙ c i )) 2 i and E C [( f ( W , x i ⊙ c i ))] . Plugging in f ( W , x i ⊙ c i ) , it suffices to analyze the following exp ectations: E 1 = E c  σ  w ⊤ r ( x ⊙ c )  σ  w ⊤ r ′ ( x ⊙ c )  ; E 2 = E c  σ  w ⊤ r ( x ⊙ c )  . The tric k of ev aluating E 1 and E 2 is to notice that w ⊤ r ( x ⊙ c ) = c ⊤ ( w r ⊙ x i ) . Since c ∼ N  1 , κ 2 I  , we m ust hav e that c ⊤ ( w r ⊙ x ) ∼ N  w ⊤ r x , κ 2 ∥ w r ⊙ x ∥ 2 2  . Therefore, we can define z 1 = c ⊤ ( w r ⊙ x ) and z 2 = c ⊤ ( w r ′ ⊙ x ) . Then the problem of ev aluating E 1 and E 2 b ecomes computing: E 1 = E z 1 ,z 2 [ z 1 z 2 I { z 1 ≥ 0; z 2 ≥ 0 } ] ; E 2 = E z 1 [ z 1 I { z 1 ≥ 0 } ] . Here Co v ( z 1 , z 2 ) = ( w r ⊙ x i ) ⊤ ( w r ′ ⊙ x i ) . T o complete the pro of, we prov e the following tw o lemmas. Lemma 4.7. L et z 1 ∼ N  µ 1 , κ 2 1  . Then, we have that: E [ z 1 I { z 1 ≥ 0 } ] = κ √ 2 π exp  − µ 2 2 κ 2  + µ Φ 1  µ κ  . Lemma 4.8. L et z 1 ∼ N  µ 1 , κ 2 1  and z 2 ∼ N  µ 2 , κ 2 2  , with Cov ( z 1 , z 2 ) = κ 1 κ 2 ρ . L et Φ 2 ( a, b, ρ ) denote the joint CDF of standar d Gaussian r andom variables ˆ z 1 , ˆ z 2 with c ovarianc e ρ at z 1 = a, z 2 = b . Then, we have: E [ z 1 z 2 I { z 1 ≥ 0; z 2 ≥ 0 } ] = ( µ 1 µ 2 + κ 1 κ 2 ρ ) Φ 2  µ 1 κ 1 , µ 2 κ 2 , ρ  + 1 √ 2 π ( κ 1 µ 2 T 1 + κ 2 µ 1 T 2 ) + κ 1 κ 2 2 π exp  − 1 2 (1 − ρ 2 )  µ 2 1 κ 2 1 − 2 ρµ 1 µ 2 κ 1 κ 2 + µ 2 2 κ 2 2  Her e, T 1 , T 2 ar e define d as: T 1 = exp  − µ 2 1 2 κ 2 1  Φ 1 1 p 1 − ρ 2  µ 2 κ 2 − ρµ 1 κ 1  ! ; T 2 = exp  − µ 2 2 2 κ 2 2  Φ 1 1 p 1 − ρ 2  µ 1 κ 1 − ρµ 2 κ 2  ! Plugging z 1 = c ⊤ ( w r ⊙ x ) and z 2 = c ⊤  w ⊤ r ′ x  bac k into E C h ( f ( W , x i ⊙ c i )) 2 i and E C [( f ( W , x i ⊙ c i ))] and b ounding the emerging error terms would give the desired result. Details are deferred into the app endix. Exp ected Surrogate Gradient. In the following part, we study the exp ectation of the surrogate gradient. Theorem 4.9. A ssume that c i ∼ N  1 , κ 2 I  for some κ ≤ 1 . L et ϕ max , ψ max , B y , R u , and R w b e define d in Definition 4.1. Then, we have that: E C [ ∇ w r L C ( W )] = ∇ w r L ( W ) + g r + 3 κ 2 m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ · I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  (5) 7 wher e the magnitude of g r c an b e b ounde d as: ∥ g r ∥ 2 ≤  6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + σ max ( X ) √ m ϕ max L ( W ) 1 2 + 6 nκR u ψ max . (6) Remark 4.10. Theorem 4.9 shows that the exp ected gradient can b e written as the summation of the v anilla loss gradient ∇ w r L ( W ) , a term T 3 ab o ve, and a gradient error g r . The magnitude of g r is controlled in (6). As discussed previously , when   w ⊤ r x i   > 0 , b oth ϕ max and ψ max decreases exp onen tially as κ decreases. Note that, although the second term scales with L ( W ) , when L ( W ) decreases during training, that term will also con tribute less to the ov erall gradient error. Remark 4.11. One could observ e that the third term on the right-hand side of (5) is the gradient of the function R ( W ) with resp ect to w r , where R ( W ) is given by: R ( W ) = 3 κ 2 m      m X r ′ =1 a r ′ w r ′ ⊙ x i I  w ⊤ r ′ x i ≥ 0       2 . Therefore, R ( W ) can b e seen as a scaled v ersion of the T 2 -term in Theorem 4.2. This again v erifies the regularization effect of the Gaussian random mask. The pro of of Theorem 4.9 is provided in App endix B.3, and we pro vide the pro of sk etc h b elo w. Pr o of sketch. By the form of the surrogate gradien t in (2), we fo cus on the following tw o terms: T 1 = f ( W , x i ⊙ c i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  ; T 2 = y i ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  . Plug in the form of f ( W , x i ⊙ c i ) , we can write T 1 as: T 1 = 1 √ m m X r ′ =1 a r ′ σ ( w ⊤ r ′ ( x i ⊙ c i )) · ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  = 1 √ m m X r ′ =1 a r ′ Diag ( x i ) c i c ⊤ i I  c ⊤ i ( w r ⊙ x i ) ≥ 0  · I  c ⊤ i ( w r ′ ⊙ x i ≥ 0)  ( w r ′ ⊙ x i ) , while T 2 can b e written as: T 2 = y i x i ⊙  c i I  c ⊤ i ( w r ⊙ x i ) ≥ 0  . This allo ws us to fo cus on the following quantities instead E  cc ⊤ I  c ⊤ u ≥ 0; c ⊤ v ≥ 0  ; E  c I  c ⊤ u ≥ 0  with multi-v ariate Gaussian random v ariable analysis. The rest of the pro of then pro ceeds similarly as in the pro of of Theorem 4.2. 5 Convergence Gua rantee of T raining with Gaussian Mask Here, we study the conv ergence prop erty of a general framework of sto c hastic neural netw ork training, which giv es us a theoretical result that can b e of indep endent interest. Recall also the setting in Section 3: Consider f ( W , · ) as a tw o-lay er ReLU activ ated MLP , as describ ed ab o ve. Let ξ denote the randomness in one step of (sto c hastic) gradient descent. Let ˆ L ( W , ξ ) and ∇ w r ˆ L ( W , ξ ) denote the sto c hastic loss and the sto chastic gradien t induced by ξ , resp ectiv ely . W e consider the sequence { W k } K k =1 generated by the following up dates: W k +1 = W k − η ∇ W ˆ L ( W k , ξ k ) . (7) Instead, the connection b et ween ∇ w r ˆ L ( W , ξ ) and ˆ L ( W , ξ ) with resp ect to w r , along with other requirements, are stated in the assumption b elow. 8 Assumption 5.1. F or al l ξ , W , we assume that the fol lowing pr op erties hold: E ξ h ˆ L ( W , ξ ) i ≤ 2 L ( W ) + ε 1 , (8)    E ξ h ∇ w r ˆ L ( W , ξ ) i − ∇ w r L ( W )    2 ≤ ε 3 L ( W ) 1 2 + ε 2 , (9)    ∇ w r ˆ L ( W , ξ )    2 2 ≤ γ ˆ L ( W , ξ ) . (10) Here, (8) and (9) provide an upp er b ound on the exp ected loss and the error of the exp ected gradient. (10) can b e seen as a relaxed form of the smo othness. Our analysis is based on the standard NTK-type argument as in (Du et al., 2018; Song and Y ang, 2020; Liao and Kyrillidis, 2022), which considers the infinite-width NTK H ∞ giv en b y: H ∞ ij = x ⊤ i x j E w ∼N ( 0 , I )  I  w ⊤ x i ≥ 0; w ⊤ x j ≥ 0  . (11) It is shown in Du et al. (2018) that H ∞ is p ositiv e definite. W e define λ 0 := λ min ( H ∞ ) > 0 . Theorem 5.2. A ssume that the first-layer weights of a neur al network ar e initialize d ac c or ding to w 0 ,r ∼ N  0 , τ 2 I  for some τ > 0 , and the se c ond-layer weights ar e initialize d ac c or ding to a r ∼ Unif {± 1 } . L et the numb er of hidden neur ons satisfy m = Ω  n 4 K 2 λ 4 0 δ 2 τ 2  and the step size satisfy η = O  λ 0 n 2  . A ssume that A ssumptions 3.1, 5.1 hold for some γ = C 1 · n m with some smal l enough ε 1 , ε 2 , ε 3 satisfying: ε 1 ≤ O  δ m nK 4  , ε 2 ≤ O  δ λ 0 nK 2  , ε 3 ≤ O  λ 0 √ mn  . (12) Then, with pr ob ability at le ast 1 − 2 δ − n 2 exp  − n 3 δ 2 τ 2 λ 3 0  , for al l k ∈ [ K ] , the se quenc e { W k } K k =1 gener ate d by (7) satisfies: E ξ 0 ,..., ξ k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + O  mn λ 2 0 · ε 2 2 + ε 1  . (13) F urthermor e, we c an guar ante e that ∥ w k,r − w 0 ,r ∥ 2 ≤ O  τ λ 0 n  for al l r ∈ [ m ] and k ∈ [ K ] . In short, Theorem 5.2 shows that under a small enough ε 1 , ε 2 , ε 3 and γ , if the neural netw ork is sufficiently o verparameterized, then, with a small enough step size η , w e can guaran tee the conv ergence under the training giv en by (7) and that the change in each w r is b ounded by O  τ λ 0 n  . As shown in (13), the exp ected loss con verges linearly up to a ball around the global minimum with radius given by O  mn λ 2 0 · ε 2 2 + ε 1  . This error region monotonically decreases as the error in the exp ected loss and gradien t, namely ε 1 and ε 2 , decreases. T raining Conv ergence with Gaussian Input Mask. W e apply the general result in Theorem 5.2 to the scenarios of Gaussian input masking, as given by (1). T o apply Theorem 5.2, one need to mak e sure that the requiremen ts in Assumption 5.1 are guaranteed. Here, we present tw o corollaries as extensions of Theorem 4.2 and Theorem 4.9, with the goal of sho wing (8) and (9). Corollary 5.3. L et B y , ϕ max , ψ max , R u , and R w b e define d in Definition 4.1. If B y ≤ 3 √ mR w , then we have E C [ L C ( W )] ≤ 2 L ( W ) + 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ 2 max + mnκ 2  R 2 u + 1  ψ 2 max Corollary 5.3 follows simply from Theorem 4.2 by upp er-b ounding the difference b etw een the smo othed neural net work function ˆ f ( W , · ) and the v anilla neural netw ork function f ( W , · ) , and b y upp er-b ounding the regularization term. In particular, Corollary 5.3 implies the b ound of the error ε 1 as 2 mnκ 2 R u + mn  κ 2 R 2 u + κR w  ϕ 2 max + mnκ 2  R 2 u + 1  ψ 2 max . Corollary 5.4. L et B y , ϕ max , ψ max , R u , and R w b e define d in Definition 4.1. If B y ≤ 3 √ mR w , then we have ∥ E C [ ∇ w r L C ( W )] − ∇ w r L ( W ) ∥ 2 ≤ O  nκ 2 B 2 x R w + nκR u √ d  ϕ max  + O  σ max ( X ) ϕ max √ m  L ( W ) 1 2 + O  nκR u ψ max + κ 2 √ mB 2 x R w  9 Similar to Corollary 5.3, Corollary 5.4 follo ws from upp er b ounding the regularization term in Theorem 4.9. By Corollary 5.4, we can write ε 2 and ε 3 in Assumption 5.1 as ε 2 = O  nκ 2 B 2 x R w + nκR u √ d  ϕ max  + O  nκR u ψ max + κ 2 √ mB 2 x R w  and lik ely ε 3 = O  σ max ( X ) ϕ max √ m  . The pro of of Corollary 5.3 and Corollary 5.4 are deferred to App endix C.2. T o complete the requiremen ts in Assumption 5.1, we can show the following lemma for (10). Lemma 5.5. A ssume that A ssumption 3.1 holds. Then, we have: ∥∇ w r L C ( W ) ∥ 2 2 ≤ C 1 n m L C ( W ) . The pro of can b e found in the app endix D.21 With the help of Corollary 5.3,5.4, and Lemma D.21, w e can derive the conv ergence guarantee of training the t wo-la yer ReLU neural netw ork under Gaussian input mask. Theorem 5.6. A ssume that the first-layer weights ar e initialize d ac c or ding to w 0 ,r ∼ N  0 , τ 2 I  for some τ > 0 , and the se c ond-layer weights ar e initialize d ac c or ding to a r ∼ Unif {± 1 } . L et the numb er of hidden neur ons satisfy m = Ω  n 4 K 2 λ 4 0 δ 2 τ 2  and the step size satisfy η = O  λ 0 n 2  . A ssume that for al l W ∈ { W k } K k =1 , the fol lowing hold: κ = O   √ δ λ 0 τ 2 K 2  m 1 4 √ d + nd   ˆ ϕ max + ˆ ψ max    (14) σ max ( X ) ˆ ϕ max ≤ O  λ 0 √ n  . (15) Then, we have that, with pr ob ability at le ast 1 − 2 δ − n 2 exp  − n 3 δ 2 τ 2 λ 3 0  , for al l k ∈ [ K ] , the se quenc e { W k } K k =1 gener ate d by (7) satisfies: E C 0 ,..., C k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + O  κτ 2 mn 2 d 2  ˆ ϕ 2 max + ˆ ψ 2 max  (16) + O  κ 2 τ 2 m 2 nd  + O  κτ mn √ d ˆ ϕ 2 max  (17) wher e ˆ ϕ max = max k ∈ [ K ] ϕ max ( W k ) and ˆ ψ max = max k ∈ [ K ] ψ max ( W k ) . In short, Theorem 5.6 guarantees the con vergence of training a tw o-lay er ReLU neural netw ork under Gaussian input mask in the form of (16) under the condition of sufficient ov erparameterization, prop er step size, and the requiremen t in (14) and (15). In particular, (14) requires a sufficien tly small Gaussian v ariance κ . The condition in (15) requires either a small maximum singular v alue of the input data matrix X , or a small ϕ max . Lastly , (16) shows a linear conv ergence of the exp ected loss up to some error region. Notice that the first part of the error region dep ends b oth on κ, τ and on ϕ max and ψ max , and the second part of the error region dep ends solely on κ and τ . This means that one can guarantee an arbitrarily small error region when the Gaussian noise κ and the initialization scale τ is sufficiently small. Remark 5.7. Both the requirement and the error region in Theorem 5.6 dep end on the quan tity ˆ ϕ max and ˆ ψ max . Recall that: ˆ ϕ max = max k,r,i ( exp −  w ⊤ r x i  4 κ 2 ∥ w r ⊙ x i ∥ 2 2 !) ; ˆ ψ max = max k,r,i (   w ⊤ r x i   · exp −  w ⊤ r x i  4 κ 2 ∥ w r ⊙ x i ∥ 2 2 !) . Both quantities decay exp onentially fast as κ deca ys, when w ⊤ r x i  = 0 for all k , r , i . As the sequence { W k } K k =1 is generated under the randomness of C k ’s, in tuitively it is almost never the case where w ⊤ k,r x i = 0 . Therefore, in most cases Theorem 5.6 should require only a log-dep endency of κ on other parameters in order for (14) and (15) to b e satisfied. Ho wev er, it should b e noticed that κ still need to decay in p o werla w if one w an t to sufficien tly decrease the second part of the error region. 10 (a) 0 1 2 3 4 5 Mask Standard Deviation ( κ ) 65 70 75 80 85 90 95 T est Accuracy 1 Local Step 20 Local Steps 40 Local Steps (b) Figure 3: (a). T raining loss L ( W k ) (log-scale) versus training iteration for a tw o-lay er ReLU netw ork ( n = 500 , d = 20 , m = 100 ) trained with full-batc h gradien t descen t under different lev els of input multiplicativ e Gaussian noise standard deviation κ . (b). Distributed training with Gaussian mask for differen κ and num b er of lo cal steps. 6 Exp eriments 6.1 Empirical Validation of T raining Convergence with Gaussian Mask. Theorem 5.6 asserts that training a sufficiently ov erparameterized tw o-lay er ReLU netw ork with Gaussian m ultiplicative input noise results in linear conv ergence of the exp ected loss to an error ball. The radius of this error ball is prop ortional to the noise v ariance (controlled by κ ) and other netw ork and data-dep enden t terms. W e empirically v erify this conv ergence b eha vior. Simulation Setup. W e train a tw o-lay er ReLU MLP: As a to y example, the netw ork has d = 20 input features and m = 100 hidden units. The training dataset comprised n = 500 synthetic samples, with input features x i normalized suc h that ∥ x i ∥ 2 ≤ 1 , and target v alues y i generated from a non-linear function of x i with small added noise. First-lay er weigh ts W were initialized using Kaiming uniform initialization, and second-la yer weigh ts a r ∈ {± 1 } w ere fixed. The net work was trained for 2000 iterations using full-batch gradien t descent with a learning rate of 0 . 005 . W e p erformed separate training runs for different noise levels: κ ∈ { 0 . 0 , 0 . 05 , 0 . 2 , 0 . 4 , 0 . 6 , 1 . 0 , 2 . 0 } . F or each run, we track ed the ev olution of the clean training loss L ( W k ) . R esults and Discussion. The training tra jectories, plotted in Figure 3a, illustrate the theoretical predictions. F or clean training ( κ = 0 . 0 ), the loss exhibits an initial linear conv ergence phase. When multiplicativ e Gaussian noise is introduced, the initial linear conv ergence trend is preserv ed. How ever, as training progresses, the loss conv erges not to the same minimal v alue but to a distinct error ball, plateauing at a v alue higher than the clean case. As exp ected, the size of this error ball, indicated by the final conv erged loss v alue, systematically increases with the noise level κ . This direct relationship b et ween κ and the size of the error ball pro vides strong empirical supp ort for the conv ergence guarantees outlined in Theorem 5.6. 6.2 Impact of Multiplicative Gaussian Noise on Mo del Accuracy W e trained a 1-hidden-lay er MLP (4096 hidden units, GELU activ ation, drop out=0.2) on CIF AR-10 for 80 ep ochs using Adam W optimization with cosine annealing, lab el smo othing, and standard data augmentation. During training, we injected multiplicativ e Gaussian noise x ← x · (1 + κZ ) where Z ∼ N (0 , 1) . and ev aluated the impact of noise strength κ on clean test accuracy . The results reveal that small amounts of MG noise κ ≈ 0 . 2 improv e generalization, acting as an effective regularizer b ey ond the existing random cropping and flipping. This suggests that mo dest input p erturbations help the mo del learn more robust features that transfer b etter to the test set. How ever, accuracy degrades monotonically b ey ond this p oint, dropping to 49.88% at κ = 1 . 8 . Next, to assess the impact of multiplicativ e Gaussian noise on a well-regularized con volutional arc hitecture, we trained a CNN on CIF AR-10 with v arying noise strengths κ . The architecture 11 (a) MLP (b) CNN Figure 4: T est accuracy versus multiplicativ e Gaussian noise strength κ for (a) a 1-hidden-lay er MLP and (b) a CNN, trained on CIF AR-10. Small noise levels ( κ ≈ 0 . 2) can impro ve generalization for the MLP , likely due to regularization effects. In contrast, the CNN exhibits robustness by maintaining baseline accuracy . Beyond this p oin t, accuracy degrades monotonically for b oth architectures as noise corrupts the training signal. consists of 4 con volutional la yers ( 32 → 32 → 64 → 64 filters with 3 × 3 k ernels), batc h normalization after eac h con volutional la yer, three drop out la y ers with rates 0 . 2 , 0 . 3 and 0 . 5 resp ectiv ely , and a fully connected la yer with ReLU activ ation. W e trained the CNN for 80 ep o chs using the Adam optimizer with lr = 10 − 3 , w eight deca y = 10 − 5 and batc h size = 128 . During training, we applied Multiplicative Gaussian (MG) noise x ← x · (1 + κZ ) where Z ∼ N (0 , 1) , while all accuracy measurements were p erformed on the clean test set (noiseless). As shown in Figure 4b, we observe that the mo del exhibits robustness to low-magnitude MG noise, main taining its baseline accuracy ( ≈ 71% ) at κ = 0 . 2 . Beyond that p oint though, accuracy degrades monotonically and excessive noise corrupts the training signal. 6.3 Application: Distributed T raining over Wireless Channels Comm unicating signals o ver wireless channels incurs fading phenomena to the signals (T se and Viswanath, 2005). Sp ecifically , a time-v arying signal x ( t ) transmitted ov er channel given by h ( t ) and additive noise z ( t ) result in y ( t ) = x ( t ) h ( t ) + z ( t ) . F or data parallel distributed training ov er wireless channels, eac h input data is passed through the c hannel to the work ers to p erform lo cal training. W e consider using the Gaussian mask ed input training scheme studied in this pap er as a simplified setup to mo del the channel fading in wireless comm unication. In particular, we let x b e the signals transmitted x ( t ) , and c ∼ N  1 , κ 2 I d  b e the c hannel effect h ( t ) . F or simplicity , we set the additive noise to 0 . The time-dep ending behavior of x ( t ) and h ( t ) are transformed into the masking scheme that in eac h iteration, a new mask is applied to the sample. Under this setup, we train a t wo-la yer MLP with 128 hidden neurons for the MNIST dataset using F edA vg. That is, w e assume that the total training pro cess is partitioned into multiple global iterations, where in each global iteration, the central server passes the up dated mo del parameter together with the current copy of training data through a wireless channel. Each work er receives the training data with channel fading (in our case, mo deled with Gaussian multiplicativ e noise), and up dated its lo cal copy of the mo del using gradien t descen t starting from the parameter shared by the server for some num b er of lo cal steps. After the lo cal up date, the work ers sends the up dated parameter to the central server to p erform an aggregation by a v eraing the w orker’s weigh ts. Under this setup, we train an aggregated mo del with 5 w ork ers and the choice of {1, 20, 40} lo cal steps with a batch size of 128. W e also v ary the v ariance of the Gaussian mask to study the relationship b et w een the num b er of lo cal steps and the noise scales. F or eac h com bination of lo cal steps and Gaussian v ariance, w e run 5 trials and record the mean and standard deviation of the resulting accuracy . W e plot the result in Figure 3b. In general, for all choices of the num b er of lo cal steps, we observe a decay in the test accuracy as the input masking v ariance grows larger, indicating the negative influence of the noise to 12 the o v erall training p erformance. In particular, we can also observ e that, in the low noise regime ( κ ≈ 0 ), the resulting final accuracy is not influenced muc h by the num b er of lo cal iterations. Ho wev er, as the noise scale gro ws larger (larger κ ), the final accuracy decays drastically as we increase the num b er of lo cal iterations. W e hypothesis that this b eha vior is due to the fact more lo cal iterations allows the work ers to fit more to the noise instead of the original signals in the data. 6.4 Efficacy of Multiplicative Gaussian Noise Against Memb ership Inference Attacks In this section, we empirically ev aluate the effectiveness of training with input multiplicativ e Gaussian (MG) noise as a defense against Membership Inference Attac ks (MIAs). In simple words, the primary goal of a MIA is to determine if a sp ecific data p oint x w as part of the training set of a target mo del f . Threat Mo del and Attac k Metho dology . W e adopt the black-box shadow mo del attack metho dology (A dversary 1) from the ML-Leaks framework (Salem et al., 2019). In this setup, the adversary aims to determine whether a sp ecific data p oint was part of a target mo del’s training set using only the mo del’s output p osteriors. Because the adversary lacks the target’s training lab els, they employ a shado w mo del trained on a proxy dataset to mimic the target’s b eha vior. By observing ho w the shadow mo del treats its o wn members versus non-members, the adversary generates lab eled data to train an attack mo del. This binary classifier learns to identify the statistical "signatures" of membership—such as increased confidence or reduced en tropy—enabling it to p erform membership inference on the original target mo del. A detailed breakdo wn of the data partitioning and the five-stage attack pipeline is pro vided in App endix A.2. Exp erimen tal Setup. W e ev aluate the effectiv eness of Multiplicative Gaussian (MG) noise as a priv acy defense using the CIF AR-10 dataset. The data is partitioned into four disjoin t sets to train and audit b oth target and shadow mo dels indep enden tly . Our ev aluation cov ers tw o architectures—a fully-connected MLP and a multi-block CNN—to ensure the defense generalizes across different mo del complexities. W e measure priv acy leakage by training a Logistic Regression attack mo del against target mo dels sub jected to v arying noise in tensities ( κ ∈ { 0 . 0 , 0 . 5 , 1 . 2 , 1 . 8 } ) and training durations (20–120 ep o chs). The defense is quantified via Precision, Recall, and AUC, where an A UC of 0.5 indicates p erfect priv acy (random guessing). Detailed h yp erparameters, partitioning sizes, and architectural sp ecifications are provided in App endix A.2. 6.4.1 Results and Discussion Our exp eriments confirm that training with multiplicativ e Gaussian noise systematically enhances a mo del’s resilience to membership inference attacks. The results for the MLP and CNN mo del are presented in Figure 6 and 7 resp ectiv ely , with the AUC v alues of our exp erimen ts illustrated in Figure 5. (a) MLP (b) CNN Figure 5: A ttac k A UC on the target mo del when it is (a) an MLP and (b) a CNN. Higher v alues indicate greater priv acy leakage. T raining with MG noise (larger κ ) consisten tly reduces attac k success. 13 Figure 6: MLP target. Multiplicativ e Gaussian noise pro vides resilience against the attacks as κ increases. Figure 7: CNN target. W e observe a similar trend as the MLP Figure 7 demonstrates the attack success for κ = 0 . 0 , in which case as the n umber of training ep o chs increases from 20 to 120, AUC rises from 0 . 578 to a significant 0 . 782 . This v alidates that our attack implementation correctly captures priv acy leakage. The cen tral finding is the consistent defensive b enefit of multiplicativ e Gaussian noise. As shown in Figures 6 for MLP , for any giv en num b er of training ep ochs, applying MG noise (increasing κ ) retains (relatively) constan t b oth the precision and recall of the attac k. F or example, after 120 ep o chs of training, the standard mo del is highly vulnerable (Attac k AUC = 0 . 782 ). In contrast, the mo del trained with κ = 0 . 5 reduces this leakage (A UC = 0 . 692 ), and mo dels with stronger noise achiev e even b etter priv acy (AUC = 0 . 585 for κ = 1 . 2 and AUC = 0 . 543 for κ = 1 . 8 ). This demonstrates a clear dose-resp onse relationship: greater noise v ariance leads to stronger priv acy protection against MIAs. Similar trends were observed for the CNN architecture (see Figure 7). Figure 4 illustrates the classic priv acy-utility trade-off. In essence, by sacrificing some mo del utilit y , training with multiplicativ e Gaussian noise effectively obfuscates the statistical signature of data mem b ership, thereby mitigating priv acy risks. 7 Conclusion This work inv estigates the fundamental question of how indep endent multiplicativ e Gaussian masking affects training dynamics. F ocusing on a t wo-la yer ReLU net work in the NTK regime, we demonstrate that the masked ob jectiv e admits a closed-form decomp osition in to a smo othed square loss plus an explicit, data-dep endent regularizer. This structure allows training with the gradient from the masked ob jective to achiev e linear con vergence to ward a noise-con trolled error ball given small step size and large enough o v er-parameterization. 14 Bey ong our theory , W e provided exp erimen tal result to v alidate the conv ergence to small ball, and presen ted applications in distributed training under channel fading and how the masking can defend against attacks. Limitations and F uture W ork. Our theoretical analysis is currently constrained to the small-noise regime and the extreme ov er-parameterization typical of NTK mo dels. F urthermore, our pro ofs rely on the indep endence of masks across iterations and do not yet incorp orate a formal priv acy accounting pip eline, suc h as subsampling or comp osition. Despite these constraints, m ultiplicative Gaussian masking represents a pro v able and practically viable method for injecting input-level uncertain t y . These results provide a principled foundation for future exploration of deep net works and more complex noise settings in feature-wise training. References F rancis Bach, Ro dolphe Jenatton, Julien Mairal, and Guillaume Ob ozinski. Conv ex optimization with sparsit y-inducing norms. Optimization for Machine L e arning , page 19, 2011. F rancis Bach, Ro dolphe Jenatton, Julien Mairal, Guillaume Ob ozinski, et al. Optimization with sparsity- inducing p enalties. F oundations and T r ends ® in Machine L e arning , 4(1):1–106, 2012. Aristide Baratin, Thomas George, César Lauren t, R Devon Hjelm, Guillaume La joie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignmen t. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pages 2269–2277. PMLR, 2021. P eter L Bartlett, Dylan J F oster, and Matus J T elgarsky . Sp ectrally-normalized margin b ounds for neural net works. A dvanc es in Neur al Information Pr o c essing Systems , 30, 2017. Y ong Cheng, Y ang Liu, Tianjian Chen, and Qiang Y ang. F ederated learning for priv acy-preserving AI. Communic ations of the A CM , 63(12):33–36, 2020. Jerem y Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smo othing. In International Confer enc e on Machine L e arning , pages 1310–1320. PMLR, 2019. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descen t prov ably optimizes ov er- parameterized neural netw orks. arXiv pr eprint arXiv:1810.02054 , 2018. Simon S. Du, Kangcheng Hou, Barnabás P ó czos, Ruslan Salakh utdinov, Ruosong W ang, and Keyulu Xu. Graph neural tangent k ernel: F using graph neural netw orks with graph k ernels, 2019a. URL https://arxiv.org/abs/1905.13192 . Simon S. Du, Jason D. Lee, Hao ch uan Li, Liwei W ang, and Xiyu Zhai. Gradien t descent finds global minima of deep neural netw orks, 2019b. URL . Chen Dun, Cameron R W olfe, Christopher M Jermaine, and Anastasios Kyrillidis. ResIST: La y er-wise decomp osition of resnets for distributed training. In Unc ertainty in A rtificial Intel ligenc e , pages 610–620. PMLR, 2022. Chen Dun, Mirian Hip olito, Chris Jermaine, Dimitrios Dimitriadis, and Anastasios Kyrillidis. Efficient and ligh t-weigh t federated learning via async hronous distributed drop out. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pages 6630–6660. PMLR, 2023. Chris Edw ards. Data quality may b e all you need, 2024. R uiqi Gao, Tianle Cai, Hao ch uan Li, Liwei W ang, Cho-Jui Hsieh, and Jason D. Lee. Con vergence of adv ersarial training in ov erparametrized neural netw orks, 2019. URL . Guillaume Garrigos and Rob ert M. Gow er. Handb ook of conv ergence theorems for (sto chastic) gradien t metho ds. arXiv pr eprint arXiv:2301.11235 , 2023. URL . Behro oz Ghorbani, Song Mei, Theo dor Misiakiewicz, and Andrea Mon tanari. Linearized tw o-lay er neural net works in high dimension. The A nnals of Statistics , 49(2):1029–1054, 2021. 15 Suriy a Gunasekar, Yi Zhang, Jyoti Aneja, Caio César T eodoro Mendes, Allie Del Giorno, Siv akan th Gopi, Mo jan Jav aheripi, Piero Kauffmann, Gustav o de Rosa, Olli Saarikivi, et al. T extb o oks are all you need. arXiv pr eprint arXiv:2306.11644 , 2023. Chao yang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi W ang, Xiao yang W ang, Praneeth V epakomma, Abhishek Singh, Hang Qiu, et al. F edML: A research library and b enchmark for federated mac hine learning. arXiv pr eprint arXiv:2007.13518 , 2020. Erdong Hu, Y uxin T ang, Anastasios Kyrillidis, and Chris Jermaine. F ederated learning o ver images: V ertical decomp ositions and pre-trained bac kb ones are difficult to b eat. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer V ision , pages 19385–19396, 2023. Andrew Ilyas, Shibani Santurkar, Dimitris T sipras, Logan Engstrom, Brandon T ran, and Aleksander Madry . A dversarial examples are not bugs, they are features. A dvanc es in Neur al Information Pr o c essing Systems , 32, 2019. Arth ur Jacot, F ranck Gabriel, and Clément Hongler. Neural tangen t k ernel: Conv ergence and generalization in neural netw orks. A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. Ro dolphe Jenatton, Jean-Y ves Audibert, and F rancis Bach. Structured v ariable selection with sparsity-inducing norms. The Journal of Machine L e arning R ese ar ch , 12:2777–2824, 2011. Ziw ei Ji and Matus T elgarsky . Polylogarithmic width suffices for gradient descent to achiev e arbitrarily small test error with shallow relu netw orks, 2020. URL . P eter Kairouz, H Brendan McMahan, Brendan A ven t, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhago ji, Kallista Bona witz, Zachary Charles, Graham Cormo de, Rachel Cummings, et al. Adv ances and op en problems in federated learning. F oundations and T r ends ® in Machine L e arning , 14(1–2):1–210, 2021. Emmanouil Kariotakis, Grigorios T sagkatakis, Panagiotis T sakalides, and Anastasios Kyrillidis. Lev eraging sparse input and sparse mo dels: Efficient distributed learning in resource-constrained en vironments. In Confer enc e on Parsimony and L e arning , pages 554–569. PMLR, 2024. Diederik P Kingma, Tim Salimans, and Max W elling. V ariational drop out and the lo cal reparameterization tric k. In A dvanc es in Neur al Information Pr o c essing Systems , volume 28, pages 2575–2583, 2015. URL https: //papers.nips.cc/paper/5666- variational- dropout- and- the- local- reparameterization- trick . Anastasios Kyrillidis, Luca Baldassarre, Marwa El Halabi, Quo c T ran-Dinh, and V olkan Cevher. Structured sparsit y: Discrete and conv ex approaches. In Compr esse d Sensing and its Applic ations: MA THEON W orkshop 2013 , pages 341–387. Springer, 2015. Daniel LeJeune and Sina Alemohammad. An adaptive tangent feature p ersp ective of neural net works. In Y uejie Chi, Gintare Karolina Dziugaite, Qing Qu, Atlas W ang, and Zhihui Zhu, editors, Confer enc e on Parsimony and L e arning , volume 234 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 379–394. PMLR, 03–06 Jan 2024. URL https://proceedings.mlr.press/v234/lejeune24a.html . Guanlin Li, Han Qiu, Shangwei Guo, Jiw ei Li, and Tianw ei Zhang. Rethinking adversarial training with neural tangen t kernel. arXiv pr eprint arXiv:2312.02236 , 2023a. Sh uyao Li, Ilias Diakonik olas, and Jelena Diakonik olas. Distributionally robust optimization with adv ersarial data con tamination, 2025. URL . Y uanzhi Li, Sébastien Bub ec k, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin T at Lee. T extb o oks are all you need I I: phi-1.5 technical rep ort. arXiv pr eprint arXiv:2309.05463 , 2023b. F angshuo Liao and Anastasios Kyrillidis. On the con v ergence of shallow neural netw ork training with randomly masked neurons. T r ansactions on Machine L e arning R ese ar ch , 2022. URL https://openreview. net/forum?id=ebZ0gGRJwQx . 16 Y ang Liu, T ao F an, Tianjian Chen, Qian Xu, and Qiang Y ang. F A TE: An industrial grade platform for collab orativ e learning with data protection. Journal of Machine L e arning R ese ar ch , 22(226):1–6, 2021. Y ang Liu, Xinw ei Zhang, Y an Kang, Liping Li, Tianjian Chen, Mingyi Hong, and Qiang Y ang. F edBCD: A comm unication-efficient collab orativ e learning framework for distributed features. IEEE T r ansactions on Signal Pr o c essing , 70:4277–4290, 2022. Y ang Liu, Y an Kang, Tianyuan Zou, Y anhong Pu, Y uanqin He, Xiaozhou Y e, Y e Ouyang, Y a-Qin Zhang, and Qiang Y ang. V ertical federated learning: Concepts, adv ances, and challenges . IEEE T r ansactions on K now le dge and Data Engine ering , 2024. No el Lo o, Ramin Hasani, Alexander Amini, and Daniela Rus. Evolution of neural tangent kernels under b enign and adv ersarial training. A dvanc es in Neur al Information Pr o c essing Systems , 35:11642–11657, 2022. Aleksander Madry , Aleksandar Mak elov, Ludwig Schmidt, Dimitris T sipras, and Adrian Vladu. T ow ards deep learning mo dels resistant to adv ersarial attacks. In International Confer enc e on L e arning R epr esentations , 2018. Brendan McMahan, Eider Mo ore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Comm unication- efficien t learning of deep netw orks from decentralized data. In A rtificial Intel ligenc e and Statistics , pages 1273–1282. PMLR, 2017. P o orya Mianjy and Raman Arora. On conv ergence and generalization of drop out training. In A dvanc es in Neur al Information Pr o c essing Systems , volume 33, pages 14124–14134, 2020. URL https://proceedings. neurips.cc/paper/2020/file/f1de5100906f31712aaa5166689bfdf4- Paper.pdf . T akeru Miyato, Shin-ichi Maeda, Masanori Ko yama, and Shin Ishii. Virtual adversarial training: A regular- ization metho d for sup ervised and semi-sup ervised learning. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 41(8):1979–1993, 2018. Quynh Nguy en. On the pro of of global conv ergence of gradien t descen t for deep relu netw orks with linear widths, 2021. URL . Samet Oymak and Mahdi Soltanolkotabi. T ow ards mo derate ov erparameterization: Global conv ergence guaran tees for training shallow neural netw orks, 2019. URL . Mélanie Rey and Andriy Mnih. Gaussian drop out as an information b ottleneck lay er. In Bayesian De ep L e arning W orkshop, NeurIPS , 2021. URL https://bayesiandeeplearning.org/2021/papers/40.pdf . Daniele Romanini, A dam James Hall, Pa vlos Papadopoulos, T om Titcombe, Abbas Ismail, T udor Ceb ere, Rob ert Sandmann, Robin Ro ehm, and Michael A Ho eh. PyV ertical: A vertical federated learning framework for m ulti-headed SplitNN. arXiv pr eprint arXiv:2104.00489 , 2021. Sebastian R uder. An ov erview of gradient descent optimization algorithms, 2017. URL abs/1609.04747 . Ahmed Salem, Y ang Zhang, Mathias Hum b ert, Pascal Berrang, Mario F ritz, and Michael Bac kes. ML-Leaks: Mo del and Data Indep endent Mem b ership Inference A ttacks and Defenses on Mac hine Learning Mo dels. In Pr o c e e dings of the 2019 Network and Distribute d System Se curity Symp osium (NDSS) , 2019. ISBN 1-891562- 55-X. URL https://www.ndss- symposium.org/ndss2019/papers/ndss2019_03A- 1_Salem_paper.pdf . Christoph Sc huhmann, Romain Beaumont, Richard V encu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Co om b es, Aarush Katta, Clayton Mullis, Mitchell W ortsman, et al. LAION-5b: An op en large-scale dataset for training next generation image-text mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 35:25278–25294, 2022. Ohad Shamir and T ong Zhang. Sto chastic gradient descent for non-smo oth optimization: Conv ergence results and optimal av eraging sc hemes. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning , pages 71–79, 2013. URL https://proceedings.mlr.press/v28/shamir13.html . 17 Zhao Song and Xin Y ang. Quadratic suffices for o ver-parametrization via matrix chernoff b ound, 2020. URL https://arxiv.org/abs/1906.03593 . Nitish Sriv asta v a, Geoffrey Hinton, Alex Krizhevsky , Ily a Sutsk ev er, and Ruslan Salakhutdino v. Drop out: A simple wa y to preven t neural netw orks from ov erfitting. The Journal of Machine L e arning R ese ar ch , 15(1): 1929–1958, 2014. Cheng T ang et al. Conv ergence analysis of sto chastic gradient descent on strongly conv ex functions. In Pr o c e e dings of the 2013 Eur op e an Signal Pr o c essing Confer enc e , pages 1568–1572, 2013. URL https: //www.esat.kuleuven.be/sista/ROKS2013/files/abstracts/ChengTang.pdf . Lan V. T ruong. Global conv ergence rate of deep equilibrium mo dels with general activ ations, 2025. URL https://arxiv.org/abs/2302.05797 . Da vid T se and Pramo d Visw anath. F undamentals of Wir eless Communic ation . Cambridge Universit y Press, Cam bridge, United Kingdom, 2005. ISBN 9780521845274. doi: 10.1017/CBO9780511807213. Nik olaos T silivis and Julia Kemp e. What can the neural tangent k ernel tell us ab out adversarial robustness? A dvanc es in Neur al Information Pr o c essing Systems , 35:18116–18130, 2022. Sida W ang and Christopher D Manning. F ast drop out training. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning , pages 118–126, 2013. URL https://proceedings.mlr.press/v28/ wang13a.html . Cameron R W olfe, Jingkang Y ang, F angsh uo Liao, Arindam Cho wdh ury , Chen Dun, Artun Bay er, Santiago Segarra, and Anastasios Kyrillidis. GIST: Distributed training for large-scale graph conv olutional netw orks. Journal of A pplie d and Computational T op olo gy , pages 1–53, 2023. Eric W ong, F rank Schmidt, Jan Hendrik Metzen, and J Zico K olter. Scaling prov able adversarial defenses. A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. Y ongtao W u, F angh ui Liu, Grigorios G Chrysos, and V olkan Cevher. On the conv ergence of enco der-only shallo w transformers, 2023. URL . Ashkan Y ousefp our, Igor Shilo v, Alexandre Sabla yrolles, Davide T estuggine, Karthik Prasad, Mani Malek, John Nguy en, Say an Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormo de, and Ilya Mirono v. Opacus: User-friendly differen tial priv acy library in pytorc h, 2021. Binhang Y uan, Cameron R W olfe, Chen Dun, Y uxin T ang, Anastasios Kyrillidis, and Chris Jermaine. Distributed learning of fully connected neural netw orks using indep enden t subnet training. Pr o c e e dings of the VLDB Endowment , 2022. 18 A A dditional Exp erimental Results and Related Details. A.1 Empirical Validation of Exp ected Gradient Prop erties (Theorem 4.9). Theorem 4.9 characterizes the exp ected gradient under Gaussian input masking as E C [ ∇ w r L C ( W )] = ∇ w r L ( W ) + T 3 ,r + g r . Here, ∇ w r L ( W ) is the clean input gradient, T 3 ,r is a systematic deviation term prop ortional to κ 2 , and g r is a residual error b ounded by Eq. (6). Simulation Setup. W e used a tw o-lay er ReLU MLP with d = 20 input features, m = 100 hidden units, on n = 500 synthetic samples ( ∥ x i ∥ 2 ≤ 1 , y i ∼ N (0 , 0 . 5 2 ) ). First-lay er weigh ts W are from N (0 , 0 . 1 2 ) ; second-la yer a r ∈ {± 1 } are fixed. W e analyze ∇ w r L C ( W ) by av eraging N = 2000 Monte Carlo samples for κ ∈ [0 . 001 , 1 . 0] , for a representativ e neuron r . F or this setup, the clean loss L ( W ) ≈ 71 . 52 and ∥∇ w r L ( W ) ∥ 2 ≈ 0 . 981 . R esults and Discussion. Our sim ulations v alidate the decomp osition in Theorem 4.9. Figure 8a displays the ℓ 2 -norms of the gradient comp onents versus κ . The clean gradien t norm is constant. The T 3 ,r ’s norm, ∥T 3 ,r ∥ 2 , scales with κ (e.g., from ≈ 8 . 1 × 10 − 7 at κ = 0 . 001 to ≈ 0 . 81 at κ = 1 . 0 ), confirming its theoretical dep endence. The norm of the empirically estimated exp ected masked gradien t, ∥ E C [ ∇ w r L C ( W )] ∥ 2 , follows the clean gradient for small κ and reflects the v ector sum with the growing teal term for larger κ in Eq. 5. (a) ℓ 2 -norms of key comp onents of the exp ected gradien t E C [ ∇ w r L C ( W )] :the clean gradient norm ( ∥∇ w r L ( W ) ∥ 2 ), the T 3 ,r ’s norm ( ∥T 3 ,r ∥ 2 ), and the total exp ected masked gradien t norm. T 3 ,r scales with κ 2 . (b) Comparison of the ℓ 2 -norm of the empirically estimated gradien t error term, ∥ g r ∥ 2 , against its theoretical upper b ound from Eq. (6). The empirical error (solid line) re- mains b elow the derived b ound (dashed line) across all tested κ . Log-log scale. Figure 8: ℓ 2 -norms of the gradient comp onents (left) and residual error b ound chec k (right). Figure 8b examines the residual error term g r . It compares the ℓ 2 -norm of the empirically estimated g 0 with its theoretical upp er b ound from Eq. (6). The estimated error norm, ∥ g r ∥ est , increases with κ (from ≈ 1 . 65 × 10 − 2 at κ = 0 . 001 to ≈ 0 . 53 at κ = 1 . 0 ). Imp ortantly , the theoretical b ound on ∥ g r ∥ 2 consisten tly upp er-bounds the empirical error across the entire range of κ . F or instance, at κ = 0 . 001 , ∥ g r ∥ est ≈ 0 . 0165 while its b ound is ≈ 7 . 85 . In summary , the sim ulations confirm that the expected gradient under Gaussian input masking, with sufficien tly small κ v alues, is well-appro ximated by the sum of the clean gradient and the κ 2 -dep enden t term, with a residual error that is effectively b ounded by our theoretical deriv ation. A.2 Exp erimental Details for Section 6.4 W e adopt the blac k-b ox threat mo del and the shadow mo del attack metho dology (A dv ersary 1) prop osed in the ML-Leaks framework (Salem et al., 2019). Thr e at Mo del. The adversary has black-box access to a trained target mo del f . This means the adversary can query the mo del with any input x and observ e its output p osterior probability vector p = f ( x ) (i.e., the softmax output ov er the classes), but has no access to the mo del’s parameters, gradients or original training 19 data. The adversary’s goal is to train an attack mo del A that, given the p osterior p f ( x ) from the target mo del for a p oin t x , predicts whether x was a mem b er of the target’s training set. Shadow Mo del A ttack Pip eline. Since the attack er do es not hav e access to the target mo del’s training set, they cannot directly generate lab eled data (member vs. non-member p osteriors) to train their attack mo del. The shadow mo del technique circumv ents this by creating a proxy environmen t where the attack er controls data mem b ership. The pip eline is as follows: 1. Data Partitioning: The attack er p ossesses a dataset D shadow , disjoin t from the target’s training set but dra wn from the same data distribution. This set is split into D T rain Shadow and D Out Shadow . 2. Shadow Mo del T r aining: A shadow mo del S , whic h mimics the target model’s architecture and training pro cess, is trained on D T rain Shadow . 3. A ttack Dataset Gener ation: The trained shadow mo del S is queried on its o wn training data (members, D T rain Shadow ) and its hold-out data (non-members, D Out Shadow ). The resulting p osterior vectors p S ( x ) are collected. F ollo wing (Salem et al., 2019), the top-3 sorted probabilities of eac h p osterior are used as features: ϕ ( p S ( x )) = ( p (1) , p (2) , p (3) ) . These feature vectors are lab eled “1” if x ∈ D T rain Shadow and “0” otherwise. 4. A ttack Mo del T r aining: A binary classifier, the attac k mo del A , is trained on this generated dataset of “(feature, lab el)” pairs. It learns to distinguish the statistical “signature” of a member’s p osterior from a non-mem b er’s. This signature often manifests as higher confidence (larger p (1) ) and lo w er en tropy for mem b ers, a result of the target/shadow mo del ov erfitting to its training data. 5. Infer enc e on T ar get Mo del: T o attack the original target mo del f , the adv ersary queries it with a p oint of interest x , extracts the features ϕ ( p f ( x )) , and feeds them to the trained attack mo del A to get a mem b ership prediction. Dataset and Partitioning. W e use the CIF AR-10 dataset, consisting of 60,000 images. The full p o ol is sh uffled and divided into four disjoint sets of 10,520 images each: target_train (training MG-protected mo dels), target_test (non-mem b er audit data), shadow_train (training shadow mo dels), and shadow_test (shado w non-member data). All data is normalized using the mean and standard deviation of their resp ectiv e training sets. F or protected mo dels, inputs x are mo dified via element wise m ultiplication with a random mask m , where m i ∼ N (1 , κ 2 ) . Mo del Arc hitectures. • MLP (“nn”): A fully-connected netw ork with one hidden lay er of 100 neurons (T anh activ ation). Input la y er: 3,072 features. • CNN (“cnn”): T wo Conv-ReLU-MaxPool blo c ks, follow ed by a T anh-activ ated fully-connected la yer with 100 hidden units. T raining and Hyp erparameters. Both mo dels are trained using the Adam optimizer (Learning Rate: 10 − 3 , ℓ 2 Regularization: 10 − 7 ) for interv als b etw een 20 and 120 ep o chs. The attack mo del A is a LogisticRegression classifier (scikit-learn), trained on a balanced dataset of member and non-member p osteriors. A.2.1 Multiplicative Gaussian noise vs Differential privacy W e ev aluate the priv acy-utility tradeoff of Multiplicativ e Gaussian (MG) noise against Differen tial Priv acy (DP-SGD) using the CIF AR-10 image classification dataset. F ollo wing the standard ML-Leaks ev aluation proto col [Shokri et al., 2017], we utilize the dataset partitioning describ ed in the previous section and employ a Standard CNN arc hitecture, i.e. a shallow baseline consisting of tw o conv olutional lay ers ( 5 × 5 kernels, 32 filters) follow ed by max-p o oling and a fully connected lay er. F or the multiplicativ e gaussian defense, we sw eep the noise parameter κ ∈ { 0 . 0 , 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 , 1 . 2 } where κ = 0 . 0 represents the undefended baseline. F or 20 eac h training batch, w e apply element-wise multiplicativ e noise to input features ˜ x = x ⊙  1 + κZ  ›where Z ∼ N (0 , I ) is the standard Gaussian noise. F or the the Differential Priv acy , (DP-SGD) part, we sweep the noise multiplier σ ∈ { 0 . 3 , 0 . 5 , 0 . 8 , 1 . 0 , 1 . 5 , 2 . 0 , 3 . 0 } using the opacus library (Y ousefp our et al. (2021)). W e set the p er-sample gradien t clipping norm C = 1 . 0 and target δ = 10 − 5 . The empirical findings for this exp erimet are illustrated in Figure 9. The plots map the priv acy leakage (Attac k Precision and Recall) against the mo del’s utilit y (T arget Accuracy) across the swept noise parameters. Figure 9: Priv acy-utilit y tradeoff for the Standard CNN. The left panel sho ws Attac k Precision vs. Accuracy , and the right panel shows Attac k Recall vs. A ccuracy . Ev aluation on High-Capacity Architecture : W e rep eated the ev aluation using an Improved CNN arc hitecture. This mo del features a deep er conv olutional structure (con volutional blo cks with increasing filter sizes ( 32 → 64 → 128 ) using 3 × 3 kernels) and also incorp orates Batch Normalization and Drop out to ac hieve higher baseline utilit y . Figure 10 b elow illustrates the results for the improv ed CNN architecture. W e observ e that the p erformance gap b etw een the tw o metho ds narrows and the MG noise curve remains m uch closer to the near-random guess of the attack er for a longer stretc h of the accuracy sp ectrum. Figure 10: Priv acy-utilit y tradeoff for the Improv ed CNN. Unlike the Standard CNN, the MG noise curve sta ys m uch closer to the DP-SGD curve across the accuracy range B Pro ofs in Section 4 In this section, we first prov e an exact form of the exp ected surrogate loss function. 21 B.1 General Fo rm of Exp ected Loss Lemma B.1. L et u i,r = w r ⊙ x i , let ˆ σ κ ( w , x ) = w ⊤ x · Φ 1  w ⊤ x κ ∥ w ⊙ x ∥ 2  , and let ˆ f ( θ , x ) = 1 √ m P m r =1 a r ˆ σ κ ( w r , x ) . Then we have E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! wher e C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r ar e define d as C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1   ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 · w ⊤ r ′ x − u ⊤ i,r u i,r ′ · w ⊤ r x i κ 2 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! Pr o of. By definition, we ha ve E C [ L C ( θ )] = 1 2 n X i =1 E c i h ( f ( θ , x i ⊙ c i ) − y i ) 2 i F or simplicity , w e fix i ∈ [ n ] , and study E c h ( f ( θ , x ⊙ c ) − y ) 2 i . In the analysis b elo w, we let u r = w r ⊙ x . In particular, we hav e E c h ( f ( θ , x ⊙ c ) − y ) 2 i = E c h f ( θ , x ⊙ c ) 2 i − 2 y E c [ f ( θ , x ⊙ c )] + y 2 Here w e shall ev aluate the tw o exp ectations separately . T o start, for the first-order term, we hav e E c [ f ( θ , x ⊙ c )] = 1 √ m m X r =1 a r E  σ  w ⊤ r ( c ⊙ x )  = 1 √ m m X r =1 a r E  σ  c ⊤ ( w r ⊙ x )  (18) Notice that since c ∼ N  1 , κ 2 I  . By Lemma D.13 w e hav e that c ⊤ ( w r ⊙ x ) ∼ N  w ⊤ r x , κ 2 ∥ w r ⊙ x ∥ 2 2  , since 1 ⊤ ( w r ⊙ x ) = w ⊤ r x . Applying Lemma D.18 with z = c ⊤ ( w r ⊙ x ) , mean w ⊤ r x , and standard deviation κ ∥ w r ⊙ x ∥ , w e ha ve that E  σ  c ⊤ ( w r ⊙ x )  = κ ∥ w r ⊙ x ∥ 2 √ 2 π exp −  w ⊤ r x  2 2 κ 2 ∥ w r ⊙ x ∥ 2 2 ! + w ⊤ r xΦ 1  w ⊤ r x κ ∥ w r ⊙ x ∥ 2  (19) Plugging in to the form of (18) gives E c [ f ( θ , x ⊙ c )] = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) + κ √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! (20) 22 Next, w e fo cus on the second-order term. Since w ⊤ r ( c ⊙ x ) = c ⊤ ( w r ⊙ x ) , w e ha ve E c h f ( θ , x ⊙ c ) 2 i = 1 m m X r,r ′ =1 a r a r ′ E  σ  c ⊤ u r  σ  c ⊤ u r ′  Let z 1 = c ⊤ u r and z 2 = c ⊤ u r ′ , w e ha v e that z 1 ∼ N  w ⊤ r x , κ 2 ∥ u r ∥ 2 2  , z 2 ∼ N  w ⊤ r ′ x , κ 2 ∥ u r ′ ∥ 2 2  , and Co v ( z 1 , z 2 ) = κ 2 u ⊤ r u r ′ . Applying Lemma D.9 with a = b = 0 gives E  σ  c ⊤ u r  σ  c ⊤ u r ′  =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  Φ 2  w ⊤ r x κ ∥ u r ∥ 2 , w ⊤ r ′ x κ ∥ u r ′ ∥ 2 , u ⊤ r u r ′ ∥ u r ∥ 2 ∥ u r ′ ∥ 2  + κ 2 2 π ∥ u r ∥ 2 ∥ u r ′ ∥ 2 exp   − ∥ u r ′ ∥ 2 2  w ⊤ r x  2 − 2  u ⊤ r u r ′   w ⊤ r x   w ⊤ r ′ x  + ∥ u r ∥ 2 2  w ⊤ r ′ x  2 2 κ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2    + κ √ 2 π  ∥ u r ∥ 2 · w ⊤ r ′ x · ˆ T 1 ,r,r ′ + ∥ u r ′ ∥ 2 · w ⊤ r x · ˆ T 2 ,r,r ′  where ˆ T 1 ,r,r ′ , ˆ T 2 ,r,r ′ are defined as ˆ T 1 ,r,r ′ = exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! Φ 1    ∥ u r ∥ 2 2 · w ⊤ r ′ x − u ⊤ r u r ′ · w ⊤ r x κ ∥ u r ∥ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2  1 2    ˆ T 2 ,r,r ′ = exp −  w ⊤ r ′ x  2 2 κ 2 ∥ u r ′ ∥ 2 2 ! Φ 1    ∥ u r ′ ∥ 2 2 · w ⊤ r x − u ⊤ r u r ′ · w ⊤ r ′ x κ ∥ u r ′ ∥ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2  1 2    F or the simplicity of notations, we define T 1 ,r,r ′ = ∥ u r ∥ 2 · w ⊤ r ′ x · ˆ T 1 ,r,r ′ and T 2 ,r,r ′ = ∥ u r ′ ∥ 2 · w ⊤ r x · ˆ T 2 ,r,r ′ . Moreo ver, w e define E r,r ′ = ∥ u r ∥ 2 ∥ u r ′ ∥ 2 exp   − ∥ u r ′ ∥ 2 2  w ⊤ r x  2 − 2  u ⊤ r u r ′   w ⊤ r x   w ⊤ r ′ x  + ∥ u r ∥ 2 2  w ⊤ r ′ x  2 2 κ 2  ∥ u r ∥ 2 2 ∥ u r ′ ∥ 2 2 − ( u ⊤ r u r ′ ) 2    Lastly , we use the definition of the Gaussian Copula function C ( a, b, ρ ) = Φ 2 ( a, b, ρ ) − Φ 1 ( a ) Φ 1 ( b ) and define C r,r ′ =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  C  w ⊤ r x κ ∥ u r ∥ 2 , w ⊤ r ′ x κ ∥ u r ′ ∥ 2 , u ⊤ r u r ′ ∥ u r ∥ 2 ∥ u r ′ ∥ 2  Under these definitions, we hav e that E  σ  c ⊤ u r  σ  c ⊤ u r ′  =  w ⊤ r x   w ⊤ r ′ x  + κ 2 u ⊤ r u r ′  Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ ) = ˆ σ κ ( w r , x ) ˆ σ κ ( w r ′ , x ) + κ 2 u ⊤ r u r ′ Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ ) 23 Plugging bac k into the expression of E c h f ( θ , x ⊙ c ) 2 i giv es E c h f ( θ , x ⊙ c ) 2 i = 1 m m X r,r ′ =1 a r a r ′ ˆ σ κ ( w r , x ) ˆ σ κ ( w r ′ , x ) + 1 m m X r,r ′ =1 a r a r ′ κ 2 u ⊤ r u r ′ Φ 1  w ⊤ r x κ ∥ u r ∥ 2  Φ 1  w ⊤ r ′ x κ ∥ u r ′ ∥ 2  + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) ! 2 + κ 2      1 √ m m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  Com bining the expression of E c h f ( θ , x ⊙ c ) 2 i and E c [ f ( θ , x ⊙ c )] , and noticing T 1 ,r,r ′ = T 2 ,r ′ ,r , we hav e E c h ( f ( θ , x ⊙ c ) − y ) 2 i = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) ! 2 + κ 2      1 √ m m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + κ √ 2 π ( T 1 ,r,r ′ + T 2 ,r,r ′ )  − 2 y √ m m X r =1 a r ˆ σ κ ( w r , x ) − 2 κy √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 ∥ u r ∥ 2 2 ! + y 2 = 1 √ m m X r =1 a r ˆ σ κ ( w r , x ) − y ! 2 + κ 2 m      m X r =1 a r u r Φ  w ⊤ r x κ ∥ u r ∥ 2       2 2 + 1 m m X r,r ′ =1 a r a r ′  C r,r ′ + κ 2 2 π E r,r ′ + 2 κ √ 2 π T 1 ,r,r ′  − 2 κy √ 2 π m m X r =1 a r ∥ u r ∥ 2 exp −  w ⊤ r x  2 2 κ 2 ∥ u r ∥ 2 2 ! T o extend to the case of x i , y i , we need to re-define C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1    ∥ u i,r ∥ 2 2 · w ⊤ r ′ x i − u ⊤ i,r u i,r ′ · w ⊤ r x i κ ∥ u i,r ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2  1 2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! 24 Moreo ver, let ˆ f ( θ , x ) = 1 √ m P m r =1 a r ˆ σ κ ( w r , x ) . Then we hav e that E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′ C i,r,r ′ + κ 2 2 π E i,r,r ′ + κ r 2 π T i,r,r ′ G i,r ! − κy r 2 π m n X i =1 m X r =1 a r G i,r = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! B.2 Pro of of Theorem 4.2 Pr o of. Let u i,r = w r ⊙ x i . By Lemma B.1, we hav e that E C [ L C ( θ )] = 1 2 n X i =1  ˆ f ( θ , x i ) − y i  2 2 + κ 2 2 m n X i =1      m X r =1 a r u i,r Φ  w ⊤ r x i κ ∥ u i,r ∥ 2       2 2 + 1 m n X i =1 m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′  + 2 κ √ 2 π m n X i =1 m X r =1 a r G i,r 1 √ m m X r ′ =1 a ′ r T i,r,r ′ − y ! where C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r are defined as C i,r,r ′ =  w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′  C w ⊤ r x i κ ∥ u i,r ∥ 2 , w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 , u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ! E i,r,r ′ = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    T i,r,r ′ = w ⊤ r ′ x i · Φ 1    ∥ u i,r ∥ 2 2 · w ⊤ r ′ x i − u ⊤ i,r u i,r ′ · w ⊤ r x i κ ∥ u i,r ∥ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2  1 2    G i,r = ∥ u i,r ∥ 2 exp −  w ⊤ r x i  2 2 κ 2 ∥ u r ∥ 2 2 ! Therefore, the pro of of the theorem relies on the upp er b ound of C i,r,r ′ , E i,r,r ′ , T i,r,r ′ and G i,r . T o upp er-b ound C i,r,r ′ , w e utilize the result in that | C ( a, b, ρ ) | ≤ | arcsin ρ | 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  ≤ | ρ | 4 exp  − a 2 + b 2 4  25 where we used Lemma D.19 that | arcsin x | ≤ π 2 · | x | . Plugging in a = w ⊤ r x i κ ∥ u i,r ∥ 2 , b = w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 and ρ = u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 giv es | C i,r,r ′ | ≤    w ⊤ r x i   w ⊤ r ′ x i  + κ 2 u ⊤ i,r u i,r ′   ·   u ⊤ i,r u i,r ′   4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp − 1 4 κ 2  w ⊤ r x i  2 ∥ u i,r ∥ 2 2 +  w ⊤ r ′ x i  2 ∥ u i,r ′ ∥ 2 2 !! ≤ 1 4     w ⊤ r x i   w ⊤ r ′ x i    + κ 2 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  = κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2   w ⊤ r x i   κ ∥ u i,r ∥ 2 ·   w ⊤ r ′ x i   κ ∥ u i,r ∥ 2 + 1 ! ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  = κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ψ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  + ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  where w e use the definition P i,r =   w ⊤ r x i   · exp  − ( w ⊤ r x i ) 2 4 κ 2 ∥ u i,r ∥ 2 2  . F or the term E i,r,r ′ , w e notice that b y letting a = w ⊤ r x i κ ∥ u i,r ∥ 2 , b = w ⊤ r ′ x i κ ∥ u i,r ′ ∥ 2 and ρ = u ⊤ i,r u i,r ′ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ , we hav e exp   − ∥ u i,r ′ ∥ 2 2  w ⊤ r x i  2 − 2  u ⊤ i,r u i,r ′   w ⊤ r x i   w ⊤ r ′ x i  + ∥ u i,r ∥ 2 2  w ⊤ r ′ x i  2 2 κ 2  ∥ u i,r ∥ 2 2 ∥ u i,r ′ ∥ 2 2 −  u ⊤ i,r u i,r ′  2    = exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  Using exp  − a 2 − 2 ρab + b 2 2(1 − ρ 2 )  ≤ exp  − a 2 + b 2 4  , we hav e that | E i,r,r ′ | ≤ ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 exp − 1 4 κ 2  w ⊤ r x i  2 ∥ u i,r ∥ 2 2 +  w ⊤ r ′ x i  2 ∥ u i,r ′ ∥ 2 2 !! = ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  Therefore, w e hav e     C i,r,r ′ + κ 2 2 π E i,r,r ′     ≤ κ 2 4 ∥ u i,r ∥ 2 ∥ u i,r ′ ∥ 2  ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ψ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  + ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ϕ  w ⊤ r ′ x i 2 κ ∥ u i,r ′ ∥ 2  This giv es that       m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′        ≤ κ 2 4   m X r =1 ∥ u i,r ∥ 2 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ∥ u i,r ∥ 2 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   (21) By definition, we hav e ∥ u i,r ∥ 2 ≤ R u . Therefore       m X r,r ′ =1 a r a r ′  C i,r,r ′ + κ 2 2 π E i,r,r ′        ≤ 1 4 κ 2 R 2 u   m X r =1 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   26 Next, we fo cus on the term T i,r,r ′ and G i,r . By the property of CDF, we hav e that | T i,r,r ′ | ≤   w ⊤ r ′ x i   ≤ ∥ w r ∥ 2 . Therefore      1 √ m m X r ′ =1 a r ′ T i,r,r ′ − y i      ≤ 1 √ m m X r ′ =1 ∥ w r ′ ∥ 2 + | y i | ≤ √ mR w + B y ≤ 2 √ mR w where w e applied ∥ w r ∥ 2 ≤ R w and B y ≤ 3 √ mR w . Thus      m X r =1 a r G i,r 1 √ m m X r ′ =1 a r ′ T i,r,r ′ − y i !      ≤ m X r =1 G i,r · 2 √ mR w ≤ √ mR w m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  where w e used ∥ u i,r ∥ 2 ≤ R u . Combining the inequality ab ov e and (21), we ha ve |E | ≤ nκ 2 R 2 u 4 m   m X r =1 ψ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2 + m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  ! 2   + nκR w 2 m X r =1 ϕ  w ⊤ r x i 2 κ ∥ u i,r ∥ 2  Applying the definition of ψ max and ϕ max giv es the desired results. B.3 Pro of of Theorem 4.9 Pr o of. By the form of the gradient, we hav e E C k [ ∇ w r L C ( θ )] = a r √ m n X i =1 E C [( f ( θ , x i ⊙ c i ) − y i ) x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] = a r √ m n X i =1 E c i [ f ( θ , x i ⊙ c i ) x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] | {z } T 1 ,i − a r √ m n X i =1 y i E c i [ x i ⊙ c i I {⟨ w r , x i ⊙ c i ⟩ ≥ 0 } ] | {z } T 2 ,i (22) Let u r,i = w r ⊙ x i . F or T 1 ,i , we further hav e T 1 ,i = 1 √ m m X r ′ =1 a r ′ E c i  σ  w ⊤ r ′ ( x i ⊙ c i )  x i ⊙ c i I  w ⊤ r ( x i ⊙ c i ) ≥ 0  = 1 √ m m X r ′ =1 a r ′ E c i h ( x i ⊙ c i ) ( x i ⊙ c i ) ⊤ w r ′ I  w ⊤ r ( x i ⊙ c i ) ≥ 0; w ⊤ r ′ ( x i ⊙ c i ) ≥ 0  i = 1 √ m m X r ′ =1 a r ′  E c i  c i c ⊤ i I  w ⊤ r ( x i ⊙ c i ) ≥ 0; w ⊤ r ′ ( x i ⊙ c i ) ≥ 0  ⊙  x i x ⊤ i  w r ′ = 1 √ m m X r ′ =1 a r ′  E c i  c i c ⊤ i I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  ⊙  x i x ⊤ i  w r ′ = 1 √ m m X r =1 a r ′ Diag ( x ) i E c i  c i c ⊤ i I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  u r ′ ,i F or T 2 ,i , we can easily obtain T 2 ,i = E c i  c i I  u ⊤ r,i c i ≥ 0  ⊙ x i Abstractly , we are thus in terested in the following quantit y: E c  cc ⊤ I  c ⊤ u ≥ 0; c ⊤ v ≥ 0  ; E c  c I  c ⊤ u ≥ 0  where c ∼ N  µ , κ 2 I  , and u , v are fixed vectors. Let z 1 = c ⊤ u and z 2 = c ⊤ v . Then we hav e z 1 ∼ N  c ⊤ u , κ 2 ∥ u ∥ 2 2  ; z 2 ∼ N  c ⊤ v , κ 2 ∥ v ∥ 2 2  27 A ccording to Lemma D.11 and Lemma D.12, and by defining µ = 1 , we hav e that ∆ (1) k,r,i ∈ R d and ∆ (2) k,r,r ′ ,i ∈ R d × d defined b elo w ∆ (1) k,r,i := E c i  c i I  c ⊤ i u r,i ≥ 0  − 1 · Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  ∆ k,r,r ′ ,i := E c  cc ⊤ I  u ⊤ r,i c i ≥ 0; u ⊤ r ′ ,i c i ≥ 0  u r ′ ,i −  11 ⊤ u r ′ ,i + 3 κ 2 u r ′ ,i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  satisfies    ∆ (1) r,i    ∞ ≤ κR u ϕ max    ∆ (2) r,r ′ ,i    ∞ ≤ 4 κ ∥ v ∥ 2  √ dϕ max + ψ max  Here w e used ∥ µ ∥ ∞ = 1 and µ ⊤ ( w r ⊙ x i ) = w ⊤ k,r x i when µ = 1 . Therefore, for T 1 ,i , we hav e T 1 ,i = 1 √ m m X r ′ =1 a r ′ Diag ( x i )   w ⊤ r ′ x i · 1  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  + ∆ (2) r,r ′ ,i  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) ( w r ⊙ x i ) Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  = 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i · x i Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  + 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  = f ( θ , x i ) x i I  w ⊤ r x i ≥ 0  + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + g 1 ,i + g 2 ,i where g 1 ,i = 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i · x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0   g 2 ,i = 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ s ⊙ x i ∥ 2  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0   Lik ely , for T 2 ,i w e ha ve T 2 ,i =  1Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  + ∆ (1) r,i  ⊙ x i = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  + ∆ (1) r,i ⊙ x i = x i I  w ⊤ r x i ≥ 0  + ∆ (1) r,i ⊙ x i + g 3 ,i 28 where g 3 ,i = x i ·  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   . Therefore, the final gradient is giv en b y E C [ ∇ w r L C ( θ )] = a r √ m n X i =1 ( f ( θ , x i ) − y i ) x i I  w ⊤ r x i  + a r √ m n X i =1 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + y i ∆ (1) r,i ⊙ x i ! + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + a r √ m n X i =1 ( g 1 ,i + g 2 ,i − y i · g 3 ,i ) = ∇ w r L ( θ ) + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  + a r √ m n X i =1 1 √ m m X r ′ =1 a r ′  x i ⊙ ∆ (2) r,r ′ ,i  + y i ∆ (1) r,i ⊙ x i ! | {z } g 4 + a r √ m n X i =1 ( g 1 ,i + g 2 ,i − y i · g 3 ,i ) Notice that we can re-write g 1 ,i as g 1 ,i = x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i I  w ⊤ r ′ x i ≥ 0  + x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   + x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   · f ( θ , x i ) Then, b y the definition of g 3 ,i , we hav e that g 1 ,i − y i · g 3 ,i = x i Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  · 1 √ m m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0   + ( f ( θ , x i ) − y i ) x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0   Using Lemma D.4, we hav e that | Φ 1 ( a ) − I { a ≥ 0 }| ≤ exp  − a 2 2  ≤ ϕ  a 2  29 Therefore, w e hav e that      n X i =1 ( g 1 ,i − y i · g 3 ,i )      2 ≤ n √ m      m X r ′ =1 a r ′ w ⊤ r ′ x i  Φ 1  w ⊤ r ′ x i κ ∥ w r ′ ⊙ x i ∥ 2  − I  w ⊤ r ′ x i ≥ 0        2 +      n X i =1 ( f ( θ , x i ) − y i ) · x i  Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  − I  w ⊤ r x i ≥ 0        2 ≤ n √ m m X r ′ =1   w ⊤ r ′ x i   ϕ  w ⊤ r ′ x i 2 κ ∥ w r ′ ⊙ x i ∥ 2  + ∥ Diag ( ∆ ) X ( f ( θ − y )) ∥ 2 = κ √ mR u ψ max + σ max ( X ) ϕ max L ( θ ) 1 2 Moreo ver, w e can b ound g 2 ,i as ∥ g 2 ,i ∥ ≤ 3 κ 2 √ m m X r ′ =1 ∥ x ∥ 2 ∞ ∥ w r ∥ · 2 ϕ max ≤ 6 κ 2 √ mB 2 x R w ϕ max Lastly , we can b ound g 3 as ∥ g 3 ∥ 2 ≤ 1 m n X i =1 m X r ′ =1    x i ⊙ ∆ (2) r,r ′ ,i    + 1 √ m n X i =1 | y i |    ∆ (1) r,i ⊙ x i    2 ≤ 1 m n X i =1 m X r ′ =1    ∆ (2) r,r ′ ,i    ∞ ∥ x i ∥ 2 + 1 √ m n X i =1 | y i |    ∆ (1) r,r ′ ,i    ∞ ∥ x i ∥ 2 ≤ 1 m n X i =1 m X r ′ =1    ∆ (2) r,r ′ ,i    ∞ + B y √ m n X i =1    ∆ (1) r,r ′ ,i    ∞ ≤ 4 nκR u  √ dϕ max + ψ max  + B y √ m · nκR u ϕ max ≤ 5 nκR u  √ dϕ max + ψ max  when B y ≤ √ md . Therefore, we hav e that      E C [ ∇ w r L C ( θ )] − ∇ w r L ( θ ) + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0  !      2 ≤ nκR u ψ max + σ max ( X ) ϕ max √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w ϕ max + 5 nκR u  √ dϕ max + ψ max  ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max C Pro ofs in Section 5 C.1 Pro of of Theorem 5.2 Pr o of. T o start the pro of, w e define the following quantit y in the standard NTK-based analysis of tw o-lay er ReLU neural netw ork. Let R = C 1 · τ λ 0 n for some C 1 > 0 , we define even t A i,r and set S i , S ⊥ i as A i,r =  ∃ w ∈ B ( w 0 ,r , R ) : I  w ⊤ 0 ,r x i ≥ 0   = I  w ⊤ x i ≥ 0  (23) S i = { r ∈ [ m ] : ¬ A i,r } ; S ⊥ i = [ m ] \ S i (24) 30 Lemma 16 from shows that with probability at least 1 − n exp  − mR τ  , we ha v e that   S ⊥ i   ≤ 4 mR τ . In the follo wing of the pro of, w e assume that such even t holds. Define K ′ = min  k ∈ N : ∃ r ∈ [ m ] s.t. ∥ w k,r − w 0 ,r ∥ 2 > R  . Then for all k < K ′ , w e hav e that w k,r ∈ B ( w 0 ,r , R ) . Fix an y k < K ′ − 1 . Consider the expansion of L ( θ k +1 ) as the following L ( θ k +1 ) = 1 2 n X i =1 ( f ( θ k +1 , x i ) − y i ) 2 = 1 2 n X i =1 (( f ( θ k +1 , x i ) − f ( θ k , x i )) + ( f ( θ k , x i ) − y i )) 2 = 1 2 n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) 2 + n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) ( f ( θ k , x i ) − y i ) + 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 (25) W e will analyze the three terms separately . T o start, notice that 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 = L ( θ k ) (26) F or the first term, by the definition of f ( θ , x ) , we ha ve | f ( θ k +1 , x i ) − f ( θ k , x i ) | =      1 √ m m X r =1 a r  σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i       ≤ 1 √ m m X r =1   σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i    ≤ 1 √ m m X r =1    ( w k +1 − w k ) ⊤ x i    ≤ 1 √ m m X r =1 ∥ w k +1 − w k ∥ = η √ m m X r =1    ∇ w r ˆ L ( θ k , ξ k )    2 where in the first inequality w e use the fact that a = ± 1 , and in the second inequality w e use the 1 -Lipsc hitzness of ReLU. Applying Assumption 5.1, we hav e that n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) 2 ≤ η 2 m n X i =1 m X r =1    ∇ w r ˆ L ( θ k , ξ k )    2 ! ≤ η 2 n m ·  m · q γ ˆ L ( θ k , ξ k )  2 = η 2 mnγ ˆ L ( θ k , ξ k ) (27) Lastly , to analyze the second term, we use the following definition of I i,k and I ⊥ i,k I i,k = 1 √ m X r ∈ S i a r σ  w ⊤ k,r x i  ; I ⊥ i,k = 1 √ m X r ∈ S ⊥ i a r σ  w ⊤ k,r x i  Then w e hav e that f ( θ k , x i ) = I i,k + I ⊥ i,k . Therefore f ( θ k +1 , x i ) − f ( θ k , x i ) = ( I i,k +1 − I i,k ) +  I ⊥ i,k +1 − I ⊥ i,k  31 By the 1 -Lipschitzness of ReLU, we hav e that   I ⊥ i,k +1 − I ⊥ i,k   =       1 √ m X r ∈ S ⊥ i a r  σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i        ≤ 1 √ m X r ∈ S ⊥ i   σ  w ⊤ k +1 ,r x i  − σ  w ⊤ k,r x i    ≤ 1 √ m X r ∈ S ⊥ i    ( w k +1 ,r − w k,r ) ⊤ x i    ≤ η √ m X r ∈ S ⊥ i    ∇ w r ˆ L ( θ k , ξ k )    2 ≤ η √ γ √ m   S ⊥ i   ˆ L ( θ k , ξ k ) 1 2 Applying   S ⊥ i   ≤ 4 mR τ giv es    I ⊥ i,k +1 − I ⊥ i,k    ≤ 4 η R τ √ γ m ˆ L ( θ k , ξ k ) 1 2 .This giv es that n X i =1 ( f ( θ k +1 , x i ) − f ( θ k , x i )) ( f ( θ k , x i ) − y i ) = n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + n X i =1  I ⊥ i,k +1 − I ⊥ i,k  ( f ( θ k , x i ) − y i ) ≤ n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + n X i =1  I ⊥ i,k +1 − I ⊥ i,k  2 ! 1 2 n X i =1 ( f ( θ k , x i ) − y i ) 2 ! 1 2 ≤ n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) + 4 η R τ √ γ mn ˆ L ( θ k , ξ k ) 1 2 L ( θ k ) 1 2 (28) Plugging (26), (27), and (28) into (25) gives L ( θ k +1 ) ≤ L ( θ k ) + η 2 mnγ ˆ L ( θ k , ξ k ) + 4 η R τ √ γ mn ˆ L ( θ k , ξ k ) 1 2 L ( θ k ) 1 2 + n X i =1 ( I i,k +1 − I i,k ) ( f ( θ k , x i ) − y i ) Under Jensen’s inequality , w e ha ve that E ξ k h ˆ L ( θ k , ξ k ) 1 2 i ≤ E ξ k h ˆ L ( θ k , ξ k ) i 1 2 . Using the prop erty that E ξ k h ˆ L ( θ k , ξ k ) i ≤ 2 L ( θ k ) + ε 1 from Assumption 5.1, we can also obtain that E ξ k h ˆ L ( θ k , ξ k ) 1 2 i ≤ (2 L ( θ k ) + ε 1 ) 1 2 32 Therefore, taking the exp ectation of L ( θ k +1 ) giv es E ξ k [ L ( θ k +1 )] ≤ L ( θ k ) + η 2 mnγ (2 L ( θ k ) + ε ) + 4 η R τ √ γ mn (2 L ( θ k ) + ε 1 ) 1 2 L ( θ k ) 1 2 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) ≤ L ( θ k ) + η 2 mnγ (2 L ( θ k ) + ε 1 ) + 10 η R τ √ γ mn L ( θ k ) + 4 η R τ √ γ mn · ε 1 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) =  1 + 2 η 2 mnγ + 10 C η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 + n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) (29) where in the last inequality w e use the prop erty that p a ( a + b ) ≤ 5 4 a + b . Recall that w k +1 ,r , w k,r ∈ B ( w 0 ,r , R ) . Therefore, for r ∈ S i , w e m ust hav e that I n w ⊤ k +1 ,r x i ≥ 0 o = I  w ⊤ 0 ,r x i ≥ 0  = I n w ⊤ k,r x i ≥ 0 o . Th us, w e ha ve E ξ k [ I i,k +1 − I i,k ] = 1 √ m X r ∈ S i a r E ξ k [ w k +1 ,r − w k,r ] ⊤ x i I  w ⊤ k,r x i ≥ 0  = − η √ m X r ∈ S i a r E ξ k h ∇ w r ˆ L ( θ k , ξ k ) i ⊤ x i I  w ⊤ k,r x i ≥ 0  (30) Let g k,r = E ξ k h ∇ w r ˆ L ( θ k , ξ k ) i − ∇ w r L ( θ k ) . Then b y Assumption 5.1 we hav e that ∥ g k,r ∥ 2 ≤ ε 3 L ( θ ) 1 2 + ε 2 . Using g k,r , we can write (30) as E ξ k [ I i,k +1 − I i,k ] = − η X r ∈ S i a r √ m ∇ w r L ( θ k ) ⊤ x i I  w ⊤ k,r x i ≥ 0  − η √ m X r ∈ S i a r g ⊤ k,r x i I  w ⊤ k,r x i ≥ 0  (31) By definition, we hav e ∇ w r L ( θ k ) = a r √ m n X j =1 ( f ( θ k , x i ) − y i ) x j I  w ⊤ k,r x j ≥ 0  Therefore, w e hav e that a r √ m ∇ w r L ( θ k ) ⊤ x i I  w ⊤ k,r x i ≥ 0  = 1 m n X j =1 ( f ( θ k , x i ) − y i ) x ⊤ i x j I  w ⊤ k,r x i ≥ 0 w ⊤ k,r x j ≥ 0  33 Com bining with (31), we hav e that n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) = − η m n X i,j =1 X r ∈ S i ( f ( θ k , x i ) − y i ) ( f ( θ k , x j ) − y j ) x ⊤ i x j I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  − η √ m n X i =1 X r ∈ S i ( f ( θ k , x i ) − y i ) a r g ⊤ k,r x i I  w ⊤ k,r x i ≥ 0  ≤ − η n X i,j =1 ( f ( θ k , x i ) − y i ) x ⊤ i x j m X r ∈ S i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  ! | {z } H k,ij ( f ( θ k , x j ) − y j ) + η √ m m X i =1 n X r =1 | f ( θ k , x i ) − y i | ∥ g k,r ∥ 2 ≤ − η λ min ( H k ) n X i =1 ( f ( θ k , x i ) − y i ) 2 + η ε 2 √ mn n X r =1 ( f ( θ k , x i ) − y i ) 2 ! 1 2 + η ε 3 √ mn L ( θ k ) = −  2 η λ min ( H k ) + η ε 3 √ mn  L ( θ k ) + 2 η ε 2 √ mn L ( θ k ) 1 2 Using the prop ert y that ab ≤ a 2 2 + b 2 2 , we hav e that for any C ′ > 0 , 2 η ε 2 √ mn L ( θ k ) 1 2 ≤ C ′ η λ 0 r γ m n + η ε 2 2 n C ′ λ 0 r mn γ Moreo ver, b y Lemma C.1, we hav e that when m = Ω  n 2 λ 2 0 log n δ  , with probabilit y at least 1 − δ − n 2 exp − mR τ , it holds that ∥ H k − H ∞ ∥ F ≤ λ 0 6 + 1 m   n X i,j =1   S ⊥ i   2   1 2 + 2 nR τ Plugging in   S ⊥ i   ≤ 4 mR τ and R ≤ C 1 · τ λ 0 n , we hav e that ∥ H k − H ∞ ∥ F ≤ λ 0 2 for small enough C 1 . Thus, we ha ve that λ min ( H k ) ≥ λ 0 2 . Therefore, we hav e that n X i =1 E ξ k [ I i,k +1 − I i,k ] ( f ( θ k , x i ) − y i ) ≤  C ′ η λ 0 r γ m n + η ε 3 √ mn − η λ 0  L ( θ k ) + η ε 2 2 n C ′ λ 0 r mn γ Plugging this back into (31) gives E ξ k [ L ( θ k +1 )] ≤  1 + 2 η 2 mnγ + 10 C η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 +  C ′ η λ 0 r γ m n + η ε 3 √ mn − η λ 0  L ( θ k ) + η ε 2 2 n C ′ λ 0 r mn γ =  1 − η λ 0 + η ε 3 √ mn + 2 η 2 mnγ + (10 C + C ′ ) η λ 0 r γ m n  L ( θ k ) +  η 2 mnγ + 4 C η λ 0 r γ m n  ε 1 + η ε 2 2 n C ′ λ 0 r mn γ 34 Apply ε 3 ≤ C ε · λ 0 √ mn , γ = C 1 · n m and η = C 2 · λ 0 n 2 giv es E ξ k [ L ( θ k +1 )] ≤  1 −  1 − 2 C 1 C 2 − (10 C + C ′ ) p C 1  η λ 0  L ( θ k ) + η λ 0  mnε 2 2 C ′ √ C 1 λ 2 0 +  C 1 C 2 + 4 C p C 1  ε 1  Cho osing a small enough C 1 , C 2 , C, C ′ giv es E ξ k [ L ( θ k +1 )] ≤  1 − η λ 0 2  L ( θ k ) + 1 2 ˆ C η λ 0  mn λ 2 0 · ε 2 2 + ε 1  (32) for a large enough ˆ C . Thus, unrolling the iterations gives E ξ 0 ,..., ξ k − 1 [ L ( W k )] ≤  1 − η λ 0 2  k L ( W 0 ) + ˆ C  mn λ 2 0 · ε 2 2 + ε 1  (33) for all k < K ′ . Next, we shall low er b ound K ′ . F or all k ≤ K ′ , we hav e that ∥ w k,r − w 0 ,r ∥ 2 ≤ k − 1 X t =0 ∥ w t +1 ,r − w t,r ∥ 2 = η k − 1 X t =0    ∇ w r ˆ L ( W t , ξ t )    2 ≤ η √ γ k − 1 X t =0 ˆ L ( W t , ξ t ) 1 2 By (33), we hav e E ξ 0 ,..., ξ t − 1 h ˆ L ( W t , ξ t ) 1 2 i ≤ ( L ( W t ) + ε 1 ) 1 2 ≤ 2  1 − η λ 0 2  t L ( W 0 ) +  ˆ C + 1   mn λ 2 0 · ε 2 2 + ε 1  ! 1 2 ≤ 2  1 − η λ 0 4  t L ( W 0 ) 1 2 + q ˆ C + 1  ε 2 λ 0 √ mn + √ ε 1  Therefore, w e hav e E ξ 0 ,..., ξ k − 1  ∥ w k,r − w 0 ,r ∥ 2  ≤ η √ γ k − 1 X t =0 E ξ 0 ,..., ξ t − 1 h ˆ L ( W t , ξ t ) 1 2 i ≤ 2 η √ γ L ( W 0 ) 1 2 ∞ X t =0  1 − η λ 0 4  t + k r  ˆ C + 1  γ  ε 2 λ 0 √ mn + √ ε 1  = 8 √ γ λ 0 L ( W 0 ) 1 2 + k r  ˆ C + 1  γ  ε 2 λ 0 √ mn + √ ε 1  By Lemma 26 in , we hav e that E W 0 , a h L ( W 0 ) 2 i = O ( n ) . γ = C 1 · n m , we hav e that E W 0 , a , ξ 0 ,..., ξ k − 1  ∥ w k,r − w 0 ,r ∥ 2  ≤ O  n λ 0 √ m  + O  k  ε 2 n λ 0 + r ε 1 n m  Th us, b y Mark ov’s inequalit y , we ha ve that with probabilit y at least 1 − δ 3 K , ∥ w k,r − w 0 ,r ∥ 2 ≤ O  nK λ 0 δ √ m  | {z } T 1 + O  K 2 δ  ε 2 n λ 0 + r ε 1 n m  | {z } T 2 Setting m = Ω  n 4 λ 4 0 δ 2 τ 2  guaran tees that T 1 ≤ C 1 2 · τ λ 0 n = R 2 and set ε 2 ≤ O  δ λ 0 nK 2  , ε 1 ≤ O  δ m K 4 n  giv es that T 2 ≤ C 1 2 · τ λ 0 n = R 2 . Combining the b ound on T 1 and T 2 and taking a union b ound giv es that, with probability at least 1 − δ 3 , it holds that ∥ w k,r − w 0 ,r ∥ 2 ≤ R ; ∀ k ∈ [ K ] This sho ws that we must hav e K ′ > K , which completes the pro of. 35 Lemma C.1. L et H ∞ b e define d in (11), and let H k b e define d as H k,ij = x ⊤ i x j m X r ∈ S i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  Fix any R . A ssume that w k,r ∈ B ( w 0 ,r , R ) for al l r ∈ [ m ] . If w 0 ,r ∼ N  0 , τ 2 I  , and m = Ω () , then we have that ∥ H k − H ∞ ∥ F ≤ λ 0 6 + 1 m 2 n X i,j =1   S ⊥ i   2 + 2 nR τ Pr o of. W e define ˆ H k as follo ws ˆ H k,ij = x ⊤ i x j m m X r =1 I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0  Then w e hav e that ∥ H k − H ∞ ∥ F ≤    H k − ˆ H k    F +    ˆ H k − ˆ H 0    F +    ˆ H 0 − H ∞    F By Lemma 3.1 in Du et al. (2018), we ha ve that with probability at least 1 − δ , w e ha ve that    ˆ H 0 − H ∞    F ≤ λ 0 6 when m = Ω  n 2 λ 2 0 log n δ  . By Lemma 3.2 in Song and Y ang (2020), we hav e that with probability at least 1 − n 2 exp − mR τ , it holds that    ˆ H k − ˆ H 0    ≤ 2 nR τ . Lastly , for the first term, we hav e    H k − ˆ H k    2 F ≤ n X i,j =1  H k,ij − ˆ H k,ij  2 = 1 m 2 n X i,j =1   x ⊤ i x j X r ∈ S ⊥ i I  w ⊤ k,r x i ≥ 0; w ⊤ k,r x j ≥ 0    2 ≤ 1 m 2 n X i,j =1   S ⊥ i   2 Com bining the three b ounds gives the desired result. C.2 Pro of of Theorem 5.6 W e view Gaussian input masking as a sp ecial case of the general sto chastic training framework in Section 5. Recall that in that framework, the randomness at iteration k is denoted by ξ k , and the up date rule is W k +1 = W k − η ∇ W ˆ L ( W k , ξ k ) . (7) In the Gaussian-masked setting we take ξ k ≡ C k , ˆ L ( W , ξ k ) ≡ L C k ( W ) , where C k is the multiplicativ e Gaussian mask at iteration k and L C k is the masked loss. Thus ∇ w r ˆ L ( W , ξ k ) ≡ ∇ w r L C k ( W ) , and the up date (7) coincides with the masked gradien t descent rule (1) . Therefore, to apply Theorem 5.2 to training with Gaussian input masks, it suffices to verify that Assumption 5.1 holds with suitable ε 1 , ε 2 , ε 3 , γ , and that these parameters satisfy the smallness conditions of Theorem 5.2 under the constraints (14)– (15). Throughout the pro of we condition on the high-probability NTK even t of Theorem 5.2, on which 36 • the minimum eigenv alue of the empirical NTK satisfies λ min ( H k ) ≥ λ 0 / 2 for all k ∈ [ K ] , • all first-lay er weigh ts remain in a ball of radius R = C 1 τ λ 0 /n around their initialization, i.e., ∥ w k,r − w 0 ,r ∥ 2 ≤ R for all k ∈ [ K ] , r ∈ [ m ] , • the data are b ounded as in Assumption 3.1. The probabilit y of this even t is at least 1 − 2 δ − n 2 exp ( − n 3 / ( δ 2 τ 2 λ 3 0 )) , as in Theorem 5.2. All inequalities b elo w hold on this even t. Comparing Corollary 5.3 with Assumption 5.1(8), we identify ε 1 ( W ) = 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ max ( W ) 2 + mnκ 2  R 2 u + 1  ψ max ( W ) 2 . (34) On the NTK even t, the weigh ts stay close to initialization, hence their norms are uniformly b ounded; using Assumption 3.1 and the definition of R w and R u , we obtain: R w ( W k ) := max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d (35) R u ( W k ) := max r ∈ [ m ] ,i ∈ [ n ] ∥ w k,r ⊙ x i ∥ 2 ≤ C u τ √ d. (36) for constan ts C w , C u > 0 . Lemma C.2 (Bound on R w ) . On the NTK event of The or em 5.2, ther e exists an absolute c onstant C w > 0 such that for al l iter ations k ≤ K : R w ( W k ) := max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d. (37) Pr o of. Recall that the first–lay er weigh ts are initialized as w 0 ,r ∼ N ( 0 , τ 2 I d ) for r = 1 , . . . , m . During training, the NTK even t of Theorem 5.2 ensures that each row stays in a small ball around its initialization: ∥ w k,r − w 0 ,r ∥ 2 ≤ R, R := C 1 τ λ 0 n , ∀ k ≤ K, r ∈ [ m ] . (38) W e define z r := 1 τ w 0 ,r . Each co ordinate satisfies ( z r ) j ∼ N (0 , 1) , making z r a standard Gaussian vector N ( 0 , I d ) . Its squared norm follows a chi-square distribution: ∥ z r ∥ 2 2 ∼ χ 2 d . Using the Laurent–Massart concen tration inequalit y , for t = d : Pr  ∥ z r ∥ 2 2 ≥ 5 d  ≤ e − d . Th us, with high probability , ∥ z r ∥ 2 ≤ √ 5 d . Defining C 0 = √ 5 , w e obtain the initialization b ound: ∥ w 0 ,r ∥ 2 = τ ∥ z r ∥ 2 ≤ C 0 τ √ d. (39) By a union b ound ov er r ∈ [ m ] , this holds for all rows with probability at least 1 − me − d . Combining the triangle inequalit y with (38) and (39), we find: ∥ w k,r ∥ 2 ≤ ∥ w 0 ,r ∥ 2 + ∥ w k,r − w 0 ,r ∥ 2 ≤ C 0 τ √ d + C 1 τ λ 0 n = τ √ d  C 0 + C 1 λ 0 n √ d  . Since λ 0 /n is O (1) and d ≥ 1 , we define the absolute constan t C w := C 0 + C 1 λ 0 n √ d . T aking the maximum o ver r ∈ [ m ] yields: R w ( W k ) = max r ∈ [ m ] ∥ w k,r ∥ 2 ≤ C w τ √ d. (40) In tuitively , since the mo v ement term C 1 λ 0 n √ d v anishes as d → ∞ , the weigh ts remain on the same scale as their initialization throughout training. 37 Lemma C.3 (Bound on R u ) . Supp ose A ssumption 3.1 holds, such that the input data is b ounde d in ℓ ∞ -norm by B x := max i ∈ [ n ] ∥ x i ∥ ∞ . On the NTK event of The or em 5.2, ther e exists an absolute c onstant C u > 0 such that, for al l iter ations k ≤ K : R u ( W k ) := max r ∈ [ m ] ,i ∈ [ n ] ∥ w k,r ⊙ x i ∥ 2 ≤ C u τ √ d. (41) Pr o of. Consider an y iteration k ≤ K , neuron r ∈ [ m ] , and sample index i ∈ [ n ] . W e analyze the squared ℓ 2 -norm of the Hadamard pro duct by pulling out the maximum co ordinate of the input vector: ∥ w k,r ⊙ x i ∥ 2 2 = d X j =1 w 2 k,r,j x 2 i,j ≤  max 1 ≤ j ≤ d x 2 i,j  d X j =1 w 2 k,r,j = ∥ x i ∥ 2 ∞ ∥ w k,r ∥ 2 2 . T aking the square ro ot of b oth sides, we obtain the inequality: ∥ w k,r ⊙ x i ∥ 2 ≤ ∥ x i ∥ ∞ ∥ w k,r ∥ 2 ≤ B x ∥ w k,r ∥ 2 . T aking the maximum ov er all r ∈ [ m ] and i ∈ [ n ] yields: R u ( W k ) ≤ B x R w ( W k ) . (42) F rom the w eight stabilit y b ound previously established (Pro of of R w ), we know that on the NTK even t, the w eights are b ounded b y R w ( W k ) ≤ C w τ √ d , where C w is an absolute constant. Substituting this into (42): R u ( W k ) ≤ B x ( C w τ √ d ) . Defining the absolute constan t C u := B x C w completes the pro of. Note that since B x is a fixed prop ert y of the dataset and C w is indep enden t of k , C u is a v alid absolute constant for the problem. W e an iteration k ∈ [ K ] and denote, for brevity , R w := R w ( W k ) , R u := R u ( W k ) , ϕ k := ϕ max ( W k ) , ψ k := ψ max ( W k ) . Then (34) b ecomes ε 1 ( W k ) = 2 mnκ 2 R 2 u + mn  κ 2 R 2 u + κR w  ϕ 2 k + mnκ 2  R 2 u + 1  ψ 2 k . (43) W e now b ound each of the three terms on the right-hand side using (35),(36). (i) First term. Using R 2 u ≤ C 2 u τ 2 d from (36), we get 2 mnκ 2 R 2 u ≤ 2 mnκ 2 · C 2 u τ 2 d = (2 C 2 u ) κ 2 τ 2 mnd = O  κ 2 τ 2 mnd  . (44) (ii) Midd le term. mn  κ 2 R 2 u + κR w  ϕ 2 k = mnκ 2 R 2 u ϕ 2 k + mnκR w ϕ 2 k . F or the κ 2 R 2 u ϕ 2 k comp onen t, we use R 2 u ≤ C 2 u τ 2 d from (36): mnκ 2 R 2 u ϕ 2 k ≤ mnκ 2 ( C 2 u τ 2 d ) ϕ 2 k = C 2 u κ 2 τ 2 mndϕ 2 k = O  κ 2 τ 2 mndϕ 2 k  . (45) 38 F or the κR w ϕ 2 k comp onen t, we use R w ≤ C w τ √ d from (35): mnκR w ϕ 2 k ≤ mnκ ( C w τ √ d ) ϕ 2 k = C w κτ mn √ dϕ 2 k = O  κτ mn √ dϕ 2 k  . (46) (iii) L ast term. F or the last term, using (36) , it is true that R 2 u + 1 ≤ C 2 u τ 2 d + 1 , so there exists a constan t C ′ u > 0 such that R 2 u + 1 ≤ C ′ u τ 2 d for big enough d . Hence, mnκ 2  R 2 u + 1  ψ 2 k ≤ mnκ 2 ( C ′ u τ 2 d ) ψ 2 k = C ′ u κ 2 τ 2 mndψ 2 k = O  κ 2 τ 2 mndψ 2 k  . (47) Com bining (43) with (44), (45), (46), and (47), we obtain ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mndϕ 2 k  + O  κτ mn √ dϕ 2 k  + O  κ 2 τ 2 mndψ 2 k  = O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ϕ 2 k + ψ 2 k )  + O  κτ mn √ dϕ 2 k  . (48) W e denote ˆ ϕ max := max k ∈ [ K ] ϕ max ( W k ) , ˆ ψ max := max k ∈ [ K ] ψ max ( W k ) , so that for each k , ϕ 2 k ≤ ˆ ϕ 2 max , ψ 2 k ≤ ˆ ψ 2 max . Substituting these into (48) yields ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  , ∀ k ∈ [ K ] . (49) T aking the maximum ov er k ∈ [ K ] do es not c hange the right-hand side, so ε 1 := max k ∈ [ K ] ε 1 ( W k ) ≤ O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  . (50) or equiv alen tly: ε 1 ≤ O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max + 1)  + O  κτ mn √ d ˆ ϕ 2 max  (51) Matc hing Corollary 5.4 with Assumption 5.1(9), we read off ε 3 ( W ) = O  σ max ( X ) ϕ max ( W ) √ m  and ε 2 ( W ) = O  ( nκ 2 B 2 x R w + nκR u √ d ) ϕ max ( W )  + O  nκR u ψ max ( W ) + κ 2 √ mB 2 x R w  . (52) Using ∥ x i ∥ 2 ≤ 1 from Assumption 3.1, we hav e B x ≤ 1 . On the NTK ev ent, for all k ∈ [ K ] we hav e the uniform b ounds R w ( W k ) ≤ C w τ √ d, R u ( W k ) ≤ C u τ √ d, for some absolute constants C w , C u > 0 (cf. the b ounds prov ed earlier for R w and R u ). Substituting these in to (52) yields: ε 2 ( W k ) ≤ O   nκ 2 C w τ √ d + nκC u τ √ d √ d  ϕ max ( W k )  + O  nκC u τ √ dψ max ( W k ) + κ 2 √ mC w τ √ d  = O  nκ 2 τ √ dϕ max ( W k )  + O  nκτ dϕ max ( W k )  + O  nκτ √ dψ max ( W k )  + O  κ 2 τ √ md  . (53) 39 Since d ≥ 1 , w e ha ve √ d ≤ d , and hence each term containing √ d can be upp er b ounded b y the corresp onding expression with d in place of √ d . Therefore, ε 2 ( W k ) ≤ O  nκ 2 τ dϕ max ( W k )  + O  nκτ dϕ max ( W k )  + O  nκτ dψ max ( W k )  + O  κ 2 τ √ md  ≤ O  κτ nd  ϕ max ( W k ) + ψ max ( W k )  + O  κ 2 τ d  nϕ max ( W k ) + √ m  . (54) T o obtain a uniform b ound ov er the whole training tra jectory , define ˆ ϕ max := max k ∈ [ K ] ϕ max ( W k ) , ˆ ψ max := max k ∈ [ K ] ψ max ( W k ) . T aking the maximum ov er k in (54) yields ε 2 := max k ∈ [ K ] ε 2 ( W k ) ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d  n ˆ ϕ max + √ m  . W e can simplify ε 2 further as: O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d  n ˆ ϕ max + √ m  = O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d  n ˆ ϕ max + √ m  = O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ dn ˆ ϕ max + κ 2 τ d √ m  ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max ) + κτ dn ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d √ m  = O  2 κτ nd ( ˆ ϕ max + ˆ ψ max ) + κ 2 τ d √ m  Th us, ε 2 ≤ O  κτ nd ( ˆ ϕ max + ˆ ψ max )  + O  κ 2 τ d √ m  (55) Theorem 5.2 requires ε 2 to satisfy , for some absolute constan t c 0 > 0 , ε 2 ≤ c 0 δ λ 0 nK 2 (56) Com bining (55) and (56), we get that the following inequality must hold: C 1 κτ nd  ˆ ϕ max + ˆ ψ max  + C 2 κ 2 τ d √ m ≤ c 0 δ λ 0 nK 2 . (57) Let us denote a := C 1 κτ nd  ˆ ϕ max + ˆ ψ max  , b := C 2 κ 2 τ d √ m, R := c 0 δ λ 0 nK 2 . Then (57) can b e written as a + b ≤ R . A sufficien t wa y to enforce this inequality is to require that eac h term a and b is at most R / 2 : a ≤ R 2 , b ≤ R 2 (58) Indeed, if (58) holds, then a + b ≤ R 2 + R 2 = R, so (57) is automatically satisfied. Imp osing a ≤ R/ 2 yields: C 1 κτ nd  ˆ ϕ max + ˆ ψ max  ≤ c 0 2 δ λ 0 nK 2 . 40 Th us, κ ≤ κ lin 2 := c 0 2 C 1 δ λ 0 τ n 2 dK 2  ˆ ϕ max + ˆ ψ max  . (59) Similarly , imp osing b ≤ R/ 2 gives: C 2 κ 2 τ d √ m ≤ c 0 2 δ λ 0 nK 2 ⇒ κ 2 ≤ c 0 2 C 2 δ λ 0 τ d √ mnK 2 and therefore κ ≤ κ quad 2 := r c 0 2 C 2 s δ λ 0 τ d √ mnK 2 . (60) T o ensure (56) holds, it is sufficient that (59) and (60) b oth hold. Equiv alen tly , κ ≤ min { κ lin 2 , κ quad 2 } . In big- O notation we may write this as κ = O δ λ 0 τ n 2 dK 2 ( ˆ ϕ max + ˆ ψ max ) ! and κ = O s δ λ 0 τ d √ mnK 2 ! . (61) Similarly , for ε 1 the general sto c hastic conv ergence theorem requires that ε 1 ≤ c 1 δ m nK 4 , (4) for some absolute constant c 1 > 0 . Com bining (51) and (4) we get that the following must hold: O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max + 1)  + O  κτ mn √ d ˆ ϕ 2 max  ≤ c 1 δ m nK 4 ⇒ C a κ 2 τ 2 nd  ˆ ϕ 2 max + ˆ ψ 2 max + 1  + C b κτ n √ d ˆ ϕ 2 max ≤ c 1 δ nK 4 . Similarly , we imp ose C a κ 2 τ 2 nd  1 + ˆ ϕ 2 max + ˆ ψ 2 max  ≤ c 1 2 δ nK 4 ⇒ κ 2 ≤ c 1 2 C a δ τ 2 n 2 dK 4  ˆ ϕ 2 max + ˆ ψ 2 max + 1  ⇒ κ ≤ κ quad 1 := r c 1 2 C a · √ δ τ n √ dK 2 q ˆ ϕ 2 max + ˆ ψ 2 max + 1 . (62) and C b κτ n √ d ˆ ϕ 2 max ≤ c 1 2 δ nK 4 ⇒ κ ≤ κ lin 1 := c 1 2 C b δ τ n 2 √ dK 4 ˆ ϕ 2 max . (63) Th us, the ε 1 requiremen t (4) is guaranteed whenever κ ≤ min  κ quad 1 , κ lin 1  . 41 Com bining all constraints from ε 1 and ε 2 , we see that a sufficient set of conditions is κ ≤ min  κ lin 1 , κ quad 1 , κ lin 2 , κ quad 2  . Equiv alen tly , in big- O notation, κ = O min ( δ λ 0 τ n 2 dK 2 ( ˆ ϕ max + ˆ ψ max ) , δ τ n 2 √ dK 4 ˆ ϕ 2 max , s δ λ 0 τ d √ mnK 2 , s δ τ 2 n 2 dK 4 ( ˆ ϕ 2 max + ˆ ψ 2 max + 1) )! . (64) Instead of carrying this minimum in the theorem statement, we define a slightly more restrictiv e but cleaner condition that implies all of the ab ov e b ounds: κ = O √ δ λ 0 τ 2 K 2  m 1 / 4 √ d + nd  ˆ ϕ max + ˆ ψ max  ! . (65) F or ε 3 , w e combine (9) and (5.4) which yields that: ε 3 = O  σ max ( X ) ϕ max √ m  W e assume that σ max ( X ) ˆ ϕ max ≤ C λ 0 / √ n for some constant C > 0 (assumption (15)), and therefore ε 3 ≤ O  λ 0 √ mn  , (66) whic h is exactly the form required in Theorem 5.2. Theorem 5.2 includes the factor O  mn λ 2 0 · ε 2 2 + ε 1  (55) , (50) = So we study the quantit y mn λ 2 0 · ε 2 2 . mn λ 2 0 · ε 2 2 = mn λ 2 0 ·  κ 2 τ 2 n 2 d 2 ( ˆ ϕ max + ˆ ψ max ) 2 + κ 4 τ 2 d 2 m  ≤ 2 λ 2 0 κ 2 τ 2 n 3 d 2 m ( ˆ ϕ 2 max + ˆ ψ 2 max ) + 1 λ 2 0 κ 4 τ 2 d 2 m 2 n = O  κτ 2 mn 3 d 2  ˆ ϕ 2 max + ˆ ψ 2 max   + O  κ 2 τ 2 m 2 nd 2  (67) b ecause ( a + b ) 2 ≤ 2( a 2 + b 2 ) and κ 4 ≤ κ 2 ≤ κ for κ ≤ 1 F urthermore, ε 1 = O  κ 2 τ 2 mnd  + O  κ 2 τ 2 mnd ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  ≤ O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  (68) Commen t: T ried b ounding O  κτ mn √ d ˆ ϕ 2 max  in the term O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  as ˆ ϕ 2 max ≤ ˆ ϕ 2 max + ˆ ψ 2 max and √ d ≤ d 2 (since in our case it definitely holds that d > 1 ) but it do es not necessarily hold that τ ≤ τ 2 Therefore, O  mn λ 2 0 · ε 2 2 + ε 1  (68) , (67) = O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  Th us, the exp ected loss is b ounded by: E [ L ( W K )] ≤  1 − η λ 0 2  K L ( W 0 ) + O  κ 2 τ 2 m 2 nd 2  + O  κτ 2 mn 3 d 2 ( ˆ ϕ 2 max + ˆ ψ 2 max )  + O  κτ mn √ d ˆ ϕ 2 max  42 C.2.1 Pro of of Corolla ry 5.3 Pr o of. By Theorem 4.2, w e ha ve that | E C [ L C ( θ )] − L ( θ ) | ≤      1 2 n X i =1   ˆ f ( θ , x i ) − y i  2 − ( f ( W , x i ) − y i ) 2       | {z } T 1 + κ 2 2 m n X i =1      m X r =1 a r ( w r ⊙ x i ) Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2       2 2 | {z } T 2 + mn  κ 2 R 2 u ψ 2 max +  κ 2 R 2 u + κR w  ϕ 2 max  No w, w e can b ound T 1 and T 2 separately . F or T 1 , we hav e      1 2 n X i =1   ˆ f ( θ , x i ) − y i  2 − ( f ( W , x i ) − y i ) 2       ≤ 1 2 n X i =1   ˆ f ( θ , x i ) − f ( W , x i )  2 + 2     ˆ f ( θ , x i ) − f ( W , x i )   ˆ f ( θ , x i ) − f ( W , x i )      ≤ n X i =1  ˆ f ( θ , x i ) − f ( W , x i )  2 + L ( θ ) Here, w e can b ound ˆ f ( W , x i ) − f ( W , x i ) as    ˆ f ( W , x i ) − f ( W , x i )    ≤ 1 √ m m X r =1   ˆ σ ( w r , x i ) − σ  w ⊤ r x i    ≤ 1 √ m m X r =1   w ⊤ r x i    I  w ⊤ r x i ≥ 0  − Φ 1  w ⊤ r x i κ ∥ w r ⊙ x i ∥ 2  ≤ 1 √ m m X r =1   w ⊤ r x i   exp −  w ⊤ r x i  2 2 κ 2 ∥ w r ⊙ x i ∥ 2 ! ≤ 1 √ m m X r =1 ψ  w ⊤ r x i 2 κ ∥ w ⊤ r x i ∥ 2  ≤ √ mψ max where in the third inequality we used Lemma D.4. Therefore, T 1 can b e b ounded as T 1 ≤ L ( θ ) + mnψ max F or T 2 , we an b ound it as T 2 ≤ 2 κ 2 n X i =1 m X r =1 ∥ w r ⊙ x i ∥ ≤ 2 κ 2 mnR 2 u Plugging in T 1 and T 2 giv es the desired result. 43 C.2.2 Pro of of Corolla ry 5.4 Pr o of. By Theorem 4.9, w e ha ve that ∥ E C [ ∇ w r L ( θ )] − ∇ w r L ( θ ) ∥ 2 =      g r + 3 κ 2 √ m m X r ′ =1 a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0       2 ≤ ∥ g r ∥ 2 + 3 κ 2 √ m m X r ′ =1    a r ′ Diag ( x i ) 2 w r ′ I  w ⊤ r x i ≥ 0; w ⊤ r ′ x i ≥ 0     2 ≤ ∥ g r ∥ 2 + 3 κ 2 √ m m X r ′ =1 ∥ x i ∥ 2 ∞ ∥ w r ∥ 2 ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max + 3 κ 2 √ mB 2 x R w Theorem 4.9 states that the norm ∥ g r ∥ 2 satisfies ∥ g r ∥ 2 ≤  6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + σ max ( X ) √ m ϕ max L ( θ ) 1 2 + 6 nκR u ψ max Moreo ver, b y Assumption 3.1, we hav e that ∥ x ∥ ∞ ≤ B x . Therefore, we obtain that ∥ E C [ ∇ w r L ( θ )] − ∇ w r L ( θ ) ∥ 2 ≤  σ max ( X ) √ m L ( θ ) 1 2 + 6 nκ 2 B 2 x R w + 5 nκR u √ d  ϕ max + 6 nκR u ψ max + 3 κ 2 √ mB 2 x R w C.2.3 Pro of of Lemma D.21 Pr o of. By the form of ∇ w r L C ( W ) in (2), we hav e: ∥∇ w r L C ( W ) ∥ 2 =        a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  | {z } ≤ 1        2 ≤ | a r | √ m n X i =1 | f ( W , x i ⊙ c i ) − y i | · ∥ x i ⊙ c i ∥ 2 ≤ √ n √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 ! 1 / 2 · ∥ c i ∥ ∞ ∥ x i ∥ 2 using D.20 ≤ C √ n √ m (2 L C ( W )) 1 / 2 using ∥ c i ∥ ∞ ≤ C and ∥ x i ∥ 2 ≤ 1 ≤ C √ 2 √ n √ m L C ( W ) 1 / 2 where, in the first inequalit y , we use the fact that the indicator function is upp er-b ounded by 1 and in the second inequalit y , we use the fact that a r = ± 1 . and so, ∥∇ w r L C ( W ) ∥ 2 2 ≤ 2 C 2 · n m L C ( W ) 44 D A uxiliary Results D.1 Gaussian Random Va riables D.1.1 Conditional Exp ectation and Cova riance Lemma D.1. L et c ∼ N  µ , κ 2 I  , and let z = c ⊤ u , z ′ = c ⊤ v . Then we have that E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  ; E c [ c | z , z ′ ] = µ + s 1  z − µ ⊤ u  + s 2  z ′ − µ ⊤ v  wher e the ve ctos s 1 , s 2 ar e define d as s 1 = ∥ v ∥ 2 2 u − u ⊤ v · v ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ; s 2 = ∥ u ∥ 2 2 v − u ⊤ v · u ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 Pr o of. By the formula of conditional exp ectation, we hav e E c [ c | z ] = E [ c ] + Cov ( c , z ) V ar ( z ) − 1 ( z − E [ z ]) By the definition of c , we hav e E [ c ] = µ . Moreov er, since z = c ⊤ u , by Lemma D.13, we hav e E [ z ] = µ ⊤ u and V ar ( z ) = κ 2 ∥ u ∥ 2 2 . Therefore, the cov ariance b etw een c and z is given b y Co v ( c , z ) = E [( c − µ ) ( z − ⟨ µ , v ⟩ )] = E [ z c ] − µ ⟨ µ , v ⟩ where E [ z c ] j = d X j ′ =1 E [ c j c j ′ u j ′ ] = d X j ′ =1  µ j µ j ′ + I { j = j ′ } κ 2  u j ′ = µ j µ ⊤ u + κ 2 u j This giv es E [ z c ] = µ ⊤ u · µ + κ 2 u Therefore E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  Similarly , since z ′ = c ⊤ v . Then E c [ c | z , z ′ ] = E [ c ] + Cov ( c , [ z , z ′ ]) Co v ( z , z ′ ) − 1  [ z , z ′ ] ⊤ − E [[ z , z ′ ]] ⊤  where Co v ( c , [ z , z ′ ]) = κ 2  u v  ; Co v ( z , z ′ ) = κ 2  ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  This giv es E c [ c | z , z ′ ] = µ +  u v   ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  − 1  z − µ ⊤ u z ′ − µ ⊤ v  = µ + 1 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2  u v   ∥ v ∥ 2 2 − u ⊤ v − u ⊤ v ∥ u ∥ 2 2  − 1  z − µ ⊤ u z ′ − µ ⊤ v  = µ +  ∥ v ∥ 2 2 u − u ⊤ v · v   z − µ ⊤ u  +  ∥ u ∥ 2 2 v − u ⊤ v · u   z ′ − µ ⊤ v  ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 = µ + s 1  z − µ ⊤ u  + s 2  z ′ − µ ⊤ v  where the vectos s 1 , s 2 are defined as s 1 = ∥ v ∥ 2 2 u − u ⊤ v · v ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ; s 2 = ∥ u ∥ 2 2 v − u ⊤ v · u ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 45 Lemma D.2. L et c ∼ N  µ , κ 2 I  , and let z = c ⊤ u , z ′ = c ⊤ v . Then we have Cov ( c | z , z ′ ) = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 Pr o of. The conditional cov ariance of Gaussian random v ariables is given by Co v ( c | z , z ′ ) = Co v ( c ) − Cov ( c , [ z , z ′ ]) Co v ( z , z ′ ) − 1 Co v ( c , [ z , z ′ ]) ⊤ Recall that in the previos lemma we hav e computed Co v ( c , [ z , z ′ ]) = κ 2  u v  ; Co v ( z , z ′ ) = κ 2  ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  Th us, w e ha ve Co v ( c | z , z ′ ) = κ 2 I − κ 2  u v   ∥ u ∥ 2 2 u ⊤ v u ⊤ v ∥ v ∥ 2 2  − 1  u v  = κ 2 I − κ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2  u v   ∥ v ∥ 2 2 − u ⊤ v − u ⊤ v ∥ u ∥ 2 2   u v  = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 D.1.2 App roximation of CDF In appro ximation of the Gaussian CDF, we use the follo wing prop ert y Lemma D.3 ( ? ) . L et x ≥ 0 b e given, then the fol lowing ine quality holds 1 2  1 − exp  − x 2 2  1 2 ≤ 1 √ 2 π Z x 0 exp  − t 2 2  dt ≤ 1 2  1 − exp  − 2 x 2 π  1 2 T o start, we analyze the CDF of a single v ariable Gaussian random v ariable Lemma D.4. L et Φ 1 b e the CDF of a standar d Gaussian r andom variable. Then we have | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  Pr o of. Let ν ( x ) = 1 √ 2 π R x 0 exp  − t 2 2  dt for x ≥ 0 . W e study the case of α ≥ 0 and α < 0 separately . F or α ≥ 0 , w e hav e Φ 1 ( α ) = 1 √ 2 π Z α −∞ exp  − t 2 2  dt = 1 2 + ν ( α ) ≥ 1 2 + 1 2  1 − exp  − α 2 2  1 2 ≥ 1 − exp  − α 2 2  = I { α ≥ 0 } − exp  − α 2 2  46 Since Φ 1 ≤ 1 = I { α ≥ 0 } when α ≥ 0 , w e must hav e that, for α ≥ 0 , | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  . Similarly , for α < 0 , we hav e Φ 1 ( α ) = 1 √ 2 π Z α −∞ exp  − t 2 2  dt = 1 2 − ν ( α ) ≤ 1 2 − 1 2  1 − exp  − α 2 2  1 2 ≤ exp  − α 2 2  = I { α ≥ 0 } + exp  − α 2 2  Since Φ 1 ≥ 0 = I { α ≥ 0 } when α < 0 , we must ha ve that, for α < 0 , | Φ 1 ( α ) − I { α ≥ 0 }| ≤ exp  − α 2 2  . Lemma D.5. L et Φ 2 ( α 1 , α 2 , ρ ) b e the joint CDF of two standar d Gaussian r andom variables with c ovarianc e ρ at α 1 , α 2 . Then we have | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ 2 exp − min { α 1 , α 2 } 2 2 ! Pr o of. Since z 1 , z 2 are standard Gaussian random v ariables with cov ariance ρ , w e hav e z 1 | z 2 = ζ ∼ N  ρζ , 1 − ρ 2  . According to the definition of CDF, Φ 2 ( α 1 , α 2 , ρ ) = Z α 2 −∞ Z α 1 −∞ f z 1 ,z 2 ( ζ 1 , ζ 2 , ρ ) dζ 1 dζ 2 = Z α 2 −∞ Z α 1 −∞ f z 1 | z 2 = ζ 2 ( ζ 1 ) dζ 1 f z 2 ( ζ 2 ) dζ 2 F o cusing on the inner integral, we substitute ζ ′ = ζ 1 − ρζ 2 √ 1 − ρ 2 . Then we hav e dζ 1 = p 1 − ρ 2 dζ ′ . Thus Z α 1 −∞ f z 1 | z 2 = ζ 2 ( ζ 1 ) dζ 1 = 1 p 2 π (1 − ρ 2 ) Z α 1 −∞ exp − ( ζ 1 − ρζ 2 ) 2 2(1 − ρ 2 ) ! dζ 1 = 1 √ 2 π Z α 1 − ρζ 2 √ 1 − ρ 2 −∞ exp  − ζ ′ 2 2  dζ ′ = Φ 1 α 1 − ρζ 2 p 1 − ρ 2 ! = I ( α 1 − ρζ 2 p 1 − ρ 2 ≥ 0 ) + ε α 1 − ρζ 2 p 1 − ρ 2 ! = I { α 1 ≥ ρζ 2 } + ε α 1 − ρζ 2 p 1 − ρ 2 ! 47 Where     ε  α 1 − ρζ 2 √ 1 − ρ 2      ≤ exp  − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 )  . Therefore Φ 2 ( α 1 , α 2 , ρ ) = Z α 2 −∞ I { α 1 ≥ ρζ 2 } + ε α 1 − ρζ 2 p 1 − ρ 2 !! f z 2 ( ζ 2 ) dζ 2 = Z α 2 −∞ I { α 1 ≥ ρζ 2 } f z 2 ( ζ 2 ) dζ 2 | {z } I 1 + Z α 2 −∞ ε α 1 − ρζ 2 p 1 − ρ 2 ! f z 2 ( ζ 2 ) dζ 2 | {z } I 2 F or I 1 , w e hav e I 1 = Z α 2 −∞ I { α 1 ≥ ρζ 2 } f z 2 ( ζ 2 ) dζ 2 = Z min { α 1 ρ ,α 2 } −∞ f z 2 ( ζ 2 ) dζ 2 = Φ 1  min  α 1 ρ , α 2  = I  min  α 1 ρ , α 2  ≥ 0  + ε  min  α 1 ρ , α 2  Notice that, when α 1 ≥ 0 and α 2 ≥ 0 , we must hav e min n α 1 ρ , α 2 o ≥ 0 . Conv ersely , when min n α 1 ρ , α 2 o ≥ 0 , w e must hav e that α 1 ≥ 0 and α 2 ≥ 0 . Therefore I n min n α 1 ρ , α 2 o ≥ 0 o = I { α 1 ≥ 0; α 2 ≥ 0 } . Thus, we hav e Φ 2 ( α 1 , α 2 , ρ ) = I { α 1 ≥ 0; α 2 ≥ 0 } + ε  min  α 1 ρ , α 2  + I 2 F or I 2 , w e hav e |I 2 | ≤ Z α 2 −∞      ε α 1 − ρζ 2 p 1 − ρ 2 !      f z 2 ( ζ 2 ) dζ 2 ≤ Z α 2 −∞ exp − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 ) ! f z 2 ( ζ 2 ) dζ 2 = 1 √ 2 π Z α 2 −∞ exp − ( α 1 − ρζ 2 ) 2 2(1 − ρ 2 ) − ζ 2 2 2 ! dζ 2 = 1 √ 2 π exp  − α 2 1 2  Z α 2 −∞ exp − ( ζ 2 − ρα 1 ) 2 2 (1 − ρ 2 ) ! dζ 2 ≤ exp  − α 2 1 2  Moreo ver, since    ε  min n α 1 ρ , α 2 o    ≤ exp  − 1 2 min n α 1 ρ , α 2 o 2  , we hav e that | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ exp − 1 2 min  α 1 ρ , α 2  2 ! + exp  − α 2 1 2  ≤ exp − min { α 1 , α 2 } 2 2 ! + exp  − α 2 1 2  48 Exc hanging α 1 and α 2 , we hav e | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ exp − 1 2 min  α 1 ρ , α 2  2 ! + exp  − α 2 1 2  ≤ exp − min { α 1 , α 2 } 2 2 ! + exp  − α 2 2 2  Th us, w e ha ve | Φ 2 ( α 1 , α 2 , ρ ) − I { α 1 ≥ 0; α 2 ≥ 0 }| ≤ 2 exp − min { α 1 , α 2 } 2 2 ! D.1.3 Uni-va riate Coupled Exp ectation Lemma D.6. let z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Define T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 1 + ρT 2 ) ; E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 2 + ρT 1 ) Pr o of. W riting the exp ectation in integral form, w e ha ve that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b Z ∞ a z 1 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 1 f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 Since z 1 , z 2 ∼ N (0 , 1) with co v ariance ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, let ρ ′ = 1 − ρ 2 , w e ha ve f ( z 1 | z 2 ) = 1 √ 2 π ρ ′ exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! Th us, b y Lemma D.14, we hav e that Z ∞ a z 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 p 2 π ρ ′ Φ 1  ρz 2 − a √ ρ ′  ! = r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 in Lemma D.14. Plugging into the original in tegral gives E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  ! f ( z 2 ) dz 2 = √ ρ ′ 2 π Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ − z 2 2 2 ! dz 2 + ρ Z ∞ b z 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 49 Notice that exp − ( a − ρz 2 ) 2 2 ρ ′ − z 2 2 2 ! = exp  − a 2 − 2 ρaz 2 + z 2 2 2 ρ ′  = exp − ( z 2 − ρa ) 2 2 ρ ′ ! · exp  − a 2 2  F rom a previous result, w e hav e an identit y for the conditional probability , which expresses the CDF term as an in tegral: Φ 1 ρz 2 − a p 1 − ρ 2 ! = Z ∞ a f ( z 1 | z 2 ) dz 1 (69) and so: ρ Z ∞ b z 2 Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 = ρ Z ∞ b z 2 Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) dz 2 Moreo ver, b y applying Lemma D.16, we hav e E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = √ ρ ′ 2 π exp  − a 2 2  Z ∞ b exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 + ρ Z ∞ b z 2 Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) dz 2 = ρ ′ √ 2 π exp  − a 2 2  Z ∞ b f ( z 2 | z 1 = a ) dz 2 + ρ Z ∞ b Z ∞ a z 2 f ( z 1 | z 2 ) f ( z 2 ) | {z } f ( z 1 ,z 2 ) dz 1 dz 2 = ρ ′ √ 2 π exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! + ρ E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] Therefore, w e can conclude that E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] − ρ E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = ρ ′ √ 2 π exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! = ρ ′ √ 2 π T 1 (70) Switc hing z 1 , z 2 and a, b gives E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] − ρ E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = ρ ′ √ 2 π exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! = ρ ′ √ 2 π T 2 (71) Solving for E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] and E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] from (70) and (71) gives E [ z 1 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 1 + ρT 2 ) ; E [ z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = 1 √ 2 π ( T 2 + ρT 1 ) Lemma D.7. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Define T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  aT 1 + ρ 2 bT 2  E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  50 Pr o of. Again, w e write the exp ectation in the integral form to get that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = Z ∞ b Z ∞ a z 2 1 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 2 1 f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 Since z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Define ρ ′ = 1 − ρ 2 . By Lemma D.15, we hav e that Z ∞ a z 2 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 2 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ ( a + ρz 2 ) exp − ( a − ρz 2 ) 2 2 ρ ′ ! + p 2 π ρ ′  ρ ′ + ρ 2 z 2 2  Φ 1  ρz 2 − a √ ρ ′  ! = ρ 2 z 2 2 Φ 1  ρz 2 − a √ ρ ′  + ρz 2 r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + a r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρ ′ Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 in Lemma D.15. Therefore, we hav e E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ 2 Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 + ρ r ρ ′ 2 π Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + a r ρ ′ 2 π Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + ρ ′ Z ∞ b Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 T o start, by Lemma D.16, w e hav e Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) = Z ∞ a f ( z 1 | z 2 ) f ( z 2 ) dz 1 = Z ∞ a f ( z 1 , z 2 ) dz 1 Therefore, for the first term, we hav e Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a z 2 2 f ( z 1 , z 2 ) dz 2 = E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  F or the last term, we ha ve Z ∞ b Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a f ( z 1 , z 2 ) dz 1 dz 2 = Φ 2 ( − a, − b, ρ ) Next, w e notice that exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) = 1 √ 2 π exp  − a 2 − 2 ρaz 2 + z 2 2 2 ρ ′  = 1 √ 2 π exp  − a 2 2  exp − ( z 2 − ρa ) 2 2 ρ ′ ! Therefore, for the second term, we apply Lemma D.14 to get that Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b z 2 exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = 1 √ 2 π exp  − a 2 2  ρ ′ exp − ( ρa − b ) 2 2 ρ ′ ! + p 2 π ρ ′ ρa Φ 1  ρa − b √ ρ ′  ! = ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  51 where w e set κ = √ ρ ′ and µ = ρa in Lemma D.14. F or the third term, w e ha ve Z ∞ b exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = p ρ ′ exp  − a 2 2  Z ∞ b f ( z 2 | z 1 = a ) dz 2 = p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  Com bining all four terms gives E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ r ρ ′ 2 π  ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + a r ρ ′ 2 π · p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  + ρ ′ Φ 2 ( − a, − b, ρ ) = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ a √ 2 π  ρ 2 + 1  exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  + ρ ′ Φ 2 ( − a, − b, ρ ) Therefore, w e can conclude that E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  − ρ 2 E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ a √ 2 π  ρ 2 + 1  T 1 + ρ ′ Φ 2 ( − a, − b, ρ ) E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  − ρ 2 E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 ρ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ ′ b √ 2 π  ρ 2 + 1  T 2 + ρ ′ Φ 2 ( − a, − b, ρ ) Solving for E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  and E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  giv es E  z 2 1 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  aT 1 + ρ 2 bT 2  E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  W rite ρ ′ = 1 − ρ 2 giv es the desired result. Lemma D.8. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . L et a, b ∈ R . Define T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! Then we have that E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) Pr o of. T o start, we write out the integral form of the exp ectation as E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b Z ∞ a z 1 z 2 f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ b  Z ∞ a z 1 f ( z 1 | z 2 ) dz 1  z 2 f ( z 2 ) dz 2 52 Since z 1 , z 2 ∼ N (0 , 1) with co v ariance ρ , w e hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, let ρ ′ = 1 − ρ 2 , w e ha ve f ( z 1 | z 2 ) = 1 √ 2 π ρ ′ exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! Th us, b y Lemma D.14, we hav e that Z ∞ a z 1 f ( z 1 | z 2 ) dz 1 = 1 √ 2 π ρ ′ Z ∞ a z 1 exp − ( z 1 − ρz 2 ) 2 2 ρ ′ ! dz 1 = 1 √ 2 π ρ ′ ρ ′ exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 p 2 π ρ ′ Φ 1  ρz 2 − a √ ρ ′  ! = r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  where w e set κ = √ ρ ′ and µ = ρz 2 . Therefore E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = Z ∞ b r ρ ′ 2 π exp − ( a − ρz 2 ) 2 2 ρ ′ ! + ρz 2 Φ 1  ρz 2 − a √ ρ ′  ! z 2 f ( z 2 ) dz 2 = r ρ ′ 2 π Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 + ρ Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 T o start, by Lemma D.16, w e hav e Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) = Z ∞ a f ( z 1 | z 2 ) dz 1 f ( z 2 ) = Z ∞ a f ( z 1 , z 2 ) dz 1 Therefore, for the second term, we hav e Z ∞ b z 2 2 Φ 1  ρz 2 − a √ ρ ′  f ( z 2 ) dz 2 = Z ∞ b Z ∞ a z 2 2 f ( z 1 , z 2 ) dz 1 dz 2 = E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  F or the first term, we ha ve exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) = 1 √ 2 π exp  − z 2 2 − 2 ρaz 2 + a 2 2 ρ ′  = 1 √ 2 π exp  − a 2 2  exp − ( z 2 − ρa ) 2 2 ρ ′ ! Therefore, b y Lemma D.14, the first term can b e written as Z ∞ b z 2 exp − ( a − ρz 2 ) 2 2 ρ ′ ! f ( z 2 ) dz 2 = 1 √ 2 π exp  − a 2 2  Z ∞ b z 2 exp − ( z 2 − ρa ) 2 2 ρ ′ ! dz 2 = 1 √ 2 π exp  − a 2 2  ρ ′ exp − ( ρa − b ) 2 2 ρ ′ ! + p 2 π ρ ′ ρa Φ 1  ρa − b √ ρ ′  ! = ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  where w e set κ = √ ρ ′ and µ = ρa in Lemma D.14. Com bining the tw o terms, we hav e E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = r ρ ′ 2 π  ρ ′ √ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρa p ρ ′ exp  − a 2 2  Φ 1  ρa − b √ ρ ′  + ρ E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ ′ 3 2 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + aρρ ′ √ 2 π T 1 + ρ E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  53 F rom Lemma D.7, we ha ve that E  z 2 2 I { z 1 ≥ a ; z 2 ≥ b }  = ρ √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  Therefore E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = √ ρ ′ 2 π exp  − a 2 − 2 ρab + b 2 2 ρ ′  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) W rite ρ ′ = 1 − ρ 2 giv es the desired result. Lemma D.9. L et z 1 ∼ N  µ 1 , κ 2 1  and z 2 ∼ N  µ 2 , κ 2 2  , with Cov ( z 1 , z 2 ) = κ 1 κ 2 ρ . L et a, b ∈ R . Then we have E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = ( µ 1 µ 2 + κ 1 κ 2 ρ ) Φ  µ 1 − a κ 1 , µ 2 − b κ 2 , ρ  + κ 1 κ 2 2 π exp − 1 2 (1 − ρ 2 ) ( µ 1 − a ) 2 κ 2 1 − 2 ρ κ 1 κ 2 ( µ 1 − a ) ( µ 2 − b ) + ( µ 2 − b ) 2 κ 2 2 !! + 1 √ 2 π (( κ 2 ρa + κ 1 µ 2 ) T 1 + ( κ 1 ρb + κ 2 µ 1 ) T 2 ) Her e T 1 , T 2 ar e define d as T 1 = exp − ( a − µ 1 ) 2 2 κ 2 1 ! Φ 1 1 p 1 − ρ 2  ρ ( a − µ 1 ) κ 1 − b − µ 2 κ 2  ! T 2 = exp − ( b − µ 2 ) 2 2 κ 2 2 ! Φ 1 1 p 1 − ρ 2  ρ ( b − µ 2 ) κ 2 − a − µ 1 κ 1  ! Pr o of. Let ˆ z 1 = z 1 − µ 1 κ 1 and ˆ z 1 = z 2 − µ 2 κ 2 . Then we hav e ˆ z 1 , ˆ z 2 ∼ N (0 , 1) . Moreo v er, Co v ( ˆ z 1 , ˆ z 2 ) = E [ ˆ z 1 ˆ z 2 ] = 1 κ 1 κ 2 E [( z 1 − µ 1 ) ( z 2 − µ 2 )] = ρ Since z 1 = κ 1 ˆ z 1 + µ 1 and z 2 = κ 2 ˆ z 2 + µ 2 , we hav e E [ z 1 z 2 I { z 1 ≥ a ; z 2 ≥ b } ] = E  ( κ 1 ˆ z 1 + µ 1 ) ( κ 2 ˆ z 2 + µ 2 ) I  ˆ z 1 ≥ a − µ 1 κ 1 ; ˆ z 2 ≥ b − µ 2 κ 2  = κ 1 κ 2 E h ˆ z 1 ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + µ 1 µ 2 E h I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + κ 1 µ 2 E h ˆ z 1 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi + κ 2 µ 1 E h ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi where w e re-defined ˆ a = a − µ 1 κ 1 and ˆ b = b − µ 2 κ 2 . By Lemma D.16, Lemma D.6, and Lemma D.8, we hav e E h ˆ z 1 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = 1 √ 2 π ( T 1 + ρT 2 ) ; E h ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = 1 √ 2 π ( T 2 + ρT 1 ) E h ˆ z 1 ˆ z 2 I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = p 1 − ρ 2 2 π exp − ˆ a 2 − 2 ρ ˆ a ˆ b + ˆ b 2 2 (1 − ρ 2 ) ! + ρ Φ 2  − ˆ a, − ˆ b, ρ  + ρ √ 2 π  ˆ aT 1 + ˆ bT 2  and E h I n ˆ z 1 ≥ ˆ a ; ˆ z 2 ≥ ˆ b oi = Φ 2  − ˆ a, − ˆ b, ρ  . Here, T 1 , T 2 are defined as T 1 = exp  − ˆ a 2 2  Φ ρ ˆ a − ˆ b p 1 − ρ 2 ! ; T 2 = exp − ˆ b 2 2 ! Φ ρ ˆ b − ˆ a p 1 − ρ 2 ! Plugging in the v alue of ˆ a and ˆ b give s the desired result. 54 D.1.4 Multi-va riate Coupled Exp ectation Lemma D.10. L et c ∼ N  µ , κ 2 I  , and let u ∈ R d . Define z = c ⊤ u . Then we have E [ c I { z ≥ 0 } ] = µ Φ 1  − µ ⊤ u κ ∥ u ∥ 2  + κ √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! · u ∥ u ∥ 2 Pr o of. A ccording to the law of total exp ectation, E  c I  c ⊤ u ≥ 0  = E z [ E c [ c I { z ≥ 0 } | z ]] = E z [ E c [ c | z ] I { z ≥ 0 } ] By Lemma D.1, we hav e that E c [ c | z ] = µ + u ∥ u ∥ 2 2  z − µ ⊤ u  . Therefore E  c I  c ⊤ u ≥ 0  = u ∥ u ∥ 2 2 E z [ z I { z ≥ 0 } ] + µ − µ ⊤ u · u ∥ u ∥ 2 2 ! E z [ I { z ≥ 0 } ] By definition, E z [ I { z ≥ 0 } ] = Pr ( z ≥ 0) = 1 − Pr ( z ≤ 0) . Since, by Lemma D.13, z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  , w e ha ve that Pr ( z ≤ 0) = Pr  z − µ ⊤ u κ ∥ u ∥ 2 ≤ − µ ⊤ u ∥ u ∥ 2  = Φ 1  − µ ⊤ u κ ∥ u ∥ 2  Moreo ver, let ˆ z = z − µ ⊤ u κ ∥ u ∥ 2 , then we hav e E z [ z I { z ≥ 0 } ] = E ˆ z   κ ∥ u ∥ 2 ˆ z + µ ⊤ u  I  ˆ z ≥ − µ ⊤ u κ ∥ u ∥ 2  = κ ∥ u ∥ 2 E ˆ z  ˆ z I  ˆ z ≥ − µ ⊤ u κ ∥ u ∥ 2  + µ ⊤ u E z [ z ≥ 0] By the PDF of ˆ z , w e hav e E ˆ z [ ˆ z I { z ≥ 0 } ] = 1 √ 2 π Z ∞ a z exp  − z 2 2  dz = − 1 √ 2 π exp  − z 2 2  | ∞ a = 1 √ 2 π exp  − a 2 2  Therefore E z [ z I { z ≥ 0 } ] = κ ∥ u ∥ 2 √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! + µ ⊤ u E z [ z ≥ 0] Plugging in gives E  c I  c ⊤ u ≥ 0  = µ Φ 1  − µ ⊤ u κ ∥ u ∥ 2  + κ √ 2 π exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 2 ! · u ∥ u ∥ 2 Lemma D.11. L et c ∼ N  µ , κ 2 I  with κ ≤ 1 . L et u , v b e given, and let z 1 = c ⊤ u , z 2 = c ⊤ v . Then we have that     E c [ c I { z 1 ≥ 0 } ] − µ Φ 1  µ ⊤ u κ ∥ u ∥ 2      ∞ ≤ κ ∥ u ∥ 2 exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! Pr o of. Giv en the form of the conditional exp ectation, we hav e E c [ c I { z 1 ≥ 0 } ] = Z ∞ 0 E c [ c | z 1 ] f 1 ( z 1 ) dz 1 = Z ∞ 0 µ + u ∥ u ∥ 2 2  z 1 − µ ⊤ u  ! f 1 ( z 1 ) dz 1 55 Since z 1 = c ⊤ u , we must hav e that z 1 ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . Define z ′ = z 1 − µ ⊤ u κ ∥ u ∥ 2 , then we hav e that z 1 = κ ∥ u ∥ 2 z ′ + µ ⊤ u E c [ c I { z 1 ≥ 0 } ] = Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ( µ + κ u · z ′ ) f ( z ′ ) dz ′ = µ · Z µ ⊤ u κ ∥ u ∥ 2 −∞ f ( z ′ ) dz ′ + κ u · Z µ ⊤ u κ ∥ u ∥ 2 −∞ z ′ f ( z ′ ) dz ′ = µ Φ 1  µ ⊤ u κ ∥ u ∥ 2  + κ u exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! where in the last equality we applied Lemma D.14 with a = 0 and κ = 1 in the lemma. Therefore, we hav e that     E c [ c I { z 1 ≥ 0 } ] − µ Φ 1  µ ⊤ u κ ∥ u ∥ 2      ∞ ≤ κ ∥ u ∥ ∞ exp −  µ ⊤ u  2 2 κ 2 ∥ u ∥ 2 ! No w, w e shall div e into E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  . Lemma D.12. L et c ∼ N  µ , κ 2 I  with ∥ µ ∥ 2 ≥ 2 and κ ≤ 1 . L et u , v b e given, and let z 1 = µ ⊤ u , z 2 = µ ⊤ v . Then we have that     E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u 2 κ ∥ v ∥ 2  Φ 1  µ ⊤ v 2 κ ∥ v ∥ 2      ≤ ∆ wher e ∆ is given by ∆ = 2 κ ∥ v ∥ 2  ∥ µ ∥ 2  ϕ  µ ⊤ u 2 κ ∥ u ∥ 2  + ϕ  µ ⊤ u 2 κ ∥ v ∥ 2  + ∥ µ ∥ ∞  ψ  µ ⊤ u 2 κ ∥ u ∥ 2  + ψ  µ ⊤ u 2 κ ∥ u ∥ 2  Pr o of. W e hav e E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0 E c  cc ⊤ | z 1 , z 2  f ( z 1 , z 2 ) dz 1 dz 2 Notice that E c  cc ⊤ | z 1 , z 2  = Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤ By the form of Cov ( c | z 1 , z 2 ) , we hav e Co v ( c | z 1 , z 2 ) = κ 2 I − κ 2  uv ⊤ − vu ⊤  2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 := M By the form of E [ c | z 1 , z 2 ] w e hav e E [ c | z 1 , z 2 ] = z 1 · s 1 + z 2 · s 2 + s 3 where s 1 = ∥ v ∥ 2 2 u − ⟨ u , v ⟩ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 ; s 2 = ∥ u ∥ 2 2 v − ⟨ u , v ⟩ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 s 3 = µ −  ∥ v ∥ 2 2 u − ⟨ u , v ⟩ v  ⟨ µ , u ⟩ +  ∥ u ∥ 2 2 v − ⟨ u , v ⟩ u  ⟨ µ , v ⟩ ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ⟨ v , u ⟩ 2 56 Therefore E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤ = ( z 1 · s 1 + z 2 · s 2 + s 3 ) ( z 1 · s 1 + z 2 · s 2 + s 3 ) ⊤ Recall that we are interested in E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  f ( z 1 , z 2 ) dz 1 dz 2 Let ˆ z 1 = z 1 − µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 = z 1 − µ ⊤ v κ ∥ v ∥ 2 Then w e hav e ˆ z 1 , ˆ z 2 ∼ N (0 , 1); Co v ( ˆ z 1 , ˆ z 2 ) = u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2 := ρ Th us, ˆ z 1 | ˆ z 2 = γ 2 ∼ N  ργ 2 , 1 − ρ 2  ; ˆ z 2 | ˆ z 1 = γ 1 ∼ N  ργ 1 , 1 − ρ 2  Moreo ver, E [ c | z 1 , z 2 ] = z 1 · s 1 + z 2 · s 2 + s 3 =  κ ∥ u ∥ 2 ˆ z 1 + µ ⊤ u  s 1 +  κ ∥ v ∥ 2 ˆ z 2 + µ ⊤ v  s 2 + s 3 Redefine ˆ s 1 = κ ∥ u ∥ 2 s 1 ; ˆ s 2 = κ ∥ v ∥ 2 s 2 ; ˆ s 3 = µ ⊤ u · s 1 + µ ⊤ v · s 2 + s 3 = µ Then w e hav e E [ c | z 1 , z 2 ] = ˆ s 1 ˆ z 1 + ˆ s 2 ˆ z 2 + ˆ s 3 In this case, let ˆ f b e the joint PDF of ˆ z 1 and ˆ z 2 , then we hav e ˆ f ( ˆ z 1 , ˆ z 2 ) = 1 2 π p 1 − ρ 2 exp  − 1 2(1 − ρ 2 )  z 2 1 − 2 ρz 1 z 2 + z 2 2   = κ 2 ∥ u ∥ 2 ∥ v ∥ 2 2 π κ 2 ∥ u ∥ 2 ∥ v ∥ 2 p 1 − ρ 2 exp  − 1 2(1 − ρ 2 )  z 2 1 − 2 ρz 1 z 2 + z 2 2   = κ 2 ∥ u ∥ 2 ∥ v ∥ 2 f ( ˆ z 1 , ˆ z 2 ) 57 Therefore, since dz 1 = κ ∥ u ∥ 2 d ˆ z 1 , and dz 2 = κ ∥ v ∥ 2 d ˆ z 2 , we hav e E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  = Z z 1 ,z 2 ≥ 0  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  f ( z 1 , z 2 ) dz 1 dz 2 = Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2  Co v ( c | z 1 , z 2 ) + E [ c | z 1 , z 2 ] E [ c | z 1 , z 2 ] ⊤  ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 = ˆ s 1 ˆ s ⊤ 1 Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 1 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 1 + ˆ s 2 ˆ s ⊤ 2 Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 2 +  ˆ s 1 ˆ s ⊤ 2 + ˆ s 2 ˆ s ⊤ 1  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 1 ˆ z 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 3 +  ˆ s 1 ˆ s 3 + ˆ s 3 ˆ s ⊤ 1  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 1 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 4 +  ˆ s 2 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 2  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ z 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 5 +  ˆ s 3 ˆ s ⊤ 3 + M  Z ∞ − µ ⊤ v κ ∥ v ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 ˆ f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 | {z } I 6 Since our goal is to study the term E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v , w e need to understand the terms I 1 to I 6 , as w ell as understanding the matrix-v ector pro duct in front of these terms. T o start, under some standard computation, w e hav e ˆ s ⊤ 1 u = κ ∥ u ∥ 2 · ∥ v ∥ 2 2 u ⊤ u − v ⊤ u · v ⊤ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = κ ∥ u ∥ 2 ˆ s ⊤ 1 v = κ ∥ u ∥ 2 · ∥ v ∥ 2 2 u ⊤ v − v ⊤ u · v ⊤ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = 0 ˆ s ⊤ 2 u = κ ∥ v ∥ 2 · ∥ u ∥ 2 2 v ⊤ u − v ⊤ u · u ⊤ u ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = 0 ˆ s ⊤ 2 v = κ ∥ v ∥ 2 · ∥ u ∥ 2 2 v ⊤ v − v ⊤ u · u ⊤ v ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2 = κ ∥ v ∥ 2 Therefore, the following must holds ˆ s 1 ˆ s ⊤ 1 v = 0 ; ˆ s 2 ˆ s ⊤ 2 v = κ ∥ v ∥ 2 ˆ s 2 ;  ˆ s 1 ˆ s ⊤ 2 + ˆ s 2 ˆ s ⊤ 1  v = κ ∥ v ∥ 2 ˆ s 1 ;  ˆ s 1 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 1  v = µ ⊤ v · ˆ s 1 ;  ˆ s 2 ˆ s ⊤ 3 + ˆ s 3 ˆ s ⊤ 2  v = µ ⊤ v · ˆ s 2 + κ ∥ v ∥ 2 µ 58 Lastly , we hav e  ˆ s 3 ˆ s ⊤ 3 + M  v = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  uv ⊤ − vu ⊤  2 v = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  uv ⊤ − vu ⊤   u ⊤ v · v − ∥ v ∥ 2 2 u  = µ ⊤ v · µ + κ 2 v − κ 2 ∥ v ∥ 2 2 ∥ u ∥ 2 2 − ( v ⊤ u ) 2  u ⊤ v · ∥ v ∥ 2 2 u −  u ⊤ v  2 v − u ⊤ v · ∥ v ∥ 2 2 u + ∥ v ∥ 2 2 ∥ u ∥ 2 2 v  = µ ⊤ v · µ + κ 2 v + κ 2 v = µ ⊤ v · µ + 2 κ 2 v Therefore, w e can write E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v as E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v = κ ∥ v ∥ 2 ˆ s 2 · I 2 + κ ∥ v ∥ 2 ˆ s 1 · I 3 + µ ⊤ v · ˆ s 1 · I 4 +  µ ⊤ v · ˆ s 2 + κ ∥ v ∥ 2 µ  I 5 +  µ ⊤ v · µ + 2 κ 2 v  I 6 = κ ∥ v ∥ 2 ( I 2 · ˆ s 2 + I 3 · ˆ s 1 ) + µ ⊤ v ( I 4 · ˆ s 1 + I 5 · ˆ s 2 ) +  κ ∥ v ∥ 2 · I 5 + µ ⊤ v · I 6  µ + 2 κ 2 I 6 v (72) By the definition of I 2 to I 6 , we first notice that I 6 = Z ∞ − µ ⊤ u κ ∥ u ∥ 2 Z ∞ − µ ⊤ u κ ∥ u ∥ 2 f ( ˆ z 1 , ˆ z 2 ) d ˆ z 1 d ˆ z 2 = P  ˆ z 1 ≥ − µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 ≥ − µ ⊤ v κ ∥ v ∥ 2  = P  ˆ z 1 ≤ µ ⊤ u κ ∥ u ∥ 2 ; ˆ z 2 ≤ µ ⊤ v κ ∥ v ∥ 2  = Φ 2  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , ρ  Moreo ver, w e can in v oke Lemma D.6, Lemma D.7, and Lemma D.8 to get that I 4 = 1 √ 2 π ( T 1 + ρT 2 ) ; I 5 = 1 √ 2 π ( T 2 + ρT 1 ) I 2 = ρ p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + Φ 2 ( − a, − b, ρ ) + 1 √ 2 π  bT 2 + ρ 2 aT 1  I 3 = p 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  + ρ Φ 2 ( − a, − b, ρ ) + ρ √ 2 π ( aT 1 + bT 2 ) where T 1 , T 2 and a, b are defined as T 1 = exp  − a 2 2  Φ 1 ρa − b p 1 − ρ 2 ! ; T 2 = exp  − b 2 2  Φ 1 ρb − a p 1 − ρ 2 ! a = − µ ⊤ u κ ∥ u ∥ 2 ; b = − µ ⊤ v κ ∥ v ∥ 2 ; ρ = u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2 T o ease our computation, we define E = 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  ; F = Φ 2 ( − a, − b, ρ ) 59 Then the terms I 2 to I 6 can b e written as I 2 = ρE + F + 1 √ 2 π  bT 2 + ρ 2 aT 1  ; I 3 = E + ρF + 1 √ 2 π ( bT 2 + aT 1 ) I 4 = 1 √ 2 π ( T 1 + ρT 2 ) ; I 5 = 1 √ 2 π ( T 2 + ρT 1 ) ; I 6 = F (73) No w, the trick of ev aluating (72) is to re-write ˆ s 1 and ˆ s 2 as b elo w ˆ s 1 = κ ∥ u ∥ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ·  ∥ v ∥ 2 2 u − u ⊤ v · v  = κ  ∥ v ∥ 2 2 u − u ⊤ v · v  ∥ u ∥ 2 ∥ v ∥ 2 2 (1 − ρ 2 ) = κ 1 − ρ 2 · u ∥ u ∥ 2 − κρ 1 − ρ 2 · v ∥ v ∥ 2 = κ 1 − ρ 2  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  ˆ s 2 = κ ∥ v ∥ 2 ∥ u ∥ 2 2 ∥ v ∥ 2 2 − ( u ⊤ v ) 2 ·  ∥ u ∥ 2 2 v − u ⊤ v · u  = κ  ∥ u ∥ 2 2 v − u ⊤ v · u  ∥ u ∥ 2 ∥ v ∥ 2 2 (1 − ρ 2 ) = κ 1 − ρ 2 · v ∥ v ∥ 2 − κρ 1 − ρ 2 · u ∥ u ∥ 2 = κ 1 − ρ 2  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  (74) No w, w e can simplify (72) with (73) and (74). T o start, for the terms I 2 · ˆ s 2 + I 3 · ˆ s 1 w e ha ve I 2 · ˆ s 2 + I 3 · ˆ s 1 = κ 1 − ρ 2  ρE + F + 1 √ 2 π  bT 2 + ρ 2 aT 1    v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + κ 1 − ρ 2  E + ρF + ρ √ 2 π ( bT 2 + aT 1 )   u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κE 1 − ρ 2  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κF 1 − ρ 2  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κρaT 1 √ 2 π (1 − ρ 2 )  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κbT 2 √ 2 π (1 − ρ 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κE · u ∥ u ∥ 2 + κF · v ∥ v ∥ 2 + κρaT 1 √ 2 π · u ∥ u ∥ 2 + κbT 2 √ 2 π · v ∥ v ∥ 2 = κ  E + ρaT 1 √ 2 π  u ∥ u ∥ 2 +  F + bT 2 √ 2 π  v ∥ v ∥ 2  60 Similarly , for the term I 4 ˆ s 1 + I 4 ˆ s 2 , we hav e I 4 ˆ s 1 + I 4 ˆ s 2 = κ √ 2 π (1 − ρ 2 )  ( T 1 + ρT 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ( T 2 + ρT 1 )  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κT 1 √ 2 π (1 − ρ 2 )  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  + ρ  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  + κT 2 √ 2 π (1 − ρ 2 )  ρ  v ∥ v ∥ 2 − ρ · u ∥ u ∥ 2  +  u ∥ u ∥ 2 − ρ · v ∥ v ∥ 2  = κT 1 √ 2 π · v ∥ v ∥ 2 + κT 2 √ 2 π · u ∥ u ∥ 2 = κ √ 2 π  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  Applying these ev aluations, (72) b ecomes E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v = κ 2 ∥ v ∥ 2  E + ρaT 1 √ 2 π  u ∥ u ∥ 2 +  F + bT 2 √ 2 π  v ∥ v ∥ 2  + κ µ ⊤ v √ 2 π  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  + κ ∥ v ∥ 2 √ 2 π ( T 2 + ρT 1 ) µ + µ ⊤ v · F · µ + 2 κ 2 F v = κ 2 ∥ v ∥ 2  E · u ∥ u ∥ 2 + ρaT 1 √ 2 π · u ∥ u ∥ 2 + bT 2 √ 2 π · v ∥ v ∥ 2  | {z } g 1 + κ √ 2 π  µ ⊤ v  T 1 · v ∥ v ∥ 2 + T 2 · u ∥ u ∥ 2  + ∥ v ∥ 2 ( T 2 + ρT 1 ) µ  | {z } g 2 + F  µµ ⊤ v + 3 κ 2 v  (75) Then w e hav e that E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u κ ∥ u ∥ 2  Φ 1  µ ⊤ v κ ∥ v ∥ 2  = g 1 + g 2 + C  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2  (76) The pro of then pro ceed by estimating the magnitude of the three terms. T o start, we need to b ound T 1 and T 2 . In particular, since Φ 1 is the CDF, its magnitude must b e b ounded by 1 . Therefore 0 ≤ T 1 ≤ exp  − a 2 2  ; 0 ≤ T 2 ≤ exp  − b 2 2  Therefore, the ℓ ∞ norm of g 2 is b ounded b y ∥ g 2 ∥ ∞ ≤ κ √ 2 π  µ ⊤ v  exp  − a 2 2  + exp  − b 2 2  + ∥ v ∥ 2 ∥ µ ∥ 2  exp  − a 2 2  + ρ exp  − b 2 2  ≤ 2 κ √ 2 π ∥ v ∥ 2 ∥ µ ∥ 2  exp  − a 2 2  + exp  − b 2 2  ≤ κ ∥ v ∥ 2 ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  (77) 61 Next, for E , we hav e E = 1 − ρ 2 2 π exp  − a 2 − 2 ρab + b 2 2 (1 − ρ 2 )  = 1 − ρ 2 4 π  exp  − a 2 − 2 ρab + ρ 2 b 2 2 (1 − ρ 2 ) − b 2 2  + exp  − ρ 2 a 2 − 2 ρab + b 2 2 (1 − ρ 2 ) − a 2 2  ≤ 1 4 π  exp  − a 2 2  + exp  − b 2 2  ≤ 1 4 π  ϕ  a 2  + ϕ  b 2  Therefore, the magnitude of g 1 can b e b ounded b y ∥ g 1 ∥ 2 ≤ κ 2 ∥ v ∥ 2  | E | + | a | T 1 √ 2 π + | b | T 2 √ 2 π  ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + | a | √ 2 π exp  − a 2 2  + | b | √ 2 π exp  − b 2 2  ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + ψ  a 2  + ψ  b 2  (78) Moreo ver, b y the b ound of the Gaussian Copula function, we hav e that |C ( a, b, ρ ) | ≤ 1 4 exp  − a 2 + b 2 4  Therefore, w e hav e that C  µ ⊤ u κ ∥ u ∥ 2 , µ ⊤ v κ ∥ v ∥ 2 , u ⊤ v ∥ u ∥ 2 ∥ v ∥ 2  ≤ 1 4 exp  − a 2 4  exp  − b 2 4  = 1 4 ϕ  a 2  ϕ  b 2  Com bining the results gives     E c  cc ⊤ I { z 1 ≥ 0; z 2 ≥ 0 }  v −  µµ ⊤ v + 3 κ 2 v  Φ 1  µ ⊤ u κ ∥ u ∥ 2  Φ 1  µ ⊤ v κ ∥ v ∥ 2 ,      2 ≤ κ 2 ∥ v ∥ 2  1 4 π  ϕ  a 2  + ϕ  b 2  + ψ  a 2  + ψ  b 2  + κ ∥ v ∥ 2 ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  + 1 4    µ ⊤ v   ∥ µ ∥ ∞ + 3 κ 2 ∥ v ∥ 2  ϕ  a 2  ϕ  b 2  = ∥ v ∥ 2   κ 2 + κ ∥ µ ∥ 2   ϕ  a 2  + ϕ  b 2  +  κ 2 + κ ∥ µ ∥ ∞   ψ  a 2  + ψ  b 2  ≤ 2 κ ∥ v ∥ 2  ∥ µ ∥ 2  ϕ  a 2  + ϕ  b 2  + ∥ µ ∥ ∞  ψ  a 2  + ψ  b 2  D.1.5 Other Results Lemma D.13. L et c ∼ N  µ , κ 2 I  , and let u ∈ R d b e a ve ctor. Define z = c ⊤ u . Then we have that z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . 62 Pr o of. Since z = c ⊤ u where c ∼ N  µ , κ 2 I  . Then the moment generating function of z is given b y M z ( t ) = E [exp ( z t )] = E   d Y j =1 exp ( c j u j t )   = d Y j =1 E [exp ( c j u j t )] = d Y j =1 exp  u j µ j t + 1 2 u 2 j κ 2 t 2  = exp     d X j =1 u j µ j   t + 1 2   d X j =1 u 2 j   κ 2 t 2   = exp  µ ⊤ u · t + 1 2 ∥ u ∥ 2 2 κ 2 t 2  Therefore, z ∼ N  µ ⊤ u , κ 2 ∥ u ∥ 2 2  . Lemma D.14. L et κ, µ, a ∈ R b e given such that κ > 0 . Then we have that Z ∞ a z exp − ( z − µ ) 2 2 κ 2 ! dz = κ 2 exp − ( µ − a ) 2 2 κ 2 ! + κµ √ 2 π Φ 1  µ − a κ  Pr o of. W e use a change of v ariable z ′ = z − µ κ . Then we hav e that z = κz ′ + µ , and dz = κdz ′ . Therefore Z ∞ a z exp − ( z − µ ) 2 2 κ 2 ! dz = Z ∞ a − µ κ ( κz ′ + µ ) exp  − z ′ 2 2  κdz ′ = κ 2 Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ + κµ Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = κ 2 exp  − z ′ 2 2  | a − µ κ ∞ + κµ √ 2 π  1 − Φ 1  a − µ κ  = κ 2 exp − ( µ − a ) 2 2 κ 2 ! + κµ √ 2 π Φ 1  µ − a κ  Lemma D.15. L et κ, µ, a ∈ R b e given such that κ > 0 . Then we have that Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ 2 ( a + µ ) exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π κ  κ 2 + µ 2  Φ 1  µ − a κ  Pr o of. T o start, let z ′ = z − µ κ . Then we hav e that z = κz ′ + µ , and dz = κdz ′ . Therefore Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ Z ∞ a − µ κ ( κz ′ + µ ) 2 exp  − z ′ 2 2  dz ′ = κ 3 Z ∞ a − µ κ z ′ 2 exp  − z ′ 2 2  dz ′ + 2 κ 2 µ Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ + κµ 2 Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ 63 Notice that for the third term, we hav e that Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = √ 2 π  1 − Φ 1  a − µ κ  = √ 2 π Φ 1  µ − a κ  F or the second term, we can directly apply Lemma D.14 with κ = 1 , µ = 0 to get that Z ∞ a − µ κ z ′ exp  − z ′ 2 2  dz ′ = exp − ( a − µ ) 2 2 κ 2 ! F or the first term, we apply integration by parts with u ( z ′ ) = − z ′ and v ( z ′ ) = exp  − z ′ 2 2  . In particular, notice that v ′ ( z ′ ) = − z ′ exp  − z ′ 2 2  and u ′ ( z ′ ) = − 1 . Therefore Z ∞ a − µ κ z ′ 2 exp  − z ′ 2 2  dz ′ = Z ∞ a − µ κ u ( z ′ ) dv ( z ′ ) = u ( z ′ ) v ( z ′ ) | ∞ a − µ κ − Z ∞ a − µ κ v ( z ′ ) du ( z ′ ) = − z ′ exp  − z ′ 2 2  | ∞ a − µ κ + Z ∞ a − µ κ exp  − z ′ 2 2  dz ′ = a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π  1 − Φ 1  a − µ κ  = a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π Φ 1  µ − a κ  Putting things together, we hav e that Z ∞ a z 2 exp − ( z − µ ) 2 2 κ 2 ! dz = κ 3 a − µ κ exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π Φ 1  µ − a κ  ! + 2 κ 2 µ exp − ( a − µ ) 2 2 κ 2 ! + κµ 2 √ 2 π Φ 1  µ − a κ  =  κ 2 ( a − µ ) + 2 κ 2 µ  exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π  κ 3 + κµ 2  Φ 1  µ − a κ  = κ 2 ( a + µ ) exp − ( µ − a ) 2 2 κ 2 ! + √ 2 π κ  κ 2 + µ 2  Φ 1  µ − a κ  Lemma D.16. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . Then we have that Z ∞ a f ( z 1 | z 2 ) dz 1 = Φ 1 ρz 2 − a p 1 − ρ 2 ! 64 Pr o of. Since z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , we hav e that z 1 | z 2 ∼ N  ρz 2 , 1 − ρ 2  . Therefore, using a ch ange of v ariable z ′ = z 1 − ρz 2 √ 1 − ρ 2 , we hav e Z ∞ a f ( z 1 | z 2 ) dz 1 = 1 p 2 π (1 − ρ 2 ) Z ∞ a exp − ( z 1 − ρz 2 ) 2 2 (1 − ρ 2 ) ! dz 1 = 1 √ 2 π Z ∞ a − ρz 2 √ 1 − ρ 2 exp  − z ′ 2 2  dz ′ = 1 − Φ 1 a − ρz 2 p 1 − ρ 2 ! = Φ 1 ρz 2 − a p 1 − ρ 2 ! Lemma D.17. L et z 1 , z 2 ∼ N (0 , 1) with Cov ( z 1 , z 2 ) = ρ . Then we have that Φ 2 ( − a, − b, ρ ) = Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 = Z ∞ b Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 Pr o of. Let z ′ 1 , z ′ 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ , and define z 1 = − z ′ 1 , z 2 = − z ′ 2 . Then we hav e that z 1 , z 2 ∼ N (0 , 1) with Co v ( z 1 , z 2 ) = ρ . By symmetry , we ha ve f ( z 1 , z 2 ) = f ( − z 1 , − z 2 ) = f ( z ′ 1 , z ′ 2 ) . Moreo ver, dz ′ 2 dz ′ 1 = ( − dz 2 ) ( − dz 1 ) = dz 2 dz 1 . Thus Φ 2 ( − a, − b, ρ ) = Z − a −∞ Z − b −∞ f ( z ′ 1 , z ′ 2 ) dz ′ 2 dz ′ 1 = Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 Recall that f ( z 1 , z 2 ) = f ( z 1 | z 2 ) f ( z 2 ) . Then we can apply Lemma D.16 to get that Z ∞ a Z ∞ b f ( z 1 , z 2 ) dz 2 dz 1 = Z ∞ b  Z ∞ a f ( z 1 | z 2 ) dz 1  f ( z 2 ) dz 2 = Z ∞ b Φ 1 ρz 2 − a p 1 − ρ 2 ! f ( z 2 ) dz 2 Lemma D.18. L et z ∼ N  µ, κ 2  , and let a ∈ R . Then we have E [ z I { z ≥ a } ] = κ √ 2 π exp − ( µ − a ) 2 2 κ 2 ! + µ Φ 1  µ − a κ  Pr o of. Define ˆ z = z − µ κ . Then we hav e that ˆ z ∼ N (0 , 1) . Since z = κ ˆ z + µ , we ha ve E [ z I { z ≥ 0 } ] = E  ( κ ˆ z + µ ) I  ˆ z ≥ a − µ κ  = κ E  ˆ z I  ˆ z ≥ a − µ κ  + µ E  I  ˆ z ≥ a − µ κ  Notice that E  I  ˆ z ≥ a − µ κ  = Pr  ˆ z ≥ a − µ κ  = Φ 1  µ − a κ  . Moreov er E h ˆ z I n ˆ z ≥ − µ κ oi = 1 √ 2 π Z ∞ a − µ κ ˆ z exp  − z 2 2  dz = 1 √ 2 π exp − ( a − µ ) 2 2 κ 2 ! Therefore E [ z I { z ≥ 0 } ] = κ √ 2 π exp − ( a − µ ) 2 2 κ 2 ! + µ Φ 1  µ − a κ  65 Lemma D.19. . L et x ∈ [ − 1 , 1] . Then we have that | arcsin x | ≤ π 2 · | x | . Pr o of. T o start, consider the case x > 0 . Define f ( x ) = arcsin x x . Then we hav e that f ′ ( x ) = x − 2  x √ 1 − x 2 − arcsin x  ; f ′′ ( x ) = x − 3 3 x 3 − 2 x (1 − x 2 ) 3 2 + 2 arcsin x ! F or all x ∈ (0 , 1] , we hav e that 1 − x 2 ≥ 0 . Notice that b y the T aylor expansion of arcsin x , we ha ve arcsin x ≤ x + x 3 6 when x ∈ (0 , 1] . Therefore 3 x 3 − 2 x (1 − x 2 ) 3 2 + 2 arcsin x ≥ 3 x 3 − 2 x + 2 x  1 − x 2  3 2 + x 3 3  1 − x 2  3 2 (1 − x 2 ) 3 2 ≥ 3 x 3 − 2 x  1 −  1 − x 2  3 2  (1 − x 2 ) 3 2 Since  1 − x 2  3 2 ≥  1 − x 2  3 ≥ 1 − 3 x 4 + 2 x 6 , we must hav e that 3 x 3 − 2 x  1 −  1 − x 2  3 2  ≤ 3 x 3 − 6 x 5 + 4 x 7 = 3 x 3  1 − 2 x 2 + x 4  + x 7 = 3 x 3  1 − x 2  2 + x 7 ≥ 0 Th us, w e m ust ha ve that f ′′ ( x ) ≥ 0 . Therefore, for an y ϵ ∈ (0 , 0 . 1] , we hav e that for x ∈ [ ϵ, 1] f ( x ) ≤ f (((1 + ϵ ) x − ϵ ) · 1 + (1 − x ) · ϵ ) ≤ (1 − x ) · f ( ϵ ) + ((1 + ϵ ) x − ϵ ) · f (1) = (1 − x ) · f ( ϵ ) + π 2 · x − π 2 · ϵ (1 − x ) ≤ π 2 · x + 1 . 002 (1 − x ) ≤ π 2 This gives that f ( x ) ≤ π 2 for all x ∈ (0 , 1] . Since f ( x ) is an even function, we hav e f ( x ) ≤ π 2 for all x ∈ [ − 1 , 0) . Therefore, | arcsin x | ≤ π 2 | x | when x ∈ [ − 1 , 1] \ { 0 } . When x = 0 , we hav e arcsin x = 0 . This completes the pro of. Lemma D.20. L et u := ( | e 1 | , ..., | e n | ) ∈ R n and 1 := (1 , ..., 1) ∈ R n . Then n X i =1 | e i | = u ⊤ 1 . By the Cauchy–Schwarz ine quality, u ⊤ 1 ≤ ∥ u ∥ 2 ∥ 1 ∥ 2 . Mor e over, ∥ u ∥ 2 = n X i =1 u 2 i ! 1 / 2 = n X i =1 | e i | 2 ! 1 / 2 = n X i =1 e 2 i ! 1 / 2 , ∥ 1 ∥ 2 = n X i =1 1 2 ! 1 / 2 = √ n. Combining the ab ove gives n X i =1 | e i | ≤ √ n n X i =1 e 2 i ! 1 / 2 . Lemma D.21. A ssume that A ssumption 3.1 holds. Then, we have: ∥∇ w r L C ( W ) ∥ 2 2 ≤ √ n √ m L C ( W ) 1 2 . 66 Pr o of. By the form of ∇ w r L C ( W ) in (2), we hav e: ∥∇ w r L C ( W ) ∥ 2 =        a r √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) ( x i ⊙ c i ) I  w ⊤ r ( x i ⊙ c i ) ≥ 0  | {z } ≤ 1        2 ≤ | a r | √ m n X i =1 | f ( W , x i ⊙ c i ) − y i | · ∥ x i ⊙ c i ∥ 2 ≤ √ n √ m n X i =1 ( f ( W , x i ⊙ c i ) − y i ) 2 ! 1 / 2 · ∥ c i ∥ ∞ ∥ x i ∥ 2 using D.20 ≤ C √ n √ m (2 L C ( W )) 1 / 2 using D.22 with ∥ c i ∥ ∞ ≤ 1 + κ r 2 log  2 d δ  = C and ∥ x i ∥ 2 ≤ 1 ≤ C √ 2 √ n √ m L C ( W ) 1 / 2 where, in the first inequalit y , we use the fact that the indicator function is upp er-b ounded by 1 and in the second inequalit y , we use the fact that a r = ± 1 . and so, ∥∇ w r L C ( W ) ∥ 2 2 ≤ 2 C 2 · n m L C ( W ) Lemma D.22 (High-probability ℓ ∞ b ound for Gaussian masks) . Fix δ ∈ (0 , 1) . L et c i ∈ R d b e a Gaussian mask with indep endent c o or dinates c i ∼ N ( 1 , κ 2 I d ) , i.e., c i,j = 1 + κg i,j , g i,j i.i.d. ∼ N (0 , 1) . Then, with pr ob ability at le ast 1 − δ , we have ∥ c i ∥ ∞ ≤ 1 + κ r 2 log  2 d δ  . Pr o of. Since c i ∼ N ( 1 , κ 2 I d ) with indep enden t co ordinates, eac h co ordinate can b e written as c i,j = 1 + κg i,j , where g i,j ∼ N (0 , 1) i.i.d. Hence ∥ c i ∥ ∞ = max j ∈ [ d ] | c i,j | = max j ∈ [ d ] | 1 + κg i,j | . W e observe that | c i,j | = | 1 + ( c i,j − 1) | ≤ | 1 | + | c i,j − 1 | triangle inequalit y = 1 + | κg i,j | W e will first b ound max j | c i,j − 1 | = max j | κg i,j | , and then conv ert this into a b ound on ∥ c i ∥ ∞ . F or brevity , we denote g i,j as g ∼ N (0 , 1) and we are going to show that Pr( | g | ≥ t ) ≤ 2 e − t 2 / 2 (79) By symmetry of the standard normal distribution, Pr( | g | ≥ t ) = Pr( g ≥ t ) + Pr( g ≤ − t ) = 2 Pr( g ≥ t ) . 67 So it suffices to upp er b ound Pr( g ≥ t ) . F or any λ > 0 , since the exp onential is monotone increasing we ha ve: g ≥ t ⇒ λg ≥ λt ⇒ e λg ≥ e λt and so Pr( g ≥ t ) = Pr( e λg ≥ e λt ) By Marko v’s inequalit y , for any nonnegative random v ariable X and any a > 0 , Pr( X ≥ a ) ≤ E [ X ] a Applying this with X = e λg and a = e λt giv es Pr( g ≥ t ) = Pr  e λg ≥ e λt  ≤ E [ e λg ] e λt = E [ e λg ] e − λt (80) Computation of the moment generating function E [ e λg ] : The standard normal density is φ ( x ) = 1 √ 2 π e − x 2 / 2 , x ∈ R . Therefore, E [ e λg ] = Z ∞ −∞ e λx φ ( x ) dx = 1 √ 2 π Z ∞ −∞ exp  λx − x 2 2  dx. (81) W e now study the exp onen t: λx − x 2 2 = − 1 2  x 2 − 2 λx  = − 1 2  ( x − λ ) 2 − λ 2  = λ 2 2 − ( x − λ ) 2 2 . Plugging this into (81) yields E [ e λg ] = 1 √ 2 π Z ∞ −∞ exp  λ 2 2 − ( x − λ ) 2 2  dx = e λ 2 / 2 · 1 √ 2 π Z ∞ −∞ exp  − ( x − λ ) 2 2  dx. W e set u = x − λ (c hange of v ariables with dx = du ) Z ∞ −∞ exp  − ( x − λ ) 2 2  dx = Z ∞ −∞ exp  − u 2 2  du = √ 2 π . Hence E [ e λg ] = e λ 2 / 2 . (82) Substituting (82) into (80) giv es Pr( g ≥ t ) ≤ exp  λ 2 2 − λt  , ∀ λ > 0 . The right-hand side is a v alid b ound for every λ > 0 , so we choose λ to make it as small as p ossible. Define f ( λ ) := λ 2 2 − λt. 68 Then f ′ ( λ ) = λ − t , so the unique minimizer is λ = t (and f ′′ ( λ ) = 1 > 0 confirms it is a minimum). Plugging λ = t gives Pr( g ≥ t ) ≤ exp  t 2 2 − t 2  = e − t 2 / 2 . Using Pr( | g | ≥ t ) = 2 Pr( g ≥ t ) prov es (79). No w, we fo cus on one of the mask co ordinates by fixing a co ordinate j ∈ [ d ] . Since c i,j − 1 = κg i,j with g i,j ∼ N (0 , 1) , for any u ≥ 0 we hav e Pr  | c i,j − 1 | ≥ u  = Pr  | g i,j | ≥ u/κ  ≤ 2 exp  − u 2 2 κ 2  , where w e applied (79) with t = u/κ . Let’s define the even t E i ( u ) := n max j ∈ [ d ] | c i,j − 1 | ≤ u o . Its complemen t is the even t that at least one co ordinate deviates by more than u : E i ( u ) c = n ∃ j ∈ [ d ] s.t. | c i,j − 1 | > u o . By the union b ound, Pr  E i ( u ) c  = Pr  d [ j =1 {| c i,j − 1 | > u }  ≤ d X j =1 Pr  | c i,j − 1 | > u  ≤ d X j =1 2 exp  − u 2 2 κ 2  = 2 d exp  − u 2 2 κ 2  . Cho ose u so that the right-hand side is at most δ : 2 d exp  − u 2 2 κ 2  ≤ δ ⇐ ⇒ − u 2 2 κ 2 ≤ log  δ 2 d  ⇐ ⇒ u ≥ κ r 2 log  2 d δ  . Set u := κ r 2 log  2 d δ  . Then Pr( E i ( u ) c ) ≤ δ , i.e. Pr( E i ( u )) ≥ 1 − δ , and on E i ( u ) we hav e | c i,j − 1 | ≤ u ∀ j ∈ [ d ] . On E i ( u ) , for each co ordinate j , | c i,j | = | 1 + ( c i,j − 1) | ≤ | 1 | + | c i,j − 1 | ≤ 1 + u. T aking the maximum ov er j yields, with probability at least 1 − δ , ∥ c i ∥ ∞ = max j ∈ [ d ] | c i,j | ≤ 1 + u = 1 + κ r 2 log  2 d δ  . 69

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment