Rate-Optimal Denoising with Deep Neural Networks

Rate-Optimal Denoising with Deep Neural Net w orks Reinhard Hec kel ˚ W en Huang ‹ P aul Hand ‹‹ Vladisla v V oroninski : Departmen t of Electrical and Computer Engineering ˚ , Rice Universit y Sc ho ol of Mathematical Sciences ‹ , Xiamen Universit y , Xiamen, F ujian, P .R.China Departmen t of Mathematics and College of Computer and Information Science ‹‹ , Northeastern Universit y Helm.ai : April 9, 2019 Abstract Deep neural net works provide s tate-of-the-art p erformance for image denoising, where the goal is to recov er a near noise-free image from a noisy observ ation. The underlying principle is that neural net w orks trained on large datasets hav e empirically b een shown to b e able to generate natural images w ell from a lo w-dimensional laten t represen tation of the image. Given suc h a generator netw ork, a noisy image can b e denoised by i) ﬁnding the closest image in the range of the generator or b y ii) passing it through an enco der-generator architecture (kno wn as an auto enco der). Ho w ev er, there is little theory to justify this success, let alone to predict the denoising p erformance as a function of the netw ork parameters. In this pap er we consider the problem of denoising an image from additive Gaussian noise using the tw o generator based approac hes. In b oth cases, we assume the image is well describ ed by a deep neural netw ork with ReLU activ ations functions, mapping a k -dimensional co de to an n -dimensional image. In the case of the auto enco der, we show that the feedforward netw ork reduces noise energy by a factor of O p k { n q . In the case of optimizing ov er the range of a generative mo del, we state and analyze a simple gradient algorithm that minimizes a non-conv ex loss function, and pro v ably reduces noise energy by a factor of O p k { n q . W e also demonstrate in numerical exp eriments that this denoising p erformance is, indeed, achiev ed by generativ e priors learned from data. 1 In tro duction W e consider the denoising problem, where the goal is to remo ve noise from an unknown image or signal. In more detail, our goal is to obtain an estimate of an image or signal y ˚ P R n from a noisy measuremen t y “ y ˚ ` η . Here, η is unknown noise, which w e mo del as a zero-mean white Gaussian random v ariable with co- v ariance matrix σ 2 { nI . Image denoising relies on generative or prior assumptions on the image y ˚ , suc h as self-similarity within images [ Dab+07 ], sparsit y in ﬁxed [ Don95 ] and learned bases [ EA06 ], and most recen tly , b y assuming the image can b e generated by a pre-trained deep-neural net- w ork [ Bur+12 ; Zha+17 ]. Deep-net w ork based approaches, typically yield the b est denoising p er- formance. This success can b e attributed to their ability to eﬃciently represent and learn realistic image priors, for example via auto-deco ders [ HS06 ] and generativ e adversarial mo dels [ Go o+14 ]. Motiv ated by this success story , w e assume that the image y ˚ lies in the range of an image- generating net work. In this pap er, we prop ose the ﬁrst algorithm for solving denoising with deep generativ e priors that pro v ably ﬁnds an appro ximation of the underlying image. As the inﬂuence of deep netw orks in denoising and in verse problems grows, it b ecomes increasingly imp ortan t to understand their p erformance at a theoretical level. Given that most optimization approaches for deep learning are ﬁrst-order gradien t metho ds, a justiﬁcation is needed for why they do not get stuc k in lo cal minima. 1 The most related work that establishes theoretical reasons for why gradient metho ds might succeed when using deep generativ e priors for solving inv erse problems, is [ HV18 ]. In it, the authors establish global fav orability for optimization of a ` 2 -loss function under a random neural net work mo del. Speciﬁcally , they sho w existence of a descent direction outside a ball around the global optimizer and a negative multiple of it in the latent space of the generativ e mo del. This w ork do es not justify wh y the one spurious p oint is a v oided b y gradient descen t, nor do es it pro vide a sp eciﬁc algorithm whic h prov ably estimates the global minimizer, nor do es it provide an analysis of the robustness of the problem with resp ect to noise. This w ork w as subsequen tly extended to include the case of generative conv olutional neural netw orks by [ Ma+18 ], but that work to o do es not pro v e con vergence of a sp eciﬁc algorithm. Con tributions: The goal of this pap er is to analytically quantify the denoising p erformance of deep-prior based denoisers. Sp eciﬁcally , we c haracterize the performance of t wo simple and eﬃcien t algorithms for denoising based on a d -lay er generative neural netw ork G : R k Ñ R n , with k ă n . W e ﬁrst provide a simple result for an enco der-generator netw ork G p E p y qq where E : R n Ñ R k is an encoder netw ork. W e show that if w e pass noise through an encoder-deco der netw ork G p E p y qq that acts as the iden tit y on a class of images of in terest, then it reduces the random noise by O p k { n q . This result requires no assumptions on the weigh ts of the netw ork. The second and main result of our pap er p ertains to denoising b y optimizing ov er the laten t co de of a generator net w ork with random weigh ts. W e prop ose a gradient metho d that attempts to minimize the least-squares loss f p x q “ 1 2 } G p x q ´ y } 2 b et w een the noisy image y and an image in the range of the generator, G p x q . Even though f is non-conv ex, we show that a gradient metho d yields an estimate ˆ x ob eying } G p ˆ x q ´ y ˚ } 2 À σ 2 k n , with high probability , where the notation À absorbs a constant factor dep ending on the num b er of lay ers of the net work, and its expansitivity , as discussed in more detail later. Our result shows that the denoising rate of a deep generator based denoiser is optimal in terms of the dimension of the latent represen tation. W e also sho w in n umerical exp erimen ts, that this rate—sho wn to b e analytically achiev ed for random priors—is also experimentally achiev ed for priors learned from real imaging data. Related w ork: W e hasten to add that a close theoretical work to the question considered in this pap er is the pap er [ Bor+17 ], whic h solv es a noisy compressiv e sensing problem with generativ e priors by minimizing an ` 2 -loss. Under the assumption that the netw ork is Lipsc hitz, they show that if the global optimizer can b e found, which is in principle NP-hard, then a signal estimate is reco vered to within the noise level. While the Lipschitzness assumption is quite mild, the resulting theory do es not provide justiﬁcation for why global optimality can b e reached. 2 Bac kground on denoising with classical and deep-learning based metho ds As men tioned b efore, image denoising relies on mo deling or prior assumptions on the image y ˚ . F or example, supp ose that the image y ˚ lies in a k -dimensional subspace of R n denoted by Y . Then we can estimate the original image by ﬁnding the closest p oint in ` 2 -distance to the noisy observ ation 2 y on the subspace Y . The corresp onding estimate, denoted b y ˆ y , ob eys } ˆ y ´ y ˚ } 2 À σ 2 k n , (1) with high probability (throughout, }¨} denotes the ` 2 -norm). Th us, the noise energy is reduced b y a factor of k { n o v er the trivial estimate ˆ y “ y which do es not use any prior knowledge of the signal. The denoising rate ( 1 ) shows that the more concise the image prior or image representation (i.e., the smaller k ), the more noise can be remo ved. If on the other hand the image model (the subspace, in this example) do es not include the original image y ˚ , then the error b ound ( 1 ) increases, as we w ould remov e a signiﬁcant part of the signal along with noise when pro jecting onto the range of the image prior. Th us a concise and accurate mo del is crucial for denoising. Real w orld signals rarely lie in a priori kno wn subspaces, and the last few decades of image denoising researc h ha v e dev elop ed sophisticated algorithms based on accurate image mo dels. Ex- amples include algorithms based on sparse representations in o vercomplete dictionaries suc h as w av elets [ Don95 ] and curv elets [ Sta+02 ], and algorithms based on exploiting self-similarity within images [ Dab+07 ]. A prominent example of the former class of algorithms is the (state-of-the-art) BM3D [ Dab+07 ] algorithm. Ho wev er, the n uances of real world images are diﬃcult to describe with handcrafted mo dels. Th us, starting with the paper [ EA06 ] that prop oses to learn sparse represen tation based on training data, it has b ecome common to learn concise representation for denoising (and other inv erse problems) from a set of training images. Burger et al. [ Bur+12 ] applied deep net works to the denoising problem, b y training a plain neural netw ork on a large set of images. Since then, deep learning based denoisers [ Zha+17 ] hav e set the standard for denoising. The success of deep netw ork priors can b e attributed to their ability to eﬃciently represent and learn realistic image priors, for example via auto-deco ders [ HS06 ] and generativ e adversarial mo dels [ Go o+14 ]. Over the last few y ears, the qualit y of deep priors has signiﬁcan tly improv ed [ Kar+17 ; Uly+18 ; HH19 ]. As this ﬁeld matures, priors will b e dev elop ed with even smaller latent co de dimensionalit y and more accurate approximation of natural signal manifolds. Consequen tly , the represen tation error from deep priors will decrease, and thereb y enable ev en more p ow erful denoisers. 3 Denoising with a neural net w ork with an hourglass structure P erhaps the most straigh t-forward and classical approach to using deep net works for denoising is to train a deep net w ork with an auto enco der or hourglass structure end-to-end to p erform denoising. An auto enco der compresses data from the input lay er into a low-dimensional code, and then generates an output from that co de. In this section, w e analyze such netw orks from the p ersp ective of denoising. Sp eciﬁcally , we sho w mathematically that a simple mo del for neural net works with an hourglass structure achiev es optimal denoising rates, as given b y the dimensionalit y of the low-dimensional co de. An auto enco der H p x q “ G p E p x qq consists of an enco der netw ork E : R n Ñ R k mapping an image to a lo w-dimensional laten t representation, and a deco der or generator netw ork G : R k Ñ R n mapping the latent representation to an image. T o see that the size of the low-dimensional co de, k , determines the denoising rate, consider a simple one-lay er enco der and multila y er deco der of the form E p y q “ relu p W 1 y q , G p x q “ relu p W d . . . relu p W 2 relu p W 1 x qq . . . q , (2) where relu p x q “ max p x, 0 q applies entrywise, W 1 are the weigh ts of the enco der, W i P R n i ˆ n i ´ 1 , are the w eigh ts in the i -th lay er of the deco der, and n i is the num b er of neurons in the i th lay er. 3 T ypically , the auto enco der netw ork H is trained such that H p y q « y for some class of signals of in terest (say , natural images). The following prop osition shows that the structure of the netw ork alone guaran tees that an auto enco der “ﬁlters out” most of the noise. Prop osition 1. L et H “ G ˝ E b e an auto enc o der of the form ( 2 ) and note that it is pie c ewise line ar, i.e., H p y q “ U y in a r e gion ar ound y . Supp ose that } U } 2 ď 2 for al l such r e gions. L et η b e Gaussian noise with c ovarianc e matrix σ I , σ ą 0 . Then, pr ovide d that k ¨ 32 log p 2 n 1 n 2 . . . n d q ď n , we have with pr ob ability at le ast 1 ´ 2 e ´ k log p 2 n 1 n 2 ...n d q , that } H p η q} 2 2 ď 5 k n log p 2 n 1 n 2 . . . n d q} η } 2 2 . Note that the assumption } U } 2 ď 2 implies that } H p y q} 2 2 {} y } 2 2 ď 2 for all y . This in turn guaran tees that the auto enco der do es not “amplify” a signal to o muc h. Note that we en vision an auto enco der that is trained suc h that it ob eys H p y q « y for y in a class of images. The prop osition w ould then justify wh y the the feedforward netw ork reduces noise by O p k { n q . The pro of of Prop osition 1 , con tained in the appendix, shows that H lies in the range of a union of k -dimensional subspaces and then uses a standard concentration argument showing that the union of those subspaces can represent no more than a fraction of O p k { n q of the noise. In the remainder of the pap er we shows that denoising via enforcing a generative prior gives us an analogous denoising rate. 4 Denoising via enforcing a generativ e mo del W e consider the problem of estimating a vector y ˚ P R n from a noisy observ ation y “ y ˚ ` η . W e assume that the vector y ˚ b elongs to the range of a d -lay er generative neural netw ork G : R k Ñ R n , with k ă n . That is, y ˚ “ G p x ˚ q for some x ˚ P R k . W e consider a generative netw ork of the form G p x q “ relu p W d . . . relu p W 2 relu p W 1 x ˚ qq . . . q , where relu p x q “ max p x, 0 q applies en trywise, W i P R n i ˆ n i ´ 1 , are the w eigh ts in the i -th la yer, n i is the n umber of neurons in the i th lay er, and the net work is expansiv e in the sense that k “ n 0 ă n 1 ă ¨ ¨ ¨ ă n d “ n . The problem at hand is: Giv en the weigh ts of the netw ork W 1 . . . W d and a noisy observ ation y , obtain an estimate ˆ y of the original image y ˚ suc h that } ˆ y ´ y ˚ } is small and ˆ y is in the range of G . 4.1 Enforcing a generativ e mo del As a wa y to solve the abov e problem, w e ﬁrst obtain an estimate of x ˚ , denoted b y ˆ x , and then estimate y ˚ as G p ˆ x q . In order to estimate x ˚ , w e minimize the loss f p x q : “ 1 2 } G p x q ´ y } 2 . Since this ob jective is noncon vex, there is no a priori guaran tee of eﬃcien tly ﬁnding the global minim um. Approaches suc h as gradient methods could in principle get stuck in lo cal minima, instead of ﬁnding a global minimizer that is close to x ˚ . Ho wev er, as we sho w in this pap er, under appropriate conditions, a gradient metho d —in tro duced next—ﬁnds a p oint that is very close to the original latent parameter x ˚ , with the distance to the 4 ´ 2 ´ 1 0 1 ´ 1 0 0 2 4 ¨ 10 ´ 2 x 1 x 2 f p x q Figure 1: Loss surface f p x q “ } G p x q ´ G p x ˚ q} , x ˚ “ r 1 , 0 s , of an expansiv e net w ork G with ReLU activ ation functions with k “ 2 no des in the input la yer and n 2 “ 300 and n 3 “ 784 no des in the hidden and output la y ers, resp ectively , with random Gaussian weigh ts in each lay er. The surface has a critical p oint near ´ x ˚ , a global minimum at x ˚ , and a lo cal maximum at 0. parameter x ˚ con trolled by the noise. In order to state the algorithm, we ﬁrst introduce a useful quan tity . F or analyzing whic h ro ws of a matrix W are active when computing relu p W x q , we let W ` ,x “ diag p W x ą 0 q W . F or a ﬁxed weigh t matrix W , the matrix W ` ,x zeros out the rows of W that do not ha ve a p ositive dot product with x . Alternatively put, W ` ,x con tains weigh ts from only the neurons that are activ e for the input x . W e also deﬁne W 1 , ` ,x “ p W 1 q ` ,x “ diag p W 1 x ą 0 q W 1 and W i, ` ,x “ diag p W i W i ´ 1 , ` ,x ¨ ¨ ¨ W 2 , ` ,x W 1 , ` ,x x ą 0 q W i . The matrix W i, ` ,x consists only of the weigh ts of the neurons in the i th lay er that are active if the input to the ﬁrst lay er is x . W e are now ready to state our algorithm: a gradient metho d with a tw eak informed by the loss surface of the function to be minimized. Given a noisy observ ation y , the algorithm starts with an arbitrary initial p oint x 0 ‰ 0. At eac h iteration i “ 0 , 1 , . . . , the algorithm computes the step direction ˜ v x i “ p Π 1 i “ d W i, ` ,x i q t p G p x i q ´ y q , whic h is equal to the gradien t of f if f is diﬀeren tiable at x i . It then takes a small step opp osite to ˜ v x i . The tw eak is that b efore each iteration, the algorithm chec ks whether f p´ x i q is smaller than f p x i q , and if so, negates the sign of the curren t iterate x i . This tw eak is informed by the loss surface. T o understand this step, it is instructive to examine the loss surface for the noiseless case in Figure 1 . It can b e seen that while the loss function has a glob al minimum at x ˚ , it is relatively ﬂat close to ´ x ˚ . In exp ectation, there is a critical p oint that is a negative multiple of x ˚ with the prop erty that the curv ature in the ˘ x ˚ direction is p ositive, and the curv ature in the orthogonal directions is zero. F urther, around appro ximately ´ x ˚ , the loss function is larger than around the optimum x ˚ . As a simple gradient descen t metho d (without the tw eak) could p otentially get stuck in this region, the negation chec k provides a wa y to av oid con verging to this region. Our algorithm is formally summarized as Algorithm 1 b elow. 5 Algorithm 1 Gradient metho d Require: W eights of the netw ork W i , noisy observ ation y , and step size α ą 0 1: Cho ose an arbitrary initial p oin t x 0 P R k zt 0 u 2: for i “ 0 , 1 , . . . do 3: if f p´ x i q ă f p x i q then 4: ˜ x i Ð ´ x i ; 5: else 6: ˜ x i Ð x i ; 7: end if 8: Compute v ˜ x i P B f p ˜ x i q , in particular, if G is diﬀerentiable at ˜ x i , then set v ˜ x i “ ˜ v ˜ x i , where ˜ v ˜ x i : “ p Π 1 i “ d W i, ` , ˜ x i q t p G p ˜ x i q ´ y q ; 9: x i ` 1 Ð ˜ x i ´ αv ˜ x i ; 10: end for Other v ariations of the t weak are also p ossible. F or example, the negation c heck in Step 3 could b e p erformed after a conv ergence criterion is satisﬁed, and if a lo wer ob jective is ac hiev ed b y negating the laten t co de, then the gradient descent can b e contin ued again until a conv ergence criterion is again satisﬁed. 5 Main results F or our analysis, we consider a fully-connected generative net w ork G : R k Ñ R n with Gaussian w eights and no bias terms. Sp eciﬁcally , we assume that the w eights W i are indep enden tly and iden tically distributed as N p 0 , 2 { n i q , but do not require them to b e indep endent across la y ers. Moreo ver, we assume that the net work is suﬃciently exp ansive : Expansivit y condition. We say that the exp ansivity c ondition with c onstant  ą 0 holds if n i ě c ´ 2 log p 1 {  q n i ´ 1 log n i ´ 1 , for al l i, wher e c is a p articular numeric al c onstant. In a real-world generativ e netw ork the w eights are learned from training data, and are not drawn from a Gaussian distribution. Nonetheless, the motiv ation for selecting Gaussian weigh ts for our analysis is as follows: 1. The empirical distribution of w eigh ts from deep neural netw orks often hav e statistics consis- ten t with Gaussians. AlexNet is a concrete example [ Aro+15 ]. 2. The ﬁeld of theoretical analysis of reco very guaran tees for deep learning is nascen t, and Gaus- sian netw orks can p ermit theoretical results b ecause of well developed theories for random matrices. 3. It is not clear which non-Gaussian distribution for weigh ts is sup erior from the joint p ersp ec- tiv e of realism and analytical tractability . 6 The netw ork mo del we consider is fully connected. W e anticipate that the analysis of this pap er can b e extended to the case of generative conv olutional neural netw orks. This extension has already happ ened for theoretical results concerning the fav orability of the optimization landscap e for compressiv e sensing under generative priors [ Ma+18 ], as mentioned previously . W e are now ready to state our main res ult. Theorem 2. Consider a network with the weights in the i -th layer, W i P R n i ˆ n i ´ 1 , i.i.d. N p 0 , 2 { n i q distribute d, and supp ose that the network satisﬁes the exp ansivity c ondition for  “ K { d 90 . A lso, supp ose that the noise varianc e ω , deﬁne d for notational c onvenienc e as ω : “ c 18 σ 2 k n log p n d 1 n d ´ 1 2 . . . n d q , ob eys ω ď } x ˚ } K 1 d 16 . Consider the iter ates of Algorithm 1 with stepsize α “ K 4 1 d 2 . Then, ther e exists a numb er of steps N upp er b ounde d by N ď K 2 d 86 f p x 0 q } x ˚ } such that after N steps, the iter ates of Algorithm 1 ob ey } x i ´ x ˚ } ď K 5 d 36 } x ˚ } ` K 6 d 6 ω , for al l i ě N , (3) with pr ob ability at le ast 1 ´ 2 e ´ 2 k log n ´ ř d i “ 2 8 n i e ´ K 7 n i ´ 2 ´ 8 n 1 e ´ K 7 k log d { d 180 . In addition, for al l i ą N , we have } x i ´ x ˚ } ď p 1 ´ α 7 { 8 q i ´ N } x N ´ x ˚ } ` K 8 ω and (4) } G p x i q ´ G p x ˚ q} ď 1 . 2 p 1 ´ α 7 { 8 q i ´ N } x N ´ x ˚ } ` 1 . 2 K 8 ω , (5) wher e α ă 1 is the stepsize of the algorithm. Her e, K 1 , K 2 , . . . ar e numeric al c onstants, and x 0 is the initial p oint in the optimization. Our result guarantees that after polynomially man y iterations (with respect to d ), the algorithm con verges linearly to a region satisfying } x i ´ x ˚ } 2 À σ 2 k n , where the notation À absorbs a factor logarithmic in n and p olynomial in d . One can show that G is Lipsc hitz in a region around x ˚ 1 , } G p x i q ´ G p x ˚ q} 2 À σ 2 k n . Th us, the theorem guarantees that our algorithm yields the denoising rate of σ 2 k { n , and, as a consequence, denoising based on a generative deep prior pro v ably reduces the energy of the noise in the original image by a factor of k { n . In the case of σ “ 0, the theorem guaran tees con vergence to the global minimizer x ˚ . W e note that the inten tion of this pap er is to show rate-optimalit y of reco very with resp ect to the noise 1 The proof of Lipschitzness follo ws from applying the W eigh t Distribution Condition in Section 5.1 . 7 p o w er, the latent co de dimensionality , and the signal dimensionality . As a result, no attempt was made to establish optimal b ounds with resp ect to the scaling of constan ts or to p ow ers of d . The b ounds pro vided in the theorem are highly conserv ativ e in the constants and dep endency on the n umber of lay ers, d , in order to keep the pro of as simple as p ossible. Numerical exp eriments sho wn later reveal that the parameter range for successful denoising are muc h broader than the constants suggest. As this result is the ﬁrst of its kind for rigorous analysis of denoising p erformance by deep generativ e netw orks, we an ticipate the results can b e improv ed in future researc h, as has happ ened for other problems, such as sparsity-based compressed sensing and phase retriev al. Finally , we remark that Theorem 2 can b e generalized to the c ase where y ˚ only approximately lies in the range of the generator, i.e., G p x ˚ q « y ˚ . Sp eciﬁcally , if } G p x ˚ q ´ y ˚ } is suﬃciently small, the error induced by this p erturbation is prop ortional to } G p x ˚ q ´ y ˚ } . 5.1 The W eigh t Distribution Condition (WDC) T o prov e our main result, w e mak e use of a deterministic condition on G , called the W eight Distribution Condition (WDC), and then sho w that Gaussian W i , as giv en by the statemen t of Theorem 2 are such that W i { ? 2 satisﬁes the WDC with the appropriate probability for all i , pro vided the expansivity condition holds. Our main result, Theorem 2 , contin ues to hold for an y w eight matrices such that W i { ? 2 satisfy the WDC. The condition is on the spatial arrangement of the netw ork w eigh ts within eac h lay er. W e say that the matrix W P R n ˆ k satisﬁes the Weight Distribution Condition with constant  if for all nonzero x, y P R k , › › › n ÿ i “ 1 1 h w i ,x i ą 0 1 h w i ,y i ą 0 ¨ w i w t i ´ Q x,y › › › ď , with Q x,y “ π ´ θ 0 2 π I k ` sin θ 0 2 π M ˆ x Ø ˆ y , (6) where w i P R k is the i th row of W ; M ˆ x Ø ˆ y P R k ˆ k is the matrix 2 suc h that ˆ x ÞÑ ˆ y , ˆ y ÞÑ ˆ x , and z ÞÑ 0 for all z P span pt x, y uq K ; ˆ x “ x {} x } 2 and ˆ y “ y {} y } 2 ; θ 0 “ = p x, y q ; and 1 S is the indicator function on S . The norm in the left hand side of ( 6 ) is the sp ectral norm. Note that an elementary calculation 3 giv es that Q x,y “ E r ř n i “ 1 1 h w i ,x i ą 0 1 h w i ,y i ą 0 ¨ w i w t i s for w i „ N p 0 , I k { n q . As the ro ws w i corresp ond to the neural netw ork w eights of the i th neuron in a lay er given by W , the WDC pro vides a deterministic prop erty under whic h the set of neuron weigh ts within the la y er given b y W are distributed appro ximately like a Gaussian. The WDC could also b e interpreted as a deterministic prop ert y under which the neuron weigh ts are distributed approximately lik e a high dimensional vector c hosen uniformly from a sphere of a particular radius. Note that if x “ y , Q x,y is an isometry up to a factor of 1 { 2. 5.2 Sk etch of pro of of Theorem 2 The pro of relies on a characterization of the loss surface. W e show that outside of tw o balls around x “ x ˚ and x “ ´ ρ d x ˚ , with ρ d a constant deﬁned in the pro of, the direction c hosen by the algorithm is a descent direction, with high probability . 2 A formula for M ˆ x Ø ˆ y is as follows. If θ 0 “ = p ˆ x, ˆ y q P p 0 , π q and R is a rotation matrix such that ˆ x and ˆ y map to e 1 and cos θ 0 ¨ e 1 ` sin θ 0 ¨ e 2 resp ectiv ely , then M ˆ x Ø ˆ y “ R t ¨ ˝ cos θ 0 sin θ 0 0 sin θ 0 ´ cos θ 0 0 0 0 0 k ´ 2 ˛ ‚ R , where 0 k ´ 2 is a k ´ 2 ˆ k ´ 2 matrix of zeros. If θ 0 “ 0 or π , then M ˆ x Ø ˆ y “ ˆ x ˆ x t or ´ ˆ x ˆ x t , respectively . 3 T o do this calculation, take x “ e 1 and y “ cos θ 0 ¨ e 1 ` sin θ 0 ¨ e 2 without loss of generality . Then each en try of the matrix can be determined analytically by an integral that factors in p olar co ordinates. 8 x ˚ ´ ρ d x ˚ 0 S ` β S ´ β Figure 2: Logic of the pro of, explained in the text. W e sho w that the step direction ˜ v x concen trates around a particular h x P R k , that is a con tin uous function of nonzero x, x ˚ and is zero only at x “ x ˚ , x “ ´ ρ d x ˚ , and 0, using a concen tration argumen t similar to [ HV18 ]. Around x “ x ˚ , the loss function has a global minimum, close to 0 it has a saddle p oint, and close to x “ ´ ρ d x ˚ p oten tially a lo cal minimum. In a nutshell, we sho w that i) the algorithm mov es aw a y from the saddle p oint at 0, and ii) the algorithm escap es the lo cal minim um close to x “ ´ ρ d x ˚ with the twist in Steps 3-5 of the algorithm. Finally , the iterates end up close to the x ˚ . The proof is organized as describ ed next and as illustrated in Figure 2 . The algorithm is initialized at an arbitrary p oint; for example close to 0. Algorithm 1 mov es a wa y from 0, at least till its iterates are outside the gra y ring, as 0 is a lo cal maxim um; and once an iterate x i lea ves the gray ring around 0, all subsequen t iterates will never b e in the white circle around 0 again (see Lemma 11 in the supplemen t). Then the algorithm might mo ve tow ards ´ ρ d x ˚ , but once it en ters the dashed ball around ´ ρ d x ˚ , it en ters a region where the function v alue is strictly larger than that of the dashed ball around x ˚ , by Lemma 13 in the supplemen t. Th us steps 3-5 of the algorithm will ensure that the next iterate x i is in the dashed ball around x ˚ . F rom there, the iterates will mo ve in to the region S ` β , since outside of S ` β Y S ´ β the algorithm c ho oses a descent direction in each step (see the argument around equation ( 22 ) in the supplement). The region S ` β is cov ered by a ball of radius r , b y Lemma 12 in the supplement, determined by the noise and  . This establishes the b ound ( 3 ) in the theorem. The proof pro ceeds b e sho wing that within a ball around x ˚ , the algorithm then con verges linearly , which establishes equations ( 4 ) and ( 5 ). 6 Applications to Compressed Sensing In this section we brieﬂy discuss another imp ortan t scenario to whic h our results apply to, namely regularizing in verse problems using deep generative priors. Approaches that regularize in verse problems using deep generativ e mo dels [ Bor+17 ] hav e empirically been sho wn to impro ve o ver sparsit y-based approac hes, see [ Luc+18 ] for a review for applications in imaging, and [ Mar+17 ] for an application in Magnetic Resonance Imaging showing a signiﬁcant p erformance improv emen t o ver conv entional metho ds. Consider an inv erse problem, where the goal is to reconstruct an unknown vector y ˚ P R n from m ă n noisy linear measuremen ts: z “ Ay ˚ ` η P R m , where A P R m ˆ n is called the measuremen t matrix and η is zero mean Gaussian noise with co v ariance matrix σ 2 { nI , as before. As b efore, assume that y ˚ lies in the range of a genera- 9 tiv e prior G , i.e., y ˚ “ G p x ˚ q for some x ˚ . As a wa y to recov er x ˚ , consider minimizing the empirical risk ob jective f p x q “ 1 2 } AG p x q ´ z } , using Algorithm 1 , with Step 6 substituted b y ˜ v x i “ p A Π 1 i “ d W i, ` ,x i q t p AG p x i q ´ y q , to account for the fact that measuremen ts were taken with the matrix A . Supp ose that A is a random pro jection matrix, for concreteness assume that A has i.i.d. Gaussian en tries with v ariance 1 { m . One could prov e an analogous result as Theorem 2 , but with ω “ b 18 σ 2 k m log p n d 1 n d ´ 1 2 . . . n d q , (note that n has been replaced b y m ). This extension sho ws that, pro vided  is c hosen suﬃciently small, that our algorithm yields an iterate x i ob eying } G p x i q ´ G p x ˚ q} 2 À σ 2 k m , where again À absorbs factors logarithmic in the n i ’s, and p olynomial in d . Pro ving this result w ould b e analogous to the proof of Theorem 2 , but with the additional assumption that the sensing matrix A acts like an isometry on the union of the ranges of Π 1 i “ d W i, ` ,x i , analogous to the pro of in [ HV18 ]. This extension of our result shows that Algorithm 1 enables solving in v erse problems under noise eﬃciently , and quantiﬁes the eﬀect of the noise. W e hasten to add that the pap er [ Bor+17 ] also derived an error bound for minimizing empirical loss. Ho wev er, the corresp onding result (for example Lemma 4.3) diﬀers in tw o imp ortant asp ects to our result. First, the result in [ Bor+17 ] only makes a statement ab out the minimizer of the empirical loss and do es not provide justiﬁcation that an algorithm can eﬃciently ﬁnd a p oint near the global minimizer. As the program is non-conv ex, and as non-conv ex optimization is NP-hard in general, the empirical loss could hav e lo cal minima at which algorithms get stuc k. In contrast, the presen t pap er presents a speciﬁc practical algorithm and prov es that it ﬁnds a solution near the global optimizer regardless of initialization. Second, the result in [ Bor+17 ] considers arbitrary noise η and thus can not assert denoising p erformance. In con trast, we consider a random mo del for the noise, and show the denoising b ehavior that the resulting error is no more than O p k { n q , as opp osed to } η } 2 « O p 1 q , whic h is what w e w ould get from direct application of the result in [ Bor+17 ]. 7 Exp erimen tal results In this section w e pro vide exp erimen tal evidence that corrob orates our theoretical claims that denoising with deep priors ac hieves a denoising rate prop ortional to σ 2 k { n . W e fo cus on denoising b y enforcing a generativ e prior, and consider b oth a synthetic, random prior, as studied theoretically in the pap er, as well as a prior learned from data. All our results are repro ducible with the co de pro vided in the supplemen t. 7.1 Denoising with a syn thetic prior W e start with a synthetic generative net work prior with ReLU-activ ation functions, and draw its w eights independently from a Gaussian distribution. W e consider a tw o-la y er netw ork with n “ 1500 neurons in the output la yer, 500 in the middle la yer, and v ary the num b er of input neurons, k , and the noise level, σ . W e next present simulations showing that if k is suﬃciently small, our algorithm achiev es a denoising rate prop ortional to σ k { n as guaranteed by our theory . T ow ards this goal, we generate Gaussian inputs x ˚ to the net w ork and observ e the noisy image y “ G p x ˚ q ` η , η „ N p 0 , σ 2 { nI q . F rom the noisy image, w e ﬁrst obtain an estimate ˆ x of the laten t represen tation by running Algorithm 1 un til conv ergence, and second we obtain an estimate of the image as ˆ y “ G p ˆ x q . In the left and middle panel of Figure 4 , w e depict the normalized mean squared error of the laten t represen tation, MSE p ˆ x, x ˚ q , and the mean squared error in the 10 noisy denoised noise v ariance 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 Figure 3: Denosing with a learned generative prior: Ev en when the num b er is barely visible, the denoiser reco v ers a sharp image. image domain, MSE p G p ˆ x q , G p x ˚ qq , where we deﬁned MSE p z , z 1 q “ } z ´ z 1 } 2 . F or the left panel, we ﬁx the noise v ariance to σ 2 “ 0 . 25, and v ary k , and for the middle panel we ﬁx k “ 50 and v ary the noise v ariance. The results show that, if the netw ork is suﬃcien tly expansive, guaranteed by k b eing suﬃciently small, then in the noiseless case ( σ 2 “ 0), the latent represen tation and image are p erfectly recov ered. In the noisy case, we achiev e a MSE prop ortional to σ 2 k { n , b oth in the represen tation and image domains. W e also observed that for the problem instances considered here, the negation trick in step 3-4 of Algorithm 1 is often not nec essary , in that ev en without that step the algorithm typically con verges to the global minimum. Ha ving said this, in general the negation step is necessary , since there exist problem instances that hav e a lo cal minimum opp osite of x ˚ . 7.2 Denoising with a learned prior W e next consider a prior learned from data. T echnically , for such a prior our theory do es not apply since we assume the weigh ts to b e c hosen at random. Ho w ever, the n umerical results presented in this section sho w that even for the learned prior we ac hiev e the rate predicted by our theory p ertain- ing to a random prior. T ow ards this goal, w e consider a fully-connected auto enco der parameterized b y k , consisting of an deco der and enco der with ReLU activ ation functions and fully connected la yers. W e choose the num b er of neurons in the three lay ers of the enco der as 784 , 400 , k , and those of the deco der as k , 400 , 784. W e set k “ 10 and k “ 20 to obtain tw o diﬀerent auto enco ders. W e train b oth auto enco ders on the MNIST [ Lec+98 ] tr aining set . W e then take an image y ˚ from the MNIST test set , add Gaussian noise to it, and denoise it using our metho d based on the learned deco der-netw ork G for k “ 10 and k “ 20. Sp eciﬁcally , we estimate the latent represen tation ˆ x by running Algorithm 1 , and then set ˆ y “ G p ˆ x q . See Figure 3 for a few examples demonstrating the p erformance of our approac h for diﬀeren t noise lev els. W e next show that this achiev es a mean squared error (MSE) prop ortional to σ 2 k { n , as sug- gested by our theory which applies for deco ders with random weigh ts. W e add noise to the images with noise v ariance ranging from σ 2 “ 0 to σ 2 “ 6. In the righ t panel of Figure 4 we show the MSE in the image domain, MSE p G p ˆ x q , G p x ˚ qq , av eraged ov er a num b er of images for the learned deco ders with k “ 10 and k “ 20. W e observe an interesting tradeoﬀ: The deco der with k “ 10 has fewer parameters, and thus do es not represent the digits as well, therefore the MSE is larger than that for k “ 20 for the noiseless case (i.e., for σ “ 0). On the other hand, the smaller num b er of parameters results in a b etter denoising rate (b y ab out a factor of t wo), corresp onding to the steep er slop e of the MSE as a function of the noise v ariance, σ 2 . Ac knowledgemen ts RH is partially supp orted by a NSF Gran t ISS-1816986 and PH is partially supp orted by a NSF CAREER Grant DMS-1848087 as well as NSF Gran t DMS-1464525, and the authors would like to 11 0 20 40 60 0 0 . 01 0 . 02 0 . 03 k mean squared error noisy rep. noisy img. 0 10 20 0 0 . 2 0 . 4 0 . 6 σ 2 noisy rep. noisy img. 0 2 4 6 0 . 2 0 . 4 0 . 6 σ 2 k “ 10 k “ 20 Figure 4: Mean square error in the image domain, MSE p G p ˆ x q , x ˚ q , and in the laten t representation, MSE p ˆ x, x ˚ q , as a function of the dimension of the laten t represen tation, k , with σ 2 “ 0 . 25 (left panel) , and the noise v ariance, σ 2 with k “ 50 (middle panel) . As suggested by the theory p ertaining to deco ders with random w eights, if k is suﬃcien tly small, and thus the net work is suﬃcien tly expansiv e, the denoising rate is prop ortional to σ 2 k { n . Right panel: Denoising of handwritten digits based on a learned decoder with k “ 10 and k “ 20, along with the least-squares ﬁt as dotted lines. The learned deco der with k “ 20 has more parameters and thus represents the images with a smaller error; therefore the MSE at σ “ 0 is smaller. Ho wev er, the denoising rate for the deco der with k “ 20, which is the slop e of the curve is larger as well, as suggested by our theory . thank T an Nguyen for helpful discussions. References [Aro+15] S. Arora, Y. Liang, and T. Ma. “Why are deep nets reversible: A simple theory , with implications for training”. In: arXiv:1511.05653 (2015). [Bor+17] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. “Compressed sensing using generativ e mo dels”. In: arXiv:1703.03208 (2017). [Bur+12] H. C. Burger, C. J. Sc huler, and S. Harmeling. “Image denoising: Can plain neural net works comp ete with BM3D?” In: 2012 IEEE Confer enc e on Computer Vision and Pattern R e c o gnition . 2012, pp. 2392–2399. [Cla17] C. Clason. “Nonsmo oth Analysis and Optimization”. In: arXiv:1708.04180 (2017). [Dab+07] K. Dab ov, A. F oi, V. Katko vnik, and K. Egiazarian. “Image denoising b y sparse 3-D transform-domain collab orativ e ﬁltering”. In: IEEE T r ansactions on Image Pr o c essing 16.8 (2007), pp. 2080–2095. [Don95] D. L. Donoho. “De-noising by soft-thresholding”. In: IEEE T r ansactions on Information The ory 41.3 (1995), pp. 613–627. [EA06] M. Elad and M. Aharon. “Image denoising via sparse and redundantd representations o ver learned dictionaries”. In: IEEE T r ansactions on Image Pr o c essing 15.12 (2006), pp. 3736–3745. [Go o+14] I. Go o dfellow, J. P ouget-Abadie, M. Mirza, B. Xu, D. W arde-F arley, S. Ozair, A Courville, and Y. Bengio. “Generativ e adv ersarial nets”. In: A dvanc es in Neur al In- formation Pr o c essing Systems 27 . 2014, pp. 2672–2680. 12 [HV18] P . Hand and V. V oroninski. “Global guarantees for enforcing deep generative priors by empirical risk”. In: Confer enc e on L e arning The ory . arXiv:1705.07576. 2018. [HH19] R. Hec k el and P . Hand. “Deep Decoder: Concise Image Represen tations from Untrained Non-con volutional Net works”. In: International Confer enc e on L e arning R epr esenta- tions . 2019. [HS06] G. E. Hin ton and R. R. Salakhutdino v. “Reducing the dimensionalit y of data with neural net w orks”. In: Scienc e 313.5786 (2006), pp. 504–507. [Kar+17] T. Karras, T. Aila, S. Laine, and J. Lehtinen. “Progressiv e gro wing of GANs for im- pro ved quality , stabilit y , and v ariation”. In: arXiv: 1710.10196 (2017). [LM00] B. Laurent and P . Massart. “Adaptiv e estimation of a quadratic functional by mo del selection”. In: The Annals of Statistics 28.5 (2000), pp. 1302–1338. [Lec+98] Y. Lecun, L. Bottou, Y. Bengio, and P . Haﬀner. “Gradient-based learning applied to do cumen t recognition”. In: Pr o c e e dings of the IEEE 86.11 (1998), pp. 2278–2324. [Luc+18] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos. “Using deep neural net works for in verse problems in imaging: Bey ond analytical metho ds”. In: IEEE Signal Pr o c essing Magazine 35.1 (2018), pp. 20–36. [Lug+13] G. Lugosi, P . Massart, and S. Boucheron. Conc entr ation Ine qualities: A Nonasymptotic The ory of Indep endenc e . Oxford Universit y Press, 2013. [Ma+18] F. Ma, U. Ayaz, and S. Karaman. “Inv ertibility of conv olutional generativ e net w orks from partial measurements”. In: A dvanc es in Neur al Information Pr o c essing Systems . 2018, pp. 9651–9660. [Mar+17] M. Mardani, H. Mona jemi, V. P apy an, S. V asanaw ala, D. Donoho, and J. Pauly. “Re- curren t generativ e adversarial netw orks for proximal learning and automated compres- siv e image reco very”. In: arXiv:1711.10046 (2017). [Sta+02] J.-L. Starck, E. J. Candes, and D. L. Donoho. “The curv elet transform for image de- noising”. In: IEEE T r ansactions on Image Pr o c essing 11.6 (2002), pp. 670–684. [Uly+18] D. Ulyano v, A. V edaldi, and V. Lempitsky . “Deep Image Prior”. In: IEEE Conf. Com- puter Vision and Pattern R e c o gnition . 2018. [Zha+17] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising”. In: IEEE T r ansactions on Image Pr o c essing 26.7 (2017), pp. 3142–3155. A Pro of of Prop osition 1 W e ﬁrst show that H p y q lies in the range of a union of k -dimensional subspaces, and upp er-b ound the num b er of the subspaces. T ow ards this goal, ﬁrst note that the eﬀect of the ReLU op eration relu p z q can b e describ ed with a diagonal matrix D which contains a one on its diagonal if the resp ectiv e en try of z is larger than zero, and zero otherwise, so that D z “ relu p z q . With this notation, w e can write H p y q “ D d W d D d ´ 1 . . . D 2 W 2 D 1 W 1 D 1 W 1 l jh n U y . The matrix U P R n ˆ n has at most rank k , th us H p y q lies in the range of a union of at most k -dimensional subspaces, where each subspace is determined by the matrices D d , . . . , D 1 , D 1 . W e 13 next b ound the num b er of subspaces. First note that since W 1 P R n ˆ k , there are only 2 k man y diﬀeren t choices for D 1 , corresp onding to all the sign patterns. Next note that, with the lemma b elo w, w e hav e that for a ﬁxed D 1 W 1 , the n umber of diﬀeren t matrixes D 1 can b e b ounded b y n k 1 . Lik ewise, for ﬁxed W 2 D 1 W 1 D 1 W 1 , the n umber of diﬀerent matrices D 2 can b e b ounded by n k 2 and so forth. Th us, the total n um b er of diﬀerent choices of the matrices D d , . . . , D 1 , D 1 is upp er b ounded by 2 k n k 1 . . . n k d . Lemma 3. F or any U P R n ˆ k and k ě 5 , |t diag p U v ą 0 q U | v P R k u| ď n k . Next note that by assumption we hav e that } U η } 2 2 {} η } 2 2 ď 2 , (7) for all v ectors η and for all U deﬁned by the matrices D d , . . . , D 1 , D 1 . F or ﬁxed U , let S b e the span of the right singular v ectors of U , and note that S has dimension at most k . Let P S b e the orthogonal pro jector onto a subspace S . W e hav e that } U η } 2 2 } η } 2 2 “ } U P S η } 2 2 } η } 2 2 ď 2 } P S η } 2 2 } η } 2 2 , again for all η . Now, w e mak e use of the following b ound on the pro jection of the noise η on to a subspace, whic h follo ws from standard Gaussian concentration inequalities [ LM00 , Lem. 1]. Lemma 4. L et S Ă R n b e a subsp ac e with dimension k . L et η „ N p 0 , I n q and β ě 1 . Then, P « } P S η } 2 2 } η } 2 2 ď 10 β k n ﬀ ě 1 ´ e ´ β k ´ e ´ n { 16 . T aking a union b ound ov er all subspaces, we obtain with the lemma ab ov e that P « } H p η q} 2 2 } η } 2 2 ď 20 β k n x ﬀ ě 1 ´ p 2 k n k 1 . . . n k d qp e ´ β k ` e ´ n { 16 q . Cho osing β “ 2 log p 2 n 1 n 2 . . . n d q concludes the pro of. B Pro of of Theorem 2 In this section we prov e our main result, Theorem 2 . Instead of proving Theorem 2 as stated, we will pro ve the following equiv alent rescaled statement for when W i ha ve i.i.d. N p 0 , 1 { n i q en tries. Because of this rescaling, G p x q scales lik e 2 ´ d { 2 } x } , the noise ω is assumed to scale like 2 ´ d { 2 , ∇ f scales lik e 2 d , and α scales like 2 d . Theorem 2 is the  “ K { d 90 case of what follows. Theorem 5. Consider a network with the weights in the i -th layer, W i P R n i ˆ n i ´ 1 , i.i.d. N p 0 , 1 { n i q distribute d, and supp ose that the network satisﬁes the exp ansivity c ondition for some  ď K { d 90 . A lso, supp ose that the noise varianc e ob eys ω ď } x ˚ } K 1 2 ´ d { 2 d 16 , ω : “ c 18 σ 2 k n log p n d 1 n d ´ 1 2 . . . n d q . Consider the iter ates of A lgorithm 1 with stepsize α “ K 4 2 d d 2 . 14 A. Then, ther e exists a numb er of steps N upp er b ounde d by N ď K 2 d 4  f p x 0 q 2 d } x ˚ } such that after N steps, the iter ates of Algorithm 1 ob ey } x i ´ x ˚ } ď K 5 d 9 } x ˚ } ?  ` K 6 d 6 2 d { 2 ω , for al l i ě N , (8) with pr ob ability at le ast 1 ´ 2 e ´ 2 k log n ´ ř d i “ 2 8 n i e ´ K 7 n i ´ 2 ´ 8 n 1 e ´ K 7  2 log p 1 {  q k . B. In addition, for al l i ě N , we have } x i ` 1 ´ x ˚ } ď p 1 ´ α 7 { 8 q i ` 1 ´ N } x N ´ x ˚ } ` K 8 2 d { 2 ω and (9) } G p x i ` 1 q ´ G p x ˚ q} ď 1 . 2 2 d { 2 p 1 ´ α 7 { 8 q i ` 1 ´ N } x N ´ x ˚ } ` 1 . 2 K 8 ω , (10) wher e α ă 1 is the stepsize of the algorithm. Her e, K 1 , K 2 , .. ar e numeric al c onstants, and x 0 is the initial p oint in the optimization. As men tioned in Section 5.1 , our proof makes use of a deterministic condition, called the W eigh t Distribution Condition (WDC), formally deﬁned in Section 5.1 . The following prop osition estab- lishes that the expansivity condition ensures that the WDC holds: Lemma 6 (Lemma 9 in [ HV18 ]) . Fix  P p 0 , 1 q . If the entir es of W i P R n i ˆ n i ´ 1 ar e i.i.d. N p 0 , 1 { n i q and the exp ansivity c ondition n i ą c ´ 2 log p 1 {  q n i ´ 1 log n i ´ 1 holds, then W i satisﬁes the WDC with c onstant  with pr ob ability at le ast 1 ´ 8 n i e ´ K  2 n i ´ 1 . Her e, c and K ar e numeric al c onstants. It follows from Lemma 6 , that the WDC holds for all W i sim ultaneously with probability at least 1 ´ ř d i “ 2 8 n i e ´ K 7 n i ´ 2 ´ 8 n 1 e ´ K 7  2 log p 1 {  q k . In the remainder of the pro of we work on the even t that the WDC holds for all W i . B.1 Preliminaries Recall that the goal of our algorithm is to minimize the empirical risk ob jective f p x q “ 1 2 } G p x q ´ y } 2 , where y : “ G p x ˚ q ` η , with η „ N p 0 , σ 2 { nI q . Our results rely on the fact that outside of t w o balls around x “ x ˚ and x “ ´ ρ d x ˚ , with ρ d a constant deﬁned b elow, the direction chosen by the algorithm is a descent direction, with high probabilit y . In order to pro ve this, we use a concentration argumen t, similar to the argumen ts used in [ HV18 ]. First, deﬁne Λ x : “ Π 1 i “ d W i, ` ,x , with W i, ` ,x deﬁned in Section 4 for notational con v enience, and note that the step direction of our algorithm can b e written as ˜ v x “ v x ` ¯ q x , with v x : “ Λ t x Λ x x ´ p Λ x q t p Λ x ˚ q x ˚ , and ¯ q x : “ Λ t x η . (11) 15 Note that at p oints x where G (and hence f ) is diﬀerentiable, we hav e that ˜ v x “ ∇ f p x q . The pro of is based on showing that ˜ v x concen trates around a particular h x P R k , deﬁned b elow, that is a contin uous function of nonzero x, x ˚ and is zero only at x “ x ˚ and x “ ´ ρ d x ˚ . The deﬁnition of h x dep ends on a function that is helpful for con trolling ho w the op erator x ÞÑ W ` ,x x distorts angles, deﬁned as: g p θ q : “ cos ´ 1 ´ p π ´ θ q cos θ ` sin θ π ¯ . (12) With this notation, we deﬁne h x : “ ´ 1 2 d ´ d ´ 1 ź i “ 0 π ´ θ i π ¯ x ˚ ` 1 2 d « x ´ d ´ 1 ÿ i “ 0 sin θ i π ´ d ´ 1 ź j “ i ` 1 π ´ θ j π ¯ } x ˚ } 2 } x } 2 x ﬀ , where θ 0 “ = p x, x ˚ q and θ i “ g p θ i ´ 1 q . Note that h x is deterministic and only dep ends on x , x ˚ , and the num b er of lay ers, d . In order to b ound the deviation of ˜ v x from h x w e use the follo wing tw o lemmas, b ounding the deviation con trolled b y the WDC and the deviation from the noise: Lemma 7 (Lemma 6 in [ HV18 ]) . Supp ose that the WDC holds with  ă 1 {p 16 π d 2 q 2 . Then, for al l nonzer o x, x ˚ P R k , } v x ´ h x } 2 ď K d 3 ?  2 d max p} x } 2 , } x ˚ } 2 q , and (13) @ Λ x x, Λ x ˚ x ˚ D ě 1 4 π 1 2 d } x } 2 } x ˚ } 2 , and (14) } Λ x } 2 ď 1 2 d p 1 ` 2  q d ď 13 12 2 ´ d . (15) Pr o of. Equation ( 13 ) and ( 14 ) are Lemma 6 in [ HV18 ]. Regarding ( 15 ), note that the WDC implies that } W i, ` ,x } 2 ď 1 { 2 `  . It follows that } Λ x } 2 “ › › Π 1 i “ d W i, ` ,x › › 2 ď 1 2 d p 1 ` 2  q d “ 1 2 d e d log p 1 ` 2  q ď 1 ` 4 d 2 d ď 13 12 2 ´ d , where the last inequalities follow by our assumption on  (i.e.,  ă 1 {p 16 π d 2 q 2 ). Lemma 8. Supp ose the WDC holds with  ă 1 {p 16 π d 2 q 2 , that any subset of n i ´ 1 r ows of W i ar e line arly indep endent for e ach i , and that η „ N p 0 , σ 2 { nI q . Then the event E noise : “ ! › › Λ t x η › › ď ω 2 d { 2 , for al l x ) , ω : “ c 16 σ k n log p n d 1 n d ´ 1 2 . . . n d q (16) holds with pr ob ability at le ast 1 ´ 2 e ´ 2 k log n . As the cost function f is not diﬀeren tiable ev erywhere, we mak e use of the generalized sub diﬀer- en tial in order to reference the subgradients at nondiﬀeren tiable p oints. F or a Lipsc hitz function ˜ f deﬁned from a Hilb ert space X to R , the Clarke generalized directional deriv ativ e of ˜ f at the p oint x P X in the direction u , denoted by ˜ f o p x ; u q , is deﬁned by ˜ f o p x ; u q “ lim sup y Ñ x,t Ó 0 ˜ f p y ` tu q´ ˜ f p y q t , and the generalized sub diﬀerential of ˜ f at x , denoted by B ˜ f p x q , is deﬁned by B ˜ f p x q “ t v P R k | x v , u y ď ˜ f o p x ; u q , for all u P X u . 16 Since f p x q is a piecewise quadratic function, w e ha v e B f p x q “ conv p v 1 , v 2 , . . . , v t q , (17) where conv denotes the con v ex hull of the vectors v 1 , . . . , v t , t is the n umber of quadratic functions adjoin t to x , and v i is the gradient of the i -th quadratic function at x . Lemma 9. Under the assumption of L emma 8 , and assuming that E noise holds, we have that, for any x ‰ 0 and any v x P B f p x q , } v x ´ h x } ď K d 3 ?  2 d max p} x } 2 , } x ˚ } 2 q ` ω 2 d { 2 . Pr o of. By ( 17 ), B f p x q “ conv p v 1 , . . . v t q for some ﬁnite t , and th us v x “ a 1 v 1 ` . . . a t v t for some a 1 , . . . , a t ě 0, ř i a i “ 1. F or eac h v i , there exists a w suc h that v i “ lim t Ó 0 ˜ v x ` tw . On the even t E noise , w e ha v e that for an y x ‰ 0, for any ˜ v x P B f p x q } ˜ v x ´ h x } “ } v x ` ¯ q x ´ h x } ď } v x ´ h x } ` } ¯ q x } ď K d 3 ?  2 d max p} x } 2 , } x ˚ } 2 q ` ω 2 d { 2 , where the last inequalit y follows from Lemmas 7 and 8 ab ov e. The pro of is concluded by app ealing to the contin uit y of h x with resp ect to nonzero x , and by noting that } v x ´ h x } ď ÿ i a i } v i ´ h x } ď K d 3 ?  2 d max p} x } 2 , } x ˚ } 2 q ` ω 2 d { 2 , where w e used the inequality ab ov e and that ř i a i “ 1. W e will also need an upp er b ound on the norm of the step direction of our algorithm: Lemma 10. Supp ose that the WDC holds with  ă 1 {p 16 π d 2 q 2 and that the event E noise holds with ω ď 2 ´ d { 2 } x ˚ } 8 π . Then, for al l x , and al l v x P B f p x q , } v x } ď dK 2 d max p} x } , } x ˚ } q , (18) wher e K is a numeric al c onstant. Pr o of. Deﬁne for conv enience ζ j “ ś d ´ 1 i “ j π ´ ¯ θ j,x,x ˚ π . W e ha v e } v x } ď} h x } ` } h x ´ v x } ď › › › › › 1 2 d x ´ 1 2 d ζ 0 x ˚ ´ 1 2 d d ´ 1 ÿ i “ 0 sin ¯ θ i,x π ζ i ` 1 } x ˚ } } x } x › › › › › ` K 1 d 3 ?  2 d max p} x } 2 , } x ˚ } 2 q ` ω 2 d { 2 ď 1 2 d } x } ` ˆ 1 2 d ` d π 2 d ˙ } x ˚ } ` K 1 d 3 ?  2 d max p} x } , } x ˚ }q ` ω 2 d { 2 ď dK 2 d max p} x } , } x ˚ } q , where the second inequalit y follo ws from the deﬁnition of h x and Lemma 9 , the third inequality uses | ζ j | ď 1, and the last inequality uses the assumption ω ď 2 ´ d { 2 } x ˚ } 8 π . 17 B.2 Pro of of Theorem 5 A W e are now ready to pro v e Theorem 5 A. The logic of the pro of is illustrated in Figure 2 . Recall that x i is the i th iterate of x as p er Algorithm 1 . W e ﬁrst ensure that we can assume throughout that x i is b ounded aw a y from zero: Lemma 11. Supp ose that WDC holds with  ă 1 {p 16 π d 2 q 2 and that E noise holds with ω in ( 16 ) ob eying ω ď 2 ´ d { 2 } x ˚ } 8 π . Mor e over, supp ose that the step size in Algorithm 1 satisﬁes 0 ă α ă K 2 d d 2 , wher e K is a numeric al c onstant. Then, after at most N “ p 38 π K 0 2 d α q 2 steps, we have that for al l i ą N and al l t P r 0 , 1 s that t ˜ x i ` p 1 ´ t q x i ` 1 R B p 0 , 1 32 π } x ˚ }q . In particular, if α “ K 2 d { d 2 , then N is b ounded b y a constant times d 4 . W e can therefore assume throughout this pro of that x i R B p 0 , K 0 } x ˚ } q , K 0 “ 1 32 π . W e pro ve Theorem 5 by sho wing that if } h x } is suﬃcien tly large, i.e., if the iterate x i is outside of set S β “ ! x P R k | } h x } ď 1 2 d β max p} x } , } x ˚ }q ) , with β “ 4 K d 3 ?  ` 13 ω 2 d { 2 {} x ˚ } , (19) then the algorithm makes progress in the sense that f p x i ` 1 q ´ f p x i q is smaller than a certain negativ e v alue. The set S β is con tained in tw o balls around x ˚ and ´ ρx ˚ , whose radius is controlled by β : Lemma 12. F or any β ď 1 64 2 d 12 , S β Ă B p x ˚ , 5000 d 6 β } x ˚ } 2 q Y B p´ ρ d x ˚ , 500 d 11 a β } x ˚ } 2 q . (20) Her e, ρ d ą 0 is deﬁne d in the pr o of and ob eys ρ d Ñ 1 as d Ñ 8 . Note that b y the assumption ω ď } x ˚ } K 1 2 ´ d { 2 d 16 and K d 45 ?  ď 1, our choice of β in ( 19 ) ob eys β ď 1 64 2 d 12 for suﬃcien tly small K 1 , K , and thus Lemma 12 yields: S β Ă B p x ˚ , r q Y B p´ ρ d x ˚ , b r } x ˚ } d 8 q . w ere we deﬁne the radius r “ K 2 d 9 ?  } x ˚ } ` K 3 d 6 ω 2 d { 2 , where K 2 , K 3 are numerical constants. Note that hat the radius r is equal to the righ t hand side in the error b ound ( 8 ) in our theorem. In order to guarantee that the algorithm conv erges to a ball around x ˚ , and not to that around ´ ρ d x ˚ , w e use the following lemma: Lemma 13. Supp ose that the WDC holds with  ă 1 {p 16 π d 2 q 2 . Mor e over supp ose that E noise holds, and that ω in the event E noise ob eys ω 2 ´ d { 2 } x ˚ } 2 ď K 9 { d 2 , wher e K 9 ă 1 is a universal c onstant. Then for any φ d P r ρ d , 1 s , it holds that f p x q ă f p y q (21) for al l x P B p φ d x ˚ , K 3 d ´ 10 } x ˚ }q and y P B p´ φ d x ˚ , K 3 d ´ 10 } x ˚ }q , wher e K 3 ă 1 is a universal c onstant. 18 In order to apply Lemma 13 , deﬁne for conv enience the tw o sets: S ` β : “ S β X B p x ˚ , r q , and S ´ β : “ S β X B p´ ρ d x ˚ , b r } x ˚ } d 8 q . By the assumption that K d 45 ?  ď 1 and ω ď K 1 d ´ 16 2 ´ d { 2 } x ˚ } , w e hav e that for suﬃciently small K 1 , K , S ` β Ď B p x ˚ , K 3 d ´ 10 } x ˚ }q and S ´ β Ď B p´ ρ d x ˚ , K 3 d ´ 10 } x ˚ }q . Th us, the assumptions of Lemma 13 are met, and the lemma implies that for an y x P S ´ β and y P S ` β , it holds that f p x q ą f p y q . W e now sho w that the algorithm conv erges to a p oint in S ` β . This fact and the negation step in our algorithm (line 3-5) establish that the algorithm con verges to a p oin t in S ` β if we prov e that the ob jective is nonincreasing with iteration num b er, which will form the remainder of this pro of. Consider i suc h that x i R S β . By the mean v alue theorem [ Cla17 , Theorem 8.13], there is a t P r 0 , 1 s suc h that for ˆ x i “ x i ´ tα ˜ v x i there is a v ˆ x i P B f p ˆ x i q , where B f is the generalized sub diﬀeren tial of f , ob eying f p x i ´ α ˜ v x i q ´ f p x i q “x v ˆ x i , ´ α ˜ v x i y “x ˜ v x i , ´ α ˜ v x i y ` x v ˆ x i ´ ˜ v x i , ´ α ˜ v x i y ď ´ α } ˜ v x i } 2 ` α } v ˆ x i ´ ˜ v x i }} ˜ v x i } “ ´ α } ˜ v x i }p} ˜ v x i } ´ } v ˆ x i ´ ˜ v x i }q . (22) In the next subsection, we guarantee that for any t P r 0 , 1 s , v ˆ x i with ˆ x i “ x i ´ tα ˜ v x i is close to ˜ v x i : } v ˆ x i ´ ˜ v x i } ď ˆ 5 6 ` αK 7 d 2 2 d ˙ } ˜ v x i } , for all v ˆ x i P B f p ˆ x i q . (23) Applying ( 23 ) to ( 22 ) yields f p x i ´ α ˜ v x i q ´ f p x i q ď ´ 1 12 α } ˜ v x i } 2 2 , where w e used that αK 7 d 2 2 d ď 1 12 , b y our assumption on the stepsize α b eing suﬃcien tly small. Th us, the maximum num b er of iterations for which x i R S β is f p x 0 q 12 {p α min i } ˜ v x i } 2 q . W e next lo wer-bound } ˜ v x i } . W e ha v e that on E noise , for all x R S β , with β given by ( 19 ) that } ˜ v x } 2 ě } h x } ´ } h x ´ ˜ v x } ě 2 ´ d max p} x } , } x ˚ }q ´ β ´ K 1 d 3 ?  ´ ω 2 d { 2 } x ˚ } ¯ ě 2 ´ d max p} x } , } x ˚ }q ˆ 3 K d 3 ?  ` 12 ω 2 d { 2 } x ˚ } ˙ (24) ě 2 ´ d } x ˚ } 3 K d 3 ? , where the second inequalit y follows by the deﬁnition of S β and Lemma 9 , and the third inequality follo ws from our deﬁnition of β in equation ( 19 ). Thus, f p x i ´ α ˜ v x i q ´ f p x i q ď ´ α K 5 2 ´ 2 d d 6  } x ˚ } 2 ď ´ 2 ´ d d 4 K 6  } x ˚ } 2 19 where w e used α “ K 4 2 d d 2 . Hence, there can b e at most f p x 0 q 2 d K 6 d 4  } x ˚ } 2 iterations for which x i R S β . In order to conclude our pro of, w e remark that once x i is inside a ball of radius r around x ˚ , the iterates do not leav e a ball of radius 2 r around x ˚ . T o see this, note that by the b ound on } v x } giv en in equation ( 18 ) and our choice of stepsize, α } ˜ v x i } ď K d max p} x i } , } x ˚ } q . This concludes the pro of of Theorem 5 A. B.3 Pro of of Theorem 5 B Theorem 5 A establishes that after N iterations the iterates x i are inside a ball of radius 2 r around x ˚ . With the assumption that  ď K 1 { d 90 for suﬃcien tly small K 1 and the deﬁnition of r , this implies that the iterates lie in a ball around x ˚ of radius at most K 3 d ´ 10 } x ˚ } . In this proof of Theorem 5 B, we pro v e con vergence within this ball. In this pro of, we show that for any i ě N , it holds that x i P B p x ˚ , a 4 d ´ 10 } x ˚ }q , ˜ x i “ x i , and } x i ` 1 ´ x ˚ } ď b i ` 1 ´ N 2 } x N ´ x ˚ } ` b 4 2 d { 2 ω . where K 3 is deﬁned in Lemma 13 , b 2 “ 1 ´ α 2 d 7 8 and b 4 is a universal constan t. W e need Lemma 14 which guarantees that the search directions of the iterates afterw ard p oint to x ˚ only up to the noise ω : Lemma 14. Supp ose the WDC holds with 200 d a d ?  ă 1 and x P B p x ˚ , d ?  } x ˚ }q . Then for al l x ‰ 0 and for al l v x P B f p x q , › › › › v x ´ 1 2 d p x ´ x ˚ q › › › › ď 1 2 d 1 8 } x ´ x ˚ } ` 1 2 d { 2 ω . Supp ose ˜ x i P B p x ˚ , K 3 d ´ 10 } x ˚ }q . By the assumption  ď K 1 { d 90 for suﬃcien tly small K 1 , the assumptions in Lemma 14 are met. Therefore, } x i ` 1 ´ x ˚ } “} ˜ x i ´ αv ˜ x i ´ x ˚ } “} ˜ x i ´ x ˚ ´ α 2 d p ˜ x i ´ x ˚ q ´ αv ˜ x i ` α 2 d p ˜ x i ´ x ˚ q} ď ´ 1 ´ α 2 d ¯ } ˜ x i ´ x ˚ } ` α } v ˜ x i ´ 1 2 d p ˜ x i ´ x ˚ q} ď ´ 1 ´ α 2 d ¯ } ˜ x i ´ x ˚ } ` α ˆ 1 8 1 2 d } ˜ x i ´ x ˚ } ` 1 2 d { 2 ω ˙ “ ˆ 1 ´ α 2 d 7 8 ˙ } ˜ x i ´ x ˚ } ` α 1 2 d { 2 ω , (25) where the second inequality holds b y Lemma 14 . By the assumptions ˜ x i P B p x ˚ , K 3 d ´ 10 } x ˚ }q , ω ď K 1 } x ˚ } d 16 2 d { 2 , and using ( 25 ), we ha ve x i ` 1 P B p x ˚ , K 3 d ´ 10 } x ˚ }q . In addition, using Lemma 13 yields that ˜ x i ` 1 “ x i ` 1 . Repeat the ab ov e steps yields that x i P B p x ˚ , K 3 d ´ 10 } x ˚ }q and ˜ x i “ x i for all i ě N . Using ( 25 ) and α “ K 4 2 d d 2 , w e ha v e } x i ` 1 ´ x ˚ } ď b 2 } x i ´ x ˚ } ` b 3 2 d { 2 d 2 ω , (26) 20 where b 2 “ 1 ´ 7 K 4 {p 8 d 2 q and b 3 is a universal constan t. Repeatedly applying ( 26 ) yields } x i ` 1 ´ x ˚ } ď b i ` 1 ´ N 2 } x N ´ x ˚ } ` p b i ´ N 2 ` b i ´ N ´ 1 2 ` ¨ ¨ ¨ ` 1 q b 3 2 d { 2 d 2 ω ď b i ` 1 ´ N 2 } x N ´ x ˚ } ` b 3 2 d { 2 p 1 ´ b 2 q d 2 ω ď b i ` 1 ´ N 2 } x N ´ x ˚ } ` b 4 2 d { 2 ω , where the last inequalit y follows from the deﬁnition of b 2 , and b 4 is a univ ersal constan t. This ﬁnishes the pro of for ( 9 ). Inequality ( 10 ) follows from Lemma 21 . This concludes the pro of. The remainder of the pro of is devoted to prov e the lemmas used in this section. B.4 Pro of of Equation ( 23 ) Our pro of relies on h x b eing Lipschitz, as formalized by the lemma b elow, whic h is pro ven in Section B.10 : Lemma 15. F or any x, y R B p 0 , K 0 } x ˚ } q , wher e K 0 and K 4 ar e numeric al c onstants, } h x ´ h y } ď K 4 d 2 2 d } x ´ y } . By Lemma 15 , for all t P r 0 , 1 s and i ą N (recall that b y Lemma 11 , after at most N steps, x i ‰ B p 0 , K 0 } x ˚ } q ): } h ˆ x i ´ h x i } ď K 4 d 2 2 d } ˆ x i ´ x i } , (27) where ˆ x i “ x i ´ tα ˜ v x i . Th us, we hav e that on E noise , for any v ˆ x i P B f p ˆ x i q b y Lemma 9 , } v ˆ x i ´ ˜ v x i } ď} v ˆ x i ´ h ˆ x i } ` } h ˆ x i ´ h x i } ` } h x i ´ ˜ v x i } ď K 1 d 3 ?  2 d max p} ˆ x i } , } x ˚ }q ` ω 2 d { 2 ` K 4 d 2 2 d } ˆ x i ´ x i } ` K 1 d 3 ?  2 d max p} x i } , } x ˚ }q ` ω 2 d { 2 ď K 1 d 3 ?  2 d max p} x i } ` α } ˜ v x i } , } x ˚ }q ` K 4 d 2 2 d α } ˜ v x i } ` K 1 d 3 ?  2 d max p} x i } , } x ˚ }q ` 2 ω 2 d { 2 ď K 1 d 3 ?  2 d ˆ 2 ` αdK 2 d ˙ max p} x i } , } x ˚ }q ` K 4 d 2 2 d α } ˜ v x i } ` 2 K 9 { d 2 2 d } x ˚ } (28) where the second inequalit y is from Lemma 9 and Equation ( 27 ), and the fourth inequalit y is from ( 18 ) and the assumption ω 2 ´ d { 2 } x ˚ } 2 ď K 9 { d 2 . Com bining ( 28 ) and ( 24 ), we get that } v ˆ x i ´ ˜ v x i } ď ˆ 5 6 ` αK 7 d 2 2 d ˙ } ˜ v x i } , with the appropriate constan ts chosen suﬃcien tly small. This concludes the pro of of Equation ( 23 ). 21 B.5 Pro of of Lemma 11 First supp ose that x i P B p 0 , 2 K 0 } x ˚ } q . W e show that after a p olynomial num b er of iterations N , w e ha ve that x i ` N R B p 0 , 2 K 0 } x ˚ } q . Belo w, we prov e that h x, v x i ă 0 and } v x } ě 1 2 d 16 π } x ˚ } for all x P B p 0 , 2 K 0 } x ˚ } q and v x P B f p x q . (29) It follo ws that for an y ˜ x i P B p 0 , 2 K 0 } x ˚ } q , ˜ x i and the next iterate pro duced b y the algorithm, x i ` 1 “ ˜ x i ´ αv ˜ x i , form an obtruse triangle. As a consequence, } ˜ x i ` 1 } 2 “ } x i ` 1 } 2 ě } ˜ x i } 2 ` α 2 } v ˜ x i } 2 ě } ˜ x i } 2 ` α 2 1 p 2 d 16 π q 2 } x ˚ } 2 , where the last inequality follows from ( 29 ). Th us, the norm of the iterates ˜ x i will increase until after ` 2 K 0 2 d 16 π α ˘ 2 iterations, w e ha v e ˜ x i ` N R B p 0 , 2 K 0 } x ˚ } q . Consider ˜ x i R B p 0 , 2 K 0 } x ˚ }q , and note that α } v ˜ x i } ď α dK 2 d max p} ˜ x i } , } x ˚ }q ď α 16 π K d 2 d } ˜ x i } ď 1 2 } ˜ x i } , where the ﬁrst inequalit y follows from ( 18 ), the second inequality from } ˜ x i } ě 2 K 0 } x ˚ } , and ﬁnally the last inequalit y from our assumption on the suﬃciently small step size α . Therefore, from x i ` 1 “ ˜ x i ´ αv ˜ x i , we hav e that t ˜ x i ` p 1 ´ t q x i ` 1 R B p 0 , K 0 } x ˚ }q for all t P r 0 , 1 s , which completes the pro of. Pro of of ( 29 ) : It remains to pro ve ( 29 ). W e start with pro ving h x, ˜ v x i ă 0. F or brevity of notation, let Λ z “ ś 1 i “ d W i, ` ,z . W e ha v e x T ˜ v x “  Λ T x Λ x x ´ Λ T x Λ x ˚ x ˚ ` Λ T x η , x  ď 13 12 2 ´ d } x } 2 ´ 1 4 π 1 2 d } x }} x ˚ } ` } x } ω 2 d { 2 ď} x } ˆ 13 12 2 ´ d } x } ` 1 {p 16 π q 2 d } x ˚ } ´ 1 4 π 1 2 d } x ˚ } ˙ ď} x } 1 2 d ˆ 2 } x } ´ 3 16 π } x ˚ } ˙ . The ﬁrst inequality follo ws from ( 14 ) and ( 15 ), and the second inequality follo ws from our assump- tion on ω . Therefore, for any x P B p 0 , 1 16 π } x ˚ }q , h x, ˜ v x i ă ´ 1 16 π 2 d } x }} x ˚ } ď 0 , as desired. If G p x q is diﬀerentiable at x , then v x “ ˜ v x and x x, v x y ă 0. If G p x q is not diﬀeren tiable at x , b y equation ( 17 ), w e ha v e x T v x “ x T p c 1 v 1 ` c 2 v 2 ` ¨ ¨ ¨ ` c t v t q ď p c 1 ` c 2 ` . . . ` c t q} x } 1 2 d ˆ 2 } x } ´ 3 16 π } x ˚ } ˙ “} x } 1 2 d ˆ 2 } x } ´ 3 16 π } x ˚ } ˙ ă ´ 1 16 π 2 d } x }} x ˚ } ď 0 , (30) for all v x P B f p x q . Using ( 30 ) yields } v x } “ max } u }“ 1 x v x , u y ě x v x , ´ x {} x }y “ ´ x T v x } x } ą 1 2 d 16 π } x ˚ } , whic h concludes the pro of of ( 29 ). 22 B.6 Pro of of Lemma 8 Let Λ x “ Π 1 i “ d W i, ` ,x . W e ha v e that } ¯ q x } 2 “ › › Λ t x η › › 2 ď } Λ x } 2 } P Λ x η } 2 , where P Λ x is a pro jector onto the span of Λ x . As a consequence, } P Λ x η } 2 is χ 2 -distributed random v ariable with k -degrees of freedom scaled by σ { n . A standard tail b ound (see [ Lug+13 , p. 43]) yields that, for any β ě k , P ” } P Λ x η } 2 ě 4 β ı ď 2 e ´ β . Next, w e note that b y applying Lemmas 13-14 from [ HV18 , Pro of of Lem. 15]) 4 , with probability one, that the num b er of diﬀerent matrices Λ x can b e b ounded as | t Λ x | x ‰ 0 u | “ | Π 1 i “ d W i, ` ,x | x ‰ 0 ( | ď 10 d 2 p n d 1 n d ´ 1 2 . . . n d q k ď p n d 1 n d ´ 1 2 . . . n d q 2 k , where the second inequalit y holds for log p 10 q ď k { 4 log p n 1 q . T o see this, note that p n d 1 n d ´ 1 2 . . . n d q k ě 10 d 2 is implied by k p d log p n 1 q ` p d ´ 1 q log p n 2 q ` . . . log p n d qq ě k d 2 { 4 log p n 1 q ě d 2 log p 10 q . Th us, b y the union b ound, P ” } P Λ x η } 2 ď 16 k log p n d 1 n d ´ 1 2 . . . n d q , for all x ı ě 1 ´ 2 e ´ 2 k log p n q , where n “ n d . Recall from ( 15 ) that } Λ x } ď 13 12 . Combining this inequality with } ¯ q x } 2 ď } Λ x } 2 } P Λ x η } 2 concludes the pro of. B.7 Pro of of Lemma 12 W e now show that h x is aw ay from zero outside of a neigh b orho o d of x ˚ and ´ ρ d x ˚ . W e prov e Lemma 12 by establishing the following: Lemma 16. Supp ose 64 d 6 ? β ď 1 . Deﬁne ρ d : “ d ´ 1 ÿ i “ 0 sin q θ i π ˜ d ´ 1 ź j “ i ` 1 π ´ q θ j π ¸ , wher e q θ 0 “ π and q θ i “ g p q θ i ´ 1 q . If x P S β , then we have that either | θ 0 | ď 32 d 4 β and |} x } 2 ´ } x ˚ } 2 | ď 132 d 6 β } x ˚ } 2 or | θ 0 ´ π | ď 8 π d 4 a β and |} x } 2 ´ } x ˚ } 2 ρ d | ď 200 d 7 ?  } x ˚ } 2 . In p articular, we have S β Ă B p x ˚ , 5000 d 6 β } x ˚ } 2 q Y B p´ ρ d x ˚ , 500 d 11 a β } x ˚ } 2 q . (31) A dditional ly, ρ d Ñ 1 as d Ñ 8 . 4 The pro of in that argument only uses the assumption of indep endence of subsets of ro ws of the w eigh t matrices. 23 Pr o of. Without loss of generality , let } x ˚ } “ 1, x ˚ “ e 1 and ˆ x “ r cos θ 0 ¨ e 1 ` r sin θ 0 ¨ e 2 for θ 0 P r 0 , π s . Let x P S β . First w e in tro duce some notation for conv enience. Let ξ “ d ´ 1 ź i “ 0 π ´ θ i π , ζ “ d ´ 1 ÿ i “ 0 sin θ i π d ´ 1 ź j “ i ` 1 π ´ θ j π , r “ } x } 2 , M “ max p r, 1 q . Th us, h x “ ´ 1 2 d ξ ˆ x 0 ` 1 2 d p r ´ ζ q ˆ x . By insp ecting the comp onents of h x , w e hav e that x P S β implies | ´ ξ ` cos θ 0 p r ´ ζ q| ď β M (32) | sin θ 0 p r ´ ζ q| ď β M (33) No w, w e record sev eral prop erties. W e ha v e: θ i P r 0 , π { 2 s for i ě 1 θ i ď θ i ´ 1 for i ě 1 | ξ | ď 1 (34) | ζ | ď d π sin θ 0 (35) q θ i ď 3 π i ` 3 for i ě 0 (36) q θ i ě π i ` 1 for i ě 0 (37) ξ “ d ´ 1 ź i “ 0 π ´ θ i π ě π ´ θ 0 π d ´ 3 (38) θ 0 “ π ` O 1 p δ q ñ θ i “ q θ i ` O 1 p iδ q (39) θ 0 “ π ` O 1 p δ q ñ | ξ | ď δ π (40) θ 0 “ π ` O 1 p δ q ñ ζ “ ρ d ` O 1 p 3 d 3 δ q if d 2 δ π ď 1 (41) W e now establish ( 36 ). Observe 0 ă g p θ q ď ` 1 3 π ` 1 θ ˘ ´ 1 “ : ˜ g p θ q for θ P p 0 , π s . As g and ˜ g are monotonic increasing, we hav e q θ i “ g ˝ i p q θ 0 q “ g ˝ i p π q ď ˜ g ˝ i p π q “ ` i 3 π ` 1 π ˘ ´ 1 “ 3 π i ` 3 . Similarly , g p θ q ě p 1 π ` 1 θ q ´ 1 implies that q θ i ě π i ` 1 , establishing ( 37 ). W e now establish ( 38 ). Using ( 36 ) and θ i ď q θ i , w e ha v e d ´ 1 ź i “ 1 ´ 1 ´ θ i π ¯ ě d ´ 1 ź i “ 1 ´ 1 ´ 3 i ` 3 ¯ ě d ´ 3 , where the last inequality can b e established by showing that the ratio of consecutive terms with resp ect to d is greater for the pro duct in the middle expression than for d ´ 3 . W e establish ( 39 ) by using the fact that | g 1 p θ q| ď 1 for all θ P r 0 , π s and using the same logic as for [ HV18 , Eq. 17]. W e now establish ( 41 ). As θ 0 “ π ` O 1 p δ q , we hav e θ i “ q θ i ` O 1 p iδ q . Thus, if d 2 δ π ď 1, d ´ 1 ź j “ i ` 1 π ´ θ j π “ d ´ 1 ź j “ i ` 1 ´ π ´ q θ j π ` O 1 p iδ 2 π q ¯ “ ´ d ´ 1 ź j “ i ` 1 π ´ q θ j π ¯ ` O 1 p d 2 δ q 24 So ζ “ d ´ 1 ÿ i “ 0 ´ sin q θ i π ` O 1 p iδ π q ¯”´ d ´ 1 ź j “ i ` 1 π ´ q θ j π ¯ ` O 1 p d 2 δ q ı (42) “ ρ d ` O 1 ´ d 2 δ { π ` d 3 δ { π ` d 4 δ 2 { π ¯ (43) “ ρ d ` O 1 p 3 d 3 δ q . (44) Th us ( 41 ) holds. Next, w e establish that x P S β ñ r ď 4 d , and thus M ď 4 d . Suppose r ą 1. A t least one of the follo wing holds: | sin θ 0 | ě 1 { ? 2 or | cos θ 0 | ě 1 { ? 2. If | sin θ 0 | ě 1 { ? 2 then ( 33 ) implies that | r ´ ζ | ď ? 2 β r . Using ( 35 ), we get r ď d { π 1 ´ ? 2 β ď d { 2 if β ă 1 { 4. If | cos θ 0 | ě 1 { ? 2, then ( 32 ) implies that | r ´ ζ | ď ? 2 p β r ` | ξ |q . Using ( 34 ), ( 35 ), and β ă 1 { 4, we get r ď ? 2 | ξ |` ζ 1 ´ ? 2 β ď d ` ? 2 1 ´ ? 2 β ď 4 d . Th us, w e hav e x P S β ñ r ď 4 d ñ M ď 4 d . Next, we establish that we only need to consider the small angle case ( θ 0 « 0) and the large angle case ( θ 0 « π ), by considering the following three cases: (Case I) sin θ 0 ď 16 d 4 β : W e hav e θ 0 “ O 1 p 32 d 4 β q or θ 0 “ π ` O 1 p 32 d 4 β q , as 32 d 4 β ă 1. (Case I I) | r ´ ζ | ă ? β M : Applying case I I to inequality ( 32 ) yields | ξ | ď 2 ? β M . Using ( 38 ), w e get θ 0 “ π ` O 1 p 2 π d 3 ? β M q . (Case I I I) sin θ 0 ą 16 d 4 β and | r ´ ζ | ě ? β M : Finally , consider Case I I I. By ( 33 ), we ha v e | r ´ ζ | ď β M sin θ 0 . Using this inequalit y in ( 32 ), we ha v e | ξ | ď β M ` β M sin θ 0 ď 2 β M sin θ 0 ď 1 8 d ´ 4 M ď 1 2 d ´ 3 , where the second to last inequality uses sin θ 0 ą 16 d 4 β and the last inequality uses M ď 4 d . By ( 38 ), w e hav e π ´ θ 0 π d ´ 3 ď ξ ď 1 2 d ´ 3 , whic h implies that θ 0 ě π { 2. Now, as | r ´ ζ | ě ? β M , then b y ( 33 ), w e hav e | sin θ 0 | ď ? β . Hence, θ 0 “ π ` O 1 p 2 ? β q , as θ 0 ě π { 2 and as β ă 1. A t least one of the Cases I,I I, or I I I hold. Thus, w e see that it suﬃces to consider the small angle case θ 0 “ O 1 p 32 d 4 β q or the large angle case θ 0 “ π ` O 1 p 8 π d 4 ? β q . Small Angle Case . Assume θ 0 “ O 1 p δ q with δ “ 32 d 4 β . As θ i ď θ 0 ď δ for all i , we hav e 1 ě ξ ě p 1 ´ δ π q d “ 1 ` O 1 p 2 δ d π q provided δ d { π ď 1 { 2 (whic h holds b y our choice δ “ 32 d 4 β b y assumption 64 d 6 ? β ď 1). By ( 35 ), we also hav e ζ “ O 1 p d π δ q . By ( 32 ), we hav e | ´ ξ ` cos θ 0 p r ´ ζ q| ď β M . Th us, as cos θ 0 “ 1 ` O 1 p θ 2 0 { 2 q “ 1 ` O 1 p δ 2 { 2 q , ´ ´ 1 ` O 1 p 2 δ d π q ¯ ` p 1 ` O 1 p 2 δ d π qqp r ` O 1 p δ d π qq “ O 1 p 4 dβ q , and r ď M ď 4 d (shown ab ov e) provides, r ´ 1 “ O 1 p 4 dβ ` 2 δ d π ` δ d π ` 2 δ d π 4 d ` 2 δ 2 d 2 π 2 q (45) “ O 1 p 4 β d ` 4 δ d 2 q . (46) By plugging in that δ “ 32 d 4 β , we hav e that r ´ 1 “ O 1 p 132 d 6 β q , where w e ha v e used that 32 d 5 β π ď 1 { 2. 25 Large Angle Case . Assume θ 0 “ π ` O 1 p δ q where δ “ 8 π d 4 ? β . By ( 40 ) and ( 41 ), we hav e ξ “ O 1 p δ { π q , and we ha v e ζ “ ρ d ` O 1 p 3 d 3 δ q if 8 d 6 ? β ď 1. By ( 32 ), we ha v e | ´ ξ ` cos θ 0 p r ´ ζ q| ď β M , so, as cos θ 0 “ 1 ´ O 1 p θ 2 0 { 2 q , O 1 p δ { π q ` p 1 ` O 1 p δ 2 { 2 qqp r ´ ρ d ` O 1 p 3 d 3 δ qq “ O 1 p β M q , and th us, using r ď 4 d , ρ d ď d , and δ “ 8 π d 4 ? β ď 1, r ´ ρ d “ O 1 p β M ` δ { π ` 3 d 3 δ ` 5 2 δ 2 d ` 3 2 d 3 δ 3 q (47) “ O 1 ´ 4 β d ` δ p 1 π ` 3 d 3 ` 5 2 d ` 3 2 d 3 q ¯ (48) “ O 1 p 200 d 7 a β q (49) T o conclude the pro of of ( 31 ), we use the fact that } x ´ x ˚ } 2 ď ˇ ˇ } x } 2 ´ } x ˚ } 2 ˇ ˇ ` p} x ˚ } 2 ` ˇ ˇ } x } 2 ´ } x ˚ } 2 ˇ ˇ q θ 0 . This fact simply says that if a 2d p oint is known to hav e magnitude within ∆ r of some r and is kno wn to b e within angle ∆ θ from 0, then its Euclidean distance to the p oin t of p olar co ordinates p r , 0 q is no more than ∆ r ` p r ` ∆ r q ∆ θ . Finally , we establish that ρ d Ñ 1 as d Ñ 8 . Note that ρ d ` 1 “ p 1 ´ q θ d π q ρ d ` sin q θ d π and ρ 0 “ 0. It suﬃces to show ˜ ρ d Ñ 0, where ˜ ρ d : “ 1 ´ ρ d . The follo wing recurrence relation holds: ˜ ρ d “ p 1 ´ q θ d ´ 1 π q ˜ ρ d ´ 1 ` q θ d ´ 1 ´ sin q θ d ´ 1 π , with ˜ ρ 0 “ 1. Using the recurrence formula [ HV18 , Eq. (15)] and the fact that q θ 0 “ π , we get that ˜ ρ d “ d ÿ i “ 1 q θ i ´ 1 ´ sin q θ i ´ 1 π d ź j “ i ` 1 ` 1 ´ q θ j ´ 1 π ˘ (50) using ( 37 ), we hav e that d ź j “ i ` 1 ´ 1 ´ q θ j ´ 1 π ¯ ď d ź j “ i ` 1 ´ 1 ´ 1 j ¯ “ exp ´ ´ d ÿ j “ i ` 1 1 j ¯ ď exp ´ ´ ż d ` 1 i ` 1 1 s ds ¯ “ i ` 1 d ` 1 Using ( 36 ) and the fact that q θ i ´ 1 ´ sin q θ i ´ 1 ď q θ 3 i ´ 1 { 6, w e ha ve that ˜ ρ d ď ř d i “ 1 q θ 3 i ´ 1 6 π ¨ i ` 1 d ` 1 Ñ 0 as d Ñ 8 . B.8 Pro of of Lemma 13 Consider the function f η p x q “ f 0 p x q ´ x G p x q ´ G p x ˚ q , η y , and note that f p x q “ f η p x q ` } η } 2 . Consider x P B p φ d x ˚ , ϕ } x ˚ }q , for a ϕ that will b e sp eciﬁed later. Note that | h G p x q ´ G p x ˚ q , η i | ď |  Π 1 i “ d W i, ` ,x x, η  | ` |  Π 1 i “ d W i, ` ,x ˚ x ˚ , η  | “ |  x, p Π 1 i “ d W i, ` ,x q t η  | ` |  x ˚ , p Π 1 i “ d W i, ` ,x ˚ q t η  | ď p} x } ` } x ˚ } q ω 2 d { 2 ď p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 , 26 where the second inequality holds on the even t E noise , by Lemma 8 , and the last inequality holds b y our assumption on x . Th us, for x P B p φ d x ˚ , ϕ } x ˚ }q f η p x q ď E f 0 p x q ` | f 0 p x q ´ E f 0 p x q| ` | h G p x q ´ G p x ˚ q , η i | ď 1 2 d ` 1 ˆ φ 2 d ´ 2 φ d ` 10 K 3 2 dϕ ˙ } x ˚ } 2 ` 1 2 d ` 1 } x ˚ } 2 `  p 1 ` 4 d q 2 d } x } 2 `  p 1 ` 4 d q ` 48 d 3 ?  2 d ` 1 } x }} x ˚ } `  p 1 ` 4 d q 2 d } x ˚ } 2 ` p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 ď 1 2 d ` 1 ˆ φ 2 d ´ 2 φ d ` 10 K 3 2 dϕ ˙ } x ˚ } 2 ` 1 2 d ` 1 } x ˚ } 2 `  p 1 ` 4 d q 2 d p φ d ` ϕ q 2 } x ˚ } 2 `  p 1 ` 4 d q ` 48 d 3 ?  2 d ` 1 p φ d ` ϕ q} x ˚ } 2 `  p 1 ` 4 d q 2 d } x ˚ } 2 ` p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 ď } x ˚ } 2 2 d ` 1 ˆ 1 ` φ 2 d ´ 2 φ d ` 10 K 3 2 d ` 68 d 2 ?  ˙ ` p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 (51) where the last inequality follows from  ă ?  , ρ d ď 1, 4 d ă 1, ϕ ă 1 and assuming ϕ “  . Similarly , we hav e that for any y P B p´ φ d x ˚ , ϕ } x ˚ }q f η p y q ě E r f p y qs ´ | f p y q ´ E r f p y qs| ´ | h G p x q ´ G p x ˚ q , η i | ě 1 2 d ` 1 ` φ 2 d ´ 2 φ d ρ d ´ 10 d 3 ϕ ˘ } x ˚ } 2 ` 1 2 d ` 1 } x ˚ } 2 ´ ˆ  p 1 ` 4 d q 2 d } y } 2 `  p 1 ` 4 d q ` 48 d 3 ?  2 d ` 1 } y }} x ˚ } `  p 1 ` 4 d q 2 d } x ˚ } 2 ˙ ´ p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 ě } x ˚ } 2 2 d ` 1 ` 1 ` φ 2 d ´ 2 φ d ρ d ´ 10 d 3 ϕ ´ 68 d 2 ?  ˘ ´ p ϕ } x ˚ } ` } x ˚ } q ω 2 d { 2 (52) Using  ă ?  , ρ d ď 1, 4 d ă 1, ϕ ă 1 and assuming ϕ “  , the righ t side of ( 51 ) is smaller than the righ t side of ( 52 ) if ϕ “  ď ¨ ˝ φ d ´ ρ d φ d ´ 13 } η } 2 ´ 125 ` 5 K 3 2 ¯ d 3 ˛ ‚ 2 . (53) W e can establish that: Lemma 17. F or al l d ě 2 , that 1 { ` K 1 p d ` 2 q 2 ˘ ď 1 ´ ρ d ď 250 {p d ` 1 q . Th us, it suﬃces to hav e ϕ “  “ K 3 d 10 and 13 } η } 2 ď K 9 d 2 ď 1 2 K 2 K 1 p d ` 2 q 2 for an appropriate universal constan t K 9 , and for an appropriate universal constant K 3 . 27 B.9 Pro of of Lemma 17 It holds that } x ´ y } ě 2 sin p θ x,y { 2 q min p} x } , } y }q , @ x, y (54) sin p θ { 2 q ě θ { 4 , @ θ P r 0 , π s (55) d dθ g p θ q P r 0 , 1 s @ θ P r 0 , π s (56) log p 1 ` x q ď x @ x P r´ 0 . 5 , 1 s (57) log p 1 ´ x q ě ´ 2 x @ x P r 0 , 0 . 75 s (58) where θ x,y “ = p x, y q . W e recall the results (36), (37), and (50) in [ HV18 ]: ˇ θ i ď 3 π i ` 3 and ˇ θ i ě π i ` 1 @ i ě 0 1 ´ ρ d “ d ´ 1 ź i “ 1 ˆ 1 ´ ˇ θ i π ˙ ` d ´ 1 ÿ i “ 1 ˇ θ i ´ sin ˇ θ i π d ´ 1 ź j “ i ` 1 ˆ 1 ´ ˇ θ j π ˙ . Therefore, w e ha v e for all 0 ď i ď d ´ 2, d ´ 1 ź j “ i ` 1 ˆ 1 ´ ˇ θ j π ˙ ď d ´ 1 ź j “ i ` 1 ˆ 1 ´ 1 j ` 1 ˙ “ e ř d ´ 1 j “ i ` 1 log ´ 1 ´ 1 j ` 1 ¯ ď e ´ ř d ´ 1 j “ i ` 1 1 j ` 1 ď e ´ ş d i ` 1 1 s ` 1 ds “ i ` 2 d ` 1 , d ´ 1 ź j “ i ` 1 ˆ 1 ´ ˇ θ j π ˙ ě d ´ 1 ź j “ i ` 1 ˆ 1 ´ 3 j ` 3 ˙ “ e ř d ´ 1 j “ i ` 1 log ´ 1 ´ 3 j ` 3 ¯ ě e ´ ř d ´ 1 j “ i ` 1 6 j ` 3 ě e ´ ş d ´ 1 i 6 s ` 3 ds “ ˆ i ` 3 d ` 2 ˙ 6 , where the second and the ﬁfth inequalities follow from ( 57 ) and ( 58 ) resp ectively . Since π 3 {p 12 p i ` 1 q 3 q ď ˇ θ 3 i { 12 ď ˇ θ i ´ sin ˇ θ i ď ˇ θ 3 i { 6 ď 27 π 3 {p 6 p i ` 3 q 3 q , w e ha v e that for all d ě 3 1 ´ ρ d ď 2 d ` 1 ` d ´ 1 ÿ i “ 1 27 π 3 6 p i ` 3 q 3 i ` 2 d ` 1 ď 2 d ` 1 ` 3 π 5 4 p d ` 1 q ď 250 d ` 1 and 1 ´ ρ d ě ˆ 3 p d ` 2 q ˙ 6 ` d ´ 1 ÿ i “ 1 π 3 12 p i ` 3 q 3 ˆ i ` 3 d ` 2 ˙ 6 ě 1 K 1 p d ` 2 q 2 , where w e use ř 8 i “ 4 1 i 2 ď π 2 6 and ř n i “ 1 i 3 “ O p n 4 q . B.10 Pro of of Lemma 15 T o establish Lemma 15 , we prov e the following: Lemma 18. F or al l x, y ‰ 0 , } h x ´ h y } ď ˆ 1 2 d ` 6 d ` 4 d 2 π 2 d max ˆ 1 } x } , 1 } y } ˙ } x ˚ } ˙ } x ´ y } Lemma 15 follo ws b y noting that if x, y R B p 0 , r } x ˚ }q , then } h x ´ h y } ď ´ 1 2 d ` 6 d ` 4 d 2 π r 2 d ¯ } x ´ y } . 28 Pr o of of L emma 18 . F or brevity of notation, let ζ j,z “ ś d ´ 1 i “ j π ´ ¯ θ i,z π . Combining ( 54 ) and ( 55 ) giv es | ¯ θ 0 ,x ´ ¯ θ 0 ,y | ď 4 max ´ 1 } x } , 1 } y } ¯ } x ´ y } . Inequalit y ( 56 ) implies | ¯ θ i,x ´ ¯ θ i,y | ď | ¯ θ j,x ´ ¯ θ j,y | for all i ě j . It follo ws that } h x ´ h y } ď 1 2 d } x ´ y } ` 1 2 d | ζ 0 ,x ´ ζ 0 ,y | l jh n T 1 } x ˚ } ` 1 2 d ˇ ˇ ˇ ˇ ˇ d ´ 1 ÿ i “ 0 sin ¯ θ i,x π ζ i ` 1 ,x ˆ x ´ d ´ 1 ÿ i “ 0 sin ¯ θ i,y π ζ i ` 1 ,y ˆ y ˇ ˇ ˇ ˇ ˇ l jh n T 2 } x ˚ } . (59) By Lemma 19 , we hav e T 1 ď d π | ¯ θ 0 ,x ´ ¯ θ 0 ,y | ď 4 d π max ˆ 1 } x } , 1 } y } ˙ } x ´ y } . (60) Additionally , it holds that T 2 “ ˇ ˇ ˇ ˇ ˇ d ´ 1 ÿ i “ 0 sin ¯ θ i,x π ζ i ` 1 ,x ˆ x ´ sin ¯ θ i,x π ζ i ` 1 ,x ˆ y ` sin ¯ θ i,x π ζ i ` 1 ,x ˆ y ´ d ´ 1 ÿ i “ 0 sin ¯ θ i,y π ζ i ` 1 ,y ˆ y ˇ ˇ ˇ ˇ ˇ ď d π } ˆ x ´ ˆ y } ` ˇ ˇ ˇ ˇ ˇ d ´ 1 ÿ i “ 0 sin ¯ θ i,x π ζ i ` 1 ,x ´ d ´ 1 ÿ i “ 0 sin ¯ θ i,y π ζ i ` 1 ,y ˇ ˇ ˇ ˇ ˇ l jh n T 3 . (61) W e hav e T 3 ď d ´ 1 ÿ i “ 0 „ ˇ ˇ ˇ ˇ sin ¯ θ i,x π ζ i ` 1 ,x ´ sin ¯ θ i,x π ζ i ` 1 ,y ˇ ˇ ˇ ˇ ` ˇ ˇ ˇ ˇ sin ¯ θ i,x π ζ i ` 1 ,y ´ sin ¯ θ i,y π ζ i ` 1 ,y ˇ ˇ ˇ ˇ  ď d ´ 1 ÿ i “ 0 „ 1 π ˆ d ´ i ´ 1 π ˇ ˇ ¯ θ i ´ 1 ,x ´ ¯ θ i ´ 1 ,y ˇ ˇ ˙ ` 1 π | sin ¯ θ i,x ´ sin ¯ θ i,y |  ď d 2 π | ¯ θ 0 ,x ´ ¯ θ 0 ,y | ď 4 d 2 π max ˆ 1 } x } , 1 } y } ˙ } x ´ y } . (62) Using ( 54 ) and ( 55 ) and noting } ˆ x ´ ˆ y } ď θ x,y yield } ˆ x ´ ˆ y } ď θ x,y ď 2 max ˆ 1 } x } , 1 } y } ˙ } x ´ y } . (63) Finally , combining ( 59 ), ( 60 ), ( 61 ), ( 62 ) and ( 63 ) yields the result. Lemma 19. Supp ose a i , b i P r 0 , π s for i “ 1 , . . . , k , and | a i ´ b i | ď | a j ´ b j | , @ i ě j . Then it holds that ˇ ˇ ˇ ˇ ˇ k ź i “ 1 π ´ a i π ´ k ź i “ 1 π ´ b i π ˇ ˇ ˇ ˇ ˇ ď k π | a 1 ´ b 1 | . 29 Pr o of. Pro v e by induction. It is eas y to verify that the inequalit y holds if k “ 1. Supp ose the inequalit y holds with k “ t ´ 1. Then ˇ ˇ ˇ ˇ ˇ t ź i “ 1 π ´ a i π ´ t ź i “ 1 π ´ b i π ˇ ˇ ˇ ˇ ˇ ď ˇ ˇ ˇ ˇ ˇ t ź i “ 1 π ´ a i π ´ π ´ a t π t ´ 1 ź i “ 1 π ´ b i π ˇ ˇ ˇ ˇ ˇ ` ˇ ˇ ˇ ˇ ˇ π ´ a t π t ´ 1 ź i “ 1 π ´ b i π ´ t ź i “ 1 π ´ b i π ˇ ˇ ˇ ˇ ˇ ď t ´ 1 π | a 1 ´ b 1 | ` 1 π | a t ´ b t | ď t π | a 1 ´ b 1 | . B.11 Pro of of Lemma 14 W e ﬁrst need Lemmas 20 , 21 and 22 . Lemma 20. Supp ose W P R n ˆ k satisﬁes the WDC with c onstant  . Then for any x, y P R k , it holds that } W ` ,x x ´ W ` ,y y } ď ˜ c 1 2 `  ` a 2 p 2  ` θ q ¸ } x ´ y } , wher e θ “ = p x, y q . Pr o of. W e hav e } W ` ,x x ´ W ` ,y y } ď } W ` ,x x ´ W ` ,x y } ` } W ` ,x y ´ W ` ,y y } “} W ` ,x p x ´ y q} ` }p W ` ,x ´ W ` ,y q y } ď } W ` ,x }} x ´ y } ` }p W ` ,x ´ W ` ,y q y } . (64) By WDC assumption, we hav e } W T ` ,x p W ` ,x ´ W ` ,y q} ď › › W T ` ,x W ` ,x ´ I { 2 › › ` › › W T ` ,x W ` ,y ´ Q x,y › › ` } Q x,y ´ I { 2 } ď 2  ` θ . (65) W e also hav e }p W ` ,x ´ W ` ,y q y } 2 “ n ÿ i “ 1 p 1 w i ¨ x ą 0 ´ 1 w i ¨ y ą 0 q 2 p w i ¨ y q 2 ď n ÿ i “ 1 p 1 w i ¨ x ą 0 ´ 1 w i ¨ y ą 0 q 2 pp w i ¨ x q 2 ` p w i ¨ y q 2 ´ 2 p w i ¨ x qp w i ¨ y qq “ n ÿ i “ 1 p 1 w i ¨ x ą 0 ´ 1 w i ¨ y ą 0 q 2 p w i ¨ p x ´ y qq 2 “ n ÿ i “ 1 1 w i ¨ x ą 0 1 w i ¨ y ď 0 p w i ¨ p x ´ y qq 2 ` n ÿ i “ 1 1 w i ¨ x ď 0 1 w i ¨ y ą 0 p w i ¨ p x ´ y qq 2 “p x ´ y q T W T ` ,x p W ` ,x ´ W ` ,y qp x ´ y q ` p x ´ y q T W T ` ,y p W ` ,y ´ W ` ,x qp x ´ y q ď 2 p 2  ` θ q} x ´ y } 2 . (by ( 65 )) (66) Com bining ( 64 ), ( 66 ), and } W i, ` ,x } 2 ď 1 { 2 `  given in [ HV18 , (10)] yields the result. 30 Lemma 21. Supp ose x P B p x ˚ , d ?  } x ˚ }q , and the WDC holds with  ă 1 {p 200 q 4 { d 6 . Then it holds that › › › › › 1 ź i “ j W i, ` ,x x ´ 1 ź i “ j W i, ` ,x ˚ x ˚ › › › › › ď 1 . 2 2 j 2 } x ´ x ˚ } . Pr o of. In this pro of, w e denote θ i,x,x ˚ and ¯ θ i,x,x ˚ b y θ i and ¯ θ i resp ectiv ely . Since x P B p x ˚ , d ?  } x ˚ }q , w e ha ve ¯ θ i ď ¯ θ 0 ď 2 d ? . (67) By [ HV18 , (14)], we also hav e | θ i ´ ¯ θ i | ď 4 i ?  ď 4 d ?  . It follows that 2 a θ i ` 2  ď 2 b ¯ θ i ` 4 d ?  ` 2  ď 2 b 2 d ?  ` 4 d ?  ` 2  ď 2 b 8 d ?  ď 1 30 d . (b y the assumption on  ) (68) Note that ? 1 ` 2  ď 1 `  ď 1 ` a d ?  . W e ha v e 0 ź i “ d ´ 1 ´ ? 1 ` 2  ` 2 a θ i ` 2  ¯ ď ˆ 1 ` 7 b d ?  ˙ d ď 1 ` 14 d b d ?  ď 107 100 ă 1 . 2 , where the second inequalit y is from that p 1 ` x q d ď 1 ` 2 dx if 0 ă xd ă 1. Combining the ab ov e inequalit y with Lemma 20 yields › › › › › 1 ź i “ j W i, ` ,x x ´ 1 ź i “ j W i, ` ,x ˚ x ˚ › › › › › ď 0 ź i “ j ´ 1 ˜ c 1 2 `  ` ? 2 a θ i ` 2  ¸ } x ´ x ˚ } ď 1 . 2 2 j 2 } x ´ x ˚ } . Lemma 22. Supp ose x P B p x ˚ , d ?  } x ˚ }q , and the WDC holds with  ă 1 {p 200 q 4 { d 6 . Then it holds that ˜ 1 ź i “ d W i, ` ,x ¸ T «˜ 1 ź i “ d W i, ` ,x ¸ x ´ ˜ 1 ź i “ d W i, ` ,x ˚ ¸ x ˚ ﬀ “ 1 2 d p x ´ x ˚ q ` 1 2 d 1 16 } x ´ x ˚ } O 1 p 1 q . Pr o of. F or brevity of notation, let Λ j,k,z “ ś k i “ j W i, ` ,z . W e ha v e Λ T d, 1 ,x p Λ d, 1 ,x x ´ Λ d, 1 ,x ˚ x ˚ q “ Λ T d, 1 ,x « Λ d, 1 ,x x ´ d ÿ j “ 1 p Λ d,j,x Λ j ´ 1 , 1 ,x ˚ x ˚ q ` d ÿ j “ 1 p Λ d,j,x Λ j ´ 1 , 1 ,x ˚ x ˚ q ´ Λ d, 1 ,x ˚ x ˚ ﬀ “ Λ T d, 1 ,x Λ d, 1 ,x p x ´ x ˚ q l jh n T 1 ` Λ T d, 1 ,x d ÿ j “ 1 Λ d,j ` 1 ,x p W j, ` ,x ´ W j, ` ,x ˚ q Λ j ´ 1 , 1 ,x ˚ x ˚ l jh n T 2 . (69) F or T 1 , w e ha v e T 1 “ 1 2 d p x ´ x ˚ q ` 4 d 2 d } x ´ x ˚ } O 1 p  q . [ HV18 , (10)] (70) 31 F or T 2 , w e ha v e T 2 “ O 1 p 1 q d ÿ j “ 1 ˆ 1 2 d ´ j 2 ` p 4 d ´ 2 j q  2 d ´ j 2 ˙ }p W j, ` ,x ´ W j, ` ,x ˚ q Λ j ´ 1 , 1 ,x ˚ x ˚ } “ O 1 p 1 q d ÿ j “ 1 ˆ 1 2 d ´ j 2 ` p 4 d ´ 2 j q  2 d ´ j 2 ˙ }p Λ j ´ 1 , 1 ,x x ´ Λ j ´ 1 , 1 ,x ˚ x ˚ q} b 2 p θ i,x,x ˚ ` 2  q “ O 1 p 1 q d ÿ j “ 1 ˆ 1 2 d ´ j 2 ` p 4 d ´ 2 j q  2 d ´ j 2 ˙ 1 . 2 2 j 2 } x ´ x ˚ } 1 30 ? 2 d “ 1 16 1 2 d } x ´ x ˚ } O 1 p 1 q . (71) where the ﬁrst equation is by [ HV18 , (10)]; the second equation is by ( 66 ); the third equation is by Lemma 21 and ( 68 ). The result follows from ( 69 ), ( 70 ) and ( 71 ). No w, we are ready to prov e Lemma ( 14 ). F or brevity of notation, let Λ j,z “ ś 1 i “ j W i, ` ,z . Using Lemma 22 yields } ¯ v x ´ 1 2 d p x ´ x ˚ q} ď 1 2 d 1 16 } x ´ x ˚ } . It follo ws that } ˜ v x ´ 1 2 d p x ´ x ˚ q} “ } ¯ v x ` ¯ q x ´ 1 2 d p x ´ x ˚ q} ď 1 2 d 1 16 } x ´ x ˚ } ` 1 2 d { 2 ω . F or an y x ‰ 0 and for an y v P B f p x q , b y ( 17 ), there exist c 1 , c 2 , . . . , c t ě 0 such that c 1 ` c 2 ` . . . ` c t “ 1 and v “ c 1 v 1 ` c 2 v 2 ` . . . ` c t v t . It follows that } v ´ 1 2 d p x ´ x ˚ q} ď ř t j “ 1 c j } v j ´ 1 2 d p x ´ x ˚ q} ď 1 2 d 1 16 } x ´ x ˚ } ` 1 2 d { 2 ω . 32

Rate-Optimal Denoising with Deep Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment