On the Stability of Deep Networks
In this work we study the properties of deep neural networks (DNN) with random weights. We formally prove that these networks perform a distance-preserving embedding of the data. Based on this we then draw conclusions on the size of the training data…
Authors: Raja Giryes, Guillermo Sapiro, Alex M. Bronstein
Accepted as a workshop contribution at ICLR 2015 O N T H E S T A B I L I T Y O F D E E P N E T W O R K S Raja Giryes & Guillermo Sapir o Duke Uni versity { raja.giry es, guillermo.sa piro } @duke .edu Alex M. Bronstein T el A vi v University bron@eng.tau .ac.il A B S T R AC T In this w ork we study the prop erties of deep neural networks (DNN) with random weights. W e formally pr ove that these n etworks perf orm a distance-pr eserving embedd ing of the data. Based on this we then draw conclusions on the size of the training data and the networks’ structu re. A lon ger version of this paper with more results and details can be foun d in (Gir yes et al., 2015). I n particular , we formally prove in (Giry es et al., 2 015) that DNN with r andom Gaussian weights perfor m a distance-p reserving embedding of the da ta, with a special treatment for in-cla ss and out-o f-class data. 1 I N T RO D U C T I O N Deep neur al nets (DN N) have led to a rev olution in th e areas o f machin e learning, au dio analysis, and com puter vision. Many state-of -the-art results have been achiev ed using these architectures. In this work we study the pro perties o f th ese architectures with rand om weights. W e p rove that DNN preserve th e distances in the data a long their layers and that this pro perty allows stably recovering the origin al data from the features calculated by the ne twork. Our results provide in sights into the outstanding empirically observed performance of DNN and the size of the training data. Our motiv ation fo r stu dying netw ork s with random weights is thr eefold. First, one of the differences between the n etworks used two decades ago and state-of- the-art training strategies is the usage o f random initialization of the weights. Second, a series of works (Pinto et al., 2009; Saxe et al., 2011; Cox & Pinto, 2011) empirically showed successful DNN learning techniques based on randomiza- tion. T hird, recent works that studied the o ptimization aspect in the training of d eep n etworks also have done so via randomizatio n (Saxe et al., 2014; Daup hin et al., 2014; Cho roman ska et al., 2015). Bruna et al. (2013) show that th e pooling stage in DNN causes a shift inv arianc e p roper ty . Bruna et al. (20 14) interpret this step as the removal of phase from a c omplex signal and show how the signal may be recovered after a poo ling stage using phase r etriev al method s. In this sho rt no te, and for presentation purposes, we do not consider th e previously stud ied pooling st ep, assumin g the data to be pr operly aligned. W e focus on the roles of the layers of a lin ear operation followed by an element-wise non-linear acti vation function. 2 S TA B L E E M B E D D I N G O F A S I N G L E L A Y E R W e assume the input data to belong to a manifold K with Gau ssian mean width ω ( K ) := E [ sup x , y ∈ K h g , x − y i ] , (1) where the expectation is taken over g with normal i.i.d . elements. In Section 3 we will illustrate this concept and exemplify the results with Gaussian mixture models (GMM). W e say that f : R → R is a sem i-truncated linear function if it is linear on some (possi- bly , semi-infinite) interval and co nstant outside o f it, f (0) = 0 , 0 < f ( x ) ≤ x, ∀ x > 0 and 1 Accepted as a workshop contribution at ICLR 2015 0 ≥ f ( x ) ≥ x, ∀ x < 0 . The po pular rectified linear unit (ReLU), f ( x ) = max(0 , x ) , is an examp le of such a fu nction, while the sigmo id function s satisfy this pro perty app roximately . Th e follow- ing theo rem shows that each standar d DNN layer perfo rms a stable emb edding of the data in the Gromov-Hau sdorff sense. Theorem 1 Let M be the lin ear op erator applied at the i -th layer , f th e no n-linear activation function, a nd K ⊂ S n − 1 the m anifold of the in put data fo r the i -th layer . If √ m M ∈ R n × m is a random matrix with i.i.d normally distributed entries with m = O ( ω ( K ) 2 ) bein g the ou tput dimension, and f is a semi-trun cated linear function, then with high pr oba bility k x − y k 2 ≃ d ( f ( Mx ) , f ( M y )) , ∀ x , y ∈ K , (2) wher e d ( · , · ) is a variant of th e Hammin g distance that tr eats the positive values in the vectors as ones. This r esult imp lies that the metric of the input data is pr eserved. The proof follows from (Plan & V ershyn in , 2014) and Klartag & Mend elson (2005). Mahendr an & V edaldi (201 4 ) dem onstrate that it is possible to recover the inp ut of DNN from their output. The n ext result provides a theoretical justification for their observation by showing that it is possible to recover the input of each layer from its outpu t: Theorem 2 Under the assumptions of Theor em 1 ther e exists a pr ogram A such that k x − A ( f ( Mx )) k 2 ≤ ǫ , (3) wher e ǫ = O ω ( K ) √ m . The proof follows from Plan & V ershynin (2014). 3 S TA B L E E M B E D D I N G O F T H E E N T I R E N E T W O R K In ord er to sh ow that the en tire network p roduc es a stable e mbeddin g of its input, we need to show that the Gaussian m ean width does n ot grow sign ificantly as the data propag ate thr ough the layers of the ne twork. Instead of bou nding the variation o f the Gaussian m ean width thro ughou t the net- work, we b ound the chang e in th e c overing number N ( K, ǫ ) , i.e., the lowest nu mber of ℓ 2 -balls of radius ǫ that cover K . Having th e b ound o n the covering numb er , we use Du dleys inequ ality (Ledou x & T a lagrand, 1 991), ω ( K ) ≤ C R ∞ 0 p log N ( K, ǫ ) dǫ , to bo und th e Gau ssian mean width variation, where C is a con stant. Theorem 3 Under the assumptions of Theor em 1, N ( f ( M K ) , ǫ ) ≤ N K, ǫ 1 + ω ( K ) √ m . (4) Pr o of: W e n ow present a sketch of th e proof , deferr ing the full pr oof that treats also the Gau ssian mean wid th directly to a long er version o f the pap er . It is no t hard to see that since a non -linear activ ation fun ction shrinks the data, then it can no t incr ease the size of the covering; th erefore we focus on the linea r part. Following (K lartag & Mendelson, 200 5, Theor em 1. 4), we h av e that th e distances in M K are the same as the ones in K u p to a 1 + √ m ω ( K ) factor . Th is is sufficient to complete the proof. W e demo nstrate th e implication o f the above theorem for a GMM, i.e., K con sisting of L Gaussians of dimension k in the ℓ 2 -ball. For this model N ( K, ǫ ) = L 1 + 2 ǫ k for ǫ < 1 and 1 other wise (see Mendelson et al. (2008)). T herefo re we have that ω ( K ) ≤ C ′ √ k + log L and that at each lay er the Gaussian mean width gro ws at most with an order of 1 + √ k +log L √ m . Similar results can b e shown fo r other models of union of subspaces and low dimension al manifolds. 2 Accepted as a workshop contribution at ICLR 2015 4 H O W M A N Y M E A S U R E M E N T S A R E N E E D E D T O T R A I N T H E N E T W O R K An impo rtant question in deep lea rning is wh at is the amount of lab eled training samples needed at trainin g. Usin g Sud akov m inoration (Ledo ux & T alagran d, 1991), one ma y get an upper bo und on the size of an ǫ -net in K . W e have demonstrated that networks with rand om Gau ssian weigh ts realize a stab le emb edding ; consequen tly , if a network is trained u sing the screen ing techniqu e b y selecting the best among many networks generated with random weights as suggested in Pinto et al. (2009); Saxe et al. (201 1); Cox & Pinto (20 11), th en the n umber of d ata points n eeded to b e used in order to g uarantee that the network rep resents a ll the da ta is O (exp ( ω ( K ) 2 /ǫ 2 )) . Since ω ( K ) 2 is a pr oxy fo r th e data dim ension (see Plan & V ershy nin (201 4)), we con clude tha t th e nu mber o f training points grows exponentially with the intrinsic dimension of the data. 5 D I S C U S S I O N A N D C O N C L U S I O N W e have shown that DNN with random Gaussian weights perform a d istance-preserv ing embed ding of the data. This result provides a relationship between the complexity of the input data and the size of the requir ed tr aining set. In addition, it draws a connection between the dimension of the features produ ced by the n etwork,which still keep the metric in formatio n o f the original manifo ld, and the complexity of the data. Thoug h we ha ve focused here on the case o f DNN with lin ear filters with rando m Gaussian entries, it is possible to exten d o ur analysis to distributions such as sub-G aussian, and to ran dom con volu- tional filters using pr oof techniq ues from (Hau pt et al., 20 10; Saligrama, 2 012; Rauhut et al., 2 012; Ai et al., 2014). Th is and the extension to learned DNN will be p resented in an extend ed version of this note. Acknowledgments: This work is suppor ted by NSF , DoD and ERC StG 33549 1. R E F E R E N C E S Ai, A., Lapanowski, A., Plan, Y ., and V ershyn in, R . One-bit compressed sensing with non-gaussian measuremen ts. to ap pear in Linear Algebr a and its Applicatio ns , 2014 . Bruna, J., LeCun, Y ., an d Szlam, A. Learning stab le gr oup inv ariant rep resentations with convolu- tional networks. In ICLR W orksho p , Jan. 2013. Bruna, J., Szlam, A., a nd LeCu n, Y . Sig nal r ecovery from l p pooling r epresentation s. In I nt. Con f. on Machine Learning (ICML) , 2014. Choroman ska, A. , He naff, M. B., M athieu, M ., Arous, G. Ben , an d LeCun, Y . Th e loss surfaces of multilayer netw ork s. In International Confer ence on Artificial Intelligence a nd Statistics (AIS- T ATS) , 2015. Cox, D. and Pinto , N. Be yond simp le f eatures: A large-scale fe ature search app roach to uncon - strained face recog nition. In IEE E In ternationa l Confer ence on Automatic F ace Gestur e Recog- nition and W orkshop s (FG) , pp. 8–15, March 2011. Dauphin, Y ., Pascanu, R., Gulcehr e, C., Cho, K., Gangu li, S., and Bengio, Y . Identify ing and attacking the saddle point problem in high-dimensional non-co n vex o ptimization. In Advances in Neural Information Pr ocessing Systems (NIPS) , 2014 . Giryes, R., Saprio, G., and Bronstein, A. M. Deep neural networks with ran- dom gau ssian weights: A u niversal classification strategy? ArXiv , 2015. URL http://arxiv .org/abs/15 04.08291 . Haupt, J., Bajwa, W .U., Raz, G., and Nowak, R. T oeplitz c ompressed sensing matrices with appli- cations to sparse chann el estimation. I EEE T rans. Inf. Theory , 56(11):5862 –587 5, Nov . 2010. Klartag, B. a nd Mendelson , S. Em pirical processes and ra ndom pro jections. Journal of Func tional Analysis , 225(1 ):229– 245, Aug. 200 5. Ledoux , Michel and T alagrand, Michel. Pr obab ility in Banach Spaces . Sprin ger-V erlag , 1 991. 3 Accepted as a workshop contribution at ICLR 2015 Mahendr an, A. and V ed aldi, A. Understandin g deep im age rep resentations by inverting them. ArXiv , 2014. URL h ttp://arxiv. org/abs/1412 .0035 . Mendelson, S., Pajor , A., and T om czak-Jaegerman n, N. Un iform uncerta inty prin ciple for Bernou lli and sub-Gaussian ensembles. Constructive App r o ximation , 28:277–2 89, 20 08. Pinto, N., Do ukhan, D., DiCarlo, J. J., and Cox, D. D. A high- throug hput screening ap proach to discovering good fo rms of biologica lly inspired visual representation. PLoS Comput Bio l , 5(11): e1000 579, 11 200 9. Plan, Y . and V ershy nin, R. Dimension red uction by random h yperplan e tessellations. Discr ete an d Computation al Geometry , 51(2 ):438– 461, 20 14. Rauhut, H., Romb erg, J., and Tropp, J. A. Restricted isometries fo r partial r andom circu lant matri- ces. Ap pl. Comput. Harmon. Anal. , 32(2):24 2–25 4, Mar . 20 12. Saligrama, V . Aperio dic sequences with unif ormly decaying correla tions with applications to com- pressed sensing and system identification. I EEE T rans. In f. The ory , 58(9 ):6023 –603 6, Sept. 2012. Saxe, A., Koh, P . W ., Chen, Z., Bhand, M., Sur esh, B., a nd Ng, A. Y . On random weights and unsuper vised feature learning . In Int. Conf. on Machine Learning (ICML) , pp. 1089– 1096 , 201 1. Saxe, A., McClelland , J., and Gan guli, S. E xact solu tions to the non linear dynamics of lear ning in deep linear neu ral network. In Internatio nal Confer ence on Learning Representations (ICLR) , 2014. 4
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment