Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation

Backpropagation (BP) is the cornerstone of today's deep learning algorithms, but it is inefficient partially because of backward locking, which means updating the weights of one layer locks the weight updates in the other layers. Consequently, it is …

Authors: Yu-Wei Kao, Hung-Hsuan Chen

Associated Learning: Decomposing End-to-end Backpropagation based on   Auto-encoders and Target Propagation
1 Associated Learning: Decomposing End-to-e nd Backpropag ation Based on Autoencoder s and T ar - get Propag ation 1 Y u-W ei Kao, Hun g-Hsuan Chen Department of Computer Science and Information Engineering, National Central Uni- versity Keyw ords: Bac kpropagation, pipelined trainin g, parallel training, backward lock- ing, associated learning Abstract Backpropagation (BP) is the cornerstone of to d ay’ s deep learning algorithms, but it is inef ficient partially because of backward locking, which means u p dating the weights 1 If you are lookin g for th e prepr int of the paper p ublished in M IT Neural Comp utation 33 (1) 2 021, please see the version 3 o n arXiv ( http s://arxiv.or g/abs/1906.0 5560v3 ) . The version you are reading cur rently includes few more references. of one layer l ock s the weight upd at es in t he other layers. Consequently , it is challeng- ing to apply parallel comput ing or a pipeline structure to up d ate th e weights in different layers simult aneously . In this paper , we introduce a novel learning structure called as- sociated learnin g (AL), which modularizes t he network i nto s m aller comp o nents, each of which has a local objective. Because t h e objective s are mutuall y ind ependent, AL can learn the parameters i n diff erent layers independently and simultaneously , so i t is feasible to apply a pipel i ne structu re to improve the training t hroughput. S pecifically , this pipeline struct u re improves t h e complexity of the training time from O ( nℓ ) , which is the time complexity when using BP and stochastic gradient descent (SGD) for train- ing, to O ( n + ℓ ) , where n i s the num b er of trainin g instances and ℓ is t he num ber of hidden layers. Surprisingly , ev en t h ough most of the parameters i n AL do not directly interact with the targe t variable, training deep models by this method yields accura- cies comparable to those from mo d els trained using typical BP methods, i n which all parameters are u s ed to predict the target variable. Consequently , b ecause of the scal- ability and the predictive p ower demonstrated in the experiments, AL deserves further study to d etermine the better hyperparameter settings, such as activa tion function se- lection, l earning rate scheduling, and weight init ialization, to accumulate experience, as we ha ve done over the years with the typical BP method. Additionally , perhaps our design can also i nspire ne w network desig n s for deep learning. Our i mplementation is a vailable at https:/ /github.co m/SamYWK/A ssociated_Learning . 2 1 Introduction Deep neural networks are usuall y trained using backpropagation (BP) (Rumelhart et al., 1986), whi ch, although common, increases the traini ng di ffi culty for s e veral reasons, among which b a c kwar d locking highly limits the training speed. Essentially , th e end- to-end traini ng m eth od propagates the error-correc ting signals layer by layer; conse- quently , it cannot update the network parameters of t h e diffe rent layers in parallel. This b ackward locking prob l em is discussed in (Jaderberg et al., 2 0 16). Backward locking becomes a severe performance bottleneck when the network has many lay- ers. Beyond these computational weaknesses, BP- based learning seems biolo gically implausibl e. For example, it is unlikely that all t h e wei g hts would be adjusted sequen- tially and in sm all increments based on a single objective (Crick, 1989). Addi t ionally , some compon ents essential for BP to work correctly hav e not been o b served i n the cor - tex (Balduzzi et al., 20 1 5). Therefore, many works h a ve propos ed methods that more closely resemble th e operations of biolog i cal neurons (Lillicrap et al., 2016; Nøkland, 2016; Bartunov et al., 2018; Nøkland and Eidnes, 2019). Ho wev er , empirical studies show that th e p redictions of these m ethods are st ill unsatisfactory compared to thos e using BP (Bartunov et al., 2018). In this paper , we propo se associated learning (AL), a m eth od th at can be used to replace end-to -end BP wh en training a deep n eural network. AL decompo s es the net- work into s m all components su ch that each component has a local objective function independent of the local objective functions of th e o t her components. Consequently , the parameters in diffe rent components can b e updated si multaneously , meanin g that we can l e verage parallel computin g or pi p elining to imp rove the train i ng throughpu t. 3 W e cond ucted experiments on dif ferent datasets to s how th at AL giv es test accuracies comparable t o thos e obtained b y end -t o-end BP training, ev en though most compo nents in A L do not directly recei ve the residual sig nal from the o u t put layer . The remainder of this paper is organized as follows. In Section 2, we revie w the related works regarding the comput ational issues of traini n g deep neural networks. Sec- tion 3 gives a toy example to compare end-to-end BP w i th our proposed AL m ethod. Section 4 explains the det ail s of AL. W e conducted extensiv e experiments to com pare AL and BP-based end-to-end learning using dif ferent types o f neural n et works and diffe rent datasets, and the results are shown in Section 5. Finally , we discuss the dis- cove ries and suggest future work in Section 6. 2 Related W ork BP (Rumelhart et al., 1986) is an essential algo rithm for training deep neural networks and is the foun d ation of th e success of m any models in recent decades (Hochreit er and Schmidhuber, 1997; LeCun et al., 1998; He et al., 2 0 16). Howev er , because of “backward lockin g ” (i.e., the weights mus t be updated layer by l ayer), training a deep neural network can be extremely ineffi cient (Jaderberg et al., 2016). Additionally , empirical evidence shows that BP i s biol o gically i mplausible (Crick , 1989; Balduzzi et al., 2015; Bengio et al., 2015). Thus, many studies have suggested replacing BP with a more biologically plausible method or with a gradient-free method (T aylor et al., 201 6; Ororbia and Mali, 2019; Ororbia et al., 2018) in the hope o f decreasing the computati onal tim e and mem - ory con s umption and better resembling biol ogical neural networks (Bengio et al., 2015; 4 Huo et al., 2018a,b). T o address the backward locking problem, the authors of (Jaderber g et al., 201 6) proposed using a synthetic gradient, which is an estimation of the real gradient gener - ated by a separate neural network for each layer . By adoptin g the sy n thetic g radi ent as t he actual gradient, the parameters of every layer can be updated simultaneously and independently . This approach eliminates the backward locking problem. Howe ver , the experimental results have shown that this approach t end s to result in u n derfitting— probably because the gradients are difficult to predict. It is also possi ble to elim inate backward locking by comput ing the local errors for the di f ferent comp onents of a network. In (Belilovsky et al. , 2018), th e authors showed that us ing an auxiliary classifier for each l ayer can yiel d good results . Howe ver , thi s paper added one layer to t he network at a ti me, so it was challeng ing for the network to learn t he parameters of different layers in parallel. In (Mostafa et al., 2018), every layer in a deep neural network is trained by a local classi fier . Howe ver , experimen- tal result s have shown t hat th is type of m odel is not com parable with BP . The authors of (Belilovsky et al., 2 0 19) and the authors of (Nø k land and Eidnes, 2019) also pro- posed to upd ate p arameters based on (or p artially based on) l o cal errors. These models indeed allow the simultaneou s updating of parameters of d iffe rent layers, and experi- mental results showed t hat t h ese techniqu es improved test i ng accuracy . Howe ver , these designs requi re each lo cal com p onent to receive si g nals directly from th e target variable for loss comp u tation. Biologi cally , it is unl ikely that neurons far away from the target would be able to access the target si gnal directly . Therefore, even though these methods do not require global BP , they may still be bio logically i m plausible. 5 Feedback alignment (Lillicrap et al., 2016) suggest s propagati ng error signals in a similar manner as BP , but the error signals are propagated with fixed random weights in every layer . Later , the authors of (Nøk land, 20 1 6) suggested delivering error signals directly from th e outpu t layer using fixe d weights. The result is that the gradients are propagated by weights, whi le the signals remain local to each layer . The probl em with this approach is t hat it is s imilar to the issu e discuss ed in t he preceding paragraph— biologically , distant neurons are un l ikely to be able to obt ai n signals directly from the tar get variable. Another b iologically mot iv ated algorithm is target propagation (Bengio, 2014; Lee et al., 2015; Bartunov et al., 201 8). Rather than computing the gradient for every layer , th e tar get propagation computes the target t hat each layer should learn. Th is app roach relies on an autoencoder (Baldi, 2012) to calculate the in verse mapping of the forward pass and then pass t he ground truth in form ation to e very layer . Each traini ng s t ep inclu d es two l o s ses that must be mi nimized for each layer: the loss of inv erse m app i ng and the loss between activ ations and tar gets. This learning metho d alle viates the need for sym - metric weight s and is both biologi cally pl ausible and m o re robust t h an BP when appl ied to s tochastic networks. Nonetheless, t he targets are still generated layer by layer . Overvie ws of the biolog ically plausib l e (or at least partially plausible) methods are presented i n (Bengio et al., 2015; Bartunov et al., 2018). Alth o u gh most of t hese m eth - ods perform worse th an con ventional BP , optimization beyond BP is sti ll an important research area, m ainly for comp u tational ef ficiency and b iological compati b ility reasons . Most stu dies on parallelizi n g deep learning dis t ribute different data instances into diffe rent computing un i ts. Each of these computi ng uni ts com putes t he gradient based 6 on t he allo cated instances, and t h e final gradient is determined by an aggregation of t h e gradients com puted b y all the com puting uni ts (Shallue et al., 20 1 8; Zinkevich et al., 2010). Al though this indeed increases the training throug hput via parallelizati o n, this is diffe rent from our approach because our metho d parallelizes the computatio n in dif fer - ent layers of a d eep network. Our AL techni q ue and the technique of parallelizing data instances can complement each other and further im p rove the th rou ghput given enough computational resources. A recent work, GPipe, utilizes p i peline training to i mprove the training throughp u t (Huang et al., 2019). Howe ver , all the parameters in GPipe are still influenced i n a layerwise fa shion. Our m ethod is diffe rent because th e parameters in t he differe nt layers are independent. Our work is h i ghly motiv ated by tar get propagation, but we crea te interm edi ate mappings i n stead of directly transforming features into targe ts. As a result, t he local signals in each layer are independent of t h e signals in the other layers, and most of these signals are not obtained directly from the out p ut label. 3 A T oy Exampl e to Compare the T raining Through- put of End-to-end Back pr opagation an d Associated Learning Figure 1 gives a typical s t ructure of a deep neural network with 6 hidd en layers. The in- put feature vector x goes th rough a series o f transform ations ( x f 1 − → s 1 f 2 − → s 2 f 3 − → s 3 b 3 − → t 3 h 3 − → t 2 h 2 − → t 1 h 1 − → y ) to approximate the corresponding o utput y . W e denote the func- 7 ! " ! " " " # # # # " # ! $ % ! % " % # & # ' # ' " ' ! ( ' ! ( ) $ ! ( ' " ( ) $ " ( ' # ( ) $ # ( & # ( ) % # ( % # ( ) & # ( % " ( ) & " ( % ! ( ) & ! Figure 1: An example of a deep neural network with 6 h i dden layers. W e de- note eac h forward function ( f 1 , f 2 , f 3 , b 3 , h 3 , h 2 , h 1 ) and the output of each fun ct i on ( s 1 , s 2 , s 3 , t 3 , t 2 , t 1 , y ) by differe nt symbo l s for ease of later explanation. Let θ ( f ) de- note the parameters of a function f ; then, the backward path requires computi ng th e local gradient ∂ f ∂ θ ( f ) for each function f . tions ( f 1 , f 2 , f 3 , b 3 , t 3 , t 2 , t 1 ) and th e outputs of these functio n s ( s 1 , s 2 , s 3 , t 3 , t 2 , t 1 , y ) by diff erent symbols for the ease of later explanation on AL. If stochastic g radient de- scent (SGD) and BP are appl ied to search for the proper parameter values, we need to compute the lo cal g radient ∂ f ∂ θ ( f ) as the backward function for every forward function f (whose parameters are denot ed by θ ( f ) ). As a result, each training epoch requires a ti m e complexity of O ( n × (( ℓ + 1) + ( ℓ + 1))) ≈ O ( nℓ ) , in which n i s t h e num ber o f training instances and ℓ is the number of hidden layers (i.e., ℓ = 6 in our example). Since both forward pass and backward pass require ℓ + 1 transformations , we hav e two ℓ + 1 terms. Consequently , the trainin g ti me i ncreases linearly wi th the number of hidden layers ℓ . Figure 2 sh ows a simplified structure of th e AL techni q u e, which “fol d s” the net- work and d ecom poses t he network i nto 3 comp onents such that each component has a local objective fun ction that is independent of the local obj ectiv es in t he other compo- nents. As a result, for i 6 = j , we may update the parameters i n comp o nent i ( θ ( f ) i , θ ( h ) i ) and th e parameters in component j ( θ ( f ) j , θ ( h ) j ) in dependently and simu l taneously , since 8 ! " ! " " # " # # " # # ! $ % ! & ! !"# $"%& %'( ) !"# $"%& %'( * % " & " !"# $"%& %'( + % # & # ' # Figure 2: A simplified structure of the AL t echnique, which decompos es 6 hidd en layers into 3 compo n ent s such that each compon ent has a local objectiv e function th at is independent of the objective functions of the other components. Consequentl y , we may update the parameters in compo n ent i ( θ ( f ) i , θ ( h ) i ) and the parameters in com ponent j ( θ ( f ) j , θ ( h ) j ) simult aneously for i 6 = j . the parameters of component i ( ( θ ( f ) i , θ ( h ) i ) ) determin e the loss o f component i , which is i n dependent of the lo s s of com ponent j , which is d etermined by the parameters of component j ( ( θ ( f ) j , θ ( h ) j ) ). T able 1 gives an example o f applying pipelini ng for p arameter updating to imp rove the training throughput u sing AL. Let T ask i be the task of updating t he parameters in Component i . At th e 1 st time unit, the network performs T ask 1 (updating θ ( f ) 1 and θ ( h ) 1 ) based on the 1 st training instance (or the i nstances in the 1 st mini-batch). At the 2 nd time unit, t he network performs T ask 1 (updating θ ( f ) 1 and θ ( h ) 1 ) b ased on the 2 nd training instance (or the training instances in the 2n d mini-batch) and performs T ask 2 (updating θ ( f ) 2 and θ ( h ) 2 ) based on t h e 1 st instance (or the 1 st mini-batch). As shown in the table, s tarting from t he 3 rd time unit, th e parameters in all the different components can be updated simultaneously . Consequently , the first instance requires O ( ℓ/ 2) uni t s of 9 T able 1: An example of si multaneously up d ating the parameters b y pipelining T ime unit 1 2 3 4 5 6 7 ... 1 st mini-batch T ask 1 T ask 2 T ask 3 2 nd mini-batch T ask 1 T ask 2 T ask 3 3 rd mini-batch T ask 1 T ask 2 T ask 3 4 th mini-batch T ask 1 T ask 2 T ask 3 5 th mini-batch T ask 1 T ask 2 T ask 3 ... computational ti m e, and, because of the pipeline, each of the following n − 1 inst ances requires only O (1) units of computatio nal tim e. Therefore, the t ime complexity of each training epoch becomes O ( ℓ/ 2 + ( n − 1)) ≈ O ( n + ℓ ) . Compared to end-to-end BP during which the time complexity grows linearly t o the num ber of hidden layers, the t ime com plexity of the propo s ed AL with pipelining technique grows to only a constant time as the number of hidden l ayers increases. 4 Methodology A ty p ical deep network training process requires features to pass through m ultiple non- linear layers, all owing the outpu t to approach the ground-trut h labels. T h erefore, there is only one objectiv e. W ith AL, howev er , we modul arize the training path by splitting it i n to sm aller compo nents and assign independent local objectives to each sm all com - ponent. Consequently , the AL t echnique divides the original long gradient flow i nto many ind epend ent short gradi ent flows and ef fecti vely elim inates the backward locki n g 10 ! ! ! " " ! " " " # # ! # " # # $ ! $ " % ! % " % ! & % " & ! " # !" # Figure 3: Adding a “bridge” to the structure. The bridge includes nonl inear layers to transform s i into s ′ i such that s ′ i ≈ t i . The black arrows indicate the forward path. problem. In this section, we introduce three types of fun cti ons (asso ci at ed function, encoding and decoding functions, and bridge function ) that t ogether compose the AL network. 4.1 Associated Function and Associated Loss Referring to Figure 2, l et x and y be the input features and the output tar get, respec- tiv ely , of a trainin g sample. W e split a n etwork with ℓ hidden layers into ℓ/ 2 components (assuming ℓ is an even num ber). Th e details of each component are illustrated in Fig- ure 4. Each component i consists of two lo cal forwa rd functions, f i and g i ( f i and g i will be called the associa t ed functio n and encodi ng function , respectiv ely , for better dif- ferentiation; we wil l further explain t he encoding function in Section 4.3), and a local objective fun ction independent of the objective functions o f the other components. A 11 local associated fun ct i on can be a simple single-layer perceptron, a con volutional layer , or another function. W e compute s i using Equ at i on 1 : s i = f i ( s i − 1 ) , i = 1 , . . . , ℓ/ 2 . (1) Note that here, s 0 equals x . W e define th e associated loss function for each pair of ( s i , t i ) by Equation 2. This concept i s similar t o t ar get propagation (Bengio, 2014; Lee et al., 201 5 ; Bartunov et al., 2018), in which the goal is t o minimize the distance b et w een s i and t i for eve ry com - ponent i . L i ( s i , t i ) = || s i − t i || 2 , i = 1 , . . . , ℓ/ 2 . (2) The opti mizer in the i th component updates the parameters in f i to reduce the asso- ciated loss function (Equation 2). Referring to Figure 2, Equation 2 attempts to make s i ≈ t i for all i . Th i s design may look s trange for several reasons. First, i f we can obtain an f 1 such that s 1 ≈ t 1 , all the other f i s ( i > 1) seem unnecessary . Second, since s 1 and t 1 are far apart, fittin g these two terms seems counterintuitive. For the first question, one can regard each component as one layer in a deep neural network. As we add m ore components, th e correspondin g s i and t i may become closer . For the s econd quest ion, ind eed, it seems m ore reasonable t o fit the values of neighbor- ing cells. Howe ver , our design breaks th e gradient flow among different com ponents so that it is possible to perform a parallel parameter updat e for each com ponent. 12 4.2 Bridge Function Our early experiments showed that s i has difficulty fitting the correspondin g tar get t i , especially for a con volutional neural network (CNN) and its variants. Thus, we insert nonlinear layers to im prove the fitting between s i and t i . As shown in Figure 3, we create a bridge f u nction , b i , to perform a no n linear transform on s i such th at b i ( s i ) = s ′ i ≈ t i . As a result, the associated loss is reformulated to the following equation to replace the original Equation 2: L i ( s i , t i ) = || b i ( s i ) − t i || 2 , i = 1 , . . . , ℓ/ 2 , (3) where the function b i ( . ) s erves as the bridge. Although t h is approach greatly increases th e number o f parameters and the nonlin- ear layers to decrease the forward loss, except for t h e last bridge, t hese parameters do not affect the i nference functio n , as we will explain in Section 4.5, so t he bridges only slightly i n crease t he hy pothesis space. For a fair comparison, we also increase the n u m- ber of parameters wh en the models are trained by BP so that t he mod el s trained b y AL and trained by BP ha ve t he same number of parameters. The details will be explained in Section 5. 4.3 Encoding/Decodi ng Functions and A utoencoder Loss Referring to Figure 2, in add i tion to the parameters of the f i s and b i s, we als o need to obtain parameters in h i s to have the mapping t i → t i − 1 at the inference phase. This mapping i s achiev ed by the fol lowing two functions, which together can be regarded as an autoencoder: 13 t i = g i ( t i − 1 ) , i = 1 , . . . , ℓ/ 2 . (4) t ′ i − 1 = h i ( t i ) , i = i, . . . , ℓ/ 2 . (5) Referring to Figure 4, the abov e two equations form an autoencoder because we want t i − 1 g i − → t i h i − → t ′ i − 1 ≈ t i − 1 , s o g i and h i are called the encoding function and decoding functi on , respectiv ely . The autoencoder loss L ′ i for layer i i s defined by Equa- tion 6: L ′ i ( h i ( g i ( t i − 1 )) , t i − 1 ) = || t ′ i − 1 − t i − 1 || 2 , i = 1 , . . . , ℓ/ 2 . (6) 4.4 Putting Everything T ogether Figure 4 sho ws the enti re training process of AL bas ed on our earlier example. W e group each com p onent by a dashed lin e. The parameters in each component are ind e- pendent of the parameters in t he other components . For each component i , the local objective function is defined by Equatio n 7. local-obj i = MSE (1) i + M SE (2) i = || b i ( s i ) − t i || 2 + | | t ′ i − 1 − t i − 1 || 2 , (7) where | | b i ( s i ) − t i || 2 is t h e associated loss shown by Equation 3 and || t ′ i − 1 − t i − 1 || 2 is the autoencoder loss demonstrated by Equatio n 6. As shown in Figure 4, the associated loss in each component creates the gradient flo w ℓ (1) i , whi ch guides t he updates of the parameters of f i and b i . The autoencoder loss 14 in each component leads to the second gradient flow ℓ (2) i , whi ch determines t he updates of g i and h i . A gradient flow tra vels only wi thin a com p onent, so the parameters in different components can be updated simultaneousl y . Additio n ally , since each gradient flow is short, the vanishing gradient and exploding gradi ent probl em s are less likely to occur . Since each component incrementally refines the association lo ss of the component immediately below it, the input x approaches the outpu t y . 4.5 Inferen ce Function, Effective Parameters, and Hypothesis Space W e can categorize the abovementioned parameters into two types: ef fectiv e parame- ters and af filiated parameters. The affiliated parameters help the model determine the values o f the effe ctiv e parameters, which in turn determine t h e hypoth esis space of the final inference fun cti on. Therefore, whil e increasing t he number of affi liated parame- ters may help to obtain better values for th e effecti ve parameters, it will no t i ncrease the hypothesis space of the prediction model. Such a setting may be relev ant to the ove r- parameterization technique, w h ich introduces redundant parameters to accelerate the training speed (Allen-Zhu et al., 2018; Arora et al. , 2018; Chen, 2 017; Chen and Chen, 2020), but here, the purpose i s to obtain better values of the effecti ve parameters rather than faster conv er gence. Specifically , in t he training phase, we search for the parameters of t he f i s and b i s th at minimize the associated loss and search for the parameters of th e g i s and h i s to minimi ze the autoencoder loss. Howe ver , in the inference phase, we mak e predictions based only on E q uation 1, Equation 5, and b ℓ/ 2 ( s ℓ/ 2 ) . Therefore, the effecti ve parameters 15 include only the parameters in the f i s, th e h i s ( i = 1 , . . . , ℓ/ 2 ), and b ℓ/ 2 (i.e., the last bridge). The p arameters in the other functi ons (i.e., the g i s ( i = 1 , . . . , ℓ/ 2 ) and the b j s ( j = 1 , . . . , ℓ/ 2 − 1) ) are affiliated parameters; th ey do not i ncrease t he expressi veness of the model but only help determine th e values of the effecti ve parameters. The predicting process can be represented in Figure 2. Equatio n 8 shows the pre- diction function: ˆ y =  h 1 ◦ h 2 ◦ . . . ◦ h ℓ/ 2 ◦ b ℓ/ 2 ◦ f ℓ/ 2 ◦ . . . ◦ f 2 ◦ f 1  ( x ) , (8) where ◦ denotes the function composition operation and ℓ = 6 i n the example is illus- trated by Fig u re 2 and Figure 4. Only the parameters in v olved in Equation 8 are the ef fectiv e parameters that determine th e hypot hesis space. 5 Experime nts In this section, we introduce the experimental settings, implementation details, and show the resul ts of the performance comparisons between BP and AL. 5.1 Experimental Settings W e conducted exper iments by applying AL and BP to di ff erent deep neural network structures (a multi l ayer perceptron (MLP), a v anilla CNN, a V isual Geometry Group (VGG) network (Simonyan and Zisserman, 20 1 5), a 20-layer residual neural network (ResNet-20), and a 32-layer ResNet (ResNet-32) (He et al. , 2016)) and different datasets (the Mod ified National Institut e of Standards and T echnology (MNIST) (LeCun et al . , 16 1998), th e 10-class Canadian Institute for Advanced Research (CIF AR-10), and the 100-class CIF AR (CIF AR-100) (Krizhevsky and Hinton, 2009) datasets ). Surprisingly , although the AL approach aim s at m inimizin g the local losses, its predictio n accuracy i s comparable to, and someti m es ev en better than, that of BP-based learning, whose goal is directly minim izing the prediction error . In each experiment, we used th e settin g s that were reported in recent papers. W e spent a reasonable amount of time searching for the hyperparameters not s tated in previ- ous papers based on random search (Bergstra and Bengio, 2012). Eventually , we initial- ized all t he weights based on the He normal initializer and use Adam as the optimizer . W e experimented with different activ ation functions and adopt ed the exponential l in- ear unit (ELU) for all the local forwar d functions (i.e., f i ) and a sigmoid function for the functi o ns related to t h e autoencoders and bridges (i.e., g i , h i , and b i ). The models trained by BP yielded test accuracies close to the state-of-the-art (SO T A) results under the same or similar network s tructures (He et al., 2 016; Carranza-Rojas et al., 2019). In addition, because AL includes extra parameters in th e fun ction b ℓ/ 2 (the last bridge), as explained in Section 4.5, we increased the number o f layers in the correspon ding base- line models when training by BP so that the models trained by AL and those trained b y BP hav e identi cal parameters, so the comparis o ns are fair . The im plementations are freely av ai l able at https://gith ub.com/Sam YWK/Associated_Lear n i n g . 5.2 T est Accuracy T o test the capability of AL, we comp ared AL and BP on dif ferent network structures (MLP , v anilla CNN, ResNet, and V GG) and different datasets (MNIST , CIF AR-10, 17 T able 2: T est accurac y comparison on the MNIST dataset. W e highli g ht th e winner in bold font. W e applied onl y th e DTP algorithm on the MLP because this is the setting used in the origi n al paper . Applying D T P on other networks mig ht require dif ferent designs. BP AL DTP MLP 98 . 5 ± 0 . 0% 98 . 6 ± 0 . 0% 96 . 43 ± 0 . 04% V anilla CNN 99 . 4 ± 0 . 0% 99 . 5 ± 0 . 0% - and CIF AR-100). When con verting a network with an odd number of layers into t he ”folded” architecture used by A L , the m i ddle layer is sim ply absorbed by the bridg e layer at the top component shown in Figure 4. W e also experimented wi th diffe rential tar get propagation (DTP) (Lee et al., 20 1 5) on the ML P network based on the MNIST dataset. W e tried only the MLP network, as the ori g inal paper appl ied only DTP to the MLP s tructure and appl y ing DTP t o o ther net work structures requires different designs. On the MNIST dataset, we conducted experiments wi th only two networks struc- tures, MLP and vanilla CNN, b ecause using ev en these simpl e structures yielded decent test accuracies. Their detailed settings are described in the following paragraphs. The results are shown in T able 2. For both the M L P and the vanilla CNN structure, AL performs sligh t ly better than BP , wh ich p erforms better than DTP on the MLP network. The MLP contains 5 hi dden layers and 1 outpu t layer; there are 1 024 , 1024 , 5 120 , 1024 , and 1024 neurons i n the h i dden l ayers and 10 neurons in the output layer . Re- ferring to Figure 4, this network corresponds to the following structure when usin g the AL framew ork: the network has two components; both the s i and t i in a component i 18 T able 3: T est accuracy comparison o n the CIF AR-10 dataset. W e highlight the winner in bold font. W e applied only the DT P algorith m on the M LP because thi s is t he setting used in the origi n al paper . Applying D T P on other networks mig ht require dif ferent designs. BP AL DTP MLP 60 . 6 ± 0 . 3% 62 . 8 ± 0 . 2% 58 . 2 ± 0 . 2% V anilla CNN 85 . 2 ± 0 . 4% 85 . 8 ± 0 . 1% - ResNet-20 91 . 2 ± 0 . 4% 89 . 1 ± 0 . 5% - ResNet-32 92 . 0 ± 0 . 2% 88 . 7 ± 0 . 4% - VGG 92 . 3 ± 0 . 2% 92 . 6 ± 0 . 1% - ( i = 1 , 2 ) have 1024 neuron s , and b 2 the output o f the top bridge function contains 5120 neurons. The vanilla CNN contain s 13 hidden layers and 1 output l ayer . The first 4 layers are con volutional layers wi th a size of 3 × 3 × 3 2 (i.e., a width of 3, a height of 3, and 32 kernels) in each layer , followed by 4 con volutional layers wi th a size of 3 × 3 × 64 in each layer , followed by a fully connected layer w i th 1 280 neurons, followed by 4 full y connected layers wi th 256 neurons in each layer and ending with a fully connected layer with 10 neurons. When training by AL, this struct u re corresponds to the following: the first fiv e layers (layers 1 t o 5) and the last five layers (layers 9 to 13) form fiv e components, wh ere layer i and layer 14 − i ( i = 1 , . . . , 5 ) belong to component i and the 6 th , 7 th , and 8 th layers construct the component 6 . The initial learning rate is 10 − 4 , which i s reduced after 80 , 120 , 160 , and 1 80 epochs. 19 T able 4: T est accuracy comparison on the CIF AR-100 dat aset . W e highlight t he winn er in b old font. BP AL MLP 26 . 5 ± 0 . 4% 29 . 7 ± 0 . 2% V anilla CNN 51 . 1 ± 0 . 2% 52 . 2 ± 0 . 5% ResNet-20 63 . 7 ± 0 . 2% 61 . 0 ± 0 . 6% ResNet-32 63 . 7 ± 0 . 3% 59 . 0 ± 1 . 6% VGG 65 . 8 ± 0 . 3% 67 . 1 ± 0 . 3% The CIF AR-10 dataset i s more challeng i ng than the MNIST dataset. The input image size is 3 2 × 32 × 3 (Krizhevsk y and Hinton, 200 9); i.e., the images have a hi g her resolution, and each pixel includ es red, green, and blue (RGB) i nformation. T o make good use of these abundant features, we incl u ded not only M LP and vanilla CNN in t h is experiment but also VGG and the ResNets. The input im ages are augmented b y 2-pixel jittering (Sabour et al., 2017). W e applied the L2-no rm using 5 × 10 − 4 and 1 × 10 − 4 as the regularization weights for VGG and the ResNet m odels. Because ResNet uses batch normalization and th e shortcut trick, w e s et its learning rate to 10 − 3 , which sli g htly larger than that of the other models. In add i tion, to ensure that t he mod el s trained by BP and AL have ident ical num bers of parameters for a fair comparison, we added extra layers to ResNet-20, ResNet-32, and VGG when using BP for learning. T able 3 shows the results of the CIF AR-10 dataset. AL performs mar ginally better than BP on the MLP , vanilla CNN, and VGG structures. W ith the ResNet structure, AL 20 performs slig htly worse than BP . The CIF AR-100 d atas et includes 100 class es. W e us ed model settings that w ere nearly identical to the settings used on t he CIF AR-10 dataset but increased the num ber of neurons in the b ri d ge. T abl e 4 sho ws th e results. As in CIF AR-10, AL performs better than BP o n th e MLP , vanilla CNN, and VGG structures but slightly worse on th e ResNet structures. Currently , the theoretical aspects of the AL method are weak, so we are unsure of the fundamental reasons why AL outperforms BP on MLP , vanilla CNN, and VGG but BP outperforms AL o n ResNet. Our speculation s are below . First, since BP aims to fit the t arget directly , and most of the l ayers in AL can le verage only indirect cl u es to update the parameters, AL i s less likely to outperform BP . Howe ver , thi s reason does not explain why AL performs better than BP on other n et works. Second, perhaps the bridges can b e re garded im plicitly as the shortcut connections of ResNet, so applying AL on ResNet appears such as refining residuals of residuals, which could be noisy . Fi- nally , years of stu d y o n BP has made u s gain experience on the hyperparameter settings for BP . A sim ilar hyperparameter setting may not necessarily achieve th e best s et t ing for AL. As reported in (Bartunov et al., 2018), earlier stud ies on BP alternativ es, such as tar get propagation (TP) and feedback alignment (F A), performed worse than BP in non- fully connected networks (e.g., a locally connected network such as a CNN) and more complex datasets (e.g., CIF AR). Recent studies, such as those on decoupled greedy learning (DGL) and the Predsim model (Belilovsky et al., 2019; Nøkland and Eidnes, 2019), sho wed a simil ar performance to BP o n more complex networks, e.g., VGG, but these mo d els require each layer to access the t arget label y directly , which could 21 T able 5: The associated loss at d i ff erent layers o n the MNIST dataset after 200 epochs. Referring to Figure 4, for each layer , i ts corresponding s i and t i both contain 1024 neurons. Number o f component layers 1 layer 2 layers 3 layers || s ′ 1 − t 1 || 2 2 1 . 2488 × 10 − 5 1 . 5469 × 10 − 5 1 . 2219 × 10 − 5 || s ′ 2 − t 2 || 2 2 - 3 . 5818 × 10 − 7 3 . 8033 × 10 − 7 || s ′ 3 − t 3 || 2 2 - - 6 . 7192 × 10 − 10 T able 6: Number of l ayers vs. the traini ng accuracy and vs. the test accuracy on the MNIST dataset after 200 epochs. Referring to Figure 4, for each layer , the correspond- ing s i and t i both contain 1024 neurons. The bridge layer in the top layer includes 5120 neurons. Number o f component l ayers 1 l ayer 2 layers 3 layers T raining accuracy 1.0 1.0 1.0 T est accuracy 0.9849 0.9860 0.9871 be biol ogically im p lausible because distant neuron s are unlikely t o obtai n the signals directly from the tar get. As fa r as we know , our proposed AL technique i s the first work to show t h at an alternative of BP works on va rious network structures without directly re vealing the target y to each hidden layer , and the results are com parable to, and sometim es ev en better than, the networks trained by BP . 22 5.3 Number of Layers vs . the Associated Lo s s and vs . the Accuracy This section presents the results of experiments wit h dif ferent numbers of component layers on the MNIST dataset. For each component layer i , bot h the corresponding s i and t i hav e 1024 neurons, and s ′ ℓ (i.e., the out put of the brid g e at the to p layer) con t ains 5120 neurons. First, we show that each component indeed incrementally refines the associated loss of the one imm ediately below it. Specifically , we applied AL to the MLP and experimented with differe nt numbers of com ponent layers. As shown in T able 5 , adding more layers truly decreases t he associated l oss, and the associated loss at an upper layer is smaller than that at a lower layer . Second, we show t hat adding m ore layers helps transform x i n to y . As shown in T able 6, add ing more layers increases the test accuracy . 5.4 Metafeature V isualization and Quantification T o determine wh eth er the hidden layers t ruly learn useful metafeatures when usi ng AL, we used t-SNE (Maaten and Hinton, 2008) to visual i ze the 2 nd , 4 th hidden layers and output layer in the 6-layer MLP mo d el and the 4 th , 8 th , and 12 th hidden layers in the 14- layer V anill a CNN m odel on the CIF AR-10 d ataset. For compariso n purposes, we also visualize the corresponding hidden layers trained using BP . As shown in Figure 5 and Figure 6, the initial l ayers seem to extract less useful metafeatures than the l at er layers because the labels are dif ficult to distingu i sh i n the corresponding figures. Howe ver , a comparison of t h e last fe w layers s h ows that AL groups t he data po i nts of the same label more accurately than BP , which suggests that AL likely learns better metafeatures. 23 T able 7: A comparison of the inter- and intraclass di stances and the ratio o f the two. W e highlight the winner in bold font. Dataset Network Method Interclass dis t ance Intraclass distance Inter:Intra ratio CIF AR-10 MLP BP 39 . 36 67 . 97 0 . 58 AL 0 . 73 0 . 66 1 . 11 V ani lla CNN BP 41 . 82 26 . 87 1 . 56 AL 1 . 17 0 . 36 3 . 25 CIF AR-100 MLP BP 114 . 42 342 . 65 0 . 33 AL 0 . 23 0 . 28 0 . 82 V ani lla CNN BP 114 . 71 163 . 43 0 . 70 AL 0 . 55 0 . 51 1 . 08 T o assess the quali ty of the learned metafeatures, we calculated the intra- and in- terclass dis tances of the data point s based on th e m etafeatures. W e com puted t he in- traclass distance d intra k as the average distance between any two data poi n ts i n class k for each class. Th e i nterclass dis tance i s the av erage dis tance b et w een the centro i ds of the classes. W e also computed the ratio between inter- and in t raclass distance to determine the quali ty of the metafeatures generated b y AL and BP (Mi chael and Li n , 1973; Luo et al., 2019). As shown in T able 7, AL performs better than BP o n bot h the CIF AR-10 and CIF AR-100 datasets because AL generates metafeatures with a larger ratio between inter- and intraclass distance. 24 6 Discussion and Future W ork Although BP i s the cornerstone of to d ay’ s d eep learning al g orithms, it is far from i deal, and therefore, im p roving BP or searching for alternatives is an important research di- rection. This paper discus s es AL, a novel process for trainin g deep neural networks without end-t o -end BP . Rather than calculati n g gradients in a layerwise fashion based on BP , A L removes the dependencies between th e parameters of d iffe rent subnetworks, thus allowing each subnetwork to be trained simultaneousl y and independently . Con- sequently , we may utilize pipelines to increase the training throughput. Our method is biologically plausi ble because the targets are local and the gradients are not obtained from the o u t put layer . Alt h ough A L does not directly minimize the prediction error , its test accuracy is comparable to, and som etimes better t h an, that of BP , which does directly attem p t to mi nimize the prediction error . Alth o ugh recent s t udies hav e begun to use l ocal losses instead of backpropagati ng the global l o ss (Nøkland and Eidnes, 2019), these local loss es are com puted m ainly based on (or are at least partially based o n ) t he diffe rence between the target v ariable and the predicted results. Ou r method is unique because in AL, most of the layers do not interact wit h t h e target variable. Current s trategies t o parallelize the training of a deep l earning model usu ally dis- tribute the training data int o dif ferent compu t ing units and aggregate (e.g., by a verag- ing) the gradients computed by each computi ng unit. Ou r work, o n the ot her hand, parallelizes t h e trainin g step by computing the p arameters of the different layers s i mul- taneously . Therefore, AL is not an alternative to most of the ot her parallel t raini ng approaches but can integrate with th e abovementioned approach to furth er improve the training throughput. 25 Y ears of research ha ve allowed us to gradually understand t he proper hyperparame- ter setti ngs (e.g., n et work st ructure, weight i nitialization , and activ ation functi on) when training a neural n et work based on BP . Ho wev er , t hese setti ngs may not be appropriate when training by AL. Th erefore, one possible research direction is t o search for th e right settings for this new approach. W e i m plemented AL i n T ensorFlow . Howe ver , we were unable to i mplement t h e “pipelined” A L that was shown in T able 1 within a reasonable period because of th e technical challenges of task schedul ing and parallelizatio n in T ensorFlo w . W e decided to leav e this part as future work. Howe ver , we ensure that the g radients propagate only within each component, so t heoretically , a pipelined AL should be able to be imple- mented. Another possib le future work is validating A L on other datasets. (e.g., ImageNet, Microsoft Common Objects in Context (MS COCO), and Goog l e’ s Open Im ages) and e ven on datasets unrelated to computer vision, such as those used i n signal processing, natural l anguage processing, and recommender systems. Y et anot her future work is the theoretical work of AL, as thi s may help us und erst and why AL out p erforms BP un d er certain network structures. In the lo n ger term, we are hig h ly i nterested in in vestigating optimizatio n algorithms beyond BP and gradients . Ackno wledgments W e acknowledge partial support by the Ministry of Science and T echnology under grant no. MOST 107 -2221-E-008-077-MY3. W e thank the revie wers for their informative feedback. 26 References Allen-Zhu, Z., Li, Y ., and Song, Z. (2018). A con ver gence theory for deep learning via over -parameterization. arXiv pre print arXiv:1811.0 3 962 . Arora, S., Cohen, N., and Hazan, E. (2018 ). On the optimization of deep networks: implicit acceleration by overparameterization. arXiv pr eprint arXiv:1802.0650 9 . Baldi, P . (2012). Auto encod ers, u n supervised learning, and deep architectures. In Pr oceedings of ICML workshop on unsupervi s ed and transfer learning , pages 37– 49. Balduzzi, D., V anchinathan, H., and Buhm ann, J. M . (2015 ). Kickback cuts backprop’ s red-tape: biol ogically plausible credit ass i gnment in neural networks. In Pr oceedings of t he T wenty-Ninth AAAI Confer ence on Art i ficial Intelligence , pages 485–491 . Bartunov , S., Santoro, A., Richards, B., Marris, L., Hint o n , G. E., and Lillicrap, T . (2018). Assess ing the scalability of biologically-mot iv ated deep learning algorith m s and archit ectures. In Ad vances i n Neural Informatio n Pr ocessi n g Systems , pages 9390–9400. Belilovsky , E., Eickenber g, M., and Oyallon, E. (2018). Greedy l ayerwise learning can scale to imagenet. arXiv preprint arXiv:1812.1 1446 . Belilovsky , E., Eickenber g, M ., and Oyallon , E. (2019). Decoupled greedy learning o f CNNs. arXiv pr eprint arXiv:1901.08 164 . Bengio, Y . (2014). How auto-encoders could p rovide credit assignment in deep net- works via target propagation. arXiv pr eprint arXiv:1407.790 6 . 27 Bengio, Y ., Lee, D.-H., Bornschein, J., Mesnard, T ., and Lin, Z. (2015). T ow ards biologically plaus ible deep learning. arXiv pr epri nt arXiv:150 2.04156 . Ber gstra, J. and Bengio, Y . (201 2). Random search for hyper-parameter optimi zation. J ournal o f Machine Learning Resear ch , 13(Feb):281–305. Carranza-R ojas, J., Calderon-Ramirez, S., Mora-Fa llas, A., Granados-Menani, M. , and T orrents-Barrena, J . (2019). Unsharp m asking layer: injecting prior k n owledge in con volutional networks for im age classification. In International Confer ence o n Ar - tificial Neural Networks , pages 3–16. Springer . Chen, H.-H. (20 1 7 ). W eighted-svd: matrix f actorization wit h weights on the l atent factors. arXiv p r eprint arXiv:1710.00 482 . Chen, P . and Chen, H.-H. (2020). Accelerating m atrix factorization by overparame- terization. In International Confer ence on Deep Learning Theory and Appl ications , pages 89– 97. Crick, F . (1989). The recent excitement about neural networks. Natur e , 337(620 3 ): 129– 132. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residu al learning for im age recognition. In Pr oceedings of the IEEE Confer ence on Computer V i sion and P attern Recognition , pages 770 – 778. Hochreiter , S. and Schm i dhuber , J. (1997). Long short-term memory . Neural Compu- tation , 9(8):1735–1780. 28 Huang, Y ., Cheng, Y ., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V ., W u, Y ., et al. (2019). Gpipe: efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Pr ocess i ng Systems , pages 103–112. Huo, Z., Gu, B., and Hu ang, H. (2018a). Training neural networks using features replay . In Advances i n Neural Inform a tion Proc essing Syst ems , pages 6659–6668. Huo, Z., Gu, B., Y ang, Q., and Huang, H. (201 8b). Decoupled parallel backpropagati on with con vergence guarantee. arXiv pr eprin t ar Xiv:1804.10574 . Jaderber g, M., Czarnecki, W . M., Osindero, S., V inyals, O., Graves, A., Silver , D., and Kavukcuoglu, K. (2016 ). Decoupled neural in terfaces using synth etic gradient s. arXiv pr eprin t arX i v:1608.05343 . Krizhevsk y , A. and Hint on, G. (2009). Learning multiple layers of features from tiny images. T echnical report, Univer sity of T oronto. LeCun, Y ., Bottou, L., Bengio, Y ., and Haf fner , P . (199 8 ). Gradient-based learning applied to d o cument recognition. Pr oceedings of the IEEE , 86(11):2 2 78–2324. Lee, D.-H., Zhang, S., Fischer , A., and Bengio, Y . (2015). Difference target propaga- tion. In Joint Eur op ean Confer ence on Machine Learning and Knowledge Discovery in D a tabases , pages 498–515. Springer . Lillicrap, T . P ., Cownden, D., T weed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagati on for deep learning. Natur e Communications , 7:13276. 29 Luo, Y ., W ong, Y ., Kankanhalli, M., and Zhao, Q. (2019). G-s o ftmax: imp roving intraclass compactness and int erclass separability of features. IEEE T ransactions on Neural Networks and Learning Systems . Maaten, L. v . d. and Hinton, G. (2008). V isualizing data using t -sne. Journal of Machine Learning Resear ch , 9 (Nov):2579 –2605. Michael, M. and Lin, W .-C. (1973). Experiment al study of information m easure and inter-intra class distance ratios on feature selection and orderings. IEEE T ransactions on S yst ems, Man , and Cybernetics , pages 172–181 . Mostafa, H., Ramesh, V ., and Ca uwenberghs, G. (2018). Deep supervised learning using lo cal errors. F rontiers in Neur os cience , 12:60 8. Nøkland, A. (2016). Di rect feedback ali gnment provides learning in deep neural net- works. In A d vances in Neural Information Pr ocessi ng Systems , pages 1037– 1045. Nøkland, A. and Eidnes, L. H. (2019). Training neural networks with local error signals. arXiv pr eprin t arX i v:1901.06656 . Ororbia, A. G. and M ali , A. (2019). Biologically mo t iv ated algorith m s for propagat- ing local tar get representations. In Procee dings of the AAAI confer ence on artificia l intelligence , volume 33, pages 465 1 –4658. Ororbia, A. G. , Mali, A., Kifer , D., and Giles, C. L. (2018). Conducting credit assign- ment by ali gning local representations. arXiv p reprint arXiv:1803. 0 1834 . Rumelhart, D. E., Hinton, G. E., and W ill iams, R. J. (198 6). Learning representations by b ack-propagati ng errors. Nature , 323(6088):5 3 3 . 30 Sabour , S., Frosst, N . , and Hinton, G. E. (2017). Dynamic routing between capsules. In Advances i n Neural Inform a tion Proc essing Syst ems , pages 3856–3866. Shallue, C. J., Lee, J., Antognini , J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. (2018). Measurin g the effects o f data parallelism o n neural n et work training. arXiv pr eprin t arXiv:1811 . 0 3 600 . Simonyan, K. and Zisserman, A. (2015). V ery deep con volutional networks for large- scale image recognition. In Intern a tional Conf er ence on Learning Repr esentations . T aylor , G. , Burmeister , R., Xu, Z., Singh, B., Patel, A., and Goldstei n , T . (2016). T rain- ing neural networks without gradients : a scalable admm approach. In Internati onal Confer ence on Machine Learning , pages 2 722–2731. Zinke vich, M. , W eimer , M., Li, L., and Smola, A. J. (2010). P arallelized s t ochas- tic gradient descent. In Advances in Neural Information P rocessing Systems , pages 2595–2603. 31 ! " ! # ! # " $ ! % $ " & ! & " ' ! ' " # ! ( # " ( ) " ) ! $ ! ( $ # ( ! " # ! " ! # ! " # $ " ! # ! " # $ " $ # ! " # ! " $ # " " * " ! * " " * ! ! * ! " # $ $ $ & $ ' $ # $ ( ) $ $ " ( ! " # % " ! # !"# % "$ # " $ * $ ! * $ " Figure 4: A training example using associat ed learning. The black arrows indicate the forward paths that in volve learnable p arameters; th e green arro ws conn ect the va riables that s hould be compared t o m inimize th ei r associated distance; the red arrows denote the backward gradient flows. W e group each component by dashed li nes. The parameters of the different compo nents are independent so that they can be up d ated simultaneousl y . The variable ℓ ( v ) u denotes the v th gradient flo w of the u th component. M SE ( v ) u denotes the v th m ean-squared error of the u t h com ponent. Consequentl y , t he first gradient flow of each comp onent, ℓ (1) u , d etermines the u p dates of the parameters of f u and b u ; the second gradient flow o f each component, ℓ (2) u , determi nes t h e updates of g u and h u . 32 Figure 5: t-SNE visualization of the MLP on the CIF AR-10 dat aset . The different colors represent d iffe rent labels . The figures in the first row are the results of the raw data, 2 nd layer , 4 th layer , and output layer when usi ng BP . The second row shows t he corresponding results for AL. 33 Figure 6: t-SNE visualizati o n of V anilla CNN with CIF AR-10 dataset. The di ff erent colors represent d iffe rent labels . The figures in the first row are the results of the raw data, 4 th layer , 8 th layer , and 12 th layers when using BP . The second row shows th e corresponding results for AL. 34

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment