A Novel Representation of Neural Networks
Deep Neural Networks (DNNs) have become very popular for prediction in many areas. Their strength is in representation with a high number of parameters that are commonly learned via gradient descent or similar optimization methods. However, the repre…
Authors: Anthony Caterini, Dong Eui Chang
A Novel Representation of Neural Networks Anthony L. Caterini and Dong Eui Chang Departmen t of Applied Mathematics University of W aterloo W aterloo, ON, Canada, N2L 3G1 {alca teri, decha ng}@u water loo.ca Abstract Deep Neural Netw orks (DNNs) ha ve become very popular for predictio n in many areas. Their strength is in r epresentatio n with a high n umber o f par ameters that are common ly learned via gradient descent or similar optimizatio n methods. Ho we ver , the r epresentation is non-stand ardized , and th e gr adient calcu lation meth ods are often performed using co mponen t-based approac hes that break parameter s down into scalar units, instead of consider ing the parameters as whole entities. In this work, these problems are addr essed. Standard notation is used to repr esent DNNs in a comp act framew ork. Gradien ts of DNN loss functio ns are calculated directly over the inner pro duct space on whic h the parameter s are defined . This fr amew ork is general and is app lied to tw o com mon network types: the Multilayer Percep tron and the Deep Autoenco der . Keywords: Deep Learning, Neural Networks, Multilayer Perceptron, Deep Au - toencod er , Back propag ation. 1 Intr oduction Deep Neural Networks ( DNNs) ha ve g rown increa singly popular ov er th e last few years because of their astoundin g results in a variety of ta sks. Their strength derives from their expressiv eness, and this gro ws with ne twork depth. Howe ver , the tra ditional appro aches to repr esenting DNNs suffers as the number o f network layers increases. These often r ely on co nfusing d iagrams that provide an inc omplete descriptio n of the mechanics of the network, which leads to complexity as the num ber of layers incr eases. Furthermor e, DNNs are inconsistently for mulated as a mathematical problem througho ut research in the field , especially notationally , which imped es the efficiency in which results can be co mbined or expand ed up on. A c lear and concise fram ew ork underpinnin g DNNs must be dev eloped, and this work endeav ours to address that issue. In this work , a novel mathe matical framework for DNNs is cr eated. It is f ormed by em ploying carefully selected standard notio ns and notation to repre sent a general DNN. Common mathem atical tools such as the inner prod uct, the adjoin t op eration, and maps defined over generic inner product spaces are utilized thr ough out this work. W ell-established mathematical objects are treated as-is in this framew ork; it is no long er necessary to convert a matrix into a column vector or a decomp ose it into a collection of compo nents, for example, for the purp oses of deri vati ve calculation. This work presents a comprehe nsiv e m athematical standard upo n which DNNs can be fo rmulated. The specific layout of th is paper is as fo llows. After some mathematical p reliminaries, a gener ic DNN is form ulated over an ab stract inner prod uct space. The chain ru le is used to dem onstrate a concise coor dinate-fr ee appro ach to backprop agation. T wo standard loss fun ctions are explicitly considered , and it is sho wn how to handle some variations on those within the lea rning algor ithm. Then, th is framework is a pplied to the multilayer per ceptron (MLP). The specifics of the previous approa ch beco me clear, an d it is sh own how to create a g radient descent algo rithm to learn the 1 parameters o f the MLP . Some of the th eory de veloped in the section on MLP is then applied to a deep autoencod er (AE) , which dem onstrates the flexibility of the appro ach. This type of frame work can b e extended to other typ es of networks, in cluding conv olutiona l neural networks (CNNs) and recurren t neur al networks (RNNs), b ut these are omitted for the sake of bre vity . 2 Mathematical Pr eliminaries In this section, we set notation and revie w some elementary but essential mathematical facts. These facts will be used to cast neural networks i nto a novel framew ork in the following sections. 2.1 Linear Maps, Bilinear Maps, and Adjoints Consider three inner pro duct sp aces E 1 , E 2 , an d E 3 , i.e. each vector space is equipped with an inner produ ct denoted by , . The space of linear maps from E 1 to E 2 will be deno ted L( E 1 ; E 2 ) . Note that fo r L ∈ L( E 1 ; E 2 ) and u ∈ E 1 , L ⋅ u ∈ E 2 denotes L operating on u , i.e. L ( u ) or m ore simply Lu . Similarly , the space of bilinear maps from E 1 × E 2 into E 3 will be denoted L( E 1 , E 2 ; E 3 ) . For B ∈ L( E 1 , E 2 ; E 3 ) and u 1 ∈ E 1 , u 2 ∈ E 2 , B ⋅ ( u 1 , u 2 ) ∈ E 3 denotes B o perating on u 1 and u 2 , i.e. B ( u 1 , u 2 ) . F or any bilin ear m ap B ∈ L ( E 1 , E 2 ; E 3 ) and any e 1 ∈ E 1 , a linear map e 1 B ∈ L( E 2 ; E 3 ) is defined as follows : ( e 1 B ) ⋅ e 2 = B ( e 1 , e 2 ) for all e 2 ∈ E 2 . Similarly , for any e 2 ∈ E 2 , a linear map B e 2 ∈ L( E 1 ; E 3 ) is defin ed as follows: ( B e 2 ) ⋅ e 1 = B ( e 1 , e 2 ) . for all e 1 ∈ E 1 . These operators and will be referred to as the left hoo k and right hoo k operators, respectively . The adjoint L ∗ of a linear map L ∈ L( E 1 ; E 2 ) is a linear map in L( E 2 ; E 1 ) defined by L ∗ e 2 , e 1 = e 2 , Le 1 for all e 1 ∈ E 1 and e 2 ∈ E 2 . Th e adjoint operator satisfies the direction re versing pro perty: ( L 2 L 1 ) ∗ = L ∗ 1 L ∗ 2 for all L 1 ∈ L( E 1 ; E 2 ) and L 2 ∈ L( E 2 ; E 3 ) . 2.2 Derivatives In this section, notation for deriv ativ es in accor dance with [1] is presented. 2.2.1 First Derivati ves Consider a map f ∶ E 1 → E 2 , where E 1 and E 2 are inner prod uct spaces. The (first) derivati ve map of f , deno ted D f , is a map from E 1 to L( E 1 ; E 2 ) that ope rates as x ↦ D f ( x ) f or any x ∈ E 1 . The linear map D f ( x ) operates in the following manner for an y v ∈ E 1 : D f ( x ) ⋅ v = d d t f ( x + tv ) t = 0 . (1) For each x ∈ E 1 the adjoin t of the deriv ative D f ( x ) ∈ L( E 1 ; E 2 ) is well defin ed with respect to the inner pro ducts on E 1 and E 2 , and it is denoted D ∗ f ( x ) instead of D f ( x ) ∗ for the sake of notatio nal conv enience. Th en, D ∗ f ∶ E 1 → L( E 2 ; E 1 ) denotes the adjoin t map that maps each point x ∈ E 1 to D ∗ f ( x ) ∈ L( E 2 ; E 1 ) . Now consider two map s f 1 ∶ E 1 → E 2 and f 2 ∶ E 2 → E 3 , wher e E 3 is anoth er inner p roduct space. The derivati ve of their comp osition, D ( f 2 ○ f 1 )( x ) ∈ L( E 1 ; E 3 ) for x ∈ E 1 , is calculated using the well-known chain ru le. Lemma 2.1 (Chain Rule) . F or any x ∈ E 1 , D ( f 2 ○ f 1 )( x ) = D f 2 ( f 1 ( x )) ⋅ D f 1 ( x ) , wher e f 1 ∶ E 1 → E 2 and f 2 ∶ E 2 → E 3 ar e C 1 , i.e. con tinuou sly dif fer entia ble, and E 1 , E 2 , an d E 3 ar e vector spaces. 2 2.2.2 Second Derivatives Every map in h ere assum ed to be (piecewise) C 2 , i.e. (piecewise) twice continu ously differen- tiable, u nless stated o therwise. T he seco nd de riv ative map o f f , den oted D 2 f , is a map f rom E 1 to L( E 1 , E 1 ; E 2 ) , which o perates as x ↦ D 2 f ( x ) f or any x ∈ E 1 . The bilinea r map D 2 f ( x ) oper ates as follows: f or any v 1 , v 2 ∈ E 1 D 2 f ( x ) ⋅ ( v 1 , v 2 ) = d d t ( D f ( x + tv 1 ) ⋅ v 2 ) t = 0 . (2) It is no t hard to show that D 2 f ( x ) is symmetric, i.e. D 2 f ( x ) ⋅ ( v 1 , v 2 ) = D 2 f ( x ) ⋅ ( v 2 , v 1 ) for all v 1 , v 2 ∈ E 1 . Further more, it can be sho wn that D 2 f ( x ) ⋅ ( v 1 , v 2 ) = ∂ 2 ∂ t∂ s f ( x + tv 1 + sv 2 ) t = s = 0 . The h ook n otation f rom Section 2.1 can be used to tu rn the second der iv ative into a linear m ap. In particular, ( v D 2 f ( x )) and ( D 2 f ( x ) v ) ∈ L( E 1 ; E 2 ) f or any x, v ∈ E 1 . An importan t identity exists for the second deriv ative o f the composition of two functions. Lemma 2.2. F or any x, v 1 , v 2 ∈ E 1 , D 2 ( f 2 ○ f 1 )( x ) ⋅ ( v 1 , v 2 ) = D 2 f 2 ( f 1 ( x )) ⋅ ( D f 1 ( x ) ⋅ v 1 , D f 1 ( x ) ⋅ v 2 ) + D f 2 ( f 1 ( x )) ⋅ D 2 f 1 ( x ) ⋅ ( v 1 , v 2 ) , wher e f 1 ∶ E 1 → E 2 is C 1 and f 2 ∶ E 2 → E 3 is C 2 for vector spaces E 1 , E 2 , an d E 3 . This can be seen as the chain rule for second derivati ves. 2.2.3 Parameter -Dependent M aps Now suppose f is a m ap f rom E 1 × H 1 → E 2 , i.e. f ( x ; θ ) ∈ E 2 for any x ∈ E 1 and θ ∈ H 1 , wher e H 1 is also an inner prod uct space. The variable x ∈ E 1 is said to be the state variab le for f , whereas θ ∈ H 1 is a parameter . The notation presented in (1 ) is used to d enote the deriv a ti ve of f with respect to the state variable, i.e. f or all v ∈ E 1 , D f ( x ; θ ) ⋅ v = d d t f ( x + tv ; θ ) t = 0 . Also, D 2 f ( x ; θ ) ⋅ ( v 1 , v 2 ) = D ( D f ( x ; θ ) ⋅ v 2 ) ⋅ v 1 as before. New notation is used to denote the deriv ati ve of f with respect to the parameters, as follows: ∇ f ( x ; θ ) ⋅ u = d d t f ( x ; θ + t u ) t = 0 for any u ∈ H 1 . Note th at ∇ f ( x ; θ ) ∈ L( H 1 ; E 2 ) . In the case wher e f d epend s on two parameters as f ( x ; θ 1 , θ 2 ) , th e notation ∇ θ 1 f ( x ; θ 1 , θ 2 ) will be used to explicitly d enote dif ferentiation with respect to the parameter θ 1 when the distinction is necessary . The mixed p artial deriv ati ve maps, ∇ D f ( x ; θ ) ∈ L( H 1 , E 1 ; E 2 ) a nd D ∇ f ( x ; θ ) ∈ L( E 1 , H 1 ; E 2 ) , are defined as: ∇ D f ( x ; θ ) ⋅ ( u, e ) = d d t ( D f ( x ; θ + tu ) ⋅ e ) t = 0 , D ∇ f ( x ; θ ) ⋅ ( e, u ) = d d t (∇ f ( x + te ; θ ) ⋅ u ) t = 0 . for any e ∈ E 1 , u ∈ H 1 . Note th at if f ∈ C 2 , then D ∇ f ( x ; θ ) ⋅ ( u, e ) = ∇ D f ( x ; θ ) ⋅ ( e, u ) , i.e. the mixed partial deri vati ves are equal. 2.3 Elementwise Functions Consider an in ner prod uct space E of dime nsion n with the in ner p rodu ct denoted b y , . Let { e k } n k = 1 be an orth onorm al basis of E . An elementwise function is defined to be a function Ψ ∶ E → E of the for m Ψ ( v ) = n k = 1 ψ ( v , e k ) e k , (3) 3 where ψ ∶ R → R — known as th e elementwise operation associated with Ψ — defines the oper ation of the elemen twise function ov er the componen ts { v , e k } k of the vector v ∈ E . The operator Ψ is basis-depend ent, but { e k } n k = 1 can b e any orth onorm al basis of E . Also define the elemen twise first derivative of an elementwise functio n Ψ , Ψ ′ ∶ E → E , as Ψ ′ ( v ) = n k = 1 ψ ′ ( v , e k ) e k , (4) where ψ ′ is the first de riv ative of ψ . Note that ψ ′ can b e referred to as the associated elementwise operation for Ψ ′ . Similarly , define the elementwise secon d derivative function Ψ ′′ ∶ E → E as Ψ ′′ ( v ) = n k = 1 ψ ′′ ( v , e k ) e k , (5) where ψ ′′ is the second deriv ativ e of ψ . 2.3.1 Hadamard Product Now define a symmetr ic bilinear operator ⊙ ∈ L( E , E ; E ) over the basis vectors { e k } n k = 1 as e k ⊙ e k ′ ∶ = δ k,k ′ e k , (6) where δ k,k ′ is th e Kronec ker d elta. This is the stan dard Hadam ard produ ct when E = R n and { e k } n k = 1 is the standard ba sis of R n . Howe ver , when E ≠ R n or { e k } n k = 1 is not th e standard basis, ⊙ can be seen as a g eneralization of the Hadam ard produc t, and it will be refe rred to as such in this paper . For illu strativ e pu rposes, consider the (g eneralized) Had amard p rodu ct of two vectors v , v ′ ∈ E . T hese vectors can be written as v = ∑ n k = 1 v , e k e k and v ′ = ∑ n k = 1 v ′ , e k e k . Then, v ⊙ v ′ = n k = 1 v , e k e k ⊙ n k ′ = 1 v ′ , e k ′ e k ′ = n k,k ′ = 1 v , e k v ′ , e k ′ ( e k ⊙ e k ′ ) = n k = 1 v , e k v ′ , e k e k . It is easy to show that the Had amard product satisfies the follo wing properties: v ⊙ v ′ = v ′ ⊙ v , ( v ⊙ v ′ ) ⊙ y = v ⊙ ( v ′ ⊙ y ) , y , v ⊙ v ′ = v ⊙ y , v ′ = y ⊙ v ′ , v for all y , v , v ′ ∈ E . 2.3.2 Derivativ es of Elementwise Functions Some results regardin g the deriv ative maps for a g eneric elementwise function Ψ , i.e. DΨ and D 2 Ψ , are presented now . Proposition 2.3. Let Ψ ∶ E → E be an elementwise fu nction as defi ned in (3) , fo r an inn er pr oduct space E of dimension n with a basis { e k } n k = 1 and inner pr odu ct , . Th en, for any v , z ∈ E , DΨ ( z ) ⋅ v = Ψ ′ ( z ) ⊙ v , wher e the Hadamar d pr od uct ⊙ is define d in (6) and Ψ ′ is the elementwise first derivative defined in (4) . Fu rthermor e, DΨ ( z ) is self-adjo int, i.e. D ∗ Ψ ( z ) = DΨ ( z ) for all z ∈ E . Pr oof. L et ψ be the elemen twise operation associated with Ψ . Then, DΨ ( z ) ⋅ v = d d t Ψ ( z + tv ) t = 0 = d d t n k = 1 ψ ( z + tv , e k ) e k t = 0 = n k = 1 ψ ′ ( z , e k ) v, e k e k = Ψ ′ ( z ) ⊙ v , 4 where the third equality follows from th e chain rule and linearity of the deri vati ve. Furthermo re, let y ∈ E . Th en, y , DΨ ( z ) ⋅ v = y , Ψ ′ ( z ) ⊙ v = Ψ ′ ( z ) ⊙ y , v = DΨ ( z ) ⋅ y , v . Since y , DΨ ( z ) ⋅ v = DΨ ( z ) ⋅ y , v for any v , y , z ∈ E , DΨ ( z ) is self-ad joint. Proposition 2.4. Let Ψ ∶ E → E be an elementwise fu nction as defi ned in (3) , fo r an inn er pr oduct space E of dimension n with a basis { e k } n k = 1 and inner pr odu ct , . Th en, for any v 1 , v 2 , z ∈ E , D 2 Ψ ( z ) ⋅ ( v 1 , v 2 ) = Ψ ′′ ( z ) ⊙ v 1 ⊙ v 2 , (7) wher e the Hadama r d pr od uct ⊙ is d efined in (6) and Ψ ′′ is the elemen twise second derivative defin ed in (5) . Furthermo r e, v 1 D 2 Ψ ( z ) and D 2 Ψ ( z ) v 2 ar e both self-adjoint line ar map s for any v 1 , v 2 , z ∈ E . Pr oof. Pr ove (7) directly: D 2 Ψ ( z ) ⋅ ( v 1 , v 2 ) = D ( DΨ ( z ) ⋅ v 2 ) ⋅ v 1 = D ( Ψ ′ ( z ) ⊙ v 2 ) ⋅ v 1 = ( Ψ ′′ ( z ) ⊙ v 1 ) ⊙ v 2 , where the th ird e quality follows since Ψ ′ ( z ) ⊙ v 2 is an elementwise fun ction in z . A lso, for any y ∈ E , y , v 1 D 2 Ψ ( z ) ⋅ v 2 = y , D 2 Ψ ( z ) ⋅ ( v 1 , v 2 ) = y , Ψ ′′ ( z ) ⊙ v 1 ⊙ v 2 = Ψ ′′ ( z ) ⊙ v 1 ⊙ y , v 2 = v 1 D 2 Ψ ( z ) ⋅ y , v 2 . This implies th at v 1 D 2 Ψ ( z ) is self- adjoint for any v 1 , z ∈ E . Since D 2 Ψ ( z ) is a symmetric bilinear map, this also implies that D 2 Ψ ( z ) v 1 is self-adjoint for any v 1 , z ∈ E . 3 Coordinate-Fr ee Represen tation of Neural Networks In this section, coordinate- free backpro pagation is derived for a gene ric layer ed neu ral network. The network is formulated and then a gradie nt descent algorithm is gi ven for two t ypes of loss functions. 3.1 Neural Network F ormulation Neural networks are lay ered mo dels, with the actions of layer i denoted b y f i ∶ E i × H i → E i + 1 , where E i , H i , and E i + 1 are in ner pr oduct spaces. In other words, f i ( x i , θ i ) ∈ E i + 1 for x i ∈ E i and θ i ∈ H i . For a neu ral network with L layer s, i ∈ { 1 , . . . , L } . Th e state variable x i ∈ E i is an abstract representatio n of the in put data x 1 = x a t layer i . The parameters θ i ∈ H i at layer i must be learned, often by some form of grad ient descent. Note that the e x plicit depend ence of f i on the parameter θ i will be sup pressed in th e notation thr ough out this section. I n th is way , f i ∶ E i → E i + 1 , defined by x i + 1 = f i ( x i ) , where f i depend s on θ i . Then, the network pr ediction can be written as a c omposition of function s F ( x ; θ ) = ( f L ○ ○ f 1 )( x ) , (8) where each f i ∶ E i → E i + 1 has a supp ressed depende nce on the parameter θ i ∈ H i , an d θ repr esents the parameter s et { θ 1 , . . . , θ L } . Each parameter θ i is independen t of the other parameters { θ j } j ≠ i in this formu lation. Some maps will be intro duced to assist in derivati ve c alculation. Let the hea d map at level i , α i ∶ E 1 → E i + 1 , be defined by: α i = f i ○ ○ f 1 (9) 5 for each i ∈ { 1 , . . . , L } . Note that α i implicitly dep ends on the p arameters { θ 1 , . . . , θ i } . For con ve- nience, set α 0 to be th e identity map on E 1 . Similarly , define the tail map at le vel i , ω i ∶ E i → E L + 1 , as: ω i = f L ○ ○ f i (10) for each i ∈ { 1 , . . . , L } . T he map ω i implicitly depends on { θ i , . . . , θ L } . Again for convenience, set ω L + 1 to be the id entity map on E L + 1 . It is easy to show that the following h old for all i ∈ { 1 , . . . , L } : F = ω i + 1 ○ α i , ω i = ω i + 1 ○ f i , α i = f i ○ α i − 1 . (11) The equation s in (11) imply that the p rediction F can be deco mposed into F = ω i + 1 ○ f i ○ α i − 1 for all i ∈ { 1 , . . . , L } , where α i − 1 does not depend on the param eter θ i . 3.2 Loss Function and Backpropagation While training a neur al network , the g oal is to optimize some loss fu nction J with respect to the parameters θ . For example, consider J ( x ; θ ) ∶ = 1 2 y − F ( x ; θ ) 2 = 1 2 y − F ( x ; θ ) , y − F ( x ; θ ) , (12) where y ∈ E L + 1 is th e known respon se data. G radient d escent is used to o ptimize the lo ss f unction , thus the grad ient of J with respec t to each of th e parameters must be calcu lated. Before that can be done, some preliminar y results will b e introduced . In th is section, it is always assumed that x i = α i − 1 ( x ) is the state variable at level i for a given data poin t x . Theorem 3.1. Let J be defined as in (12) . Then, for an y x ∈ E 1 and i ∈ { 1 , . . . , L } , ∇ θ i J ( x ; θ ) = ∇ ∗ θ i F ( x ; θ ) ⋅ ( F ( x ; θ ) − y ) . (13) Pr oof. By the pro duct rule, for any U i ∈ H i , ∇ θ i J ( x ; θ ) ⋅ U i = F ( x ; θ ) − y , ∇ θ i F ( x ; θ ) ⋅ U i = ∇ ∗ θ i F ( x ; θ ) ⋅ ( F ( x ; θ ) − y ) , U i . Since this holds for any U i ∈ H i , (13) follows. The following two theorem s show h ow to com pute the d eriv a ti ve ∇ θ i J ( x ; θ ) given in (13) recur- si vely . Theorem 3.2. W ith F defi ned as in (8) and ω i defined as in (10) , ∇ ∗ θ i F ( x ; θ ) = ∇ ∗ θ i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) (14) with x i = α i − 1 ( x ) and x i + 1 = f i ( x i ) , for all i ∈ { 1 , . . . , L } . Pr oof. Ap ply th e c hain ru le to F = ω i + 1 ○ f i ○ α i − 1 and th en take the adjoint of it to get the result. Theorem 3.3. W ith ω i defined as in (10) , then for all x i ∈ E i , D ω i ( x i ) = D ω i + 1 ( x i + 1 ) ⋅ D f i ( x i ) (15) and D ∗ ω i ( x i ) = D ∗ f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) , (16) wher e x i + 1 = f i ( x i ) , for all i ∈ { 1 , . . . , L } . Pr oof. Ap ply the chain ru le to ω i ( x i ) = ( ω i + 1 ○ f i )( x i ) to get (15). Then, take the adjoint of ( 15) to get (16). This ho lds for any i ∈ { 1 , . . . , L } . 6 Algorithm 3.1 One iteration of gradien t descent for a gener al NN function D E S C E N T I T E R AT I O N ( x, y , θ 1 , . . . , θ L , η ) x 1 ← x for i ∈ { 1 , . . . , L } do ▷ x L + 1 = F ( x ; θ ) x i + 1 ← f i ( x i ) end f or for i ∈ { L, . . . , 1 } do ˜ θ i ← θ i ▷ Store old θ i for updating θ i − 1 if i = L then ▷ e = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x L + 1 − y ) e ← x L + 1 − y ▷ ω L + 1 = identity else e ← D ∗ f i + 1 ( x i + 1 ) ⋅ e ▷ (16), up date with ˜ θ i + 1 end if ∇ θ i J ( x ; θ ) ← ∇ ∗ θ i f i ( x i ) ⋅ e ▷ Thms. 3. 1 and 3.2 θ i ← θ i − η ∇ θ i J ( x ; θ ) end f or end function Algorithm 3.1 p rovides a meth od to per form one iteratio n of g radient descen t to minimiz e J over the parameter set θ = { θ 1 , . . . , θ L } for a single data point x . Th e algorithm extend s linearly to a batch of up dates over multip le data points. Notice that gra dient descen t is perf ormed directly over the inner prod uct space H i at each layer i , wh ich contrasts the standard approach of perf orming the descent over each individual compon ent of θ i . This can b e seen as a coordinate-fr ee gr adient d escent algorithm . Remark 3.4. It is not difficult to in corporate a stand ar d ℓ 2 -r egularizing term into this framework. Construct a new objective function J T ( x ; θ ) = J ( x ; θ ) + λT ( θ ) , wher e λ ∈ R ≥ 0 is the regularization parameter and T ( θ ) = 1 2 θ 2 = 1 2 L i = 1 θ i 2 = 1 2 L i = 1 θ i , θ i is the regularization term . It follows that ∇ θ i J T ( x ; θ ) = ∇ θ i J ( x ; θ ) + λθ i , since ∇ θ i T ( θ ) = θ i . This implies tha t gradien t d escent can be upd ated to include the re gularizing term, i.e. the la st line in Algorithm 3.1 can be alter ed as follows: θ i ← θ i − η ( ∇ θ i J ( x ; θ ) + λθ i ) . Remark 3.5 . The loss func tion considered so far was J ( x ; θ ) = 1 2 y − F ( x ; θ ) 2 . However , anothe r standard loss function is the cr oss-entr opy loss, ˜ J ( x ; θ ) = − y , L ( F ( x ; θ )) − 1 − y , L ( 1 − F ( x ; θ )) , wher e 1 is a vector of o nes of appr o priate length and L is an elementwise func tion with elementwise operation log . The gradient of ˜ J with r espec t to a parameter θ i , in the dir e ction of U i , is ∇ θ i ˜ J ( x ; θ ) ⋅ U i = − y , D L ( F ( x ; θ )) ⋅ ∇ θ i F ( x ; θ ) ⋅ U i + 1 − y , D L ( 1 − F ( x ; θ )) ⋅ ∇ θ i F ( x ; θ ) ⋅ U i = ∇ ∗ θ i F ( x ; θ ) ⋅ [ − D L ( F ( x ; θ )) ⋅ y + D L ( 1 − F ( x ; θ )) ⋅ ( 1 − y )] , U i . Thus, ∇ θ i ˜ J ( x ; θ ) = ∇ ∗ θ i F ( x ; θ ) ⋅ [ − D L ( F ( x ; θ )) ⋅ y + D L ( 1 − F ( x ; θ )) ⋅ ( 1 − y )] . Algorithm 3. 1 can th en be mo dified to minimize ˜ J in stead o f J by chang ing the initializatio n of the err or e fr om e ← x L + 1 − y to e ← − D L ( F ( x ; θ )) ⋅ y + D L ( 1 − F ( x ; θ )) ⋅ ( 1 − y ) . 3.3 Higher-Order Loss Function Suppose th at another term is a dded to the loss fu nction to pen alize the first ord er derivati ve of F ( x ; θ ) , a s in [3] or [4] for example. This can be represented using R ( x ; θ ) ∶ = 1 2 D F ( x ; θ ) ⋅ v x − β x 2 , (17) 7 for some v x ∈ E 1 and β x ∈ E L + 1 . When β x = 0 , minimizing R ( x ; θ ) prom otes in variance o f th e network in the direction of v x . Similarly to Remark 3 .4, R can be added to J to create a n ew lo ss function J R ( x ; θ ) = J ( x ; θ ) + µR ( x ; θ ) , (18) where µ ∈ R ≥ 0 determines the amount that the higher-order term contributes to the loss f unction. Note that R c an be extended additi vely to contain multiple terms: R ( x ; θ ) = ( v x ,β x )∈B x 1 2 D F ( x ; θ ) ⋅ v x − β x 2 , (19) where B x is a finite set of pairs ( v x , β x ) fo r each data point x . Theorem 3.6. F or all i ∈ { 1 , . . . , L } , with R defined as in (17) , ∇ θ i R ( x ; θ ) = ( ∇ θ i D F ( x ; θ ) v x ) ∗ ⋅ ( D F ( x ; θ ) ⋅ v x − β x ) . (20) Pr oof. Fr om (17), ∇ θ i R ( x ; θ ) ⋅ U i = D F ( x ; θ ) ⋅ v x − β x , ∇ θ i D F ( x ; θ ) ⋅ ( U i , v x ) = D F ( x ; θ ) ⋅ v x − β x , ( ∇ θ i D F ( x ; θ ) v x ) ⋅ U i = ( ∇ θ i D F ( x ; θ ) v x ) ∗ ⋅ ( D F ( x ; θ ) ⋅ v x − β x ) , U i for all U i ∈ H i , i ∈ { 1 , . . . , L } . Thus, (20) follows . Some pre liminary results will be g iv en before (2 0) can be r ecursively computed. Note again in this section that x i = α i − 1 ( x ) is the state variable at layer i for input data x , where the map α i is defined in (9). Lemma 3.7. F or any x ∈ E 1 , and i ∈ { 1 , . . . , L } , D α i ( x ) = D f i ( x i ) ⋅ D α i − 1 ( x ) . Pr oof. T his is proven using the chain rule, since α i = f i ○ α i − 1 for all i ∈ { 1 , . . . , L } . Lemma 3 .7 defines f orward propag ation through the tang ent network, in th e spirit o f [4]. Note that since α L = F , D α L = D F . This im plies that Lem ma 3.7 is needed for calculating D F ( x ; θ ) ⋅ v x . Now , tangent backpr opagatio n will be described. Theorem 3.8 (T an gent Backpropag ation) . F or an y x, v ∈ E 1 , and i ∈ { 1 , . . . , L } , ( D α i − 1 ( x ) ⋅ v ) D 2 ω i ( x i ) ∗ = D ∗ f i ( x i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ + ( D α i − 1 ( x ) ⋅ v ) D 2 f i ( x i ) ∗ ⋅ D ∗ ω i + 1 ( x i + 1 ) , wher e α i is defined in (9) and ω i is defined in (10) . Pr oof. L et v 1 , v 2 , and z ∈ E i . Then, v 1 D 2 ω i ( z ) ⋅ v 2 = D 2 ω i ( z ) ⋅ ( v 1 , v 2 ) = D 2 ( ω i + 1 ○ f i )( z ) ⋅ ( v 1 , v 2 ) = D 2 ω i + 1 ( f i ( z )) ⋅ ( D f i ( z ) ⋅ v 1 , D f i ( z ) ⋅ v 2 ) + D ω i + 1 ( f i ( z )) ⋅ D 2 f i ( z ) ⋅ ( v 1 , v 2 ) = ( D f i ( z ) ⋅ v 1 ) D 2 ω i + 1 ( f i ( z )) ⋅ D f i ( z ) ⋅ v 2 + D ω i + 1 ( f i ( z )) ⋅ v 1 D 2 f i ( z ) ⋅ v 2 , where the third equ ality comes f rom Lem ma 2.2. The oper ator v 1 D 2 ω i ( z ) can thus b e written as v 1 D 2 ω i ( z ) = ( D f i ( z ) ⋅ v 1 ) D 2 ω i + 1 ( f i ( z )) ⋅ D f i ( z ) + D ω i + 1 ( f i ( z )) ⋅ v 1 D 2 f i ( z ) . 8 By taking the adjoint, v 1 D 2 ω i ( z ) ∗ = D ∗ f i ( z ) ⋅ ( D f i ( z ) ⋅ v 1 ) D 2 ω i + 1 ( f i ( z )) ∗ + v 1 D 2 f i ( z ) ∗ ⋅ D ∗ ω i + 1 ( f i ( z )) . Set v 1 = D α i − 1 ( x ) ⋅ v and z = x i to obtain the final result: ( D α i − 1 ( x ) ⋅ v ) D 2 ω i ( x i ) ∗ = D ∗ f i ( x i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ + ( D α i − 1 ( x ) ⋅ v ) D 2 f i ( x i ) ∗ ⋅ D ∗ ω i + 1 ( x i + 1 ) , where D α i ( x ) ⋅ v = D f i ( x i ) ⋅ D α i − 1 ( x ) ⋅ v from Lemma 3.7 and x i + 1 = f i ( x i ) . Theorem 3.8 provides a recursive up date fo rmula for ( D α i − 1 ( x ) ⋅ v ) D 2 ω i ( x i ) ∗ , which bac k- propag ates the erro r throug h the tangent network via multiplicatio n by D ∗ f i ( x i ) and add ing an other term. Recall that the map D ∗ ω i + 1 ( x i + 1 ) is calculated recursi vely using Theo rem 3 .3. Now , the main result for calculating ∇ θ i R ( x ; θ ) is pr esented. Theorem 3.9. F or any x, v ∈ E 1 and i ∈ { 1 , . . . , L } , ( ∇ θ i D F ( x ; θ ) v ) ∗ = ∇ ∗ θ i f i ( x i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ + (( D α i − 1 ( x ) ⋅ v ) D ∇ θ i f i ( x i )) ∗ ⋅ D ∗ ω i + 1 ( x i + 1 ) , wher e F ( x ; θ ) = f L ○ ○ f 1 ( x ) , α i is defined as in (9) , and ω i is defined as in (10) . Pr oof. For any i ∈ { 1 , . . . , L } and U i ∈ H i , ( ∇ θ i D F ( x ; θ ) v ) ⋅ U i = ∇ θ i D F ( x ; θ ) ⋅ ( U i , v ) = D ( ∇ θ i F ( x ; θ ) ⋅ U i ) ⋅ v = D ( D ω i + 1 ( α i ( x )) ⋅ ∇ θ i f i ( α i − 1 ( x )) ⋅ U i ) ⋅ v = D 2 ω i + 1 ( x i + 1 ) ⋅ ( D α i ( x ) ⋅ v , ∇ θ i f i ( x i ) ⋅ U i ) + D ω i + 1 ( x i + 1 ) ⋅ D ∇ θ i f i ( x i ) ⋅ ( D α i − 1 ( x ) ⋅ v , U i ) = ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ⋅ ∇ θ i f i ( x i ) ⋅ U i + D ω i + 1 ( x i + 1 ) ⋅ (( D α i − 1 ( x ) ⋅ v ) D ∇ θ i f i ( x i )) ⋅ U i . Since this holds for all U i ∈ H i , ( ∇ θ i D F ( x ; θ ) v ) = ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ⋅ ∇ θ i f i ( x i ) + D ω i + 1 ( x i + 1 ) ⋅ (( D α i − 1 ( x ) ⋅ v ) D ∇ θ i f i ( x i )) . T aking the adjo int of this proves the theorem. by the reversing pro perty of the adjoint. Algorithm 3.2 pr esents a single iteration of a gradient d escent a lgorithm to min imize J R directly over the parameter set θ = { θ 1 , . . . , θ L } . This formu la extends lin early to a batch of up dates over se veral data points. T o extend this to R defined with multiple ( v x , β x ) pairs as in (19), then the re must be a set V j = { v j 1 , . . . , v j L + 1 } c alculated for each pair; in Algorith m 3.2, o nly the one set { v 1 , . . . , v L + 1 } is calculated. 4 Ap plication 1: Standard Multilayer P er ceptr on The first n etwork c onsidered is a standa rd multilay er p erceptro n (ML P). Th e inp ut d ata h ere is x ∈ R n 1 , and the outp ut is F ∈ R n L + 1 when the MLP is a ssumed to h av e L layers. Th e single-lay er function f i ∶ R n i × ( R n i + 1 × n i × R n i + 1 ) → R n i + 1 takes in the data at the i th layer — x i ∈ R n i — along with parameters W i ∈ R n i + 1 × n i and b i ∈ R n i + 1 , and outputs the data at the ( i + 1 ) th layer, i. e. x i + 1 = f i ( x i ; W i , b i ) ∈ R n i + 1 . The dependence of f i on its parameters ( W i , b i ) will often be suppressed throu ghou t this section, i.e. f i ( x i ; W i , b i ) ≡ f i ( x i ) , for convenience wh en comp osing fun ctions. It is assumed that e very 9 Algorithm 3.2 One iteration of gradien t descent for a high er-order loss function function D E S C E N T I T E R AT I O N ( x, v x , β x , y , θ 1 , . . . , θ L , η , µ ) x 1 ← x v 1 ← v x ▷ v i = D α i − 1 ( x ) ⋅ v x and D α 0 ( x ) = identity for i ∈ { 1 , . . . , L } do ▷ x L + 1 = F ( x ; θ ) and v L + 1 = D F ( x ; θ ) ⋅ v x x i + 1 ← f i ( x i ) v i + 1 ← D f i ( x i ) ⋅ v i ▷ Lemm a 3.7 end f or for i ∈ { L, . . . , 1 } do ˜ θ i ← θ i ▷ Store θ i for updating θ i − 1 if i = L then e t ← 0 ▷ e t = v i + 1 D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ ( v L + 1 − β x ) e v ← v L + 1 − β x ▷ e v = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( v L + 1 − β x ) e y ← x L + 1 − y ▷ e y = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x L + 1 − y ) else ▷ Calculate D ∗ f i + 1 ( x i + 1 ) with ˜ θ i + 1 in this block e t ← D ∗ f i + 1 ( x i + 1 ) ⋅ e t + v i + 1 D 2 f i + 1 ( x i + 1 ) ∗ ⋅ e v ▷ Thm . 3.8; use old e v e v ← D ∗ f i + 1 ( x i + 1 ) ⋅ e v ▷ Thm . 3.3 e y ← D ∗ f i + 1 ( x i + 1 ) ⋅ e y ▷ Thm . 3.3 end if ∇ θ i J ( x ; θ ) ← ∇ ∗ θ i f i ( x i ) ⋅ e y ▷ Thms. 3. 1 and 3.2 ∇ θ i R ( x ; θ ) ← ∇ ∗ θ i f i ( x i ) ⋅ e t + ( v i D ∇ θ i f i ( x i )) ∗ ⋅ e v ▷ Thms. 3. 6 and 3.9 θ i ← θ i − η (∇ θ i J ( x ; θ ) + µ ∇ θ i R ( x ; θ )) end f or end function vector space u sed he re is eq uipped with th e usu al Euclidean inn er pro duct. Thus, the in ner pr oduct of two matrices or vectors A an d B o f equal size is compu ted as A, B = tr ( A T B ) . As a cor ollary , A, B C = B T A, C = AC T , B f or any m atrices or vector s A , B and C so th at the inne r p rodu ct A, B C is v alid . Every v ector in each R n i is treated as an n i × 1 ma trix by default. The explicit action of the layer-wise fu nction f i can b e descr ibed via an elem entwise fu nction S i ∶ R n i + 1 → R n i + 1 , with associated elementwise operation σ i ∶ R → R , as f i ( x i ) = S i ( W i ⋅ x i + b i ) (21) for any x i ∈ R n i , wher e ⋅ deno tes m atrix-vector mu ltiplication. The elem entwise fun ction S i ∶ R n i + 1 → R n i + 1 is defined as in (3). The o peration σ i is nonlinear, so S i is k nown as an element- wise nonlinear f unction , or elementwise no nlinearity . The deriv ative map s D S i and D 2 S i can b e calculated using Proposition s 2.3 and 2.4, respectively . Remark 4.1. Th e maps D S i and D 2 S i clearly depen d on the choice of non linearity σ i . Some common choic es and th eir derivatives a r e given in T able 1. Note that H is the Heaviside step function, and sinh and cosh ar e the hyp erbolic sine a nd cosine fun ctions, respectively . T able 1 is not a complete description of all possible nonlin earities. T able 1: Comm on nonlinearities, along with their first and second deriv atives Name Definition First Derivativ e Seco nd Derivative tanh σ i ( x ) ∶ = sinh ( x ) cosh ( x ) σ ′ i ( x ) = 4 cosh 2 ( x ) ( cosh ( 2 x )+ 1 ) 2 σ ′′ i ( x ) = − 8 sinh ( 2 x ) cosh 2 ( x ) ( cosh ( 2 x )+ 1 ) 3 Sigmoidal σ i ( x ) ∶ = 1 1 + exp (− x ) σ ′ i ( x ) = σ i ( x ) ( 1 − σ i ( x )) σ ′′ i ( x ) = σ ′ i ( x ) ( 1 − 2 σ i ( x )) Ramp σ i ( x ) ∶ = max ( 0 , x ) σ ′ i ( x ) = H ( x ) σ ′′ i ( x ) = 0 10 4.1 Gradient Descent for Standard Loss Functio n Consider the loss functio n J giv en in (12). Its gradien t with re spect to the parameters W i and b i can now be ca lculated separa tely at each la yer i ∈ { 1 , . . . , L } , since W i and b i are b oth independe nt of each other and ind ependen t of other layers j ≠ i . First, the d eriv atives of f i and their adjoints ar e computed : Lemma 4.2. Consider the function f i defined in (21) . Then, for any x i ∈ R n i and any U i ∈ R n i + 1 × n i ∇ W i f i ( x i ) ⋅ U i = D S i ( z i ) ⋅ U i ⋅ x i , (22) ∇ b i f i ( x i ) = D S i ( z i ) , (23) wher e z i = W i ⋅ x i + b i , and D f i ( x i ) = D S i ( z i ) ⋅ W i . (24) This holds for any i ∈ { 1 , . . . , L } . Pr oof. For any U i ∈ R n i + 1 × n i , ∇ W i f i ( x i ) ⋅ U i = d d t ( S i (( W i + tU i ) ⋅ x i + b i )) t = 0 = D S i ( W i ⋅ x i + b i ) ⋅ U i ⋅ x i , which proves (22). Th e other equations can be proven similarly . Lemma 4.3. F or any i ∈ { 1 , . . . , L } , x i ∈ R n i and u ∈ R n i + 1 , ∇ ∗ W i f i ( x i ) ⋅ u = ( S ′ i ( z i ) ⊙ u ) x T i , (25) ∇ ∗ b i f i ( x i ) = D S i ( z i ) , (26) wher e z i = W i ⋅ x i + b i , and D ∗ f i ( x i ) = W T i ⋅ D S i ( z i ) . (27) Pr oof. By Lem ma 4.2, for any u ∈ R n i + 1 and any U i ∈ R n i + 1 × n i u, ∇ W i f i ( x i ) ⋅ U i = z , D S i ( z i ) ⋅ U i ⋅ x i = D S i ( z i ) ⋅ u , U i ⋅ x i = ( D S i ( z i ) ⋅ u ) x T i , U i , which implies ∇ ∗ W i f i ( x i ) ⋅ u = ( D S i ( z i ) ⋅ u ) x T i = ( S ′ i ( z i ) ⊙ u ) x T i , (28) which proves (25). Equation s (26) and (27) follo w fr om taking the ad joints o f (23) and (2 4) and using the self-adjointn ess of D S i ( z i ) . The next result demonstrates how to ba ckprop agate th e error in the network. Theorem 4.4 (Backpropag ation in MLP) . F or f i defined as in (21) and ω i as define d in (10) , D ω i ( x i ) = D ω i + 1 ( x i + 1 ) ⋅ D S i ( z i ) ⋅ W i , (29) wher e x i + 1 = f i ( x i ) for all i ∈ { 1 , . . . , L } , z i = W i ⋅ x i + b i , an d ω L + 1 is th e iden tity . Furthermore , for any u ∈ R n L + 1 , D ∗ ω i ( x i ) ⋅ u = W T i ⋅ ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ u )) . (30) Pr oof. Pick any v ∈ R n i . By the Theor em 3.3 and Lemma 4.2, D ω i ( v ) = D ω i + 1 ( f i ( v )) ⋅ D f i ( v ) = D ω i + 1 ( f i ( v )) ⋅ D S i ( W i ⋅ v + b i ) ⋅ W i Then, setting v = x i , equation (29) is proven since x i + 1 = f i ( x i ) . Now , by taking the adjoint of the above equa tion D ∗ ω i ( x i ) = W ∗ i ⋅ D ∗ S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) , 11 where W ∗ i = W T i . Also, D ∗ S i ( z i ) = D S i ( z i ) fro m Pro position 2.3. Th us, app lying D ∗ ω i ( x i ) to any v ∈ R n L + 1 giv es D ∗ ω i ( x i ) ⋅ v = W T i ⋅ D S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ v = W T i ⋅ ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ v )) . This is true for any i ∈ { 1 , . . . , L } , so the proof is complete. The above theor em d emonstrates how to calcu late D ∗ ω i ( x i ) r ecursively , which is need ed to ba ck- propag ate the error thro ugho ut the network. This will be necessary to compute th e main MLP result presented in the next theorem. Theorem 4.5. Let J be defin ed as in (12) , θ = { W 1 , . . . , W L , b 1 , . . . , b L } r e pr esent the p arameters, x ∈ R n 1 be an input with associated known output y ∈ R n L + 1 , and F ( x ; θ ) be defined as in (8) . Th en, the following equatio ns hold for any i ∈ { 1 , . . . , L } : ∇ W i J ( x ; θ ) = ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) x T i , (31) ∇ b i J ( x ; θ ) = S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e ) , (32) wher e x i = α i − 1 ( x ) , z i = W i ⋅ x i + b i , and the pr ediction err or e is g iven by e = F ( x ; θ ) − y ∈ R n L + 1 . Pr oof. By The orem 3.1 ∇ W i J ( x ; θ ) = ∇ ∗ W i F ( x ; θ ) ⋅ e, (33) ∇ b i J ( x ; θ ) = ∇ ∗ b i F ( x ; θ ) ⋅ e. (34) From Theore m 3.2, ∇ ∗ W i F ( x ; θ ) ⋅ e = ∇ ∗ W i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e , (35) ∇ ∗ b i F ( x ; θ ) ⋅ e = ∇ ∗ b i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e . (36) Recall that D ∗ ω i + 1 ( x i + 1 ) ⋅ e is calculated recu rsiv ely via Theorem 4 .4. Then, (31 ) follows from (33), (35) and (25), i.e. ∇ W i J ( x ; θ ) = ∇ ∗ W i F ( x ; θ ) ⋅ e = ∇ ∗ W i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e = ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) x T i . Similarly , (32) follows from (34), (35) and (26), i.e. ∇ b i J ( x ; θ ) = ∇ ∗ b i F ( x ; θ ) ⋅ e = ∇ ∗ b i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e = S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e ) . This comp letes the pro of, which is valid for all i ∈ { 1 , . . . , L } . Giv en the above results, a g radient d escent algo rithm can be developed to minim ize J with respect to each W i and b i , for a giv en data point x and learning rate η . One iteration of this is given in Algorithm 4.1. T he output of the algor ithm is an updated version of W i and b i . Th is process can be extended ad ditiv ely to a batch of u pdates by summing the individual contributions of ea ch x to the gradient of J ( x ; θ ) . 4.2 Gradient Descent for Higher -Order Loss Function The goal n ow is to per form a gr adient descen t iteration f or a hig her-order loss fu nction of th e fo rm (18). Since th e gr adients of J are alread y und erstood, it is o nly ne cessary to compute the gr adients of R , defined in (17), with respect to W i and b i for all i ∈ { 1 , . . . , L } . T his will in volve forward and backward pr opagatio n t hroug h the tan gent n etwork, an d then the calcu lation of ( ∇ D F ( x ; θ ) v ) ∗ , as in Theore m 3.9. First, relev ant single-la yer d eriv atives will be presented as in the previous section. 12 Algorithm 4.1 One iteration of gradien t descent in MLP function D E S C E N T I T E R AT I O N ( x, y , W 1 , . . . , W L , b 1 , . . . , b L , η ) x 1 ← x for i ∈ { 1 , . . . , L } do ▷ x L + 1 = F ( x ; θ ) z i ← W i ⋅ x i + b i x i + 1 ← S i ( z i ) end f or for i ∈ { L, . . . , 1 } do ˜ W i ← W i ▷ Store old W i for updatin g W i − 1 if i = L then ▷ e = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x L + 1 − y ) e ← x L + 1 − y ▷ ω L + 1 = identity else e ← ˜ W T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e ) ▷ (30) end if ∇ b i J ( x ; θ ) ← S ′ i ( z i ) ⊙ e ▷ (32) ∇ W i J ( x ; θ ) ← ( S ′ i ( z i ) ⊙ e ) x T i ▷ (31) b i ← b i − η ∇ b i J ( x ; θ ) W i ← W i − η ∇ W i J ( x ; θ ) end f or end function Lemma 4.6. Consider the function f i defined in (21) . Then , for any x i , v ∈ R n i and U i ∈ R n i + 1 × n i , ( v D ∇ W i f i ( x i )) ⋅ U i = D 2 S i ( z i ) ⋅ ( W i ⋅ v , U i ⋅ x i ) + D S i ( z i ) ⋅ U i ⋅ v (37) ( v D ∇ b i f i ( x i )) = ( W i ⋅ v ) D 2 S i ( z i ) (38) v D 2 f i ( x i ) = ( W i ⋅ v ) D 2 S i ( z i ) ⋅ W i , (39) wher e z i = W i ⋅ x i + b i . Fu rthermor e, for any y ∈ R n i + 1 , ( v D ∇ W i f i ( x i )) ∗ ⋅ y = [ S ′′ i ( z i ) ⊙ ( W i ⋅ v ) ⊙ y ] x T i + [ S ′ i ( z i ) ⊙ y ] v T , (40) ( v D ∇ b i f i ( x i )) ∗ = ( W i ⋅ v ) D 2 S i ( z i ) , (41) v D 2 f i ( x i ) ∗ = W T i ⋅ ( W i ⋅ v ) D 2 S i ( z i ) . (42) Pr oof. First, equ ation (37) is proven directly: ( v D ∇ W i f i ( x i )) ⋅ U i = D ( ∇ W i f i ( x i ) ⋅ U i ) ⋅ v = D ( D S i ( z i ) ⋅ U i ⋅ x i ) ⋅ v = D 2 S i ( z i ) ⋅ ( W i ⋅ v , U i ⋅ x ) + D S i ( z i ) ⋅ U i ⋅ v , where the secon d line comes from (2 2) and the last line follows fro m Lemma 2.2. Equations ( 38) and (39) can be proven similarly . Next, equation (40) is proven dire ctly . For any y ∈ R n i + 1 , y , ( v D ∇ W i f i ( x i )) ⋅ U i = y , ( W i ⋅ v ) D 2 S i ( z i ) ⋅ U i ⋅ x i + D S i ( x i ) ⋅ U i ⋅ v = ( W i ⋅ v ) D 2 S i ( z i ) ⋅ y x T i + [ D S i ( z i ) ⋅ y ] v T , U i , since D S i ( z i ) and v D 2 S i ( z i ) are both self-adjoint. Since this is true for any U i , ( v D ∇ W i f i ( x i )) ∗ ⋅ y = ( W i ⋅ v ) D 2 S i ( z i ) ⋅ y x T i + [ D S i ( z i ) ⋅ y ] v T , (43) which is equation (40) once the definitions of D S i and D 2 S i are substituted in. Equation s (41) and (42) are direct consequ ences o f (3 8) and (39), respectively , using the reversing proper ty of the adjoin t and the self-adjointness of D S i ( z i ) and v D 2 S i ( z i ) . Theorem 4.7. F or f i defined as in (21) , α i defined as in (9) , and x, v ∈ R n 1 , D α i ( x ) ⋅ v = S ′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v ) , wher e x i = α i − 1 ( x ) , z i = W i ⋅ x i + b i , and i ∈ { 1 , . . . , L } . Also, D α 0 ( x ) ⋅ v = v . 13 Pr oof. For all i ∈ { 1 , . . . , L } , by Lem ma 3.7, Proposition 2.3 and equation (24), D α i ( x ) ⋅ v = D f i ( α i − 1 ( x )) ⋅ D α i − 1 ( x ) ⋅ v = S ′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v ) , where α 0 ( x ) = x an d x i = α i − 1 ( x ) . Fu rthermo re, D α 0 ( x ) is the ide ntity since α 0 is the identity . This is an explicit rep resentation of the for ward propagation through the tangen t network. T he next lemma describes the backp ropagatio n throu gh the tangent network. Theorem 4 .8 (T ang ent Backp ropaga tion in MLP) . Let α i and ω i be defin ed a s in (9) and (1 0) , r espe ctively . Let f i be defined as in (21) . The n, for any i ∈ { 1 , . . . , L } , x, v 1 ∈ R n 1 , and v 2 ∈ R n L + 1 , ( D α i − 1 ( x ) ⋅ v 1 ) D 2 ω i ( x i ) ∗ ⋅ v 2 = W T i ⋅ S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v 1 ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ v 2 + W T i ⋅ { S ′′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v 1 ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ v 2 )} , wher e x i = α i − 1 ( x ) and z i = W i ⋅ x i + b i . Also, ( D α L ( x ) ⋅ v 1 ) D 2 ω L + 1 ( x L + 1 ) ∗ ⋅ v 2 = 0 . Pr oof. T heorem 3.8 states that for any i ∈ { 1 , . . . , L } , ( D α i − 1 ( x ) ⋅ v 1 ) D 2 ω i ( x i ) ∗ ⋅ v 2 = D ∗ f i ( x i ) ⋅ ( D α i ( x ) ⋅ v 1 ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ v 2 + ( D α i − 1 ( x ) ⋅ v 1 ) D 2 f i ( x i ) ∗ ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ v 2 . (44) By (27), D ∗ f i ( x i ) = W T i ⋅ D S i ( z i ) . Furthermo re, by (42), ( D α i − 1 ( x ) ⋅ v 1 ) D 2 f i ( x i ) ∗ = W T i ⋅ ( W i ⋅ D α i − 1 ( x ) ⋅ v 1 ) D 2 S i ( z i ) . These results can be substituted into equa tion (44) t o obtain the final result: ( D α i − 1 ( x ) ⋅ v 1 ) D 2 ω i ( x i ) ∗ ⋅ v 2 = W T i ⋅ D S i ( z i ) ⋅ ( D α i ( x ) ⋅ v 1 ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ v 2 + W T i ⋅ ( W i ⋅ D α i − 1 ( x ) ⋅ v 1 ) D 2 S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ v 2 = W T i ⋅ S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v 1 ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ v 2 + W T i ⋅ { S ′′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v 1 ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ v 2 )} . This is true ev en fo r i = 1 since α 0 is the iden tity . For i = L + 1 , ω L + 1 is also the identity , so (( α L ( x ) ⋅ v 1 ) D 2 ω L + 1 ( x L + 1 )) ∗ is the zero operator . Thu s, the result is proven. Note that the first term in the tang ent backpro pagation expression in Theorem 4.8 is the recursive part, and the second term can be calculated at ea ch stage on ce D ∗ ω i + 1 ( x i + 1 ) is c alculated. The maps ( ∇ W i D F ( x ; θ ) v ) ∗ and ( ∇ b i D F ( x ; θ ) v ) ∗ are calculated in the n ext theorem as the final step in the gradient descent puzzle. Theorem 4. 9. Let v ∈ R n 1 and e ∈ R n L + 1 . Then, with α i and ω i defined as in (9) an d (1 0) , r espe ctively , and x i = α i − 1 ( x ) , ( ∇ W i D F ( x ; θ ) v ) ∗ ⋅ e = S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e x T i + ( S ′′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) x T i (45) + ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) ( D α i − 1 ( x ) ⋅ v ) T , ( ∇ b i D F ( x ; θ ) v ) ∗ ⋅ e = S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e + S ′′ i ( z i ) ⊙ ( W i ⋅ D α i − 1 ( x ) ⋅ v ) ⊙ [ D ∗ ω i + 1 ( x i + 1 ) ⋅ e ] , (46) for any i ∈ { 1 , . . . , L } , where z i = W i ⋅ x i + b i . 14 Pr oof. T heorem 3.9 says that for any e ∈ R n L + 1 , ( ∇ θ i D F ( x ; θ ) v ) ∗ ⋅ e = ∇ ∗ θ i f i ( x i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e + (( D α i − 1 ( x ) ⋅ v ) D ∇ θ i f i ( x i )) ∗ ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e , (47) where θ i is a generic parameter at layer i , for i ∈ { 1 , . . . , L } . When θ i = W i , equation s (28) an d (43) can be substituted into (47) to obtain ( ∇ W i D F ( x ; θ ) v ) ∗ ⋅ e = D S i ( z i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e x T i + ( W i ⋅ D α i − 1 ( x ) ⋅ v ) D 2 S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e x T i (48) + ( D S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e ) ( D α i − 1 ( x ) ⋅ v ) T . Equation ( 45) is then o btained up on substituting the expr essions for D S i ( z i ) an d D 2 S i ( z i ) in to (48). Similarly , when θ i = b i , equations (26) and (41) can be substituted into (47) to obtain ( ∇ b i D F ( x ; θ ) v ) ∗ ⋅ e = D S i ( z i ) ⋅ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e + ( W i ⋅ D α i − 1 ( x ) ⋅ v ) D 2 S i ( z i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e . (49) As befo re, equation (4 6) is obta ined by substituting the expr essions for D S i ( z i ) and D 2 S i ( z i ) into (49). From Theore m 3.6, for v x ∈ R n 1 and β x ∈ R n L + 1 ∇ θ i R ( x ; θ ) = ( ∇ θ i D F ( x ; θ ) v x ) ∗ ⋅ ( D F ( x ; θ ) ⋅ v x − β x ) , for θ i equal to on e of W i or b i . Substitute v = v x and e = ( D F ( x ; θ ) ⋅ v x − β x ) in the form ulas in Th eorem 4 .9 to co mpute ∇ W i R ( x ; θ ) and ∇ b i R ( x ; θ ) . Thus, o ne iteration of a g radient d escent algorithm to m inimize J R = J + µR can n ow be given, since ∇ θ i J ( x ; θ ) an d ∇ θ i R ( x ; θ ) can bo th be calculated. This is descr ibed in Algorithm 4.2. 5 Ap plication 2: Deep A utoencoder Now , a 2 L -la yer autoencod er (AE) of the form given in Mu rphy , Chapter 28 [2] is descr ibed in the framework of Section 3. The layerwise fun ction f i is slightly mor e complicated in this case because ther e is weight-sh aring between differernt layers of the network. I ntrodu ce a f unction ξ ∶ { 1 , . . . , 2 L } → { 1 , . . . , 2 L } to aid in network representatio n, defined as follows: ξ ( i ) = 2 L − i + 1 , ∀ i ∈ { 1 , . . . , 2 L } . (50) This functio n has the p roper ty that ( ξ ○ ξ )( i ) = i , for all i . Then, th e lay erwise fu nction f i ∶ R n i × ( R n i + 1 × n i × R n i + 1 ) → R n i + 1 can be represented in the following manner: f i ( x i ; W i , b i ) = S i ( W i ⋅ x i + b i ) , i ∈ { 1 , . . . , L } f i ( x i ; W ξ ( i ) , b i ) = S i τ i ( W ξ ( i ) ) ⋅ x i + b i , i ∈ { L + 1 , . . . , 2 L } , where x i ∈ R n i is th e input to the i th layer, W i ∈ R n i + 1 × n i is the weig ht matrix, b i ∈ R n i is the b ias vector at layer i , S i ∶ R n i + 1 → R n i + 1 is the elementwise non linearity with cor respond ing elemen twise operation σ i , an d τ i ∈ L ( R n ξ ( i )+ 1 × n ξ ( i ) ; R n ξ ( i ) × n ξ ( i )+ 1 ) governs how the weights are shared betwee n layer i and ξ ( i ) . The stru cture of the autoe ncoder is to enco de for the fir st L lay ers, and deco de for the next L layers, with the dimensions being preserved according to: n L + j = n L − j + 2 , ∀ j ∈ { 2 , . . . , L + 1 } . In [2] and other similar examp les, τ i is th e matrix tran spose op erator at each layer , a lthough it is kept g eneral in this p aper . Howe ver , fo r that particular case, the a djoint is calculated acco rding to the following lemma. Lemma 5.1. Let τ ∈ L ( R n × m ; R m × n ) be defi ned as τ ( U ) = U T for all U ∈ R n × m . Then, τ ∗ ( W ) = W T for all W ∈ R m × n . 15 Algorithm 4.2 One iteration of gradien t descent for hig her-order loss in MLP function D E S C E N T I T E R AT I O N ( x, v x , β x , y , W 1 , . . . , W L , b 1 , . . . , b L , η , µ ) x 1 ← x v 1 ← v x ▷ v i = D α i − 1 ( x ) ⋅ v x and D α 0 ( x ) = id entity for i ∈ { 1 , . . . , L } do ▷ x L + 1 = F ( x ; θ ) and v L + 1 = D F ( x ; θ ) ⋅ v x z i ← W i ⋅ x i + b i x i + 1 ← S i ( z i ) v i + 1 ← S ′ i ( z i ) ⊙ ( W i ⋅ v i ) ▷ Theo rem 4.7 end f or for i ∈ { L, . . . , 1 } do ˜ W i ← W i ▷ Store W i for updatin g W i − 1 if i = L then e t ← 0 ▷ e t = v i + 1 D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ ( v L + 1 − β x ) e v ← v L + 1 − β x ▷ e v = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( v L + 1 − β x ) e y ← x L + 1 − y ▷ e y = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x L + 1 − y ) else e t ← ˜ W T i + 1 ⋅ S ′ i + 1 ( z i + 1 ) ⊙ e t + S ′′ i + 1 ( z i + 1 ) ⊙ ( ˜ W i + 1 ⋅ v i + 1 ) ⊙ e v ▷ Theo rem 4.8 e v ← ˜ W T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e v ) ▷ (30); Update e v after update of e t e y ← ˜ W T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e y ) ▷ (30) end if ∇ b i J ( x ; θ ) ← S ′ i ( z i ) ⊙ e y ▷ (32) ∇ W i J ( x ; θ ) ← ( S ′ i ( z i ) ⊙ e y ) x T i ▷ (31) ∇ b i R ( x ; θ ) ← S ′ i ( z i ) ⊙ e t + S ′′ i ( z i ) ⊙ ( W i ⋅ v i ) ⊙ e v ▷ Thm . 4.9 for this and next line ∇ W i R ( x ; θ ) ← ( S ′ i ( z i ) ⊙ e t + S ′′ i ( z i ) ⊙ ( W i ⋅ v i ) ⊙ e v ) x T i + ( S ′ i ( z i ) ⊙ e v ) v T i W i ← W i − η ( ∇ W i J ( x ; θ ) + µ ∇ W i R ( x ; θ )) b i ← b i − η ( ∇ b i J ( x ; θ ) + µ ∇ b i R ( x ; θ )) end f or end function Pr oof. For any U ∈ R n × m and W ∈ R m × n , W , τ ( U ) = W, U T = tr ( W U ) = tr ( U W ) = U, W T , which proves the result by the symm etry of , . Now , introduce the following notation to rep resent the f i in a more compact manner: K i = W i , 1 ≤ i ≤ L τ i ( W ξ ( i ) ) , L + 1 ≤ i ≤ 2 L Then, the action of layer i — f i — can be simply represen ted as f i ( x i ) = S i ( K i ⋅ x i + b i ) , (51) where the explicit depe ndence on the param eters K i and b i are su ppressed an d implied when dis- cussing f i . Th e network pred iction is gi ven by F ( x ; θ ) = f 2 L ○ ○ f 1 ( x ) , (52) where θ = { W 1 , . . . , W L , b 1 , . . . , b 2 L } and x ∈ R n . Notice tha t layers i and ξ ( i ) b oth explicitly depend on the param eter W i , for any i ∈ { 1 , . . . , L } , and their impact on F ca n be shown by writing F as f ollows: F ( x ; θ ) = f 2 L ○ ○ f ξ ( i ) ○ ○ f i ○ ○ f 1 ( x ) . (53) In this section, α i and ω i are defined analogou sly to (9) and (10) respectively , i.e. α i ( x ) = f i ○ ○ f 1 ( x ) and ω i ( y ) = f 2 L ○ ○ f i ( y ) (54) for all x ∈ R n 1 , y ∈ R n i , and i ∈ { 1 , . . . , 2 L } . Note again that α 0 and ω 2 L + 1 are identity maps. 16 5.1 Gradient Descent for Standard Loss Functio n For the deep autoencode r , the stand ard loss function is dif ferent. It is of the form J ( x ; θ ) = 1 2 x − F ( x ; θ ) , x − F ( x ; θ ) . (55) Notice that the y from (12) h as been re placed by x in (55). This is to enforce th e ou tput, which is the deco ding of the encoded inpu t, to be as similar to the o riginal input as possible. T he equation for ∇ θ i J ( x ; θ ) is then updated from the form in (13) to ∇ θ i J ( x ; θ ) = ∇ ∗ θ i F ( x ; θ ) ⋅ ( F ( x ; θ ) − x ) , (56) for any parameter θ i . Note that c alculating ∇ ∗ W i F ( x ; θ ) f or i ∈ { 1 , . . . , L } in th is case is more difficult th an in (3 5), since layer s i and ξ ( i ) bo th depen d on W i . T his will be shown tow ards the end of this section after sing le-layer deriv atives and back propag ation are presen ted. The re is a very strong corre sponden ce between th is section a nd Section 4.1 b ecause o f the similarity in the layerwise-defin ing function f i , and this will be exploited whenever po ssible. Before proceedin g into grad ient calcu lation, howev er , a very particu lar in stance of the chain rule will be introdu ced for param eter-dependent maps. Theorem 5.2. Let E , ˜ E , H 1 , and H 2 be gene ric inner p r odu ct spaces. Consider a linear map τ ∈ L ( H 1 ; H 2 ) , and two parameter-dependent maps g ∶ E × H 1 → ˜ E a nd h ∶ E × H 2 → ˜ E , suc h that g ( x ; θ ) = h ( x ; τ ( θ )) for all x ∈ E and θ ∈ H 1 . Then, the following two r esults hold for all U ∈ H 1 and y ∈ ˜ E ∇ g ( x ; θ ) ⋅ U = ∇ h ( x ; τ ( θ )) ⋅ τ ( U ) , ∇ ∗ g ( x ; θ ) ⋅ y = τ ∗ ( ∇ ∗ h ( x ; τ ( θ )) ⋅ y ) . Pr oof. T his is a co nsequenc e of the ch ain r ule, the linearity of τ , and the reversing property o f the adjoint. Then, single-lay er deri vativ es for a gen eric function f are presente d as corollaries to Theorem 5.2. Corollary 5.3. Co nsider a function f o f the form f ( x ; W ) = S ( τ ( W ) ⋅ x + b ) , wher e x ∈ R n , b ∈ R m , W ∈ R n × m , τ ∈ L ( R n × m ; R m × n ) , and S ∶ R m → R m is an elementwise function. Then , the following hold: for any U ∈ R n × m , ∇ W f ( x ; W ) ⋅ U = D S ( z ) ⋅ τ ( U ) ⋅ x, (57) ∇ b f ( x ; W ) = D S ( z ) , (58) D f ( x ; W ) = D S ( z ) ⋅ τ ( W ) , (59) wher e z = τ ( W ) ⋅ x + b . Fu rthermor e, the following hold : for any y ∈ R m , ∇ ∗ W f ( x ; W ) ⋅ y = τ ∗ ( S ′ ( z ) ⊙ y ) x T (60) ∇ ∗ b f ( x ; W ) = D S ( z ) (61) D ∗ f ( x ; W ) = τ ∗ ( W ) ⋅ D S ( z ) . (62) Pr oof. I n Lemmas 4.2 and 4.3, the deriv atives an d correspo nding adjoints of ˜ f ( x ; ˜ W ) = S ( ˜ W ⋅ x + b ) were calculated, where ˜ W ∈ R m × n . Then, equation s (57) and (60) are consequences of Lemma 5.2. Equation s (58) and (59) also fo llow from derivati ves calculated in Le mmas 4.2 and 4.3, alon g with the chain ru le. Equ ations (6 1) and (62) follow from the reversing prop erty of the adjoint and the self-adjointn ess of D S ( z ) . 17 Since the sing le-layer deriv a ti ves can be ca lculated, it is now shown that backpr opagation in a deep autoenco der is of the same for m as backpropag ation in a M LP . Theorem 5. 4 ( Backprop agation in Deep AE) . W ith f i defined as in (51) and ω i given as in (54) , then for any x i ∈ R n i and i ∈ { 1 , . . . , 2 L } , D ω i ( x i ) = D ω i + 1 ( x i + 1 ) ⋅ D S i ( z i ) ⋅ K i , wher e z i = K i ⋅ x i + b i and ω L + 1 is the identity . Furthermo r e, for any v ∈ R n 2 L + 1 , D ∗ ω i ( x i ) ⋅ v = K T i ⋅ ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ v )) . Pr oof. Sin ce f i ( x i ) = K i ⋅ x i + b i , where K i is ind epend ent of x i , this resu lt c an b e proven in the same way as Theorem 4.4, replacing W i with K i . The deriv ati ves of th e e ntire loss fu nction can now be c omputed with r espect to W i for any i ∈ { 1 , . . . , L } , and with respec t to b i for any i ∈ { 1 , . . . , 2 L } . Theorem 5. 5. Let J be defined as in (55) , F b e defined as in (52) , and ω i be d efined as in (54) . Then, for all i ∈ { 1 , . . . , L } and x ∈ R n 1 , ∇ W i J ( x ; θ ) ⋅ e = ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) x T i + τ ∗ ξ ( i ) S ′ ξ ( i ) ( z ξ ( i ) ) ⊙ D ∗ ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ e x T ξ ( i ) , (63) wher e e = F ( x ; θ ) − x and z j = K j ⋅ x j + b j for all 1 ≤ j ≤ 2 L . Furthermore , for all i ∈ { 1 , . . . , 2 L } , ∇ b i J ( x ; θ ) = S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e ) . (64) Pr oof. Pr oving eq uation (64) for any i ∈ { 1 , . . . , 2 L } is the same as p roving (32), and is omitted. As for equ ation (63), recall that o nly two o f th e fu nctions co mprising F in ( 53) d epend on W i : f i and f ξ ( i ) . Hence, by the pro duct rule of dif ferentiation , ∇ W i F ( x ; θ ) = D ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ ∇ W i f ξ ( i ) ( x ξ ( i ) ) + D ω i + 1 ( x i + 1 ) ⋅ ∇ W i f i ( x i ) . T aking the adjo int of this implies ∇ ∗ W i F ( x ; θ ) ⋅ e = ∇ ∗ W i f ξ ( i ) ( x ξ ( i ) ) ⋅ D ∗ ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ e + ∇ ∗ W i f i ( x i ) ⋅ D ∗ ω i + 1 ( x i + 1 ) ⋅ e . ( 65) Equation (25) gives ∇ ∗ W i f i ( x i ) ⋅ u = ( S ′ i ( z i ) ⊙ u ) x T i (66) for any u ∈ R n i + 1 and any i ∈ { 1 , . . . , L } . Since i ∈ { 1 , . . . , L } implies ξ ( i ) ∈ { L + 1 , . . . , 2 L } , equation (60) implies ∇ ∗ W i f ξ ( i ) ( x ξ ( i ) ) ⋅ v = τ ∗ ξ ( i ) S ′ ξ ( i ) ( z ξ ( i ) ) ⊙ v x T ξ ( i ) (67) for any v ∈ R ξ ( i )+ 1 and any i ∈ { 1 , . . . , L } , where z ξ ( i ) = τ ξ ( i ) ( W i ) ⋅ x ξ ( i ) + b ξ ( i ) . Hence, (63) follows from (56) and (65 ) – (67). One iteration of a gradien t descent algor ithm to min imize J with respect to the parame ters is giv en in Algor ithm 5.1. As before, the o utput o f this algorithm is a n ew parameter set θ = { W 1 , . . . W L , b 1 , . . . , b 2 L } that has taken o ne step in th e dir ection of the n egativ e g radient o f J with respect to each parameter . 5.2 Gradient Descent for Higher -Order Loss Function Now , as in previous sections, a loss fu nction J R = J + µR is co nsidered, with R ( x ; θ ) defined as in ( 17) or (19). T o perf orm gr adient descent to min imize J R , it is only necessary to d etermine the gradient o f R with resp ect to the param eters, since the gradien t of J can already be calculated . Again, f orward and backward pr opagation through the tangent network must be computed in the spirit of [4], as well as ( ∇ D F ( x ; θ ) v ) ∗ . 18 Algorithm 5.1 One iteration of gradien t descent in an auto encode r function D E S C E N T I T E R AT I O N ( x, W 1 , . . . , W L , b 1 , . . . , b 2 L , η ) x 1 ← x for i ∈ { 1 , . . . , 2 L } do ▷ x 2 L + 1 = F ( x ; θ ) if i <= L then K i ← W i else K i ← τ i ( W ξ ( i ) ) end if z i ← K i ⋅ x i + b i x i + 1 ← S i ( z i ) end f or for i ∈ { 2 L, . . . , 1 } do if i = 2 L then ▷ ω 2 L + 1 = identity e x ← x 2 L + 1 − x ▷ e x = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x 2 L + 1 − x ) else e x ← K T i + 1 ⋅ ( S ′ i ( z i + 1 ) ⊙ e x ) ▷ Thm . 5.4 end if ∇ b i J ( x ; θ ) ← S ′ i ( z i ) ⊙ e x ▷ (64) b i ← b i − η ∇ b i J ( x ; θ ) if i > L then ∇ W ξ ( i ) J ( x ; θ ) ← τ ∗ i ( S ′ i ( z i ) ⊙ e x ) x T i ▷ Secon d term in (63) else ∇ W i J ( x ; θ ) ← ∇ W i J ( x ; θ ) + ( S ′ i ( z i ) ⊙ e x ) x T i ▷ Add first term in (63) W i ← W i − η ∇ W i J ( x ; θ ) end if end f or end function Lemma 5.6. F or f i defined as in (51) , α i defined as in (54) , and any x, v ∈ R n , D α i ( x ) ⋅ v = S ′ i ( z i ) ⊙ ( K i ⋅ D α i − 1 ( x ) ⋅ v ) , wher e x i = α i − 1 ( x ) and z i = K i ⋅ x i + b i , for all i ∈ { 1 , . . . , 2 L } . Pr oof. T his result is proven similarly to Theo rem 4.7 since f i ( x i ) = S i ( K i ⋅ x i + b i ) . Now , tangent backpr opagatio n must be comp uted. Theorem 5.7 (T angen t Backp ropag ation in Deep AE) . Let α i and ω i be d efined a s in (54) . Let f i be defined as in (51) . Then, for an y i ∈ { 1 , . . . , 2 L } , x, v 1 ∈ R n 1 , an d v 2 ∈ R n 2 L + 1 , ( D α i − 1 ( x ) ⋅ v 1 ) D 2 ω i ( x i ) ∗ ⋅ v 2 = K T i ⋅ S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v 1 ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ v 2 + K T i ⋅ { S ′′ i ( z i ) ⊙ ( K i ⋅ D α i − 1 ( x ) ⋅ v 1 ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ v 2 )} , wher e x i = α i − 1 ( x ) and z i = K i ⋅ x i + b i . Also, ( D α 2 L ( x ) ⋅ v 1 ) D 2 ω 2 L + 1 ( x 2 L + 1 ) ∗ ⋅ v 2 = 0 . Pr oof. Sin ce f i ( x i ) = S i ( K i ⋅ x i + b i ) and K i is independ ent of x i , this r esult can be proven in the same way as Theorem 4.8. Since th e tang ents can be back prop agated, th e final step in calculating the grad ients of R is to calculate ( ∇ θ i D F ( x ; θ ) v ) ∗ , where θ i is a generic parameter . 19 Theorem 5.8. Let α i and ω i be defi ned in (54) , and F be defined as in ( 52) . Then , for any e ∈ R n 2 L + 1 , x ∈ R n 1 , and i ∈ { 1 , . . . , L } , ( ∇ W i D F ( x ; θ ) v ) ∗ ⋅ e = S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e x T i + ( S ′′ i ( z i ) ⊙ ( K i ⋅ D α i − 1 ( x ) ⋅ v ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) x T i + ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) ( D α i − 1 ( x ) ⋅ v ) T (68) + τ ∗ ξ ( i ) S ′ ξ ( i ) ( z ξ ( i ) ) ⊙ ( D α ξ ( i ) ( x ) ⋅ v ) D 2 ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ∗ ⋅ e x T ξ ( i ) + τ ∗ ξ ( i ) S ′′ ξ ( i ) ( z ξ ( i ) ) ⊙ ( K ξ ( i ) ⋅ D α ξ ( i )− 1 ( x ) ⋅ v ) ⊙ ( D ∗ ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ e ) x T ξ ( i ) + τ ∗ ξ ( i ) S ′ ξ ( i ) ( z ξ ( i ) ) ⊙ ( D ∗ ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ e ) ( D α ξ ( i )− 1 ( x ) ⋅ v ) T , wher e x i = α i − 1 ( x ) and z i = K i ⋅ x i + b i . Furthermore , for any i ∈ { 1 , . . . , 2 L } , ( ∇ b i D F ( x ; θ ) v ) ∗ ⋅ e = S ′ i ( z i ) ⊙ ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ e + S ′′ i ( z i ) ⊙ ( K i ⋅ D α i − 1 ( x ) ⋅ v ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e ) . (69) Pr oof. E quation (69) is proven similarly to (46) and is omitted. Eq uation (68) is no w deriv ed. Consider the case when i ∈ { 1 , . . . , L } . Recall fro m the proof of Theorem 5.5 that ∇ W i F ( x ; θ ) = D ω i + 1 ( x i + 1 ) ⋅ ∇ W i f i ( x i ) + D ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ ∇ W i f ξ ( i ) ( x ξ ( i ) ) . Then, as in the proo f of Theorem 3.9, ( ∇ W i D F ( x ; θ ) v ) = ( D α i ( x ) ⋅ v ) D 2 ω i + 1 ( x i + 1 ) ⋅ ∇ W i f i ( x i ) (70) + D ω i + 1 ( x i + 1 ) ⋅ (( D α i − 1 ( x ) ⋅ v ) D ∇ W i f i ( x i )) + ( D α ξ ( i ) ⋅ v ) D 2 ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ ∇ W i f ξ ( i ) ( x ξ ( i ) ) + D ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ ( D α ξ ( i )− 1 ( x ) ⋅ v ) D ∇ W i f ξ ( i ) ( x ξ ( i ) ) , where the third an d fo urth terms com e fro m the secon d term in ∇ W i F ( x ; θ ) . Then, taking th e adjo int of the first two terms of (7 0) works as in (45), replacin g W i with K i . T aking the adjo int of the final two terms of (70) can be done using Theorem 5.2 and (45), which completes the proof. Corollary 5. 9. Let α i and ω i be d efined in (5 4) , and F be d efined as in (52) . Then, for any e ∈ R n 2 L + 1 , x ∈ R n 1 , and i ∈ { 1 , . . . , L } , ( ∇ W i D F ( x ; θ ) v ) ∗ ⋅ e = ( ∇ b i D F ( x ; θ ) v ) ∗ ⋅ e x T i + ( S ′ i ( z i ) ⊙ ( D ∗ ω i + 1 ( x i + 1 ) ⋅ e )) ( D α i − 1 ( x ) ⋅ v ) T (71) + τ ∗ ξ ( i ) ∇ b ξ ( i ) D F ( x ; θ ) v ∗ ⋅ e x T ξ ( i ) + τ ∗ ξ ( i ) S ′ ξ ( i ) ( z ξ ( i ) ) ⊙ ( D ∗ ω ξ ( i )+ 1 ( x ξ ( i )+ 1 ) ⋅ e ) ( D α ξ ( i )− 1 ( x ) ⋅ v ) T , wher e x i = α i − 1 ( x ) and z i = K i ⋅ x i + b i . Pr oof. T his result can easily be obtained by substituting (69) into (68). Recall the following for v x , β x ∈ R n 2 L + 1 : ∇ θ i R ( x ; θ ) = ( ∇ θ i D F ( x ; θ ) v x ) ∗ ⋅ ( D F ( x ; θ ) ⋅ v x − β x ) , for a gen eric par ameter θ i . Now , gradient descent can be perform ed to minimize J R = J + µR since the gradient of R is kn own. One iteratio n of this is gi ven in Algo rithm 5.2. 20 Algorithm 5.2 One iteration of gradien t descent for hig her-order loss in an autoenc oder function D E S C E N T I T E R AT I O N ( x, v x , β x , W 1 , . . . , W L , b 1 , . . . , b 2 L , η , µ ) x 1 ← x v 1 ← v x ▷ v i = D α i − 1 ( x ) ⋅ v x and D α 0 ( x ) = id entity for i ∈ { 1 , . . . , 2 L } do ▷ x 2 L + 1 = F ( x ; θ ) and v 2 L + 1 = D F ( x ; θ ) ⋅ v x if i <= L then K i ← W i else K i ← τ i ( W ξ ( i ) ) end if z i ← K i ⋅ x i + b i x i + 1 ← S i ( z i ) v i + 1 ← S ′ i ( z i ) ⊙ ( K i ⋅ v i ) ▷ Lemm a 5.6 end f or for i ∈ { 2 L, . . . , 1 } do if i = 2 L then ▷ ω 2 L + 1 = identity e x ← x 2 L + 1 − x ▷ e x = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( x 2 L + 1 − x ) e t ← 0 ▷ e t = v i + 1 D 2 ω i + 1 ( x i + 1 ) ∗ ⋅ ( v 2 L + 1 − β x ) e v ← v 2 L + 1 − β x ▷ e v = D ∗ ω i + 1 ( x i + 1 ) ⋅ ( v 2 L + 1 − β x ) else e x ← K T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e x ) ▷ Thm . 5.4 e t ← K T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e t + S ′′ i + 1 ( z i + 1 ) ⊙ ( K i + 1 ⋅ v i + 1 ) ⊙ e v ) ▷ Thm . 5.7; old e v e v ← K T i + 1 ⋅ ( S ′ i + 1 ( z i + 1 ) ⊙ e v ) ▷ Thm . 5.4 end if ∇ b i J ( x ; θ ) ← S ′ i ( z i ) ⊙ e x ▷ (64) ∇ b i R ( x ; θ ) ← S ′ i ( z i ) ⊙ e t + S ′′ i ( z i ) ⊙ ( K i ⋅ v i ) ⊙ e v ▷ (69) b i ← b i − η (∇ b i J ( x ; θ ) + µ ∇ b i R ( x ; θ )) if i > L then ∇ W ξ ( i ) J ( x ; θ ) ← τ ∗ i ( S ′ i ( z i ) ⊙ e x ) x T i ▷ Secon d term in (63) ∇ W ξ ( i ) R ( x ; θ ) ← τ ∗ i (∇ b i R ( x ; θ )) x T i + ( S ′ i ( z i ) ⊙ e v ) v T i ▷ T erms 3 & 4 in (71) else ∇ W i J ( x ; θ ) ← ∇ W i J ( x ; θ ) + ( S ′ i ( z i ) ⊙ e x ) x T i ▷ Add first term in (63) ∇ W i R ( x ; θ ) ← ∇ W i R ( x ; θ ) + (∇ b i R ( x ; θ )) x T i + ( S ′ i ( z i ) ⊙ e v ) v T i ▷ T erms 1 & 2 in (71), add to previously comp uted result. W i ← W i − η (∇ W i J ( x ; θ ) + µ ∇ W i R ( x ; θ )) end if end f or end function 6 Conclusion and Futur e W ork In th is work , a concise a nd comp lete m athematical fra mew ork fo r DN Ns was formu lated. Gener ic multiv a riate functions defined th e operatio n of the network at each layer, and their compo sition defined th e overall mechanics o f th e n etwork. A coo rdinate- free gradient descent alg orithm, which relied heavily o n deriv ati ves of vector-valued func tions, was p resented and applied to two specific examples. It was shown ho w to calculate gra dients of network loss functions over the inne r product space in which the param eters reside, as oppo sed to individually with respect to each component. A simple loss function and a hig her-order loss fu nction wer e consider ed, and it was also sho wn how to extend this framework to other types of lo ss functions. The appro ach con sidered in this paper was generic an d fle xible and c an be extended to other types of networks b esides the ones considered here. The most immed iate direction of future work would be to represent the p arameters of a DNN in some sort of lower-dimensional subsp ace to prom ote spar sity in the n etwork. Finding mean ingful basis re presentation s of param eters cou ld help limit the am ount of o verfitting, while still ma intaining the predictiv e p ower o f the model. Also, more sophisticated optimization methods beco me tractable once the nu mber of dimen sions is sufficiently reduced, and it would be in teresting to app ly these to 21 neural networks. Another direction for futur e w ork is to exploit the discrete-time dynamical system structure presented for the layerwise network, and to consider how to u se control and dynamic al systems theory to improve network training or output. Refer ences [1] R. Abra ham, J. Mar sden, and T . Ratiu. Manifolds, T ensor Analysis, and Applicatio ns ( 2nd edition) . Spr inger, 19 88. [2] K. Murp hy . Machine learning : a pr oba bilistic perspective . MIT press, 2012. [3] S. Rifai, Y . Daup hin, P . V in cent, Y . Bengio , an d X. Muller . The manif old tangent classifier . In Advance s i n Neural Information Pr ocessing Systems , pages 2294 –2302 , 20 11. [4] P . Simard, B. V icto rri, Y . L eCun, an d J. Den ker . T ange nt Pro p — A formalism fo r specify ing selected inv ariances in an adaptive network. In Adv ances in neural informa tion pr ocessing systems , pages 895–9 03, 19 92. 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment