SENNS: Sparse Extraction Neural NetworkS for Feature Extraction
By drawing on ideas from optimisation theory, artificial neural networks (ANN), graph embeddings and sparse representations, I develop a novel technique, termed SENNS (Sparse Extraction Neural NetworkS), aimed at addressing the feature extraction pro…
Authors: Abdulrahman Oladipupo Ibraheem
SENNS: Sparse Extraction Neural Net w o rkS for F eature Extraction. Ab dulrahman Oladipup o Ibraheem rahmanoladi@y aho o.com Computing and In telligen t Systems Researc h Group Departmen t of Computer Science and Engineering Obafemi Aw olo w o Univ ersity , Ile-Ife, Nigeria. Septem b er 18, 2018 Abstract The feature extraction problem o ccupies a central position in pattern recognition and mac h ine learning. In this concept paper, drawing o n ideas f rom optimisatio n theory , artificial neural net w orks (ANN), graph embeddings and sparse representations, I deve lop a no vel tec hn ique, termed SENNS (Sparse Extraction N eural Netw orkS ), aimed at addressing the feature extraction problem. The prop osed metho d uses ( preferably deep) AN Ns for pro jecting input attribute vectors to an outpu t space wherein pairwise distances are maximized for vectors belonging to d ifferent classes, bu t min- imized for those b elonging to the same class, while simultaneously enforcing sparsi ty on the ANN outputs. The vectors that result from th e pro jection can th en b e used as features in an y classifier of choice. Mathematically , I formulate the prop osed metho d as the minimisation of an ob jective function which can b e interpreted, in th e ANN output space, as a ne gative f actor o f the sum of the squares of the pair-wise distances b etw een output vectors b elonging to different classes, added to a p ositive factor of th e sum of squ ares of the p air-wise distances b et w een outpu t vectors b elonging to the same classes, p lus sparsity and weigh t decay terms. T o derive an algorithm for minimizing th e ob jective funct ion via gradient descent, I use the m ulti-v ariate v ersion of the chain rule to obtain the p artial deriva tives of the function with resp ect to A NN we ights and biases, and find th at each of the required partial d eriva tives can b e expressed as a sum of six terms. As it turns out, four of those six terms can b e computed using the stand ard back propagation algori thm; the fifth can b e computed via a slight mo dification of t he stand ard b ac kpropagation algorithm; while th e sixth one can b e comput ed via simple arithmetic. Finally , I propose ex p eriments on th e ARABAS E Arabic corpora of d igits and letters, the CMU PIE database of faces, the MNI S T digits database, and other standard machine learning databases. 1 In tro duction Most pattern recognition systems comprise three key stag e s: pre-pro cessing, feature extr action a nd classification stages. Of these three stages, resea rchers be lie ve that the featur e extraction stage is the most cr itical. F or exa mple, in their review pap er, the autho rs of [1] unequivocally wrote: ‘Selection of 1 a fea ture extractio n metho d is probably the sing le mo st imp ortant facto r in achieving hig h recognitio n per formance...’ Indeed, we agree with the authors of [1], since our o wn view is that fe atur e ext r action is a c ommitment, onc e made might b e irr eversible by any classifier, however sophistic ate d . Hence, it b ecomes highly pa ramount to carefully and rigo rously study how these ‘commitmen ts’ should b e made — how should features b e extracted for optimal accura cies at the cla ssification stage? While researchers have prop osed a plethora of metho ds (e.g . [2], [3], [4], [5 ], [6], [7], [8], [10], [9], [12], [1 3] ) aimed at answering this question, it appears that a philosophy that s hould b e follow ed by any goo d feature extraction metho d is the o ne articulated b y Dejiver and Kittler in [14]. They said that feature extraction is the problem of ‘extracting from the raw da ta the information which is most rele v ant for classifica tion purposes , in the sense of minimizing the within-class pattern v a riability , while enhancing the betw e e n-class pa tter n v aria bilit y .’ Two of the more p opular featur e extraction tec hniques tha t follow this philosophy a re the Linear Discr iminant Analys is (LDA) [3] a nd the Ma rginal Fisher Analysis (MF A) [12], b oth of which can be viewed as specific examples of a unifiying concept calle d graph embediings [12]. Inspired by the ab ov e philosophy o f [14], we her ein prop ose a technique, termed SENNS (Spar se Extrac tio n Neural Netw orkS) and prono unced ‘SENSE,’ for addr eesing the feature e x traction problem. Like the LDA and MF A, our SENNS can als o be viewed, at leas t partially , via the lens of the graph embeddings concept. Unlike the MF A and LDA, how ever, SENNS incorp o rates a mechanism for seeking spa rse features, and employs the apparatus of (preferably deep ) non-linea r artificial neural net w orks, ra ther than line a r (or kernel, or tensor) pro jectio ns, for effecting the tra nsformations that re sult in the sought features. Mathema tica lly , we formulate our metho d a s the minimisatio n o f an ob jective function. Via r igoro us mathematical analysis, w e then der ive a gradient descen t algor ithm for minimizing our ob jective function. F or tunately , it turns out that our algor ithm ca n b e expressed in terms of the standar d backpropagation pro cedure, except, as we shall see, for a little tw eaking to a ccomo date L 1 norms. Finally , we pla n to test SENNS o n standard machine learning datasets such a s the ARABASE Arabic corp or a of digits and letters [1 5], the CMU PIE databas e o f faces [1 6], the MNIST digits data ba se [17], a nd other standar d ma chine learning databases. 2 Notation In describing neura l netw o rks, I almost entirely follow the notatio n of Prof. Andrew Ng a s in [18]. In Ng’s notation, for sup ervised training mo de, the neura l netw ork learns from a training data denoted ( x ( i ) , y ( i ) ), i = 1 , 2 , ..., m . The neural netw ork prop er cons ists of n l lay ers, L 1 , L 2 , ..., L l , ..., L n l , and the nu mber of neurons in the l -th lay er is denoted s l . A weigh t, de no ted W ( l ) ij , connects the j -th neuron of lay er l with the i -th neuro n of lay e r l + 1, w hile a bias, denoted b ( l ) i , ema na tes from lay er l and enters the i -th ne ur on of lay er l + 1 . The ov erall function of the i -th neur on in lay er l is to compute an activ ation denoted a ( l ) i , which is the r e sult of pas s ing the quantit y z ( l ) i = P s l j =1 w ( l ) ij a ( l ) j + b ( l ) i through a tr a nsfer function such as the s ig moid or tanh function. As a matter of no tational ex pe die nce , for each j , the definition a (1) i = x ( j ) is employ ed, whic h means that the ANN input is viewed as a n ‘activ ation’for the first lay er. F urthermore, a ( l ) is used to denote the column vector given b y a ( l ) = ( a ( l ) 1 , a ( l ) 2 , ..., a ( l ) s l ) T ; a nd a similar notation applies to b ( l ) and z ( l ) as well. Similarly , W ( l ) denotes the column vector obtained via an ordered concatena tion of a ll the w eigh ts linking la yer l with lay er l + 1 in the netw ork. Finally , W repre sents a conca tenation of a ll the weigh ts in the ANN, while b represe nt s a concatena tion of all the biases. Herein, I will follow the ab ov e nota tio n, except for the following mo difications. Firstly , to av o id 2 confusion, I will use the index, t , placed within square brack ets, to la b e l training data: ( x [ t ] , y [ t ] ) t = 1 , 2 , . . . , m . Base d o n this, I will wr ite a ( l )[ t ] i to denote the ac tiv ation from the i -th neuron of the l -th lay er when x [ t ] is applied as input to the ANN, a nd I will use a ( l )[ t ] to the denote the column vector of activ ations as s o ciated with the l -th layer a nd t - th input vector: a ( l )[ t ] i = ( a ( l )[ t ] 1 , a ( l )[ t ] 2 , ..., a ( l )[ t ] s l ) T . How- ever, when the la y er in question is clear from co ntext, I will simply wr ite a [ t ] i instead o f a ( l )[ t ] i , and a [ t ] instead o f a ( l )[ t ] , in order to achiev e a less clumsy no tation. Finally , in dealing with pa rtial deriv atives, I shall herein wr ite ∂ f ∂ a [ t ] | ˆ a [ t ] as a sho rthand for ∂ f ∂ a [ t ] | a [ t ] =ˆ a [ t ] . Likewise, we shall write ∂ f ∂ a [ t ] | (ˆ a [ t ] , ˆ a [ u ] ) instead o f ∂ f ∂ a [ t ] | ( a [ t ] ,a [ u ] )=(ˆ a [ t ] , ˆ a [ u ] ) 3 F orm ulation of Ob jectiv e F unction T ow ards for mulating the required ob jective function, I will b egin by introducing tw o ‘functions’, C and D defined over the Ca rtesian pr o duct of input vectors accor ding to: C ( x [ t ] , x [ u ] ) = ( 1 if x [ t ] and x [ u ] belo ng to the same class 0 otherwise (1) and D ( x [ t ] , x [ u ] ) = ( 1 if x [ t ] and x [ u ] belo ng to different classes 0 otherwis e (2) F urther, if X = { x [1] , x [2] , ..., x [ m ] } denotes the se t of all input vectors, and X 2 is the set of all pairs of the form ( x [ t ] , x [ u ] ), t ∈ { 1 , 2 , ..., m } and u ∈ { 1 , 2 , ..., m } , which can be drawn from X , then we shall use M C ( X ) to denote the num ber of times that function C outputs 1 when all the pair s in X 2 are passed through it. A similar definition applies to M D ( X ) as well. When the tr aining set is cle a r fr om context, we w ill simply wr ite M C and M D instead of M C ( X ) a nd M D ( X ) resp ectively . Next, for some non-negative regular isation constants, λ 1 , λ 2 ∈ [0 , 1], we define a function, S ( x [ t ] , x [ u ] ), as follows: S ( x [ t ] , x [ u ] ) = λ 1 M C if x [ t ] and x [ u ] belo ng to the same class − λ 2 M D otherwise (3) The connection b etw een function S and the gra ph em bedding framework [12] should at once be clear . Spec ific a lly , it should b e clear that S plays the role of the weight s on the edge s of the gr a phs underlying graph embeddings. W e see that, similar to the Marginal Fisher Analysis (MF A) describ ed in [12], S connects da ta p oints belonging to the same clas s with a p os itive weight, but connects those belonging to different class es w ith a negative weigh t. How ever, a s we shall so o n see, the propos ed matho d herein differs from that in [1 2] in thr e e key wa ys. Firs t, herein, we effect our pro jections via ANNs, unlike [12] who employed either linearisation, k ernelisation or tensorisation. Second, herein, we imp ose a sparsity requirement o n the so ught features, thereby seek ing to take adv antage of the well k nown b enefits of sparse features; se e [19 ] and [20] for instance. Indeed, we would like to see the effect o f the sparsity term on the ability of our gradient descen t algo r ithm to lo cate a global minimum for our non- conv ex 3 ob jective function. Thir dly , the ov e rall str ucture of our ob jective function herein is different fr o m that in [12], since ours is a regular ised sum of terms, wher eas theirs is a quo tient of terms . The ab ov e three things also distinguis h the method prop os ed herein from Linear Discriminant Analy sis (LD A) [3]. W e can now s p e ll out the o b jective function we wish to minimize as follows: J ( W, b ) = 1 2 m X t =1 m X u =1 S ( x [ t ] , x [ u ] ) || a [ t ] − a [ u ] || 2 + λ 3 m m X t =1 || a [ t ] || 1 + λ 4 2 n l X l =1 s l +1 X i =1 s l X j =1 W ( l )2 ij (4) T o a chiev e a slightly less clumsy no tation in the ab ove eq ua tion, we have written the ac tiv ations asso ci- ated with the ANN’s output layer as a [ t ] instead o f a ( n l )[ t ] . W e shall ca rry on this practice henceforth, unless other wise stated. F urther, || . || 1 denotes the L 1 norm. Also, no tice that the ab ove o b jective function implicitly incorp or ates the regula rizers, λ 1 , λ 2 , via the inclusion o f S ( x [ t ] , x [ u ] ). In addition, λ 3 and λ 4 are a lso reg ularizer s s uch that λ 3 , λ 4 ∈ [0 , 1] a nd λ 1 + λ 2 + λ 3 + λ 4 = 1. By simply s ubstituting S ( x [ t ] , x [ u ] ) from E quation 3 int o Equation 4, we hav e: J ( W, b ) = λ 1 2 M C m X t =1 m X u =1 C ( x [ t ] , x [ u ] ) || a [ t ] − a [ u ] || 2 − λ 2 2 M D m X t =1 m X u =1 D ( x [ t ] , x [ u ] ) || a [ t ] − a [ u ] || 2 + λ 3 m m X t =1 || a [ t ] || 1 + λ 4 2 n l X l =1 s l +1 X i =1 s l X j =1 W ( l )2 ij (5) The form in Eq uation 5 ab ove highlights the ‘gra ph embedding’ a sp ect of our formulation. The first term g ives a measure, in the output space of the ANN, of how widely separa ted output vectors b elonging to the sa me class ar e . Clear ly , we wish to minimize this no n-negative q uantit y . On the co ntrary , the second term, excluding its ne gative sign , gives a measur e, in the output space of the ANN, of how widely separated o utput vectors b elo ng ing to different clas ses ar e . W e wish to maximize this non -negative measure, by minimizing the negative qua ntit y that r esults when the negative s ign is pre-fixed to it. The third ter m is a spars ity term b y which w e wish to make the extra cted fea tur es sparse. The fourth term is a weigh t decay term which preven ts the weigh ts from b e coming to o larg e , and helps preven t ov erfitting. Finally , we see that the para meters, λ 1 , λ 2 , λ 3 , λ 4 , a llow us to co ntrol the relative amount of significance that the ob jective function attaches to the four ob jectives it is trying to achiev e. A question naturally ar ises per taining to the co mputatio nal feasibility of the sums appea ring in Equa tion 5. In particular, as we shall see, the sums in volving index v a riables t and m will carry ov er directly to the alg orithm for minimizing the ob jective function in Equa tio n 5. This means that the term inv olving C ( x [ t ] , x [ u ] ) (as well as the term inv olving D ( x [ t ] , x [ u ] )) in the ob jective function would require O ( m 2 ) time, which b e c omes undesirable as m g r ows. T o amelior a te this, we are go ing to prop ose tw o heuristics for the tw o sums, as follows. W e fir st cons ider the case o f the term in volving D ( x [ t ] , x [ u ] ). Upfro nt, we p oint out that the heuristic leads to a maximisation, in the ANN o utput s pace, of the sum of the distances b etw een each input vector from a given class, and its near est neighbour from ea ch of the other cla sses, ther e by given rise to a m aximisation of the minimum distanc e formulation , r e minis c ent of suppo rt vector machines [ ? ]. T o pro ceed, let T = { x [1] , x [2] , . . . , x [ m ] } , and let there be N classes , denoted Ω 1 , Ω 2 , . . . , Ω N in the cla ssification pr oblem for which we ar e extra cting fea tur es. Now, for each x [ t ] ∈ Ω p , we define a set, D x [ t ] containing N − 1 elements a s follows: D x [ t ] = { d 1 , d 2 , . . . d p − 1 , d p +1 , . . . , d N } , s uch that each d p belo ngs to cla ss Ω p and each d p is the ne ar est member of Ω p from x [ t ] . Our heuristic is to re pla ce the quan tit y P m t =1 P m u =1 D ( x [ t ] , x [ u ] ) || a [ t ] − a [ u ] || 2 by P x [ t ] ∈T P d q ∈ D x [ t ] || a [ t ] − a [ q ] || 2 where a [ q ] 4 denotes the ANN output-la yer vector of activ ations asso ciated with input vector d q ∈ D x [ t ] . It should be clear that this new sum is O ( γ m ), wher e γ = N − 1 , compared to the o riginal sum which is O ( m 2 ). Since most problems usually have m >> N − 1, w e exp ect this heuristic to lead to sig nificant g ains in computational feasibility in most cases. How ever, for the case of the term in volving C ( x [ t ] , x [ u ] ), we pro p o se an heuristic that leads to a situation wherein the sum of the distances b etw een each input vector and its k -farthest neighbours, all in the same class as the input vector, is minimised. F ormally , for ea ch x [ t ] ∈ Ω p , we simply define the set C x [ t ] = ˜ N k ( x [ t ] ), where as usual, ˜ N k ( x [ t ] ) is the set of the k farthest ele ments from x [ t ] in Ω p . In E quation 5, we then replace the quantit y P m t =1 P m u =1 C ( x [ t ] , x [ u ] ) || a [ t ] − a [ u ] || 2 by P x [ t ] ∈T P d q ∈ C x [ t ] || a [ t ] − a [ q ] || 2 . Again, in this ca se, we s e e that, for mo st cases, the heuristic ca n lead to improved computational feasibility since the new sum is O ( k m ), and k can b e chosen to b e far lesser than m . Since the ab ove heuristics constitute just an example of a host of p os s ible heur istics that can be a pplied to allevia te the co mputational feasibilty issue in E q uation 5, we therefore think that it would b e be tter for us to develop our techn ique for the gener a l case for mulated in E quation 5 , es pe c ia lly cons idering the fact that it sho uld b e clear how to a dapt the develop ed technique to any particula r heuristic of interest. So now, let us go back to E quation 4 (from which Equation 5 derives). It is exp edient to denote the fir st term in the equa tion b y J 1 ( W , b ), the second term by J 2 ( W , b ), and the third term b y J 3 ( W ). Hence, Equation 4 can b e re- written in the form: J ( W, b ) = J 1 ( W , b ) + J 2 ( W , b ) + J 3 ( W ) W e now consider how to minimize J ( W, b ) via gr adient des cent. A key step is the co mputation of ∇ W ( l ) J ( W, b ) and ∇ b ( l ) J ( W, b ), a nd this dis tills to the co mputation of ∂ J ( W, b ) ∂ W ( l ) ij and ∂ J ( W, b ) ∂ b ( l ) i , for all l = 1 , 2 , ..., n l − 1, for all j = 1 , 2 , ...s l , and for all i = 1 , 2 , ..., s l +1 . I shall illustrate my ov erall approach by showing how to compute ∂ J ( W, b ) ∂ W ( l ) ij , since the computation of ∂ J ( W, b ) ∂ b ( l ) i is ana logous. T o this end, I beg in with a ra ther trivia l step and write: ∂ J ( W, b ) ∂ W ( l ) ij = ∂ J 1 ( W , b ) ∂ W ( l ) ij + ∂ J 2 ( W , b ) ∂ W ( l ) ij + ∂ J 3 ( W ) ∂ W ( l ) ij (6) But, the thir d pa r tial deriv ative, ∂ J 3 ( W ) ∂ W ( l ) ij , is par ticularly stra ightforw ard to co mpute: ∂ J 3 ( W ) ∂ W ( l ) ij = W ( l ) ij . Plugging this into Equation 6, we r eadily obtain: ∂ J ( W, b ) ∂ W ( l ) ij = ∂ J 1 ( W , b ) ∂ W ( l ) ij + ∂ J 2 ( W , b ) ∂ W ( l ) ij + W ( l ) ij (7) T o pro cee d, let us introduce tw o definitions: ˆ J 1 ( W , b ) = 1 2 || a [ t ] − a [ u ] || 2 , a nd ˆ J 2 ( W , b ) = || a [ t ] || 1 . By comparing Equation 4 with the expres sion J ( W, b ) = J 1 ( W , b ) + J 2 ( W , b ) J 3 ( W ), o bserve that the first definition ab ov e allows us to write: J 1 ( W , b ) = m P t =1 m P u =1 S ( x [ t ] , x [ u ] ) ˆ J 1 ( W , b ), so that: ∂ J 1 ( W , b ) ∂ W ( l ) ij = m X t =1 m X u =1 S ( x [ t ] , x [ u ] ) ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij (8) 5 In the s ame vein, no tice that the second definition ab ov e p er mits us to wr ite: J 2 ( W , b ) = λ 3 m m P t =1 ˆ J 2 ( W , b ), so that: ∂ J 2 ( W , b ) ∂ W ( l ) ij = λ 3 m m X t =1 ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij (9) . In rounding off this section, w e should point o ut that the expr ession for ∂ J ( W, b ) ∂ W ( l ) ij given in Equation 7 will be particularly useful in the g r adient descent algor ithm we shall b e prop osing in Section 5 for minimizing J ( W, b ). The algorithm will first use Eq ua tions 8 and 9 to compute ∂ J 1 ( W , b ) ∂ W ( l ) ij and ∂ J 2 ( W , b ) ∂ W ( l ) ij resp ectively; it will then use Equa tion 7 to co mpute ∂ J ( W, b ) ∂ W ( l ) ij , befor e pro ceeding to up date ANN weight s in a gradient -descent style. How ev er, it is clear from E quations 8 and 9 that if the algo rithm is to b e effective, then w e must find ways of computing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij and ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij . The next tw o sections are devoted to thes e t w o task s. 4 Deriving the P artial Deriv ativ es of ˆ J 1 ( W, b ) , and an A lgorithm for Compu ting them. In this sec tion, we fo cus on the computation of ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij . F ro m the pr e c eding s ection, we know: ˆ J 1 ( W , b ) = 1 2 || a [ t ] − a [ u ] || 2 (10) Next, I define vector A to b e the column vector formed by sta cking the co lumn vector, a [ t ] atop the column vector a [ u ] . That is, A = ( a [ t ] T , a [ u ] T ) T , where a [ t ] T denotes the transp ose of a [ t ] . Notice that ˆ J 1 ( W , b ) is a function of A , so that one may sp eak of computing ∇ A ˆ J 1 ( W , b ), ∇ a [ t ] ˆ J 1 ( W , b ) and ∇ a [ u ] ˆ J 1 ( W , b ). Also, observe that, s ince each o f a [ t ] and a [ u ] is a function o f ANN weights, W , and biases, b , it follows that A is also a function of weigh ts and bia ses, so that one ca n as well spe ak of computing ∂ A ∂ W ( l ) ij . Now, going back to Equation 10, w e can now use the chain rule to write an expression for ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij : ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij = ( ∂ A ∂ W ( l ) ij ) T ∇ A ˆ J 1 ( W , b ) (11) Being a dot pr o duct of tw o vectors, the expr ession ab ove is a sum o f terms. W e ca n break the s um into t wo parts, one asso cia ted with the co lumn vector, a [ t ] and the other asso c ia ted with the column vector a [ u ] : ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) + ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˆ J 1 ( W , b ) (12) 6 I will now go down a pa th to argue that ea ch of the terms in Equation 12 ab ov e can b e co mputed using the standard backpropag ation alg orithm. T o this end, I co ns ider a function that has a very similar form to ˆ J 1 ( W , b ), except for a v ery small differ ence. In par ticular, I co nsider the function ˜ B ( a [ t ] ; c [ t ] ), which I her e in term a Back-Pr op agatable function, and which I define a ccording to: ˜ B ( a [ t ] ; c [ t ] ) = 1 2 || a [ t ] − c [ t ] || 2 (13) where, c [ t ] is a co nstant that is indep e ndent of the ANN’s weigh ts and bia ses. W e ask the reader to compare the function ˜ B ( a [ t ] ; c [ t ] ) = 1 2 || a [ t ] − c [ t ] || 2 with the function ˆ J 1 ( W , b ) = 1 2 || a [ t ] − a [ u ] || 2 , defined in E quation 1 0 . In particular, the rea de r should note that ˜ B ( a [ t ] ; c [ t ] ) can b e obtaine d from E ( W , b ; x [ t ] , x [ u ] ) simply by repla cing the function (of weigh ts and biases) a [ u ] in the latter by the c onstant c [ t ] . Ho wev er, what is mor e impo rtant is that the back-propagata ble function in Equation 13 a bove plays a central role in the expressio n of the sum of squar es er r or that an ANN aimed at classification m ust try to minimize in sup e rvised learning mo de. Given training data, T = ( x [ t ] , y [ t ] ) , t = 1 , 2 , 3 , ..., m , I can r ecall that tha t s um of sq uares e rror can be written as : B ( W, b ) = 1 2 m m X t =1 || a [ t ] − y [ t ] || 2 (14) Clearly , in the ab ov e equatio n, y [ t ] is a consta nt indep endent of ANN weights and bia s es. Moreover, it is als o clear that the expression inside the summation on the r ight ha nd side of the equation perfectly fits int o: ˜ B ( a [ t ] ; y [ t ] ) = 1 2 || a [ t ] − y [ t ] || 2 (15) T o pro ceed, we put Equation 15 into Equatio n 14 and obtain: B ( W, b ) = 1 m m X t =1 ˜ B ( a [ t ] ; y [ t ] ) (16) In super vised learning mo de, the ob jective of the ANN is to minimize the total erro r, B ( W, b ).One of the most frequently employ ed techniques for minimizing B ( W, b ) is the gra dient descent algo rithm, a nd this req uir es the partial deriv ativ es of B ( W , b ) with resp ect to weigh ts and bias es. Now, the par tia l deriv ative of B ( W , b ) with resp ect to an arbitra ry weight W ( l ) ij can b e expressed as: ∂ B ( W , b ) ∂ W ( l ) ij = 1 m m X t =1 ∂ ˜ B ( a [ t ] ; y [ t ] ) ∂ W ( l ) ij (17) So in es sence, the task o f c o mputing ∂ B ( W , b ) ∂ W ( l ) ij bo ils down to the c o mputation of ∂ ˜ B ( a [ t ] ; y [ t ] ) ∂ W ( l ) ij , and then summing ov er a ll t ∈ { 1 , 2 , ..., m } (i.e. summing ov er the training da ta); the b ackpr op agatio n algorithm is normal ly employe d for the c omputation of ∂ ˜ B ( a [ t ] ; y [ t ] ) ∂ W ( l ) ij . Gener alizing, we se e that the b ackpr op agatio n algorithm c an always b e u s e d t o c omput e the p artial derivatives , ∂ ˜ B ( a [ t ] ; c [ t ] ) ∂ W ( l ) ij , of any 7 function of the form ˜ B ( a [ t ] ; c [ t ] ) = 1 2 || a [ t ] − c [ t ] || 2 , such t hat c [ t ] is a c onstant which plays the r ole which the targe t output v a lue , y [ t ] , plays in the super v ised learning mo de of ANNs. T o pro ceed from here, we s hall let ˆ a [ u ] denote the sp ecific constant v alue that the function a [ u ] ( a [ u ] is a function of ANN weigh ts a nd biases) ev aluates to for a given v alue of u and a given set of weights a nd biases . W e then consider the function ˜ B ( a [ t ] ; ˆ a [ u ] ), in which ˆ a [ u ] is playing the r ole that the target output v alue, y [ t ] , plays in the function ˜ B ( a [ t ] ; y [ t ] ). With the fo r egoing in mind, let us now try to wr ite an expr ession for ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij . With the aid of the chain rule, we hav e: ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ) (18) W e now tr y to co mpa re the qua ntit y , ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ) on the righ t hand side of Equa- tion 1 8 ab ov e with the quantit y , ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ), which o ccurs as the first term on the rig ht hand side of E quation 12 . W e cla im that, for a given v alue, ˆ a [ u ] , b oth are e q ual, in the sense that: [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ), where, as explained in the s ection o n no - tation, w e have written [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] instead o f [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | a [ u ] =ˆ a [ u ] . Before we show the equality , let us first highlight what we stand to ga in if they ar e truly equa l. In particular, [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ) clearly implies [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij . But, acco rding to o ur discussion in the previous paragr aph, we know that ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij can b e computed via backpropaga tion. Thus, if the equality is really tr ue, we w ould have in es sence found a means of computing [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] . Now, this would b e v ery imp or tant to us b ecaus e , as given in Equa tion 1 2, ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) is o ne of the tw o ter ms in v olved in our e xpression for ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij , whic h in turn is a partial deriv ative which we need for minimizing our ob jective function via gr adient descent. Indeed, it is very simple to show the required e q uality . Spec ifically , to show [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ), all we need show is ∇ a [ t ] ˆ J 1 ( W , b ) | ˆ a [ u ] = 8 ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ), since ( ∂ a [ t ] ∂ W ( l ) ij ) T | ˆ a [ u ] = ( ∂ a [ t ] ∂ W ( l ) ij ) T . A simple ‘pro o f ’ follo ws thus: ‘ Pr o of ’ of ∇ a [ t ] ˆ J 1 ( W , b ) | ˆ a [ u ] = ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ) . W e b egin with ˆ J 1 ( W , b ) = 1 2 || a [ t ] − a [ u ] || 2 = 1 2 P s n l i =1 ( a [ t ] i − a [ u ] i ) 2 , where a [ t ] i denotes the activ ation of the i -th neuron in the output lay er when the training example x [ t ] is applied as input to the ANN, and s n l is the n um ber of neur ons in the output lay e r . W e then find ∇ a [ t ] ˆ J 1 ( W , b ) = ( a [ t ] 1 − a [ u ] 1 , a [ t ] 2 − a [ u ] 2 , . . . , a [ t ] s n l − a [ u ] s n l ). Hence, for a given output v alue ˆ a [ u ] of function a [ u ] we obtain ∇ a [ t ] ˆ J 1 ( W , b ) | ˆ a [ u ] = ( a [ t ] 1 − ˆ a [ u ] 1 , a [ t ] 2 − ˆ a [ u ] 2 , . . . , a [ t ] s n l − ˆ a [ u ] s n l ). O n the other hand, we have ˜ B ( a [ t ] ; ˆ a [ u ] ) = || a [ t ] − ˆ a [ u ] || 2 = P s n l i =1 1 2 ( a [ t ] i − ˆ a [ u ] i ) 2 , from which we may compute: ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ) = ( a [ t ] 1 − ˆ a [ u ] 1 , a [ t ] 2 − ˆ a [ u ] 2 , . . . , a [ t ] s n l − ˆ a [ u ] s n l ). Hence, ∇ a [ t ] ˆ J 1 ( W , b ) | ˆ a [ u ] = ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ), as required. Before moving on, let us quic kly ‘save’a key implication o f the ab ove result for future reference: [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij (19) A t this juncture, we pause to take sto ck of what we hav e achiev ed so far . Essentially , what we hav e achiev ed up to this p oint is a means for computing the first o f the t wo terms req uired fo r the computation of ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij as given in Equa tion 12. Hence, my natural next line of actio n is to consider how to compute the se c o nd o f the tw o terms, which is the q uantit y ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˆ J 1 ( W , b ). As it turns out, the appro ach to the computation o f ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˆ J 1 ( W , b ) | ˆ a [ t ] is p erfectly analo gous to that of ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) | ˆ a [ u ] , whic h hav e a lready dealt with, so I will not go into all the details. Only tw o things are requir ed. First, one needs to take adv antage of the fact tha t ( − 1) 2 = 1, which allows one to re-write ˆ J 1 ( W , b ) in the form ˆ J 1 ( W , b ) = 1 2 || a [ u ] − a [ t ] || 2 . Second, one considers a back-propagatable function of the form ˜ B ( a [ u ] ; ˆ a [ t ] ) = 1 2 || a [ u ] − ˆ a [ t ] || 2 (Compare this with the form ˜ B ( a [ t ] ; ˆ a [ u ] ) = 1 2 || a [ t ] − ˆ a [ u ] || 2 which we had use d earlier on). Based on this, the curr ent sc enario bec omes perfectly analog ous to the scenario w e ha d earlier dea lt with ea rlier on, while trying to show the claim, [ ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˆ J 1 ( W , b ) ] | ˆ a [ u ] = ( ∂ a [ t ] ∂ W ( l ) ij ) T ∇ a [ t ] ˜ B ( a [ t ] ; ˆ a [ u ] ). Hence, with the analo gy , 9 it should be clea r that, for the pre sent scenar io, we m ust have: [ ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˆ J 1 ( W , b ) ] | ˆ a [ t ] = ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˜ B ( a [ u ] ; ˆ a [ t ] ). F urthermor e, for the presen t situation, w e have an a na log of E quation 19 as follows: [ ( ∂ a [ u ] ∂ W ( l ) ij ) T ∇ a [ u ] ˆ J 1 ( W , b ) ] | ˆ a [ t ] = ∂ ˜ B ( a [ u ] ; ˆ a [ t ] ) ∂ W ( l ) ij (20) Now, when we co mbin e Equatio ns 12 , 19, and 20 , it s hould not b e ha rd to see that, for a giv en pa ir of constants ( ˆ a [ t ] , ˆ a [ u ] ), we must hav e : ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij | (ˆ a [ t ] , ˆ a [ u ] ) = ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij | ˆ a [ t ] + ∂ ˜ B ( a [ u ] ; ˆ a [ t ] ) ∂ W ( l ) ij | ˆ a [ u ] (21) A key essence of E quation 21 ab ov e is that it pr ovides an aven ue for computing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij , for a g iven pair of constants ( ˆ a [ t ] , ˆ a [ u ] ), using the backpropagation algor ithm. Let us illustr ate the ‘av enue’ by considering how to compute the qua ntit y ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij | ˆ a [ t ] , which occ urs in the Equation. T o co mpute ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij | ˆ a [ t ] , we simply call the standard backpropagation pro cedure, passing the curr ent training example, ˆ a [ u ] and x [ t ] . In the pro cedur e , ˆ a [ u ] plays the role of the tar get output vector; while x [ t ] is used to calculate ˆ a [ t ] , which then plays the role of the curre nt v ector of output-lay e r activ ations from the ANN. On the flip side, to compute ∂ ˜ B ( a [ u ] ; ˆ a [ t ] ) ∂ W ( l ) ij | ˆ a [ u ] , we pass the training example, x [ u ] (not x [ t ] ), along with the pa ir, ˆ a [ t ] (not ˆ a [ u ] ), as a rguments to the standard backpropagation pr o cedure. This time, ˆ a [ u ] , whic h can b e computed from x [ u ] , plays the ro le o f the curr ent vector of activ ations pro duced by the ANN, while ˆ a [ t ] plays the r ole o f the ta rget o utput vector. Based on the forego ing, we naturally derive the following algorithm for c o mputing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij : 10 Algorithm 1 Harnessesing the Standard B ack-Propaga tion (SBP) Pro cedure for Computing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij Inputs : A set o f cur r ent ANN weigh ts, W ( l ) ij , and biases, b ( l ) i , l = 1 , 2 , . . . , n l , i = 1 , 2 , . . . , s l , and j = 1 , 2 , . . . , s l − 1 , along with a pair of input vectors ( x [ t ] , x [ u ] ). 1). Use the input vector, x [ t ] , to p erform an initialisa tio n by filling ANN ‘input ac tiv ations’ with x [ t ] . That is, in the input lay er, FOR each i = 1 , 2 , . . . , s 1 , set: a (1) i := x [ t ] i END FOR 2). Perform a F eedforward pass throug h the ANN, c o mputing all activ ations through out the netw ork. That is, FOR each l = 1 , 2 , . . . , n l − 1, FOR ea ch i = 1 , 2 , . . . , s l +1 , FOR each j = 1 , 2 , . . . , s l , set: 2a). z ( l +1)[ t ] i := W ( l ) ij a ( l ) i + b ( l ) i 2b). a ( l +1)[ t ] i := f ( z ( l +1)[ t ] i ) END FOR, END FOR, END FOR 3). Obtain ˆ a [ t ] as the vector o f output lay er activ a tions computed in STEP 2 b ab ov e. That is, F O R ea ch i = 1 , 2 , . . . , s n l , set: ˆ a [ t ] i := a ( n l )[ t ] i . END FOR 4). Rep eat STEPS 1 to 3, but this time using x [ u ] , rather than x [ t ] (and using v aria ble u r a ther than t ), there by o btaining vector ˆ a [ u ] . 5). Use the SBP pro cedure to compute ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij | ˆ a [ t ] , by pa ssing x [ t ] and ˆ a [ u ] to the pro cedure; this first time, x [ t ] plays the r ole of the cur rent ANN input, while ˆ a [ u ] plays the role of the tar get output vector: ∆ ˜ B 1 := ∂ ˜ B ( a [ t ] ; ˆ a [ u ] ) ∂ W ( l ) ij | ˆ a [ t ] 6). Use the SBP pro cedure to compute ∂ ˜ B ( a [ u ] ; ˆ a [ t ] ) ∂ W ( l ) ij | ˆ a [ u ] , by pa ssing x [ u ] and ˆ a [ t ] to the pro cedure; this second time, x [ u ] plays the ro le of the current ANN input, while ˆ a [ t ] plays the ro le of the tar get output vector: ∆ ˜ B 2 := ∂ ˜ B ( a [ u ] ; ˆ a [ t ] ) ∂ W ( l ) ij | ˆ a [ u ] 7). Compute the final result: ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij := ∆ ˜ B 1 + ∆ ˜ B 2 11 5 Deriving the Pa rtial Deriv ativ es of ∂ ˆ J 2 ( W, b ) ∂ W ( l ) ij and an Algo- rithm for Computing them. In the pr e ceding section, we derived a n express ion, a nd corr esp onding algor ithm, for computing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij . In this s ection, we de r ive an expr ession for ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij and furnish a n algor ithm for its computation. T o pro ceed, let us recall from Section 3 that: ˆ J 2 ( W , b ) = || a ( n l )[ t ] || 1 (22) Now, keeping the notation of Sectio n 2 in mind, the ab ov e can also b e written as : ˆ J 2 ( W , b ) = s n l X i =1 || a ( n l )[ t ] i || (23) In what follows, we will simply write || a ( n l ) i || 1 instead o f || a ( n l )[ t ] i || 1 , for a le ss clumsy notation, and the reader is implore d to keep this in mind. The ultimate plan is to der ive a v ariant o f the backpropaga tion algorithm for the co mputation o f ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij for all l , i a nd j . T o this e nd, using the nota tion o f Section 2, we fir st need to spell out the p ertinent feed-for ward eq uations: z ( l +1) i = s l X j =1 a ( l ) i w ( l ) ij + b ( l ) i (24) a ( l ) i = f ( z ( l ) i ) (25) where f ( . ) denotes the relev ant transfer function, whic h typically is the s igmoid, ta n-sigmoid or linear transfer function, depending o n the layer in q uestion. Also, we define a ‘sign function ’accor ding to: sig n ( a i ) = +1 if a i > 0 − 1 if a i < 0 0 if a i = 0 (26) Based on the ab ove, one can s how, us ing the chain rule, that ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 1) ij , which is the partial deriv ativ e with resp ect to an arbitr ary weight that pro jects to the output layer, l = n l , from the p enultimate layer, l = n l − 1, is given by: ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 1) ij = sig n ( a ( n l ) i ) f ′ ( z ( n l ) i ) a ( n l − 1) j (27) 12 where f ′ ( . ) denotes the der iv ative of the transfer function f ( . ). In anticipation of what is to come next, we define a ‘signed deriv a tive’, de no ted β ( n l ) i , as follows: β ( n l ) i = si g n ( a ( n l ) i ) f ′ ( z ( n l ) i ) (28) Let us notice the similarity betw een the form of β ( n l ) i and the s o-called sca led er ror, δ ( n l ) i = ( a ( n l ) i − y i ) f ′ ( z ( n l ) i ) (where y i denotes the i -th comp onent of the current ta r get output training vector), whic h plays a key role in the standard bac kpropag ation a lgorithm. In pa rticular, let us make the key observation that sig n ( a ( n l ) i ) in β ( n l ) i is playing the ro le which ( a ( n l ) i − y i ) plays in δ ( n l ) i . Indeed, we can ea s ily e x press ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 1) ij in terms of β ( n l ) i by putting Equatio n 28 into Eq ua tion 27: ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 1) ij = β ( n l ) i a ( n l − 1) j (29) T o pro ceed, let us now mov e o ne lay er ba ck throug h the ANN, and a ttempt to write an expres sion for ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 2) ij . Again, using the chain rule, one can show that: ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 2) ij = s n l X p =1 sig n ( a ( n l ) p ) f ′ ( z ( n l ) p ) W ( n l − 1) pi f ′ ( z ( n l − 1) i ) a ( n l − 2) j (30) But, fro m Equation 28, it is clear that for any index v ariable, p , we can write β ( n l ) p = sig n ( a ( n l ) p ) f ′ ( z ( n l ) p ). Putting this in to Equation 30, we readily obtain: ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 2) ij = s n l X p =1 β ( n l ) p W ( n l − 1) pi f ′ ( z ( n l − 1) i ) a ( n l − 2) j (31) A t this juncture, we p o int out that if ˆ J 2 ( W , b ) had been defined b y ˆ J 2 ( W , b ) = 1 2 P s n l i =1 ( a ( n l )[ t ] i − y i ) 2 rather than ˆ J 2 ( W , b ) = P s n l i =1 || a ( n l )[ t ] i || , then we would have had ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 1) ij = δ ( n l ) i a ( n l − 1) j and ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 2) ij = P s n l p =1 δ ( n l ) p W ( n l − 1) pi f ′ ( z ( n l − 1) i ) a ( n l − 2) j . This observ ation indicates that, for the problem at ha nd, we can obtain a backpropagation alg orithm for computing the par tial deriv atives o f ˆ J 2 ( W , b ) simply by letting β ( l ) i play the role which δ ( l ) i normally plays in the sta ndard backpropagation a lgorithm. As a spe c ific implication of the ab ove o bserv a tion, one could define β ( n l − 1) i = P s n l p =1 β ( n l ) p W ( n l − 1) pi f ′ ( z ( n l − 1) i ), in analogy with what is usually done in the standard bac kpropaga tion algorithm setting, and then write ∂ ˆ J 2 ( W , b ) ∂ W ( n l − 2) ij = β ( n l − 1) i a ( n l − 2) j . Now generalizing , for any l ∈ { 1 , 2 , . . . , n l − 1 } , w e ca n define β ( l ) i according to: β ( l ) i = s n l X p =1 β ( l +1) p W ( l ) pi f ′ ( z ( l ) i ) (32) 13 where the computation of β ( n l ) i (i.e. the base step) is alrea dy given in Equation 31. Hence, we are led to pro p ose the algorithm below for computing the partial deriv atives of ˆ J 2 ( W , b ) with resp ect to ANN weigh ts, (a nd bia ses as well): Algorithm 2 A Back-Propag ation Algorithm for Calculating ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij and ∂ ˆ J 2 ( W , b ) ∂ b ( l ) i Inputs : The set o f curr e nt ANN w eight s, W ( l ) ij , and biases , b ( l ) i , l = 1 , 2 , . . . , n l , i = 1 , 2 , . . . , s l , and j = 1 , 2 , . . . , s l − 1 , along with a specific input vector, x [ t ] = ( x [ t ] 1 , x [ t ] 2 , . . . , x [ t ] s 1 ). 1). Perform initialisation by filling ANN ‘input activ ations’ with the input vector, x [ t ] . That is, in the input lay er, FOR each i = 1 , 2 , . . . , s 1 , set: a (1) i := x [ t ] i END FOR 2). Perform a F eedforward pass throug h the ANN, c o mputing all activ ations through out the netw ork. That is, F O R ea ch l = 1 , 2 , . . . , n l − 1, for ea ch i = 1 , 2 , . . . , s l +1 , for each j = 1 , 2 , . . . , s l , set: 1a). z ( l +1) i := W ( l ) ij a ( l ) i + b ( l ) i 1b). a ( l +1) i := f ( z ( l +1) i ) END FOR 3). At the output lay er, (i.e. lay er L n l ), of the ANN, FOR each i compute: β ( n l ) i := s ig n ( a ( n l ) i ) f ′ ( z ( n l ) i ) END FOR 4). FOR each l = 1 , 2 , . . . , n l − 1, for each i = 1 , 2 , . . . , s l +1 , set: 4a). ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij = β ( l +1) i a ( l ) j 4b). ∂ ˆ J 2 ( W , b ) ∂ b ( l ) i = β ( l +1) i 4c). β ( l ) i = P s l +1 j =1 β ( l +1) j W ( l ) j i f ′ ( z ( l ) i ) END FOR 6 A Gradien t Descen t Algorithm for Minimizing SENNS’s Ob- jectiv e F u nction In this section, we describ e a gradient descent algo rithm fo r the minimisatio n of our o b jective function, which we gav e in Equation 4 . The alg orithm is la be le d as Algorithm 3 be low. The algorithm r elies on Algorithms 1 and 2 for computing ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij and ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij resp ectively . It then implicitly utilizes 14 Equations 8, 9 of Section 3, for the computation o f ∂ J 1 ( W , b ) ∂ W ( l ) ij and ∂ J 2 ( W , b ) ∂ W ( l ) ij resp ectively , b efor e using Equation 7 of that same section to compute ∂ J ( W, b ) ∂ W ( l ) ij . Finally , using a learning ra te, α , the algorithm pro ceeds to update ANN weigh ts in a gra dient-descen t fa s hion a ccording to: W ( l ) new ij = W ( l ) old ij − α ∂ J ( W, b ) ∂ W ( l ) ij (33) where W ( l ) old ij and W ( l ) new ij denote the old and new weight s resp ectively . Without any further a do , he r e is the a lgorithm we set o ut to derive: 15 Algorithm 3 Minimisation of SENNS’s Ob jective F unction via Gra dient Descent: W eights V er s ion Inputs : Regularisa tion par ameters, λ 1 , λ 2 , λ 3 , λ 4 , a learning ra te, α , a training set, x [ t ] , t = 1 , 2 , . . . , m , para meter M C , which is the n umber of tra ining s et pairs b elong ing to the same class, parameter M D , which is the num b er of training set pairs b elonging to different clas s es, and a set of initial randomized small w eights, W ( l ) ij , l = 1 , 2 , . . . , n l , i = 1 , 2 , . . . , s l , and j = 1 , 2 , . . . , s l − 1 . 1). Perform the pair of initialisatio ns : 1a). ∂ J 1 ( W , b ) ∂ W ( l ) ij := 0 1b). ∆ J 2 := 0 2). FOR each pair ( x [ t ] , x [ u ] ) , t = 1 , 2 , . . . , m, u = 1 , 2 , . . . , m 2a). Using Equa tio n 3, set S ( x [ t ] , x [ u ] ) := λ 1 M C 2b). Using the curr ent pair, ( x [ t ] , x [ u ] ), compute ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij via Algorithm 1 2c). Increment ∂ J 1 ( W , b ) ∂ W ( l ) ij := ∂ J 1 ( W , b ) ∂ W ( l ) ij + S ( x [ t ] , x [ u ] ) ∂ ˆ J 1 ( W , b ) ∂ W ( l ) ij END FOR 3). FOR each x [ t ] , t = 1 , 2 , . . . , m 3a). Compute ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij via Algorithm 2 3b). Increment ∆ J 2 := ∆ J 2 + ∂ ˆ J 2 ( W , b ) ∂ W ( l ) ij END FOR 4). Set ∂ J 2 ( W , b ) ∂ W ( l ) ij := λ 3 m ∆ J 2 5). Set ∂ J ( W, b ) ∂ W ( l ) ij := ∂ J 1 ( W , b ) ∂ W ( l ) ij + λ 3 m ∂ J 2 ( W , b ) ∂ W ( l ) ij + λ 4 W ( l ) ij 6). Finally , up date the ANN w eigh ts via gra dient desce nt using lear ning rate α : W ( l ) ij = W ( l ) ij − α ( ∂ J ( W, b ) ∂ W ( l ) ij ) 7). Rep eat STEPs 1 to 6 untill conv ergence, or a maximum num b er of iterations is reached 16 7 Conclusion In this concept pap er, w e hav e prop os ed a technique called SENNS (Sparse Extra ction Neur al Net workS) for the feature extra ction problem, which is a problem at the heart o f pattern rec ognition a nd machine learning. Philosophica lly , o ur prop osed metho d dr aws o n the idea of extrac ting feature s that maximise int er-clas s v a riances, while minimising intra-class v ariances . As a result, the metho d fits immediately int o the fr amework of graph embeddings. How ever, unlike tw o of the mo s t repres entativ e members of the class o f metho ds within the gra ph embedding schoo l of though t, our propos ed SENNS enfo r ces sparsity on the extr acted features, a nd utilises p owerful non-linear pro jections r ather than linear, kernel or tensor transformatio ns, to e ffect the feature extractio n pro cess. W e for mulated SENNS as the min- imisation of a regular is ed sum of four terms, and derived an effective gradient descent algor ithm for the resulting minimisation problem. Via rig o rous mathematical analysis , we s howed ho w o ur algor ithm can be sp ecified as a set o f tasks inv olving the standard ba ck-propagation pro cedur e , up to a modification for L 1 norms. Finally , a natur a l next line of a ction is to test SENNS out o n some standard machine learning da tasets such as the ARABASE databas e of Ara bic characters, the CMU PIE data base of faces, as well a s the MNIST da tabase o f digits. References [1] T rie r, O.D., Jain, A.K. and T axt, T. F ea ture Extraction Metho ds for Charac ter Recog nition: A Survey . Pattern R e c o gn ition , vol 29, no 4, pp 64 1-66 2 , 199 6. [2] Belkin M. and Niyogi, P . La placian Eigenmaps a nd Spectr al T echniques for E mbedding and Clustering. A dvanc es in Neur al In formation Pr o c essing System , vol. 1 4, pp. 585-5 91, 200 1. [3] Etemad K. and Chellapa , R. Discrimina nt Analysis for Recog nition of Human F ace Images. Journal of the Optic al So ciety of Americ a A , vol. 14, no. 8 pp. 1724 - 1733 , 1997. [4] T urk , M. and Pen tland, A. F ace Recognition Using Eigenface s, Pr o c. IEEE Conf. Computer Vision and Pattern R e c o gnition , pp. 586- 591, 199 1. [5] Hu, M.K. Visua l Pattern Recognition by Moment In v ar ia nts. IR E T r ansactions on Information The ory , vol. 8, pp 179 -187 , 196 2. [6] Chao K. and Srinath M.D. In v ariant Char acter Recognition with Zernike and Orthogonal F o urier Mellin Moments. Pattern Re c o gnition 35, 14 3 -154 , 2 002. [7] Low e, D. Distinctiv e Image F e atures from Scale-Inv aria nt Keyp oints. International Journ al of Computer V ision vol. 6 0, no. 2, pp. 9111 0, 2004 . [8] Belongie , S., Malik J., and Puzicha J., Shap e Matching and Ob ject Recog nition using Sha p e Contexts. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 2002. [9] Blum, H. A. T ransfor mation for E xtracting New Descr ipto rs of Shap e. In: Mo dels for the Perception of Sp eech a nd Visual F orm, W. W a then-Dunn, eds., MIT Pr ess , Cambridge, MA, pp. 362-3 80, 1 967. 17 [10] Kimia , B.B, T annenbaum, A.R. and Zuck er, S.W. Shap e, Sho cks and Deformations I: The Comp onents of 2d Shap e and the Reactio n-Diffusion Space, In ternational Journal of Computer Vision 15 189-2 24, 1995 . [11] Macr ini, D., Dic k inson, S., Flee t, D., and Siddiqi, K. B one Graphs: Medial Shape Parsing and Abstraction, Computer Visio n and Im age Understanding 115 , 1 0441 0 61, 2 011. [12] Y a n, S., Xu, D., Zha ng , B., Zhang, H.-J., Y a ng, Q. a nd Lin, S. Gr aph E mbedding and Ex- tensions: A Gener al F ramework for Dimensionality Reduction. IEEE T r ansactionson Pattern Analy sis and Machine Intel ligenc e , vol. 2 9, no. 1 , pp. 4 0 51, 2 007. [13] Sundara mo orthi, G. and Y ang, Y. Ma tching Thr ough F eature s a nd F eatur e s Throug h Matching. T echnical Rep ort, KAUST. [14] Dejiver, P .A. and Kitler, J. Pattern R e c o gnition: A Satistic al Appr o ach . Prentice Hall, London, 1982. [15] Ben Amara , N., Mazhoud, O ., Bouzra ra, N. and Ello uze, N. ARABASE: A Relatio na l Database for Arabic OCR Sys tems. International A r ab J ournal of Information T e chnolo gy , 2005. [16] Sim, T., Baker, S., and Bsat, M. The CMU Pose, Illumination, a nd Expres sion Database. IEEE T r ans. Pattern Ana lysis and Machine Intel ligenc e , v ol. 2 5, no. 1 2, pp. 16 15-16 18, Dec. 2003. [17] LeCun, Y., Botto u, L., Bengio , Y. and Ha ffner , P . Gradient-based Lea rning Applied to Do cu- men t Recognition. Pr o c e e dings of the IEEE , 86 (11):227 8232 4, 1998 . [18] Ng A., J iq uan N., Chuan F., Yifan M. a nd Caroline S., UFDL T u torial on Neur al Networks . W eb. August 201 4 < http://ufdl.stanford.edu/wiki/index.php/Neura l Net works > [19] Y a ng, J., Y u, K ., Gong, Y. and Huang, T. Linear Spatial Pyr amid Ma tching using Sparse Co ding for Image Class ification. In IEEE International Confer enc e on Computer Vision and Patt ern R e c o gnition , 200 9. [20] Gao S., Tsang I., Chia L., a nd Zhao P . Lo ca l F eatures ar e not Lonely - Laplacia n Sparse Co ding for Image Class ification. In IEEE International Confer enc e on Computer Vision and Patt ern R e c o gnition , 201 0. 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment