Training Restricted Boltzmann Machine by Perturbation

A new approach to maximum likelihood learning of discrete graphical models and RBM in particular is introduced. Our method, Perturb and Descend (PD) is inspired by two ideas (I) perturb and MAP method for sampling (II) learning by Contrastive Diverge…

Authors: Siamak Ravanbakhsh, Russell Greiner, Brendan Frey

T raining Restr icted Boltzmann M achine by P erturbatio n Siamak Ravanbakhsh, Russell Greiner Departmen t of Computing Science University of Alberta { mravanba, rgreiner@ua lberta.ca } Brendan J . Frey Prob. an d Stat. Inf. Group Universi of T oronto frey@psi.tor onto.edu Abstract A new approach to maximum likelihood learning of d iscrete graph ical models and RBM in par ticular is introduced . Our method, Pertur b and Descend (PD) is inspired by tw o ideas (I) perturb and MAP m ethod for sampling (II) learning by Contrastive Di ver gence minim ization. In contrast to perturb and MAP , PD lev er- ages training data to learn the mo dels that do not allow ef ficient MAP estimation. During the lear ning, to p roduc e a sample from the current model, we start f rom a training data and descen d in the energy landscape of the “pertur bed model”, for a fixed numbe r of steps, or until a local optim a is reached. For RBM, th is i n volves linear calculations and thresholdin g which can be very fast . Further more we s how that the amount of perturbatio n is closely related to the tem peratur e parame ter and it can regular ize the model by producin g robust f eatures resulting in spar se hidden layer activ ation. 1 Intr oduction The common proc edure i n learning a Pro babilistic Graphical Model (PGM ) is to maximize the like- lihood of observed data, by updating the mode l par ameters along the gradien t o f th e likelihood. This gradient step requires inference on th e cu rrent mode l, which may b e p erform ed using deterministic or a Markov Chain Mon te Carlo (MCMC) proced ure [1 ]. Intuitively , the grad ient step attempts to update the parameters to incr ease t he unnormalized pro bability o f the observation, while decr easing the sum of unn ormalized prob abilities over all states – i.e. , the partition function . The first part of the u pdate is known as po siti ve p hase an d th e secon d part is referr ed to as the n egativ e ph ase. An efficient alterna ti ve is Contrastive Diver gence ( CD) [2] training, in whic h the negativ e phase on ly decreases the prob ability of th e con figuration s that are in the vicinity of training data. In practice these neig hbor ing states are sampled by takin g few steps on Markov chains that are initialized by training data. Recently Perturbation m ethods comb ined with efficient Max imum A Posteriori (MAP) s olvers were used to e fficiently sample PGMs [3, 4, 5]. Here a basic idea fro m extreme value theor y is used, wh ich states that the MAP assignmen ts f or particular p erturb ations o f any Gib bs d istribution can replace unbiased s amples from the unperturb ed model [6]. In practice, howe ver , models are not per turbed in the ideal fo rm and appr oximation s are used [4] . Hazan et a l. show t hat lower order appro ximations provide an upper boun d on the partitio n func tion [7]. This sugg est that p erturb and MAP samp ling proced ure can be u sed in the negative phase to maximize a lower boun d on the lo g-likelihood of the data. Howe ver , this is feasible only if ef ficient M AP estimation is p ossible ( e.g. , PGMs with submodu lar potentials [8]), and e ven so, rep eated MAP estimation at each step of learn ing could be prohib iti vely expen siv e. Here we p ropo se a new app roach closely related to CD and per turb an d MAP to sample the PGM in the negati ve pha se of learning. The b asic id ea is to per turb the model and startin g at training 1 data, find lower pertur bed-en ergy configur ations. Then u se these config urations as fantasy particles in the negative pha se o f lear ning. Althoug h this sche me m ay be used for arbitrary discrete PGMs with and without hidden variables, here we consider its application to the task of training Restricted Boltzmann Machine (RBM) [9]. 2 Backgr ound 2.1 Restricted Boltzmann Machine RBM is a b ipartite Markov Rando m Field, where the v ariables x = { v , h } are partitioned into visible v = [ v 1 , . . . , v n ] and hid den h = [ h 1 , . . . , h m ] units. Because of its repr esentation power ([10]) and relative ease of training [1 1], RBM is increasing u sed in various ap plications. For example as gen erative mo del for movie rating s [12], speec h [13] an d topic mod eling [14]. Most impor tantly it is used in construction of deep neural architectures [15, 16]. RBM models the joint distribution o ver hidden and visible units by P ( h , v | θ ) = 1 Z ( θ ) e − E ( h , v ,θ ) where Z ( θ ) = P h , v e − E ( h , v ,θ ) is the normalization co nstant ( a.k.a . partition function) and E is the energy fu nction. Due to its bipar tite for m, condition ed on the visible (hidd en) variables the h idden (visible) variables in an RBM are inde penden t of each o ther: P ( h | v , θ ) = Y 1 ≤ j ≤ m P ( h j | v , θ ) a nd P ( v | h , θ ) = Y 1 ≤ i ≤ n P ( v i | h , θ ) (1) Here we consider the energy function of binary R BM, where h j , v i ∈ { 0 , 1 } E ( v , h , θ ) = −  X 1 ≤ i ≤ n 1 ≤ j ≤ m v i W i,j h j + X 1 ≤ i ≤ n a i v i + X 1 ≤ j ≤ m b j h j  = −  v T Wh + a T v + b T h  The mod el param eter θ = ( W , a , b ) , con sists of the matrix of n × m real valued pairwise interac- tions W , an d local fields ( a.k.a . bias terms) a an d b . The marginal over visible units is P ( v | θ ) = 1 Z ( θ ) X h P ( v , h | θ ) Giv en a data-set D = { v (1) , . . . , v ( N ) } , maximum -likelihood l earning of the model seeks the max- imum of the averaged log-likelihood : ℓ ( θ ) = 1 N X v ( k ) ∈D log( P ( v ( k ) | θ )) (2) = − 1 N X v ( k ) ∈D Y 1 ≤ j ≤ m  1 + exp( X 1 ≤ i ≤ n v ( k ) i W i,j )  − lo g( Z ( θ )) (3) Simple calculations giv es us the deriv ati ve o f this objective wr t θ : ∂ ℓ ( θ ) /∂ W i,j = 1 N X v ( k ) ∈D E P ( h j | v ( k ) ,θ ) h v ( k ) i h j i − E P ( v i ,h j | θ ) [ v i h j ] ∂ ℓ ( θ ) /∂ a i = 1 N X v ( k ) ∈D v ( k ) i − E P ( v i | θ ) [ v i ] ∂ ℓ ( θ ) /∂ b j = 1 N X v ( k ) ∈D E P ( h j | v ( k ) ,θ ) [ h j ] − E P ( h j | θ ) [ h j ] 2 where the first and the seco nd term s in each line cor respond to positive and n egati ve phase respec - ti vely . It is easy to calculate P ( h j | v ( k ) , θ ) , req uired in the positive ph ase. The negati ve phase, howe ver , re quires un condition ed samples f rom the c urrent m odel, which may requ ire long mixin g of the Markov chain. Note that the same f orm o f upda te appears when learn ing any Markov Random Field, r egardless of the form o f graph a nd presence of hidden variables. In general the gradient update has the follo wing form ∇ θ I ℓ ( θ ) = E D ,θ [ φ I ( x I ) ] − E θ [ φ I ( x I ) ] (4) where φ I ( x I ) is the sufficient statistics cor respond ing to param eter θ I . For examp le the su fficient statistics for variable interactions W i,j in an RBM is φ i,j ( v i , h j ) = v i h j . Note that θ in calcu lating the expectation of the first term appears only if hidden v ariables are present. 2.2 Contrastive Di vergence T raining In est imating the second term in the upd ate of eq(4), we can sample the model with the training data in mind. T o this end, CD samples the m odel by initializing the M arkov chain to data poin ts a nd runnin g it for K steps. This is repe ated each ti me we calculate t he gradient. At the limit of K → ∞ , this g iv es unbiased sam ples from the curr ent mode l, howev er using only few steps, CD per forms very well in p ractice [2]. For RBM th is Markov chain is simply a block Gibbs samp ler with v isible and hidden units are sampled alternatively using eq(1). It is also p ossible to initialize the chain to the tr aining data at the beginn ing of learn ing and d uring each calculatio n o f gr adient run the cha in from its p revious state. This is kn own as persistent CD [17] or stochastic maximum likelihood [18]. 2.3 Sampling by Perturb and MAP Assuming that it is possible to efficiently obtain t he MAP assignment in an MRF , i t is possible to u se perturb ation meth ods to produce u nbiased samples. These samples then may b e used in the negati ve phase of learning . Let e E ( x ) = E ( x ) − ǫ ( x ) den ote th e perturb ed energy fun ction, where the perturbation for eac h x is a sample from standard Gumbel distrib ution ǫ ( x ) ∼ γ ( ε ) = exp( ε − exp( − ε )) . Also let e P ( x ) ∝ ex p( − e E ) d enote the pertu rbed d istribution. T hen the MAP assignment a rg x max e P ( x ) is an unbiased sample from P ( x ) . Th is m eans we can sample P ( x ) by re peatedly p erturbin g it and finding the MAP a ssignment. T o obtain samples from a Gumbel distribution we transfo rm samples from uniform distribution u ∼ U (0 , 1) by ǫ ← lo g( − lo g( u )) . The following lemma clar ifies the co nnection b etween Gibbs d istribution and the MAP assignmen t in the pertur bed model. Lemma 1 ([6]) . Let { E ( x ) } x ∈X and E ( x ) ∈ ℜ . Define the perturbed values as e E ( x ) = E ( x ) − ǫ ( x ) , when ǫ ( x ) ∼ γ ( ε ) ∀ x ∈ X are IID samples fr o m standa r d Gumbel distrib ution. Th en P r (ar g x ∈X max {− e E ( x ) } = b x ) = exp( − E ( b x )) P y ∈X exp( − E ( y )) ) (5) Since the domain X , of joint assignments grows exponen tially with the numbe r of v ariables, we can n ot find the MAP assignm ent efficiently . As an ap prox imation we may use fully dec omposed noise ǫ ( x ) = P i ǫ ( x i ) [4]. This corresponds to adding a Gumbel n oise to each assignment of unary poten tials. In the case o f RBM’ s par ametrization , this correspo nds to adding the difference of two ran dom samp les from a standa rd Gumb el distribution (wh ich is basically a sample fr om a logistic distrib ution ) to biases ( e.g. , e a i = a i + ǫ ( v i = 1) − ǫ ( v i = 0) ). Altern ativ ely a second order approx imation may perturb a co mbination o f b inary and una ry pote ntials such that each variable is included once (Section 3.2) 3 3 Pe rturb and Descend Learning Feasibility of sampling using perturb and MAP depend s on availability o f ef ficient optimization proced ures. Howe ver MAP estimation is in ge neral NP-hard [1 9] and only a limited class of MRFs allow effi cient energy minimiz ation [8]. W e pro pose an alter native to pertu rb and MAP that is suitable when inf erence is em ployed within the co ntext o f learning . Since first and secon d orde r perturb ations in perturb and MAP , up per bou nd the partition functio n [7], likeliho od optimization using this metho d is desirable ( e .g. , [20]). On the other hand since the mod el is trained on a data-set, we may le verage the training data in sampling the model. Similar to CD at e ach step of the grad ient we star t fro m training data. In o rder to pro duce fantasy particles o f the negative p hase we pertu rb th e curren t mod el an d take several steps towards lower energy configurations. W e may take enoug h s teps to reach a local optim a or stop midway . Let e θ = ( f W , e a , e b ) den ote the perturb ed mod el. For RBM, eac h step of this bloc k coordin ate descen d takes the following f orm v ← e a + f Wh > 0 (6) h ← e b + f W T v > 0 (7) where starting fr om v = v ( k ) ∈ D , h and v are r epeatedly up dated fo r K steps or u ntil the up date above has no effect ( i.e. , a local optima is reach ed). The final c onfigur ation is then used as the fantasy particle in the negati ve p hase of learning. 3.1 Amount of Perturbations T o see the effect of the amount o f pertur bations we simply mu ltiplied the n oise ǫ by a constant β – i.e. , β > 1 mea ns we p erturbed the model with larger n oise values. Goin g back to Lemma1 , we see that any multiplicatio n of noise can be compen sated b y a ch ange o f tem perature of the en ergy function – i.e . , fo r β = 1 T , the arg x max e E ( x ) = arg x max 1 T E ( x ) − β ǫ ( x ) remain s the same. Howe ver here we are only changing the noise without changing the energy . Here we provide s ome intuition about the poten tial ef fect of increasing perturbations. Ex perimen tal results seem to confirm this view . For β > 0 , in the negativ e ph ase of learnin g, we a re lowering the prob ability of configu rations that a re at a “larger distance” fro m th e trainin g d ata, comp ared to training with β = 1 . Th is can make the model more robust as it puts more ef fort into removing false valle ys that are distant from the training data, while less ef fort is made to remove (false) valle ys that are closer to the training data. 3.2 Second Order P erturbations for RBM As discussed in Section 2.3 a first order perturba tion of θ , only injects noise to local potentials: e a i = a i + ǫ ( v i = 1) − ǫ ( v i = 0) and e b i = b j + ǫ ( h i = 1) − ǫ ( h i = 0) In a second order pertu rbation we may per turb a subset o f non- overlapping p airwise po tentials as well as unary poten tials over the r emaining variables. In doing so it is desirab le to select the pairwise potentials with h igher in fluence – i.e. , larger | W i,j | values. W ith n visible and m hid den variables, we can use Hung arian maximum bipartite matching algorithm to find min( m, n ) most in fluential interactions [21]. Once influen tial interaction s a re selected, we n eed to pertur b th e corresp onding 2 × 2 factors with Gumbel n oise as well as the bias term s for all the variables that are not covered. A simple calculation shows that perturbation of the 2 × 2 potentials in RBM corresponds to p erturb ing W i,j as well as a i and b j as follows f W i,j = W i,j + ǫ (1 , 1) − ǫ (0 , 1) − ǫ (1 , 0) + ǫ (0 , 0) e a i = a i − ǫ (0 , 0) + ǫ (0 , 1) e b j = b j − ǫ (0 , 0) + ǫ (1 , 0) ǫ (0 , 0) , ǫ (0 , 1) , ǫ (1 , 0 ) , ǫ (1 , 1 ) ∼ γ ( ε ) where ǫ ( y , z ) is b asically the injected noise to the pairwise poten tial assignment for v i = y and h j = z . 4 Refer ences [1] D. K oller and N. Fried man, P r obabilistic Graphical Mode ls: Principles and T echniques , T . Di- etterich, Ed. The MIT Press, 2009, v ol. 2009, no. 4. [2] G. E. Hinton, “Training pro ducts of experts by min imizing c ontrastive d iv ergence, ” Neural computatio n , vol. 14, no. 8, pp. 1771–18 00, 2 002. [3] G. Papand reou and A. L. Y uille, “Gaussian sam pling by local per turbation s, ” in Adv ances in Neural Information Pr ocessing Systems , 20 10, pp. 1858–1 866. [4] ——, “Pertur b-and- map random fields: Using discrete o ptimization to learn and sample from energy models, ” in ICCV . IEEE, 2011, pp. 193–200. [5] T . Hazan , S. Maji, an d T . Jaakkola, “On samp ling fro m the gibbs distribution with random maximum a-posteriori perturb ations, ” arXiv preprint arXiv:1309.7 598 , 20 13. [6] E. J. Gum bel, Statistical theory of e xtr eme values and some practical applications: a series of lectur es . US Gov ernmen t Printing Office W ashington, 1954 , v ol. 33. [7] T . Haz an an d T . Jaakkola, “ On th e partition f unction and ran dom m aximum a-p osteriori per- turbation s, ” arXiv preprint arXiv:1206.6 410 , 20 12. [8] V . K olmogor ov a nd R. Za bin, “Wh at energy function s can be minimized via graph cu ts?” P attern Analysis and Mac hine Intelligence, I EEE T ransactions on , vol. 26 , no. 2, pp. 147–15 9, 2004. [9] P . Smolen sky , “In formatio n proc essing in d ynamical systems: Foundation s of harm ony the- ory , ” 198 6. [10] N. Le Roux an d Y . Bengio , “Representational po wer of restricted boltzmann mac hines an d deep belief networks, ” Neural C omputatio n , v ol. 20, no. 6, pp. 1631–1 649, 2008 . [11] G. Hinto n, “ A p ractical guid e to training re stricted b oltzmann mach ines, ” Momentum , vol. 9, no. 1, 2010. [12] R. Salak hutdin ov , A. Mn ih, a nd G. Hinton, “Restricted boltzm ann machin es for collabor ativ e filtering, ” in Pr oceedings of the 24 th internation al c onfer ence on Machine learnin g . A CM, 2007, pp. 791–7 98. [13] A. -R. Mo hamed and G. Hinton , “Pho ne reco gnition using restricted boltzman n machines, ” in Acoustics Speech a nd Signa l Pr ocessing ( ICASSP), 2010 IEEE In ternationa l Confer ence o n . IEEE, 2010, pp. 4354–4 357. [14] G. E. Hinton and R. Salakhutd inov , “Replicated sof tmax: an un directed topic model, ” in Ad- vances in neural information pr ocessing systems , 20 09, pp. 1607–16 14. [15] G. E. Hinto n, S. Osindero , and Y .-W . T eh , “ A fast lear ning algorithm for deep belief ne ts, ” Neural computation , v ol. 18, no. 7, pp. 1527–1 554, 2006. [16] Y . Bengio, “Learning deep architecture s for AI, ” F oun dations and tr ends R  in Machine Learn- ing , v ol. 2, no. 1, pp. 1–12 7, 2009. [17] T . T ieleman, “Training restricted bo ltzmann machin es using ap proxim ations to th e likelihood gradient, ” in Pr oceeding s of th e 25th internatio nal conference on Machine learning . A CM, 2008, pp. 1064– 1071 . [18] L . Y ounes, “Parametric infere nce for imperf ectly observed gibb sian fields, ” Pr obability The ory and Related F ield s , vol. 82, no. 4, pp. 625–645, 1989. [19] S. E. Shimony , “Finding maps for belief networks is np-h ard, ” Artificial Intelligenc e , vol. 6 8, no. 2, pp. 399–4 10, 199 4. [20] M . J. W ainwright and M. I. Jord an, “Graphica l mod els, exponen tial families, a nd variational inference , ” F ounda tions and T rends R  in Machine Learning , v ol. 1, no. 1-2, pp. 1–3 05, 2008. [21] J. Mun kres, “ Algorithms for the assignment and transportation problem s, ” Journal of the Soci- ety for Industrial & Applied Mathematics , vol. 5, no. 1, pp. 32–38, 1957. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment