An exact mapping between the Variational Renormalization Group and Deep Learning
Deep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficul…
Authors: Pankaj Mehta, David J. Schwab
An exact mapping b et w een the V ariational Renormalization Group and Deep Learning Pank a j Meht a Dept. of Physics, Bos ton University, Boston, MA David J. Sch wab Dept. of Physics, Northwestern University, Evanston, IL Deep learning is a broad set o f techniques that uses multiple la yers of representation to automat- ically learn relev ant fea tures directly from structured data. Recently , such techniques have yielded record-breaking results on a d iv erse set of difficult mac hine learning tasks in compu ter v ision, sp eech recognition, and natural language pro cessing. Despite t h e enormous success of deep learning, rel- ative ly little is und erstoo d t heoretically ab out why these techniques are so successful at feature learning and compression. Here, w e show that deep learning is intimately related to one of the most imp ortant and successful t ec hniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining sc heme that allo ws for th e extraction of relev ant features (i.e. operators) as a physical system is examined at different length scales. W e construct an exact map- ping from the v ariational renormalization group , first introdu ced by K ad an off, and deep learning arc hitectures b ased on R estricted Boltzmann Machines (RBMs). W e illustrate t h ese ideas u sing the nearest-neighbor I sing Model in one and tw o-dimensions. Our results suggests that deep learning algorithms may b e employing a generalized R G-like scheme to lea rn rele v an t features from data. A central g oal of moder n machine lear ning research is to learn a nd extract impo r tan t features dire ctly from data. Among the most promising and successful tech- niques for acco mplishing this goa l ar e those asso ciated with the emer ging sub-discipline of deep lea rning. Deep learning uses multip le la yers of representation to lear n descriptive features directly fro m training data [1, 2] and has b een successfully utilized, often achieving reco r d breaking results, in difficult machine learning tasks in- cluding ob ject labeling [3], spe e c h recognition [4], and natural language pro cessing [5 ]. In this work, we will fo cus o n a set of deep learning algorithms known as deep neural netw orks (DNNs) [6]. DNNs a r e biolog ically-inspired gra phical statistical mo d- els that consis t of multiple la yers of “neurons” , with units in one lay er receiving inputs from units in the layer be- low them. Despite their enormous succe s s, it is still un- clear what adv antages these deep, m ulti-lay er ar c hitec- tures p ossess ov er shallow er architectures with a similar nu mber o f par ameters. In pa rticular, it is s till not well understo o d theoretically why DNNs are so suc c essful at uncov ering fea tures in structured da ta. (But see [7 – 9].) One p ossible explanation for the success of DNN ar- chit ectures is that they can b e view ed a s an iterative coarse- graining scheme, where each new high-level lay er of the neura l ne tw ork lear ns incr easingly abstract higher- level features from the data [1, 10]. The initial layers o f the the DNN can b e thought o f as low-level feature de- tecters whic h a re then fed into higher lay ers in the DNN which com bine these low-level features in to mo re abstract higher-level features, providing a useful, and at times r e- duced, representation of the data. By successively apply- ing fea ture extraction, DNNs learn to deemphasize irrel- ev ant features in the da ta while simultaneously learning relev a n t ones. (Note that in a sup ervised setting, such as classificatio n, relev ant and ir relev ant are ultimately determined b y the problem at hand. Here we a re con- cerned solely with the unsuper vised asp ect of tr aining DNNs, and the use o f DNNs for co mpression [6 ].) In what follo ws, w e mak e this explanation precise. This successive c oarse-g r aining pro cedure is r eminis- cent of one of the most succe ssful and imp ortant too ls in theo retical physics, the reno rmalization group (R G) [11, 12]. R G is an itera tive coarse-g raining pro cedure designed to tackle difficult physics pr oblems in volving many length scale s . The cen tral goal of RG is to ex- tract relev ant features of a physical system fo r describing phenomena at large length sca les by int egrating out (i.e. marginalizing over) shor t distance degrees of freedom. In any RG sequence , fluctuations are in tegrated out s tart- ing at the microscopic scale and then moving iter ativ ely on to fluctuations at larger scales. Under this pro ce- dure, certa in features, called relev ant op erato rs, become increasingly imp ortant while other featur es, dubbed irrel- ev ant o pera tors, hav e a diminishing effect on the ph ysical prop erties of the sy stem a t large scales. In genera l, it is impo ssible to carry o ut the reno rmal- ization pr oce dur e exactly . Therefore, a n um ber of ap- proximate R G pro cedur es hav e bee n develop ed in the theoretical physics co mm unit y [12 – 15]. One such ap- proximate metho d is a class of v ariatio na l “real- space” renormaliz a tion schemes introduced by Kadanoff for per - forming RG on spin sy s tems [14, 16, 17]. K adanoff ’s v aria tional RG s cheme in tro duces c o arse-gr ained auxil- iary , or “hidden”, spins tha t are coupled to the ph ysical spin systems thr o ugh s o me unknown co upling parame- ters. A par ameter-dep endent free energy is ca lc ula ted for the coar s e-grained spin system from the coupled system by integrating o ut the physical spins. The c o upling pa- rameters are chosen through a v a riational pro cedure that minimizes the difference b et ween the free energ ies of the ph ysical and hidden spin systems. This ensur es that the coarse- grained sy stem pr eserves the long-dis tance infor- mation present in the ph ysical system. Carr ying out this 2 pro cedure results in an R G transformatio n that maps the ph ysical spin s ystem in to a coarse-gr ained description in terms of hidden spins. The hidden spins then serve as the input for the next ro und o f reno rmalization. The introduction of layers of hidden spins is also a cen- tral comp onent of DNNs based on Restricted Bo ltzmann Machines (RBMs). In RBMs, hidden s pins (o ften called units or neur ons) are coupled to “visible” spins descr ibing the data of interest. (Here we restrict ourse lv es to binar y data.) The coupling pa r ameters betw een the visible and hidden la yers are chosen using a v ar iational pro cedure that minimizes the K ullback-Leibler div ergence (i.e. rela- tive en tropy) b e tw een the “tr ue” probability distribution of the data and the v aria tio nal distribution o btained by marginalizing ov er the hidden spins. Like in v a riational R G, RBMs can b e used to map a state of the visible spins in a data sample into a description in ter ms of hid- den spins. If the num be r of hidden units is les s than the nu mber of visible units, such a mapping can b e tho ugh t of a s a compressio n. (Note, how ever, that dimensional expansions ar e common [18].) In deep lea rning, individ- ual RB Ms ar e stack ed on top o f ea c h other into a DNN [6, 19], with the output of one RBM s erving as the input to the next. Mor eov e r , the v ariational pro cedure is often per formed iteratively , lay e r by la y er. The preceding paragra phs s uggest an in timate connec- tion betw een R G and dee p lear ning. Indee d, here we con- struct an exact mapping from the v ariational RG scheme of Kadanoff to DNN s based on RBMs [6, 19]. Our map- ping sugg ests that DNNs implemen t a genera lized RG- like procedur e to extract relev ant features from struc- tured data. The pa per is o rganized as follo ws. W e b egin b y re- viewing Kadanoff ’s v aria tio nal renorma liz ation sc heme in the context of the Ising Mo del. W e then in tro duce RBMs and deep lear ning architectures of stacked RBMs. W e then show how to map the pro cedure of v ariational R G to unsup ervised tr aining of a DNN. W e illustrate these idea s using the one- and tw o-dimensional nea rest- neighbor I s ing mo dels. W e end by discussing the impli- cation of our mapping for ph y sics a nd machine learning. I. O VER VIEW OF V ARIA TIONAL RG In statistical physics, one often considers an e nsem ble of N binary spins { v i } tha t can tak e the v alues ± 1. The index i lab els the p osition of spin v i in some lattice. In thermal equilibrium, the probability o f a s pin co nfigura- tion is giv en b y the Bo ltzmann distribution P ( { v i } ) = e − H ( { v i } ) Z , (1) where we hav e defined the Hamiltonian H ( { v i } ), and the partition function Z = T r v i e − H ( { v i } ) ≡ X v 1 ,...v N = ± 1 e − H ( { v i } ) . (2) Note throughout the pap er w e se t the temp erature equa l to one, without loss of genera lit y . T ypically , the Ha mil- tonian dep ends on a se t of co uplings or par ameters, K = { K s } , that para meterizes the set of a ll p ossible Hamiltonians. F or ex ample, with binary spins, the K could be the couplings desc ribing the spin interactions of v ario us or ders: H [ { v i } ] = − X i K i v i − X ij K ij v i v j − X ij k K ij k v i v j v k + . . . . (3) Finally , w e can define the free energy of the spin s ystem in the standard way: F v = − log Z = − log T r v i e − H ( { v i } ) . (4) The idea b ehind RG is to find a new co arse-gr ained description of the spin system where one has “in tegrated out” short distance fluctuations. T o this end, let us intro- duce M < N new binar y spins, { h j } . Each o f these spins h j will serve as a coarse-g rained degree of freedom where fluctuations on s mall scales hav e been av eraged out. Typ- ically , such a co arse-gr aining pro cedure increases some characteristic leng th scale describing the system such a s the lattice s pacing. F or example, in the block spin reno r- malization picture intro duced by Kadanoff, each h i rep- resents the s ta te of a lo cal block of ph ysical spins, v i . Figure 1 shows such a block-spin pro cedure for a t w o- dimensional spin system on a square lattice, where each h i represents a 2 × 2 blo c k of visible spins. The result of such a coarse-gr aining pro cedure is that the lattice spacing is doubled a t each step of the reno rmalization pro cedure. In gener al, the in teractions (statistica l corr elations) betw een the { v i } induce in teractions (statistical corre- lations) betw een the coarse- grained spins, { h j } . In par - ticular, the coarse - grained system can b e describ ed b y a new coarse-gr ained Hamiltonian of the form H RG [ { h j } ] = − X i ˜ K i h i − X ij ˜ K ij h i h j − X ij k ˜ K ij k h i h j h k + . . . , (5) where { ˜ K } describ e interactions betw een the hidden spins, { h j } . In the physics liter ature, such a renor mal- ization transfor mation is often represented a s mapping betw een couplings, { K } → { ˜ K } . Of cour se, the exact mapping depends on the details of the R G scheme used. In the v ariational RG scheme prop osed by K adanoff, the coarse gr aining pro cedure is implemented b y co n- structing a function, T λ ( { v i } , { h j } ), that depends o n a set of v aria tional parameters { λ } and encodes (typically pairwise) interactions b etw een the ph ysical and coar se- grained degrees of freedom. After coupling the auxiliar y spins { h j } to the physical spins { v i } , one can then inte- grate out (marginaliz e ov er) the v isible spins to arrive at a coarse- grained descr iption of the physical system entirely in terms of the { h j } . The function T λ ( { v i } , { h j } ) then naturally defines a Hamiltonian for the { h j } thro ugh the 3 FIG. 1. Bl ock spin renormalization. I n b lock spin renormalization [14], a physical system is coarse grained by introd u cing new “block” v ariables whic h describ e some “effective ” b ehavior of a block of spins. F or example, in the figure, four adjacen t spins are group ed into 2 x 2 blocks. The system is then described in terms of these new blo ck va riables. This scheme is then iterated to create even new block v ariables th at av erage o ver an even larger set of the original spins. Notice th e lattice spacing doubles after eac h iteration. expression e − H RG λ [ { h j } ] ≡ T r v i e T λ ( { v i } , { h j } ) − H ( { v i } ) . (6) W e ca n a ls o define a free energy for the coarse grained system in the usual way F h λ = − log T r h i e − H RG λ ( { h i } ) . (7) Thu s far w e ha ve ignored the problem of c hoos ing the v aria tional parameters λ that define o ur RG transfor- mation T λ ( { v i } , { h j } ). Intuitiv ely , it is cle ar we s ho uld choose λ to ensure that the long - distance ph ysical o bs erv- ables of the sy stem ar e inv ariant to this coarse graining pro cedure. This is done b y cho osing the par ameters λ to minimize the free energy difference, ∆ F = F h λ − F v , betw een the physical a nd co arse grained systems. Notice that ∆ F = 0 ⇐ ⇒ T r h j e T λ ( { v i } , { h j } ) = 1 (8) Thu s, for an y exact R G tra nsformation, w e kno w tha t T r h j e T λ ( { v i } , { h j } ) = 1 (9) In general, it is not p ossible to cho o se the parame- ters λ to satisfy the condition ab ov e and v arious v aria - tional schemes (e.g. b ond moving) have b een prop osed to c ho o se λ to minimize this ∆ F . II. RBMS AND DEEP NEURAL NET W ORKS W e w ill show b elow that this v aria tional R G proc e dur e has a natural int erpretatio n as a deep learning scheme based on a powerful class of ener gy-based mo dels called Restricted Boltzma nn Machines (RBMs) [6, 20 – 23]. W e will restrict our discussion to RBMs acting on binary data [6] dra wn from some probability distribution, P ( { v i } ), with { v i } binary spins lab eled by an index i = 1 . . . N . F or example, for blac k and white images each s pin v i enco des whether a giv en pix e l is on or off and the distri- bution P ( { v i } ) enco des the s ta tistical pr ope rties of the ensemble o f images (e.g the set of all handwritten digits in the MNIST dataset). T o mo del the data distributio n, RBMs in tro duce new hidden spin v a riables, { h j } ( j = 1 . . . M ) that couple to the visible units. The interactions b et ween visible and hidden units are mo deled using an ene r gy function of the form E ( { v i } , { h j } ) = X i b j h j + X ij v i w ij h j + X i c i v i , (10 ) where λ = { b j , w ij , c i } ar e v ariational par ameters of the mo del. In terms of this energy function, the joint prob- ability of o bserving a configuration of hidden and visible spins can be written as p λ ( { v i } , { h j } ) = e − E ( { v i } , { h j } ) Z . (11) This join t distribution also defines a v aria tional distribu- tion for the vis ible spins p λ ( { v i } ) = X { h j } p λ ( { v i } , { h j } ) = T r h j p λ ( { v i } , { h j } ) (12) 4 as well as a marginal distribution fo r hidden spins them- selves: p λ ( { h j } ) = X { v j } p λ ( { v i } , { h j } ) = T r v i p λ ( { v i } , { h j } ) . (13) Finally , for future refere nc e it will b e helpful to define a “v ar iational” RBM Hamiltonia n fo r the visible units: p λ ( { v i } ) ≡ e − H RB M λ [ { v i } ] Z , (14) and an RBM Hamiltonia n for the hidden units: p λ ( { h j } ) ≡ e − H RB M λ [ { h j } ] Z . (15) Since the ob jective o f the RBM for our purp oses is unsupe r vised learning, the parameters in the RBM a re chosen to minimize the Kullback-Leibler divergence b e - t ween the true distribution o f the data P ( { v i } ) a nd the v aria tional dis tr ibution p λ ( { v i } ): D K L ( P ( { v i } ) || p λ ( { v i } ) = X { v i } P ( { v i } ) log P ( { v i } ) p λ ( { v i } ) . (16) F urthermore, notice that when the RBM exa ctly repro- duces the visible data dis tribution D K L ( P ( { v i } ) || p λ ( { v i } )) = 0 . (17) In g eneral it not p oss ible to explicitly minimize the D K L ( P ( { v i } ) || p λ ( { v i } )) and this minimization is usually per formed using approximate numerical metho ds s uc h as contrastiv e divergence [24]. No te that if the n umber o f hidden units is restricted (i.e. less than 2 N ), the RBM cannot b e made to matc h an ar bitr ary distribution e x - actly [9]. In a DNN, RB Ms ar e stack ed on top of ea c h other so that, once trained, the hidden layer of one RBM serves as the visible lay er o f the next RBM. In particular, one can map a configuratio n of visible spins to a configura tion in the hidden la y er via the conditiona l probability distr i- bution, p λ ( { h j }|{ v i } ). Th us, after training an RBM, we can treat the activities of the hidden layer in r espo nse to ea c h visible data sample as data for learning a second lay er o f hidden spins, and so on. II I. MAPPING V ARIA TIONAL R G T O DEEP LEARNING In v ariational RG, the couplings betw een the hid- den and v isible spins are enco ded b y the op erato rs T λ ( { v i } , { h j } ). In RBMs, an a nalogous ro le is played by the join t energ y function E ( { v i } , { h j } ). In fact, as we will s ho w b elow, these ob jects are r elated through the equation, T ( { v i } , { h j } ) = − E ( { v i } , { h j } ) + H [ { v i } ] , (18) where H [ { v i } ] is the Hamiltonian defined in Eq. 3 that enco des the da ta pro babilit y distribution P ( { v i } ). This equation defines a one-to-o ne ma pping b et ween the v ari- ational R G sc heme and RBM base d DNNs. Using this definition, it is eas y to s ho w that the Hamil- tonian H RG λ [ { h j } ], originally defined in Eq. 6 a s the Hamiltonian of the coar se-grained degre es of freedo m after p erforming RG, a lso describ es the hidden s pins in the RBM. This is equiv alent to the statemen t that the marginal distribution p λ ( { h j } ) describing the hidden spins of the RBM is of the B o ltzmann form with a Hamil- tonian H RG λ [ { h j } ]. T o prove this, w e divide b oth sides o f Eq. 6 b y Z to get e − H RG λ [ { h j } ] Z = T r v i e T λ ( { v i } , { h j } ) − H ( { v i } ) Z . (19) Substituting Eq. 18 in to this equation yields e − H RG λ [ { h j } ] Z = T r v i e − E ( { v i } , { h j } ) Z = p λ ( { h j } ) . (20) Substituting Eq. 15 into the right-hand side yields the desired result H RG λ [ { h j } ] = H RB M λ [ { h j } ] . (21) These results also provide a natural in terpretation for v aria tional RG entirely in the language of pr obability theory . The op erator T λ ( { v i } , { h j } ) can b e vie w ed as a v ar ia tional approximation fo r the co nditio nal proba bil- it y o f the hidden spins given the visible spins. T o see this, notice that e T ( { v i } , { h j } ) = e − E ( { v i } , { h j } )+ H [ { v i } ] = p λ ( { v i } , { h j } ) p λ ( { v i } ) e H [ { v i } ] − H RB M λ [ { v i } ] = p λ ( { h j }|{ v i } ) e H [ { v i } ] − H RB M λ [ { v i } ] (22) where in go ing from the first the line to the second line we ha v e used Eqs . 11 and 14. This implies that when an R G can b e p erformed exa ctly (i.e. the RG transformatio n sa tisfies the equality T r h j e T λ ( { v i } , { h j } ) = 1), the v ar iational Hamiltonian is identical to the true Hamiltonian des cribing the da ta, H [ { v i } ] = H RB M λ [ { v i } ] and T ( { v i } , { h j } ) is exa ctly the conditional proba- bilit y . In the langua g e of pr obability theory , this means that the v ar ia tional distribution p λ ( { v i } ) ex - actly repr oduce s the true data distribution P ( { v i } ) and D K L ( P ( { v i } ) || p λ ( { v i } ) = 0. In genera l, it is not pos sible to perfor m the v ar iational R G transfor mation exactly . Instead, one co nstructs a family of v ariationa l appr o ximations for the ex a ct R G transform [14, 16, 17]. The discussion a bove makes it clear that these v ariational distributions work at the level of the Hamiltonia ns and F r ee Energies. In con- trast, in the Machine L e a rning literature, these v aria- tional a ppr oximations are usually made b y minimizing the K L divergence D K L ( P ( { v i } ) || p λ ( { v i } ) = 0. Thus, 5 J J J J J J J J J (1) J (1) J (1) J (1) J (2) J (2) . . . . . . . . . . . . 0 1 2 Number of Decimations . . . . . . . . . . . . J J J J J J J J J J J (1) J (1) J (1) J (1) J (1) J (1) 0 1 2 Layer Decimation Deep Architecture 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 J = 0 Stable Unstable J ∞ = A B C tanh[ J ] 0 1 2 3 4 tanh[ J ] (n+1) = tanh [ J ] 2 (n) tanh[ J ] tanh[ J ] (n+1) = tanh [ J ] 2 (n) FIG. 2. RG and dee p learning in the one-dimens ional Ising Mo del. ( A ) A decimation based renormalization trans- formation for the ferromagnetic 1-D Ising mo del. A t each step, h alf the spins are decimated, doubling th e effective lattice spacing. After, n successive decimations, the spins can b e describ ed u sing a new 1-D Ising models with a coupling J n b et w een spins. Couplings at a given la yer are related to coup lings at a prev ious lay er through t h e square of the hyberb olic tangent function. (B) Decimation-based renormalization transformations can also b e realized using the d eep arc hitecture where th e w eigh ts betw een t h e n + 1 and n -th h id den la yer are given by J n . (C) Visualizing the renormalization group fl o w of the couplings for 1- D F erromagnetic Ising model. Under four successiv e decimations or equ iva lently as w e mo ve up four laye rs in the deep ar chitecture, the couplings (marked b y red d ots) get smaller. Eventually , the couplings are attracted to stable fixed p oin t J = 0. the t wo approaches employ distinct v aria tional approxi- mation schemes for co arse graining. Finally , notice tha t the corr espo ndenc e do es not rely on the explicit form of the energy E ( { h j } , { v j } ) and hence holds fo r any Boltz- mann Machine. IV. EXAMPLES T o gain in tuition ab out the mapping betw een RG and deep learning, it is helpful to consider some sim- ple examples in detail. W e b egin by examining the o ne- dimensional nearest-neighbo r Ising mo del where the R G transformatio n can b e carr ied out ex actly . W e then nu- merically explo r e the tw o-dimensional nea rest-neighbor Ising mo del using an RBM-based deep lea rning architec- ture. A. One dim e nsional Ising M o del The one-dimensiona l Is ing mo del describ es a collection of bina ry spins { v i } organized along a one-dimens ional lattice with lattice s pacing a . Suc h a system is des cribed by a Hamiltonian of the form H = − J X i v i v i +1 , (23) where J is a ferromagnetic c oupling tha t e ne r getically fav ors c o nfigurations where neighboring spins align. T o per form a RG transformation, w e decima te (mar ginalize ov er) every other spin. This doubles the la ttice s pacing a → 2 a and results in a new effective in teraction J (1) be- t ween spins (see Figure 2). If we denote the coupling a f- ter p erforming n successive RG trans formations by J ( n ) , then a standard calculation sho ws that these coefficie nts satisfy the R G equations tanh [ J ( n +1) ] = tanh 2 [ J ( n ) ] , (24) where we hav e de fined J (0) = J [14]. This recursio n relationship can b e visualized as a one- dimensional flow in the coupling spac e J from J = ∞ to J = 0. Thus, after p erfor ming R G the interactions b ecome weak er and weak er and J → 0 as n → ∞ . This RG tra nsformation also naturally gives rise to the deep learning architecture sho wn in Figure 2. The spins at a given lay er of the DNN have a natural int erpreta tio n as the decimated spins when perfo rming the RG trans- formation in the la yer below. Notice that the coupled spins in the b ottom t wo layers of the DNNs in Fig. 2B form an “effective” o ne-dimensional c hain isomorphic to the or ig inal spin chain. Thus, mar ginalizing ov er spins in the b ottom lay er in the DNN is identical to decima ting every other spin in the o riginal spin systems. This im- plies that the “hidden” s pins in the second lay er of the DNN ar e also describ ed by the RG transformed Hamil- tonian with a coupling J (1) betw een neig hbo ring spins. Repe a ting this a rgument for spins co upled b etw een the second and third lay ers and s o on, o ne obtains the de e p learning architecture shown in Fig. 2B which implements decimation. The adv antage of the simple deep a rchit ecture pre- sented here is that it is easy to in terpret and requires no calculations to construct. Howev er, an impor tan t shor t- coming is that it contains no informa tio n ab out half of the visible spins, na mely the spins that do not couple to the hidden la yer. 6 1600 400 100 25 Samples Reconstructions A B C D E Layers/ RG Iterations 0 10 20 30 40 50 60 70 80 90 100 −20 −15 −10 −5 0 5 10 15 20 25 30 25 100 400 FIG. 3. De ep le a rning the 2D Ising mo del A Deep Neural Netw ork with four lay ers of size 1600, 400, 100, and 25 spins w as trained using samples drawn from a 2D Ising m o del slightly abov e the critical temp erature, J / ( k B T ) = 0 . 408. B Visualization of the effectiv e receptive fields for the top la yer of spins. Each 40 by 40 pix el image depicts th e effective receptiv e field of one of the 25 spins in the top lay er (see material and metho ds) C Visualization of effective receptiv e fields for eac h of the 100 spins in the middle la yer calculated as in B. D The effective receptive fields get larger as one mov es u p the Deep Neural Netw ork. This is consisten t with what is expected from the successi ve application of block renormal ization. E Three represen tativ e samples draw n from the 2D I sing mod el at J = 0 . 408 and their reconstruction from the trained DN N . Samples were reconstructed from DNNs as in [6]. B. Two dimensional Ising Mode l W e nex t a pplied deep learning techniques to numeri- cally co arse-gr ain the tw o-dimensional neare s t-neighbor Ising model on a sq ua re la ttice. This mo del is describ ed by a Hamiltonian of the form H [ { v i } ] = − J X h ij i v i v j , (25) where h ij i indicates that i and j a re nea rest neigh- bo rs and J is a ferromagnetic coupling that f av ors configuratio ns wher e neighbor ing spins a lign. Unlik e the one-dimensional Ising model, the tw o dimensiona l Ising mo del ha s a phas e tr a nsition that o ccurs when J / ( k B T ) = 0 . 4 352 (recall we hav e set β = T − 1 = 1). A t the phase transition, the characteristic length s cale of the system, the cor relation length, diverges. F o r this re a- son, near a c r itical p oint the system can be productively coarse- grained us ing a pro cedure s imilar to Kadanoff ’s blo c k spin renormalization (see Fig. 1) [14]. Inspired by our ma pping b etw een v a riational RG and DNNs, we applied standard dee p lear ning techniques to samples genera ted from the 2D Is ing mode l for J = 0 . 4 0 8, just ab ov e the critical temp erature. 20 , 000 samples were generated from a pe r io dic 40 × 40 2D Is ing model using standard equilibr ium Monte Car lo techniques and served as input to an RB M- based deep neural netw o rk of four lay ers with 1600, 400, 10 0, and 25 s pins resp ectively (see Fig. 3A). W e fur thermore imp osed an L1 p enalty o n the weigh ts b et ween layers in the RBM and trained the net- work using contrastive divergence [24] (see Mater ia ls and Metho ds). The L1 p enalty s erves as a sparsity promot- ing regularizer that encour ages weights in the RBM to be zero and prevents ov erfitting due to the finite num- ber of sa mples. In practice, it ensures that visible and hidden spins int eract with only a small subset of all the spins in an RBM. (Note that we did not use a conv olu- tional netw o rk that explicitly builds in s patial lo cality o r translationa l inv ariance.) The architecture of the r e s ulting DNN sug gests that it 7 is implementing a coarse- graining scheme simila r to block spin renormaliza tion (see Fig. 3). E ach spin in a hidden lay er couples to a lo cal blo ck of spins in the lay er b elow. This iter ative blo c king is co nsistent with K adanoff ’s in- tuitiv e picture of how co arse-gr a ining should b e imple- men ted near the critical p oint. Mor eov e r , the size o f the blo cks coupling to ea c h hidden unit in a lay er a re of approximately the same size (Fig. 3B,C), and the char- acteristic size is incr easing with layer (Fig. 3 D). Sur- prisingly, this lo c al blo ck spin structur e emer ges fr om the tr aining pr o c ess, su ggesting the DNN is self -or ganizing to implement blo ck spi n r enormalization. F urthermore, as shown in Fig. 3E, reco ns tructions from the coarse grained DNN can qualitatively repro duce the macro - scopic feature s of individua l samples despite having only 25 spins in the top layer, a compression ratio of 6 4. V. DISCUSSION Deep learning is one of the mo st successful paradigms for unsuper vised learning to emerge over the last ten years. The enor mous success of deep learning tec hniques at a v ariety o f practical ma chine learning ta sks ranging from voice recognition to image class ific a tion raises nat- ural questions a bout its theoretica l underpinnings. Here, we hav e demonstrated that there is a one-to-o ne map- ping betw een RBM-ba sed Deep Neural Net w orks and the v ariational renorma lization g roup. W e illustrated this mapping by analy tically cons tr ucting a DNN for the 1D Ising mo del and n umerically examining the 2D Ising mo del. Surprisingly , we found that these DNNs self o r - ganize to implement a coarse-gr aining pro cedure remi- niscent of K adanoff blo ck renormaliza tion. This sug gests that dee p learning may b e implementin g a genera lized R G-like scheme to lea rn important fea tur es from data. R G plays a central ro le in our mo dern understanding of statistical ph ysics and quantu m field theory . A cen- tral finding of RG is that the long distance ph ysics of many dispa rate physical systems a re dominated by the same long distance fix ed p oints. This gives ris e to the idea of universalit y – many micros copically dissimilar s ys- tems exhibit macrosco pically similar pro perties at long distances Physicists have developed elab orate tec hnical machinery for exploiting fixed p oints and univ ersality to ident ify the salient lo ng distance feature s of physics sys- tems. It will b e interesting to see, what, if any of this more co mplex machinery ca n be imp orted to deep lear n- ing. A p otential obstacle for impo rting ideas fr om physics int o the deep lea rning fra mew ork is that RG is commonly applied to ph ysical systems with many symmetries. This is in contrast to deep lear ning which is often a pplied to data with limited structure. Recently , it w as sug g ested that moder n RG techniques developed in the co n text of qua n tum s ystems suc h as ma- trix pro duct states and tensor ne tw orks have a natura l int erpretatio n in terms of v ar ia tional RG [1 7]. These new techn iques explo it ideas such as entanglement e ntropy and disentanglers w hich crea te a features with a mini- m um a mo un t of redundancy . It is an op en q ues tion to see whether these ideas can b e imp orted into deep le a rning algorithms. O ur mapping a lso s ug gests a r oute fo r apply- ing real space reno rmalization tec hniques to complicated ph ysical systems. Real space r enormalization techniques such as v ariational R G have often b een limited b y their inability to make go od approximations. T ec hniques from deep lea r ning may represe nt a p oss ible route for ov ercom- ing these problems. App endix A: Learning Dee p Arc hitecture for the Two-dimensional Ising Mo del Details are given in the SI Ma terials and Meth- o ds . Stack ed RBMs w ere trained with a v ariant of the co de from [6]. This co de is av aila ble at ht tps://co de.go og le.com/p/matrbm/. In particular, only the unsup ervised lear ning phase w as perfo rmed. In- dividual RBMs were trained with contrastive div ergence for 20 0 epo chs, with momentum 0 . 5 using mini-batches o f size 100 on 40 , 000 total samples from the 2 D Ising mo del with J = 0 . 408 . Additionally , L 1 regulariza tion was im- plement ed, with strength 2 × 10 − 4 , instea d of weight de- cay . This L1 r egularization streng th was chosen to ens ure that one could no t have all-to-all couplings betw een la y- ers in the DNN. Reconstructions were per formed as in [6]. See Supplementary files for a Ma tlab v ar iable containing the learned model. App endix B: Vi sualizing Effective Receptive Fields The effectiv e receptive field is a way to visualize w hich spins in the vis ible lay er that coupled to a given spin in one of the hidden lay ers. W e denote the effective recep- tive field matrix of layer l by r ( l ) and the n umber of spins in lay er l by n ( l ) , with the visible la yer corr e s ponding to l = 0. E ach column in r ( l ) is a vector that enco des the receptive field of a single spin in hidden lay er l . It can b e computed by conv oluting the w eight matrices W ( l ) en- co ding the w eights w ij betw een the spins in layers l − 1 and l . T o compute r ( l ) first w e set r (1) = W (1) and used the recursion r elationship r l = r ( l − 1) W ( l ) for l > 1. Thu s, the effective receptive field of a spin is a mea sure of how muc h that hidden spin influences the s pins in the visible la yer. ACKNO WLEDGMENT S PM is gr a teful to Charles K. Fis he r for useful co n ver- sations. W e ar e also gr ateful to Jav ad No orbakhsh and Alex Lang for comments on the manuscript. This work was par tially suppor ted by Simons F oundation In vestiga- tor Award in the Mathematical Mo deling of Living Sys- 8 tems and a Sloan Resear c h F e llowship (to P .M). DJS was partially suppor ted b y NIH Gra n t K25 GM098875. [1] Y. Bengio, F oun d ations and trends R in Mac hine Learn- ing, 2 , 1 (2009). [2] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, In ternational Con- ference in Mac h ine Learning ( 2012). [3] A. Krizhev sk y , I. S utskev er, and G. E. Hinton, Ad- v ances in Neu ral Information Pro cessing Systems 25, 1097 (2012). [4] G. Hinton, L. Deng, D. Y u, G. Dahl, A. Mo- hamed, N. Jaitly , A. Senior, V. V anhouc ke, P . N guye n, T. Sainath, and B. K ingsbury , Signal Pro cessing Maga- zine, 29 , 82 (2012). [5] R. Sarik aya , G. Hin ton, and A . Deoras, IEEE T ransac- tions on A udio Sp eech and Language Processing (2014). [6] G. E. Hinto n and R. R. Salakhutdino v, Science, 313 , 504 (2006). [7] Y. Bengio and L. Y ann, Large-scale kernel machines, 34 , 1 (2007). [8] N. Le Roux and Y. Bengio, N eural Computation, 22 , 2192 (2010). [9] N. Le Roux and Y. Bengio, N eural Computation, 20 , 1631 (2008). [10] Y. Bengi o, P . Lam b lin, D. P op o vici, and H. Laro c helle, Adv ances in neural information processing systems, 19 , 153 (2007). [11] K. G. Wilson and J. K ogut, Physics Rep orts, 12 , 75 (1974). [12] K . G. Wilson, R eviews of Mo dern Physics, 55 , 583 (1983). [13] J. Cardy , Sc aling and r enormalization i n stat istic al physics , V ol. 5 (Ca mbridge Universit y Press, 1996). [14] L. P . Kadanoff, Statics, Dynami cs and R enormalization (W orld Scientific, 2000). [15] N . Goldenfeld, ( 1992). [16] L. P . Kadanoff, A. Houghton, and M. C. Y alabik, Journal of S t atistical Physics, 14 , 171 ( 1976). [17] E. Efrati, Z. W ang, A. Kolan, and L. P . Kadanoff, Re- views of Modern Ph ysics, 86 , 647 (2014). [18] Y . Bengio, A. Courville, and P . Vincent, Pattern Analy- sis and Mac h ine Intelligence, IEEE T ransactions on, 35 , 1798 (20 13). [19] G. Hin ton, S. Osindero, and Y.-W . T eh, Neural compu- tation, 18 , 1527 (2006). [20] R . Salakh utdinov, A. Mnih, and G. H in ton, in Pr o- c e e dings of the 24th international c onfer enc e on Machine le arning (ACM, 2007) p p. 791–798. [21] H . Larochelle and Y. Bengio, in Pr o c e e dings of the 25th international c onfer enc e on Machine le arning (A CM, 2008) pp. 536–5 43. [22] P . Smolensky , (1986). [23] Y . W. T eh and G. E. Hin ton, Adv ances in neural infor- mation processing sy stems, 908 (2001). [24] G. E. Hinton, Neural compu tation, 14 , 17 71 (20 02).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment