An exact mapping between the Variational Renormalization Group and Deep Learning

An exact mapping b et w een the V ariational Renormalization Group and Deep Learning Pank a j Meht a Dept. of Physics, Bos ton University, Boston, MA David J. Sch wab Dept. of Physics, Northwestern University, Evanston, IL Deep learning is a broad set o f techniques that uses multiple la yers of representation to automat- ically learn relev ant fea tures directly from structured data. Recently , such techniques have yielded record-breaking results on a d iv erse set of diﬃcult mac hine learning tasks in compu ter v ision, sp eech recognition, and natural language pro cessing. Despite t h e enormous success of deep learning, rel- ative ly little is und erstoo d t heoretically ab out why these techniques are so successful at feature learning and compression. Here, w e show that deep learning is intimately related to one of the most imp ortant and successful t ec hniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining sc heme that allo ws for th e extraction of relev ant features (i.e. operators) as a physical system is examined at diﬀerent length scales. W e construct an exact map- ping from the v ariational renormalization group , ﬁrst introdu ced by K ad an oﬀ, and deep learning arc hitectures b ased on R estricted Boltzmann Machines (RBMs). W e illustrate t h ese ideas u sing the nearest-neighbor I sing Model in one and tw o-dimensions. Our results suggests that deep learning algorithms may b e employing a generalized R G-like scheme to lea rn rele v an t features from data. A central g oal of moder n machine lear ning research is to learn a nd extract impo r tan t features dire ctly from data. Among the most promising and successful tech- niques for acco mplishing this goa l ar e those asso ciated with the emer ging sub-discipline of deep lea rning. Deep learning uses multip le la yers of representation to lear n descriptive features directly fro m training data [1, 2] and has b een successfully utilized, often achieving reco r d breaking results, in diﬃcult machine learning tasks in- cluding ob ject labeling [3], spe e c h recognition [4], and natural language pro cessing [5 ]. In this work, we will fo cus o n a set of deep learning algorithms known as deep neural netw orks (DNNs) [6]. DNNs a r e biolog ically-inspired gra phical statistical mo d- els that consis t of multiple la yers of “neurons” , with units in one lay er receiving inputs from units in the layer be- low them. Despite their enormous succe s s, it is still un- clear what adv antages these deep, m ulti-lay er ar c hitec- tures p ossess ov er shallow er architectures with a similar nu mber o f par ameters. In pa rticular, it is s till not well understo o d theoretically why DNNs are so suc c essful at uncov ering fea tures in structured da ta. (But see [7 – 9].) One p ossible explanation for the success of DNN ar- chit ectures is that they can b e view ed a s an iterative coarse- graining scheme, where each new high-level lay er of the neura l ne tw ork lear ns incr easingly abstract higher- level features from the data [1, 10]. The initial layers o f the the DNN can b e thought o f as low-level feature de- tecters whic h a re then fed into higher lay ers in the DNN which com bine these low-level features in to mo re abstract higher-level features, providing a useful, and at times r e- duced, representation of the data. By successively apply- ing fea ture extraction, DNNs learn to deemphasize irrel- ev ant features in the da ta while simultaneously learning relev a n t ones. (Note that in a sup ervised setting, such as classiﬁcatio n, relev ant and ir relev ant are ultimately determined b y the problem at hand. Here we a re con- cerned solely with the unsuper vised asp ect of tr aining DNNs, and the use o f DNNs for co mpression [6 ].) In what follo ws, w e mak e this explanation precise. This successive c oarse-g r aining pro cedure is r eminis- cent of one of the most succe ssful and imp ortant too ls in theo retical physics, the reno rmalization group (R G) [11, 12]. R G is an itera tive coarse-g raining pro cedure designed to tackle diﬃcult physics pr oblems in volving many length scale s . The cen tral goal of RG is to ex- tract relev ant features of a physical system fo r describing phenomena at large length sca les by int egrating out (i.e. marginalizing over) shor t distance degrees of freedom. In any RG sequence , ﬂuctuations are in tegrated out s tart- ing at the microscopic scale and then moving iter ativ ely on to ﬂuctuations at larger scales. Under this pro ce- dure, certa in features, called relev ant op erato rs, become increasingly imp ortant while other featur es, dubbed irrel- ev ant o pera tors, hav e a diminishing eﬀect on the ph ysical prop erties of the sy stem a t large scales. In genera l, it is impo ssible to carry o ut the reno rmal- ization pr oce dur e exactly . Therefore, a n um ber of ap- proximate R G pro cedur es hav e bee n develop ed in the theoretical physics co mm unit y [12 – 15]. One such ap- proximate metho d is a class of v ariatio na l “real- space” renormaliz a tion schemes introduced by Kadanoﬀ for per - forming RG on spin sy s tems [14, 16, 17]. K adanoﬀ ’s v aria tional RG s cheme in tro duces c o arse-gr ained auxil- iary , or “hidden”, spins tha t are coupled to the ph ysical spin systems thr o ugh s o me unknown co upling parame- ters. A par ameter-dep endent free energy is ca lc ula ted for the coar s e-grained spin system from the coupled system by integrating o ut the physical spins. The c o upling pa- rameters are chosen through a v a riational pro cedure that minimizes the diﬀerence b et ween the free energ ies of the ph ysical and hidden spin systems. This ensur es that the coarse- grained sy stem pr eserves the long-dis tance infor- mation present in the ph ysical system. Carr ying out this 2 pro cedure results in an R G transformatio n that maps the ph ysical spin s ystem in to a coarse-gr ained description in terms of hidden spins. The hidden spins then serve as the input for the next ro und o f reno rmalization. The introduction of layers of hidden spins is also a cen- tral comp onent of DNNs based on Restricted Bo ltzmann Machines (RBMs). In RBMs, hidden s pins (o ften called units or neur ons) are coupled to “visible” spins descr ibing the data of interest. (Here we restrict ourse lv es to binar y data.) The coupling pa r ameters betw een the visible and hidden la yers are chosen using a v ar iational pro cedure that minimizes the K ullback-Leibler div ergence (i.e. rela- tive en tropy) b e tw een the “tr ue” probability distribution of the data and the v aria tio nal distribution o btained by marginalizing ov er the hidden spins. Like in v a riational R G, RBMs can b e used to map a state of the visible spins in a data sample into a description in ter ms of hid- den spins. If the num be r of hidden units is les s than the nu mber of visible units, such a mapping can b e tho ugh t of a s a compressio n. (Note, how ever, that dimensional expansions ar e common [18].) In deep lea rning, individ- ual RB Ms ar e stack ed on top o f ea c h other into a DNN [6, 19], with the output of one RBM s erving as the input to the next. Mor eov e r , the v ariational pro cedure is often per formed iteratively , lay e r by la y er. The preceding paragra phs s uggest an in timate connec- tion betw een R G and dee p lear ning. Indee d, here we con- struct an exact mapping from the v ariational RG scheme of Kadanoﬀ to DNN s based on RBMs [6, 19]. Our map- ping sugg ests that DNNs implemen t a genera lized RG- like procedur e to extract relev ant features from struc- tured data. The pa per is o rganized as follo ws. W e b egin b y re- viewing Kadanoﬀ ’s v aria tio nal renorma liz ation sc heme in the context of the Ising Mo del. W e then in tro duce RBMs and deep lear ning architectures of stacked RBMs. W e then show how to map the pro cedure of v ariational R G to unsup ervised tr aining of a DNN. W e illustrate these idea s using the one- and tw o-dimensional nea rest- neighbor I s ing mo dels. W e end by discussing the impli- cation of our mapping for ph y sics a nd machine learning. I. O VER VIEW OF V ARIA TIONAL RG In statistical physics, one often considers an e nsem ble of N binary spins { v i } tha t can tak e the v alues ± 1. The index i lab els the p osition of spin v i in some lattice. In thermal equilibrium, the probability o f a s pin co nﬁgura- tion is giv en b y the Bo ltzmann distribution P ( { v i } ) = e − H ( { v i } ) Z , (1) where we hav e deﬁned the Hamiltonian H ( { v i } ), and the partition function Z = T r v i e − H ( { v i } ) ≡ X v 1 ,...v N = ± 1 e − H ( { v i } ) . (2) Note throughout the pap er w e se t the temp erature equa l to one, without loss of genera lit y . T ypically , the Ha mil- tonian dep ends on a se t of co uplings or par ameters, K = { K s } , that para meterizes the set of a ll p ossible Hamiltonians. F or ex ample, with binary spins, the K could be the couplings desc ribing the spin interactions of v ario us or ders: H [ { v i } ] = − X i K i v i − X ij K ij v i v j − X ij k K ij k v i v j v k + . . . . (3) Finally , w e can deﬁne the free energy of the spin s ystem in the standard way: F v = − log Z = − log  T r v i e − H ( { v i } )  . (4) The idea b ehind RG is to ﬁnd a new co arse-gr ained description of the spin system where one has “in tegrated out” short distance ﬂuctuations. T o this end, let us intro- duce M < N new binar y spins, { h j } . Each o f these spins h j will serve as a coarse-g rained degree of freedom where ﬂuctuations on s mall scales hav e been av eraged out. Typ- ically , such a co arse-gr aining pro cedure increases some characteristic leng th scale describing the system such a s the lattice s pacing. F or example, in the block spin reno r- malization picture intro duced by Kadanoﬀ, each h i rep- resents the s ta te of a lo cal block of ph ysical spins, v i . Figure 1 shows such a block-spin pro cedure for a t w o- dimensional spin system on a square lattice, where each h i represents a 2 × 2 blo c k of visible spins. The result of such a coarse-gr aining pro cedure is that the lattice spacing is doubled a t each step of the reno rmalization pro cedure. In gener al, the in teractions (statistica l corr elations) betw een the { v i } induce in teractions (statistical corre- lations) betw een the coarse- grained spins, { h j } . In par - ticular, the coarse - grained system can b e describ ed b y a new coarse-gr ained Hamiltonian of the form H RG [ { h j } ] = − X i ˜ K i h i − X ij ˜ K ij h i h j − X ij k ˜ K ij k h i h j h k + . . . , (5) where { ˜ K } describ e interactions betw een the hidden spins, { h j } . In the physics liter ature, such a renor mal- ization transfor mation is often represented a s mapping betw een couplings, { K } → { ˜ K } . Of cour se, the exact mapping depends on the details of the R G scheme used. In the v ariational RG scheme prop osed by K adanoﬀ, the coarse gr aining pro cedure is implemented b y co n- structing a function, T λ ( { v i } , { h j } ), that depends o n a set of v aria tional parameters { λ } and encodes (typically pairwise) interactions b etw een the ph ysical and coar se- grained degrees of freedom. After coupling the auxiliar y spins { h j } to the physical spins { v i } , one can then inte- grate out (marginaliz e ov er) the v isible spins to arrive at a coarse- grained descr iption of the physical system entirely in terms of the { h j } . The function T λ ( { v i } , { h j } ) then naturally deﬁnes a Hamiltonian for the { h j } thro ugh the 3 FIG. 1. Bl ock spin renormalization. I n b lock spin renormalization [14], a physical system is coarse grained by introd u cing new “block” v ariables whic h describ e some “eﬀective ” b ehavior of a block of spins. F or example, in the ﬁgure, four adjacen t spins are group ed into 2 x 2 blocks. The system is then described in terms of these new blo ck va riables. This scheme is then iterated to create even new block v ariables th at av erage o ver an even larger set of the original spins. Notice th e lattice spacing doubles after eac h iteration. expression e − H RG λ [ { h j } ] ≡ T r v i e T λ ( { v i } , { h j } ) − H ( { v i } ) . (6) W e ca n a ls o deﬁne a free energy for the coarse grained system in the usual way F h λ = − log  T r h i e − H RG λ ( { h i } )  . (7) Thu s far w e ha ve ignored the problem of c hoos ing the v aria tional parameters λ that deﬁne o ur RG transfor- mation T λ ( { v i } , { h j } ). Intuitiv ely , it is cle ar we s ho uld choose λ to ensure that the long - distance ph ysical o bs erv- ables of the sy stem ar e inv ariant to this coarse graining pro cedure. This is done b y cho osing the par ameters λ to minimize the free energy diﬀerence, ∆ F = F h λ − F v , betw een the physical a nd co arse grained systems. Notice that ∆ F = 0 ⇐ ⇒ T r h j e T λ ( { v i } , { h j } ) = 1 (8) Thu s, for an y exact R G tra nsformation, w e kno w tha t T r h j e T λ ( { v i } , { h j } ) = 1 (9) In general, it is not p ossible to cho o se the parame- ters λ to satisfy the condition ab ov e and v arious v aria - tional schemes (e.g. b ond moving) have b een prop osed to c ho o se λ to minimize this ∆ F . II. RBMS AND DEEP NEURAL NET W ORKS W e w ill show b elow that this v aria tional R G proc e dur e has a natural int erpretatio n as a deep learning scheme based on a powerful class of ener gy-based mo dels called Restricted Boltzma nn Machines (RBMs) [6, 20 – 23]. W e will restrict our discussion to RBMs acting on binary data [6] dra wn from some probability distribution, P ( { v i } ), with { v i } binary spins lab eled by an index i = 1 . . . N . F or example, for blac k and white images each s pin v i enco des whether a giv en pix e l is on or oﬀ and the distri- bution P ( { v i } ) enco des the s ta tistical pr ope rties of the ensemble o f images (e.g the set of all handwritten digits in the MNIST dataset). T o mo del the data distributio n, RBMs in tro duce new hidden spin v a riables, { h j } ( j = 1 . . . M ) that couple to the visible units. The interactions b et ween visible and hidden units are mo deled using an ene r gy function of the form E ( { v i } , { h j } ) = X i b j h j + X ij v i w ij h j + X i c i v i , (10 ) where λ = { b j , w ij , c i } ar e v ariational par ameters of the mo del. In terms of this energy function, the joint prob- ability of o bserving a conﬁguration of hidden and visible spins can be written as p λ ( { v i } , { h j } ) = e − E ( { v i } , { h j } ) Z . (11) This join t distribution also deﬁnes a v aria tional distribu- tion for the vis ible spins p λ ( { v i } ) = X { h j } p λ ( { v i } , { h j } ) = T r h j p λ ( { v i } , { h j } ) (12) 4 as well as a marginal distribution fo r hidden spins them- selves: p λ ( { h j } ) = X { v j } p λ ( { v i } , { h j } ) = T r v i p λ ( { v i } , { h j } ) . (13) Finally , for future refere nc e it will b e helpful to deﬁne a “v ar iational” RBM Hamiltonia n fo r the visible units: p λ ( { v i } ) ≡ e − H RB M λ [ { v i } ] Z , (14) and an RBM Hamiltonia n for the hidden units: p λ ( { h j } ) ≡ e − H RB M λ [ { h j } ] Z . (15) Since the ob jective o f the RBM for our purp oses is unsupe r vised learning, the parameters in the RBM a re chosen to minimize the Kullback-Leibler divergence b e - t ween the true distribution o f the data P ( { v i } ) a nd the v aria tional dis tr ibution p λ ( { v i } ): D K L ( P ( { v i } ) || p λ ( { v i } ) = X { v i } P ( { v i } ) log  P ( { v i } ) p λ ( { v i } )  . (16) F urthermore, notice that when the RBM exa ctly repro- duces the visible data dis tribution D K L ( P ( { v i } ) || p λ ( { v i } )) = 0 . (17) In g eneral it not p oss ible to explicitly minimize the D K L ( P ( { v i } ) || p λ ( { v i } )) and this minimization is usually per formed using approximate numerical metho ds s uc h as contrastiv e divergence [24]. No te that if the n umber o f hidden units is restricted (i.e. less than 2 N ), the RBM cannot b e made to matc h an ar bitr ary distribution e x - actly [9]. In a DNN, RB Ms ar e stack ed on top of ea c h other so that, once trained, the hidden layer of one RBM serves as the visible lay er o f the next RBM. In particular, one can map a conﬁguratio n of visible spins to a conﬁgura tion in the hidden la y er via the conditiona l probability distr i- bution, p λ ( { h j }|{ v i } ). Th us, after training an RBM, we can treat the activities of the hidden layer in r espo nse to ea c h visible data sample as data for learning a second lay er o f hidden spins, and so on. II I. MAPPING V ARIA TIONAL R G T O DEEP LEARNING In v ariational RG, the couplings betw een the hid- den and v isible spins are enco ded b y the op erato rs T λ ( { v i } , { h j } ). In RBMs, an a nalogous ro le is played by the join t energ y function E ( { v i } , { h j } ). In fact, as we will s ho w b elow, these ob jects are r elated through the equation, T ( { v i } , { h j } ) = − E ( { v i } , { h j } ) + H [ { v i } ] , (18) where H [ { v i } ] is the Hamiltonian deﬁned in Eq. 3 that enco des the da ta pro babilit y distribution P ( { v i } ). This equation deﬁnes a one-to-o ne ma pping b et ween the v ari- ational R G sc heme and RBM base d DNNs. Using this deﬁnition, it is eas y to s ho w that the Hamil- tonian H RG λ [ { h j } ], originally deﬁned in Eq. 6 a s the Hamiltonian of the coar se-grained degre es of freedo m after p erforming RG, a lso describ es the hidden s pins in the RBM. This is equiv alent to the statemen t that the marginal distribution p λ ( { h j } ) describing the hidden spins of the RBM is of the B o ltzmann form with a Hamil- tonian H RG λ [ { h j } ]. T o prove this, w e divide b oth sides o f Eq. 6 b y Z to get e − H RG λ [ { h j } ] Z = T r v i e T λ ( { v i } , { h j } ) − H ( { v i } ) Z . (19) Substituting Eq. 18 in to this equation yields e − H RG λ [ { h j } ] Z = T r v i e − E ( { v i } , { h j } ) Z = p λ ( { h j } ) . (20) Substituting Eq. 15 into the right-hand side yields the desired result H RG λ [ { h j } ] = H RB M λ [ { h j } ] . (21) These results also provide a natural in terpretation for v aria tional RG entirely in the language of pr obability theory . The op erator T λ ( { v i } , { h j } ) can b e vie w ed as a v ar ia tional approximation fo r the co nditio nal proba bil- it y o f the hidden spins given the visible spins. T o see this, notice that e T ( { v i } , { h j } ) = e − E ( { v i } , { h j } )+ H [ { v i } ] = p λ ( { v i } , { h j } ) p λ ( { v i } ) e H [ { v i } ] − H RB M λ [ { v i } ] = p λ ( { h j }|{ v i } ) e H [ { v i } ] − H RB M λ [ { v i } ] (22) where in go ing from the ﬁrst the line to the second line we ha v e used Eqs . 11 and 14. This implies that when an R G can b e p erformed exa ctly (i.e. the RG transformatio n sa tisﬁes the equality T r h j e T λ ( { v i } , { h j } ) = 1), the v ar iational Hamiltonian is identical to the true Hamiltonian des cribing the da ta, H [ { v i } ] = H RB M λ [ { v i } ] and T ( { v i } , { h j } ) is exa ctly the conditional proba- bilit y . In the langua g e of pr obability theory , this means that the v ar ia tional distribution p λ ( { v i } ) ex - actly repr oduce s the true data distribution P ( { v i } ) and D K L ( P ( { v i } ) || p λ ( { v i } ) = 0. In genera l, it is not pos sible to perfor m the v ar iational R G transfor mation exactly . Instead, one co nstructs a family of v ariationa l appr o ximations for the ex a ct R G transform [14, 16, 17]. The discussion a bove makes it clear that these v ariational distributions work at the level of the Hamiltonia ns and F r ee Energies. In con- trast, in the Machine L e a rning literature, these v aria- tional a ppr oximations are usually made b y minimizing the K L divergence D K L ( P ( { v i } ) || p λ ( { v i } ) = 0. Thus, 5 J J J J J J J J J (1) J (1) J (1) J (1) J (2) J (2) . . . . . . . . . . . . 0 1 2 Number of Decimations . . . . . . . . . . . . J J J J J J J J J J J (1) J (1) J (1) J (1) J (1) J (1) 0 1 2 Layer Decimation Deep Architecture 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 J = 0 Stable Unstable J ∞ = A B C tanh[ J ] 0 1 2 3 4 tanh[ J ] (n+1) = tanh [ J ] 2 (n) tanh[ J ] tanh[ J ] (n+1) = tanh [ J ] 2 (n) FIG. 2. RG and dee p learning in the one-dimens ional Ising Mo del. ( A ) A decimation based renormalization trans- formation for the ferromagnetic 1-D Ising mo del. A t each step, h alf the spins are decimated, doubling th e eﬀective lattice spacing. After, n successive decimations, the spins can b e describ ed u sing a new 1-D Ising models with a coupling J n b et w een spins. Couplings at a given la yer are related to coup lings at a prev ious lay er through t h e square of the hyberb olic tangent function. (B) Decimation-based renormalization transformations can also b e realized using the d eep arc hitecture where th e w eigh ts betw een t h e n + 1 and n -th h id den la yer are given by J n . (C) Visualizing the renormalization group ﬂ o w of the couplings for 1- D F erromagnetic Ising model. Under four successiv e decimations or equ iva lently as w e mo ve up four laye rs in the deep ar chitecture, the couplings (marked b y red d ots) get smaller. Eventually , the couplings are attracted to stable ﬁxed p oin t J = 0. the t wo approaches employ distinct v aria tional approxi- mation schemes for co arse graining. Finally , notice tha t the corr espo ndenc e do es not rely on the explicit form of the energy E ( { h j } , { v j } ) and hence holds fo r any Boltz- mann Machine. IV. EXAMPLES T o gain in tuition ab out the mapping betw een RG and deep learning, it is helpful to consider some sim- ple examples in detail. W e b egin by examining the o ne- dimensional nearest-neighbo r Ising mo del where the R G transformatio n can b e carr ied out ex actly . W e then nu- merically explo r e the tw o-dimensional nea rest-neighbor Ising mo del using an RBM-based deep lea rning architec- ture. A. One dim e nsional Ising M o del The one-dimensiona l Is ing mo del describ es a collection of bina ry spins { v i } organized along a one-dimens ional lattice with lattice s pacing a . Suc h a system is des cribed by a Hamiltonian of the form H = − J X i v i v i +1 , (23) where J is a ferromagnetic c oupling tha t e ne r getically fav ors c o nﬁgurations where neighboring spins align. T o per form a RG transformation, w e decima te (mar ginalize ov er) every other spin. This doubles the la ttice s pacing a → 2 a and results in a new eﬀective in teraction J (1) be- t ween spins (see Figure 2). If we denote the coupling a f- ter p erforming n successive RG trans formations by J ( n ) , then a standard calculation sho ws that these coeﬃcie nts satisfy the R G equations tanh [ J ( n +1) ] = tanh 2 [ J ( n ) ] , (24) where we hav e de ﬁned J (0) = J [14]. This recursio n relationship can b e visualized as a one- dimensional ﬂow in the coupling spac e J from J = ∞ to J = 0. Thus, after p erfor ming R G the interactions b ecome weak er and weak er and J → 0 as n → ∞ . This RG tra nsformation also naturally gives rise to the deep learning architecture sho wn in Figure 2. The spins at a given lay er of the DNN have a natural int erpreta tio n as the decimated spins when perfo rming the RG trans- formation in the la yer below. Notice that the coupled spins in the b ottom t wo layers of the DNNs in Fig. 2B form an “eﬀective” o ne-dimensional c hain isomorphic to the or ig inal spin chain. Thus, mar ginalizing ov er spins in the b ottom lay er in the DNN is identical to decima ting every other spin in the o riginal spin systems. This im- plies that the “hidden” s pins in the second lay er of the DNN ar e also describ ed by the RG transformed Hamil- tonian with a coupling J (1) betw een neig hbo ring spins. Repe a ting this a rgument for spins co upled b etw een the second and third lay ers and s o on, o ne obtains the de e p learning architecture shown in Fig. 2B which implements decimation. The adv antage of the simple deep a rchit ecture pre- sented here is that it is easy to in terpret and requires no calculations to construct. Howev er, an impor tan t shor t- coming is that it contains no informa tio n ab out half of the visible spins, na mely the spins that do not couple to the hidden la yer. 6 1600 400 100 25 Samples Reconstructions A B C D E Layers/ RG Iterations 0 10 20 30 40 50 60 70 80 90 100 −20 −15 −10 −5 0 5 10 15 20 25 30 25 100 400 FIG. 3. De ep le a rning the 2D Ising mo del A Deep Neural Netw ork with four lay ers of size 1600, 400, 100, and 25 spins w as trained using samples drawn from a 2D Ising m o del slightly abov e the critical temp erature, J / ( k B T ) = 0 . 408. B Visualization of the eﬀectiv e receptive ﬁelds for the top la yer of spins. Each 40 by 40 pix el image depicts th e eﬀective receptiv e ﬁeld of one of the 25 spins in the top lay er (see material and metho ds) C Visualization of eﬀective receptiv e ﬁelds for eac h of the 100 spins in the middle la yer calculated as in B. D The eﬀective receptive ﬁelds get larger as one mov es u p the Deep Neural Netw ork. This is consisten t with what is expected from the successi ve application of block renormal ization. E Three represen tativ e samples draw n from the 2D I sing mod el at J = 0 . 408 and their reconstruction from the trained DN N . Samples were reconstructed from DNNs as in [6]. B. Two dimensional Ising Mode l W e nex t a pplied deep learning techniques to numeri- cally co arse-gr ain the tw o-dimensional neare s t-neighbor Ising model on a sq ua re la ttice. This mo del is describ ed by a Hamiltonian of the form H [ { v i } ] = − J X h ij i v i v j , (25) where h ij i indicates that i and j a re nea rest neigh- bo rs and J is a ferromagnetic coupling that f av ors conﬁguratio ns wher e neighbor ing spins a lign. Unlik e the one-dimensional Ising model, the tw o dimensiona l Ising mo del ha s a phas e tr a nsition that o ccurs when J / ( k B T ) = 0 . 4 352 (recall we hav e set β = T − 1 = 1). A t the phase transition, the characteristic length s cale of the system, the cor relation length, diverges. F o r this re a- son, near a c r itical p oint the system can be productively coarse- grained us ing a pro cedure s imilar to Kadanoﬀ ’s blo c k spin renormalization (see Fig. 1) [14]. Inspired by our ma pping b etw een v a riational RG and DNNs, we applied standard dee p lear ning techniques to samples genera ted from the 2D Is ing mode l for J = 0 . 4 0 8, just ab ov e the critical temp erature. 20 , 000 samples were generated from a pe r io dic 40 × 40 2D Is ing model using standard equilibr ium Monte Car lo techniques and served as input to an RB M- based deep neural netw o rk of four lay ers with 1600, 400, 10 0, and 25 s pins resp ectively (see Fig. 3A). W e fur thermore imp osed an L1 p enalty o n the weigh ts b et ween layers in the RBM and trained the net- work using contrastive divergence [24] (see Mater ia ls and Metho ds). The L1 p enalty s erves as a sparsity promot- ing regularizer that encour ages weights in the RBM to be zero and prevents ov erﬁtting due to the ﬁnite num- ber of sa mples. In practice, it ensures that visible and hidden spins int eract with only a small subset of all the spins in an RBM. (Note that we did not use a conv olu- tional netw o rk that explicitly builds in s patial lo cality o r translationa l inv ariance.) The architecture of the r e s ulting DNN sug gests that it 7 is implementing a coarse- graining scheme simila r to block spin renormaliza tion (see Fig. 3). E ach spin in a hidden lay er couples to a lo cal blo ck of spins in the lay er b elow. This iter ative blo c king is co nsistent with K adanoﬀ ’s in- tuitiv e picture of how co arse-gr a ining should b e imple- men ted near the critical p oint. Mor eov e r , the size o f the blo cks coupling to ea c h hidden unit in a lay er a re of approximately the same size (Fig. 3B,C), and the char- acteristic size is incr easing with layer (Fig. 3 D). Sur- prisingly, this lo c al blo ck spin structur e emer ges fr om the tr aining pr o c ess, su ggesting the DNN is self -or ganizing to implement blo ck spi n r enormalization. F urthermore, as shown in Fig. 3E, reco ns tructions from the coarse grained DNN can qualitatively repro duce the macro - scopic feature s of individua l samples despite having only 25 spins in the top layer, a compression ratio of 6 4. V. DISCUSSION Deep learning is one of the mo st successful paradigms for unsuper vised learning to emerge over the last ten years. The enor mous success of deep learning tec hniques at a v ariety o f practical ma chine learning ta sks ranging from voice recognition to image class iﬁc a tion raises nat- ural questions a bout its theoretica l underpinnings. Here, we hav e demonstrated that there is a one-to-o ne map- ping betw een RBM-ba sed Deep Neural Net w orks and the v ariational renorma lization g roup. W e illustrated this mapping by analy tically cons tr ucting a DNN for the 1D Ising mo del and n umerically examining the 2D Ising mo del. Surprisingly , we found that these DNNs self o r - ganize to implement a coarse-gr aining pro cedure remi- niscent of K adanoﬀ blo ck renormaliza tion. This sug gests that dee p learning may b e implementin g a genera lized R G-like scheme to lea rn important fea tur es from data. R G plays a central ro le in our mo dern understanding of statistical ph ysics and quantu m ﬁeld theory . A cen- tral ﬁnding of RG is that the long distance ph ysics of many dispa rate physical systems a re dominated by the same long distance ﬁx ed p oints. This gives ris e to the idea of universalit y – many micros copically dissimilar s ys- tems exhibit macrosco pically similar pro perties at long distances Physicists have developed elab orate tec hnical machinery for exploiting ﬁxed p oints and univ ersality to ident ify the salient lo ng distance feature s of physics sys- tems. It will b e interesting to see, what, if any of this more co mplex machinery ca n be imp orted to deep lear n- ing. A p otential obstacle for impo rting ideas fr om physics int o the deep lea rning fra mew ork is that RG is commonly applied to ph ysical systems with many symmetries. This is in contrast to deep lear ning which is often a pplied to data with limited structure. Recently , it w as sug g ested that moder n RG techniques developed in the co n text of qua n tum s ystems suc h as ma- trix pro duct states and tensor ne tw orks have a natura l int erpretatio n in terms of v ar ia tional RG [1 7]. These new techn iques explo it ideas such as entanglement e ntropy and disentanglers w hich crea te a features with a mini- m um a mo un t of redundancy . It is an op en q ues tion to see whether these ideas can b e imp orted into deep le a rning algorithms. O ur mapping a lso s ug gests a r oute fo r apply- ing real space reno rmalization tec hniques to complicated ph ysical systems. Real space r enormalization techniques such as v ariational R G have often b een limited b y their inability to make go od approximations. T ec hniques from deep lea r ning may represe nt a p oss ible route for ov ercom- ing these problems. App endix A: Learning Dee p Arc hitecture for the Two-dimensional Ising Mo del Details are given in the SI Ma terials and Meth- o ds . Stack ed RBMs w ere trained with a v ariant of the co de from [6]. This co de is av aila ble at ht tps://co de.go og le.com/p/matrbm/. In particular, only the unsup ervised lear ning phase w as perfo rmed. In- dividual RBMs were trained with contrastive div ergence for 20 0 epo chs, with momentum 0 . 5 using mini-batches o f size 100 on 40 , 000 total samples from the 2 D Ising mo del with J = 0 . 408 . Additionally , L 1 regulariza tion was im- plement ed, with strength 2 × 10 − 4 , instea d of weight de- cay . This L1 r egularization streng th was chosen to ens ure that one could no t have all-to-all couplings betw een la y- ers in the DNN. Reconstructions were per formed as in [6]. See Supplementary ﬁles for a Ma tlab v ar iable containing the learned model. App endix B: Vi sualizing Eﬀective Receptive Fields The eﬀectiv e receptive ﬁeld is a way to visualize w hich spins in the vis ible lay er that coupled to a given spin in one of the hidden lay ers. W e denote the eﬀective recep- tive ﬁeld matrix of layer l by r ( l ) and the n umber of spins in lay er l by n ( l ) , with the visible la yer corr e s ponding to l = 0. E ach column in r ( l ) is a vector that enco des the receptive ﬁeld of a single spin in hidden lay er l . It can b e computed by conv oluting the w eight matrices W ( l ) en- co ding the w eights w ij betw een the spins in layers l − 1 and l . T o compute r ( l ) ﬁrst w e set r (1) = W (1) and used the recursion r elationship r l = r ( l − 1) W ( l ) for l > 1. Thu s, the eﬀective receptive ﬁeld of a spin is a mea sure of how muc h that hidden spin inﬂuences the s pins in the visible la yer. ACKNO WLEDGMENT S PM is gr a teful to Charles K. Fis he r for useful co n ver- sations. W e ar e also gr ateful to Jav ad No orbakhsh and Alex Lang for comments on the manuscript. This work was par tially suppor ted by Simons F oundation In vestiga- tor Award in the Mathematical Mo deling of Living Sys- 8 tems and a Sloan Resear c h F e llowship (to P .M). DJS was partially suppor ted b y NIH Gra n t K25 GM098875. [1] Y. Bengio, F oun d ations and trends R  in Mac hine Learn- ing, 2 , 1 (2009). [2] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, In ternational Con- ference in Mac h ine Learning ( 2012). [3] A. Krizhev sk y , I. S utskev er, and G. E. Hinton, Ad- v ances in Neu ral Information Pro cessing Systems 25, 1097 (2012). [4] G. Hinton, L. Deng, D. Y u, G. Dahl, A. Mo- hamed, N. Jaitly , A. Senior, V. V anhouc ke, P . N guye n, T. Sainath, and B. K ingsbury , Signal Pro cessing Maga- zine, 29 , 82 (2012). [5] R. Sarik aya , G. Hin ton, and A . Deoras, IEEE T ransac- tions on A udio Sp eech and Language Processing (2014). [6] G. E. Hinto n and R. R. Salakhutdino v, Science, 313 , 504 (2006). [7] Y. Bengio and L. Y ann, Large-scale kernel machines, 34 , 1 (2007). [8] N. Le Roux and Y. Bengio, N eural Computation, 22 , 2192 (2010). [9] N. Le Roux and Y. Bengio, N eural Computation, 20 , 1631 (2008). [10] Y. Bengi o, P . Lam b lin, D. P op o vici, and H. Laro c helle, Adv ances in neural information processing systems, 19 , 153 (2007). [11] K. G. Wilson and J. K ogut, Physics Rep orts, 12 , 75 (1974). [12] K . G. Wilson, R eviews of Mo dern Physics, 55 , 583 (1983). [13] J. Cardy , Sc aling and r enormalization i n stat istic al physics , V ol. 5 (Ca mbridge Universit y Press, 1996). [14] L. P . Kadanoﬀ, Statics, Dynami cs and R enormalization (W orld Scientiﬁc, 2000). [15] N . Goldenfeld, ( 1992). [16] L. P . Kadanoﬀ, A. Houghton, and M. C. Y alabik, Journal of S t atistical Physics, 14 , 171 ( 1976). [17] E. Efrati, Z. W ang, A. Kolan, and L. P . Kadanoﬀ, Re- views of Modern Ph ysics, 86 , 647 (2014). [18] Y . Bengio, A. Courville, and P . Vincent, Pattern Analy- sis and Mac h ine Intelligence, IEEE T ransactions on, 35 , 1798 (20 13). [19] G. Hin ton, S. Osindero, and Y.-W . T eh, Neural compu- tation, 18 , 1527 (2006). [20] R . Salakh utdinov, A. Mnih, and G. H in ton, in Pr o- c e e dings of the 24th international c onfer enc e on Machine le arning (ACM, 2007) p p. 791–798. [21] H . Larochelle and Y. Bengio, in Pr o c e e dings of the 25th international c onfer enc e on Machine le arning (A CM, 2008) pp. 536–5 43. [22] P . Smolensky , (1986). [23] Y . W. T eh and G. E. Hin ton, Adv ances in neural infor- mation processing sy stems, 908 (2001). [24] G. E. Hinton, Neural compu tation, 14 , 17 71 (20 02).

An exact mapping between the Variational Renormalization Group and Deep Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment