Resonator Networks outperform optimization methods at solving high-dimensional vector factorization
We develop theoretical foundations of Resonator Networks, a new type of recurrent neural network introduced in Frady et al. (2020) to solve a high-dimensional vector factorization problem arising in Vector Symbolic Architectures. Given a composite ve…
Authors: Spencer J. Kent, E. Paxon Frady, Friedrich T. Sommer
T o app ear in Neural Computation, 2020 Resonator Net w orks outp erform optimization metho ds at solving high-dimensional v ector factorization Sp encer J. Ken t 1 , 2 E. P axon F rady 1 , 3 , 5 F riedric h T. Sommer 1 , 3 , 5 Bruno A. Olshausen 1 , 3 , 4 1 Redw o o d Cen ter for Theoretical Neuroscience 2 Electrical Engineering and Computer Sciences 3 Helen Wills Neuroscience Institute, 4 Sc ho ol of Optometry Univ ersity of California, Berk eley Berk eley , CA 94702 5 In tel Laboratories, Neuromorphic Computing Lab San F rancisco, CA 94111 Abstract W e dev elop theoretical foundations of Resonator Netw orks, a new type of recur- ren t neural net w ork in tro duced in F rady et al. (2020) to solve a high-dimensional v ector factorization problem arising in V ector Symbolic Arc hitectures. Giv en a comp osite v ector formed by the Hadamard product b etw een a discrete set of high- dimensional v ectors, a Resonator Net w ork can efficien tly decomp ose the com- p osite in to these factors. W e compare the performance of Resonator Netw orks against optimization-based metho ds, including Alternating Least Squares and sev- eral gradien t-based algorithms, sho wing that Resonator Netw orks are sup erior in sev eral imp ortant w ays. This adv an tage is ac hieved b y leveraging a com bination of nonlinear dynamics and “searching in sup erp osition,” b y which estimates of the correct solution are formed from a w eighted sup erp osition of all p ossible solu- tions. While the alternative metho ds also searc h in sup erp osition, the dynamics of Resonator Net w orks allow them to strik e a more effectiv e balance b et ween ex- ploring the solution space and exploiting lo cal information to driv e the net work to ward probable solutions. Resonator Net works are not guaran teed to conv erge, but within a particular regime they almost alwa ys do. In exc hange for relaxing the guarantee of global conv ergence, Resonator Netw orks are dramatically more effectiv e at finding factorizations than all alternative approac hes considered. 1 1 In tro duction This article is the second in a t wo-part series on Resonator Net works. Part one (F rady et al., 2020) show ed how distributed representations of data structures ma y b e formed using the algebra of V ector Symbolic Architectures, and further- more that deco ding these represen tations often requires solving a v ector factor- ization problem. Resonator Netw orks were in tro duced as a neural solution to this problem, and demonstrated on tw o examples. Here, our primary ob jectiv e is to establish the theoretical foundations of Resonator Netw orks, and to perform a more comprehensive analysis of their conv ergence and capacity prop erties in comparison to optimization-based metho ds. W e limit our analysis to a particular definition of the factorization problem, whic h may seem somewhat abstract, but in fact applies to practical usage of V ector Sym b olic Arc hitectures (VSAs). W e consider “bip olar” v ectors, whose elemen ts are ± 1 , used in the p opular “Multiply , Add, Perm ute (MAP)” VSA (Ga yler, 1998, 2004). These ideas extend to other VSAs, although w e lea ve a detailed analysis to future work. Part one included commen tary on the historical context and represen tational pow er of VSAs, which we will not co ver here. F or the purp oses of this pap er, it is sufficient to stipulate that wherever VSAs are used to enco de complex hierarc hical data structures, a factorization problem m ust b e solved. By solving this problem, Resonator Net works make the VSA framework scalable to a larger range of problems. The core c hallenge of factorization is that inferring the factors of a composite ob ject amounts to searching through an enormous space of p ossible solutions. Resonator Netw orks do this, in part, b y “searching in sup erp osition,” a notion that w e make precise in Section 3. There are in fact man y w ays to search in sup erp osition, and w e introduce a n um b er of them in Section 5 as a benchmark for our mo del and to understand what mak es our approach different. A Resonator Net work is simply a nonlinear dynamical system designed to solv e a particular factorization problem. It is defined b y equations (7) and (8), eac h representing t wo separate v arian ts of the netw ork. The system is named for the wa y in which correct factorizations seemingly ‘resonate’ out of what is initially an uninformativ e net work state. The size of the factorization problem that can b e reliably solved, as w ell as the sp eed with whic h solutions are found, characterizes the p erformance of all the approaches we in tro duce – in these terms, Resonator Net works are b y far the most effectiv e. The main results are as follo ws: 1. W e c haracterize stability at the correct solution, sho wing that one v arian t of Resonator Net works is alwa ys stable, while the other has stability prop erties related to classical Hopfield Net works. W e sho w that Resonator Netw orks 2 are less stable than Hopfield Net w orks because of a phenomenon we refer to as p ercolated noise (Section 6.1). 2. W e define “operational capacity” as a metric of factorization p erformance and use it to compare Resonator Net works against six benchmark algo- rithms. W e find that Resonator Net works ha ve dramatically higher op era- tional capacit y (Section 6.2). 3. Through sim ulation, we determine that op erational capacit y scales as a quadratic function of vector dimensionalit y . This quantit y is prop ortional to the n umber of idealized neurons in a Resonator Netw ork (also Section 6.2). 4. W e prop ose a theory for why Resonator Net w orks p erform well on this prob- lem (Section 6.6). 2 Statemen t of the problem W e formalize the factorization problem in the follo wing w ay: X 1 , X 2 , . . . , X F are sets of v ectors called ‘co deb o oks’. The f th co deb o ok con tains D f ‘co dev ectors’ x ( f ) 1 , x ( f ) 2 , . . . , x ( f ) D f X f := { x ( f ) 1 , x ( f ) 2 , . . . , x ( f ) D f } ∀ f = 1 , 2 , . . . , F and these v ectors all liv e in {− 1 , 1 } N . A comp osite v ector c is generated b y computing the Hadamard pro duct of F vectors, one dra wn from each of the co deb o oks X 1 , X 2 , . . . , X F . c = x (1) ? x (2) ? . . . x ( F ) ? x (1) ? ∈ X 1 , x (2) ? ∈ X 2 , . . . , x ( F ) ? ∈ X F The factorization problem w e wish to study is giv en c , X 1 , X 2 , . . . , X F find x (1) ? ∈ X 1 , x (2) ? ∈ X 2 , . . . x ( F ) ? ∈ X F suc h that c = x (1) ? x (2) ? . . . x ( F ) ? (1) Our assumption in this paper is that the factorization of c into F co dev ectors, one from eac h co deb o ok, is unique. Then, the total n umber of comp osite vectors that can b e generated b y the co deb o oks is M : M := F Y f =1 D f 3 The problem in volv es searc hing among M p ossible factorizations to find the one that generates c . W e will refer to M as the search space size, and at some lev el it captures the difficult y of the problem. The problem size is also influenced by N , the dimensionalit y of each v ector. Supp ose we w ere to solv e (1) using a brute force strategy . W e might form all p ossible comp osite vectors from the sets X 1 , X 2 , . . . , X F , one at a time, un til w e generate the v ector c , whic h would indicate the appropriate factorization. Assuming no additional information is a v ailable, the num b er of trials taken to find the correct factorization is a uniform random v ariable K ∼ U { 1 , M } and th us E [ K ] = M +1 2 . If instead we could easily store all of the comp osite vectors ahead of time, we could compare them to any new comp osite v ector via a single matrix-v ector inner pro duct, which, given our uniqueness assumption, will yield a v alue of N for the correct factorization and v alues strictly less than N for all other factorizations. The matrix containing all p ossible comp osite vectors requires M N bits to store. The core issue is that M scales very po orly with the num b er of factors and n umber of p ossible co devectors to b e entertained. If F = 4 ( 4 factors) and D f = 100 ∀ f ( 100 possible co dev ectors for each factor), then M = 100 , 000 , 000 . In the context of V ector Sym b olic Arc hitectures, it is common to hav e N = 10 , 000 . Therefore, the matrix with all p ossible comp osite vectors w ould require ≈ 125 GB to store. W e aspire to solve problems of this size (and muc h larger), whic h are clearly out of reach for brute-force approac hes. F ortunately , they are solv able using Resonator Net works. 3 F actoring b y searc h in sup erp osition In our problem form ulation (1) the factors in teract multiplicativ ely to form c , and this lies at the heart of what mak es it hard to solv e. One w ay to attempt a solution is to pro duce an estimate for eac h factor in turn, alternating b et ween up dates to a single factor on its o wn, with the others held fixed. In addition, it may mak e sense to sim ulatenously en tertain all of the v ectors in each X f , in some prop ortion that reflects our current confidence in each one b eing part of the correct solution. W e call this se ar ching in sup erp osition and it is the general approac h w e tak e throughout the pap er. What we mean b y ‘sup erp osition’ is that the estimate for the f th factor, ˆ x ( f ) , is given b y ˆ x ( f ) = g ( X f a f ) where X f is a matrix with eac h column a v ector from X f . The v ector a f con tains the co efficients that define a linear combination of the elements of X f , and g ( · ) is a function from R N to R N , which w e will call the activ ation function. In this pap er w e consider the iden tity g : x 7→ x , the sign function g : x 7→ sgn ( x ) , and nothing else. Other activ ation functions are appropriate for the other v ariants of Resonator Net works (for instance where the vectors are complex-v alued), but w e lea ve a discussion of 4 this to future w ork. ‘Search’ refers to the metho d by which we adapt a f o ver time. The estimate for eac h factor leads to an estimate for c denoted by ˆ c : ˆ c := ˆ x (1) ˆ x (2) . . . ˆ x ( F ) = g ( X 1 a 1 ) g ( X 2 a 2 ) . . . g ( X F a F ) (2) Supp ose g ( · ) is the identit y . Then ˆ c b ecomes a multiline ar function of the co effi- cien ts a 1 , a 2 , . . . a F . ˆ c = ˆ x (1) ˆ x (2) . . . ˆ x ( F ) = X 1 a 1 X 2 a 2 . . . X F a F (3) While this is a ‘nice’ relationship in the sense that it is linear in eac h of the co efficien ts a f separately (with the others held fixed), it is unfortunately not con vex with respect to the co efficien ts tak en all at once. W e can rewrite it as a sum of M different terms, one for each of the possible factorizations of c : ˆ c = X d 1 ,d 2 ,...,d F ( a 1 ) d 1 ( a 2 ) d 2 . . . ( a F ) d F x (1) d 1 x (2) d 2 . . . x ( F ) d F (4) Where d 1 ranges from 1 to D 1 , d 2 ranges from 1 to D 2 , and so on. The term in paren theses is a scalar that weigh ts eac h of the p ossible Hadamard pro ducts. Our estimate ˆ c is, at an y giv en time, purely a sup erp osition of al l the p ossible fac- torizations. Moreo v er, the sup erp osition w eights ( a 1 ) d 1 ( a 2 ) d 2 . . . ( a F ) d F can b e appro ximately recov ered from ˆ c alone by computing the cosine similarity b etw een ˆ c and the vector x (1) d 1 x (2) d 2 . . . x ( F ) d F . The source of ‘noise’ in this appro xi- mation is the fact that x (1) m 1 x (2) m 2 . . . x ( F ) m F will ha ve a nonzero inner pro duct with the other vectors in the sum. When the co dev ectors are uncorrelated and high-dimensional, this noise is quite small: ˆ c transparently reflects the prop ortion with which it contains each of the possible factorizations. When g ( · ) is the sign function, this pr op erty is r etaine d . The vector ˆ c is no longer an exact sup erp osi- tion, but the scalar ( a 1 ) m 1 ( a 2 ) m 2 . . . ( a F ) m F can still b e deco ded from ˆ c in the same wa y – the vector ˆ c is still an appro ximate sup erp osition of all the p ossible factorizations, with the weigh t for eac h of these determined by the co efficients a f . This prop erty , that thresholded sup erp ositions retain relative similarit y to eac h of their superimp osed components, is hea vily relied on throughout Kanerv a’s and Ga yler’s work on V ector Sym b olic Arc hitectures (Kanerv a, 1996; Ga yler, 1998). One last p oin t of notation b efore in tro ducing our solution to the factorization problem – w e define the vector ˆ o ( f ) to b e the product of the estimates for the other factors: ˆ o ( f ) := ˆ x (1) . . . ˆ x ( f − 1) ˆ x ( f +1) . . . ˆ x ( F ) (5) This will come up in each of the algorithms under consideration and simplify our notation. The notation will often include an explicit dep endence on time t lik e so: 5 ˆ x f [ t ] = g ( X f a f [ t ]) . Each of the algorithms considered in this pap er up dates one factor at a time, with the others held fixed so, at a giv en time t , we will up date the factors in order 1 to F , although this is a somewhat arbitrary c hoice. Including time dep endence with ˆ o ( f ) , w e hav e ˆ o ( f ) [ t ] := ˆ x (1) [ t + 1] . . . ˆ x ( f − 1) [ t + 1] ˆ x ( f +1) [ t ] . . . ˆ x ( F ) [ t ] (6) whic h mak es explicit that at the time of up dating ˆ x f , the factors 1 to ( f − 1) ha ve already been up dated for this ‘iteration’ t while the factors ( f + 1) to F hav e y et to b e up dated. 4 Resonator Net w orks A Resonator Net work is a nonlinear dynamical system designed to solve the fac- torization problem (1), and it can b e interpreted as a neural netw ork in whic h idealized neurons are connected in a very particular wa y . W e define t wo separate v arian ts of this system, whic h differ in terms of this pattern of connectivit y . A Resonator Net work with outer product (OP) weigh ts is defined b y ˆ x ( f ) [ t + 1] = sgn X f X > f ˆ o ( f ) [ t ] c (7) Supp ose ˆ x ( f ) [ t + 1] indicates the state of a p opulation of neurons at time t + 1 . Eac h neuron receives an input ˆ o ( f ) [ t ] c , mo dified by synapses mo deled as a ro w of a “weigh t matrix” X f X > f . This “synaptic current” is passed through the activ ation function sgn ( · ) in order to determine the output, whic h is either +1 or − 1 . Most readers will b e familiar with the weigh t matrix X f X > f as the so-called “outer pro duct” learning rule of classical Hopfield Net works (Hopfield, 1982). This has the nice in terpretation of Hebbian learning (Hebb, 1949) in whic h the strength of synapses b etw een any t wo neurons (represen ted b y this weigh t matrix) dep ends solely on their pairwise statistics o ver some dataset, in this case the co dev ectors. Prior to thresholding in (7), the matrix-v ector product X > ˆ o ( f ) [ t ] c pro- duces co efficients a f [ t ] which, when premultiplied b y X f , generate a v ector in the linear subspace spanned b y the codevectors (the columns of X f ). This pro jection do es not minimize the squared distance b et ween ˆ o ( f ) [ t ] c and the resultan t v ector. Instead, the matrix X > f X f − 1 X > f pro duces such a pro jection, the so- called Or dinary L e ast Squar es pro jection onto R ( X f ) . This motiv ates the second v arian t of our mo del, Resonator Netw orks with Ordinary Least Squares (OLS) w eights: ˆ x ( f ) [ t + 1] = sgn X f X > f X f − 1 X > f ˆ o ( f ) [ t ] c := sgn X f X † f ˆ o ( f ) [ t ] c (8) 6 where we hav e used the notation X † f to indicate the Mo ore-P enrose psuedoinv erse of the matrix X f . Hopfield Net works with this type of synapse were first proposed b y P ersonnaz, Guy on, and Dreyfus (Personnaz et al., 1986), who called this the “pro jection” rule. If, contrary to what w e ha ve defined in (7) and (8), the input to each sub- p opulation of neurons w as ˆ x ( f ) [ t ] , its own previous state, then one would in fact ha ve a (“Bip olar”) Hopfield Net work. In our case ho wev er, rather than b eing autoasso ciativ e, in whic h ˆ x ( f ) [ t + 1] is a direct function of ˆ x ( f ) [ t ] , our dynamics are heteroasso ciative, basing up dates on the states of the other factors. This c hange has a dramatic effect on the net work’s con vergence prop erties and is also in some sense what mak es Resonator Net works useful in solving the factorization problem, a fact that we will elab orate on in the follo wing sections. W e imagine F separate subp opulations of neurons which ev olve together in time, eac h one resp onsible for estimating a different factor of c . F or no w we hav e just sp ecified this as a discrete-time net w ork in which up dates are made one-at-a-time, but it can b e extended as a contin uous-v alued, contin uous-time dynamical system along the same lines as was done for Hopfield Net works (Hopfield, 1984). In that case, w e can think ab out these F subp opulations of neurons ev olving in a truly parallel wa y . In discrete-time, one has the c hoice of making ‘asynchronous’ or ‘sync hronous’ up dates to the factors, in a sense analogous to Hopfield Net works. Our form ulation of ˆ o ( f ) [ t ] in (6) follows the asynchronous con ven tion, which w e find to conv erge faster. The formulation giv en in part one of this series (F rady et al., 2020) employ ed the sync hronous con ven tion for p edagogical reasons, but the distinction b et ween the tw o v anishes in con tinuous-time, where up dates are instan taneous. In practice, we will ha ve to c ho ose an initial state ˆ x ( f ) [0] using no kno wledge of the correct co dev ector x ( f ) ? other than the fact it is one of the elemen ts of the co deb o ok X f . Therefore, we set ˆ x ( f ) [0] = sgn P j x ( f ) j , which, as we hav e said ab o ve, has appro ximately equal cosine similarit y to eac h term in the sum. 4.1 Difference b et w een OP w eights and OLS weigh ts The difference b et ween outer pro duct w eights and Ordinary Least Squares w eights is via X > f X f − 1 , the in verse of the so-called Gram matrix for X f , whic h contains inner pro ducts b etw een eac h co dev ector. If the co devectors are orthogonal, the Gram matrix is N I , with I the identit y matrix. When N is large (roughly speak- ing > 5 , 000 ), and the co dev ectors are c hosen randomly i.i.d. from {− 1 , 1 } N , then they will b e very ne arly orthogonal, making N I a close approxima tion. Clearly , in this setting, the t wo v arian ts of Resonator Netw orks pro duce nearly the same dynamics. In section 6.2, we define and measure a p erformance metric called op- 7 erational capacit y in such a w ay that do es not particularly highlight the difference b et ween the dynamics, i.e. it is the setting where co devectors are nearly orthog- onal. In general, ho wev er, the dynamics are clearly differen t. In our exp erience, applications that con tain correlations b etw een co devectors ma y enjo y hi gher op- erational capacit y under Ordinary Least Squares w eights, but it is hard to say whether this applies in ev ery setting. One application-relev ant consideration is that, b ecause X f consists of entries that are +1 and − 1 , the outer pro duct v ariant of a Resonator Netw ork has an in teger-v alued w eight matrix and can b e implemen ted without an y floating-p oin t computation – hardware with large binary and in teger arithmetic circuits can sim ulate this mo del v ery quic kly . Coupled with noise tolerance prop erties w e will establish in Section 6.5, this mak es Resonator Net w orks (and more generally , VSAs) a go o d fit for emerging device nanotec hnologies (Rahimi et al., 2017). 5 The optimization approac h An alternativ e strategy for solving the factorization problem is to define a loss function which compares the current estimate ˆ c := ˆ x (1) ˆ x (2) . . . ˆ x ( F ) with the comp osite that is to b e factored, c , choosing the loss function and a corresp onding constrain t set so that the global minimizer of this loss o v er the constraints yields the correct solution to (1). One can then design an algorithm that find s the solu- tion by minimizing this loss. This is the approac h taken by optimization theory . Here w e consider algorithms that searc h in sup erp osition, setting ˆ x ( f ) = g ( X f a f ) just as Resonator Net works, but that instead tak e the optimization approac h. Let the loss function b e L ( c , ˆ c ) and the feasible set for each a f b e C f . W e write this as a fairly generic optimization problem: minimize a 1 , a 2 ,..., a F L c , g ( X 1 a 1 ) g ( X 2 a 2 ) . . . g ( X F a F ) sub ject to a 1 ∈ C 1 , a 2 ∈ C 2 , . . . , a F ∈ C F (9) What makes a particular instance of this problem remark able dep ends on our c hoices for L ( · , · ) , g ( · ) , C 1 , C 2 , . . . , C F , and the structure of the vectors in eac h co deb o ok. Different algorithms ma y be appropriate for this p roblem, dep ending on these details, and we prop ose six candidate algorithms in this pap er, whic h we refer to as the “b enc hmarks”. It is in c ontr ast to the b enchmark algorithms that we can more fully understand the p erformance of Resonator Netw orks – our argumen t, whic h w e will develop in the Results section, is that Resonator Net works strike a more natural balance b et w een exploring the high-dimensional state space and using lo cal information to mo v e tow ards the solution. The b enc hmark algorithms are briefly in tro duced in Section 5.1, but they are each discussed at some length in the Appendix, including T able 2, whic h compiles the dynamics sp ecified b y each. 8 W e provide implemen tations of eac h algorithm in the small softw are library that accompanies this pap er 1 . 5.1 Benc hmark algorithms A common thread among the b enchmark algorithms is that they take the activ a- tion function g ( · ) to b e the identit y g : x 7→ x , making ˆ c a m ultilinear function of the co efficients, as we discussed in section 3. W e experimented with other acti- v ation functions, but found none for which the optimization approac h p erformed b etter. W e consider t w o straigh tforward loss functions for comparing c and ˆ c . The first is one half the squared Euclidean norm of the error, L : x , y 7→ 1 2 || x − y || 2 2 , whic h we will call the squared error for short, and the second is the negative inner pro duct L : x , y 7→ −h x , y i . The squared error is minimized b y ˆ c = c , whic h is also true for the negative inner product when ˆ c is constrained to [ − 1 , 1] N . Both of these loss functions are conv ex, meaning that L ( c , ˆ c ) is a conv ex func- tion of each a f separately 2 . Some of the benchmark algorithms constrain a f directly , and when that is the case, our fo cus is on three different conv ex sets, namely the simplex ∆ D f := { x ∈ R D f | P i x i = 1 , x i ≥ 0 ∀ i } , the unit 1 ball B ||·|| 1 [1] := { x ∈ R D f | || x || 1 ≤ 1 } , and the closed zero-one hypercub e [0 , 1] D f . Therefore, solving (9) with resp ect to eac h a f sep ar ately is a con vex optimization problem. In the case of the negativ e inner pro duct loss L : x , y 7→ −h x , y i and simplex constraints C f = ∆ D f , it is a b onafide linear program. The correct fac- torization is giv en b y a ? 1 , a ? 2 , . . . , a ? F suc h that ˆ x ( f ) = X f a ? f = x ( f ) ? ∀ f , whic h we kno w to b e v ectors with a single entry 1 and the rest 0 – these are the standard basis v ectors e i (where ( e i ) j = 1 if j = i and 0 otherwise). The initial states a 1 [0] , a 2 [0] , . . . a F [0] m ust b e set with no prior kno wledge of the correct factoriza- tion so, similar to ho w w e do for Resonator Netw orks, w e set each elemen t of a f [0] to the same v alue (whic h in general dep ends on the constrain t set). 5.1.1 Alternating Least Squares Alternating Least Squares (ALS) lo cally minimizes the squared error loss in a fairly straightforw ard w a y: for eac h factor, one at a time, it solves a least squares problem for a f and up dates the current state of the estimate ˆ c to reflect this new v alue, then mo ves on to the next factor and rep eats. F ormally , the up dates given 1 h ttps://github.com/spencerken t/resonator-netw orks 2 through the comp osition of an affine function with a con vex function 9 b y Alternating Least Squares are: a f [ t + 1] = arg min a f 1 2 c − ˆ o ( f ) [ t ] X f a f [ t ] 2 2 = ξ > ξ − 1 ξ > c | ξ := diag ˆ o ( f ) [ t ] X f (10) Alternating Least Squares is an algorithm that features prominen tly in the tensor decomp osition literature (Kolda and Bader, 2009), but while ALS has b een suc- cessful for a particular t yp e of tensor decomp osition, there are a few details which mak e our problem different from what is normally studied (see App endix D). The up dates in ALS are quite greedy – they exactly solve each least squares subprob- lem. It may make sense to more gradually mo dify the co efficients, a strategy that w e turn to next. 5.1.2 Gradien t-following algorithms Another natural strategy for solving (9) is to mak e up dates that incorp orate the gradien t of L with resp ect to the coefficients – eac h of the next 5 algorithms do es this in a particular w ay (w e write out the gradien ts for b oth loss functions in App endix E). The squared error loss is globally minimized b y ˆ c = c , so one migh t b e tempted to start from some initial v alues for the co efficients and make gradien t up dates a f [ t + 1] = a f [ t ] − η ∇ a f L . In App endix E.1 we discuss wh y this do es not w ork w ell – the difficult y is in b eing able to guaran tee that the loss function is smo oth enough that gradien t descent iterates with a fixed stepsize will con verge. Instead, the algorithms we apply to the squared error loss utilize a dynamic stepsize. Iterativ e Soft Thresholding: The global minimizers of (9) are maximally sparse, || a ? f || 0 = 1 . If one aims to minimize the squared error loss while lo osely con- strained to sparse solutions, it ma y make sense to solv e the problem with Iterativ e Soft Thesholding (IST A). The dynamics for IST A are giv en b y equation (28) in T able 2. F ast Iterative Soft Thresholding: W e also considered F ast Iterativ e Soft Thesh- olding (FIST A), an enhancement due to Bec k and T eboulle (2009), whic h utilizes Nestero v’s momen tum for accelerating first-order methods in order to alleviate the sometimes slo w con vergence of IST A (Bredies and Lorenz, 2008). Dynamics for FIST A are given in equation (29). Pro jected Gradien t Descen t: Another b enc hmark algorithm we considered w as Pro jected Gradient Descent on the negativ e inner product loss, where up- dates were pro jected onto either the simplex or unit 1 ball (30). A detailed discussion of this approac h can b e found in App endix G. 10 Multiplicativ e W eigh ts: This is an algorithm that can b e applied to either loss function, although w e found it w orked b est on the negativ e inner product. It v ery elegantly enforces a simplex constraint on a f b y maintaining a set of auxilliary v ariables, the ‘w eights’, which are used to set a f at each iteration. See equation (31) for the dynamics of Multiplicative W eigh ts, as w ell as App endix H. Map Seeking Circuits: The final algorithm that we considered is called Map Seeking Circuits. Map Seeking Circuits are neural net w orks designed to solv e in v ariant pattern recognition problems using the principle of sup erp osition. Their dynamics are based on the gradien t, but are different from what w e ha ve in tro duced so far – see equation (32) and App endix I. 5.2 Con trasting Resonator Net w orks with the b enc hmarks 5.2.1 Con vergence of the b enc hmarks A remark able fact ab out the b enchmark algorithms is that e ach one c onver ges for al l initial c onditions , which we directly pro ve, or refer to results pro ving, in the App endix. That is, given an y starting co efficients a f [0] , their dynamics reach fixed points which are local minimizers of the loss function. In some sense, this prop ert y is an immediate consequence of treating factorization as an optimization problem – the algorithms we chose as the b enc hmarks were designe d this wa y . Con vergence to a lo cal minimizer is a desirable prop erty , but unfortunately the fundamen tal non-con v exity of the optimization problem implies that this ma y not guaran tee go o d lo cal minima in practice. In Section 6 w e establish a standardized setting where w e measure ho w lik ely it is that these lo cal minima are actually glob al minima. W e find that as long as M – the size of the search space – is small enough, each of these algorithms can find the global minimizers reliably . The p oint at whic h the problem b ecomes to o large to reliably solv e is what w e call the operational capacity of the algorithm, and is a main point of comparison with Resonator Net works. 5.2.2 An algorithmic interpretation of Resonator Netw orks The b enchmark algorithms generate estimates for the factors, ˆ x ( f ) [ t ] , that mov e through the in terior of the [ − 1 , 1] h yp ercub e. Resonator Netw orks, on the other hand, do not. The sgn ( · ) function ‘bip olarizes’ inputs to the nearest vertex of the h yp ercub e, and this highly nonlinear function, which not only c hanges the length but also the angle of an input vector, is k ey . W e know the solutions x ( f ) ? exist at vertices of the h yp ercub e and these p oin ts are very special geometrically in the sense that in high dimensions, most of the mass of [ − 1 , 1] N is concen trated 11 relativ ely far from the vertices – a fact w e will not prov e here but that is based on standard results from the study of concen tration inequalities (Bouc heron et al., 2013). Our motiv ation for using the sgn ( · ) activ ation function is that mo ving through the interior of the h yp ercub e while searc hing for a factorization is un wise, a conjecture for whic h w e will provide some empirical support in the Results section. One useful in terpretation of OLS Resonator Net work dynamics is that the net work is computing a bip olarized version of Alternating Least Squares. Suppose w e w ere to tak e the dynamics sp ecified in ( 10 ) for making ALS up dates to a f [ t + 1] , but we also bip olarize the v ector ˆ x ( f ) [ t + 1] at the end of eac h step. When eac h ˆ x ( f ) [ t + 1] is bip olar, the vector ˆ o ( f ) [ t ] is bip olar and w e can simplify ξ > ξ − 1 ξ > : ˆ o ( f ) [ t ] ∈ {− 1 , 1 } N ⇐ ⇒ ξ > ξ − 1 ξ > = X > f diag ˆ o ( f ) [ t ] 2 X f − 1 X > f diag ˆ o ( f ) [ t ] = X > f X f − 1 X > f diag ˆ o ( f ) [ t ] = X † f diag ˆ o ( f ) [ t ] (11) No w a f [ t + 1] = X † f ˆ o ( f ) [ t ] c , whic h one can see from equation (8) is precisely the up date used by Resonator Netw orks with OLS weigh ts. An imp ortant word of caution on this observ ation: it is somewhat of a misnomer to call this algorithm Bip olarized Alternating Least Squares, b ecause at eac h iteration it is not solving a least squares problem, and this conceals a profound difference. T o set a f [ t + 1] = X † f ˆ o ( f ) [ t ] c is to tak e the term g ( X f a f [ t ]) presen t in the loss function and treat the activ ation function g ( · ) as if it w ere linear, whic h it clearly is not. These up dates are not computing a Least Squares solution at eac h step. W e actually lose the guarantee of global conv ergence that comes with Alternating Least Squares, but this is an exchange wel l worth making , as w e will show in the Results section. Unlik e Hopfield Netw orks, whic h hav e a Lyapuno v function certifying their global asymptotic stabilit y , no such function (that we know of ) exists for a Res- onator Net work. While ˆ c = c is alwa ys a fixed p oin t of the OLS dynamics, a net work initialized to a random state is not guaran teed to conv erge. W e ha v e observ ed tra jectories that collapse to limit cycles and seemingly-c haotic tra jecto- ries that do not con v erge in an y reasonable time. One a priori indication that this is the case comes from a simple rewriting of t wo-factor Resonator Netw ork dynamics that concatenates the states for eac h factor in to a single statespace. T o make the transformation exact, we app eal to the contin uous-time version of Resonator Net works, whic h, just lik e Hopfield net works, define dynamics in terms of time deriv atives of the pre-activ ation state ˙ u ( f ) ( t ) = X f X † f ˆ o ( f ) [ t ] c , with ˆ x ( f ) ( t ) = g ( u ( f ) ( t )) . W e write down the con tinuous-time dynamics à la autoasso- 12 ciativ e Hopfield Netw orks: ˙ u (1) ( t ) ˙ u (2) ( t ) ! = 0 X 1 X † 1 diag ( c ) X 2 X † 2 diag ( c ) 0 ! ˆ x (1) ( t ) ˆ x (2) ( t ) ! One can see that the weigh t matrix is non-symmetric, whic h has a simple but imp ortan t consequence: autoasso ciativ e net works with non-symmetric w eights cannot b e guaranteed to conv erge in gener al . This result, first established by Cohen and Grossb erg (Cohen and Grossberg, 1983) and then studied throughout the Hopfield Net w ork literature, is not quite as strong as it ma y sound, in the sense that symmetry is a sufficient, but not necessary , condition for con vergence. One can design a globally-con vergen t autoasso ciativ e netw ork with asymmetric w eights (Xu et al., 1996), and moreo ver, adding a degree of asymmetry has b een adv o cated as a tec hnique to reduce the influence of spurious fixed p oints (Hertz et al., 1986; Singh et al., 1995; Chengxiang et al., 2000). Resonator Net works hav e a large and practical regime of operation, where M (the problem size) is small enough, in whic h non-conv erging tra jectories are extremely rare. It is simple to deal with these even ts, making the mo del still useful in practice despite the lac k of a conv ergence guaran tee. It has also b een argued in sev eral places (see V an V reeswijk and Somp olinsky (1996), for example) that cyclic or c haotic tra jectories ma y b e useful to a neural system, including in cases where there are m ultiple plausible states to en tertain. This is just to sa y that we feel the lac k of a conv ergence guarantee is not a critical w eakness of our mo del, but rather an interesting and p oten tially useful characteristic. W e attempted man y differen t mo difications to the mo del’s dynamics which would pro v ably cause it to con verge, but these c hanges alwa ys hindered its ability to solve the factorization problem. W e emphasize that unlike all of the mo dels in Section 5.1, a Resonator Net work is not descending a loss function. Rather, it mak es use of the fact that: • Eac h iteration is a bip olarized ALS up date – it appr oximately mov es the state to wards the Least Squares solution for eac h factor. • The correct solution is a fixed point (guaranteed for OLS w eights, highly lik ely for OP weigh ts). • There ma y be a sizeable ‘basin of attraction’ around this fixed p oint, which the iterates help us descend. • The num b er of spurious fixed p oints (whic h do not give the correct factor- ization) is relativ ely small. This last p oin t is really what distinguishes Resonator Netw orks from the b ench- marks, whic h we will establish in Section 6.6. 13 6 Results W e present a characterization of Resonator Netw orks along three main directions. The first direction is the stability of the solutions x ( f ) ? , which w e relate to the stabilit y of classical Hopfield netw orks. The second is a fundamental measure of factorization capability we call the “op erational capacity”. The third is the sp eed with which factorizations are found. W e argue that the mark ed difference in factorization performance betw een our mo del and the b enc hmark algorithms lies in the relativ e sc ar city of spurious fixe d p oints enjo yed b y Resonator Net w ork dynamics. W e summarize the main results in b old throughout this section. In each of the sim ulations w e choose co devectors randomly i.i.d. from the discrete uniform distribution ov er the v ertices of the h yp ercub e – eac h element of eac h co devector is a Rademac her random v ariable (assuming the v alue − 1 with probabilit y 0 . 5 and +1 with probabilit y 0 . 5 ). W e generate c by c ho osing one vector at random from each of the F co deb o oks and then computing the Hadamard pro duct among these v ectors. The reason we choose vectors randomly is because it makes the analysis of p erformance somewhat easier and more standardized, and it is the setting in which most of the well-kno wn results on Hopfield Net w ork capacit y apply – we will mak e a few connections to these results. It is also the setting in whic h w e t ypically use the Multiply , A dd, P ermute VSA arc hitecture (Ga yler, 2004) and therefore these results on random v ectors are immediately applicable to a v ariet y of existing w orks. 6.1 Stable-solution capacity with outer pro duct weigh ts Supp ose ˆ x ( f ) [0] = x ( f ) ? for all f (we initialize it to the correct factorization; this will also apply to any t at which the algorithm comes up on x ( f ) ? on its o wn). What is the probability that the state sta ys there – i.e. that the correct factorization is a fixed p oin t of the dynamics? This is the basis of what researc hers hav e called the “capacit y” of Hopfield Netw orks, where x ( f ) ? are patterns that the net work has b een trained to store. W e c ho ose to call it the “stable-solution capacit y” in order to distinguish it from op erational capacit y , which w e define in Section 6.2. W e first note that this analysis is necessary only for Resonator Netw orks with outer pro duct w eights – Ordinary Least Squares w eights guaran tee that the solutions are stable, and this is one of the v arian t’s desirable prop erties. If ˆ x ( f ) [0] = x ( f ) ? for all f , then factor 1 in a Resonator Netw ork “sees” an input x (1) ? at time t = 1 . F or OLS weigh ts, the v ector X 1 X † 1 x (1) ? is exactly x (1) ? b y the definition of orthogonal pro jection. T rue for all subsequent factors, this means that for OLS w eights, x ( f ) ? is alw ays a fixed point. F or a Resonator Net work with outer pro duct weigh ts, w e m ust examine the v ector Γ := X f X > f ˆ o ( f ) [0] c at eac h f , and c hanging from the psuedoinv erse 14 X † f to the transp ose X > f mak es the situation significantly more complicated. A t issue is the probability that Γ i has a sign different from x ( f ) ? i , i.e. that there is a bitflip in an y particular component of the up dated state. In general one may not care whether the state is completely stable – it ma y b e tolerable that the dynamics flip some small fraction of the bits of x ( f ) ? as long as it does not mo ve the state to o far aw a y from x ( f ) ? . Amit, Gutfreund, and Somp olinsky (Amit et al., 1985, 1987) established that in Hopfield Netw orks, an a v alanc he phenomenon o ccurs where bitflips accum ulate and the netw ork b ecomes essen tially useless for v alues of D f > 0 . 138 N , at which p oin t the approximate bitflip probability is 0 . 0036 . While we don’t attempt an y of this complicated analysis on Resonator Netw orks, w e do derive an expression for the bitflip probability of any particular factor that accoun ts for bitflips which “p ercolate” from factor to factor through the v ector ˆ o ( f ) [0] c . W e start b y noting that for factor 1 , this bitflip probability is the same as a Hopfield net work. Readers familiar with the literature on Hopfield Netw orks will kno w that with N and D f reasonably large (appro ximately N ≥ 1 , 000 and D f ≥ 50 ) Γ i can b e well-appro ximated b y a Gaussian with mean x ( f ) ? i ( N + D f − 1) and v ariance ( N − 1)( D f − 1) ; see app endix J for a simple deriv ation. This is summarized as the Hopfield bitflip pr ob ability h f : h f := P r ˆ x ( f ) [1] i 6 = x ( f ) ? i = Φ − N − D f + 1 p ( N − 1)( D f − 1) (12) Where Φ is the cum ulativ e density function of the Normal distribution. Hopfield Net works are often sp ecified with the diagonal of X f X > f set to all zeros (having “no self-connections”), in whic h case the bitflip probabilit y is Φ − N √ ( N − 1)( D f − 1) . F or large N and D f this is often simplified to Φ( − p N /D f ) , which may b e the expression most familiar to readers. Keeping the diagonal of X f X > f mak es the co dev ectors more stable (see app endix J) and while there are some arguments in fa vor of eliminating it, we hav e found Resonator Net works to exhibit b etter p erformance b y keeping these terms. In App endix J we deriv e the bitflip probabilit y for an arbitrary factor in a Resonator Net work with outer pro duct w eights. This probabilit y dep ends on whether a component of the state has already b een flipp ed b y the previous f − 1 factors, which is what we call p er c olate d noise passed b etw een the factors, and whic h increases the bitflip probability . The four relev an t probabilities are: r f := P r ˆ x ( f ) [1] i 6 = x ( f ) ? i (13) n f := P r ˆ o ( f +1) [0] c i 6 = x ( f +1) ? i (14) 15 r f 0 := P r ˆ x ( f ) [1] i 6 = x ( f ) ? i ˆ o ( f ) [0] c i = x ( f ) ? i (15) r f 00 := P r ˆ x ( f ) [1] i 6 = x ( f ) ? i ˆ o ( f ) [0] c i 6 = x ( f ) ? i (16) Equation (13) is the probabilit y of a bitflip compared to the correct v alue, the R esonator bitflip pr ob ability . Equation (14) gives the probabilit y that the next factor will see a net bitflip, a bitflip which has p ercolated through the previous factors. Equations (15) and (16) giv e the probability of a bitflip conditioned on whether or not this factor sees a net bitflip, and they are differ ent . It should b e ob vious that r f = r f 0 (1 − n f − 1 ) + r f 00 n f − 1 (17) and also that n f = r f 0 (1 − n f − 1 ) + (1 − r f 00 ) n f − 1 (18) W e sho w via straightforw ard algebra in App endix J that the conditional proba- bilities r f 0 and r f 00 can b e written recursiv ely in terms of n f : r f 0 = Φ − N (1 − 2 n f − 1 ) − ( D f − 1) p ( N − 1)( D f − 1) (19) r f 00 = Φ − N (1 − 2 n f − 1 ) + ( D f − 1) p ( N − 1)( D f − 1) (20) The Resonator bitflip probability r f has to b e computed recursiv ely using these expressions. The base case is n 0 = 0 and this is sufficien t to compute all the other probabilities – in particular, it implies that r 1 = h 1 = Φ − N − D 1 +1 √ ( N − 1)( D 1 − 1) , which w e ha ve previously indicated. W e can v erify these equations in sim ulation, and the agreemen t is very goo d – see Figure 14 in the App endix, which measures r f . The main analytical result in this section is the sequence of equations (17) - (20), whic h allow one to compute the bitflip probabilities for eac h factor in an outer pro duct Resonator Net w ork . The fact that r f in general must b e split b et ween the tw o conditional probabilities and that there is a dep endence on n f − 1 is what mak es it differen t, for all but the first factor, from the bitflip probability for a Hopfield Netw ork (compare eqs. (19) and (20) against eq. (12)). But how muc h different? W e are in terested in the quantit y r f − h f . Here is a simple intuition for what this is capturing: supp ose there are F Hop- field Net works all evolving under their o wn dynamics – they are running simul- taneously but not interacting in any w a y . At time t = 0 , the bitflip probabilities h 1 , h 2 , . . . , h F for the net works are all the same; there is nothing sp ecial ab out an y particular one. A Resonator Net work, ho wev er, is lik e a set of F Hopfield net works that hav e b een wired up to receive input ˆ o ( f ) [ t ] c , whic h reflects the state of the other factors. The netw orks are no longer indep enden t. In particular, a bitflip in factor f gets passed on to factors f + 1 , f + 2 , and so on. This affects the 16 Figure 1: Extra bitflip probability r f − h f due to p ercolated noise. In the limit of large F , there app ears to b e a phase c hange at D f / N = 0 . 056 . Below this v alue Resonator Net works are just as stable as Hopfield Netw orks, but abov e this v alue they are strictly less stable (b y the amoun t r f − h f ). bitflip probabilit y of these other factors, and the magnitude of this effect, whic h w e call p ercolated noise, is measured by r f − h f . Let us first note that for a Hopfield netw ork with self c onne ctions the maximum bitflip probabilit y is 0 . 02275 , whic h o ccurs at D f = N . The ratio D f / N is what determines the bitflip probability . Please see Appendix J for an explanation. P ercolated noise is measured b y the difference r f − h f , whic h w e plot in Figure 1. Part (a) sho ws just five factors, illustrating that r 1 = h 1 , but that r f ≥ h f in general. T o see if there is some limiting b eha vior, w e sim ulated 100 and 10 , 000 factors; the differences r f − h f are also shown in Figure 1. In the limit of large F there appears to b e a phase c hange in residual bitflip probabilit y that o ccurs at D f / N = 0 . 056 . In the Hopfield Netw ork literature this is a very imp ortant n umber. It giv es the point at which the co devectors transition a wa y from being global minimizers of the Hopfield Net w ork energy function. When D f / N falls in b et ween 0 . 056 and 0 . 138 , the co devectors are only lo cal minimizers, and there exist spin-glass states that hav e low er energy . W e do not further explore this phase- c hange phenomenon, but lea v e the (in all lik eliho o d, highly tec hnical) analysis to future w ork. In conclusion, the second ma jor result of the section is that w e hav e sho wn, via simulation, that for D f / N ≤ 0 . 056 , the stabilit y of a Resonator Net- w ork with outer pro duct w eigh ts is the same as the stabilit y of a Hop- field Net work. F or D f / N > 0 . 056 , p ercolated noise b et ween the factors causes the Resonator Net work to b e strictly less stable than a Hopfield Net work . 6.2 Op erational capacit y W e no w define a new notion of capacit y that is more appropriate to the factor- ization problem. This p erformance measure, called the op er ational c ap acity , giv es 17 an expression for the maxim um size of factorization problem that can b e solv ed with high probabilit y . This maximum problem size, whic h we denote by M max , v aries as a function of the n umber of elemen ts in each v ector N and the n umber of factors F . It gives a very practical c haracterization of p erformance, and will form the basis of our comparison b etw een Resonator Netw orks and the b ench- mark algorithms w e in tro duced in Section 5.1. When the problem size M is b elow the op erational capacit y of the algorithm, one can b e quite sure that the correct factorization will b e efficien tly found. Definition 1. The { p, k } op erational capacity of a factorization algorithm that solv es (1) is the largest search space size M max suc h that the algorithm, when limited to a maxim um num b er of iterations k , gives a total accuracy ≥ p . W e now define what we mean by total accuracy . Eac h algorithm we ha ve in tro duced attempts to solve the factorization problem (1) b y initializing the state ˆ x ( f ) [0] and letting the dynamics evolv e until some termination criterion is met. It is p ossible that the final state ˆ x ( f ) [ ∞ ] ma y not equal the correct factors x ( f ) ? at eac h and ev ery comp onent, but w e can ‘deco de’ eac h ˆ x ( f ) [ ∞ ] by lo oking for its nearest neighbor (with resp ect to Hamming distance or cosine similarit y) among the vectors in its resp ective codeb o ok X f . This distance computation inv olv es only D f v ectors, rather than M , whic h was what w e encountered in one of the brute-force strategies of Section 2. Compared to the other computations in volv ed in finding the correct factorization out of M total p ossibilities, this last step of deco ding has a v ery small cost, and w e alw ays ‘clean up’ the final state ˆ x ( f ) [ ∞ ] using its nearest neighbor in the co deb o ok. W e define the total accuracy to b e the sum of accuracies for inferring eac h factor, whic h is 1 /F if the nearest neigh b or to ˆ x ( f ) is x ( f ) ? and 0 otherwise. F or instance, correctly inferring one of three total factors giv es a total accuracy of 1 / 3 , tw o of three is 2 / 3 , and three of three is 1 . Analytically deriving the exp ected total accuracy app ears to b e quite chal- lenging, esp ecially for a Resonator Net w ork, b ecause it requires that w e essen- tially predict ho w the nonlinear dynamics will evolv e ov er time. There may b e a region around eac h x ( f ) ? suc h that states in this region rapidly con v erge to x ( f ) ? , the so-called basin of attraction, but our initial estimate ˆ x ( f ) [0] is likely not in the basin of attraction, and it is hard to predict when, if ever, the dynamics will en ter this region. Even for Hopfield Netw orks, which obey m uch simpler dynam- ics than a Resonator Net w ork, it is kno wn that so-called “frozen noise” is built up in the netw ork state, making the shap es of the basins highly anisotropic and difficult to analyze (Amari and Magin u, 1988). Essentially all of the analytical results on Hopfield Net works consider only the stabilit y of x ( f ) ? as a (v ery p o or) pro xy for ho w the mo del b ehav es when it is initialized to other states. This less useful notion of capacity , the stable-solution capacit y , w as what we examined in the previous section. 18 Figure 2: Accuracy as a function of M for Resonator Netw ork with outer pro duct w eights. Three factors ( F = 3 ), a verage o ver 5 , 000 random trials. W e can, how ev er, estimate the total accuracy b y simulating many factoriza- tion problems, recording the fraction of factors that were correctly inferred o ver man y , many trials. W e remind the reader that our results in this pap er pertain to factorization of randomly-drawn v ectors whic h bear no particular correlational structure, but that notions of total accuracy and op erational capacit y w ould b e relev an t, and sp ecific, to factorization of non-random v ectors. W e first note that for fixed vector dimensionality N , the empirical mean of the total accuracy de- p ends strongly on M , the search space size. W e can see this clearly in Figure 2. W e show this phenomenon for a Resonator Netw ork with outer pro duct w eights, but this general b eha vior is true for all of the algorithms under consideration – one can alwa ys mak e the searc h space large enough that exp ected total accuracy go es to zero. Our notion of op erational capacity is concerned with the M that causes ex- p ected total accuracy to drop below some v alue p . W e see here that there are a range of v alues M for whic h the exp ected total accuracy is 1 . 0 , b ey ond which this ceases to b e the case. F or all v alues of M within this range, the algorithm essen tially alwa ys solves the factorization problem. In this paper w e estimate op erational capacit y when p = 0 . 99 ( ≥ 99% of factors were inferred correctly) and k = 0 . 001 M (the mo del can searc h o ver at most 1 / 1 , 000 of the en tire searc h space). These c h oices are largely practical: ≥ 99% accuracy mak es the mo del very reliable in practice, and this op erating p oin t can b e estimated from a reasonable num b er ( 3 , 000 to 5 , 000 ) of random trials. Setting k = 0 . 001 M allo ws the num b er of iterations to scale with the size of the problem, but restricts the algorithm to only consider a small fraction of the p ossible factorizations. While a Resonator Netw ork has no guarantee of con vergence, it almost alw a ys con v erges in far less than 0 . 001 M iterations, so long as w e sta y in this high-accuracy regime. Operational capacit y is in general a function of N and F , whic h we will discuss shortly . 19 Figure 3: Op erational capacity is dramatically higher for Resonator Netw orks (blue and red ab ov e) than for an y of the b enchmark algorithms. These p oin ts represent the size of factorization problem that can b e solved reliably . Sho wn is op erational capacity for F = 3 factors. The gap is similarly large for other F , see plot for F = 4 in the App endix. 6.2.1 Resonator Net w orks ha v e sup erior op erational capacit y W e estimated the operational capacit y of the b enchmark algorithms in addition to the t wo v arian ts Resonator Netw orks. Figure 3 shows the op erational capacit y estimated on several thousand random trials, where w e displa y M max as a function of N for problems with three factors. One can see that the op erational capacit y of Resonator Netw orks is roughly t w o orders of magnitude greater than the op erational capacit y of the other algorithms . Each of the b enc hmark algorithms has a slightly different op erational capacit y (due to the fact that they eac h hav e differen t dynamics and will, in general, find differen t solutions) but they are all similarly po or compared to the tw o v ariants of Resonator Netw orks. See a similar plot for F = 4 in App endix B. As N increases, the p erformance difference b et ween the tw o v ariants of Res- onator Net w orks starts to disapp ear, ostensibly due to the fact that X f X † f ≈ X f X > f . The tw o v arian ts are differen t in general, but the simulations in this pap er do not particularly highlight the difference b et ween them. Except for Alternating Least Squares, eac h of the benchmark algorithms has at least one h yp erp erparameter that must b e c hosen – w e sim ulated man y thousand random trials with a v ariet y of h yp erparameter settings for eac h algorithm and chose the h yp erparameter v alues that performed b est on av erage. W e list these v alues for eac h of the algorithms in the App endix. All of the b enchmark algorithms conv erge on their o wn and the tunable stepsizes mak e a comparison of the num b er of iterations non-standardized, so w e did not imp ose a maxim um num b er of iterations on these algorithms – the p oin ts shown represent the b est the b enchmark algorithms can do, even when not restricted to a maxim um num b er of iterations. 20 6.2.2 Op erational capacit y scales quadratically in N W e carefully measured the op erational capacity of Resonator Netw orks in search of a relationship b et ween M max and N . W e fo cused on Resonator Net works with outer pro duct w eights – for N ≈ 5000 and larger, randomly-chosen co devectors are nearly orthogonal and capacit y is approximately the same for OLS weigh ts. W e reiterate that op erational capacit y is specific to parameters p and k : p is the threshold for total accuracy and k is the maximum num b er of iterations the algorithm is allow ed to take (refer to Definition 1). Here w e rep ort op erational capacit y for p = 0 . 99 and k = 0 . 001 M on randomly-sampled co devectors. The op erational capacity is sp ecific to these choices, whic h are practical for V ector Sym b olic Architectures. Our simulations revealed that, empirically , Resonator Net work op era- tional capacit y M max scales as a quadratic function of N , which w e il- lustrate in Figure 4. The p oints in this figure are estimated from many thousands of random trials, o ver a range of v alues for F and N . In part (a) w e show opera- tional capacity separately for eac h F from 2 to 7 , with the drawn curv es indicating the least-squares quadratic fit to the measured p oin ts. In part (b) w e put these p oin ts on the same plot, follo wing a logarithmic transformation to each axis, in order to illustrate that capacity also v aries as a function of F . App endix B pro- vides some additional commentary on this topic, including some speculation on a scaling la w that com bines F and N . The parameters of this particular com bined scaling are estimated from sim ulation and not deriv ed analytically – therefore they ma y deserve additional scrutin y and we do not fo cus on them here. The main message of this section is that capacit y scales quadratically in N , regardless of ho w many factors are used. The curv es in Figure 4 are constructiv e in the following sense: given a fixed N , they indicate the largest factorization problem that can b e solved reliably . Con versely , and this is often the case in VSAs, the problem size M is predeter- mined, while N is v ariable – in this case we kno w ho w large one must make N . W e include in the official softw are implementation that accompanies this paper 3 a text file with all of the measured op erational capacities. Quadratic scaling means that one can aspire to solv e v ery large factorization problems, so long as he or she can build a Resonator Netw ork with big enough N . W e attempted to estimate capacity for ev en larger v alues of N than we rep ort in Figure 4, but this w as b ey ond the capability of our current computational resources. A useful contribution of follo w-on work would b e to lev erage high- p erformance computing to measure some of these v alues. Applications of V ector Sym b olic Arc hitectures t ypically use N ≤ 10 , 000 , but there are other reasons one might attempt to push Resonator Netw orks furth er. Early work on Hopfield 3 h ttps://github.com/spencerken t/resonator-netw orks 21 (a) M max scales quadratically in N . Red p oin ts are measured from simulation; blac k curv es are the least-squares quadratic fits. P arameters of fits included in App endix B. (b) M max v aries as a function of b oth F and N . Ov er the measured range for N , capacit y is highest for F = 3 and F = 4 . Data for F = 2 w as omitted to b etter conv ey the trend for F=3 and higher, but see App endix B for the full picture. Figure 4: Op erational capacity of Resonator Netw orks with OP weigh ts. Net works prop osed a technique for storing solutions to the T ra velling Salesman Problem as fixed p oin ts of the mo del’s dynamics (Hopfield and T ank, 1985), and this b ecame part of a larger approac h using nonlinear dynamical s ystems to solve hard searc h problems. W e do not claim that an y particular search problem, other than the factorization we hav e defined (1), can b e solved b y Resonator Net works. Supp osing, how ev er, that some other hard problem can b e cast in the form of (1), 22 the quadratic scaling of operational capacit y mak es this a p oten tially p o wer to ol. Capacit y is highest when the co deb o oks X f eac h ha v e the same n umber of co dev ectors ( D 1 = D 2 = . . . = D F = F √ M ), and this w as the case for the op erational capacity results w e hav e sho wn so far. W e chose this in order to ha ve a simple standard for comparison among the different algorithms, but in general it is p ossible that the co deb o oks are un balanced, so that we hav e the same M = Q f D f but D 1 6 = D 2 6 = . . . 6 = D f . In this case, capacity is low er than for balanced codeb o oks. W e found that the most meaningful w ay to measure the degree of balance b et ween co deb o oks w as b y the ratio of the smallest co deb o ok to the largest co deb o ok: ξ := min f D f / max f D f (21) F or ξ ≥ 0 . 2 w e found that the effect on M max w as simply an additive factor whic h can b e absorbed in to a (sligh tly smaller) y-in tercept a for the quadratic fit. F or extreme v alues of ξ , where there is one co deb o ok that is for instance 10 or 20 times larger than another, then all three parameters a , b , and c are affected, sometimes significan tly . Scaling is still quadratic, but the actual capacity v alues may b e significan tly reduced. Our result – measured operational capacit y whic h indicates an appro ximately quadratic relationship b et ween M max and N – is an imp ortan t c haracterization of Resonator Netw orks. It suggests that our framew ork scales to very large factor- ization problems and serv es as a guideline for implemen tation. Our attempts to analytically deriv e this result were st ymied by the to olb ox of nonlinear dynamical systems theory . Op erational capacit y in v olv es the probabilit y that this system, when initialized to an effectiv ely random state, conv erges to a particular set of fixed p oin ts. No results from the study of nonlinear dynamical systems, that we are a w are of, allo w us to derive such a strong statemen t about Resonator Net- w orks. Still, the scaling of Figure 4 is fairly suggestiv e of some underlying law, and we are hop eful that a theoretical explanation exists, waiting to b e discov ered. 6.3 Searc h sp eed If a Resonator Net work is not consisten tly descending an energy function, is it just aimlessly w andering around the space, trying every possible factorization un til it finds the correct one? Figure 5 sho ws that it is not. W e plot the mean num b er of iterations ov er 5 , 000 random trials, as a fraction of M , the searc h space size. This particular plot is based on a Resonator Net w ork with outer pro duct weigh ts and F = 3 . In the high-p erformance regime where M is below op eration capacity , the n umber of iterations is far less than the 0 . 001 M cutoff we used in the simulations of Section 6.2 – the algorithm is only ev er considering a tiny fraction of the p ossible factorizations b efore it finds the solution. 23 Figure 5: Iterations until con vergence, Resonator Net work with outer pro duct w eights and F = 3 . The n umber of iterations is a v ery small compared to the size of the search space Section 6.2.1 compared the op erational capacit y of differen t algorithms and sho wed that compared to the b enc hmarks, Resonator Netw orks can solve m uch larger factorization problems. This is in the sense that the dynamics ev entually con verge (with high probability) on the correct factorization while, the dynamics of the other algorithms con verge on spurious factorizations. This result, how ev er, do es not directly demonstrate the relativ e sp eed with which factorization are found in terms of either the n um b er of iterations or the amoun t of time to con v ergence. W e set up a b enc hmark to determine the relative speed of Resonator Netw orks and our main finding is depicted in Figure 6. Measured in n um b er of iterations, Resonator Net works are compa- rable to the b enchmark algorithms . W e noted that Alternating Least Squares is the most greedy of the b enchmarks, and one can see from Figure 6 that it is the fastest in this sense. W e are considering only trials that ultimately found the correct factorization, whic h in this sim ulation was roughly 70% for eac h of the b enc hmarks. In con trast, Resonator Net works alw a ys ev en tually found the correct factorization. Measured in terms of w all-clo c k time, Resonator Net works are significan tly faster than the benchmarks. This can b e attributed to their nearly 5 × lo w er p er-iteration cost. Resonator Netw orks with outer pro duct w eights utilize v ery simple arithmetic op erations and this explains the difference b et ween Figures 6b and 6c. 24 (a) Con v ergence traces for 100 randomly-dra wn factorization problems – each line is the cosine similarity b etw een c and ˆ c ov er iterations of the algorithm. Eac h of the four algorithms is run on the same 100 factorization problems. All of the instances are solv ed b y the Resonator Net w ork, whereas a sizeable fraction (around 30% ) of the instances are not solv ed b y the benchmark algorithms, at least within 100 iterations. (b) Cosine sim. vs. iteration # (only trials with accuracy 1 . 0 ) (c) Cosine sim. vs. w all-clo ck time (only trials with accuracy 1 . 0 ) Figure 6: Our b enchmark of factorization sp eed. Implemen tation in Python with NumPy . Run on machine with In tel Core i7-6850k processor and 32GB RAM. W e generated 5 , 000 random instantiations of the factorization problem with N = 1500 , F = 3 , and D f = 40 , running each of the four algorithms in turn. Figure 6a giv es a snapshot of 100 randomly selected trials. Figures 6b and 6c sho w av erage performance c onditione d on the algorithms finding the c orr e ct factorization . 6.4 Dynamics that do not con v erge One m ust b e prepared for the p ossibility that the dynamics of a Resonator Net w ork will not con v erge. F ortunately , for M b elo w the p = 0 . 99 op erational capacity , these will b e exceedingly rare. F rom sim ulation, we iden tified three ma jor regimes of differen t conv ergence b eha vior, whic h are depicted in Figure 7: • F or M small enough, almost all tra jectories conv erge, and moreo ver they con verge to a state that yields the correct factorization. Limit cycles are p ossible but rare, and often still yield the correct factorization. There ap- p ear to b e few if an y spurious fixed p oints (those yielding an incorrect fac- torization) – if the tra jectory conv erges to a p oin t attractor or limit cycle, 25 Figure 7: Regimes of differen t con vergence b ehavior. Curv es sho w measurement from sim ulation of an outer pro duct Resonator Netw ork with 3 factors and N = 400 . This is also meant as a diagram of conv ergence b eha vior for Resonator Net works in general. Sho wn in black is the av erage deco ding accuracy and sho wn in gray is the median num b er of iterations tak en by the netw ork. F or low enough M , the netw ork alwa ys finds a fixed p oin t yielding 100% accuracy . The net w ork will not con v erge to spurious fixed p oints in this regime (green). As M is increased, more tra jectories wander, not conv erging in an y reasonable time (red). Those that are forcibly terminated yield incorrect factorizations. F or large enough M , the net work is completely saturated and most states are fixed p oin ts, regardless of whether they yield the correct factorization (blue). Resonator Net works with OLS w eights are alwa ys stable when D f = N , but OP weigh ts giv e a bitflip probability that is zero only asymptotically in M (see Section 6.1 and App endix J). one can b e confiden t this state indicates the correct factorization. • As M increases, non-conv erging tra jectories app ear in greater prop ortion and yield incorrect factorizations. An y tra jectories whic h con verge on their o wn con tinue to yield the correct factorization, but these b ecome less com- mon. • Bey ond some saturation v alue M sat (roughly depicted as the transition from red to blue in the figure), b oth limit cycles and p oint attractors re-emerge, and they yield the incorrect factorization. In theory , limit cycles of an y length may app ear, although in practice they tend to b e skew ed tow ards small cycle lengths. Netw orks with tw o factors are the most 26 Figure 8: F actoring a corrupted c . F or M well b elow capacity (lighter curves ab o v e) one can sustain hea vy corruption to c and still find the correct factorization. lik ely to find limit cycles, and this likelihoo d app ears to d ecrease with increasing n umbers of factors. Our intuition ab out what happ ens in the middle section of Figure 7 is that the basins of attraction b ecome v ery narrow and hard to find for the Resonator Netw ork dynamics. The algorithm will wander, since it has so few spurious fixed p oints (see Section 6.6), but not b e able to find any basin of attraction. 6.5 F actoring a ‘noisy’ comp osite v ector Our assumption has b een that one com bination of co devectors from our co deb o oks X f generates c exactly . What if this is not the case? Perhaps the v ector we are giv en for factorization has had some prop ortion ζ of its comp onen ts flipp ed, that is, we are given ˜ c where ˜ c differs from c in exactly b ζ N c places. The v ector c has a factorization based on our co deb o oks but ˜ c do es not. W e should hop e that a Resonator Netw ork will return the factors of c so long as the corruption is not to o sev ere. This is an esp ecially imp ortan t capability in the con text of V ector Sym b olic Arc hitectures, where ˜ c will often b e the result of some algebraic manipulations that generate noise and corrupt the original c to some degree. W e show in Figure 8 that a Resonator Netw ork can still pro duce the correct factorization even after a significan t n umber of bits ha v e b een flipp ed. This robustness is more pronounced when the n umber of factorizations is well b elow operational capacity , at whic h p oin t the mo del can often still recov er the correct factorization ev en when 30% of the bits ha ve been flipp ed. 27 6.6 A theory for differences in op erational capacit y The failure mo de of each b enc hmark algorithm is getting stuc k at a spurious fixe d p oint of the dynamics. This section dev elops a simple comparison b et ween the spurious fixed p oin ts of Resonator Net works and the b enc hmarks as an explanation for why Resonator Net works enjo y relativ ely higher operational capacit y . F rom among the b enc hmarks we fo cus on Pro jected Gradien t Descen t (applied to the negativ e inner product with the simplex constrain t) to illustrate this p oin t. W e will sho w that the correct factorization is alwa ys stable under Pro jected Gradient Descen t (as it is with the OLS v arian t of Resonator Netw orks), but that incorrect factorizations are m uch more likely to b e fixed p oints under Pro jected Gradien t Descen t. The definition of Pro jected Gradien t Descent can b e found in T able 2, with some commen ts in App endix G. 6.6.1 Stabilit y of the correct factorization The vec tor of co efficients a f is a fixed p oint of Pro jected Gradien t Descen t dy- namics when the gradien t at this point is exactly 0 or when it is in the n ullspace of the pro jection operator. W e write N P C f [ x ] := { z | P C f x + z = P C f x } (22) to denote this set of p oints. The n ullspace of the pro jection op erator is relativ ely small on the faces and edges of the simplex, but it b ecomes somewhat large at the v ertices. W e denote a vertex b y e i (where ( e i ) j = 1 if j = i and 0 otherwise). The n ullspace of the pro jection operator at a v ertex of the simplex is an intersection of halfspaces (each halfspace given b y an edge of the simplex). W e can compactly represen t it with the following expression: N P ∆ D f [ e i ] = z | \ j 6 = i ( e i − e j ) > z ≥ 1 (23) An equiv alen t wa y to express the n ullspace is N P ∆ D f [ e i ] = z | z j ≤ z i − 1 ∀ j 6 = i (24) In other w ords, for a v ector to b e in the n ullspace at e i , the i th element of the v ector must b e the largest by a margin of 1 or more. This condition is met for the vector −∇ a f L at the correct factorization, since −∇ a f L = X > f ˆ o ( f ) [0] c = X > f x ( f ) ? . This v ector has a v alue N for the component corresp onding to x ( f ) ? and v alues that are ≤ N − 1 for all the other comp onents. Th us, the correct factorization (the solution to (1) and global minimizer of (9)) is alw ays a fixed p oin t under the dynamics of Pro jected Gradient Descent (PGD). 28 This matc hes the stabilit y of OLS Resonator Net works whic h are, b y construc- tion, alw ays stable at the correct factorization. W e show ed in Section 6.1 that OP w eights induce instability and that p ercolated noise mak es the mo del marginally less stable than Hopfield Net w orks, but there is still a large range of factorization problem sizes where the net work is stable with ov erwhelming probability . What distinguishes the b enc hmarks from Resonator Netw orks is what we cov er next, the stabilit y of inc orr e ct factorizations. 6.6.2 Stabilit y of incorrect factorizations Supp ose initialization is done with a random combination of codevectors that do not pro duce c . The vector ˆ o ( f ) [0] c will b e a c ompletely r andom bip olar ve ctor . So long as D f is significan tly smaller than N , which it alwa ys is in our applications, ˆ o ( f ) [0] c will b e nearly orthogonal to every vector in X f and its pro jection on to R ( X f ) will b e small, with each comp onent equally likely to b e p ositive or negativ e. Therefore, under the dynamics of a Resonator Net work with OLS weigh ts, each comp onen t will flip its sign compared to the initial state with probability 1 / 2 , and the state for this factor will remain unchanged with the min uscule probabilit y 1 / 2 N . The total probability of this incorrect factorization b eing stable, accounting for eac h factor, is therefore (1 / 2 N ) F . Sub optimal factorizations are very unlik ely to b e a fixed p oints. The same is true for a Resonator Netw ork with OP weigh ts b ecause eac h elemen t of the v ector X f X > f ˆ o ( f ) [0] c is appro ximately Gaussian with mean zero (see Section 6.1 and App endix J). Con trast this against Pro jected Gradien t Descent. W e recall from (24) that the requiremen t for e i to b e a fixed p oint is that the i th comp onent of the gradien t at this p oint be largest b y a margin of 1 or more. This is a much lo oser stability c ondition than we had for R esonator Networks – suc h a scenario will actually o ccur with probability 1 /D f for each factor, and the total probabilit y is 1 / M . While still a relativ ely small probabilit y , in t ypical VSA settings 1 / M is muc h larger than (1 / 2 N ) F , meaning that compared to Resonator Netw orks, Pro jected Gradien t Descen t is muc h more stable at incorrect factorizations. Empirically , the failure mo de of Pro jected Gradient Descen t in volv es it settling on one of these spurious fixed p oin ts. 6.6.3 Stabilit y in general The cases of correct and incorrect factorizations drawn from the co deb o oks are t wo extremes along a contin uum of p ossible states the algorithm can b e in. F or Pro jected Gradient Descent an y state will b e stable with probability in the in terv al [ 1 M , 1] , while for Resonator Netw orks (with OLS weigh ts) the interv al is [ 1 2 F N , 1] . In practical settings for VSAs, the in terv al [ 1 2 F N , 1] is, in a relativ e sense, muc h 29 larger than [ 1 M , 1] . V ectors dra wn uniformly from either {− 1 , − 1 } N or [ − 1 , − 1] N concen trate near the low er end of these interv als, suggesting that on av erage, Pro jected Gradien t Descen t has man y more spurious fixed p oin ts . This statemen t is not fully complete in the sense that dynamics steer the state along sp ecific tra jectories, visiting states in a p otentially non-uniform wa y , but it do es suggest that Pro jected Gradien t Descen t is m uch more susceptible to spurious fixed p oints. The next section sho ws that these tra jectories do in fact con verge on spurious fixed points as the factorization problem size gro ws. 6.6.4 Basins of attraction for b enc hmark algorithms It may b e that while there are sizable basins of attraction around the correct fac- torization, mo ving through the interior of the hypercub e causes state tra jectories to fall in to the basin corresponding to a spurious fixed point. In a normal set- ting for sev eral of the optimization-based approaches, we initialize a f to b e at the cen ter of the simplex, indicating that eac h of the factorizations is equally likely . Supp ose w e w ere to initialize a f so that it is just slightly n udged to w ard one of the simplex vertices. W e migh t n udge it to ward the correct v ertex (the one giv en b y a ? f ) or we migh t n udge it tow ard any of the other v ertices, a wa y from a ? f . W e can parameterize this with a single scalar θ and e i c hosen uniformly among the p ossible v ertices: a f [0] = θ e i + (1 − θ ) 1 D f 1 | θ ∈ [0 , 1] , i ∼ U { 1 , D f } (25) W e ran a simulation with N = 1500 and D 1 = D 2 = D 3 = 50 , at whic h Pro jected Gradient Descent and Multiplicative W eigh ts ha ve a total accuracy of 0 . 625 and 0 . 525 , resp ectively . W e created 5 , 000 random factorization problems, initializing the state according to (25) and allowing the dynamics to run un til con vergence. W e did this first with a nudge tow ard the correct factorization a ? f (squares in Figure 9) and then with a nudge a wa y from a ? f , tow ard a randomly- c hosen spurious factorization (triangles in Figure 9). What Figure 9 sho ws is that b y mo vin g just a small distance to ward the correct vertex, w e very quic kly fall in to its basin of attraction. Ho wev er, moving to ward any of the other vertices is actually somewhat likely to tak e us in to a spurious basin of attraction (where the con v erged state is deco ded into an incorrect factorization). The space is ful l of these bad directions. It w ould be v ery lucky indeed to start from the cen ter of the simplex and mo ve immediately to w ard the solution – it is far more likely that initial up dates tak e us somewhere else in the space, to ward one of the other v ertices, and this plot sh o ws that these tra jectories often get pulled to wards a spurious fixed p oint. What we are demonstrating here is that empirically , the in terior of the h yp ercub e is somewhat treac herous from 30 Figure 9: States in h yp ercub e in terior get pulled into spurious basins of attraction. Pro jected Gradient Descen t is in green and Multiplicative W eigh ts is in orange. Netw ork is initialized at a distance θ from the center of the simplex (see equation (25)), and allo wed to conv erge. The y-axis is the accuracy of the factorization implied b y the con verged state. T riangles indicate initialization sligh tly a w ay from a ? f to ward any of the other simplex vertices, whic h is most directions in the space. These initial states get quic kly pulled into a spurious basin of attraction. an optimization persp ective, and this lies at the heart of wh y the b enc hmark algorithms fail. F rom among the b enchmarks, we restricted our analysis of spurious fixed p oints to Pro jected Gradient Descen t and, in Figure 9, Multiplicativ e W eights. This c hoice was made for clarity , and similar arguments apply for all of the b enc h- marks. While the details ma y differ sligh tly (for instance, spurious fixed p oin ts of ALS appear near the simplex cen ter, not at a vertex), the failure mode of the b enchmarks is strikingly consistent. They all b ecome o verwhelmed by spuri- ous fixed p oints, long b efore this affect is felt b y Resonator Netw orks. W e ha ve sho wn that in exp ectation, Pro jected Gradient Descen t has man y more spurious fixed p oin ts than Resonator Net w orks . W e hav e also sh o w that tra jectories mo ving through the in terior of the hypercub e are easily pulled in to these spurious basins of attraction. 7 Discussion W e studied a v ector factorization problem whic h arises in the use of V ector Sym- b olic Architectures (as in tro duced in part one of this series (F rady et al., 2020), sho wing that Resonator Netw orks solv e this problem remark ably well. Their p er- formance comes from a particular form of nonlinear dynamics, coupled with the idea of searc hing in sup erp osition. Solutions to the factorization problem lie in a small sliv er of R N – i.e., the corners of the bipolar h yp ercub e {− 1 , 1 } N – and the 31 highly nonlinear activ ation function of Resonator Netw orks serv es to constrain the searc h to this subspace. W e drew connections b etw een Resonator Net works and a num b er of b enchmark algorithms whic h cast factorization as a problem of optimization . This intuitiv ely-satisfying form ulation app ears to come at a steep cost. None of the b enc hmarks were comp etitiv e with Resonator Netw orks in terms of k ey metrics that c haracterize factorization p erformance. One explanation for this is that the b enchmarks ha v e comparativ ely many more spurious fixed p oin ts of their dynamics, and that the loss function landscap e in the interior of the h yp ercub e induces tra jectories that approach these spurious fixed p oints. Unlik e the b enchmarks, Resonator Net works do not hav e a global con vergence guaran tee, and in some respects w e see this as b eneficial c haracteristic of the mo del. Requiring global con v ergence app ears to unnecessarily constrain the searc h for factorizations, leading to low er capacity . Besides, op erational capacity (defined in this pap er) sp ecifies a regime where the lack of a conv ergence guarantee can b e practically ignored. Resonator Netw orks almost alwa ys con verge in this setting, and the fixed points yield the correct solution. The b enchmarks are, by steadfastly descending a loss function, in some sense more greedy than Resonator Net works. It app ears that Resonator Netw orks strike a more natural balance b etw een 1) making up dates based on the best-av ailable lo cal information and 2) still exploring the solution space while not getting stuc k. Our approac h follo ws a kind of “Goldilo cks principle” on this trade off – not too muc h, not to o little, but just righ t. W e are not the first to consider eschewing con v ergence guaran tees to b etter solv e hard search problems. F or instance, randomized searc h algorithms utilize some explicit form of randomness to find b etter solutions, typically only con- v erging if this randomness is reduced o ver time (Spall, 2005). In contrast, our mo del is completely deterministic, and the searc hing b ehavior comes from non- linear heteroasso ciativ e dynamics. Another example is the prop osal to add small amoun ts of random asymmetry to the (symmetric) w eigh t matrix of Hopfield Net- w orks (Hertz et al., 1986). This mo dification remov es the guaran teed absence of cyclic and c haotic tra jectories that holds for the traditional Hopfield mo del. But, at the same time, and without significantly harming the attraction of memory states, adding asymmetry to the weigh ts can improv e associative memory recall b y shrinking the basins of attraction asso ciated with spurious fixed p oints (Singh et al., 1995; Chengxiang et al., 2000). W e emphasize that, while Resonator Net works app ear to b e better than al- ternativ es for the particular vector factorization problem (1), this is not a claim they are appropriate for other hard searc h problems. Rather, Resonator Netw orks are sp ecifically designed for the v ector factorization problem at hand. There exist sev eral prior works in volving some asp ect of factorization that we would lik e to men tion here, but we emphasize that each one of them deals with a problem or approac h that is distinct from what we ha v e introduced in this pap er. 32 T ensor decomp osition is a family of problems that b ear some resemblance to the factorization problem we ha ve introduced (1). Key differences include the ob- ject to be factored, whic h is a higher-order tensor, not a v ector, and constrain ts on the allo w able factors. W e explain in App endix D how our factorization problem is different from traditional tensor decomp ositions. Our b enchmarks actually in- cluded the standard tensor decomp osition algorithm, Alternating Least Squares, re-expressed for (1), and w e found that it is not w ell-matched for this factoriza- tion problem. Bidirectional Asso ciative Memory , prop osed by Kosk o (1988), is an extension of Hopfield Netw orks that stores p airs of factors in a matrix using the outer pro duct learning rule. The comp osite ob ject is a matrix, rather than a v ector, and is muc h closer to a particular t yp e of tensor decomp osition called the ‘CP decomp osition’, which w e elab orate on in Appendix D. Besides the fact that this model applies only to two factor problems, its dynamics are differen t from ours and its capacit y is relativ ely low (K obay ashi et al., 2002). Subsequent efforts to extend this mo del to factorizations with 3 or more factors (Huang and Hagiw ara, 1999; K oba yashi et al., 2002) ha ve had very limited success and still rely on matrices that connect pairs of factors rather than a single m ultilinear pro duct, which we ha ve in our model. Bilinear Models of St yle and Con ten t (T enen baum and F reeman, 2000) was an inspiration for us in deciding to work on factorization problems. This pap er applies a differen t type of tensor decomp osi- tion, a ‘T uc ker decomp osition’ (again see App endix D), to a v ariet y of different real-v alued datasets using what app ears to b e in one case a closed-form solution based on the Singular V alue Decomp osition, and in the other case a v arian t of Alternating Least Squares. In that sense, their metho d is different from ours, the factorization problem is itself differen t, and they consider only pairs of factors. Memisevic and Hin ton (2010) revisits the T uck er decomp osition problem, but fac- tors the ‘core’ tensor representing interactions b et ween factors in order to mak e estimation more tractable. They prop ose a Boltzmann Machine that computes the factorization and sho w some results on mo deling image transformations. Fi- nally , there is a large b o dy of work on matrix factorization of the form V ≈ WH , the most well-kno wn of which is probably Non-negativ e Matrix F actorization (Lee and Seung, 2001). The matrix V can b e though t of a sum of outer pro ducts, so this is really a type of CP decomp osition with an additional constraint on the sign of the factors. Different still is the fact that W is often in terpreted as a basis for the columns of V , with H con taining the co efficien ts of each column with respect to this basis. In this sense, vecto rs are being adde d to explain V , rather than com bined m ultiplicativ ely – Non-negative Matrix F actorization is m uch closer to Sparse Co ding (Ho yer, 2004). The first pap er in this series (F rady et al., 2020) illustrated ho w distributed represen tations of data structures can b e built with the algebra of V ector Symbolic Arc hitectures, as w ell as how Resonator Net w orks can decomp ose these datastruc- 33 tures. VSAs are a p ow erful wa y to think ab out structured connectionist repre- sen tations, and Resonator Netw orks mak e the framework muc h more scalable. Extending the examples found in that pap er to more realistic data (for example, complex 3-dimensional visual scenes) could b e a useful application of Resonator Net works. This will lik ely require learning a transform from pixels into the space of high-dimensional symbolic v ectors, and this learning should ideally o ccur in the context of the factorization dynamics – we feel that this is an exciting a ven ue for future study . Here we ha v e not shown Resonator Circuits for anything other than bip olar v ectors. How ev er, a v ersion of the model wherein vector elemen ts are unit-magnitude complex phasors is a natural next extension, and relev an t to Holographic Reduced Representations, a VSA dev elop ed b y T ony Plate (Plate, 2003). A recent theory of sparse phasor asso ciativ e memories (F rady and Som- mer, 2019) may allo w one to p erform this factorization with a netw ork of spiking neurons. Resonator net w orks are an abstract neural mo del of factorization, in tro duced for the first time in this t wo-part series. W e b elieve that as the theory and applica- tions of Resonator Net w orks are further dev elop ed, they may help us understand factorization in the brain, whic h still remains an imp ortant m ystery . 8 A c kno wledgemen ts W e w ould like to thank members of the Redw o o d Center for Theoretical Neu- roscience for helpful discussions, in particular P entti Kanerv a, whose w ork on V ector Sym b olic Arc hitectures originally motiv ated this pro ject. This w ork was generously supp orted b y the National Science F oundation, under graduate re- searc h fello wship DGE1752814 and research gran t I IS1718991, b y the National Institute of Health, gran t 1R01EB026955-01, the Seminconductor Researc h Cor- p oration under E2CDA-NRI, and D ARP A’s Virtual Intelligence P ro cessing (VIP) program. 34 References Amari, S.-I. and Magin u, K. (1988). Statistical neuro dynamics of asso ciativ e memory . Neur al Networks , 1(1):63–73. Amit, D. J., Gutfreund, H., and Somp olinsky , H. (1985). Storing infinite num b ers of patterns in a spin-glass mo del of neural net works. Physic al R eview L etters , 55(14):1530. Amit, D. J., Gutfreund, H., and Somp olinsky , H. (1987). Information storage in neural net works with lo w lev els of activit y . Physic al R eview A , 35(5):2293. Anandkumar, A., Ge, R., Hsu, D., Kak ade, S. M., and T elgarsky , M. (2014). T ensor decompositions for learning laten t v ariable mo dels. Journal of Machine L e arning R ese ar ch , 15:2773–2832. Arathorn, D. W. (2001). Recognition under transformation using sup erp osition ordering prop ert y . Ele ctr onics L etters , 37(3):164–166. Arathorn, D. W. (2002). Map-se eking cir cuits in visual c o gnition: A c omputa- tional me chanism for biolo gic al and machine vision . Stanford Univ ersit y Press, Stanford, CA. Arora, S., Hazan, E., and Kale, S. (2012). The multiplicativ e w eigh ts up date metho d: a meta-algorithm and applications. The ory of Computing , 8(1):121– 164. Bec k, A. and T eb oulle, M. (2009). A fast iterative shrink age-thresholding algo- rithm for linear inv erse problems. SIAM journal on imaging scienc es , 2(1):183– 202. Bouc heron, S., Lugosi, G., and Massart, P . (2013). Conc entr ation ine qualities . Oxford Univ ersity Press, Oxford. A nonasymptotic theory of indep endence, with a forew ord by Mic hel Ledoux. Bredies, K. and Lorenz, D. A. (2008). Linear con vergence of iterativ e soft- thresholding. Journal of F ourier Analysis and Applic ations , 14(5-6):813–837. Cannon, L. E. (1969). A c el lular c omputer to implement the Kalman filter algo- rithm . PhD thesis, Montana State Univ ersity-Bozeman, College of Engineering. Carroll, J. D. and Chang, J.-J. (1970). Analysis of individual differences in multidi- mensional scaling via an n-w ay generalization of “ec k art-y oung” decomp osition. Psychometrika , 35(3):283–319. 35 Carroll, J. D., Pruzansky , S., and Krus k al, J. B. (1980). Candelinc: A general ap- proac h to multidimensional analysis of many-w ay arra ys with linear constraints on parameters. Psychometrika , 45(1):3–24. Chengxiang, Z., Dasgupta, C., and Singh, M. P . (2000). Retriev al prop erties of a hopfield mo del with random asymmetric in teractions. Neur al c omputation , 12(4):865–880. Cohen, M. A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage b y comp etitiv e neural netw orks. IEEE tr ansactions on systems, man, and cyb ernetics , (5):815–826. Condat, L. (2016). F ast pro jection onto the simplex and the l1 ball. Mathematic al Pr o gr amming , 158(1-2):575–585. Daub ec hies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding al- gorithm for linear inv erse problems with a sparsity constrain t. Communic ations on Pur e and Applie d Mathematics: A Journal Issue d by the Cour ant Institute of Mathematic al Scienc es , 57(11):1413–1457. De Lathau w er, L., De Mo or, B., and V andew alle, J. (2000a). A multilinear singu- lar v alue decomp osition. SIAM journal on Matrix Analysis and Applic ations , 21(4):1253–1278. De Lathau wer, L., De Moor, B., and V andewalle, J. (2000b). On the b est rank-1 and rank-(r 1, r 2,..., rn) appro ximation of higher-order tensors. SIAM journal on Matrix A nalysis and Applic ations , 21(4):1324–1342. Duc hi, J., Shalev-Sh wartz, S., Singer, Y., and Chandra, T. (2008). Efficien t pro- jections on to the l 1-ball for learning in high dimensions. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 272–279. A CM. F rady , E. P ., Ken t, S. J., Olshausen, B. A., and Sommer, F. T. (2020). Resonator net works for factoring distributed represen tations of data structures. arXiv pr eprint arXiv:2007.03748 . F rady , E. P . and Sommer, F. T. (2019). Robust computation with rh ythmic spik e patterns. Pr o c e e dings of the National A c ademy of Scienc es , 116(36):18050– 18059. Ga yler, R. W. (1998). Multiplicativ e binding, represen tation op erators & analogy (w orkshop poster). In Holy oak, K., Gentner, D., and Kokino v, B., editors, A dvanc es in analo gy r ese ar ch: Inte gr ation of the ory and data fr om the c o gnitive, c omputational, and neur al scienc es . 36 Ga yler, R. W. (2004). V ector sym b olic architectures answ er jack endoff ’s c hallenges for cognitiv e neuroscience. arXiv pr eprint arXiv:cs/0412059 . Gedeon, T. and Arathorn, D. (2007). Con vergence of map seeking circuits. Journal of Mathematic al Imaging and Vision , 29(2-3):235–248. Hark er, S., V ogel, C. R., and Gedeon, T. (2007). Analysis of constrained opti- mization v arian ts of the map-seeking circuit algorithm. Journal of Mathematic al Imaging and Vision , 29(1):49–62. Harshman, R. A. (1970). F oundations of the parafac procedure: Mo dels and conditions for an ’explanatory’ m ultimo dal factor analysis. T echnical Rep ort UCLA W orking Papers in Phonetics, 16, 1-84, Univ ersity of California, Los Angeles. Hazan, E. et al. (2016). In tro duction to online conv ex optimization. F oundations and T r ends in Optimization , 2(3-4):157–325. Hebb, D. O. (1949). The or ganization of b ehavior; a neur opsycholo gic al the ory. John Wiley & Sons, New Y ork, NY. Held, M., W olfe, P ., and Cro wder, H. P . (1974). V alidation of subgradient opti- mization. Mathematic al pr o gr amming , 6(1):62–88. Hertz, J., Grinstein, G., and Solla, S. (1986). Memory net w orks with asymmetric b onds. In AIP Confer enc e Pr o c e e dings , volume 151, pages 212–218. American Institute of Ph ysics. Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the A CM (JACM) , 60(6):1–39. Hitc hco ck, F. L. (1927). The expression of a tensor or a p olyadic as a sum of pro ducts. Journal of Mathematics and Physics , 6(1-4):164–189. Hopfield, J. J. (1982). Neural netw orks and physical systems with emergen t col- lectiv e computational abilities. Pr o c e e dings of the national ac ademy of scienc es , 79(8):2554–2558. Hopfield, J. J. (1984). Neurons with graded resp onse ha ve collectiv e computational prop erties lik e those of t wo-state neurons. Pr o c e e dings of the national ac ademy of scienc es , 81(10):3088–3092. Hopfield, J. J. and T ank, D. W. (1985). “ Neural” computation of decisions in optimization problems. Biolo gic al cyb ernetics , 52(3):141–152. 37 Ho yer, P . O. (2004). Non-negativ e matrix factorization with sparseness con- strain ts. Journal of machine le arning r ese ar ch , 5(Nov):1457–1469. Huang, J. and Hagiwara, M. (1999). A new multidimensional asso ciative memory based on distributed representation and its applications. In IEEE International Confer enc e on Systems, Man, and Cyb ernetics , v olume 1, pages 194–199. Kanerv a, P . (1996). Binary spatter-co ding of ordered k-tuples. In International Confer enc e on Artificial Neur al Networks , pages 869–873. K obay ashi, M., Hattori, M., and Y amazaki, H. (2002). Multidirectional asso ciativ e memory with a hidden la yer. Systems and Computers in Jap an , 33(3):1494– 1502. K olda, T. G. and Bader, B. W. (2009). T ensor decomp ositions and applications. SIAM r eview , 51(3):455–500. K osko, B. (1988). Bidirectional associative memories. IEEE T r ansactions on Systems, man, and Cyb ernetics , 18(1):49–60. Krusk al, J. B. (1977). Three-wa y arrays: rank and uniqueness of trilinear de- comp ositions, with application to arithmetic complexity and statistics. Line ar algebr a and its applic ations , 18(2):95–138. Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negativ e matrix factor- ization. In A dvanc es in neur al information pr o c essing systems , pages 556–562. Memisevic, R. and Hin ton, G. E. (2010). Learning to represen t spatial transfor- mations with factored higher-order b oltzmann machines. Neur al c omputation , 22(6):1473–1492. P ersonnaz, L., Guy on, I., and Dreyfus, G. (1986). Collectiv e computational prop erties of neural net works: New learning mechanisms. Physic al R eview A , 34(5):4217. Plate, T. A. (2003). Holo gr aphic r e duc e d r epr esentation: Distribute d r epr esenta- tion of c o gnitive structur e . CSLI Publications, Stanford, CA. Rahimi, A., Datta, S., Kleyko, D., F rady , E. P ., Olshausen, B., Kanerv a, P ., and Rabaey , J. M. (2017). High-dimensional computing as a nanoscalable paradigm. IEEE T r ansactions on Cir cuits and Systems I: R e gular Pap ers , 64(9):2508–2521. Sidirop oulos, N. D. and Bro, R. (2000). On the uniqueness of m ultilinear decom- p osition of n-wa y arra ys. Journal of Chemometrics: A Journal of the Chemo- metrics So ciety , 14(3):229–239. 38 Sidirop oulos, N. D., De Lathauw er, L., F u, X., Huang, K., Papalexakis, E. E., and F aloutsos, C. (2017). T ensor decomp osition for signal pro cessing and machine learning. IEEE T r ansactions on Signal Pr o c essing , 65(13):3551–3582. Singh, M. P ., Chengxiang, Z., and Dasgupta, C. (1995). Fixed p oints in a hopfield mo del with random asymmetric in teractions. Physic al R eview E , 52(5):5261. Spall, J. C. (2005). Intr o duction to sto chastic se ar ch and optimization: estimation, simulation, and c ontr ol , v olume 65. John Wiley & Sons. T enen baum, J. B. and F reeman, W. T. (2000). Separating style and conten t with bilinear mo dels. Neur al c omputation , 12(6):1247–1283. T uc ker, L. R. (1963). Implications of factor analysis of three-w ay matrices for measuremen t of c hange. In Harris, C. W., editor, Pr oblems in me asuring change , pages 122–137. Univ ersity of Wisconsin Press, Madison WI. T uc ker, L. R. (1966). Some mathematical notes on three-mo de factor analysis. Psychometrika , 31(3):279–311. V an V reeswijk, C. and Somp olinsky , H. (1996). Chaos in neuronal net w orks with balanced excitatory and inhibitory activit y . Scienc e , 274(5293):1724–1726. V asilescu, M. A. O. and T erzop oulos, D. (2002). Multilinear analysis of image ensem bles: T ensorfaces. In Eur op e an Confer enc e on Computer Vision , pages 447–460. Springer. Xu, Z.-B., Hu, G.-Q., and Kw ong, C.-P . (1996). Asymmetric hopfield-t yp e net- w orks: theory and applications. Neur al Networks , 9(3):483–501. 39 App endices A Implemen tation details This section includes a few commen ts relev an t to the implemen tation of Resonator Net works. Algorithm 1 giv es psuedoco de for Ordinary Least Squares weigh ts – the only change for outer pro duct w eights is to use X > instead of X † . So long as D f < N / 2 , computing X f X † f ˆ o c has low er computational complexity than actually forming a single “synaptic matrix” T f := X f X † f and then computing T f ˆ o c in each iteration – it is faster to k eep the matrices X f and X † f separate. This, of course, assumes that implemen tation is on a conv en tional computer. If one can use sp ecialized analog computation, such as large mesh circuits that directly implemen t matrix-v ector multiplication in linear time (Cannon, 1969), then it w ould b e preferable to store the synaptic matrix directly . Lines 11 - 13 in Algorithm 1 “clean up” ˆ x ( f ) using the nearest neigh b or in the codeb o ok, and also resolv e a sign am biguit y inheren t to the factorization problem. The sign am biguity is simply this: while c = x (1) ? x (2) ? . . . x ( F ) ? is the factorization w e are searching for, w e also ha ve c = − x (1) ? − x (2) ? . . . x ( F ) ? , and, more generally , an y even n umber of factors can ha v e their signs flipped but still generate the correct c . Resonator Netw orks will sometimes find these solutions. W e clean up using the co devector with the largest unsigne d similarit y to the conv erged ˆ x ( f ) , which remedies this issue. One will notice that w e hav e written Algorithm 1 to up date factors in order from 1 to F . This is completely arbitrary , and any ordering is fine. W e hav e exp erimented with c ho osing a random up date order during each iteration, but this did not seem to significan tly affect p erformance. Computing ˆ o with the most-recently up dated v alues for factors 1 to f − 1 (see equation (6)) is a conv en tion w e call ‘asynchronous’ up dates, in rough analogy to the same term used in the context of Hopfield Netw orks. An alternativ e conv en tion is to, when computing ˆ o , not used freshly updated v alues for factors 1 to f − 1 , but rather their v alues before the up date. This treats eac h factor as if it is b eing up dated sim ultaneously , a conv en tion we call ‘synchronous’ up dates. This distinction is an idiosyncrasy of modeling Resonator Netw orks in discrete-time, and the difference b etw een the t wo disapp ears in con tinuous-time, where things happ ens instan taneously . Throughout this pap er, our analysis and simulations ha ve b een with ‘asynchronous’ up dates, whic h w e find to con verge significantly faster. Not shown in Algorithm 1 is the fact that, in practice, we record a buffer of past states, allowing us to detect when the dynamics fall in to a limit cycle, and to terminate early . 40 Algorithm 1 Resonator Netw ork with Ordinary Least Squares w eights Require: c Comp osite v ector to b e factored Require: X 1 , X 2 , . . . , X F Co deb o ok matrices x ( f ) j = X f [ : , j ] Require: k Maxim um allow ed iterations 1: ˆ x ( f ) ← sgn P j x ( f ) j ∀ f = 1 , . . . , F 2: X † f ← pinv ( X f ) ∀ f = 1 , . . . , F 3: i ← 0 4: while not con v erged and i < k do 5: for f = 1 to F do 6: ˆ o ← ˆ x (1) . . . ˆ x ( f − 1) ˆ x ( f +1) . . . ˆ x ( F ) 7: ˆ x ( f ) ← sgn X f X † f ˆ o c 8: end for 9: i ← i + 1 10: end while 11: for f = 1 to F do Nearest Neigh b or deco ding 12: u ← arg max j | sim ( ˆ x ( f ) , x ( f ) j ) | Un-signe d NN w.r.t cos -similarity 13: ˆ x ( f ) ← x ( f ) u 14: end for 15: return ˆ x ( f ) ∀ f = 1 , . . . , F 41 B Op erational Capacit y The main text in tro duced our definition of op erational capacity and highligh ted our tw o main results – that Resonator Netw orks ha ve sup erior op erational capacity compared to the b enchmark algorithms, and that Resonator Net w ork capacity scales as a quadratic function of N . This appendix pro vides some additional supp ort and commen tary on these findings. Figure 10 compares op erational capacit y among all of the considered algo- rithms when F , the n umber of factors, is 4 . W e previously show ed this t yp e of plot for F = 3 , which was Figure 3 in the main text. Resonator Netw orks hav e an adv an tage of b etw een tw o and three orders of magnitude compared to all of our b enc hmarks; the general size of this gap w as consisten t in all of our simulations. Figure 10: Comparing op erational capacity against the b enc hmarks for F = 4 ( 4 factors) W e concluded in Section 6.2 that the op erational capacit y of Resonator Net- w orks scales quadratically in N , whic h w as sho wn in Figure 4. In T able 1 w e pro vide parameters of the least-squares quadratic fits shown in that plot. One can see from Figure 4b that capacity is different dep ending on the n um b er of factors in v olved, and in the limit of large N this difference is determined b y the parameter c . c first rises from 2 to 3 factors, and then falls with increasing F . This implies that factorization is easiest for Resonator Netw orks when the decom- p osition is in to 3 factors, an interesting phenomenon for whic h we do not ha v e an explanation at this time. Figure 11 visualizes c as a function of F . The data indicates that for F ≥ 3 , c may follo w an in v erse pow er law: c = α 1 F − α 2 . The indicated linear fit, 42 F P arameters of quadratic fit a b c 2 1 . 677 × 10 5 − 3 . 253 × 10 2 0 . 293 3 1 . 230 × 10 6 − 3 . 549 × 10 3 2 . 002 4 − 5 . 663 × 10 6 9 . 961 × 10 2 1 . 404 5 1 . 140 × 10 6 − 2 . 404 × 10 3 1 . 024 6 5 . 789 × 10 6 − 4 . 351 × 10 3 0 . 874 7 − 1 . 503 × 10 7 − 1 . 551 × 10 3 0 . 669 T able 1: M max = a + bN + cN 2 Figure 11: P arameter c of the quadratic scaling dep ends on F . W e find that it ma y follo w an inv erse p ow er la w for F ≥ 3 . follo wing a logarithmic transformation to eac h axis, suggests the follo wing v alues for parameters of this p o wer law: α 1 ≈ 2 3 . 014 = 8 . 078 , α 2 ≈ 1 . 268 . It is with some reserv ation that w e give these sp ecific v alues for α 1 and α 2 . Our estimates of op erational capacity , while well-fit b y quadratics, undoubtedly ha v e small amoun ts of noise. This noise can hav e a big enough impact on fitted v alues for c that fitting the fit ma y not b e fully justified. How ev er, w e do note for the sake of completeness that this scaling, if it holds for larger v alues of F , would allo w us to write op erational capacit y in terms of both parameters N and F in the limit of large N : M max ≈ 8 . 078 N 2 F 1 . 268 ∀ F ≥ 3 (26) 43 C T able of b enc hmark algorithms Algorithm Dynamics for up dating a f [ t ] Eq. Alternating Least Squares a f [ t + 1] = ξ > ξ − 1 ξ > c ξ := diag ˆ o ( f ) [ t ] X f (10) Iterativ e Soft Thresholding a f [ t + 1] = S [ a f [ t ] − η ∇ a f L ; λη ] S [ x ; γ ] i := sgn ( x i ) max ( | x i | − γ , 0 ) (28) F ast Iterativ e Soft Thresholding α t = 1 + p 1 + 4 α 2 t − 1 2 β t = α t − 1 − 1 α t p f [ t + 1] = a f [ t ] + β t ( a f [ t ] − a f [ t − 1]) a f [ t + 1] = S [ p f [ t + 1] − η ∇ p f L ; λη ] S [ x ; γ ] i := sgn ( x i ) max ( | x i | − γ , 0 ) (29) Pro jected Gradien t Descent a f [ t + 1] = P C f a f [ t ] − η ∇ a f L P C f [ x ] := arg min z ∈ C f 1 2 x − z 2 2 (30) Multiplicativ e W eights w f [ t + 1] = w f [ t ] 1 − η ρ ∇ a f L a f [ t + 1] = w f [ t + 1] P i w f i [ t + 1] ρ := max i ( ∇ a f L ) i (31) Map Seeking Circuits a f [ t + 1] = T a f [ t ] − η 1 + 1 ρ ∇ a f L ; T x ; i := ( x i if x i ≥ 0 otherwise ρ := min i ( ∇ a f L ) i (32) T able 2: Dynamics for a f , benchmark algorithms. (see Appendices D - I for discussion of each algorithm, including h yp erparameters η , λ , and , as well as initial conditions). 44 D T ensor Decomp ositions and Alternating Least Squares T ensors are m ultidimensional arra ys that generalize vectors and matrices. An Fth-order tensor has elements that can be indexed by F separate indexes – a v ector is a tensor of order 1 and a matrix is a tensor of order 2 . As devices for measuring multiv ariate time series hav e b ecome more prev alen t, the fact that this data can b e expressed as a tensor has made the study of tensor decomp osition a v ery p opular subfield of applied mathematics. Hitc hco ck (Hitc hco c k, 1927) is often credited with originally formulating tensor decomp ositions, but mo dern tensor decomposition was p opularized in the field of psyc hometrics b y the w ork of T uc k er (T uc k er, 1966), Carroll and Chang (Carroll and Chang, 1970), and Harshman (Harshman, 1970). This section will highlight the substantial difference b et ween tensor decomp osition and the factorization problem solved b y Resonator Net works. The type of tensor decomposition most closely related to our factorization problem (given in equation (1)) decomp oses an Fth-order tensor C into a sum of tensors eac h generated by the outer product ◦ : C = R X r =1 x (1) r ◦ x (2) r ◦ . . . ◦ x ( F ) r (33) The outer product con tains all pairs of comp onents from its t wo argumen ts, so w ◦ x ◦ y ◦ z ij k l = w i x j y k z l . The in terpretation is that each term in the sum is a “rank-one” tensor of order F and that C can b e generated from the sum of R of these rank-one tensors. W e sa y that C is “rank-R”. This particular decom- p osition has at least three different names in the literature - they are Canoni- cal P oly adic Decomp osition, coined b y Hitc hco ck, CANonical DECOMPosition (CANDECOMP), coined b y Carroll and Chang, and P ARAllel F A Ctor analysis (P ARAF A C), coined by Harshman. W e will simply call this the CP decomp osi- tion, in accordance with the conv en tion used by Kolda (Kolda and Bader, 2009) and man y others. CP decomp osition mak es no men tion of a co deb o ok of v ectors, such as we hav e in (1). In CP decomposition, the searc h is apparently ov er all of the vectors in a real-v alued v ector space. One very useful fact ab out CP decomp osition is that under relatively mild conditions, if the de c omp osition exists, it is unique up to a scaling and p ermutation indeterminacy . Without going into the details, a result in Krusk al (1977) and extended by Sidiropoulos and Bro (2000) giv es a sufficien t condition for uniqueness of the CP decomp osition based on what is kno wn as the 45 Krusk al rank k X f of the matrix X f := [ x ( f ) 1 , x ( f ) 2 , . . . x ( f ) R ] : F X f =1 k X f ≥ 2 R + ( F − 1) (34) This fact of decomp osition uniqueness illustrates one w ay that basic results from matrices fail to generalize to higher-order tensors (b y higher-order we simply mean where the order is ≥ 3) . Low-rank CP decomposition for matrices (tensors of order 2 ) ma y b e computed with the truncated Singular V alue Decomp osition (SVD). Ho wev er, if C is a matrix and its truncated SVD is UΣV > := X 1 X > 2 , then any non-singular matrix M generates an equally-go o d CP decomp osition ( UΣM )( VM − 1 ) > . The decomp osition is highly non-unique. All matrices hav e an SVD, whereas generic higher-order tensors are not guaran teed to ha ve a CP decomp osition. And y et, if a CP decomp osition exists, under the mild condition of equation (34), it is unique. This is a somewhat miraculous fact, suggesting that in this sense, CP de c omp ostion of higher-or der tensors is e asier than matric es. The higher or der of the c omp osite obje ct imp oses many mor e c onstr aints that make the de c omp osition unique . Another in teresting wa y that higher-order tensors differ from matrices is that computing matrix rank is easy , whereas in general computing tensor rank is NP- hard, along with many other imp ortant tensor problems (Hillar and Lim, 2013). Our in tuition ab out matrices largely fails us when dealing with higher-order ten- sors. In some wa ys the problems are easier and in some wa ys they are harder. Please see Sidirop oulos et al. (2017) for a more comprehensiv e comparison. The v ector factorization problem defined b y (1) differs from CP decomp osition in three k ey wa ys: 1. The comp osite ob ject to b e factored is a vector, not a higher-order tensor. This is an even more extreme difference than b et ween matrices and higher- order tensors. In CP decomp osition, the arrangement and n umerosity of tensor elements constitute many constrain ts on what the factorization can b e, so m uc h so that it resolv es the uniqueness issue we outlined ab ov e. In this sense, tensors c ontain much mor e information ab out the valid factorization, making the pr oblem signific antly e asier . The size and form of these tensors ma y mak e finding CP decomp ositions a computational challenge, but CP decomp osition is analytically easier than our v ector factorization problem. 2. Searc h is conducted ov er a discrete set of p ossible factors. This differs from the standard form ulation of CP decomp osition, which mak es no restriction to a discrete set of factors. It is ho wev er w orth noting that a sp ecializa- tion of CP decomp osition called CANonical DEcomp osition with LINear Constrain ts (CANDELINC) (Carroll et al., 1980) does in fact imp ose the 46 additional requiremen t that factors are formed from a linear combination of some basis factors. In our setup the solutions are ‘one-hot’ linear com bina- tions. 3. The factors are constrained to {− 1 , 1 } N , a small sliver of R N . This difference should not be underestimated. W e hav e sho wn in Section 6.6 that the in te- rior of this hypercub e is treac herous from an optimization p ersp ectiv e and Resonator Net works av oid it b y using a highly nonlinear activ ation function. This w ould not make sense in the con text of standard CP decomp osition. P erhaps the most con vincing demonstration that (1) is not CP decomposition comes from the fact that w e applied Alternating Least Squares to it and found that its p erformance was relatively p o or. Alternating Least Squares is in fact the ‘workhorse’ algorithm of CP decomp osition (Kolda and Bader, 2009), but it cannot comp ete with Resonator Net works on our different factorization problem (1). The excellen t review of Kolda and Bader (2009) co vers CP decomp osition and Alternating Least Squares in significan t depth, including the fact that ALS alw ays conv erges to a lo cal minimum of the squared error reconstruction loss. See, in particular, section 3.4 of their pap er for more detail. One sp ecial case of CP decomp osition inv olves rank-1 comp onen ts that are symmetric and ortho gonal . F or this problem, a sp ecial case of ALS called the tensor p o w er metho d can b e used to iteratively find the best low-rank CP decom- p osition through what is kno wn as ‘deflation’, whic h is iden tical to the explaining a wa y we introduced in part one of this series (F rady et al., 2020). The tensor p o wer metho d directly generalizes the matrix p o w er method, and in this sp ecial case of symmetric, orthogonal tensors is effectiv e at finding the CP decomp osi- tion. A goo d initial reference for the tensor p o wer metho d is De Lathau wer et al. (2000b). A discussion of applying tensor decomp ositions to statistical learning problems is co vered by Anandkumar et al. (2014), which d ev elops a robust v ersion of the tensor p ow er method and con tains several imp ortan t probabilistic results for applying tensor decomp ositions to noisy data. The tensor p ow er metho d dif- fers from Resonator Netw orks in the same k ey w ays as ALS – comp osite ob jects are higher-order tensors, not v ectors, searc h is not necessarily o ver a discrete set, the v ectors are not constrained to {− 1 , 1 } N , and the dynamics mak e line ar least squares up dates in eac h factor. Another p opular tensor decomp osition is kno wn as the T uck er Decomp osition (T uc ker, 1963, 1966). It adds to CP decomp osition an order-F “core tensor” G that mo difies the in teraction b etw een eac h of the factors: C = P X p =1 Q X q =1 . . . R X r =1 g pq ...r x (1) p ◦ x (2) q ◦ . . . ◦ x ( F ) r (35) 47 Figure 12: T uc k er decomposition with 3 factors This adds man y more parameters compared to CP decomposition, which is a sp ecial case of T uc k er decomp osition when G is the identit y . F or the purp ose of illustration, w e reprin t in Figure 12 (with a sligh t relab eling) a figure from K olda and Bader (2009) that depicts an order- 3 T uck er decomp osition. This decomp o- sition go es by man y other names, most p opularly the Higher-order SVD, coined in De Lathauw er et al. (2000a). The T uc ker decomp osition can also b e found via Alternating Least Squares (see Kolda and Bader (2009), Section 4.2, for a tutorial), although the probl em is somewhat harder than CP decomp osition, b oth b y b eing computationally more exp ensiv e and by being non-unique. Despite this fact, the applications of T uc ker decomp osition are wide-ranging – it has b een used in psychometrics, signal pro cessing, and computer vision. One w ell-known appli- cation of T uck er decomp osition in computer vision w as T ensorF aces (V asilescu and T erzop oulos, 2002). This mo del was able to factorize identit y , illumination, viewp oin t, and facial expression in a dataset consisting of face images. The summary of this section is that vector factorization problem (1) is not ten- sor decomp osition. In some sense it is more challenging. Perhaps not surprisingly , the standard algorithm for tensor decomp ositions, Alternating Least Squares, is not particularly comp etitive on this problem when compared to Resonator Net- w orks. It is interesting to consider whether tensor decomp osition migh t b e cast in to a form amenable to solution b y Resonator Netw orks. Giv en the imp ortance of tensor decomp osition as a to ol of data analysis, w e b eliev e this w arran ts a closer lo ok. 48 E General notes on gradien t-based algorithms When L is the negativ e inner pro duct, the gradient with respect to a f is: ∇ a f L = − X > f c ˆ x (1) . . . ˆ x ( f − 1) ˆ x ( f +1) . . . ˆ x ( F ) = − X > f c ˆ o ( f ) (36) The term c ˆ o ( f ) can b e in terpreted as an estimate for what ˆ x ( f ) should b e based on the curren t estimates for the other factors. Multiplying by X > f compares the similarit y of this vector to eac h of the candidate co devectors w e are entertain- ing, with the smallest elemen t of ∇ a f L (its v alue is likely to b e negative with large absolute v alue) indicating the co devector whic h matc hes b est. F ollowing the negativ e gradien t will cause this co efficien t to increase more than the co efficients corresp onding to the other co dev ectors. When L is the squared error, the gradien t with resp ect to a f is: ∇ a f L = X > f c − ˆ x (1) . . . ˆ x ( F ) − ˆ x (1) . . . ˆ x ( f − 1) ˆ x ( f +1) . . . ˆ x ( F ) := X > f ˆ x ( f ) ˆ o ( f ) 2 − c ˆ o ( f ) (37) This lo oks somewhat similar to the gradient for the negative inner pro duct – they differ by an additiv e term giv en b y X > f ˆ x ( f ) ˆ o ( f ) 2 . At the v ertices of the h yp ercub e all the elements of ˆ x ( f ) are 1 or − 1 and the term ˆ o ( f ) 2 disapp ears, making the difference b et ween the tw o gradien ts just X > f ˆ x ( f ) . Among other things, this makes the gradien t of the squared error equal to zero at the global minimizer x (1) ? . . . x ( F ) ? , whic h is not the case with the negative inner pro duct. T o b e clear, (36) is the gradien t when the loss function is the negativ e inner product, while (37) is the gradien t when the loss function is the squared error. E.1 Fixed-stepsize gradient descen t on the squared error In fixed-step-size gradient descen t for unconstrained conv ex optimization prob- lems, one m ust often add a restriction on the stepsize, related to the smo othness of the loss function, in order to ensure that the iterates con verge to a fixed point. W e say that a function L is L -smo oth when its gradien t is Lipschitz con tin uous with constan t L : ||∇L ( x ) − ∇L ( y ) || 2 ≤ L || x − y || 2 ∀ x , y (38) F or a function that is t wice-differentiable, this is equiv alent to the condition 0 ∇ 2 L ( x ) L I ∀ x (39) 49 Where 0 is the matrix of all zeros and and I is the identit y . Absen t some pro ce- dure for adjusting the stepsize η at eac h iteration to accoun t for the degree of lo cal smo othness, or some additional assumption we place on the loss to mak e sure that it is sufficien tly smo oth, we should b e w ary that con vergence may not b e guar- an teed. On our factorization problem w e find this to b e an issue. Unconstrained gradien t descen t on the squared error works for the simplest problems, where M is small and the factorization can b e easily found by an y of the algorithms in this pap er. How ev er, as M increases, the exceedingly “jagged” landscape of the squared error loss makes the iterates v ery sensitiv e to the step size η , and the comp onen ts of a f [ t ] can b ecome very large. When this happ ens, the term ˆ o ( f ) [ t ] amplifies this problem (it m ultiplies all but one of the a f [ t ] ’s together) and causes n umerical instability issues. With the squared error loss, the smo othness is very p o or: we found that fixed-stepsize gradien t descent on the squared error was so sensitiv e to η that it made the metho d practically useless for solving the factor- ization problem. Iterativ e Soft Thresholding and F ast Iterative Soft Thresholding use a dynamic step size to a v oid this issue (see equation (40)). In con trast, the negativ e inner product loss, with resp ect to each factor, is in some sense p erfe ctly smo oth (it is linear), so the step size do es not factor in to con vergence proofs. F Iterativ e Soft Thresholding (IST A) and F ast It- erativ e Soft Thresholding (FIST A) Iterativ e Soft Thresholding is a type of pr oximal gr adient desc ent . The proximal op erator for an y conv ex function h ( · ) is defined as pro x h ( x ) := arg min z 1 2 || z − x || 2 2 + h ( z ) When h ( z ) is λ || z || 1 , the proximal op erator is the so-called “soft-thresholding” function, whic h we denote b y S : S [ x ; γ ] i := sgn ( x i ) max ( | x i | − γ , 0 ) Consider taking the squared error loss and adding to it λ || a f || 1 : L ( c , ˆ c ) + λ || a f || 1 = 1 2 || c − ˆ c || 2 2 + λ || a f || 1 Applying soft thesholding clearly minimizes this augmented loss function. The strategy is to take gradien t steps with resp ect to the squared error loss but then to pass those updates through the soft thresholding function S . This fla v or of pro ximal gradient descen t, where ˆ c is a linear function of a f and h ( · ) is the 1 norm, is called the Iterativ e Soft Thresholding Algorithm (Daub ec hies et al., 2004), and 50 is a somewhat old and p opular approach for finding sparse solutions to large-scale linear in verse problems. The dynamics of IST A are given in equation (28) and there are a few param- eters w orth discussing. First, the dynamic stepsize η can b e set via bac ktracking line searc h or, as w e did, by computing the Lipschitz constant of the loss function gradien t: η = 1 L ||∇ a L ( x ) − ∇ a L ( y ) || 2 ≤ L || x − y || 2 ∀ x , y (40) The scalar λ is a hyperparameter that effectively sets the sparsity of the solutions considered – its v alue should b e tuned in order to get go o d p erformance in practice. In the exp eriments we sho w in this paper, λ was 0 . 01 . The initial state a f [0] is set to 1 . Con vergence analysis of IST A is b ey ond the scop e of this pap er, but it has b een sho wn in v arious places (Bredies and Lorenz (2008), for instance) that IST A will con verge at a rate ' O (1 /t ) . Iterative Soft Thresholding w orks well in practice, although for 4 or more factors w e find that it is not quite as effectiv e as the algorithms that do constrained descen t on the negative inner pro duct loss. By virtue of not directly constraining the co efficien ts, IST A allo ws them to gro w outside of [0 , 1] N . This may make it easier to find the minimizers a ? 1 , a ? 2 , . . . , a ? F , but it ma y also lead the metho d to encoun ter more sub optimal lo cal minimizers, whic h we found to be the case in practice. One common criticism of IST A is that it can get trapp ed in shallo w parts of the loss surface and th us suffers from slow con vergence (Bredies and Lorenz, 2008). A straigh tforw ard impro vemen t, based on Nestero v’s momentum for accelerating first-order metho ds, w as prop osed b y Beck and T eb oulle (2009), whic h they call F ast Iterativ e Soft Thresholding (FIST A). The dynamics of FIST A are written in equation (29), and conv erge at the significan tly b etter rate of ' O (1 /t 2 ) , a result prov en in Beck and T eb oulle (2009). Despite this difference in worst-c ase con vergence rate, we find that the a v erage-case con vergence rate on our particular factorization problem does not significan tly differ. Initial co efficients a f [0] are set to 1 and auxiliary v ariable α t is initialized to 1 . F or all exp eriments λ w as set the same as for IST A, to 0 . 01 . G Pro jected Gradien t Descen t Starting from the general optimization form of the factorization problem (9), what kind of constrain t might it b e reasonable to enforce on a f ? The most obvious is that a f lie on the simplex ∆ D f := { x ∈ R D f | P i x i = 1 , x i ≥ 0 ∀ i } . Enforcing this constraint means that ˆ x ( f ) sta ys within the − 1 , 1 hypercub e at all times and, 51 as we noted, the optimal v alues a ? 1 , a ? 2 , . . . , a ? F happ en to lie at vertices of the simplex, the standard basis v ectors e i . Another constrain t set w orth considering is the 1 ball B ||·|| 1 [1] := { x ∈ R D f | || x || 1 ≤ 1 } . This set contains the simplex, but it encompasses m uc h more of R D f . One reason to consider the 1 ball is that it dramatically increases the num b er of feasible global optimizers of (9), from whic h w e can easily recov er the sp ecific solution to (1). This is due to the fact that: c = X 1 a ? 1 X 2 a ? 2 . . . X F a ? F ⇐ ⇒ c = X 1 ( − a ? 1 ) X 2 ( − a ? 2 ) . . . X F a ? F and moreo ver any num b er of distinct pairs of factor co efficients can b e made nega- tiv e – the sign change cancels out. The result is that while the simplex constrain t only allo ws solution a ? 1 , a ? 2 , . . . , a ? F , the 1 ball constrain t also allows solutions − a ? 1 , − a ? 2 , a ? 3 , . . . , a ? F , and a ? 1 , a ? 2 , − a ? 3 , . . . , − a ? F , and − a ? 1 , − a ? 2 , − a ? 3 , . . . , − a ? F , etc. These spurious global minimizers can easily be detected b y c hecking the sign of the largest-magnitude comp onent of a f . If it is negativ e w e can then m ultiply b y − 1 to get a ? f . Cho osing the 1 ball o ver the simplex is purely motiv ated from the p ersp ectiv e that increasing the size of the constraint set ma y mak e finding the global optimizers easier. Ho wev er, w e found that in practice, it did not signifi- can tly matter whether ∆ D f or B ||·|| 1 [1] w as used to constrain a f . There exist algorithms for efficien tly computing pro jections on to b oth the sim- plex and the 1 ball (see Held et al. (1974), Duchi et al. (2008), and Condat (2016)). W e use a v arian t summarized in Duc hi et al. (2008) that has compu- tational complexit y O ( D f log D f ) – recall that a f has D f comp onen ts, so this is the dimensionalit y of the simplex or the 1 ball b eing pro jected on to. When constraining to the simplex, w e set the initial co efficients a f [0] to 1 D f 1 , the cen ter of the simplex. When constraining to the unit 1 ball w e set a f [0] to 1 2 D f 1 , so that all co efficien ts are equal but the vector is on the interior of the ball. The only h yp erparameter is η , which in all exp erimen ts w as set to 0 . 01 . W e remind the reader that w e defined the nullspace of the pro jection op eration with equation (22) in Section 6.6, and the sp ecial case for the simplex constrain t in (23) and (24). T aking pro jected gradient steps on the negative inner pro duct loss w orks w ell and is guaranteed to conv erge, whether we use the simplex or the 1 ball constraint. Con vergence is guaranteed due to this intuitiv e fact: an y part of − η ∇ a f L not in N P C f [ x ] , induces a change in a f , denoted by ∆ a f [ t ] which must make an acute angle with −∇ a f L . This is by the definition of ortho gonal pr oje ction , and it is a sufficien t condition for showing that ∆ a f [ t ] decreases the v alue of the loss function. Pro jected Gradien t Descen t iterates alwa ys reduce the v alue of the negative inner pro duct loss or lea ve it unc hanged; the function is b ounded b elo w on the simplex and the 1 ball, so this algorithm is guaran teed to conv erge. Applying pro jected gradient descent on the squared error did not work, which is related to the smo othness issue we discussed in App endix E.1, although the b e- 52 ha vior w as not as dramatic as with unconstrained gradien t descen t. W e observ ed in practice that pro jected gradien t descent on the squared error loss easily falls in to limit cycles of the dynamics. It w as for this reason that w e restricted our atten tion with pro jected gradient descent to the negativ e inner pro duct loss. H Multiplicativ e W eigh ts When we ha ve simplex constrain ts C f = ∆ D f , the Multiplicativ e W eights algo- rithm is an elegan t wa y to p erform the sup erp osition search. It naturally enforces the simplex constrain t b y main taining a set of auxiliary v ariables, the ‘weigh ts’, whic h define the c hoice of a f at eac h iteration. See equation (31) for the dynamics of Multiplicativ e W eigh ts. W e c ho ose a fixed stepsize η ≤ 0 . 5 and initial v alues for the weigh ts all one: w f [0] = 1 . In exp eriments in this pap er w e set η = 0 . 3 . The v ariable ρ exists to normalize the term 1 ρ ∇ a f L so that eac h elemen t lies in the in terv al [ − 1 , 1] . Multiplicativ e W eights is an algorithm primarily asso ciated with game the- ory and online optimization, although it has b een independently disco vered in a wide v ariet y of fields (Arora et al., 2012). Please see Arora’s excellen t review of Multiplicativ e W eights for a discussion of the fascinating historical and analytical details of this algorithm. Multiplicative W eights is often presented as a decision p olicy for discrete-time games. Ho wev er, through a straightforw ard generalization of the discrete actions into directions in a contin uous v ector space, one can apply Multiplicativ e W eigh ts to problems of online c onvex optimization , whic h is dis- cussed at length in Arora et al. (2012) and Hazan et al. (2016). W e can think of solving our problem (9) as if it were an online conv ex optimization problem where w e update eac h factor ˆ x ( f ) according to its own Multiplicativ e W eights up date, one at a time. The function L is con vex with resp ect to a f , but is changing at eac h iteration due the up dates for the other factors - it is in this sense that we are treating (9) as an online con vex optimization problem. H.1 Multiplicativ e W eigh ts is a descen t metho d A descent metho d on L is an y algorithm that iterates a f [ t + 1] = a f [ t ] + η [ t ]∆ a f [ t ] where the up date ∆ a f [ t ] mak es an acute angle with −∇ a f L : ∇ a f L > ∆ a f [ t ] < 0 . In the case of Multiplicative W eigh ts, we can equiv alen tly define a descent metho d based on ∇ w f ˜ L > ∆ w f [ t ] < 0 where ˜ L ( w f ) is the loss as a function of the weights and ∇ w f ˜ L is its gradient with resp ect to those weigh ts. The loss as a function of the weigh ts comes via the substitution a f = w f P i w f i := w f Φ f . W e now pro v e that 53 ∇ w f ˜ L > ∆ w f [ t ] < 0 : ∇ w f ˜ L = ∂ a f ∂ w f ∂ L ∂ a f = Φ f − w f 1 Φ 2 f − w f 2 Φ 2 f · · · − w f k Φ 2 f − w f 1 Φ 2 f Φ f − w f 2 Φ 2 f · · · − w f k Φ 2 f . . . . . . . . . . . . − w f 1 Φ 2 f − w f 2 Φ 2 f · · · Φ f − w f k Φ 2 f ∇ a f L = 1 Φ f I − 1 Φ 2 f 1w > ∇ a f L = 1 Φ f ∇ a f L − L ( a f ) Φ f 1 (41) This allo ws us to write down ∆ w f [ t ] in terms of ∇ w f ˜ L : ∆ w f [ t ] = − 1 ρ w f [ t ] ∇ a f L = − 1 ρ w f [ t ] Φ f ∇ w f ˜ L + L ( a f [ t ]) 1 = − Φ f ρ diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ w f [ t ] (42) And then w e can easily show the desired result: ∇ w f ˜ L > ∆ w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ ∇ w f ˜ L > w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ 1 Φ f ∇ a f L > − L ( a f [ t ]) Φ f 1 > w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ L ( a f [ t ]) − L ( a f [ t ]) = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L < 0 (43) The last line follo ws directly from the fact that w f are alwa ys p ositive b y con- struction in Multiplicative W eights. Therefore, the matrix diag ( w f [ t ]) is p ositive definite and the term Φ f ρ is strictly greater than 0 . W e’v e sho wn that the iterates of Multiplicativ e W eights alw a ys mak e steps in descent directions. When the loss L is the negativ e inner pro duct, it is guaranteed to decrease at eac h iteration. Empirically , multiplicativ e w eights applied to the squared error loss also always de cr e ases the loss function . W e said in App endix E.1 that descent on the squared 54 error with a fixed step size is not in general guaranteed to con verge. How ev er, the behavior we observ e with Multiplicativ e W eigh ts descent on the squared er- ror migh t be explained by the fact that the stepsize is normalized by ρ at each iteration in this algorithm. Both functions are b ounded b elow o ver the constraint set ∆ D f , so therefore Multiplicative W eights must conv erge to a fixed p oin t. In practice, we pic k a step size η b etw een 0 . 1 and 0 . 5 and run the algorithm un til the normalized magnitude of the change in the coefficients is b elow some small threshold: a f [ t + 1] − a f [ t ] η < The sim ulations we sho w ed in the Results section utilized η = 0 . 3 and = 10 − 5 . I Map Seeking Circuits Map Seeking Circuits (MSCs) are neural net w orks designed to solv e in v arian t pattern recognition problems. Their theory and applications ha v e b een gradually dev elop ed b y Arathorn and colleagues ov er the past 18 years (see, for example, Arathorn (2001, 2002), Gedeon and Arathorn (2007), and Hark er et al. (2007)), but remain largely unkno wn outside of a small comm unity of vision researc hers. In their original conception, they solv e a “corresp ondence maximization” or “trans- formation disco very” problem in whic h the netw ork is given a visually transformed instance of some template ob ject and has to reco v er the iden tity of the ob ject as w ell as a set of transformations that explain its current app earance. The approac h tak en in Map Seeking Circuits is to sup erimp ose the p ossible transformations in the same spirit as we ha ve outlined for solving the factorization problem. W e cannot give the topic a full treatmen t here but simply note that the original for- m ulation of Map Seeking Circuits can b e directly translated to our factorization problem wherein each t yp e of transformation (e.g. translation, rotation, scale) is one of the F factors, and the particular v alues of the transformation are v ectors in the co deb o oks X 1 , X 2 , . . . , X F . The loss function is L : x , y 7→ −h x , y i and the constraint set is [0 , 1] D f (b oth by conv en tion in Map Seeking Circuits). The dynamics of Map Seeking Circuits are given in equation (32), with initial v alues a f [0] = 1 for each factor. The small threshold is a h yp erparameter, which w e set to 10 − 5 in exp eriments, along with the stepsize η = 0 . 1 . Gedeon and Arathorn (2007) and Harker et al. (2007) prov ed (with some minor technicalities w e will not detail here) that Map Seeking Circuits alw a ys conv erge to either a scalar multiple of a canonical basis vector, or the zero vector. That is, a f [ ∞ ] = β f e i or 0 (where ( e i ) j = 1 if j = i and 0 otherwise, and β f is a p ositiv e scalar). Due to the normalizing term ρ , the up dates to a f c an never b e p ositive . Among the comp onen ts of ∇ a f L whic h are negative, the one with the largest magnitude 55 corresp onds to a comp onent of a f whic h sees an up date of 0 . All other comp onents are decreased b y an amoun t whic h is prop ortional to the gradient. W e noted in commen ts on (36) that the smallest elemen t of ∇ a f L corresp onds to the co dev ector whic h b est matc hes c ˆ o ( f ) , a “suggestion” for ˆ x ( f ) based on the current states of the other factors. The dynamics of Map Seeking Circuits thus preserv e the w eight of the co devector which matc hes best and decrease the weigh t of the other co dev ectors, by an amoun t whic h is prop ortional to their own matc h. Once the w eight on a co devector drops b elow the threshold, it is set to zero and no longer participates in the searc h. The phenomenon wherein the correct co efficien t a f i ? drops out of the searc h is called “sustained collusion” by Arathorn (Arathorn, 2002) and is a failure mo de of Map Seeking Circuits. J P ercolated noise in Outer Pro duct Resonator Net w orks A Resonator Netw ork with outer product w eights X f X > f that is initialized to the correct factorization is not guaran teed to remain there, just as a Hopfield Net work with outer pro duct weigh ts initialized to one of the ‘memories’ is not guaranteed to remain there. This is in con trast to a Resonator Net work (and a Hopfield Net work) with Ordinary Least Squares w eights X f X > f X f − 1 X > f , for whic h eac h of the co dev ectors are alwa ys fixed p oints. In this section, when we refer simply to a Resonator Net w ork or a Hopfield Netw ork w e are referring to the v ariants of these mo dels that use outer pro duct w eights. The bitflip probabilit y for the f th factor of a Resonator Netw ork is denoted r f and defined in (13). Section J.1 derives r 1 , which is equal to the bitflip probabilit y for a Hopfield netw ork, first in tro duced by (12) in the main text. Section J.2 deriv es r 2 , and then section J.3 collects all of the ingredients to express the general r f . J.1 First factor The stabilit y of the first factor in a Resonator Net w ork is the same as the stabilit y of the state of a Hopfield net work – at issue is the distri bution of ˆ x (1) [1] : ˆ x (1) [1] = sgn X 1 X > 1 x (1) ? := sgn Γ Assuming each co devector (each column of X 1 , including the v ector x (1) ? ) is a random bip olar v ector, each comp onent of Γ is a random v ariable. Its distribution 56 can b e deduced from writing it out in terms of constant and random comp onents: Γ i = D 1 X m N X j x (1) m i x (1) m j x (1) ? j = N x (1) ? i + D 1 X m 6 = ? N X j x (1) m i x (1) m j x (1) ? j = N x (1) ? i + ( D 1 − 1) x (1) ? i + D 1 X m 6 = ? N X j 6 = i x (1) m i x (1) m j x (1) ? j (44) The third term is a sum of ( N − 1)( D 1 − 1) i.i.d. Rademacher random v ariables, whic h in the limit of large N D 1 can b e w ell-approximated by a Gaussian random v ariable with mean zero and v ariance ( N − 1)( D 1 − 1) . Therefore, Γ i is approxi- mately Gaussian with mean ( N + D 1 − 1) x (1) ? i and v ariance ( N − 1)( D 1 − 1) . The probabilit y that ˆ x (1) [1] i 6 = x (1) ? i is given b y the cum ulative densit y function of the Normal distribution: h 1 := P r ˆ x (1) [1] i 6 = x (1) ? i = Φ − N − D 1 + 1 p ( N − 1)( D 1 − 1) (45) W e care ab out the ratio D 1 / N and how the bitflip probability h 1 scales with this n umber. W e’v e called this h 1 to denote the Hopfield bitflip probabilit y but it is also r 1 , the bitflip probability for the first factor of a Resonator Netw ork. W e’ll see that for the second, third, fourth, and other factors, h f will not equal r f , which is what we mean b y p ercolated noise, the fo cus of Section 6.1 in the main text. If w e eliminate all “self-connection” terms from X 1 X > 1 , b y setting each elemen t on the diagonal to zero, then the second term in (44) is eliminated and the bitflip proba- bilit y is Φ − N √ ( N − 1)( D 1 − 1) . This is actually significantly differen t from (45), whic h w e can see in Figure 13. With self-connections, the bitflip probability is maximized when D 1 = N (the reader can verify this via simple algebra), and its maxim um v alue is ≈ 0 . 023 . Without self-connections, the bitflip probability asymptotes at 0 . 5 . The actual useful op erating regime of b oth these netw orks is where D 1 is significan tly less than N , which we zo om in on in Figure 13b. A “mean-field” anal- ysis of Hopfield Netw orks developed b y Amit, Gutfreund, and Somp olinsky (Amit et al., 1985, 1987) show ed that when D 1 / N > 0 . 138 , a phase-change phenomenon o ccurs in whic h a small num b er of initial bitflips (when the probabilit y is 0 . 0036 according to the ab o ve appro ximation) build up o ver subsequen t iterations and the netw ork almost alwa ys mov es far a wa y from x (1) ? , making it essentially useless. W e can see that the same bitflip probabilit y is suffered at a significan tly higher 57 (a) Bitflip prob. for D 1 / N ∈ (0 , 2] (b) Bitflip prob. for D 1 / N ∈ (0 , 0 . 25] Figure 13: Effect of self-connections on bitflip probabilit y v alue for D 1 / N when w e ha v e self-connections – the vector x (1) ? is significan tly more stable in this sense. W e also found that a Resonator Net work has higher op erational capacity (see Section 6.2) when w e lea ve in the self-connections. As a third p oin t of in terest, computing X f X > f x (1) ? is often m uc h faster when w e k eep eac h co deb o ok matrix separate (instead of forming the synaptic matrix X f X > f directly), in whic h case remo ving the self-connection terms inv olves extra compu- tation in eac h iteration of the algorithm. F or all of these reasons, we choose to k eep self-connection terms in the Resonator Netw ork. J.2 Second factor When w e up date the second factor, we ha v e ˆ x (2) [1] = sgn X 2 X > 2 ˆ o (2) [1] c := sgn Γ Here w e’re just repurp osing the notation Γ to indicate the v ector whic h gets thresholded to − 1 and +1 by the sign function to generate the new state ˆ x (2) [1] . Some of the comp onents of the vector ˆ o (2) [1] c will b e the same as x (2) ? and some (hop efully small) n um b er of the components will hav e b een flipp ed compared to x (2) ? b y the up date to factor 1 . Let us denote the set of components whic h flipp ed as Q . The set of comp onents that did not flip is Q c . The num b er of bits that did or did not flip is the size of these sets, denoted by | Q | and | Q c | , resp ectiv ely . W e hav e to k eep track of these t wo sets separately b ecause it will affect the probabilit y that 58 a comp onent of ˆ x (2) [1] is flipped relativ e to x (2) ? . W e can write out the constant and random parts of Γ i along the same lines as what w e did in (44). Γ i = D 2 X m N X j x (2) m i x (2) m j ˆ o (2) [1] c j = D 2 X m N X j ∈ Q c x (2) m i x (2) m j x (2) ? j − D 2 X m N X j ∈ Q x (2) m i x (2) m j x (2) ? j = | Q c | x (2) ? i + D 2 X m 6 = ? N X j ∈ Q c x (2) m i x (2) m j x (2) ? j − | Q | x (2) ? i − D 2 X m 6 = ? N X j ∈ Q x (2) m i x (2) m j x (2) ? j = ( N − 2 | Q | ) x (2) ? i + D 2 X m 6 = ? N X j ∈ Q c x (2) m i x (2) m j x (2) ? j − D 2 X m 6 = ? N X j ∈ Q x (2) m i x (2) m j x (2) ? j (46) If i is in the set of bits which did not flip previously , then there is a constan t ( D 2 − 1) x (2) ? i whic h comes out of the second term ab o v e. If i is in the set of bits whic h did flip previously , then there is a constan t − ( D 2 − 1) x (2) ? i whic h comes out of the third term ab o v e. The remaining con tribution to Γ i is, in either case, a sum of ( N − 1)( D 2 − 1) i.i.d. Rademacher random v ariables, analogously to what w e had in (44). T echnically | Q | is a random v ariable but when N is of an y moderate size it will b e close to r 1 N , the bitflip probability for the first factor. Therefore, Γ i is appro ximately Gaussian with mean either N (1 − 2 r 1 ) + ( D 2 − 1) x (2) ? i or N (1 − 2 r 1 ) − D 2 − 1) x (2) ? i , dep ending on whether i ∈ Q c or i ∈ Q . W e call the c onditional bitflip probabilities that result from these t wo cases r 2 0 and r 2 00 : r 2 0 := P r ˆ x (2) [1] i 6 = x (2) ? i ˆ o (2) [1] c i = x (2) ? i = Φ − N (1 − 2 r 1 ) − ( D 2 − 1) p ( N − 1)( D 2 − 1) (47) r 2 00 := P r ˆ x (2) [1] i 6 = x (2) ? i ˆ o (2) [1] c i 6 = x (2) ? i = Φ − N (1 − 2 r 1 ) + ( D 2 − 1) p ( N − 1)( D 2 − 1) (48) The total bitflip probabilit y for up dating the second factor, r 2 , is then r 2 0 (1 − h 1 ) + r 2 00 h 1 . J.3 All other factors It hop efully should b e clear that the general developmen t ab ov e for the bitflip probabilit y of the second factor will apply to all subsequen t factors – w e just 59 need to mak e one mo dification to notation. W e sa w that bitflip probabilit y w as differen t dep ending on whether the comp onent had flipp ed in the previous factor (the difference b et ween (47) and (48)). In the general case, what really matters is whether the factor sees a net bitflip from the other factors. It might b e the case that the comp onent had initially flipp ed but w as flipped bac k b y subsequen t factors – all that matters is whether an o dd num b er of previous factors flipp ed the comp onent. T o capture this indirectly , we define the quantit y n f to b e the net bitflip probabilit y that is passed on to the next factor (this is equation 14 in the main text): n f := P r ˆ o ( f +1) [ t ] c i 6 = x ( f +1) ? i F or the first factor, r 1 = n 1 but in the general case it should b e clear that r f = r f 0 (1 − n f − 1 ) + r f 00 n f − 1 whic h is equation (17) in the main text. This expression is just marginalizing o v er the probability that a net biflip was not seen (first term) and the probability that a net bitflip w as seen (second term). The expression for the general n f is sligh tly differen t: n f = r f 0 (1 − n f − 1 ) + (1 − r f 00 ) n f − 1 whic h is equation (18) in the main text. The base of th e recursion is n 0 = 0 , whic h makes in tuitive sense b ecause factor 1 sees no p ercolated noise. In (47) and (48) ab ov e w e had r 1 but what really b elongs there in the general case is n f − 1 . This brings us to our general statemen t for the conditional bitflip probabilities r f 0 and r f 00 , whic h are equations 19 and 20 in the main text: r f 0 = Φ − N (1 − 2 n f − 1 ) − ( D f − 1) p ( N − 1)( D f − 1) r f 00 = Φ − N (1 − 2 n f − 1 ) + ( D f − 1) p ( N − 1)( D f − 1) What w e ha v e derived here in App endix J are the equations (12) - (20). This result agrees v ery well with data generated in exp eriments where one actually coun ts the bitflips in a randomly instantiated Resonator Net work. In Figure 14 w e show the sampling distribution of r f from these experiments compared to the analytical expresssion for r f . Dots indicate the mean v alue for r f and the shaded region indicates one standard deviation ab out the mean, the standard error of this sampling distribution. W e generated this plot with 250 iid random trials for each p oin t. Solid lines are simply the analytical v alues for r f , whic h one can see are in v ery close agreement with the sampling distribution. 60 Figure 14: Agreemen t b etw een sim ulation and theory for r f . Shades indicate factors 1 - 5 (ligh t to dark). 61
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment