Resonator Networks outperform optimization methods at solving high-dimensional vector factorization

T o app ear in Neural Computation, 2020 Resonator Net w orks outp erform optimization metho ds at solving high-dimensional v ector factorization Sp encer J. Ken t 1 , 2 E. P axon F rady 1 , 3 , 5 F riedric h T. Sommer 1 , 3 , 5 Bruno A. Olshausen 1 , 3 , 4 1 Redw o o d Cen ter for Theoretical Neuroscience 2 Electrical Engineering and Computer Sciences 3 Helen Wills Neuroscience Institute, 4 Sc ho ol of Optometry Univ ersity of California, Berk eley Berk eley , CA 94702 5 In tel Laboratories, Neuromorphic Computing Lab San F rancisco, CA 94111 Abstract W e dev elop theoretical foundations of Resonator Netw orks, a new type of recur- ren t neural net w ork in tro duced in F rady et al. (2020) to solve a high-dimensional v ector factorization problem arising in V ector Symbolic Arc hitectures. Giv en a comp osite v ector formed by the Hadamard product b etw een a discrete set of high- dimensional v ectors, a Resonator Net w ork can eﬃcien tly decomp ose the com- p osite in to these factors. W e compare the performance of Resonator Netw orks against optimization-based metho ds, including Alternating Least Squares and sev- eral gradien t-based algorithms, sho wing that Resonator Netw orks are sup erior in sev eral imp ortant w ays. This adv an tage is ac hieved b y leveraging a com bination of nonlinear dynamics and “searching in sup erp osition,” b y which estimates of the correct solution are formed from a w eighted sup erp osition of all p ossible solu- tions. While the alternative metho ds also searc h in sup erp osition, the dynamics of Resonator Net w orks allow them to strik e a more eﬀectiv e balance b et ween ex- ploring the solution space and exploiting lo cal information to driv e the net work to ward probable solutions. Resonator Net works are not guaran teed to conv erge, but within a particular regime they almost alwa ys do. In exc hange for relaxing the guarantee of global conv ergence, Resonator Netw orks are dramatically more eﬀectiv e at ﬁnding factorizations than all alternative approac hes considered. 1 1 In tro duction This article is the second in a t wo-part series on Resonator Net works. Part one (F rady et al., 2020) show ed how distributed representations of data structures ma y b e formed using the algebra of V ector Symbolic Architectures, and further- more that deco ding these represen tations often requires solving a v ector factor- ization problem. Resonator Netw orks were in tro duced as a neural solution to this problem, and demonstrated on tw o examples. Here, our primary ob jectiv e is to establish the theoretical foundations of Resonator Netw orks, and to perform a more comprehensive analysis of their conv ergence and capacity prop erties in comparison to optimization-based metho ds. W e limit our analysis to a particular deﬁnition of the factorization problem, whic h may seem somewhat abstract, but in fact applies to practical usage of V ector Sym b olic Arc hitectures (VSAs). W e consider “bip olar” v ectors, whose elemen ts are ± 1 , used in the p opular “Multiply , Add, Perm ute (MAP)” VSA (Ga yler, 1998, 2004). These ideas extend to other VSAs, although w e lea ve a detailed analysis to future work. Part one included commen tary on the historical context and represen tational pow er of VSAs, which we will not co ver here. F or the purp oses of this pap er, it is suﬃcient to stipulate that wherever VSAs are used to enco de complex hierarc hical data structures, a factorization problem m ust b e solved. By solving this problem, Resonator Net works make the VSA framework scalable to a larger range of problems. The core c hallenge of factorization is that inferring the factors of a composite ob ject amounts to searching through an enormous space of p ossible solutions. Resonator Netw orks do this, in part, b y “searching in sup erp osition,” a notion that w e make precise in Section 3. There are in fact man y w ays to search in sup erp osition, and w e introduce a n um b er of them in Section 5 as a benchmark for our mo del and to understand what mak es our approach diﬀerent. A Resonator Net work is simply a nonlinear dynamical system designed to solv e a particular factorization problem. It is deﬁned b y equations (7) and (8), eac h representing t wo separate v arian ts of the netw ork. The system is named for the wa y in which correct factorizations seemingly ‘resonate’ out of what is initially an uninformativ e net work state. The size of the factorization problem that can b e reliably solved, as w ell as the sp eed with whic h solutions are found, characterizes the p erformance of all the approaches we in tro duce – in these terms, Resonator Net works are b y far the most eﬀectiv e. The main results are as follo ws: 1. W e c haracterize stability at the correct solution, sho wing that one v arian t of Resonator Net works is alwa ys stable, while the other has stability prop erties related to classical Hopﬁeld Net works. W e sho w that Resonator Netw orks 2 are less stable than Hopﬁeld Net w orks because of a phenomenon we refer to as p ercolated noise (Section 6.1). 2. W e deﬁne “operational capacity” as a metric of factorization p erformance and use it to compare Resonator Net works against six benchmark algo- rithms. W e ﬁnd that Resonator Net works ha ve dramatically higher op era- tional capacit y (Section 6.2). 3. Through sim ulation, we determine that op erational capacit y scales as a quadratic function of vector dimensionalit y . This quantit y is prop ortional to the n umber of idealized neurons in a Resonator Netw ork (also Section 6.2). 4. W e prop ose a theory for why Resonator Net w orks p erform well on this prob- lem (Section 6.6). 2 Statemen t of the problem W e formalize the factorization problem in the follo wing w ay: X 1 , X 2 , . . . , X F are sets of v ectors called ‘co deb o oks’. The f th co deb o ok con tains D f ‘co dev ectors’ x ( f ) 1 , x ( f ) 2 , . . . , x ( f ) D f X f := { x ( f ) 1 , x ( f ) 2 , . . . , x ( f ) D f } ∀ f = 1 , 2 , . . . , F and these v ectors all liv e in {− 1 , 1 } N . A comp osite v ector c is generated b y computing the Hadamard pro duct  of F vectors, one dra wn from each of the co deb o oks X 1 , X 2 , . . . , X F . c = x (1) ?  x (2) ?  . . .  x ( F ) ? x (1) ? ∈ X 1 , x (2) ? ∈ X 2 , . . . , x ( F ) ? ∈ X F The factorization problem w e wish to study is giv en c , X 1 , X 2 , . . . , X F ﬁnd x (1) ? ∈ X 1 , x (2) ? ∈ X 2 , . . . x ( F ) ? ∈ X F suc h that c = x (1) ?  x (2) ?  . . .  x ( F ) ? (1) Our assumption in this paper is that the factorization of c into F co dev ectors, one from eac h co deb o ok, is unique. Then, the total n umber of comp osite vectors that can b e generated b y the co deb o oks is M : M := F Y f =1 D f 3 The problem in volv es searc hing among M p ossible factorizations to ﬁnd the one that generates c . W e will refer to M as the search space size, and at some lev el it captures the diﬃcult y of the problem. The problem size is also inﬂuenced by N , the dimensionalit y of each v ector. Supp ose we w ere to solv e (1) using a brute force strategy . W e might form all p ossible comp osite vectors from the sets X 1 , X 2 , . . . , X F , one at a time, un til w e generate the v ector c , whic h would indicate the appropriate factorization. Assuming no additional information is a v ailable, the num b er of trials taken to ﬁnd the correct factorization is a uniform random v ariable K ∼ U { 1 , M } and th us E [ K ] = M +1 2 . If instead we could easily store all of the comp osite vectors ahead of time, we could compare them to any new comp osite v ector via a single matrix-v ector inner pro duct, which, given our uniqueness assumption, will yield a v alue of N for the correct factorization and v alues strictly less than N for all other factorizations. The matrix containing all p ossible comp osite vectors requires M N bits to store. The core issue is that M scales very po orly with the num b er of factors and n umber of p ossible co devectors to b e entertained. If F = 4 ( 4 factors) and D f = 100 ∀ f ( 100 possible co dev ectors for each factor), then M = 100 , 000 , 000 . In the context of V ector Sym b olic Arc hitectures, it is common to hav e N = 10 , 000 . Therefore, the matrix with all p ossible comp osite vectors w ould require ≈ 125 GB to store. W e aspire to solve problems of this size (and muc h larger), whic h are clearly out of reach for brute-force approac hes. F ortunately , they are solv able using Resonator Net works. 3 F actoring b y searc h in sup erp osition In our problem form ulation (1) the factors in teract multiplicativ ely to form c , and this lies at the heart of what mak es it hard to solv e. One w ay to attempt a solution is to pro duce an estimate for eac h factor in turn, alternating b et ween up dates to a single factor on its o wn, with the others held ﬁxed. In addition, it may mak e sense to sim ulatenously en tertain all of the v ectors in each X f , in some prop ortion that reﬂects our current conﬁdence in each one b eing part of the correct solution. W e call this se ar ching in sup erp osition and it is the general approac h w e tak e throughout the pap er. What we mean b y ‘sup erp osition’ is that the estimate for the f th factor, ˆ x ( f ) , is given b y ˆ x ( f ) = g ( X f a f ) where X f is a matrix with eac h column a v ector from X f . The v ector a f con tains the co eﬃcients that deﬁne a linear combination of the elements of X f , and g ( · ) is a function from R N to R N , which w e will call the activ ation function. In this pap er w e consider the iden tity g : x 7→ x , the sign function g : x 7→ sgn ( x ) , and nothing else. Other activ ation functions are appropriate for the other v ariants of Resonator Net works (for instance where the vectors are complex-v alued), but w e lea ve a discussion of 4 this to future w ork. ‘Search’ refers to the metho d by which we adapt a f o ver time. The estimate for eac h factor leads to an estimate for c denoted by ˆ c : ˆ c := ˆ x (1)  ˆ x (2)  . . .  ˆ x ( F ) = g ( X 1 a 1 )  g ( X 2 a 2 )  . . .  g ( X F a F ) (2) Supp ose g ( · ) is the identit y . Then ˆ c b ecomes a multiline ar function of the co eﬃ- cien ts a 1 , a 2 , . . . a F . ˆ c = ˆ x (1)  ˆ x (2)  . . .  ˆ x ( F ) = X 1 a 1  X 2 a 2  . . .  X F a F (3) While this is a ‘nice’ relationship in the sense that it is linear in eac h of the co eﬃcien ts a f separately (with the others held ﬁxed), it is unfortunately not con vex with respect to the co eﬃcien ts tak en all at once. W e can rewrite it as a sum of M diﬀerent terms, one for each of the possible factorizations of c : ˆ c = X d 1 ,d 2 ,...,d F  ( a 1 ) d 1 ( a 2 ) d 2 . . . ( a F ) d F  x (1) d 1  x (2) d 2  . . .  x ( F ) d F (4) Where d 1 ranges from 1 to D 1 , d 2 ranges from 1 to D 2 , and so on. The term in paren theses is a scalar that weigh ts eac h of the p ossible Hadamard pro ducts. Our estimate ˆ c is, at an y giv en time, purely a sup erp osition of al l the p ossible fac- torizations. Moreo v er, the sup erp osition w eights  ( a 1 ) d 1 ( a 2 ) d 2 . . . ( a F ) d F  can b e appro ximately recov ered from ˆ c alone by computing the cosine similarity b etw een ˆ c and the vector x (1) d 1  x (2) d 2  . . .  x ( F ) d F . The source of ‘noise’ in this appro xi- mation is the fact that x (1) m 1  x (2) m 2  . . .  x ( F ) m F will ha ve a nonzero inner pro duct with the other vectors in the sum. When the co dev ectors are uncorrelated and high-dimensional, this noise is quite small: ˆ c transparently reﬂects the prop ortion with which it contains each of the possible factorizations. When g ( · ) is the sign function, this pr op erty is r etaine d . The vector ˆ c is no longer an exact sup erp osi- tion, but the scalar  ( a 1 ) m 1 ( a 2 ) m 2 . . . ( a F ) m F  can still b e deco ded from ˆ c in the same wa y – the vector ˆ c is still an appro ximate sup erp osition of all the p ossible factorizations, with the weigh t for eac h of these determined by the co eﬃcients a f . This prop erty , that thresholded sup erp ositions retain relative similarit y to eac h of their superimp osed components, is hea vily relied on throughout Kanerv a’s and Ga yler’s work on V ector Sym b olic Arc hitectures (Kanerv a, 1996; Ga yler, 1998). One last p oin t of notation b efore in tro ducing our solution to the factorization problem – w e deﬁne the vector ˆ o ( f ) to b e the product of the estimates for the other factors: ˆ o ( f ) := ˆ x (1)  . . .  ˆ x ( f − 1)  ˆ x ( f +1)  . . .  ˆ x ( F ) (5) This will come up in each of the algorithms under consideration and simplify our notation. The notation will often include an explicit dep endence on time t lik e so: 5 ˆ x f [ t ] = g ( X f a f [ t ]) . Each of the algorithms considered in this pap er up dates one factor at a time, with the others held ﬁxed so, at a giv en time t , we will up date the factors in order 1 to F , although this is a somewhat arbitrary c hoice. Including time dep endence with ˆ o ( f ) , w e hav e ˆ o ( f ) [ t ] := ˆ x (1) [ t + 1]  . . .  ˆ x ( f − 1) [ t + 1]  ˆ x ( f +1) [ t ]  . . .  ˆ x ( F ) [ t ] (6) whic h mak es explicit that at the time of up dating ˆ x f , the factors 1 to ( f − 1) ha ve already been up dated for this ‘iteration’ t while the factors ( f + 1) to F hav e y et to b e up dated. 4 Resonator Net w orks A Resonator Net work is a nonlinear dynamical system designed to solve the fac- torization problem (1), and it can b e interpreted as a neural netw ork in whic h idealized neurons are connected in a very particular wa y . W e deﬁne t wo separate v arian ts of this system, whic h diﬀer in terms of this pattern of connectivit y . A Resonator Net work with outer product (OP) weigh ts is deﬁned b y ˆ x ( f ) [ t + 1] = sgn  X f X > f  ˆ o ( f ) [ t ]  c   (7) Supp ose ˆ x ( f ) [ t + 1] indicates the state of a p opulation of neurons at time t + 1 . Eac h neuron receives an input ˆ o ( f ) [ t ]  c , mo diﬁed by synapses mo deled as a ro w of a “weigh t matrix” X f X > f . This “synaptic current” is passed through the activ ation function sgn ( · ) in order to determine the output, whic h is either +1 or − 1 . Most readers will b e familiar with the weigh t matrix X f X > f as the so-called “outer pro duct” learning rule of classical Hopﬁeld Net works (Hopﬁeld, 1982). This has the nice in terpretation of Hebbian learning (Hebb, 1949) in whic h the strength of synapses b etw een any t wo neurons (represen ted b y this weigh t matrix) dep ends solely on their pairwise statistics o ver some dataset, in this case the co dev ectors. Prior to thresholding in (7), the matrix-v ector product X >  ˆ o ( f ) [ t ]  c  pro- duces co eﬃcients a f [ t ] which, when premultiplied b y X f , generate a v ector in the linear subspace spanned b y the codevectors (the columns of X f ). This pro jection do es not minimize the squared distance b et ween  ˆ o ( f ) [ t ]  c  and the resultan t v ector. Instead, the matrix  X > f X f  − 1 X > f pro duces such a pro jection, the so- called Or dinary L e ast Squar es pro jection onto R ( X f ) . This motiv ates the second v arian t of our mo del, Resonator Netw orks with Ordinary Least Squares (OLS) w eights: ˆ x ( f ) [ t + 1] = sgn  X f  X > f X f  − 1 X > f  ˆ o ( f ) [ t ]  c   := sgn  X f X † f  ˆ o ( f ) [ t ]  c   (8) 6 where we hav e used the notation X † f to indicate the Mo ore-P enrose psuedoinv erse of the matrix X f . Hopﬁeld Net works with this type of synapse were ﬁrst proposed b y P ersonnaz, Guy on, and Dreyfus (Personnaz et al., 1986), who called this the “pro jection” rule. If, contrary to what w e ha ve deﬁned in (7) and (8), the input to each sub- p opulation of neurons w as ˆ x ( f ) [ t ] , its own previous state, then one would in fact ha ve a (“Bip olar”) Hopﬁeld Net work. In our case ho wev er, rather than b eing autoasso ciativ e, in whic h ˆ x ( f ) [ t + 1] is a direct function of ˆ x ( f ) [ t ] , our dynamics are heteroasso ciative, basing up dates on the states of the other factors. This c hange has a dramatic eﬀect on the net work’s con vergence prop erties and is also in some sense what mak es Resonator Net works useful in solving the factorization problem, a fact that we will elab orate on in the follo wing sections. W e imagine F separate subp opulations of neurons which ev olve together in time, eac h one resp onsible for estimating a diﬀerent factor of c . F or no w we hav e just sp eciﬁed this as a discrete-time net w ork in which up dates are made one-at-a-time, but it can b e extended as a contin uous-v alued, contin uous-time dynamical system along the same lines as was done for Hopﬁeld Net works (Hopﬁeld, 1984). In that case, w e can think ab out these F subp opulations of neurons ev olving in a truly parallel wa y . In discrete-time, one has the c hoice of making ‘asynchronous’ or ‘sync hronous’ up dates to the factors, in a sense analogous to Hopﬁeld Net works. Our form ulation of ˆ o ( f ) [ t ] in (6) follows the asynchronous con ven tion, which w e ﬁnd to conv erge faster. The formulation giv en in part one of this series (F rady et al., 2020) employ ed the sync hronous con ven tion for p edagogical reasons, but the distinction b et ween the tw o v anishes in con tinuous-time, where up dates are instan taneous. In practice, we will ha ve to c ho ose an initial state ˆ x ( f ) [0] using no kno wledge of the correct co dev ector x ( f ) ? other than the fact it is one of the elemen ts of the co deb o ok X f . Therefore, we set ˆ x ( f ) [0] = sgn  P j x ( f ) j  , which, as we hav e said ab o ve, has appro ximately equal cosine similarit y to eac h term in the sum. 4.1 Diﬀerence b et w een OP w eights and OLS weigh ts The diﬀerence b et ween outer pro duct w eights and Ordinary Least Squares w eights is via  X > f X f  − 1 , the in verse of the so-called Gram matrix for X f , whic h contains inner pro ducts b etw een eac h co dev ector. If the co devectors are orthogonal, the Gram matrix is N I , with I the identit y matrix. When N is large (roughly speak- ing > 5 , 000 ), and the co dev ectors are c hosen randomly i.i.d. from {− 1 , 1 } N , then they will b e very ne arly orthogonal, making N I a close approxima tion. Clearly , in this setting, the t wo v arian ts of Resonator Netw orks pro duce nearly the same dynamics. In section 6.2, we deﬁne and measure a p erformance metric called op- 7 erational capacit y in such a w ay that do es not particularly highlight the diﬀerence b et ween the dynamics, i.e. it is the setting where co devectors are nearly orthog- onal. In general, ho wev er, the dynamics are clearly diﬀeren t. In our exp erience, applications that con tain correlations b etw een co devectors ma y enjo y hi gher op- erational capacit y under Ordinary Least Squares w eights, but it is hard to say whether this applies in ev ery setting. One application-relev ant consideration is that, b ecause X f consists of entries that are +1 and − 1 , the outer pro duct v ariant of a Resonator Netw ork has an in teger-v alued w eight matrix and can b e implemen ted without an y ﬂoating-p oin t computation – hardware with large binary and in teger arithmetic circuits can sim ulate this mo del v ery quic kly . Coupled with noise tolerance prop erties w e will establish in Section 6.5, this mak es Resonator Net w orks (and more generally , VSAs) a go o d ﬁt for emerging device nanotec hnologies (Rahimi et al., 2017). 5 The optimization approac h An alternativ e strategy for solving the factorization problem is to deﬁne a loss function which compares the current estimate ˆ c := ˆ x (1)  ˆ x (2)  . . .  ˆ x ( F ) with the comp osite that is to b e factored, c , choosing the loss function and a corresp onding constrain t set so that the global minimizer of this loss o v er the constraints yields the correct solution to (1). One can then design an algorithm that ﬁnd s the solu- tion by minimizing this loss. This is the approac h taken by optimization theory . Here w e consider algorithms that searc h in sup erp osition, setting ˆ x ( f ) = g ( X f a f ) just as Resonator Net works, but that instead tak e the optimization approac h. Let the loss function b e L ( c , ˆ c ) and the feasible set for each a f b e C f . W e write this as a fairly generic optimization problem: minimize a 1 , a 2 ,..., a F L  c , g ( X 1 a 1 )  g ( X 2 a 2 )  . . .  g ( X F a F )  sub ject to a 1 ∈ C 1 , a 2 ∈ C 2 , . . . , a F ∈ C F (9) What makes a particular instance of this problem remark able dep ends on our c hoices for L ( · , · ) , g ( · ) , C 1 , C 2 , . . . , C F , and the structure of the vectors in eac h co deb o ok. Diﬀerent algorithms ma y be appropriate for this p roblem, dep ending on these details, and we prop ose six candidate algorithms in this pap er, whic h we refer to as the “b enc hmarks”. It is in c ontr ast to the b enchmark algorithms that we can more fully understand the p erformance of Resonator Netw orks – our argumen t, whic h w e will develop in the Results section, is that Resonator Net works strike a more natural balance b et w een exploring the high-dimensional state space and using lo cal information to mo v e tow ards the solution. The b enc hmark algorithms are brieﬂy in tro duced in Section 5.1, but they are each discussed at some length in the Appendix, including T able 2, whic h compiles the dynamics sp eciﬁed b y each. 8 W e provide implemen tations of eac h algorithm in the small softw are library that accompanies this pap er 1 . 5.1 Benc hmark algorithms A common thread among the b enchmark algorithms is that they take the activ a- tion function g ( · ) to b e the identit y g : x 7→ x , making ˆ c a m ultilinear function of the co eﬃcients, as we discussed in section 3. W e experimented with other acti- v ation functions, but found none for which the optimization approac h p erformed b etter. W e consider t w o straigh tforward loss functions for comparing c and ˆ c . The ﬁrst is one half the squared Euclidean norm of the error, L : x , y 7→ 1 2 || x − y || 2 2 , whic h we will call the squared error for short, and the second is the negative inner pro duct L : x , y 7→ −h x , y i . The squared error is minimized b y ˆ c = c , whic h is also true for the negative inner product when ˆ c is constrained to [ − 1 , 1] N . Both of these loss functions are conv ex, meaning that L ( c , ˆ c ) is a conv ex func- tion of each a f separately 2 . Some of the benchmark algorithms constrain a f directly , and when that is the case, our fo cus is on three diﬀerent conv ex sets, namely the simplex ∆ D f := { x ∈ R D f | P i x i = 1 , x i ≥ 0 ∀ i } , the unit  1 ball B ||·|| 1 [1] := { x ∈ R D f | || x || 1 ≤ 1 } , and the closed zero-one hypercub e [0 , 1] D f . Therefore, solving (9) with resp ect to eac h a f sep ar ately is a con vex optimization problem. In the case of the negativ e inner pro duct loss L : x , y 7→ −h x , y i and simplex constraints C f = ∆ D f , it is a b onaﬁde linear program. The correct fac- torization is giv en b y a ? 1 , a ? 2 , . . . , a ? F suc h that ˆ x ( f ) = X f a ? f = x ( f ) ? ∀ f , whic h we kno w to b e v ectors with a single entry 1 and the rest 0 – these are the standard basis v ectors e i (where ( e i ) j = 1 if j = i and 0 otherwise). The initial states a 1 [0] , a 2 [0] , . . . a F [0] m ust b e set with no prior kno wledge of the correct factoriza- tion so, similar to ho w w e do for Resonator Netw orks, w e set each elemen t of a f [0] to the same v alue (whic h in general dep ends on the constrain t set). 5.1.1 Alternating Least Squares Alternating Least Squares (ALS) lo cally minimizes the squared error loss in a fairly straightforw ard w a y: for eac h factor, one at a time, it solves a least squares problem for a f and up dates the current state of the estimate ˆ c to reﬂect this new v alue, then mo ves on to the next factor and rep eats. F ormally , the up dates given 1 h ttps://github.com/spencerken t/resonator-netw orks 2 through the comp osition of an aﬃne function with a con vex function 9 b y Alternating Least Squares are: a f [ t + 1] = arg min a f 1 2     c − ˆ o ( f ) [ t ]  X f a f [ t ]     2 2 =  ξ > ξ  − 1 ξ > c | ξ := diag  ˆ o ( f ) [ t ]  X f (10) Alternating Least Squares is an algorithm that features prominen tly in the tensor decomp osition literature (Kolda and Bader, 2009), but while ALS has b een suc- cessful for a particular t yp e of tensor decomp osition, there are a few details which mak e our problem diﬀerent from what is normally studied (see App endix D). The up dates in ALS are quite greedy – they exactly solve each least squares subprob- lem. It may make sense to more gradually mo dify the co eﬃcients, a strategy that w e turn to next. 5.1.2 Gradien t-following algorithms Another natural strategy for solving (9) is to mak e up dates that incorp orate the gradien t of L with resp ect to the coeﬃcients – eac h of the next 5 algorithms do es this in a particular w ay (w e write out the gradien ts for b oth loss functions in App endix E). The squared error loss is globally minimized b y ˆ c = c , so one migh t b e tempted to start from some initial v alues for the co eﬃcients and make gradien t up dates a f [ t + 1] = a f [ t ] − η ∇ a f L . In App endix E.1 we discuss wh y this do es not w ork w ell – the diﬃcult y is in b eing able to guaran tee that the loss function is smo oth enough that gradien t descent iterates with a ﬁxed stepsize will con verge. Instead, the algorithms we apply to the squared error loss utilize a dynamic stepsize. Iterativ e Soft Thresholding: The global minimizers of (9) are maximally sparse, || a ? f || 0 = 1 . If one aims to minimize the squared error loss while lo osely con- strained to sparse solutions, it ma y make sense to solv e the problem with Iterativ e Soft Thesholding (IST A). The dynamics for IST A are giv en b y equation (28) in T able 2. F ast Iterative Soft Thresholding: W e also considered F ast Iterativ e Soft Thesh- olding (FIST A), an enhancement due to Bec k and T eboulle (2009), whic h utilizes Nestero v’s momen tum for accelerating ﬁrst-order methods in order to alleviate the sometimes slo w con vergence of IST A (Bredies and Lorenz, 2008). Dynamics for FIST A are given in equation (29). Pro jected Gradien t Descen t: Another b enc hmark algorithm we considered w as Pro jected Gradient Descent on the negativ e inner product loss, where up- dates were pro jected onto either the simplex or unit  1 ball (30). A detailed discussion of this approac h can b e found in App endix G. 10 Multiplicativ e W eigh ts: This is an algorithm that can b e applied to either loss function, although w e found it w orked b est on the negativ e inner product. It v ery elegantly enforces a simplex constraint on a f b y maintaining a set of auxilliary v ariables, the ‘w eights’, which are used to set a f at each iteration. See equation (31) for the dynamics of Multiplicative W eigh ts, as w ell as App endix H. Map Seeking Circuits: The ﬁnal algorithm that we considered is called Map Seeking Circuits. Map Seeking Circuits are neural net w orks designed to solv e in v ariant pattern recognition problems using the principle of sup erp osition. Their dynamics are based on the gradien t, but are diﬀerent from what w e ha ve in tro duced so far – see equation (32) and App endix I. 5.2 Con trasting Resonator Net w orks with the b enc hmarks 5.2.1 Con vergence of the b enc hmarks A remark able fact ab out the b enchmark algorithms is that e ach one c onver ges for al l initial c onditions , which we directly pro ve, or refer to results pro ving, in the App endix. That is, given an y starting co eﬃcients a f [0] , their dynamics reach ﬁxed points which are local minimizers of the loss function. In some sense, this prop ert y is an immediate consequence of treating factorization as an optimization problem – the algorithms we chose as the b enc hmarks were designe d this wa y . Con vergence to a lo cal minimizer is a desirable prop erty , but unfortunately the fundamen tal non-con v exity of the optimization problem implies that this ma y not guaran tee go o d lo cal minima in practice. In Section 6 w e establish a standardized setting where w e measure ho w lik ely it is that these lo cal minima are actually glob al minima. W e ﬁnd that as long as M – the size of the search space – is small enough, each of these algorithms can ﬁnd the global minimizers reliably . The p oint at whic h the problem b ecomes to o large to reliably solv e is what w e call the operational capacity of the algorithm, and is a main point of comparison with Resonator Net works. 5.2.2 An algorithmic interpretation of Resonator Netw orks The b enchmark algorithms generate estimates for the factors, ˆ x ( f ) [ t ] , that mov e through the in terior of the [ − 1 , 1] h yp ercub e. Resonator Netw orks, on the other hand, do not. The sgn ( · ) function ‘bip olarizes’ inputs to the nearest vertex of the h yp ercub e, and this highly nonlinear function, which not only c hanges the length but also the angle of an input vector, is k ey . W e know the solutions x ( f ) ? exist at vertices of the h yp ercub e and these p oin ts are very special geometrically in the sense that in high dimensions, most of the mass of [ − 1 , 1] N is concen trated 11 relativ ely far from the vertices – a fact w e will not prov e here but that is based on standard results from the study of concen tration inequalities (Bouc heron et al., 2013). Our motiv ation for using the sgn ( · ) activ ation function is that mo ving through the interior of the h yp ercub e while searc hing for a factorization is un wise, a conjecture for whic h w e will provide some empirical support in the Results section. One useful in terpretation of OLS Resonator Net work dynamics is that the net work is computing a bip olarized version of Alternating Least Squares. Suppose w e w ere to tak e the dynamics sp eciﬁed in ( 10 ) for making ALS up dates to a f [ t + 1] , but we also bip olarize the v ector ˆ x ( f ) [ t + 1] at the end of eac h step. When eac h ˆ x ( f ) [ t + 1] is bip olar, the vector ˆ o ( f ) [ t ] is bip olar and w e can simplify  ξ > ξ  − 1 ξ > : ˆ o ( f ) [ t ] ∈ {− 1 , 1 } N ⇐ ⇒  ξ > ξ  − 1 ξ > =  X > f diag  ˆ o ( f ) [ t ]  2 X f  − 1 X > f diag  ˆ o ( f ) [ t ]  =  X > f X f  − 1 X > f diag  ˆ o ( f ) [ t ]  = X † f diag  ˆ o ( f ) [ t ]  (11) No w a f [ t + 1] = X † f  ˆ o ( f ) [ t ]  c  , whic h one can see from equation (8) is precisely the up date used by Resonator Netw orks with OLS weigh ts. An imp ortant word of caution on this observ ation: it is somewhat of a misnomer to call this algorithm Bip olarized Alternating Least Squares, b ecause at eac h iteration it is not solving a least squares problem, and this conceals a profound diﬀerence. T o set a f [ t + 1] = X † f  ˆ o ( f ) [ t ]  c  is to tak e the term g ( X f a f [ t ]) presen t in the loss function and treat the activ ation function g ( · ) as if it w ere linear, whic h it clearly is not. These up dates are not computing a Least Squares solution at eac h step. W e actually lose the guarantee of global conv ergence that comes with Alternating Least Squares, but this is an exchange wel l worth making , as w e will show in the Results section. Unlik e Hopﬁeld Netw orks, whic h hav e a Lyapuno v function certifying their global asymptotic stabilit y , no such function (that we know of ) exists for a Res- onator Net work. While ˆ c = c is alwa ys a ﬁxed p oin t of the OLS dynamics, a net work initialized to a random state is not guaran teed to conv erge. W e ha v e observ ed tra jectories that collapse to limit cycles and seemingly-c haotic tra jecto- ries that do not con v erge in an y reasonable time. One a priori indication that this is the case comes from a simple rewriting of t wo-factor Resonator Netw ork dynamics that concatenates the states for eac h factor in to a single statespace. T o make the transformation exact, we app eal to the contin uous-time version of Resonator Net works, whic h, just lik e Hopﬁeld net works, deﬁne dynamics in terms of time deriv atives of the pre-activ ation state ˙ u ( f ) ( t ) = X f X † f  ˆ o ( f ) [ t ]  c  , with ˆ x ( f ) ( t ) = g ( u ( f ) ( t )) . W e write down the con tinuous-time dynamics à la autoasso- 12 ciativ e Hopﬁeld Netw orks: ˙ u (1) ( t ) ˙ u (2) ( t ) ! = 0 X 1 X † 1 diag ( c ) X 2 X † 2 diag ( c ) 0 ! ˆ x (1) ( t ) ˆ x (2) ( t ) ! One can see that the weigh t matrix is non-symmetric, whic h has a simple but imp ortan t consequence: autoasso ciativ e net works with non-symmetric w eights cannot b e guaranteed to conv erge in gener al . This result, ﬁrst established by Cohen and Grossb erg (Cohen and Grossberg, 1983) and then studied throughout the Hopﬁeld Net w ork literature, is not quite as strong as it ma y sound, in the sense that symmetry is a suﬃcient, but not necessary , condition for con vergence. One can design a globally-con vergen t autoasso ciativ e netw ork with asymmetric w eights (Xu et al., 1996), and moreo ver, adding a degree of asymmetry has b een adv o cated as a tec hnique to reduce the inﬂuence of spurious ﬁxed p oints (Hertz et al., 1986; Singh et al., 1995; Chengxiang et al., 2000). Resonator Net works hav e a large and practical regime of operation, where M (the problem size) is small enough, in whic h non-conv erging tra jectories are extremely rare. It is simple to deal with these even ts, making the mo del still useful in practice despite the lac k of a conv ergence guaran tee. It has also b een argued in sev eral places (see V an V reeswijk and Somp olinsky (1996), for example) that cyclic or c haotic tra jectories ma y b e useful to a neural system, including in cases where there are m ultiple plausible states to en tertain. This is just to sa y that we feel the lac k of a conv ergence guarantee is not a critical w eakness of our mo del, but rather an interesting and p oten tially useful characteristic. W e attempted man y diﬀeren t mo diﬁcations to the mo del’s dynamics which would pro v ably cause it to con verge, but these c hanges alwa ys hindered its ability to solve the factorization problem. W e emphasize that unlike all of the mo dels in Section 5.1, a Resonator Net work is not descending a loss function. Rather, it mak es use of the fact that: • Eac h iteration is a bip olarized ALS up date – it appr oximately mov es the state to wards the Least Squares solution for eac h factor. • The correct solution is a ﬁxed point (guaranteed for OLS w eights, highly lik ely for OP weigh ts). • There ma y be a sizeable ‘basin of attraction’ around this ﬁxed p oint, which the iterates help us descend. • The num b er of spurious ﬁxed p oints (whic h do not give the correct factor- ization) is relativ ely small. This last p oin t is really what distinguishes Resonator Netw orks from the b ench- marks, whic h we will establish in Section 6.6. 13 6 Results W e present a characterization of Resonator Netw orks along three main directions. The ﬁrst direction is the stability of the solutions x ( f ) ? , which w e relate to the stabilit y of classical Hopﬁeld netw orks. The second is a fundamental measure of factorization capability we call the “op erational capacity”. The third is the sp eed with which factorizations are found. W e argue that the mark ed diﬀerence in factorization performance betw een our mo del and the b enc hmark algorithms lies in the relativ e sc ar city of spurious ﬁxe d p oints enjo yed b y Resonator Net w ork dynamics. W e summarize the main results in b old throughout this section. In each of the sim ulations w e choose co devectors randomly i.i.d. from the discrete uniform distribution ov er the v ertices of the h yp ercub e – eac h element of eac h co devector is a Rademac her random v ariable (assuming the v alue − 1 with probabilit y 0 . 5 and +1 with probabilit y 0 . 5 ). W e generate c by c ho osing one vector at random from each of the F co deb o oks and then computing the Hadamard pro duct among these v ectors. The reason we choose vectors randomly is because it makes the analysis of p erformance somewhat easier and more standardized, and it is the setting in which most of the well-kno wn results on Hopﬁeld Net w ork capacit y apply – we will mak e a few connections to these results. It is also the setting in whic h w e t ypically use the Multiply , A dd, P ermute VSA arc hitecture (Ga yler, 2004) and therefore these results on random v ectors are immediately applicable to a v ariet y of existing w orks. 6.1 Stable-solution capacity with outer pro duct weigh ts Supp ose ˆ x ( f ) [0] = x ( f ) ? for all f (we initialize it to the correct factorization; this will also apply to any t at which the algorithm comes up on x ( f ) ? on its o wn). What is the probability that the state sta ys there – i.e. that the correct factorization is a ﬁxed p oin t of the dynamics? This is the basis of what researc hers hav e called the “capacit y” of Hopﬁeld Netw orks, where x ( f ) ? are patterns that the net work has b een trained to store. W e c ho ose to call it the “stable-solution capacit y” in order to distinguish it from op erational capacit y , which w e deﬁne in Section 6.2. W e ﬁrst note that this analysis is necessary only for Resonator Netw orks with outer pro duct w eights – Ordinary Least Squares w eights guaran tee that the solutions are stable, and this is one of the v arian t’s desirable prop erties. If ˆ x ( f ) [0] = x ( f ) ? for all f , then factor 1 in a Resonator Netw ork “sees” an input x (1) ? at time t = 1 . F or OLS weigh ts, the v ector X 1 X † 1 x (1) ? is exactly x (1) ? b y the deﬁnition of orthogonal pro jection. T rue for all subsequent factors, this means that for OLS w eights, x ( f ) ? is alw ays a ﬁxed point. F or a Resonator Net work with outer pro duct weigh ts, w e m ust examine the v ector Γ := X f X > f  ˆ o ( f ) [0]  c  at eac h f , and c hanging from the psuedoinv erse 14 X † f to the transp ose X > f mak es the situation signiﬁcantly more complicated. A t issue is the probability that Γ i has a sign diﬀerent from  x ( f ) ?  i , i.e. that there is a bitﬂip in an y particular component of the up dated state. In general one may not care whether the state is completely stable – it ma y b e tolerable that the dynamics ﬂip some small fraction of the bits of x ( f ) ? as long as it does not mo ve the state to o far aw a y from x ( f ) ? . Amit, Gutfreund, and Somp olinsky (Amit et al., 1985, 1987) established that in Hopﬁeld Netw orks, an a v alanc he phenomenon o ccurs where bitﬂips accum ulate and the netw ork b ecomes essen tially useless for v alues of D f > 0 . 138 N , at which p oin t the approximate bitﬂip probability is 0 . 0036 . While we don’t attempt an y of this complicated analysis on Resonator Netw orks, w e do derive an expression for the bitﬂip probability of any particular factor that accoun ts for bitﬂips which “p ercolate” from factor to factor through the v ector ˆ o ( f ) [0]  c . W e start b y noting that for factor 1 , this bitﬂip probability is the same as a Hopﬁeld net work. Readers familiar with the literature on Hopﬁeld Netw orks will kno w that with N and D f reasonably large (appro ximately N ≥ 1 , 000 and D f ≥ 50 ) Γ i can b e well-appro ximated b y a Gaussian with mean  x ( f ) ?  i ( N + D f − 1) and v ariance ( N − 1)( D f − 1) ; see app endix J for a simple deriv ation. This is summarized as the Hopﬁeld bitﬂip pr ob ability h f : h f := P r   ˆ x ( f ) [1]  i 6 =  x ( f ) ?  i  = Φ  − N − D f + 1 p ( N − 1)( D f − 1)  (12) Where Φ is the cum ulativ e density function of the Normal distribution. Hopﬁeld Net works are often sp eciﬁed with the diagonal of X f X > f set to all zeros (having “no self-connections”), in whic h case the bitﬂip probabilit y is Φ  − N √ ( N − 1)( D f − 1)  . F or large N and D f this is often simpliﬁed to Φ( − p N /D f ) , which may b e the expression most familiar to readers. Keeping the diagonal of X f X > f mak es the co dev ectors more stable (see app endix J) and while there are some arguments in fa vor of eliminating it, we hav e found Resonator Net works to exhibit b etter p erformance b y keeping these terms. In App endix J we deriv e the bitﬂip probabilit y for an arbitrary factor in a Resonator Net work with outer pro duct w eights. This probabilit y dep ends on whether a component of the state has already b een ﬂipp ed b y the previous f − 1 factors, which is what we call p er c olate d noise passed b etw een the factors, and whic h increases the bitﬂip probability . The four relev an t probabilities are: r f := P r   ˆ x ( f ) [1]  i 6 =  x ( f ) ?  i  (13) n f := P r   ˆ o ( f +1) [0]  c  i 6 =  x ( f +1) ?  i  (14) 15 r f 0 := P r   ˆ x ( f ) [1]  i 6 =  x ( f ) ?  i    ˆ o ( f ) [0]  c  i =  x ( f ) ?  i  (15) r f 00 := P r   ˆ x ( f ) [1]  i 6 =  x ( f ) ?  i    ˆ o ( f ) [0]  c  i 6 =  x ( f ) ?  i  (16) Equation (13) is the probabilit y of a bitﬂip compared to the correct v alue, the R esonator bitﬂip pr ob ability . Equation (14) gives the probabilit y that the next factor will see a net bitﬂip, a bitﬂip which has p ercolated through the previous factors. Equations (15) and (16) giv e the probability of a bitﬂip conditioned on whether or not this factor sees a net bitﬂip, and they are diﬀer ent . It should b e ob vious that r f = r f 0 (1 − n f − 1 ) + r f 00 n f − 1 (17) and also that n f = r f 0 (1 − n f − 1 ) + (1 − r f 00 ) n f − 1 (18) W e sho w via straightforw ard algebra in App endix J that the conditional proba- bilities r f 0 and r f 00 can b e written recursiv ely in terms of n f : r f 0 = Φ  − N (1 − 2 n f − 1 ) − ( D f − 1) p ( N − 1)( D f − 1)  (19) r f 00 = Φ  − N (1 − 2 n f − 1 ) + ( D f − 1) p ( N − 1)( D f − 1)  (20) The Resonator bitﬂip probability r f has to b e computed recursiv ely using these expressions. The base case is n 0 = 0 and this is suﬃcien t to compute all the other probabilities – in particular, it implies that r 1 = h 1 = Φ  − N − D 1 +1 √ ( N − 1)( D 1 − 1)  , which w e ha ve previously indicated. W e can v erify these equations in sim ulation, and the agreemen t is very goo d – see Figure 14 in the App endix, which measures r f . The main analytical result in this section is the sequence of equations (17) - (20), whic h allow one to compute the bitﬂip probabilities for eac h factor in an outer pro duct Resonator Net w ork . The fact that r f in general must b e split b et ween the tw o conditional probabilities and that there is a dep endence on n f − 1 is what mak es it diﬀeren t, for all but the ﬁrst factor, from the bitﬂip probability for a Hopﬁeld Netw ork (compare eqs. (19) and (20) against eq. (12)). But how muc h diﬀerent? W e are in terested in the quantit y r f − h f . Here is a simple intuition for what this is capturing: supp ose there are F Hop- ﬁeld Net works all evolving under their o wn dynamics – they are running simul- taneously but not interacting in any w a y . At time t = 0 , the bitﬂip probabilities h 1 , h 2 , . . . , h F for the net works are all the same; there is nothing sp ecial ab out an y particular one. A Resonator Net work, ho wev er, is lik e a set of F Hopﬁeld net works that hav e b een wired up to receive input ˆ o ( f ) [ t ]  c , whic h reﬂects the state of the other factors. The netw orks are no longer indep enden t. In particular, a bitﬂip in factor f gets passed on to factors f + 1 , f + 2 , and so on. This aﬀects the 16 Figure 1: Extra bitﬂip probability r f − h f due to p ercolated noise. In the limit of large F , there app ears to b e a phase c hange at D f / N = 0 . 056 . Below this v alue Resonator Net works are just as stable as Hopﬁeld Netw orks, but abov e this v alue they are strictly less stable (b y the amoun t r f − h f ). bitﬂip probabilit y of these other factors, and the magnitude of this eﬀect, whic h w e call p ercolated noise, is measured by r f − h f . Let us ﬁrst note that for a Hopﬁeld netw ork with self c onne ctions the maximum bitﬂip probabilit y is 0 . 02275 , whic h o ccurs at D f = N . The ratio D f / N is what determines the bitﬂip probability . Please see Appendix J for an explanation. P ercolated noise is measured b y the diﬀerence r f − h f , whic h w e plot in Figure 1. Part (a) sho ws just ﬁve factors, illustrating that r 1 = h 1 , but that r f ≥ h f in general. T o see if there is some limiting b eha vior, w e sim ulated 100 and 10 , 000 factors; the diﬀerences r f − h f are also shown in Figure 1. In the limit of large F there appears to b e a phase c hange in residual bitﬂip probabilit y that o ccurs at D f / N = 0 . 056 . In the Hopﬁeld Netw ork literature this is a very imp ortant n umber. It giv es the point at which the co devectors transition a wa y from being global minimizers of the Hopﬁeld Net w ork energy function. When D f / N falls in b et ween 0 . 056 and 0 . 138 , the co devectors are only lo cal minimizers, and there exist spin-glass states that hav e low er energy . W e do not further explore this phase- c hange phenomenon, but lea v e the (in all lik eliho o d, highly tec hnical) analysis to future w ork. In conclusion, the second ma jor result of the section is that w e hav e sho wn, via simulation, that for D f / N ≤ 0 . 056 , the stabilit y of a Resonator Net- w ork with outer pro duct w eigh ts is the same as the stabilit y of a Hop- ﬁeld Net work. F or D f / N > 0 . 056 , p ercolated noise b et ween the factors causes the Resonator Net work to b e strictly less stable than a Hopﬁeld Net work . 6.2 Op erational capacit y W e no w deﬁne a new notion of capacit y that is more appropriate to the factor- ization problem. This p erformance measure, called the op er ational c ap acity , giv es 17 an expression for the maxim um size of factorization problem that can b e solv ed with high probabilit y . This maximum problem size, whic h we denote by M max , v aries as a function of the n umber of elemen ts in each v ector N and the n umber of factors F . It gives a very practical c haracterization of p erformance, and will form the basis of our comparison b etw een Resonator Netw orks and the b ench- mark algorithms w e in tro duced in Section 5.1. When the problem size M is b elow the op erational capacit y of the algorithm, one can b e quite sure that the correct factorization will b e eﬃcien tly found. Deﬁnition 1. The { p, k } op erational capacity of a factorization algorithm that solv es (1) is the largest search space size M max suc h that the algorithm, when limited to a maxim um num b er of iterations k , gives a total accuracy ≥ p . W e now deﬁne what we mean by total accuracy . Eac h algorithm we ha ve in tro duced attempts to solve the factorization problem (1) b y initializing the state ˆ x ( f ) [0] and letting the dynamics evolv e until some termination criterion is met. It is p ossible that the ﬁnal state ˆ x ( f ) [ ∞ ] ma y not equal the correct factors x ( f ) ? at eac h and ev ery comp onent, but w e can ‘deco de’ eac h ˆ x ( f ) [ ∞ ] by lo oking for its nearest neighbor (with resp ect to Hamming distance or cosine similarit y) among the vectors in its resp ective codeb o ok X f . This distance computation inv olv es only D f v ectors, rather than M , whic h was what w e encountered in one of the brute-force strategies of Section 2. Compared to the other computations in volv ed in ﬁnding the correct factorization out of M total p ossibilities, this last step of deco ding has a v ery small cost, and w e alw ays ‘clean up’ the ﬁnal state ˆ x ( f ) [ ∞ ] using its nearest neighbor in the co deb o ok. W e deﬁne the total accuracy to b e the sum of accuracies for inferring eac h factor, whic h is 1 /F if the nearest neigh b or to ˆ x ( f ) is x ( f ) ? and 0 otherwise. F or instance, correctly inferring one of three total factors giv es a total accuracy of 1 / 3 , tw o of three is 2 / 3 , and three of three is 1 . Analytically deriving the exp ected total accuracy app ears to b e quite chal- lenging, esp ecially for a Resonator Net w ork, b ecause it requires that w e essen- tially predict ho w the nonlinear dynamics will evolv e ov er time. There may b e a region around eac h x ( f ) ? suc h that states in this region rapidly con v erge to x ( f ) ? , the so-called basin of attraction, but our initial estimate ˆ x ( f ) [0] is likely not in the basin of attraction, and it is hard to predict when, if ever, the dynamics will en ter this region. Even for Hopﬁeld Netw orks, which obey m uch simpler dynam- ics than a Resonator Net w ork, it is kno wn that so-called “frozen noise” is built up in the netw ork state, making the shap es of the basins highly anisotropic and diﬃcult to analyze (Amari and Magin u, 1988). Essentially all of the analytical results on Hopﬁeld Net works consider only the stabilit y of x ( f ) ? as a (v ery p o or) pro xy for ho w the mo del b ehav es when it is initialized to other states. This less useful notion of capacity , the stable-solution capacit y , w as what we examined in the previous section. 18 Figure 2: Accuracy as a function of M for Resonator Netw ork with outer pro duct w eights. Three factors ( F = 3 ), a verage o ver 5 , 000 random trials. W e can, how ev er, estimate the total accuracy b y simulating many factoriza- tion problems, recording the fraction of factors that were correctly inferred o ver man y , many trials. W e remind the reader that our results in this pap er pertain to factorization of randomly-drawn v ectors whic h bear no particular correlational structure, but that notions of total accuracy and op erational capacit y w ould b e relev an t, and sp eciﬁc, to factorization of non-random v ectors. W e ﬁrst note that for ﬁxed vector dimensionality N , the empirical mean of the total accuracy de- p ends strongly on M , the search space size. W e can see this clearly in Figure 2. W e show this phenomenon for a Resonator Netw ork with outer pro duct w eights, but this general b eha vior is true for all of the algorithms under consideration – one can alwa ys mak e the searc h space large enough that exp ected total accuracy go es to zero. Our notion of op erational capacity is concerned with the M that causes ex- p ected total accuracy to drop below some v alue p . W e see here that there are a range of v alues M for whic h the exp ected total accuracy is 1 . 0 , b ey ond which this ceases to b e the case. F or all v alues of M within this range, the algorithm essen tially alwa ys solves the factorization problem. In this paper w e estimate op erational capacit y when p = 0 . 99 ( ≥ 99% of factors were inferred correctly) and k = 0 . 001 M (the mo del can searc h o ver at most 1 / 1 , 000 of the en tire searc h space). These c h oices are largely practical: ≥ 99% accuracy mak es the mo del very reliable in practice, and this op erating p oin t can b e estimated from a reasonable num b er ( 3 , 000 to 5 , 000 ) of random trials. Setting k = 0 . 001 M allo ws the num b er of iterations to scale with the size of the problem, but restricts the algorithm to only consider a small fraction of the p ossible factorizations. While a Resonator Netw ork has no guarantee of con vergence, it almost alw a ys con v erges in far less than 0 . 001 M iterations, so long as w e sta y in this high-accuracy regime. Operational capacit y is in general a function of N and F , whic h we will discuss shortly . 19 Figure 3: Op erational capacity is dramatically higher for Resonator Netw orks (blue and red ab ov e) than for an y of the b enchmark algorithms. These p oin ts represent the size of factorization problem that can b e solved reliably . Sho wn is op erational capacity for F = 3 factors. The gap is similarly large for other F , see plot for F = 4 in the App endix. 6.2.1 Resonator Net w orks ha v e sup erior op erational capacit y W e estimated the operational capacit y of the b enchmark algorithms in addition to the t wo v arian ts Resonator Netw orks. Figure 3 shows the op erational capacit y estimated on several thousand random trials, where w e displa y M max as a function of N for problems with three factors. One can see that the op erational capacit y of Resonator Netw orks is roughly t w o orders of magnitude greater than the op erational capacit y of the other algorithms . Each of the b enc hmark algorithms has a slightly diﬀerent op erational capacit y (due to the fact that they eac h hav e diﬀeren t dynamics and will, in general, ﬁnd diﬀeren t solutions) but they are all similarly po or compared to the tw o v ariants of Resonator Netw orks. See a similar plot for F = 4 in App endix B. As N increases, the p erformance diﬀerence b et ween the tw o v ariants of Res- onator Net w orks starts to disapp ear, ostensibly due to the fact that X f X † f ≈ X f X > f . The tw o v arian ts are diﬀeren t in general, but the simulations in this pap er do not particularly highlight the diﬀerence b et ween them. Except for Alternating Least Squares, eac h of the benchmark algorithms has at least one h yp erp erparameter that must b e c hosen – w e sim ulated man y thousand random trials with a v ariet y of h yp erparameter settings for eac h algorithm and chose the h yp erparameter v alues that performed b est on av erage. W e list these v alues for eac h of the algorithms in the App endix. All of the b enchmark algorithms conv erge on their o wn and the tunable stepsizes mak e a comparison of the num b er of iterations non-standardized, so w e did not imp ose a maxim um num b er of iterations on these algorithms – the p oin ts shown represent the b est the b enchmark algorithms can do, even when not restricted to a maxim um num b er of iterations. 20 6.2.2 Op erational capacit y scales quadratically in N W e carefully measured the op erational capacity of Resonator Netw orks in search of a relationship b et ween M max and N . W e fo cused on Resonator Net works with outer pro duct w eights – for N ≈ 5000 and larger, randomly-chosen co devectors are nearly orthogonal and capacit y is approximately the same for OLS weigh ts. W e reiterate that op erational capacit y is speciﬁc to parameters p and k : p is the threshold for total accuracy and k is the maximum num b er of iterations the algorithm is allow ed to take (refer to Deﬁnition 1). Here w e rep ort op erational capacit y for p = 0 . 99 and k = 0 . 001 M on randomly-sampled co devectors. The op erational capacity is sp eciﬁc to these choices, whic h are practical for V ector Sym b olic Architectures. Our simulations revealed that, empirically , Resonator Net work op era- tional capacit y M max scales as a quadratic function of N , which w e il- lustrate in Figure 4. The p oints in this ﬁgure are estimated from many thousands of random trials, o ver a range of v alues for F and N . In part (a) w e show opera- tional capacity separately for eac h F from 2 to 7 , with the drawn curv es indicating the least-squares quadratic ﬁt to the measured p oin ts. In part (b) w e put these p oin ts on the same plot, follo wing a logarithmic transformation to each axis, in order to illustrate that capacity also v aries as a function of F . App endix B pro- vides some additional commentary on this topic, including some speculation on a scaling la w that com bines F and N . The parameters of this particular com bined scaling are estimated from sim ulation and not deriv ed analytically – therefore they ma y deserve additional scrutin y and we do not fo cus on them here. The main message of this section is that capacit y scales quadratically in N , regardless of ho w many factors are used. The curv es in Figure 4 are constructiv e in the following sense: given a ﬁxed N , they indicate the largest factorization problem that can b e solved reliably . Con versely , and this is often the case in VSAs, the problem size M is predeter- mined, while N is v ariable – in this case we kno w ho w large one must make N . W e include in the oﬃcial softw are implementation that accompanies this paper 3 a text ﬁle with all of the measured op erational capacities. Quadratic scaling means that one can aspire to solv e v ery large factorization problems, so long as he or she can build a Resonator Netw ork with big enough N . W e attempted to estimate capacity for ev en larger v alues of N than we rep ort in Figure 4, but this w as b ey ond the capability of our current computational resources. A useful contribution of follo w-on work would b e to lev erage high- p erformance computing to measure some of these v alues. Applications of V ector Sym b olic Arc hitectures t ypically use N ≤ 10 , 000 , but there are other reasons one might attempt to push Resonator Netw orks furth er. Early work on Hopﬁeld 3 h ttps://github.com/spencerken t/resonator-netw orks 21 (a) M max scales quadratically in N . Red p oin ts are measured from simulation; blac k curv es are the least-squares quadratic ﬁts. P arameters of ﬁts included in App endix B. (b) M max v aries as a function of b oth F and N . Ov er the measured range for N , capacit y is highest for F = 3 and F = 4 . Data for F = 2 w as omitted to b etter conv ey the trend for F=3 and higher, but see App endix B for the full picture. Figure 4: Op erational capacity of Resonator Netw orks with OP weigh ts. Net works prop osed a technique for storing solutions to the T ra velling Salesman Problem as ﬁxed p oin ts of the mo del’s dynamics (Hopﬁeld and T ank, 1985), and this b ecame part of a larger approac h using nonlinear dynamical s ystems to solve hard searc h problems. W e do not claim that an y particular search problem, other than the factorization we hav e deﬁned (1), can b e solved b y Resonator Net works. Supp osing, how ev er, that some other hard problem can b e cast in the form of (1), 22 the quadratic scaling of operational capacit y mak es this a p oten tially p o wer to ol. Capacit y is highest when the co deb o oks X f eac h ha v e the same n umber of co dev ectors ( D 1 = D 2 = . . . = D F = F √ M ), and this w as the case for the op erational capacity results w e hav e sho wn so far. W e chose this in order to ha ve a simple standard for comparison among the diﬀerent algorithms, but in general it is p ossible that the co deb o oks are un balanced, so that we hav e the same M = Q f D f but D 1 6 = D 2 6 = . . . 6 = D f . In this case, capacity is low er than for balanced codeb o oks. W e found that the most meaningful w ay to measure the degree of balance b et ween co deb o oks w as b y the ratio of the smallest co deb o ok to the largest co deb o ok: ξ :=  min f D f  /  max f D f  (21) F or ξ ≥ 0 . 2 w e found that the eﬀect on M max w as simply an additive factor whic h can b e absorbed in to a (sligh tly smaller) y-in tercept a for the quadratic ﬁt. F or extreme v alues of ξ , where there is one co deb o ok that is for instance 10 or 20 times larger than another, then all three parameters a , b , and c are aﬀected, sometimes signiﬁcan tly . Scaling is still quadratic, but the actual capacity v alues may b e signiﬁcan tly reduced. Our result – measured operational capacit y whic h indicates an appro ximately quadratic relationship b et ween M max and N – is an imp ortan t c haracterization of Resonator Netw orks. It suggests that our framew ork scales to very large factor- ization problems and serv es as a guideline for implemen tation. Our attempts to analytically deriv e this result were st ymied by the to olb ox of nonlinear dynamical systems theory . Op erational capacit y in v olv es the probabilit y that this system, when initialized to an eﬀectiv ely random state, conv erges to a particular set of ﬁxed p oin ts. No results from the study of nonlinear dynamical systems, that we are a w are of, allo w us to derive such a strong statemen t about Resonator Net- w orks. Still, the scaling of Figure 4 is fairly suggestiv e of some underlying law, and we are hop eful that a theoretical explanation exists, waiting to b e discov ered. 6.3 Searc h sp eed If a Resonator Net work is not consisten tly descending an energy function, is it just aimlessly w andering around the space, trying every possible factorization un til it ﬁnds the correct one? Figure 5 sho ws that it is not. W e plot the mean num b er of iterations ov er 5 , 000 random trials, as a fraction of M , the searc h space size. This particular plot is based on a Resonator Net w ork with outer pro duct weigh ts and F = 3 . In the high-p erformance regime where M is below op eration capacity , the n umber of iterations is far less than the 0 . 001 M cutoﬀ we used in the simulations of Section 6.2 – the algorithm is only ev er considering a tiny fraction of the p ossible factorizations b efore it ﬁnds the solution. 23 Figure 5: Iterations until con vergence, Resonator Net work with outer pro duct w eights and F = 3 . The n umber of iterations is a v ery small compared to the size of the search space Section 6.2.1 compared the op erational capacit y of diﬀeren t algorithms and sho wed that compared to the b enc hmarks, Resonator Netw orks can solve m uch larger factorization problems. This is in the sense that the dynamics ev entually con verge (with high probability) on the correct factorization while, the dynamics of the other algorithms con verge on spurious factorizations. This result, how ev er, do es not directly demonstrate the relativ e sp eed with which factorization are found in terms of either the n um b er of iterations or the amoun t of time to con v ergence. W e set up a b enc hmark to determine the relative speed of Resonator Netw orks and our main ﬁnding is depicted in Figure 6. Measured in n um b er of iterations, Resonator Net works are compa- rable to the b enchmark algorithms . W e noted that Alternating Least Squares is the most greedy of the b enchmarks, and one can see from Figure 6 that it is the fastest in this sense. W e are considering only trials that ultimately found the correct factorization, whic h in this sim ulation was roughly 70% for eac h of the b enc hmarks. In con trast, Resonator Net works alw a ys ev en tually found the correct factorization. Measured in terms of w all-clo c k time, Resonator Net works are signiﬁcan tly faster than the benchmarks. This can b e attributed to their nearly 5 × lo w er p er-iteration cost. Resonator Netw orks with outer pro duct w eights utilize v ery simple arithmetic op erations and this explains the diﬀerence b et ween Figures 6b and 6c. 24 (a) Con v ergence traces for 100 randomly-dra wn factorization problems – each line is the cosine similarity b etw een c and ˆ c ov er iterations of the algorithm. Eac h of the four algorithms is run on the same 100 factorization problems. All of the instances are solv ed b y the Resonator Net w ork, whereas a sizeable fraction (around 30% ) of the instances are not solv ed b y the benchmark algorithms, at least within 100 iterations. (b) Cosine sim. vs. iteration # (only trials with accuracy 1 . 0 ) (c) Cosine sim. vs. w all-clo ck time (only trials with accuracy 1 . 0 ) Figure 6: Our b enchmark of factorization sp eed. Implemen tation in Python with NumPy . Run on machine with In tel Core i7-6850k processor and 32GB RAM. W e generated 5 , 000 random instantiations of the factorization problem with N = 1500 , F = 3 , and D f = 40 , running each of the four algorithms in turn. Figure 6a giv es a snapshot of 100 randomly selected trials. Figures 6b and 6c sho w av erage performance c onditione d on the algorithms ﬁnding the c orr e ct factorization . 6.4 Dynamics that do not con v erge One m ust b e prepared for the p ossibility that the dynamics of a Resonator Net w ork will not con v erge. F ortunately , for M b elo w the p = 0 . 99 op erational capacity , these will b e exceedingly rare. F rom sim ulation, we iden tiﬁed three ma jor regimes of diﬀeren t conv ergence b eha vior, whic h are depicted in Figure 7: • F or M small enough, almost all tra jectories conv erge, and moreo ver they con verge to a state that yields the correct factorization. Limit cycles are p ossible but rare, and often still yield the correct factorization. There ap- p ear to b e few if an y spurious ﬁxed p oints (those yielding an incorrect fac- torization) – if the tra jectory conv erges to a p oin t attractor or limit cycle, 25 Figure 7: Regimes of diﬀeren t con vergence b ehavior. Curv es sho w measurement from sim ulation of an outer pro duct Resonator Netw ork with 3 factors and N = 400 . This is also meant as a diagram of conv ergence b eha vior for Resonator Net works in general. Sho wn in black is the av erage deco ding accuracy and sho wn in gray is the median num b er of iterations tak en by the netw ork. F or low enough M , the netw ork alwa ys ﬁnds a ﬁxed p oin t yielding 100% accuracy . The net w ork will not con v erge to spurious ﬁxed p oints in this regime (green). As M is increased, more tra jectories wander, not conv erging in an y reasonable time (red). Those that are forcibly terminated yield incorrect factorizations. F or large enough M , the net work is completely saturated and most states are ﬁxed p oin ts, regardless of whether they yield the correct factorization (blue). Resonator Net works with OLS w eights are alwa ys stable when D f = N , but OP weigh ts giv e a bitﬂip probability that is zero only asymptotically in M (see Section 6.1 and App endix J). one can b e conﬁden t this state indicates the correct factorization. • As M increases, non-conv erging tra jectories app ear in greater prop ortion and yield incorrect factorizations. An y tra jectories whic h con verge on their o wn con tinue to yield the correct factorization, but these b ecome less com- mon. • Bey ond some saturation v alue M sat (roughly depicted as the transition from red to blue in the ﬁgure), b oth limit cycles and p oint attractors re-emerge, and they yield the incorrect factorization. In theory , limit cycles of an y length may app ear, although in practice they tend to b e skew ed tow ards small cycle lengths. Netw orks with tw o factors are the most 26 Figure 8: F actoring a corrupted c . F or M well b elow capacity (lighter curves ab o v e) one can sustain hea vy corruption to c and still ﬁnd the correct factorization. lik ely to ﬁnd limit cycles, and this likelihoo d app ears to d ecrease with increasing n umbers of factors. Our intuition ab out what happ ens in the middle section of Figure 7 is that the basins of attraction b ecome v ery narrow and hard to ﬁnd for the Resonator Netw ork dynamics. The algorithm will wander, since it has so few spurious ﬁxed p oints (see Section 6.6), but not b e able to ﬁnd any basin of attraction. 6.5 F actoring a ‘noisy’ comp osite v ector Our assumption has b een that one com bination of co devectors from our co deb o oks X f generates c exactly . What if this is not the case? Perhaps the v ector we are giv en for factorization has had some prop ortion ζ of its comp onen ts ﬂipp ed, that is, we are given ˜ c where ˜ c diﬀers from c in exactly b ζ N c places. The v ector c has a factorization based on our co deb o oks but ˜ c do es not. W e should hop e that a Resonator Netw ork will return the factors of c so long as the corruption is not to o sev ere. This is an esp ecially imp ortan t capability in the con text of V ector Sym b olic Arc hitectures, where ˜ c will often b e the result of some algebraic manipulations that generate noise and corrupt the original c to some degree. W e show in Figure 8 that a Resonator Netw ork can still pro duce the correct factorization even after a signiﬁcan t n umber of bits ha v e b een ﬂipp ed. This robustness is more pronounced when the n umber of factorizations is well b elow operational capacity , at whic h p oin t the mo del can often still recov er the correct factorization ev en when 30% of the bits ha ve been ﬂipp ed. 27 6.6 A theory for diﬀerences in op erational capacit y The failure mo de of each b enc hmark algorithm is getting stuc k at a spurious ﬁxe d p oint of the dynamics. This section dev elops a simple comparison b et ween the spurious ﬁxed p oin ts of Resonator Net works and the b enc hmarks as an explanation for why Resonator Net works enjo y relativ ely higher operational capacit y . F rom among the b enc hmarks we fo cus on Pro jected Gradien t Descen t (applied to the negativ e inner product with the simplex constrain t) to illustrate this p oin t. W e will sho w that the correct factorization is alwa ys stable under Pro jected Gradient Descen t (as it is with the OLS v arian t of Resonator Netw orks), but that incorrect factorizations are m uch more likely to b e ﬁxed p oints under Pro jected Gradien t Descen t. The deﬁnition of Pro jected Gradien t Descent can b e found in T able 2, with some commen ts in App endix G. 6.6.1 Stabilit y of the correct factorization The vec tor of co eﬃcients a f is a ﬁxed p oint of Pro jected Gradien t Descen t dy- namics when the gradien t at this point is exactly 0 or when it is in the n ullspace of the pro jection operator. W e write N  P C f [ x ]  := { z | P C f  x + z  = P C f  x  } (22) to denote this set of p oints. The n ullspace of the pro jection op erator is relativ ely small on the faces and edges of the simplex, but it b ecomes somewhat large at the v ertices. W e denote a vertex b y e i (where ( e i ) j = 1 if j = i and 0 otherwise). The n ullspace of the pro jection operator at a v ertex of the simplex is an intersection of halfspaces (each halfspace given b y an edge of the simplex). W e can compactly represen t it with the following expression: N  P ∆ D f [ e i ]  =  z | \ j 6 = i ( e i − e j ) > z ≥ 1  (23) An equiv alen t wa y to express the n ullspace is N  P ∆ D f [ e i ]  =  z | z j ≤ z i − 1 ∀ j 6 = i  (24) In other w ords, for a v ector to b e in the n ullspace at e i , the i th element of the v ector must b e the largest by a margin of 1 or more. This condition is met for the vector −∇ a f L at the correct factorization, since −∇ a f L = X > f  ˆ o ( f ) [0]  c  = X > f x ( f ) ? . This v ector has a v alue N for the component corresp onding to x ( f ) ? and v alues that are ≤ N − 1 for all the other comp onents. Th us, the correct factorization (the solution to (1) and global minimizer of (9)) is alw ays a ﬁxed p oin t under the dynamics of Pro jected Gradient Descent (PGD). 28 This matc hes the stabilit y of OLS Resonator Net works whic h are, b y construc- tion, alw ays stable at the correct factorization. W e show ed in Section 6.1 that OP w eights induce instability and that p ercolated noise mak es the mo del marginally less stable than Hopﬁeld Net w orks, but there is still a large range of factorization problem sizes where the net work is stable with ov erwhelming probability . What distinguishes the b enc hmarks from Resonator Netw orks is what we cov er next, the stabilit y of inc orr e ct factorizations. 6.6.2 Stabilit y of incorrect factorizations Supp ose initialization is done with a random combination of codevectors that do not pro duce c . The vector ˆ o ( f ) [0]  c will b e a c ompletely r andom bip olar ve ctor . So long as D f is signiﬁcan tly smaller than N , which it alwa ys is in our applications, ˆ o ( f ) [0]  c will b e nearly orthogonal to every vector in X f and its pro jection on to R ( X f ) will b e small, with each comp onent equally likely to b e p ositive or negativ e. Therefore, under the dynamics of a Resonator Net work with OLS weigh ts, each comp onen t will ﬂip its sign compared to the initial state with probability 1 / 2 , and the state for this factor will remain unchanged with the min uscule probabilit y 1 / 2 N . The total probability of this incorrect factorization b eing stable, accounting for eac h factor, is therefore (1 / 2 N ) F . Sub optimal factorizations are very unlik ely to b e a ﬁxed p oints. The same is true for a Resonator Netw ork with OP weigh ts b ecause eac h elemen t of the v ector X f X > f  ˆ o ( f ) [0]  c  is appro ximately Gaussian with mean zero (see Section 6.1 and App endix J). Con trast this against Pro jected Gradien t Descent. W e recall from (24) that the requiremen t for e i to b e a ﬁxed p oint is that the i th comp onent of the gradien t at this p oint be largest b y a margin of 1 or more. This is a much lo oser stability c ondition than we had for R esonator Networks – suc h a scenario will actually o ccur with probability 1 /D f for each factor, and the total probabilit y is 1 / M . While still a relativ ely small probabilit y , in t ypical VSA settings 1 / M is muc h larger than (1 / 2 N ) F , meaning that compared to Resonator Netw orks, Pro jected Gradien t Descen t is muc h more stable at incorrect factorizations. Empirically , the failure mo de of Pro jected Gradient Descen t in volv es it settling on one of these spurious ﬁxed p oin ts. 6.6.3 Stabilit y in general The cases of correct and incorrect factorizations drawn from the co deb o oks are t wo extremes along a contin uum of p ossible states the algorithm can b e in. F or Pro jected Gradient Descent an y state will b e stable with probability in the in terv al [ 1 M , 1] , while for Resonator Netw orks (with OLS weigh ts) the interv al is [ 1 2 F N , 1] . In practical settings for VSAs, the in terv al [ 1 2 F N , 1] is, in a relativ e sense, muc h 29 larger than [ 1 M , 1] . V ectors dra wn uniformly from either {− 1 , − 1 } N or [ − 1 , − 1] N concen trate near the low er end of these interv als, suggesting that on av erage, Pro jected Gradien t Descen t has man y more spurious ﬁxed p oin ts . This statemen t is not fully complete in the sense that dynamics steer the state along sp eciﬁc tra jectories, visiting states in a p otentially non-uniform wa y , but it do es suggest that Pro jected Gradien t Descen t is m uch more susceptible to spurious ﬁxed p oints. The next section sho ws that these tra jectories do in fact con verge on spurious ﬁxed points as the factorization problem size gro ws. 6.6.4 Basins of attraction for b enc hmark algorithms It may b e that while there are sizable basins of attraction around the correct fac- torization, mo ving through the interior of the hypercub e causes state tra jectories to fall in to the basin corresponding to a spurious ﬁxed point. In a normal set- ting for sev eral of the optimization-based approaches, we initialize a f to b e at the cen ter of the simplex, indicating that eac h of the factorizations is equally likely . Supp ose w e w ere to initialize a f so that it is just slightly n udged to w ard one of the simplex vertices. W e migh t n udge it to ward the correct v ertex (the one giv en b y a ? f ) or we migh t n udge it tow ard any of the other v ertices, a wa y from a ? f . W e can parameterize this with a single scalar θ and e i c hosen uniformly among the p ossible v ertices: a f [0] = θ e i + (1 − θ ) 1 D f 1 | θ ∈ [0 , 1] , i ∼ U { 1 , D f } (25) W e ran a simulation with N = 1500 and D 1 = D 2 = D 3 = 50 , at whic h Pro jected Gradient Descent and Multiplicative W eigh ts ha ve a total accuracy of 0 . 625 and 0 . 525 , resp ectively . W e created 5 , 000 random factorization problems, initializing the state according to (25) and allowing the dynamics to run un til con vergence. W e did this ﬁrst with a nudge tow ard the correct factorization a ? f (squares in Figure 9) and then with a nudge a wa y from a ? f , tow ard a randomly- c hosen spurious factorization (triangles in Figure 9). What Figure 9 sho ws is that b y mo vin g just a small distance to ward the correct vertex, w e very quic kly fall in to its basin of attraction. Ho wev er, moving to ward any of the other vertices is actually somewhat likely to tak e us in to a spurious basin of attraction (where the con v erged state is deco ded into an incorrect factorization). The space is ful l of these bad directions. It w ould be v ery lucky indeed to start from the cen ter of the simplex and mo ve immediately to w ard the solution – it is far more likely that initial up dates tak e us somewhere else in the space, to ward one of the other v ertices, and this plot sh o ws that these tra jectories often get pulled to wards a spurious ﬁxed p oint. What we are demonstrating here is that empirically , the in terior of the h yp ercub e is somewhat treac herous from 30 Figure 9: States in h yp ercub e in terior get pulled into spurious basins of attraction. Pro jected Gradient Descen t is in green and Multiplicative W eigh ts is in orange. Netw ork is initialized at a distance θ from the center of the simplex (see equation (25)), and allo wed to conv erge. The y-axis is the accuracy of the factorization implied b y the con verged state. T riangles indicate initialization sligh tly a w ay from a ? f to ward any of the other simplex vertices, whic h is most directions in the space. These initial states get quic kly pulled into a spurious basin of attraction. an optimization persp ective, and this lies at the heart of wh y the b enc hmark algorithms fail. F rom among the b enchmarks, we restricted our analysis of spurious ﬁxed p oints to Pro jected Gradient Descen t and, in Figure 9, Multiplicativ e W eights. This c hoice was made for clarity , and similar arguments apply for all of the b enc h- marks. While the details ma y diﬀer sligh tly (for instance, spurious ﬁxed p oin ts of ALS appear near the simplex cen ter, not at a vertex), the failure mode of the b enchmarks is strikingly consistent. They all b ecome o verwhelmed by spuri- ous ﬁxed p oints, long b efore this aﬀect is felt b y Resonator Netw orks. W e ha ve sho wn that in exp ectation, Pro jected Gradient Descen t has man y more spurious ﬁxed p oin ts than Resonator Net w orks . W e hav e also sh o w that tra jectories mo ving through the in terior of the hypercub e are easily pulled in to these spurious basins of attraction. 7 Discussion W e studied a v ector factorization problem whic h arises in the use of V ector Sym- b olic Architectures (as in tro duced in part one of this series (F rady et al., 2020), sho wing that Resonator Netw orks solv e this problem remark ably well. Their p er- formance comes from a particular form of nonlinear dynamics, coupled with the idea of searc hing in sup erp osition. Solutions to the factorization problem lie in a small sliv er of R N – i.e., the corners of the bipolar h yp ercub e {− 1 , 1 } N – and the 31 highly nonlinear activ ation function of Resonator Netw orks serv es to constrain the searc h to this subspace. W e drew connections b etw een Resonator Net works and a num b er of b enchmark algorithms whic h cast factorization as a problem of optimization . This intuitiv ely-satisfying form ulation app ears to come at a steep cost. None of the b enc hmarks were comp etitiv e with Resonator Netw orks in terms of k ey metrics that c haracterize factorization p erformance. One explanation for this is that the b enchmarks ha v e comparativ ely many more spurious ﬁxed p oin ts of their dynamics, and that the loss function landscap e in the interior of the h yp ercub e induces tra jectories that approach these spurious ﬁxed p oints. Unlik e the b enchmarks, Resonator Net works do not hav e a global con vergence guaran tee, and in some respects w e see this as b eneﬁcial c haracteristic of the mo del. Requiring global con v ergence app ears to unnecessarily constrain the searc h for factorizations, leading to low er capacity . Besides, op erational capacity (deﬁned in this pap er) sp eciﬁes a regime where the lack of a conv ergence guarantee can b e practically ignored. Resonator Netw orks almost alwa ys con verge in this setting, and the ﬁxed points yield the correct solution. The b enchmarks are, by steadfastly descending a loss function, in some sense more greedy than Resonator Net works. It app ears that Resonator Netw orks strike a more natural balance b etw een 1) making up dates based on the best-av ailable lo cal information and 2) still exploring the solution space while not getting stuc k. Our approac h follo ws a kind of “Goldilo cks principle” on this trade oﬀ – not too muc h, not to o little, but just righ t. W e are not the ﬁrst to consider eschewing con v ergence guaran tees to b etter solv e hard search problems. F or instance, randomized searc h algorithms utilize some explicit form of randomness to ﬁnd b etter solutions, typically only con- v erging if this randomness is reduced o ver time (Spall, 2005). In contrast, our mo del is completely deterministic, and the searc hing b ehavior comes from non- linear heteroasso ciativ e dynamics. Another example is the prop osal to add small amoun ts of random asymmetry to the (symmetric) w eigh t matrix of Hopﬁeld Net- w orks (Hertz et al., 1986). This mo diﬁcation remov es the guaran teed absence of cyclic and c haotic tra jectories that holds for the traditional Hopﬁeld mo del. But, at the same time, and without signiﬁcantly harming the attraction of memory states, adding asymmetry to the weigh ts can improv e associative memory recall b y shrinking the basins of attraction asso ciated with spurious ﬁxed p oints (Singh et al., 1995; Chengxiang et al., 2000). W e emphasize that, while Resonator Net works app ear to b e better than al- ternativ es for the particular vector factorization problem (1), this is not a claim they are appropriate for other hard searc h problems. Rather, Resonator Netw orks are sp eciﬁcally designed for the v ector factorization problem at hand. There exist sev eral prior works in volving some asp ect of factorization that we would lik e to men tion here, but we emphasize that each one of them deals with a problem or approac h that is distinct from what we ha v e introduced in this pap er. 32 T ensor decomp osition is a family of problems that b ear some resemblance to the factorization problem we ha ve introduced (1). Key diﬀerences include the ob- ject to be factored, whic h is a higher-order tensor, not a v ector, and constrain ts on the allo w able factors. W e explain in App endix D how our factorization problem is diﬀerent from traditional tensor decomp ositions. Our b enchmarks actually in- cluded the standard tensor decomp osition algorithm, Alternating Least Squares, re-expressed for (1), and w e found that it is not w ell-matched for this factoriza- tion problem. Bidirectional Asso ciative Memory , prop osed by Kosk o (1988), is an extension of Hopﬁeld Netw orks that stores p airs of factors in a matrix using the outer pro duct learning rule. The comp osite ob ject is a matrix, rather than a v ector, and is muc h closer to a particular t yp e of tensor decomp osition called the ‘CP decomp osition’, which w e elab orate on in Appendix D. Besides the fact that this model applies only to two factor problems, its dynamics are diﬀeren t from ours and its capacit y is relativ ely low (K obay ashi et al., 2002). Subsequent eﬀorts to extend this mo del to factorizations with 3 or more factors (Huang and Hagiw ara, 1999; K oba yashi et al., 2002) ha ve had very limited success and still rely on matrices that connect pairs of factors rather than a single m ultilinear pro duct, which we ha ve in our model. Bilinear Models of St yle and Con ten t (T enen baum and F reeman, 2000) was an inspiration for us in deciding to work on factorization problems. This pap er applies a diﬀeren t type of tensor decomp osi- tion, a ‘T uc ker decomp osition’ (again see App endix D), to a v ariet y of diﬀerent real-v alued datasets using what app ears to b e in one case a closed-form solution based on the Singular V alue Decomp osition, and in the other case a v arian t of Alternating Least Squares. In that sense, their metho d is diﬀerent from ours, the factorization problem is itself diﬀeren t, and they consider only pairs of factors. Memisevic and Hin ton (2010) revisits the T uck er decomp osition problem, but fac- tors the ‘core’ tensor representing interactions b et ween factors in order to mak e estimation more tractable. They prop ose a Boltzmann Machine that computes the factorization and sho w some results on mo deling image transformations. Fi- nally , there is a large b o dy of work on matrix factorization of the form V ≈ WH , the most well-kno wn of which is probably Non-negativ e Matrix F actorization (Lee and Seung, 2001). The matrix V can b e though t of a sum of outer pro ducts, so this is really a type of CP decomp osition with an additional constraint on the sign of the factors. Diﬀerent still is the fact that W is often in terpreted as a basis for the columns of V , with H con taining the co eﬃcien ts of each column with respect to this basis. In this sense, vecto rs are being adde d to explain V , rather than com bined m ultiplicativ ely – Non-negative Matrix F actorization is m uch closer to Sparse Co ding (Ho yer, 2004). The ﬁrst pap er in this series (F rady et al., 2020) illustrated ho w distributed represen tations of data structures can b e built with the algebra of V ector Symbolic Arc hitectures, as w ell as how Resonator Net w orks can decomp ose these datastruc- 33 tures. VSAs are a p ow erful wa y to think ab out structured connectionist repre- sen tations, and Resonator Netw orks mak e the framework muc h more scalable. Extending the examples found in that pap er to more realistic data (for example, complex 3-dimensional visual scenes) could b e a useful application of Resonator Net works. This will lik ely require learning a transform from pixels into the space of high-dimensional symbolic v ectors, and this learning should ideally o ccur in the context of the factorization dynamics – we feel that this is an exciting a ven ue for future study . Here we ha v e not shown Resonator Circuits for anything other than bip olar v ectors. How ev er, a v ersion of the model wherein vector elemen ts are unit-magnitude complex phasors is a natural next extension, and relev an t to Holographic Reduced Representations, a VSA dev elop ed b y T ony Plate (Plate, 2003). A recent theory of sparse phasor asso ciativ e memories (F rady and Som- mer, 2019) may allo w one to p erform this factorization with a netw ork of spiking neurons. Resonator net w orks are an abstract neural mo del of factorization, in tro duced for the ﬁrst time in this t wo-part series. W e b elieve that as the theory and applica- tions of Resonator Net w orks are further dev elop ed, they may help us understand factorization in the brain, whic h still remains an imp ortant m ystery . 8 A c kno wledgemen ts W e w ould like to thank members of the Redw o o d Center for Theoretical Neu- roscience for helpful discussions, in particular P entti Kanerv a, whose w ork on V ector Sym b olic Arc hitectures originally motiv ated this pro ject. This w ork was generously supp orted b y the National Science F oundation, under graduate re- searc h fello wship DGE1752814 and research gran t I IS1718991, b y the National Institute of Health, gran t 1R01EB026955-01, the Seminconductor Researc h Cor- p oration under E2CDA-NRI, and D ARP A’s Virtual Intelligence P ro cessing (VIP) program. 34 References Amari, S.-I. and Magin u, K. (1988). Statistical neuro dynamics of asso ciativ e memory . Neur al Networks , 1(1):63–73. Amit, D. J., Gutfreund, H., and Somp olinsky , H. (1985). Storing inﬁnite num b ers of patterns in a spin-glass mo del of neural net works. Physic al R eview L etters , 55(14):1530. Amit, D. J., Gutfreund, H., and Somp olinsky , H. (1987). Information storage in neural net works with lo w lev els of activit y . Physic al R eview A , 35(5):2293. Anandkumar, A., Ge, R., Hsu, D., Kak ade, S. M., and T elgarsky , M. (2014). T ensor decompositions for learning laten t v ariable mo dels. Journal of Machine L e arning R ese ar ch , 15:2773–2832. Arathorn, D. W. (2001). Recognition under transformation using sup erp osition ordering prop ert y . Ele ctr onics L etters , 37(3):164–166. Arathorn, D. W. (2002). Map-se eking cir cuits in visual c o gnition: A c omputa- tional me chanism for biolo gic al and machine vision . Stanford Univ ersit y Press, Stanford, CA. Arora, S., Hazan, E., and Kale, S. (2012). The multiplicativ e w eigh ts up date metho d: a meta-algorithm and applications. The ory of Computing , 8(1):121– 164. Bec k, A. and T eb oulle, M. (2009). A fast iterative shrink age-thresholding algo- rithm for linear inv erse problems. SIAM journal on imaging scienc es , 2(1):183– 202. Bouc heron, S., Lugosi, G., and Massart, P . (2013). Conc entr ation ine qualities . Oxford Univ ersity Press, Oxford. A nonasymptotic theory of indep endence, with a forew ord by Mic hel Ledoux. Bredies, K. and Lorenz, D. A. (2008). Linear con vergence of iterativ e soft- thresholding. Journal of F ourier Analysis and Applic ations , 14(5-6):813–837. Cannon, L. E. (1969). A c el lular c omputer to implement the Kalman ﬁlter algo- rithm . PhD thesis, Montana State Univ ersity-Bozeman, College of Engineering. Carroll, J. D. and Chang, J.-J. (1970). Analysis of individual diﬀerences in multidi- mensional scaling via an n-w ay generalization of “ec k art-y oung” decomp osition. Psychometrika , 35(3):283–319. 35 Carroll, J. D., Pruzansky , S., and Krus k al, J. B. (1980). Candelinc: A general ap- proac h to multidimensional analysis of many-w ay arra ys with linear constraints on parameters. Psychometrika , 45(1):3–24. Chengxiang, Z., Dasgupta, C., and Singh, M. P . (2000). Retriev al prop erties of a hopﬁeld mo del with random asymmetric in teractions. Neur al c omputation , 12(4):865–880. Cohen, M. A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage b y comp etitiv e neural netw orks. IEEE tr ansactions on systems, man, and cyb ernetics , (5):815–826. Condat, L. (2016). F ast pro jection onto the simplex and the l1 ball. Mathematic al Pr o gr amming , 158(1-2):575–585. Daub ec hies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding al- gorithm for linear inv erse problems with a sparsity constrain t. Communic ations on Pur e and Applie d Mathematics: A Journal Issue d by the Cour ant Institute of Mathematic al Scienc es , 57(11):1413–1457. De Lathau w er, L., De Mo or, B., and V andew alle, J. (2000a). A multilinear singu- lar v alue decomp osition. SIAM journal on Matrix Analysis and Applic ations , 21(4):1253–1278. De Lathau wer, L., De Moor, B., and V andewalle, J. (2000b). On the b est rank-1 and rank-(r 1, r 2,..., rn) appro ximation of higher-order tensors. SIAM journal on Matrix A nalysis and Applic ations , 21(4):1324–1342. Duc hi, J., Shalev-Sh wartz, S., Singer, Y., and Chandra, T. (2008). Eﬃcien t pro- jections on to the l 1-ball for learning in high dimensions. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 272–279. A CM. F rady , E. P ., Ken t, S. J., Olshausen, B. A., and Sommer, F. T. (2020). Resonator net works for factoring distributed represen tations of data structures. arXiv pr eprint arXiv:2007.03748 . F rady , E. P . and Sommer, F. T. (2019). Robust computation with rh ythmic spik e patterns. Pr o c e e dings of the National A c ademy of Scienc es , 116(36):18050– 18059. Ga yler, R. W. (1998). Multiplicativ e binding, represen tation op erators & analogy (w orkshop poster). In Holy oak, K., Gentner, D., and Kokino v, B., editors, A dvanc es in analo gy r ese ar ch: Inte gr ation of the ory and data fr om the c o gnitive, c omputational, and neur al scienc es . 36 Ga yler, R. W. (2004). V ector sym b olic architectures answ er jack endoﬀ ’s c hallenges for cognitiv e neuroscience. arXiv pr eprint arXiv:cs/0412059 . Gedeon, T. and Arathorn, D. (2007). Con vergence of map seeking circuits. Journal of Mathematic al Imaging and Vision , 29(2-3):235–248. Hark er, S., V ogel, C. R., and Gedeon, T. (2007). Analysis of constrained opti- mization v arian ts of the map-seeking circuit algorithm. Journal of Mathematic al Imaging and Vision , 29(1):49–62. Harshman, R. A. (1970). F oundations of the parafac procedure: Mo dels and conditions for an ’explanatory’ m ultimo dal factor analysis. T echnical Rep ort UCLA W orking Papers in Phonetics, 16, 1-84, Univ ersity of California, Los Angeles. Hazan, E. et al. (2016). In tro duction to online conv ex optimization. F oundations and T r ends in Optimization , 2(3-4):157–325. Hebb, D. O. (1949). The or ganization of b ehavior; a neur opsycholo gic al the ory. John Wiley & Sons, New Y ork, NY. Held, M., W olfe, P ., and Cro wder, H. P . (1974). V alidation of subgradient opti- mization. Mathematic al pr o gr amming , 6(1):62–88. Hertz, J., Grinstein, G., and Solla, S. (1986). Memory net w orks with asymmetric b onds. In AIP Confer enc e Pr o c e e dings , volume 151, pages 212–218. American Institute of Ph ysics. Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the A CM (JACM) , 60(6):1–39. Hitc hco ck, F. L. (1927). The expression of a tensor or a p olyadic as a sum of pro ducts. Journal of Mathematics and Physics , 6(1-4):164–189. Hopﬁeld, J. J. (1982). Neural netw orks and physical systems with emergen t col- lectiv e computational abilities. Pr o c e e dings of the national ac ademy of scienc es , 79(8):2554–2558. Hopﬁeld, J. J. (1984). Neurons with graded resp onse ha ve collectiv e computational prop erties lik e those of t wo-state neurons. Pr o c e e dings of the national ac ademy of scienc es , 81(10):3088–3092. Hopﬁeld, J. J. and T ank, D. W. (1985). “ Neural” computation of decisions in optimization problems. Biolo gic al cyb ernetics , 52(3):141–152. 37 Ho yer, P . O. (2004). Non-negativ e matrix factorization with sparseness con- strain ts. Journal of machine le arning r ese ar ch , 5(Nov):1457–1469. Huang, J. and Hagiwara, M. (1999). A new multidimensional asso ciative memory based on distributed representation and its applications. In IEEE International Confer enc e on Systems, Man, and Cyb ernetics , v olume 1, pages 194–199. Kanerv a, P . (1996). Binary spatter-co ding of ordered k-tuples. In International Confer enc e on Artiﬁcial Neur al Networks , pages 869–873. K obay ashi, M., Hattori, M., and Y amazaki, H. (2002). Multidirectional asso ciativ e memory with a hidden la yer. Systems and Computers in Jap an , 33(3):1494– 1502. K olda, T. G. and Bader, B. W. (2009). T ensor decomp ositions and applications. SIAM r eview , 51(3):455–500. K osko, B. (1988). Bidirectional associative memories. IEEE T r ansactions on Systems, man, and Cyb ernetics , 18(1):49–60. Krusk al, J. B. (1977). Three-wa y arrays: rank and uniqueness of trilinear de- comp ositions, with application to arithmetic complexity and statistics. Line ar algebr a and its applic ations , 18(2):95–138. Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negativ e matrix factor- ization. In A dvanc es in neur al information pr o c essing systems , pages 556–562. Memisevic, R. and Hin ton, G. E. (2010). Learning to represen t spatial transfor- mations with factored higher-order b oltzmann machines. Neur al c omputation , 22(6):1473–1492. P ersonnaz, L., Guy on, I., and Dreyfus, G. (1986). Collectiv e computational prop erties of neural net works: New learning mechanisms. Physic al R eview A , 34(5):4217. Plate, T. A. (2003). Holo gr aphic r e duc e d r epr esentation: Distribute d r epr esenta- tion of c o gnitive structur e . CSLI Publications, Stanford, CA. Rahimi, A., Datta, S., Kleyko, D., F rady , E. P ., Olshausen, B., Kanerv a, P ., and Rabaey , J. M. (2017). High-dimensional computing as a nanoscalable paradigm. IEEE T r ansactions on Cir cuits and Systems I: R e gular Pap ers , 64(9):2508–2521. Sidirop oulos, N. D. and Bro, R. (2000). On the uniqueness of m ultilinear decom- p osition of n-wa y arra ys. Journal of Chemometrics: A Journal of the Chemo- metrics So ciety , 14(3):229–239. 38 Sidirop oulos, N. D., De Lathauw er, L., F u, X., Huang, K., Papalexakis, E. E., and F aloutsos, C. (2017). T ensor decomp osition for signal pro cessing and machine learning. IEEE T r ansactions on Signal Pr o c essing , 65(13):3551–3582. Singh, M. P ., Chengxiang, Z., and Dasgupta, C. (1995). Fixed p oints in a hopﬁeld mo del with random asymmetric in teractions. Physic al R eview E , 52(5):5261. Spall, J. C. (2005). Intr o duction to sto chastic se ar ch and optimization: estimation, simulation, and c ontr ol , v olume 65. John Wiley & Sons. T enen baum, J. B. and F reeman, W. T. (2000). Separating style and conten t with bilinear mo dels. Neur al c omputation , 12(6):1247–1283. T uc ker, L. R. (1963). Implications of factor analysis of three-w ay matrices for measuremen t of c hange. In Harris, C. W., editor, Pr oblems in me asuring change , pages 122–137. Univ ersity of Wisconsin Press, Madison WI. T uc ker, L. R. (1966). Some mathematical notes on three-mo de factor analysis. Psychometrika , 31(3):279–311. V an V reeswijk, C. and Somp olinsky , H. (1996). Chaos in neuronal net w orks with balanced excitatory and inhibitory activit y . Scienc e , 274(5293):1724–1726. V asilescu, M. A. O. and T erzop oulos, D. (2002). Multilinear analysis of image ensem bles: T ensorfaces. In Eur op e an Confer enc e on Computer Vision , pages 447–460. Springer. Xu, Z.-B., Hu, G.-Q., and Kw ong, C.-P . (1996). Asymmetric hopﬁeld-t yp e net- w orks: theory and applications. Neur al Networks , 9(3):483–501. 39 App endices A Implemen tation details This section includes a few commen ts relev an t to the implemen tation of Resonator Net works. Algorithm 1 giv es psuedoco de for Ordinary Least Squares weigh ts – the only change for outer pro duct w eights is to use X > instead of X † . So long as D f < N / 2 , computing X f X † f  ˆ o  c  has low er computational complexity than actually forming a single “synaptic matrix” T f := X f X † f and then computing T f  ˆ o  c  in each iteration – it is faster to k eep the matrices X f and X † f separate. This, of course, assumes that implemen tation is on a conv en tional computer. If one can use sp ecialized analog computation, such as large mesh circuits that directly implemen t matrix-v ector multiplication in linear time (Cannon, 1969), then it w ould b e preferable to store the synaptic matrix directly . Lines 11 - 13 in Algorithm 1 “clean up” ˆ x ( f ) using the nearest neigh b or in the codeb o ok, and also resolv e a sign am biguit y inheren t to the factorization problem. The sign am biguity is simply this: while c = x (1) ?  x (2) ?  . . .  x ( F ) ? is the factorization w e are searching for, w e also ha ve c = − x (1) ?  − x (2) ?  . . .  x ( F ) ? , and, more generally , an y even n umber of factors can ha v e their signs ﬂipped but still generate the correct c . Resonator Netw orks will sometimes ﬁnd these solutions. W e clean up using the co devector with the largest unsigne d similarit y to the conv erged ˆ x ( f ) , which remedies this issue. One will notice that w e hav e written Algorithm 1 to up date factors in order from 1 to F . This is completely arbitrary , and any ordering is ﬁne. W e hav e exp erimented with c ho osing a random up date order during each iteration, but this did not seem to signiﬁcan tly aﬀect p erformance. Computing ˆ o with the most-recently up dated v alues for factors 1 to f − 1 (see equation (6)) is a conv en tion w e call ‘asynchronous’ up dates, in rough analogy to the same term used in the context of Hopﬁeld Netw orks. An alternativ e conv en tion is to, when computing ˆ o , not used freshly updated v alues for factors 1 to f − 1 , but rather their v alues before the up date. This treats eac h factor as if it is b eing up dated sim ultaneously , a conv en tion we call ‘synchronous’ up dates. This distinction is an idiosyncrasy of modeling Resonator Netw orks in discrete-time, and the diﬀerence b etw een the t wo disapp ears in con tinuous-time, where things happ ens instan taneously . Throughout this pap er, our analysis and simulations ha ve b een with ‘asynchronous’ up dates, whic h w e ﬁnd to con verge signiﬁcantly faster. Not shown in Algorithm 1 is the fact that, in practice, we record a buﬀer of past states, allowing us to detect when the dynamics fall in to a limit cycle, and to terminate early . 40 Algorithm 1 Resonator Netw ork with Ordinary Least Squares w eights Require: c  Comp osite v ector to b e factored Require: X 1 , X 2 , . . . , X F  Co deb o ok matrices  x ( f ) j = X f [ : , j ]  Require: k  Maxim um allow ed iterations 1: ˆ x ( f ) ← sgn  P j x ( f ) j  ∀ f = 1 , . . . , F 2: X † f ← pinv ( X f ) ∀ f = 1 , . . . , F 3: i ← 0 4: while not con v erged and i < k do 5: for f = 1 to F do 6: ˆ o ← ˆ x (1)  . . .  ˆ x ( f − 1)  ˆ x ( f +1)  . . .  ˆ x ( F ) 7: ˆ x ( f ) ← sgn  X f X † f  ˆ o  c   8: end for 9: i ← i + 1 10: end while 11: for f = 1 to F do  Nearest Neigh b or deco ding 12: u ← arg max j | sim ( ˆ x ( f ) , x ( f ) j ) |  Un-signe d NN w.r.t cos -similarity 13: ˆ x ( f ) ← x ( f ) u 14: end for 15: return ˆ x ( f ) ∀ f = 1 , . . . , F 41 B Op erational Capacit y The main text in tro duced our deﬁnition of op erational capacity and highligh ted our tw o main results – that Resonator Netw orks ha ve sup erior op erational capacity compared to the b enchmark algorithms, and that Resonator Net w ork capacity scales as a quadratic function of N . This appendix pro vides some additional supp ort and commen tary on these ﬁndings. Figure 10 compares op erational capacit y among all of the considered algo- rithms when F , the n umber of factors, is 4 . W e previously show ed this t yp e of plot for F = 3 , which was Figure 3 in the main text. Resonator Netw orks hav e an adv an tage of b etw een tw o and three orders of magnitude compared to all of our b enc hmarks; the general size of this gap w as consisten t in all of our simulations. Figure 10: Comparing op erational capacity against the b enc hmarks for F = 4 ( 4 factors) W e concluded in Section 6.2 that the op erational capacit y of Resonator Net- w orks scales quadratically in N , whic h w as sho wn in Figure 4. In T able 1 w e pro vide parameters of the least-squares quadratic ﬁts shown in that plot. One can see from Figure 4b that capacity is diﬀerent dep ending on the n um b er of factors in v olved, and in the limit of large N this diﬀerence is determined b y the parameter c . c ﬁrst rises from 2 to 3 factors, and then falls with increasing F . This implies that factorization is easiest for Resonator Netw orks when the decom- p osition is in to 3 factors, an interesting phenomenon for whic h we do not ha v e an explanation at this time. Figure 11 visualizes c as a function of F . The data indicates that for F ≥ 3 , c may follo w an in v erse pow er law: c = α 1 F − α 2 . The indicated linear ﬁt, 42 F P arameters of quadratic ﬁt a b c 2 1 . 677 × 10 5 − 3 . 253 × 10 2 0 . 293 3 1 . 230 × 10 6 − 3 . 549 × 10 3 2 . 002 4 − 5 . 663 × 10 6 9 . 961 × 10 2 1 . 404 5 1 . 140 × 10 6 − 2 . 404 × 10 3 1 . 024 6 5 . 789 × 10 6 − 4 . 351 × 10 3 0 . 874 7 − 1 . 503 × 10 7 − 1 . 551 × 10 3 0 . 669 T able 1: M max = a + bN + cN 2 Figure 11: P arameter c of the quadratic scaling dep ends on F . W e ﬁnd that it ma y follo w an inv erse p ow er la w for F ≥ 3 . follo wing a logarithmic transformation to eac h axis, suggests the follo wing v alues for parameters of this p o wer law: α 1 ≈ 2 3 . 014 = 8 . 078 , α 2 ≈ 1 . 268 . It is with some reserv ation that w e give these sp eciﬁc v alues for α 1 and α 2 . Our estimates of op erational capacity , while well-ﬁt b y quadratics, undoubtedly ha v e small amoun ts of noise. This noise can hav e a big enough impact on ﬁtted v alues for c that ﬁtting the ﬁt ma y not b e fully justiﬁed. How ev er, w e do note for the sake of completeness that this scaling, if it holds for larger v alues of F , would allo w us to write op erational capacit y in terms of both parameters N and F in the limit of large N : M max ≈ 8 . 078 N 2 F 1 . 268 ∀ F ≥ 3 (26) 43 C T able of b enc hmark algorithms Algorithm Dynamics for up dating a f [ t ] Eq. Alternating Least Squares a f [ t + 1] =  ξ > ξ  − 1 ξ > c ξ := diag  ˆ o ( f ) [ t ]  X f (10) Iterativ e Soft Thresholding a f [ t + 1] = S [ a f [ t ] − η ∇ a f L ; λη ]  S [ x ; γ ]  i := sgn ( x i ) max ( | x i | − γ , 0 ) (28) F ast Iterativ e Soft Thresholding α t = 1 + p 1 + 4 α 2 t − 1 2 β t = α t − 1 − 1 α t p f [ t + 1] = a f [ t ] + β t ( a f [ t ] − a f [ t − 1]) a f [ t + 1] = S [ p f [ t + 1] − η ∇ p f L ; λη ]  S [ x ; γ ]  i := sgn ( x i ) max ( | x i | − γ , 0 ) (29) Pro jected Gradien t Descent a f [ t + 1] = P C f  a f [ t ] − η ∇ a f L  P C f [ x ] := arg min z ∈ C f 1 2     x − z     2 2 (30) Multiplicativ e W eights w f [ t + 1] = w f [ t ]   1 − η ρ ∇ a f L  a f [ t + 1] = w f [ t + 1] P i w f i [ t + 1] ρ := max i   ( ∇ a f L ) i   (31) Map Seeking Circuits a f [ t + 1] = T  a f [ t ] − η  1 + 1 ρ ∇ a f L  ;   T  x ;   i := ( x i if x i ≥  0 otherwise ρ :=   min i ( ∇ a f L ) i   (32) T able 2: Dynamics for a f , benchmark algorithms. (see Appendices D - I for discussion of each algorithm, including h yp erparameters η , λ , and  , as well as initial conditions). 44 D T ensor Decomp ositions and Alternating Least Squares T ensors are m ultidimensional arra ys that generalize vectors and matrices. An Fth-order tensor has elements that can be indexed by F separate indexes – a v ector is a tensor of order 1 and a matrix is a tensor of order 2 . As devices for measuring multiv ariate time series hav e b ecome more prev alen t, the fact that this data can b e expressed as a tensor has made the study of tensor decomp osition a v ery p opular subﬁeld of applied mathematics. Hitc hco ck (Hitc hco c k, 1927) is often credited with originally formulating tensor decomp ositions, but mo dern tensor decomposition was p opularized in the ﬁeld of psyc hometrics b y the w ork of T uc k er (T uc k er, 1966), Carroll and Chang (Carroll and Chang, 1970), and Harshman (Harshman, 1970). This section will highlight the substantial diﬀerence b et ween tensor decomp osition and the factorization problem solved b y Resonator Net works. The type of tensor decomposition most closely related to our factorization problem (given in equation (1)) decomp oses an Fth-order tensor C into a sum of tensors eac h generated by the outer product ◦ : C = R X r =1 x (1) r ◦ x (2) r ◦ . . . ◦ x ( F ) r (33) The outer product con tains all pairs of comp onents from its t wo argumen ts, so  w ◦ x ◦ y ◦ z  ij k l = w i x j y k z l . The in terpretation is that each term in the sum is a “rank-one” tensor of order F and that C can b e generated from the sum of R of these rank-one tensors. W e sa y that C is “rank-R”. This particular decom- p osition has at least three diﬀerent names in the literature - they are Canoni- cal P oly adic Decomp osition, coined b y Hitc hco ck, CANonical DECOMPosition (CANDECOMP), coined b y Carroll and Chang, and P ARAllel F A Ctor analysis (P ARAF A C), coined by Harshman. W e will simply call this the CP decomp osi- tion, in accordance with the conv en tion used by Kolda (Kolda and Bader, 2009) and man y others. CP decomp osition mak es no men tion of a co deb o ok of v ectors, such as we hav e in (1). In CP decomposition, the searc h is apparently ov er all of the vectors in a real-v alued v ector space. One very useful fact ab out CP decomp osition is that under relatively mild conditions, if the de c omp osition exists, it is unique up to a scaling and p ermutation indeterminacy . Without going into the details, a result in Krusk al (1977) and extended by Sidiropoulos and Bro (2000) giv es a suﬃcien t condition for uniqueness of the CP decomp osition based on what is kno wn as the 45 Krusk al rank k X f of the matrix X f := [ x ( f ) 1 , x ( f ) 2 , . . . x ( f ) R ] : F X f =1 k X f ≥ 2 R + ( F − 1) (34) This fact of decomp osition uniqueness illustrates one w ay that basic results from matrices fail to generalize to higher-order tensors (b y higher-order we simply mean where the order is ≥ 3) . Low-rank CP decomposition for matrices (tensors of order 2 ) ma y b e computed with the truncated Singular V alue Decomp osition (SVD). Ho wev er, if C is a matrix and its truncated SVD is UΣV > := X 1 X > 2 , then any non-singular matrix M generates an equally-go o d CP decomp osition ( UΣM )( VM − 1 ) > . The decomp osition is highly non-unique. All matrices hav e an SVD, whereas generic higher-order tensors are not guaran teed to ha ve a CP decomp osition. And y et, if a CP decomp osition exists, under the mild condition of equation (34), it is unique. This is a somewhat miraculous fact, suggesting that in this sense, CP de c omp ostion of higher-or der tensors is e asier than matric es. The higher or der of the c omp osite obje ct imp oses many mor e c onstr aints that make the de c omp osition unique . Another in teresting wa y that higher-order tensors diﬀer from matrices is that computing matrix rank is easy , whereas in general computing tensor rank is NP- hard, along with many other imp ortant tensor problems (Hillar and Lim, 2013). Our in tuition ab out matrices largely fails us when dealing with higher-order ten- sors. In some wa ys the problems are easier and in some wa ys they are harder. Please see Sidirop oulos et al. (2017) for a more comprehensiv e comparison. The v ector factorization problem deﬁned b y (1) diﬀers from CP decomp osition in three k ey wa ys: 1. The comp osite ob ject to b e factored is a vector, not a higher-order tensor. This is an even more extreme diﬀerence than b et ween matrices and higher- order tensors. In CP decomp osition, the arrangement and n umerosity of tensor elements constitute many constrain ts on what the factorization can b e, so m uc h so that it resolv es the uniqueness issue we outlined ab ov e. In this sense, tensors c ontain much mor e information ab out the valid factorization, making the pr oblem signiﬁc antly e asier . The size and form of these tensors ma y mak e ﬁnding CP decomp ositions a computational challenge, but CP decomp osition is analytically easier than our v ector factorization problem. 2. Searc h is conducted ov er a discrete set of p ossible factors. This diﬀers from the standard form ulation of CP decomp osition, which mak es no restriction to a discrete set of factors. It is ho wev er w orth noting that a sp ecializa- tion of CP decomp osition called CANonical DEcomp osition with LINear Constrain ts (CANDELINC) (Carroll et al., 1980) does in fact imp ose the 46 additional requiremen t that factors are formed from a linear combination of some basis factors. In our setup the solutions are ‘one-hot’ linear com bina- tions. 3. The factors are constrained to {− 1 , 1 } N , a small sliver of R N . This diﬀerence should not be underestimated. W e hav e sho wn in Section 6.6 that the in te- rior of this hypercub e is treac herous from an optimization p ersp ectiv e and Resonator Net works av oid it b y using a highly nonlinear activ ation function. This w ould not make sense in the con text of standard CP decomp osition. P erhaps the most con vincing demonstration that (1) is not CP decomposition comes from the fact that w e applied Alternating Least Squares to it and found that its p erformance was relatively p o or. Alternating Least Squares is in fact the ‘workhorse’ algorithm of CP decomp osition (Kolda and Bader, 2009), but it cannot comp ete with Resonator Net works on our diﬀerent factorization problem (1). The excellen t review of Kolda and Bader (2009) co vers CP decomp osition and Alternating Least Squares in signiﬁcan t depth, including the fact that ALS alw ays conv erges to a lo cal minimum of the squared error reconstruction loss. See, in particular, section 3.4 of their pap er for more detail. One sp ecial case of CP decomp osition inv olves rank-1 comp onen ts that are symmetric and ortho gonal . F or this problem, a sp ecial case of ALS called the tensor p o w er metho d can b e used to iteratively ﬁnd the best low-rank CP decom- p osition through what is kno wn as ‘deﬂation’, whic h is iden tical to the explaining a wa y we introduced in part one of this series (F rady et al., 2020). The tensor p o wer metho d directly generalizes the matrix p o w er method, and in this sp ecial case of symmetric, orthogonal tensors is eﬀectiv e at ﬁnding the CP decomp osi- tion. A goo d initial reference for the tensor p o wer metho d is De Lathau wer et al. (2000b). A discussion of applying tensor decomp ositions to statistical learning problems is co vered by Anandkumar et al. (2014), which d ev elops a robust v ersion of the tensor p ow er method and con tains several imp ortan t probabilistic results for applying tensor decomp ositions to noisy data. The tensor p ow er metho d dif- fers from Resonator Netw orks in the same k ey w ays as ALS – comp osite ob jects are higher-order tensors, not v ectors, searc h is not necessarily o ver a discrete set, the v ectors are not constrained to {− 1 , 1 } N , and the dynamics mak e line ar least squares up dates in eac h factor. Another p opular tensor decomp osition is kno wn as the T uck er Decomp osition (T uc ker, 1963, 1966). It adds to CP decomp osition an order-F “core tensor” G that mo diﬁes the in teraction b etw een eac h of the factors: C = P X p =1 Q X q =1 . . . R X r =1 g pq ...r x (1) p ◦ x (2) q ◦ . . . ◦ x ( F ) r (35) 47 Figure 12: T uc k er decomposition with 3 factors This adds man y more parameters compared to CP decomposition, which is a sp ecial case of T uc k er decomp osition when G is the identit y . F or the purp ose of illustration, w e reprin t in Figure 12 (with a sligh t relab eling) a ﬁgure from K olda and Bader (2009) that depicts an order- 3 T uck er decomp osition. This decomp o- sition go es by man y other names, most p opularly the Higher-order SVD, coined in De Lathauw er et al. (2000a). The T uc ker decomp osition can also b e found via Alternating Least Squares (see Kolda and Bader (2009), Section 4.2, for a tutorial), although the probl em is somewhat harder than CP decomp osition, b oth b y b eing computationally more exp ensiv e and by being non-unique. Despite this fact, the applications of T uc ker decomp osition are wide-ranging – it has b een used in psychometrics, signal pro cessing, and computer vision. One w ell-known appli- cation of T uck er decomp osition in computer vision w as T ensorF aces (V asilescu and T erzop oulos, 2002). This mo del was able to factorize identit y , illumination, viewp oin t, and facial expression in a dataset consisting of face images. The summary of this section is that vector factorization problem (1) is not ten- sor decomp osition. In some sense it is more challenging. Perhaps not surprisingly , the standard algorithm for tensor decomp ositions, Alternating Least Squares, is not particularly comp etitive on this problem when compared to Resonator Net- w orks. It is interesting to consider whether tensor decomp osition migh t b e cast in to a form amenable to solution b y Resonator Netw orks. Giv en the imp ortance of tensor decomp osition as a to ol of data analysis, w e b eliev e this w arran ts a closer lo ok. 48 E General notes on gradien t-based algorithms When L is the negativ e inner pro duct, the gradient with respect to a f is: ∇ a f L = − X > f  c  ˆ x (1)  . . .  ˆ x ( f − 1)  ˆ x ( f +1)  . . .  ˆ x ( F )  = − X > f  c  ˆ o ( f )  (36) The term c  ˆ o ( f ) can b e in terpreted as an estimate for what ˆ x ( f ) should b e based on the curren t estimates for the other factors. Multiplying by X > f compares the similarit y of this vector to eac h of the candidate co devectors w e are entertain- ing, with the smallest elemen t of ∇ a f L (its v alue is likely to b e negative with large absolute v alue) indicating the co devector whic h matc hes b est. F ollowing the negativ e gradien t will cause this co eﬃcien t to increase more than the co eﬃcients corresp onding to the other co dev ectors. When L is the squared error, the gradien t with resp ect to a f is: ∇ a f L = X > f   c − ˆ x (1)  . . .  ˆ x ( F )    − ˆ x (1)  . . .  ˆ x ( f − 1)  ˆ x ( f +1)  . . .  ˆ x ( F )   := X > f  ˆ x ( f )   ˆ o ( f )  2 − c  ˆ o ( f )  (37) This lo oks somewhat similar to the gradient for the negative inner pro duct – they diﬀer by an additiv e term giv en b y X > f  ˆ x ( f )   ˆ o ( f )  2  . At the v ertices of the h yp ercub e all the elements of ˆ x ( f ) are 1 or − 1 and the term  ˆ o ( f )  2 disapp ears, making the diﬀerence b et ween the tw o gradien ts just X > f ˆ x ( f ) . Among other things, this makes the gradien t of the squared error equal to zero at the global minimizer x (1) ? . . . x ( F ) ? , whic h is not the case with the negative inner pro duct. T o b e clear, (36) is the gradien t when the loss function is the negativ e inner product, while (37) is the gradien t when the loss function is the squared error. E.1 Fixed-stepsize gradient descen t on the squared error In ﬁxed-step-size gradient descen t for unconstrained conv ex optimization prob- lems, one m ust often add a restriction on the stepsize, related to the smo othness of the loss function, in order to ensure that the iterates con verge to a ﬁxed point. W e say that a function L is L -smo oth when its gradien t is Lipschitz con tin uous with constan t L : ||∇L ( x ) − ∇L ( y ) || 2 ≤ L || x − y || 2 ∀ x , y (38) F or a function that is t wice-diﬀerentiable, this is equiv alent to the condition 0  ∇ 2 L ( x )  L I ∀ x (39) 49 Where 0 is the matrix of all zeros and and I is the identit y . Absen t some pro ce- dure for adjusting the stepsize η at eac h iteration to accoun t for the degree of lo cal smo othness, or some additional assumption we place on the loss to mak e sure that it is suﬃcien tly smo oth, we should b e w ary that con vergence may not b e guar- an teed. On our factorization problem w e ﬁnd this to b e an issue. Unconstrained gradien t descen t on the squared error works for the simplest problems, where M is small and the factorization can b e easily found by an y of the algorithms in this pap er. How ev er, as M increases, the exceedingly “jagged” landscape of the squared error loss makes the iterates v ery sensitiv e to the step size η , and the comp onen ts of a f [ t ] can b ecome very large. When this happ ens, the term ˆ o ( f ) [ t ] ampliﬁes this problem (it m ultiplies all but one of the a f [ t ] ’s together) and causes n umerical instability issues. With the squared error loss, the smo othness is very p o or: we found that ﬁxed-stepsize gradien t descent on the squared error was so sensitiv e to η that it made the metho d practically useless for solving the factor- ization problem. Iterativ e Soft Thresholding and F ast Iterative Soft Thresholding use a dynamic step size to a v oid this issue (see equation (40)). In con trast, the negativ e inner product loss, with resp ect to each factor, is in some sense p erfe ctly smo oth (it is linear), so the step size do es not factor in to con vergence proofs. F Iterativ e Soft Thresholding (IST A) and F ast It- erativ e Soft Thresholding (FIST A) Iterativ e Soft Thresholding is a type of pr oximal gr adient desc ent . The proximal op erator for an y conv ex function h ( · ) is deﬁned as pro x h ( x ) := arg min z 1 2 || z − x || 2 2 + h ( z ) When h ( z ) is λ || z || 1 , the proximal op erator is the so-called “soft-thresholding” function, whic h we denote b y S :  S [ x ; γ ]  i := sgn ( x i ) max ( | x i | − γ , 0 ) Consider taking the squared error loss and adding to it λ || a f || 1 : L ( c , ˆ c ) + λ || a f || 1 = 1 2 || c − ˆ c || 2 2 + λ || a f || 1 Applying soft thesholding clearly minimizes this augmented loss function. The strategy is to take gradien t steps with resp ect to the squared error loss but then to pass those updates through the soft thresholding function S . This ﬂa v or of pro ximal gradient descen t, where ˆ c is a linear function of a f and h ( · ) is the  1 norm, is called the Iterativ e Soft Thresholding Algorithm (Daub ec hies et al., 2004), and 50 is a somewhat old and p opular approach for ﬁnding sparse solutions to large-scale linear in verse problems. The dynamics of IST A are given in equation (28) and there are a few param- eters w orth discussing. First, the dynamic stepsize η can b e set via bac ktracking line searc h or, as w e did, by computing the Lipschitz constant of the loss function gradien t: η = 1 L   ||∇ a L ( x ) − ∇ a L ( y ) || 2 ≤ L || x − y || 2 ∀ x , y (40) The scalar λ is a hyperparameter that eﬀectively sets the sparsity of the solutions considered – its v alue should b e tuned in order to get go o d p erformance in practice. In the exp eriments we sho w in this paper, λ was 0 . 01 . The initial state a f [0] is set to 1 . Con vergence analysis of IST A is b ey ond the scop e of this pap er, but it has b een sho wn in v arious places (Bredies and Lorenz (2008), for instance) that IST A will con verge at a rate ' O (1 /t ) . Iterative Soft Thresholding w orks well in practice, although for 4 or more factors w e ﬁnd that it is not quite as eﬀectiv e as the algorithms that do constrained descen t on the negative inner pro duct loss. By virtue of not directly constraining the co eﬃcien ts, IST A allo ws them to gro w outside of [0 , 1] N . This may make it easier to ﬁnd the minimizers a ? 1 , a ? 2 , . . . , a ? F , but it ma y also lead the metho d to encoun ter more sub optimal lo cal minimizers, whic h we found to be the case in practice. One common criticism of IST A is that it can get trapp ed in shallo w parts of the loss surface and th us suﬀers from slow con vergence (Bredies and Lorenz, 2008). A straigh tforw ard impro vemen t, based on Nestero v’s momentum for accelerating ﬁrst-order metho ds, w as prop osed b y Beck and T eb oulle (2009), whic h they call F ast Iterativ e Soft Thresholding (FIST A). The dynamics of FIST A are written in equation (29), and conv erge at the signiﬁcan tly b etter rate of ' O (1 /t 2 ) , a result prov en in Beck and T eb oulle (2009). Despite this diﬀerence in worst-c ase con vergence rate, we ﬁnd that the a v erage-case con vergence rate on our particular factorization problem does not signiﬁcan tly diﬀer. Initial co eﬃcients a f [0] are set to 1 and auxiliary v ariable α t is initialized to 1 . F or all exp eriments λ w as set the same as for IST A, to 0 . 01 . G Pro jected Gradien t Descen t Starting from the general optimization form of the factorization problem (9), what kind of constrain t might it b e reasonable to enforce on a f ? The most obvious is that a f lie on the simplex ∆ D f := { x ∈ R D f | P i x i = 1 , x i ≥ 0 ∀ i } . Enforcing this constraint means that ˆ x ( f ) sta ys within the − 1 , 1 hypercub e at all times and, 51 as we noted, the optimal v alues a ? 1 , a ? 2 , . . . , a ? F happ en to lie at vertices of the simplex, the standard basis v ectors e i . Another constrain t set w orth considering is the  1 ball B ||·|| 1 [1] := { x ∈ R D f | || x || 1 ≤ 1 } . This set contains the simplex, but it encompasses m uc h more of R D f . One reason to consider the  1 ball is that it dramatically increases the num b er of feasible global optimizers of (9), from whic h w e can easily recov er the sp eciﬁc solution to (1). This is due to the fact that: c = X 1 a ? 1  X 2 a ? 2  . . .  X F a ? F ⇐ ⇒ c = X 1 ( − a ? 1 )  X 2 ( − a ? 2 )  . . .  X F a ? F and moreo ver any num b er of distinct pairs of factor co eﬃcients can b e made nega- tiv e – the sign change cancels out. The result is that while the simplex constrain t only allo ws solution a ? 1 , a ? 2 , . . . , a ? F , the  1 ball constrain t also allows solutions − a ? 1 , − a ? 2 , a ? 3 , . . . , a ? F , and a ? 1 , a ? 2 , − a ? 3 , . . . , − a ? F , and − a ? 1 , − a ? 2 , − a ? 3 , . . . , − a ? F , etc. These spurious global minimizers can easily be detected b y c hecking the sign of the largest-magnitude comp onent of a f . If it is negativ e w e can then m ultiply b y − 1 to get a ? f . Cho osing the  1 ball o ver the simplex is purely motiv ated from the p ersp ectiv e that increasing the size of the constraint set ma y mak e ﬁnding the global optimizers easier. Ho wev er, w e found that in practice, it did not signiﬁ- can tly matter whether ∆ D f or B ||·|| 1 [1] w as used to constrain a f . There exist algorithms for eﬃcien tly computing pro jections on to b oth the sim- plex and the  1 ball (see Held et al. (1974), Duchi et al. (2008), and Condat (2016)). W e use a v arian t summarized in Duc hi et al. (2008) that has compu- tational complexit y O ( D f log D f ) – recall that a f has D f comp onen ts, so this is the dimensionalit y of the simplex or the  1 ball b eing pro jected on to. When constraining to the simplex, w e set the initial co eﬃcients a f [0] to 1 D f 1 , the cen ter of the simplex. When constraining to the unit  1 ball w e set a f [0] to 1 2 D f 1 , so that all co eﬃcien ts are equal but the vector is on the interior of the ball. The only h yp erparameter is η , which in all exp erimen ts w as set to 0 . 01 . W e remind the reader that w e deﬁned the nullspace of the pro jection op eration with equation (22) in Section 6.6, and the sp ecial case for the simplex constrain t in (23) and (24). T aking pro jected gradient steps on the negative inner pro duct loss w orks w ell and is guaranteed to conv erge, whether we use the simplex or the  1 ball constraint. Con vergence is guaranteed due to this intuitiv e fact: an y part of − η ∇ a f L not in N  P C f [ x ]  , induces a change in a f , denoted by ∆ a f [ t ] which must make an acute angle with −∇ a f L . This is by the deﬁnition of ortho gonal pr oje ction , and it is a suﬃcien t condition for showing that ∆ a f [ t ] decreases the v alue of the loss function. Pro jected Gradien t Descen t iterates alwa ys reduce the v alue of the negative inner pro duct loss or lea ve it unc hanged; the function is b ounded b elo w on the simplex and the  1 ball, so this algorithm is guaran teed to conv erge. Applying pro jected gradient descent on the squared error did not work, which is related to the smo othness issue we discussed in App endix E.1, although the b e- 52 ha vior w as not as dramatic as with unconstrained gradien t descen t. W e observ ed in practice that pro jected gradien t descent on the squared error loss easily falls in to limit cycles of the dynamics. It w as for this reason that w e restricted our atten tion with pro jected gradient descent to the negativ e inner pro duct loss. H Multiplicativ e W eigh ts When we ha ve simplex constrain ts C f = ∆ D f , the Multiplicativ e W eights algo- rithm is an elegan t wa y to p erform the sup erp osition search. It naturally enforces the simplex constrain t b y main taining a set of auxiliary v ariables, the ‘weigh ts’, whic h deﬁne the c hoice of a f at eac h iteration. See equation (31) for the dynamics of Multiplicativ e W eigh ts. W e c ho ose a ﬁxed stepsize η ≤ 0 . 5 and initial v alues for the weigh ts all one: w f [0] = 1 . In exp eriments in this pap er w e set η = 0 . 3 . The v ariable ρ exists to normalize the term 1 ρ ∇ a f L so that eac h elemen t lies in the in terv al [ − 1 , 1] . Multiplicativ e W eights is an algorithm primarily asso ciated with game the- ory and online optimization, although it has b een independently disco vered in a wide v ariet y of ﬁelds (Arora et al., 2012). Please see Arora’s excellen t review of Multiplicativ e W eights for a discussion of the fascinating historical and analytical details of this algorithm. Multiplicative W eights is often presented as a decision p olicy for discrete-time games. Ho wev er, through a straightforw ard generalization of the discrete actions into directions in a contin uous v ector space, one can apply Multiplicativ e W eigh ts to problems of online c onvex optimization , whic h is dis- cussed at length in Arora et al. (2012) and Hazan et al. (2016). W e can think of solving our problem (9) as if it were an online conv ex optimization problem where w e update eac h factor ˆ x ( f ) according to its own Multiplicativ e W eights up date, one at a time. The function L is con vex with resp ect to a f , but is changing at eac h iteration due the up dates for the other factors - it is in this sense that we are treating (9) as an online con vex optimization problem. H.1 Multiplicativ e W eigh ts is a descen t metho d A descent metho d on L is an y algorithm that iterates a f [ t + 1] = a f [ t ] + η [ t ]∆ a f [ t ] where the up date ∆ a f [ t ] mak es an acute angle with −∇ a f L : ∇ a f L > ∆ a f [ t ] < 0 . In the case of Multiplicative W eigh ts, we can equiv alen tly deﬁne a descent metho d based on ∇ w f ˜ L > ∆ w f [ t ] < 0 where ˜ L ( w f ) is the loss as a function of the weights and ∇ w f ˜ L is its gradient with resp ect to those weigh ts. The loss as a function of the weigh ts comes via the substitution a f = w f P i w f i := w f Φ f . W e now pro v e that 53 ∇ w f ˜ L > ∆ w f [ t ] < 0 : ∇ w f ˜ L = ∂ a f ∂ w f ∂ L ∂ a f =        Φ f − w f 1 Φ 2 f − w f 2 Φ 2 f · · · − w f k Φ 2 f − w f 1 Φ 2 f Φ f − w f 2 Φ 2 f · · · − w f k Φ 2 f . . . . . . . . . . . . − w f 1 Φ 2 f − w f 2 Φ 2 f · · · Φ f − w f k Φ 2 f        ∇ a f L =  1 Φ f I − 1 Φ 2 f 1w >  ∇ a f L = 1 Φ f ∇ a f L − L ( a f ) Φ f 1 (41) This allo ws us to write down ∆ w f [ t ] in terms of ∇ w f ˜ L : ∆ w f [ t ] = − 1 ρ w f [ t ]  ∇ a f L = − 1 ρ w f [ t ]   Φ f ∇ w f ˜ L + L ( a f [ t ]) 1  = − Φ f ρ diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ w f [ t ] (42) And then w e can easily show the desired result: ∇ w f ˜ L > ∆ w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ ∇ w f ˜ L > w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ  1 Φ f ∇ a f L > − L ( a f [ t ]) Φ f 1 >  w f [ t ] = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L − L ( a f [ t ]) ρ  L ( a f [ t ]) − L ( a f [ t ])  = − Φ f ρ ∇ w f ˜ L > diag ( w f [ t ]) ∇ w f ˜ L < 0 (43) The last line follo ws directly from the fact that w f are alwa ys p ositive b y con- struction in Multiplicative W eights. Therefore, the matrix diag ( w f [ t ]) is p ositive deﬁnite and the term Φ f ρ is strictly greater than 0 . W e’v e sho wn that the iterates of Multiplicativ e W eights alw a ys mak e steps in descent directions. When the loss L is the negativ e inner pro duct, it is guaranteed to decrease at eac h iteration. Empirically , multiplicativ e w eights applied to the squared error loss also always de cr e ases the loss function . W e said in App endix E.1 that descent on the squared 54 error with a ﬁxed step size is not in general guaranteed to con verge. How ev er, the behavior we observ e with Multiplicativ e W eigh ts descent on the squared er- ror migh t be explained by the fact that the stepsize is normalized by ρ at each iteration in this algorithm. Both functions are b ounded b elow o ver the constraint set ∆ D f , so therefore Multiplicative W eights must conv erge to a ﬁxed p oin t. In practice, we pic k a step size η b etw een 0 . 1 and 0 . 5 and run the algorithm un til the normalized magnitude of the change in the coeﬃcients is b elow some small threshold:    a f [ t + 1] − a f [ t ]    η <  The sim ulations we sho w ed in the Results section utilized η = 0 . 3 and  = 10 − 5 . I Map Seeking Circuits Map Seeking Circuits (MSCs) are neural net w orks designed to solv e in v arian t pattern recognition problems. Their theory and applications ha v e b een gradually dev elop ed b y Arathorn and colleagues ov er the past 18 years (see, for example, Arathorn (2001, 2002), Gedeon and Arathorn (2007), and Hark er et al. (2007)), but remain largely unkno wn outside of a small comm unity of vision researc hers. In their original conception, they solv e a “corresp ondence maximization” or “trans- formation disco very” problem in whic h the netw ork is given a visually transformed instance of some template ob ject and has to reco v er the iden tity of the ob ject as w ell as a set of transformations that explain its current app earance. The approac h tak en in Map Seeking Circuits is to sup erimp ose the p ossible transformations in the same spirit as we ha ve outlined for solving the factorization problem. W e cannot give the topic a full treatmen t here but simply note that the original for- m ulation of Map Seeking Circuits can b e directly translated to our factorization problem wherein each t yp e of transformation (e.g. translation, rotation, scale) is one of the F factors, and the particular v alues of the transformation are v ectors in the co deb o oks X 1 , X 2 , . . . , X F . The loss function is L : x , y 7→ −h x , y i and the constraint set is [0 , 1] D f (b oth by conv en tion in Map Seeking Circuits). The dynamics of Map Seeking Circuits are given in equation (32), with initial v alues a f [0] = 1 for each factor. The small threshold  is a h yp erparameter, which w e set to 10 − 5 in exp eriments, along with the stepsize η = 0 . 1 . Gedeon and Arathorn (2007) and Harker et al. (2007) prov ed (with some minor technicalities w e will not detail here) that Map Seeking Circuits alw a ys conv erge to either a scalar multiple of a canonical basis vector, or the zero vector. That is, a f [ ∞ ] = β f e i or 0 (where ( e i ) j = 1 if j = i and 0 otherwise, and β f is a p ositiv e scalar). Due to the normalizing term ρ , the up dates to a f c an never b e p ositive . Among the comp onen ts of ∇ a f L whic h are negative, the one with the largest magnitude 55 corresp onds to a comp onent of a f whic h sees an up date of 0 . All other comp onents are decreased b y an amoun t whic h is prop ortional to the gradient. W e noted in commen ts on (36) that the smallest elemen t of ∇ a f L corresp onds to the co dev ector whic h b est matc hes c  ˆ o ( f ) , a “suggestion” for ˆ x ( f ) based on the current states of the other factors. The dynamics of Map Seeking Circuits thus preserv e the w eight of the co devector which matc hes best and decrease the weigh t of the other co dev ectors, by an amoun t whic h is prop ortional to their own matc h. Once the w eight on a co devector drops b elow the threshold, it is set to zero and no longer participates in the searc h. The phenomenon wherein the correct co eﬃcien t a f i ? drops out of the searc h is called “sustained collusion” by Arathorn (Arathorn, 2002) and is a failure mo de of Map Seeking Circuits. J P ercolated noise in Outer Pro duct Resonator Net w orks A Resonator Netw ork with outer product w eights X f X > f that is initialized to the correct factorization is not guaran teed to remain there, just as a Hopﬁeld Net work with outer pro duct weigh ts initialized to one of the ‘memories’ is not guaranteed to remain there. This is in con trast to a Resonator Net work (and a Hopﬁeld Net work) with Ordinary Least Squares w eights X f  X > f X f  − 1 X > f , for whic h eac h of the co dev ectors are alwa ys ﬁxed p oints. In this section, when we refer simply to a Resonator Net w ork or a Hopﬁeld Netw ork w e are referring to the v ariants of these mo dels that use outer pro duct w eights. The bitﬂip probabilit y for the f th factor of a Resonator Netw ork is denoted r f and deﬁned in (13). Section J.1 derives r 1 , which is equal to the bitﬂip probabilit y for a Hopﬁeld netw ork, ﬁrst in tro duced by (12) in the main text. Section J.2 deriv es r 2 , and then section J.3 collects all of the ingredients to express the general r f . J.1 First factor The stabilit y of the ﬁrst factor in a Resonator Net w ork is the same as the stabilit y of the state of a Hopﬁeld net work – at issue is the distri bution of ˆ x (1) [1] : ˆ x (1) [1] = sgn  X 1 X > 1 x (1) ?  := sgn  Γ  Assuming each co devector (each column of X 1 , including the v ector x (1) ? ) is a random bip olar v ector, each comp onent of Γ is a random v ariable. Its distribution 56 can b e deduced from writing it out in terms of constant and random comp onents: Γ i = D 1 X m N X j  x (1) m  i  x (1) m  j  x (1) ?  j = N  x (1) ?  i + D 1 X m 6 = ? N X j  x (1) m  i  x (1) m  j  x (1) ?  j = N  x (1) ?  i + ( D 1 − 1)  x (1) ?  i + D 1 X m 6 = ? N X j 6 = i  x (1) m  i  x (1) m  j  x (1) ?  j (44) The third term is a sum of ( N − 1)( D 1 − 1) i.i.d. Rademacher random v ariables, whic h in the limit of large N D 1 can b e w ell-approximated by a Gaussian random v ariable with mean zero and v ariance ( N − 1)( D 1 − 1) . Therefore, Γ i is approxi- mately Gaussian with mean ( N + D 1 − 1)  x (1) ?  i and v ariance ( N − 1)( D 1 − 1) . The probabilit y that  ˆ x (1) [1]  i 6 =  x (1) ?  i is given b y the cum ulative densit y function of the Normal distribution: h 1 := P r   ˆ x (1) [1]  i 6 =  x (1) ?  i  = Φ  − N − D 1 + 1 p ( N − 1)( D 1 − 1)  (45) W e care ab out the ratio D 1 / N and how the bitﬂip probability h 1 scales with this n umber. W e’v e called this h 1 to denote the Hopﬁeld bitﬂip probabilit y but it is also r 1 , the bitﬂip probability for the ﬁrst factor of a Resonator Netw ork. W e’ll see that for the second, third, fourth, and other factors, h f will not equal r f , which is what we mean b y p ercolated noise, the fo cus of Section 6.1 in the main text. If w e eliminate all “self-connection” terms from X 1 X > 1 , b y setting each elemen t on the diagonal to zero, then the second term in (44) is eliminated and the bitﬂip proba- bilit y is Φ  − N √ ( N − 1)( D 1 − 1)  . This is actually signiﬁcantly diﬀeren t from (45), whic h w e can see in Figure 13. With self-connections, the bitﬂip probability is maximized when D 1 = N (the reader can verify this via simple algebra), and its maxim um v alue is ≈ 0 . 023 . Without self-connections, the bitﬂip probability asymptotes at 0 . 5 . The actual useful op erating regime of b oth these netw orks is where D 1 is signiﬁcan tly less than N , which we zo om in on in Figure 13b. A “mean-ﬁeld” anal- ysis of Hopﬁeld Netw orks developed b y Amit, Gutfreund, and Somp olinsky (Amit et al., 1985, 1987) show ed that when D 1 / N > 0 . 138 , a phase-change phenomenon o ccurs in whic h a small num b er of initial bitﬂips (when the probabilit y is 0 . 0036 according to the ab o ve appro ximation) build up o ver subsequen t iterations and the netw ork almost alwa ys mov es far a wa y from x (1) ? , making it essentially useless. W e can see that the same bitﬂip probabilit y is suﬀered at a signiﬁcan tly higher 57 (a) Bitﬂip prob. for D 1 / N ∈ (0 , 2] (b) Bitﬂip prob. for D 1 / N ∈ (0 , 0 . 25] Figure 13: Eﬀect of self-connections on bitﬂip probabilit y v alue for D 1 / N when w e ha v e self-connections – the vector x (1) ? is signiﬁcan tly more stable in this sense. W e also found that a Resonator Net work has higher op erational capacity (see Section 6.2) when w e lea ve in the self-connections. As a third p oin t of in terest, computing X f X > f x (1) ? is often m uc h faster when w e k eep eac h co deb o ok matrix separate (instead of forming the synaptic matrix X f X > f directly), in whic h case remo ving the self-connection terms inv olves extra compu- tation in eac h iteration of the algorithm. F or all of these reasons, we choose to k eep self-connection terms in the Resonator Netw ork. J.2 Second factor When w e up date the second factor, we ha v e ˆ x (2) [1] = sgn  X 2 X > 2  ˆ o (2) [1]  c   := sgn  Γ  Here w e’re just repurp osing the notation Γ to indicate the v ector whic h gets thresholded to − 1 and +1 by the sign function to generate the new state ˆ x (2) [1] . Some of the comp onents of the vector ˆ o (2) [1]  c will b e the same as x (2) ? and some (hop efully small) n um b er of the components will hav e b een ﬂipp ed compared to x (2) ? b y the up date to factor 1 . Let us denote the set of components whic h ﬂipp ed as Q . The set of comp onents that did not ﬂip is Q c . The num b er of bits that did or did not ﬂip is the size of these sets, denoted by | Q | and | Q c | , resp ectiv ely . W e hav e to k eep track of these t wo sets separately b ecause it will aﬀect the probabilit y that 58 a comp onent of ˆ x (2) [1] is ﬂipped relativ e to x (2) ? . W e can write out the constant and random parts of Γ i along the same lines as what w e did in (44). Γ i = D 2 X m N X j  x (2) m  i  x (2) m  j  ˆ o (2) [1]  c  j = D 2 X m N X j ∈ Q c  x (2) m  i  x (2) m  j  x (2) ?  j − D 2 X m N X j ∈ Q  x (2) m  i  x (2) m  j  x (2) ?  j = | Q c |  x (2) ?  i + D 2 X m 6 = ? N X j ∈ Q c  x (2) m  i  x (2) m  j  x (2) ?  j − | Q |  x (2) ?  i − D 2 X m 6 = ? N X j ∈ Q  x (2) m  i  x (2) m  j  x (2) ?  j = ( N − 2 | Q | )  x (2) ?  i + D 2 X m 6 = ? N X j ∈ Q c  x (2) m  i  x (2) m  j  x (2) ?  j − D 2 X m 6 = ? N X j ∈ Q  x (2) m  i  x (2) m  j  x (2) ?  j (46) If i is in the set of bits which did not ﬂip previously , then there is a constan t ( D 2 − 1)  x (2) ?  i whic h comes out of the second term ab o v e. If i is in the set of bits whic h did ﬂip previously , then there is a constan t − ( D 2 − 1)  x (2) ?  i whic h comes out of the third term ab o v e. The remaining con tribution to Γ i is, in either case, a sum of ( N − 1)( D 2 − 1) i.i.d. Rademacher random v ariables, analogously to what w e had in (44). T echnically | Q | is a random v ariable but when N is of an y moderate size it will b e close to r 1 N , the bitﬂip probability for the ﬁrst factor. Therefore, Γ i is appro ximately Gaussian with mean either  N (1 − 2 r 1 ) + ( D 2 − 1)  x (2) ?  i or  N (1 − 2 r 1 ) −  D 2 − 1)  x (2) ?  i , dep ending on whether i ∈ Q c or i ∈ Q . W e call the c onditional bitﬂip probabilities that result from these t wo cases r 2 0 and r 2 00 : r 2 0 := P r   ˆ x (2) [1]  i 6 =  x (2) ?  i    ˆ o (2) [1]  c  i =  x (2) ?  i  = Φ  − N (1 − 2 r 1 ) − ( D 2 − 1) p ( N − 1)( D 2 − 1)  (47) r 2 00 := P r   ˆ x (2) [1]  i 6 =  x (2) ?  i    ˆ o (2) [1]  c  i 6 =  x (2) ?  i  = Φ  − N (1 − 2 r 1 ) + ( D 2 − 1) p ( N − 1)( D 2 − 1)  (48) The total bitﬂip probabilit y for up dating the second factor, r 2 , is then r 2 0 (1 − h 1 ) + r 2 00 h 1 . J.3 All other factors It hop efully should b e clear that the general developmen t ab ov e for the bitﬂip probabilit y of the second factor will apply to all subsequen t factors – w e just 59 need to mak e one mo diﬁcation to notation. W e sa w that bitﬂip probabilit y w as diﬀeren t dep ending on whether the comp onent had ﬂipp ed in the previous factor (the diﬀerence b et ween (47) and (48)). In the general case, what really matters is whether the factor sees a net bitﬂip from the other factors. It might b e the case that the comp onent had initially ﬂipp ed but w as ﬂipped bac k b y subsequen t factors – all that matters is whether an o dd num b er of previous factors ﬂipp ed the comp onent. T o capture this indirectly , we deﬁne the quantit y n f to b e the net bitﬂip probabilit y that is passed on to the next factor (this is equation 14 in the main text): n f := P r   ˆ o ( f +1) [ t ]  c  i 6 =  x ( f +1) ?  i  F or the ﬁrst factor, r 1 = n 1 but in the general case it should b e clear that r f = r f 0 (1 − n f − 1 ) + r f 00 n f − 1 whic h is equation (17) in the main text. This expression is just marginalizing o v er the probability that a net biﬂip was not seen (ﬁrst term) and the probability that a net bitﬂip w as seen (second term). The expression for the general n f is sligh tly diﬀeren t: n f = r f 0 (1 − n f − 1 ) + (1 − r f 00 ) n f − 1 whic h is equation (18) in the main text. The base of th e recursion is n 0 = 0 , whic h makes in tuitive sense b ecause factor 1 sees no p ercolated noise. In (47) and (48) ab ov e w e had r 1 but what really b elongs there in the general case is n f − 1 . This brings us to our general statemen t for the conditional bitﬂip probabilities r f 0 and r f 00 , whic h are equations 19 and 20 in the main text: r f 0 = Φ  − N (1 − 2 n f − 1 ) − ( D f − 1) p ( N − 1)( D f − 1)  r f 00 = Φ  − N (1 − 2 n f − 1 ) + ( D f − 1) p ( N − 1)( D f − 1)  What w e ha v e derived here in App endix J are the equations (12) - (20). This result agrees v ery well with data generated in exp eriments where one actually coun ts the bitﬂips in a randomly instantiated Resonator Net work. In Figure 14 w e show the sampling distribution of r f from these experiments compared to the analytical expresssion for r f . Dots indicate the mean v alue for r f and the shaded region indicates one standard deviation ab out the mean, the standard error of this sampling distribution. W e generated this plot with 250 iid random trials for each p oin t. Solid lines are simply the analytical v alues for r f , whic h one can see are in v ery close agreement with the sampling distribution. 60 Figure 14: Agreemen t b etw een sim ulation and theory for r f . Shades indicate factors 1 - 5 (ligh t to dark). 61

Resonator Networks outperform optimization methods at solving high-dimensional vector factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment