Linear Reservoir: A Diagonalization-Based Optimization

We introduce a diagonalization-based optimization for Linear Echo State Networks (ESNs) that reduces the per-step computational complexity of reservoir state updates from O(N^2) to O(N). By reformulating reservoir dynamics in the eigenbasis of the re…

Authors: Romain de Coudenhove, Yannis Bendi-Ouis, Anthony Strock

Linear Reservoir: A Diagonalization-Based Optimization
Journal of Machine Learning Researc h Preprint (2026) 1-29 Submitted 02/26; Published 02/26 Linear Reserv oir: A Diagonalization-Based Optimization Romain de Coudenho ve r omain.de.coudenhove@ens.psl.eu ENS PSL, Inria Center of Bor de aux University, L aBRI, IMN Y annis Bendi-Ouis y annis.bendi-ouis@inria.fr Inria Center of Bor de aux University, L aBRI, IMN An thony Strock astr ock@st anf ord.edu Dep artment of Psychiatry & Behavior al Scienc es, Stanfor d University Scho ol of Me dicine Stanfor d Xa vier Hinaut xa vier.hina ut@inria.fr Inria Center of Bor de aux University, L aBRI, IMN Editor: Preprin t JMLR Abstract W e introduce a diagonalization-based optimization for Linear Echo State Net works (ESNs) that reduces the p er-step computational complexit y of reserv oir state updates from O ( N 2 ) to O ( N ) . By reform ulating reservoir dynamics in the eigen basis of the recurrent matrix, the recurren t up date becomes a set of indep endent elemen t-wise op erations, eliminating the ma- trix multiplication. W e further prop ose three metho ds to use our optimization dep ending on the situation: (i) Eigenbasis W eight T ransformation (EWT), which preserves the dynamics of standard and trained Linear ESNs, (ii) End-to-End Eigenbasis T raining (EET), which directly optimizes readout w eights in the transformed space and (iii) Direct P arameter Generation (DPG), that bypasses matrix diagonalization by directly sampling eigenv alues and eigenv ectors, achieving comparable p erformance than standard Linear ESNs. A cross all exp eriments, b oth our metho ds preserv e predictiv e accuracy while offering significant computational sp eedups, making them a replacemen t of standard Linear ESNs computa- tions and training, and suggesting a shift of paradigm in linear ESN tow ards the direct selection of eigenv alues. Keyw ords: Reserv oir Computing, Linear Ec ho State Net works, Diagonalization, Compu- tational Complexit y , Recurrent Neural Netw orks, Parallelization, Efficient T raining, Eigen- v alue Decomp osition, Sp ectral Initialization, Linear Recurrent Units 1 In tro duction Reserv oir Computing (R C), most notably implemen ted as Echo State Net w orks (ESN), offers a comp elling alternativ e to traditional Recurren t Neural Net works (RNNs) b y k eeping in ternal reserv oir w eights fixed and training only the output lay er. While this approach simplifies the training procedure to a linear regression problem, the reserv oir state update remains computationally exp ensive, scaling with quadratic complexit y ( O ( N 2 ) where N is the num b er of neurons). Consequently , there is a strong motiv ation to reduce this complexity without sacrificing p erformance. While early R C research emphasized the necessity of non- linear reservoir dynamics, recent studies hav e demonstrated the surprising efficacy of linear reserv oirs. Theoretically , it has been sho wn that a linear reserv oir coupled with a non-linear readout (suc h as a polynomial function or neural net work) is a universal appro ximator © 2026 Romain de Coudenhov e. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/ . Attribution requirements are provided at http://jmlr.org/papers/vPreprint/21- 0000.html . de Coudenhove, Bendi-Ouis, Strock, Hinaut (Gonon and Ortega, 2019). F urthermore, linear dynamics are particularly w ell-suited for maximizing memory capacity , as analyzed in Dambre et al. (2012). Sev eral arc hitectures ha ve lev eraged these properties. F or instance, Legendre Memory Units (LMUs) utilize linear dynamics derived from Legendre polynomials to pro ject input history onto an orthogonal basis, thereby extending memory capacity (V o elk er et al., 2019). Similarly , recent w orks b y P eter Tino (Simon F ong et al., 2025; V erzelli et al., 2021; Tino, 2020) ha ve explored the dynamical prop erties of linear mo dels, clarifying the precise conditions under which linear ESNs can match or outp erform their non-linear coun terparts. Despite these adv antages, the computational cost of matrix multiplication remains a b ottlenec k for large reservoirs. T o address this, approac hes such as Next Generation Reserv oir Computing (Gauthier et al., 2021) ha ve prop osed replacing the explicit simulation of reserv oir dynamics with static non- linear feature expansions, offering significant sp eedups. Ho wev er, abandoning the recurrent dynamics en tirely may not alwa ys b e desirable as it limits the mo del to a fixed size windo w. In this work, we presen t a metho d to reduce computational complexity while main tain- ing the explicit recurrence of the reserv oir. W e fo cus on optimizing Linear ESNs through matrix diagonalization, a transformation that con verts costly matrix m ultiplications in to efficien t elemen t-wise operations, reducing the complexity from O ( N 2 ) to O ( N ) . Beyond immediate computational efficiency , our approac h is motiv ated by a fundamental adv antage of linear systems: the abilit y to parallelize computation across the en tire input sequence (see App endix B). This principle of parallelizable state propagation aligns with recen t ad- v ances in modern state-space mo dels such as Mamba (Gu and Dao, 2024) and parallelized LMU implemen tations (Chilkuri and Eliasmith, 2021). First, Section 2 formalizes the Lin- ear ESN framework and highlights the computational cost of standard dynamics. Section 3 details our diagonalization-based reform ulation whic h reduces the up date complexity to O ( N ) . Section 4 presen ts three practical implementations of this optimization, including the direct generation of sp ectral parameters. Finally , Section 5 v alidates these metho ds on standard b enchmarks, follow ed by a discussion on the importance of sp ectral prop erties in Section 6. 2 Linear Ec ho State Net w orks: Definition An Ec ho State Net work (ESN) is a t yp e of recurrent neural netw ork particularly effective in learning complex dynamical systems. In its linear form, the netw ork is fully c haracterized b y a set of matrices: • W ∈ R N × N , the reservoir transition matrix • W in ∈ R D in × N , the input matrix • W fb ∈ R D out × N , the feedback matrix (optional) • W out ∈ R N ′ × D out , the readout matrix Here, N denotes the num b er of neurons in the reservoir, while D in and D out represen t the dimensionalit y of the input and output signals, resp ectively . The dimension N ′ corresp onds to the size of the extended state vector used b y the readout. Dep ending on the chosen configuration, this v ector is formed by concatenating the reservoir state r ( t ) with optional 2 Linear Reser voir Optimiza tion Figure 1: Standard architecture of an Echo State Netw ork (ESN). In the Reservoir Comput- ing paradigm, the input w eights W in , the in ternal recurren t weigh ts W , and the optional feedbac k w eights W f b are randomly initialized and kept fixed. The input is pro jected into a high-dimensional space within the reservoir, which utilizes its recurren t connections to maintain a deterministic state representation o ver time. Only the readout weigh ts W out are trained to deco de this internal state and pro- duce the final output. comp onen ts: a constan t bias, or the previous output y ( t − 1) . Consequently , N ′ v aries based on the inclusion of these additional terms alongside the N reserv oir units. This structure lev erages a fixed, randomly initialized reserv oir W , where only the output w eights W out are optimized. Let the time series input b e ( u ( t )) T t =1 ∈ R T × D in , where u ( t ) ∈ R D in represen ts the input v ector at time step t . F or every time step t , the reserv oir pro duces an in ternal state r ( t ) ∈ R N , and the readout pro duces an output y ( t ) ∈ R D out . 2.1 Reservoir step The in ternal state r ( t ) ev olves according to the following dynamics. This recursiv e up date propagates information through time via a w eighted combination of past states, curren t inputs, and past outputs. r ( t ) = r ( t − 1) W + u ( t ) W in + y ( t − 1) W fb (1) Note on optional input bias: An input bias term can b e added to the reserv oir. Ho wev er, since its effect is mathematically equiv alent to adding a constant dimension of 1 to the input signal u ( t ) , we omit the input bias for the remainder of this work. 2.2 Readout step The output y ( t ) at time t is a linear combination of an extended state v ector (which includes a bias, the previous output, and the curren t reservoir state) and the readout weigh t matrix W out . y ( t ) =  1 y ( t − 1) r ( t )  W out , for t = 1 , . . . , T (2) 3 de Coudenhove, Bendi-Ouis, Strock, Hinaut W out =   W out,bias W out,out W out,res   (3) Note on optional features: Dep ending on the implemen tation, b oth bias and previous output are optional, only the current reservoir state is mandatory . Consequently W out,bias and W out,out are optional to o. Note on initial states: A ccording to these equations, the mo del requires an initial state for the reserv oir and for the output if there is a feedback. By sp ecifying these initial condi- tions, the en tire reservoir and output tra jectories are uniquely determined for a given input sequence. Numerous metho ds exist to initialize those vectors, but in most of the case they are initialized either by setting them to the zero v ector, or b y using a preliminary « w armup » phase where the netw ork runs for a short perio d to stabilize the state dynamics b efore training starts. In this con text, we set the initial conditions to zero. 2.3 Leaking Rate The leaking rate l r ∈ ]0 , 1] is a hyperparameter in reservoir computing that con trols the temp oral deca y of in ternal states. It introduces a form of memory reten tion b y blending the previous state and the p otential new state at eac h time step. A smaller v alue of lr results in a slo wer up date and consequently a bigger memory capacity , while l r = 1 yields the fastest up date and consequen tly the smallest memory capacit y . In practice in Linear ESNs, the leaking rate mo difies the transition matrix W b y mixing it with the identit y: W ( lr ) = lr · W + (1 − l r ) · I (4) The input and feedback weigh ts are similarly rescaled: W ( lr ) in = lr · W in W ( lr ) fb = lr · W fb As a result, the full reservoir dynamics under leaking rate b ecome: r (0) = 0 , (5) r ( t ) = r ( t − 1) W ( lr ) + u ( t ) W ( lr ) in + y ( t ) W ( lr ) fb (6) Leaking rate only reparametrize the original matrices W , W in and W fb , whic h does not alter the structure of the netw ork. Thus, the exact same optimization metho d (section 3) can b e used with W ( lr ) , W ( lr ) in and W ( lr ) fb . 2.4 Learning W out Giv en input-output pairs { ( u ( t ) , y ( t )) } t =1 ,...,T , where T is the total n umber of timesteps used for training, we construct: X ( t ) =  1 y ( t − 1) r ( t )  ∈ R N ′ (7) Y ( t ) = [ y ( t )] ∈ R D out (8) 4 Linear Reser voir Optimiza tion W out is computed using standard ridge regression with α the regularization parameter: W out = ( X T X + αI ) − 1 X T Y (9) 2.5 Computational Complexity Here, we analyze each ma jor comp onen t of the ESN arc hitecture in terms of time complexity , explicitly accoun ting for sparsity ( 1 − connectivity) where relev ant. • W in and W fb generation: One of the input weigh t matrix W in ∈ R D in × N or the feedbac k w eight matrix is optional. These matrices are t ypically generated with a con- nectivit y c in / c fb (where eac h weigh t represents a non-zero connection with probabilit y c in / c fb ). The non-zero weigh ts are sampled from a sp ecific distribution (e.g., normal or uniform). The complexit y of generating these matrices are O ( D in N ) and O ( D out N ) . • W generation and Sp ectral Radius scaling: The reserv oir w eigh t matrix W ∈ R N × N is similarly generated with a connectivity c r . A critical step is scaling W to ac hieve the desired spectral radius ρ 0 . The computational cost depends heavily on the algorithm used to compute the leading eigenv alues: F or dense matrices, stan- dard eigen v alue decomp osition algorithms are O ( N 3 ) . F or sparse matrices, iterative metho ds such as the Implicitly Restarted Arnoldi Metho d (IRAM) are commonly use (Lehoucq et al., 1998). The complexity depends on the sparsity and con v ergence rate, scaling b etw een O ( N 2 ) and O ( N 3 ) . • Reservoir step: This recursive update inv olves matrix-v ector multiplications. By exploiting sparse matrix op erations, the complexity p er time step is reduced to the n umber of non-zero elemen ts : O ( c r N 2 + c in D in N + c fb D out N ) p er time step. Therefore, the total cost ov er T time steps is O ( T ( c r N 2 + c in D in N + c fb D out N )) . • Readout La y er: The output computation y ( t ) = X ( t ) W out in volv es a dense matrix m ultiplication. With W out ∈ R N ′ × D out , the complexit y is O ( N ′ D out ) . This cost is negligible compared to reservoir dynamics in long sequences, since D out ≪ N . The computational complexit y of ESN primarily dep ends of the reservoir step. This cost dominates due to the recurren t nature of the up date, which m ust b e computed sequen tially T times. Ho wev er, it is important to note that the use of sparse matrix can dramatically reduce the effective time complexity compared to dense ones. 3 Diagonalization-Based Optimization In this section, we introduce an optimization leveraging the diagonalization of the reservoir matrix W . By decomp osing W = P D P − 1 , we reform ulate the reserv oir step in to a p oint wise form, where each state comp onent evolv es independently according to its corresp onding eigen v alue. This simplification drastically reduce the amoun t of computation needed to compute the activity and output of a linear ESN. 5 de Coudenhove, Bendi-Ouis, Strock, Hinaut 3.1 Core T ransformation Theorem 1. Change-of-b asis of the ESN dynamics (i) L et W b e the r eservoir matrix and P ∈ GL n ( C ) a b asis. L et us denote as [ · ] P the tr ansformation into the b asis P , and expr ess b oth weights and state of the ESN into this new b asis. T r ansforme d r eservoir matrix [ W ] P = P − 1 W P T r ansforme d r eservoir state [ r ( t )] P = r ( t ) P T r ansforme d input weights [ W in ] P = W in P T r ansforme d fe e db ack weights [ W fb ] P = W fb P T r ansforme d output weights [ W out ] P =    W out,bias W out,out [ W out,r es ] P def = P − 1 W out,r es    =  I 0 0 P − 1  W out (ii) Then the r eservoir step b e c omes: [ r (0)] P = 0 (10) [ r ( t )] P = [ r ( t − 1])] P [ W ] P + u ( t ) [ W in ] P + y ( t − 1) [ W fb ] P (11) (iii) And the r e adout step b e c omes: y ( t ) = W out,bias + y ( t − 1) W out,out + [ r ( t )] P [ W out,r es ] P (12) (iv) Using the notation of X ( t ) =  1 y ( t − 1) r ( t )  , we define: [ X ( t )] P =  1 y ( t − 1) [ r ( t )] P  = X  I 0 0 P  (13) It is then p ossible to le arn the r e adout weights dir e ctly in P : [ W out ] P =  [ X ] P T [ X ] P + α  I 0 0 P T P  − 1 [ X ] P T Y (14) Pro of [ r ( t )] P = r ( t ) P = r ( t − 1) W P + u ( t ) W in P + y ( t − 1) W fb P = r ( t − 1) P [ W ] P P − 1 P + u ( t )[ W in ] P + y ( t − 1) [ W fb ] P = [ r ( t − 1)] P [ W ] P + u ( t ) [ W in ] P + y ( t − 1) [ W fb ] P y ( t ) = W out,bias + y ( t − 1) W out,out + r ( t ) W out,res = W out,bias + y ( t − 1) W out,out + r ( t ) P P − 1 W out,res = W out,bias + y ( t − 1) W out,out + [ r ( t )] P [ W out,res ] P 6 Linear Reser voir Optimiza tion [ W out ] P =  I 0 0 P − 1  W out =  I 0 0 P − 1  ( X T X + αI ) − 1 X T Y =  I 0 0 P − 1   [ X ] P  I 0 0 P − 1  T  [ X ] P  I 0 0 P − 1  + αI ! − 1  [ X ] P  I 0 0 P − 1  T Y =  I 0 0 P − 1   I 0 0 ( P T ) − 1  [ X ] P T [ X ] P  I 0 0 P − 1  + αI  − 1  I 0 0 ( P T ) − 1  [ X ] P T Y =  I 0 0 P  − 1  I 0 0 P T  − 1 [ X ] P T [ X ] P  I 0 0 P  − 1 + αI ! − 1  I 0 0 P T  − 1 [ X ] P T Y =  I 0 0 P T   I 0 0 P T  − 1 [ X ] P T [ X ] P  I 0 0 P  − 1 + αI !  I 0 0 P  ! − 1 [ X ] P T Y =  I 0 0 P T   I 0 0 P T  − 1 [ X ] P T [ X ] P  I 0 0 P  − 1  I 0 0 P  + α  I 0 0 P T   I 0 0 P  ! − 1 [ X ] P T Y =  [ X ] P T [ X ] P + α  I 0 0 P T P  − 1 [ X ] P T Y 3.2 A sp ecial case of a diagonalizable matrix Corollary 2. Diagonalization Optimization Assume the r eservoir matrix W is diagonalizable. L et W = P [ W ] P P − 1 wher e: • P ∈ GL n ( C ) : Matrix of eigenve ctors • [ W ] P = diag (Λ def = ( λ 1 , . . . , λ N )) : Diagonal matrix of eigenvalues Then the r eservoir step b e c ome p ointwise: [ r ( t )] P = [ r ( t − 1)] P ⊙ Λ + u ( t ) [ W in ] P + y ( t − 1) [ W fb ] P (15) wher e ⊙ denotes element-wise multiplic ation. Pro of Applying the c hange of basis to the standard reservoir up date equation yields [ r ( t )] P = [ r ( t − 1)] P [ W ] P + u ( t ) [ W in ] P + y ( t − 1) [ W fb ] P Since [ r ( t − 1)] P is a row vector and [ W ] P = diag (Λ) , their matrix pro duct strictly simplifies to the element-wise multiplication [ r ( t − 1)] P ⊙ Λ . 7 de Coudenhove, Bendi-Ouis, Strock, Hinaut 3.3 Apply W in after Reserv oir Steps This subsection demonstrates that with our metho d the input matrix W in can actually b e applied after the temp oral update. This pro ves that the netw ork depends merely on Λ , further justifying our fo cus on optimizing the spectral prop erties. F or what follows, we supp ose a diagonal reserv oir without feedback weigh ts ( W fb = 0 ) and with an initial state r (0) = 0 . Lemma 3. r ( t ) = t X i =1 ( u ( i ) W in ) ⊙ Λ t − i wher e the p ower of Λ denotes element-wise exp onentiation, not a matrix pr o duct. Pro of By recurrence on t . F or t = 0 : r (0) = 0 by definition. The prop erty holds. Assume the prop erty is true for a given t ≥ 0 . F or t + 1 : r ( t + 1) = r ( t ) ⊙ Λ + u ( t + 1) W in = t X i =1 ( u ( i ) W in ) ⊙ Λ t − i ! ⊙ Λ + u ( t + 1) W in ⊙ Λ 0 = t X i =1 ( u ( i ) W in ) ⊙ Λ t +1 − i + u ( t + 1) W in ⊙ Λ t +1 − ( t +1) = t +1 X i =1 ( u ( i ) W in ) ⊙ Λ t +1 − i The prop erty holds for t + 1 , concluding the pro of. A ccording to this form ula, to compute the reservoir state, we first apply W in to the input signal and then apply Λ . Actually , this can be done in the rev erse order. Let R ( t ) ∈ C D in × N b e a state matrix defined b y: R (0) = 0 R ( t ) = R ( t − 1) ⊙    Λ . . . Λ    +  u ( t ) T . . . u ( t ) T  Here, the first term broadcasts the row vector Λ across all D in ro ws, and the second term rep eats the column vector u ( t ) T across all N columns to form a D in × N matrix. Lemma 4. L et r ow d ( · ) denote the d -th r ow of a matrix. r ow d ( R ( t )) = t X i =1 u ( i ) d Λ t − i 8 Linear Reser voir Optimiza tion Pro of By recurrence on t for a fixed row d . F or t = 0 : ro w d ( R (0)) = 0 by definition. The prop ert y holds. Assume true for t ≥ 0 . F or t + 1 : ro w d ( R ( t + 1)) = row d ( R ( t )) ⊙ Λ + u ( t + 1) d Λ 0 = t X i =1 u ( i ) d Λ t − i ! ⊙ Λ + u ( t + 1) d Λ 0 = t +1 X i =1 u ( i ) d Λ t +1 − i Theorem 5. The standar d r eservoir state r ( t ) c an b e r e c over e d by element-wise weighting and summation: r ( t ) = D in X d =1 r ow d ( W in ) ⊙ r ow d ( R ( t )) Or mor e c omp actly, letting 1 b e a c olumn ve ctor of ones of size D in × 1 : r ( t ) = 1 T ( W in ⊙ R ( t )) Pro of Using the linearity of the summation and detailing the in termediate associative steps: r ( t ) = t X i =1 ( u ( i ) W in ) ⊙ Λ t − i = t X i =1 D in X d =1 u ( i ) d ro w d ( W in ) ! ⊙ Λ t − i = t X i =1 D in X d =1 ( u ( i ) d ro w d ( W in )) ⊙ Λ t − i = D in X d =1 t X i =1 ro w d ( W in ) ⊙  u ( i ) d Λ t − i  = D in X d =1 ro w d ( W in ) ⊙ t X i =1 u ( i ) d Λ t − i ! = D in X d =1 ro w d ( W in ) ⊙ row d ( R ( t )) = 1 T ( W in ⊙ R ( t )) 9 de Coudenhove, Bendi-Ouis, Strock, Hinaut 3.3.1 Applica tion: Diagonal Linear ESN Let (Λ ∈ C N , W in ∈ C D in × N , W out ∈ C N × D out ) b e a diagonal linear ESN, fed with an input signal ( u ( t )) 1 ≤ t ≤ T . According to the previous theorem: y ( t ) = r ( t ) W out =  1 T ( W in ⊙ R ( t ))  W out While this formulation is mathematically exact for an y dimensions, optimizing W out directly from R ( t ) without fixing W in first remains complex when D in > 1 or D out > 1 . How ev er, this establishes a fundamental prop erty: the general temporal dynamics are entirely captured b y R ( t ) and Λ , completely indep enden tly of the input weigh t W in . This prop ert y b ecomes highly actionable when D in = 1 and D out = 1 (see App endix C). 3.4 Computational Adv antages This approac h reduces computational costs after initial prepro cessing: • Eigendecomp osition : O ( N 3 ) (one-time prepro cessing cost), compared to the sp ec- tral radius comp otation, whic h can b e betw een O ( N 2 ) and O ( N 3 ) as explained ab ov e. • [ W in ] P and [ W fb ] P computation : O ( N 2 D in ) and O ( N 2 D fb ) • T ransformed reserv oir step : O ( N ( D in + D out )) (reduced from O ( N ( c r N + c in D in + c fb D out ) ) with D in , D out ≪ N (reserv oir lo oking to pro ject the inputs in to a high dimensional space). • T ransformed readout step : O ( N ′ D out ) (unc hanged) The key adv antage is the reduction of the reservoir step complexit y from O ( N 2 ) to O ( N ( D in + D out ) p er time step, making this approach highly efficien t despite the initial O ( N 3 ) prepro- cessing cost. F igure 2 shows the comparison b et ween standard linear reservoir computation and our optimisation in function of the num b er of neurons in the reservoir for each pro cessing step. W e can also note that even with the use of complex num b ers (which should doubles the num b er of parameters with a real and an imaginary part), the training of the readout can be p erformed with real matrices (see Appendix A) as in the standard computations, leading to no additional cost. 10 Linear Reser voir Optimiza tion 200 400 600 800 1000 R eservoir Size (N) 0.0 2.0e-1 4.0e-1 6.0e-1 8.0e-1 Duration (seconds) Generation Step Nor mal DPG Diagonalization 200 400 600 800 1000 R eservoir Size (N) 0.0 5.0e-5 1.0e-4 1.5e-4 2.0e-4 2.5e-4 3.0e-4 3.5e-4 Duration (seconds) R eservoir Step Nor mal Diagonal 200 400 600 800 1000 R eservoir Size (N) 1.5e-6 1.6e-6 1.7e-6 1.8e-6 1.9e-6 2.0e-6 2.1e-6 Duration (seconds) R eadout Step All Figure 2: Comparison of standard computation and prop osed optimizations for different reserv oir sizes. (i) Generation Step: Ev aluates three initialization metho ds. Nor- mal generates a standard linear reservoir with a weigh t matrix W . Diagonalization generates a standard W and then diagonalizes it (applicable to EWT/EET in sec- tion 4). DPG directly generates a diagonal matrix (see section 4). (ii) Reservoir Step: Compare the standard computation with the Diagonal one. Because EWT, EET, and DPG share an identical diagonal structure after the generation step, they are represented together as Diagonal . (iii) Readout Step: Displa ys only a single curve because the computational cost is identical across all metho ds. With the implemen tation proposed in App endix A, the training of the readout can b e p erformed with real matrices, equating the readout cost of the standard metho d. (Note) F or reserv oir and readout steps, the duration display ed is for a single time step. T o ev aluate the total duration o ver the whole sequence, these v alues must b e multiplied by the total num b er of time steps. 4 Metho ds This section presents three methods for efficien t linear reservoir design and training. The first, Eigen basis W eight T ransformation (EWT), enables efficient inference by transforming pre-trained weigh ts W out in to the eigenbasis without retraining. The second, End-to-End Eigen basis T raining (EET), enables full end-to-end training by directly computing [ W out ] P in the transformed space, which drastically reduces computational cost during training. Finally , Direct Parameter Generation (DPG) eliminates the need for initial reserv oir matrix W by directly constructing its sp ectral comp onen ts Λ and eigenv ector basis P and then computing [ W out ] P in the transformed space. 4.1 T raining Data Preparation Giv en input-output pairs { ( u ( t ) , y ( t )) } t =1 ,...,T k , w e construct: X ( t ) =  1 y ( t − 1) r ( t )  ∈ R N ′ (extended state) (16) Y ( t ) = [ y ( t )] ∈ R D out (target output) (17) where T is the length of the sequence. 11 de Coudenhove, Bendi-Ouis, Strock, Hinaut 4.2 Metho d 1: Eigen basis W eigh t T ransformation (EWT) If W out has already b een computed using standard ridge regression: W out = ( X T X + αI ) − 1 X T Y (18) Then the transformed output weigh ts are (see Theorem 1 in 2): [ W out ] P =  I 0 0 P − 1  W out (19) 4.3 Metho d 2: End-to-End Eigenbasis T raining (EET) If W out has not already b een computed, it is p ossible to train directly [ W out ] P . Prop osition : Direct Output W eigh t Computation Let [ X ] P =  1 y ( t − 1) [ r ( t )] P  ∈ C N ′ b e the concatenated transformed reservoir states. It is then p ossible to obtain W out b y training [ W out ] P directly . [ W out ] P =  [ X ] P T [ X ] P + α  I 0 0 P T P  − 1 [ X ] P T Y (20) 4.4 Metho d 3: Direct Parameter Generation (DPG) Rather than generating ( W, W in ) and then diagonalizing like the EET metho d, this section explores directly generating (Λ , [ W in ] P , P ) to a void the initial diagonalization cost. W e called that metho d Direct Parameter Generation (DPG). 12 Linear Reser voir Optimiza tion 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Eigenvalues of R andom Matrix (seed=1) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Unifor m Distribution (seed=1) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Golden Distribution ( =0, seed=1) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Noisy Golden Distribution ( =0.050, seed=1) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Eigenvalues of R andom Matrix (seed=2) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Unifor m Distribution (seed=2) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Golden Distribution ( =0, seed=2) 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Noisy Golden Distribution ( =0.050, seed=2) Figure 3: Comparison of eigenv alue distributions in the complex plane. On the first col- umn (left) the sp ectrum deriv ed from a standard random reserv oir matrix W . On the second column the sp ectrum generated via Uniform Distribution (Algorithm 1). On the third column the sp ectrum generated via the deterministic Golden Distribution metho d (Algorithm 3) without noise, and on the fourth column the sp ecturm generated via the Noisy Golden Distribution . W e observe that the Noisy Golden distribution (right) achiev es a significan tly more homogeneous co v erage of the unit disk than the Uniform Distribution or its non-noisy coun terpart, effec- tiv ely matching the sp ectral density of the standard reservoir (left). 4.4.1 Uniform Distribution The sp ectral structure of the reserv oir weigh t matrix W pla ys a fundamental role in de- termining b oth the stability and the dynamics of its activit y . A w ell-designed spectrum enables ric h temp oral dynamics, essential for tasks inv olving sequence mo deling and pre- diction. Drawing up on insights from random matrix theory (particularly the classical w ork of Edelman and K ostlan (1995)), w e observ e that for an N × N random matrix with i.i.d. Gaussian en tries, the exp ected num b er of real eigenv alues scales as: E [ N real ] ∼ r 2 N π (21) W e propose an approac h to directly construct P and Λ without the need of generating W by explicitly engineering the eigenv ector matrix P suc h that its associated eigen v alue spectrum Λ matches a desired distribution with a controlled n umber of real and complex conjugate pairs. T o this end, we introduce tw o complementary algorithms that resp ectively generate a v alid Λ and P . Algorithm 1 generates a structured, random eigenv alue sp ectrum Λ = ( λ 1 , . . . , λ N ) b y balancing the num b er of real and complex conjugate pairs. Sp ecifically , it ensures that 13 de Coudenhove, Bendi-Ouis, Strock, Hinaut Algorithm 1 Random generation of the eigenv alues 1: function RandomEigenv alues ( N , sr) 2: N real ←  q 2 N π  3: if N real  = n mo d 2 then 4: N real ← N real + 1 5: end if 6: Λ real : array of size N real 7: R 2 : arra y of size ( N − N real ) / 2 8: Θ : array of size ( N − N real ) / 2 9: Λ cpx : array of size ( N − N real ) / 2 10: Sample Λ real ∼ Uniform ( − sr , sr ) 11: Sample U ∼ Uniform (0 , 1) 12: Sample Θ ∼ Uniform (0 , π ) 13: Λ cpx ← sr · √ U · exp( i · Θ) 14: return Λ real , Λ cpx , conj (Λ cpx ) 15: end function exactly N real eigen v alues are real and since W is real, the remaining ( N − N real ) eigen v alues are ( N − N real ) / 2 complex conjugate pairs. T o guarantee this prop erty , the algorithm enforces that N real and N ha ve the same parity (by incrementing N real b y one if necessary), so that N − N real is alwa ys ev en and the complex eigen v alues can consisten tly b e arranged into conjugate pairs. The radial comp onen t of eac h complex pair is sampled from ∼ √ U (square ro ot of a uniform distribution), while the angular comp onent is uniformly drawn from [0 , π ) . An example of this distribution is pro vided in Figure 3. Algorithm 2 Random generation of the eigenv ectors 1: function RandomEigenvectors ( N , N real ) 2: P : arra y of shap e ( N , N ) 3: for i ← 0 to N real − 1 do 4: Sample v ∼ Normal (0 , 1) 5: v ← v / ∥ v ∥ 2 6: P [: , i ] ← v 7: end for 8: for k ← 0 to ( N − N real ) / 2 − 1 do 9: Sample v R ∼ Normal (0 , 1) 10: Sample v I ∼ Normal (0 , 1) 11: v ← ( v R + iv I ) / ∥ v R + iv I ∥ 2 12: P [: , N real + k ] ← v 13: P [: , ( N + N real ) / 2 + k ] ← conj ( v ) 14: end for 15: return P 16: end function 14 Linear Reser voir Optimiza tion Algorithm 2 constructs a random matrix of eigenv ectors P ∈ GL n ( C ) . Since W is real, w e sample eigen v ectors asso ciated with real eigenv alues and with complex-conjugate eigenv alue pairs in differen t wa ys. F or real eigenv alues, we sample real-v alued eigen vectors with en tries dra wn from a standard Gaussian distribution. F or complex-conjugate eigen v alue pairs, we sample the first eigenv ector as a complex-v alued vector whose real and imaginary parts ha ve en tries dra wn from a standard Gaussian distribution, and the second one as its conjugate. This sampling ensures that the resulting set of v ectors forms a basis so that P is in vertible. 4.4.2 Golden Distribution Algorithm 3 Random generation of the golden eigenv alues 1: function GoldenEigenv alues ( N , sr) 2: N real ←  q 2 N π  3: N real ← ( N − N real ) mo d 2 4: N cpx ← n − n real 2 5: 6: Λ real : array of size N real 7: Sample Λ real ∼ Uniform ( − 1 , 1) 8: 9: Λ cpx : array of size N cpx 10: i ← 0 11: k ← 0 12: v ← Uniform (0 , 2) 13: while i < N cpx do 14: k ← k + 1 15: v ← ( v + 3 − √ 5) mo d 2 16: if v < 1 then 17: Λ cpx ( i ) ← q k 2 n cpx exp( iπ v ) 18: i ← i + 1 19: end if 20: end while 21: 22: λ ← sr max( | Λ real | , | Λ cpx | ) 23: Λ real ← λ · Λ real 24: Λ cpx ← λ · Λ cpx 25: 26: Noise: array of size N cpx 27: Sample Noise ∼ Normal (0 , σ ) + i · Normal (0 , σ ) 28: Λ cpx ← Λ cpx + Noise 29: 30: return Λ real , Λ cpx , conj (Λ cpx ) 31: end function 15 de Coudenhove, Bendi-Ouis, Strock, Hinaut Algorithm 3 in tro duces a deterministic v ariation for generating the eigenv alue spectrum, aiming to impro ve the cov erage of the complex plane compared to Uniform Distribution . While the partition b etw een real and complex eigenv alues follows the same statistical scaling as Method 3 (where N real ≈ p 2 N /π ), the placemen t of complex eigen v alues utilizes a deterministic sequence based on the Golden Ratio. Sp ecifically , the complex eigen v alues are arranged in a ph yllotaxis spiral pattern. The angular p osition is up dated using a step deriv ed from the Golden Angle (via the term 3 − √ 5 ), while the magnitude scales with the square ro ot of the index to ensure a constan t density distribution across the unit disk. This approach mitigates the clustering and gaps often observed with pseudo-random generators, providing a more homogeneous spectral structure. An example of this distribution is provided in Figure 3. 5 Exp eriments and Results In this section, w e ev aluate our prop osed metho ds on t wo standard b enc hmarks: the Multi- ple Sup erimp osed Oscillators (MSO) and Memory Capacit y (MC). Our experiments demon- strate that our approac hes (notably Direct P arameter Generation (DPG) com bined with the deterministic Golden distribution) achiev e equiv alent or sup erior p erformance compared to standard baselines, while also exp osing the structural limitations of matrix diagonalization at extremely low connectivity levels. 5.1 Multiple Sup erimp osed Oscillators 200 400 600 800 1000 T i m e t 4 2 0 2 4 S i g n a l u ( t ) W ashout T raining V alidation T est Figure 4: Illustration of the Multiple Superimp osed Oscillators (MSO) time series for the K = 5 task (MSO5). The target signal U 5 ( t ) is generated by summing fiv e distinct sin usoidal comp onen ts. The complete sequence of 1000 time steps is partitioned in to a training part of 400 steps (blue) which includes an initial 100-step w ashout used to discard transient reserv oir dynamics (dashed blue), a v alidation of 300 steps (green) for h yp erparameter tuning, and a final test of 300 steps (pink) to ev aluate predictive p erformance. 16 Linear Reser voir Optimiza tion The Multiple Sup erimp osed Oscillators (MSO) task (Gallicchio et al., 2017; K ory akin et al., 2012) is a well-established b enchmar k used to ev aluate the ability of recurrent netw orks to capture and repro duce complex, m ulti-frequency temp oral dynamics. It consists of a time series generated by summing K discrete-time sin usoidal comp onen ts: U K ( t ) = K X k =1 sin( α k t ) (22) where K denotes the total num b er of sup erimp osed sin usoids and α k represen ts their re- sp ectiv e angular frequencies. F ollo wing the setup in tro duced b y Gallicchio et al. (2017), w e utilize a defined set of 12 distinct frequencies for the α k parameters : α 1 = 0 . 2 , α 2 = 0 . 331 , α 3 = 0 . 42 , α 4 = 0 . 51 , α 5 = 0 . 63 , α 6 = 0 . 74 , α 7 = 0 . 85 , α 8 = 0 . 97 , α 9 = 1 . 08 , α 10 = 1 . 19 , α 11 = 1 . 27 , α 12 = 1 . 32 . T o systematically ev aluate the p erformance of our mo dels across v arying lev els of signal complexit y , w e define 12 distinct sequential tasks based on the n umber of comp onents K ∈ { 1 , 2 , . . . , 12 } . F or a giv en task K , the target signal U K is the sum of the first K sinusoids. The net w ork is trained to predict the next time step U K ( t + 1) giv en the current ground truth input U K ( t ) . F or eac h of the 12 tasks, the generated time series is partitioned in to three distinct p erio d: a training set of T train = 400 time steps, a v alidation set of T v alid = 300 time steps, and a test set of T test = 300 time steps. T o mitigate the impact of the arbitrary initial state of the reservoir, the first 100 steps of the training phase are discarded as a w ashout (or warm up) and are not used to compute the readout w eights. T o ensure the statistical robustness of our comparative analysis, the ev aluation for eac h task U K is a veraged o ver 10 seeds. F or each test, an exhaustiv e grid search is conducted to determine the optimal configuration. The complete set of h yp erparameter considered is detailed in T able 1. F urthermore, by leveraging Theorem 5, we significantly accelerate the h yp erparameter search for all diagonal reservoir metho ds. Because the reservoir states can b e computed indep endently of the input scaling factor, we only need to collect these states once per seed rather than recomputing them for each of the three input_scaling v alues ev aluated in our grid search. This mathematical prop erty effectively divides the state com- putation time by a factor of three compared to the standard Normal baseline. T able 1: Hyp er-parameters v alues considered for mo del selection on the MSO tasks rep orted from Gallicc hio et al. (2017) Hyp er-parameter V alues considered Reserv oir size ( N ) 100 Input scaling (scale in ) 0.01, 0.1, 1 Leaking rate (lr) 0.1, 0.3, 0.5, 0.7, 0.9, 1.0 Sp ectral radius ( ρ ) 0.1, 0.3, 0.5, 0.7, 0.9, 1.0 Ridge regression regularization ( α ) 10 − 11 , 10 − 10 , . . . , 10 0 17 de Coudenhove, Bendi-Ouis, Strock, Hinaut T able 2: Performance comparison on the Multiple Sup erimp osed Oscillators (MSO) across 10 indep enden t seeds. Metric is the Ro ot Mean Square Error (RMSE). Normal corresp onds to a standard W , Diagonalize d to the EET metho d, others corresp onds to the DPG metho d with differen t initialization strategies. T ask Normal Diagonalized Uniform Dist. Golden Dist. Noisy Golden Dist. Sim Dist. ( σ = 0 ) ( σ = 0 . 2 ) MSO1 1.65e-14 1.58e-14 5.85e-14 2.49e-14 4.77e-14 3.56e-14 MSO2 2.55e-13 2.78e-13 2.28e-13 1.45e-13 2.39e-13 2.44e-13 MSO3 5.42e-12 9.14e-12 4.49e-12 9.07e-12 6.14e-12 8.37e-12 MSO4 1.39e-10 5.77e-10 3.64e-10 7.22e-11 6.93e-11 2.28e-10 MSO5 2.75e-09 4.03e-08 2.95e-08 5.24e-10 1.63e-09 1.87e-08 MSO6 7.38e-09 2.54e-08 2.07e-08 1.07e-08 1.18e-08 6.16e-08 MSO7 2.96e-08 9.48e-08 7.16e-08 5.98e-08 5.36e-08 8.55e-08 MSO8 2.75e-08 9.68e-08 3.57e-07 1.15e-07 6.44e-08 1.41e-07 MSO9 4.98e-08 2.69e-07 4.33e-07 1.69e-07 1.03e-07 1.63e-07 MSO10 4.65e-07 3.32e-07 4.15e-07 2.31e-07 1.61e-07 2.73e-07 MSO11 5.62e-07 7.38e-07 1.85e-06 7.49e-07 3.16e-07 3.71e-07 MSO12 9.71e-07 2.98e-06 1.34e-06 1.01e-06 8.44e-07 2.63e-06 The results presen ted in T able 2 show an in teresting distribution of the b est scores (in b old) across the different ev aluated metho ds. This div ersity suggests that our optimization approac hes, particularly those based on Direct P arameter Generation (DPG) with Golden Distribution, p erform at least as well as the Normal metho d (standard linear ESN). The sligh t RMSE v ariations from one metho d to another are naturally explained b y the structural differences in the w eights: since the DPG approach bypasses the generation of an explicit matrix W , the eigenv alue sp ectrum and the eigenv ector configuration inevitably differ from those of a standard random matrix. A cross the 12 MSO tasks, Diagonal Golden Noise d ( σ = 0 . 2 ) achiev es the best score 4 times (MSO 4, 10, 11, 12), in par with the Normal baseline, whic h also dominates 4 tasks (MSO 6, 7, 8, 9). If we strictly isolate these tw o metho ds for a direct comparison, they are in a p erfect tie (6 to 6). This strong comp etitiveness indicates that our strategy of generating a noisy "Golden" spectrum constitutes a highly relev ant alternativ e to classical initialization. Con versely , it can b e noted that the Diagonal Sim method is the only one to never attain the top rank, although it closely follo ws the general trend with scores near those of the leaders. F urthermore, the comparison among DPG metho ds rev eals that the addition of noise to the Golden Distribution consistently outp erforms the strict Golden Distribution v ersion. This phenomenon can b e understo o d in tuitively . Visually and statistically , the injection of noise allo ws the deterministic Golden distribution to smo oth its structure and m uch more faithfully mimic the natural empirical eigen v alue distribution of a true random matrix. These results v alidate the DPG approach paired with a Noisy Golden Distribution . They empirically confirm that it is possible to directly generate a structured eigen v alue distribution combined with random eigen vectors to guaran tee a p erformance lev el equiv alent, and occasionally superior, to the standard metho d. 18 Linear Reser voir Optimiza tion 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag Eigenvalues 1.0 0.5 0.0 0.5 1.0 R eal 1.0 0.5 0.0 0.5 1.0 Imag E i g e n v a l u e s s c a l e d b y | W o u t | 0.2 0.4 0.6 0.8 1.0 | W o u t | Figure 5: Visualization of sp ectral imp ortance via readout w eights on the MSO task. The reserv oir eigen v alues are plotted in the complex plane, where the mark er size is prop ortional to the absolute v alue of the corresp onding weigh t in W out (normalized b et ween 0 and 1). Larger p oints identify the sp ecific eigenv alues that contribute most significantly to the mo del’s prediction for this task. P oints are the same on the left and on the right, just the size of p oin ts is changing: on the right, p oints close to the circle are non longer visible which ma y create an illusion that the p oin t distribution was contracted, but that’s not the case. 5.2 Memory Capacity The Memory Capacity (MC), introduced by Jaeger Jaeger (2001), is a fundamental metric used to quan tify the short-term memory of a recurrent neural netw ork. It measures the reserv oir’s ability to reconstruct past contin uous inputs from its current internal state. F or an indep enden t and identically distributed (i.i.d.) input sequence u ( t ) and a trained output unit y k ( t ) , the determination co efficient for a sp ecific delay k is given b y the squared correlation co efficien t: d ( u ( t − k ) , y k ( t )) = co v 2 ( u ( t − k ) , y k ( t )) σ 2 ( u ( t )) σ 2 ( y k ( t )) (23) The k -delay short-term memory capacity of the netw ork is then formally defined as the maxim um of this determination co efficient optimized ov er the output weigh ts w out k : M C k = max w out k d [ w out k ]( u ( t − k ) , y k ( t )) (24) In our exp erimen ts, w e utilize this established task to test the practical memory capacity of our methods. W e use reserv oirs without a leaking rate and with the spectral radius strictly set to 1. W e then ev aluate b oth the M C k scores and the computational duration of the different initialization metho ds across v arious v alues of reservoir units N and requested dela y k . 19 de Coudenhove, Bendi-Ouis, Strock, Hinaut 0 20 40 60 80 100 Delay 0.0 0.2 0.4 0.6 0.8 1.0 Memory Capacity N=100 0 50 100 150 200 250 Delay 0.0 0.2 0.4 0.6 0.8 1.0 Memory Capacity N=300 0 50 100 150 200 250 300 350 Delay 0.0 0.2 0.4 0.6 0.8 1.0 Memory Capacity N=600 0 100 200 300 400 Delay 0.0 0.2 0.4 0.6 0.8 1.0 Memory Capacity N=1000 Nor mal Unifor m Distribution Golden Distribution Sim Distribution Figure 6: Memory capacity as a function of dela y across differen t reservoir sizes. The four subplots illustrate the memory capacity ev aluated against the delay for reservoir sizes N=100, 300, 600, and 1000. As exp ected, increasing the reserv oir size brings an ov erall impro vemen t in memory capacity . The p erformance of fiv e different metho ds is compared: the standard Ec ho State Netw ork with a classic w eight ma- trix W ( Normal ), Direct Parameter Generation (DPG) utilizing a Uniform Dis- tribution , DPG utilizing the deterministic Golden Distribution , and DPG utilizing the Sim Distibution which uses eigenv alues extracted from a standard random ma- trix W com bined with randomly generated eigen v ectors P . Observing the results, the Diagonal Uniform metho d exhibits a more balanced degradation profile than the Normal baseline, though they remain roughly equiv alent and intersect near a memory capacit y of 0.5. Notably , the Golden Distribution initialization sys- tematically outperforms the standard Normal metho d across all reserv oir sizes. Con versely , the Sim Distribution curve closely follows the Normal baseline, sug- gesting that the eigenv ectors play a secondary role compared to eigenv alues, even if it demonstrates a small but consisten t underp erformance. 20 Linear Reser voir Optimiza tion 1 0 3 1 0 2 1 0 1 1 0 0 Connectivity 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Memory Capacity N=100 1 0 3 1 0 2 1 0 1 1 0 0 Connectivity 0.0 0.1 0.2 0.3 0.4 Memory Capacity N=300 1 0 3 1 0 2 1 0 1 1 0 0 Connectivity 0.0 0.1 0.2 0.3 0.4 0.5 Memory Capacity N=600 1 0 3 1 0 2 1 0 1 1 0 0 Connectivity 0.0 0.1 0.2 0.3 0.4 0.5 Memory Capacity N=1000 0.0 5.0e-3 1.0e-2 1.5e-2 2.0e-2 2.5e-2 3.0e-2 0.0 2.0e-3 4.0e-3 6.0e-3 8.0e-3 1.0e-2 1.2e-2 1.4e-2 0.0 2.0e-3 4.0e-3 6.0e-3 8.0e-3 1.0e-2 0.0 2.0e-3 4.0e-3 6.0e-3 8.0e-3 Nor mal Diagonalization Differ ence (Nor mal - Diag) Figure 7: Memory capacit y as a function of reservoir connectivit y . The four subplots illus- trate the memory capacity ev aluated across v arying levels of matrix connectivity for reserv oir sizes N = 100 , 300 , 600 , and 1000 . T o ensure a task of mo derate and prop ortional difficulty , the requested delay for each configuration is c hosen so that the memory capacity yields 0.5 when the connectivit y is one (to c hose the cor- resp onding v alue w e looked at Figure 6). The graphs compare the p erformance of the standard weigh t matrix approach ( Normal ) against its diagonalized coun- terpart ( Diagonalization , corresponding to the EET and EWT metho ds), with the secondary axis tracking their absolute p erformance gap ( Differ enc e ). As ex- p ected, b oth metho ds exp erience a severe drop in memory capacit y at extremely lo w connectivit y levels; as the matrix W becomes v ery sparse, the net work loses the dimensionality required to maintain temp oral represen tations. Moreov er, the results highligh t a sp ecific minimum connectivity threshold which dep ends of the reserv oir size. Belo w this connectivit y , the Diagonalization metho d b egins to sys- tematically underp erform the Normal baseline. A t suc h extreme sparsit y , the standard matrix W lacks sufficien t structural complexity for an effective eigende- comp osition, causing the eigenspectrum to collapse and lose essential dimensional ric hness. Ab ov e this minimal threshold, the diagonalized approac h successfully matc hes the standard baseline’s p erformance. 21 de Coudenhove, Bendi-Ouis, Strock, Hinaut 6 Conclusion Recen t w ork in reservoir computing has highlighted tw o complementary trends that motiv ate the present study . On the one hand, several con tributions hav e stressed the need to reduce the computational cost of reservoir-based mo dels, either by simplifying the dynamics or by replacing recurren t computations with alternativ e feature constructions (Gauthier et al., 2021). On the other hand, a n umber of theoretical and empirical studies ha ve shown that linear reserv oirs already p ossess strong dynamical and memory-related prop erties, and can constitute comp etitive building blocks for temporal pro cessing (Dam bre et al., 2012; Gonon and Ortega, 2019; V o elk er et al., 2019; Tino, 2020; V erzelli et al., 2021). In this con text, our ob jective is not to revisit the mo deling capabilities of linear reserv oirs, but to revisit their implemen tation, and to in vestigate ho w their recurrent dynamics can b e preserved while b eing expressed in a computationally more suitable form. In this study , w e show ed that the dynamics of Linear Ec ho State Net works can b e reform ulated through matrix diagonalization, turning costly recurrent updates into effi- cien t point wise operations. This diagonalization-based optimization preserves the predictiv e p o wer of standard linear ESNs while reducing computational complexit y from O ( N 2 ) to O ( N ) . Bey ond the theoretical sp eedup, our empirical results demonstrate that b oth the Eigen basis W eight T ransformation (EWT) and End-to-End Eigen basis T raining (EET) re- tain numerical stabilit y and accuracy , with negligible differences from standard linear ESN outputs. W e further introduced the Direct Parameter Generation (DPG) approach, whic h b ypass diagonalization entirely b y directly sampling eigen v alues and eigen v ectors, allo wing a b etter con trol of the eigenv alues that controls the dynamics of the reservoir, while replicating or ev en outp erforming the b ehavior of classical Linear ESNs. Our in vestigation into the diagonalization of Linear ESNs rev eals that the b eha vior of these models is almost entirely gov erned b y their sp ectral prop erties. The success of the Direct P arameter Generation (DPG) approac h, whic h b ypasses the generation of an explicit w eight matrix W to directly sample eigenv alues, suggests that the explicit matrix structure of the reservoir is secondary to the distribution of its eigenv alues. This finding aligns with the observ ation that structured sp ectral cov erages, suc h as the deterministic "Golden" distribution, can yield p erformance superior to purely sto chastic initializations, particularly for tasks requiring long-term memory , though its sup eriority for tasks with differen t dynamic requiremen ts remains an op en question. This sp ectral p ersp ective is further reinforced b y analyzing the trained readout w eights in the eigenbasis. As illustrated in Figure 5, we observe a strong heterogeneity in the con tri- bution of sp ectral comp onen ts: only a subset of eigenv alues is asso ciated with large output w eights, indicating that the readout mechanism naturally selects the sp ecific frequency com- p onen ts or decay rates relev an t to the task. This fundamental reliance on the sp ectrum is mathematically corrob orated b y our demonstration that the input weigh ts W in can b e ap- plied entirely after the recurren t up dates. Since the temp oral states dep end strictly on the eigen v alues Λ , it prov es that the memory and dynamical ric hness of the net work are go verned by its sp ectral configuration. Consequen tly , DPG p ositions itself as a rational ap- proac h to reservoir initialization rather than a mere computational con venience. It allows for the explicit selection of sp ectral comp onen ts, shifting the paradigm from arbitrary matrix generation to controlled sp ectral initialization. 22 Linear Reser voir Optimiza tion While our method offers significant adv antages, it is imp ortant to note that the diago- nalization of a pre-existing matrix (EWT) relies on the assumption that W is diagonalizable and incurs a one-time O ( n 3 ) prepro cessing cost. Although this cost is amortized ov er long sequences, the DPG framework elegan tly circumv ents this b ottleneck b y av oiding matrix diagonalization entirely , making it highly scalable. Finally , the indep endent ev olution of state v ariables in the eigen basis has implications b eyond sequential efficiency . Since eac h v ariable evolv es indep endently , our form ulation allo ws for the parallelization of state up dates across the temp oral dimension (see App endix B). This prop erty mirrors recen t adv ances in state-space mo dels lik e Mam ba (Gu and Dao, 2024) or parallelized LMUs (Chilkuri and Eliasmith, 2021), suggesting that diagonalized Linear ESNs could lev erage similar hardw are optimizations. F or future application, it could be in teresting to adapt the metho ds with non linear readout (Gonon and Ortega, 2019). These con tributions suggest a more efficient use of linear reserv oirs a v oiding unnecessary computations, and a shift of paradigm in linear reservoir initialization to w ards the direct selection of spectral distributions. In future w orks, it w ould b e in teresting to explore the v arious p ossible distributions for the eigen v alues, in order to systematically in vestigate their impact on reservoir dynamics and mo del p erformance. 23 de Coudenhove, Bendi-Ouis, Strock, Hinaut App endix A. Practical Implemen tation: Memory Rein terpretation T o maximize computational efficiency , we implement a hybrid approach: the reserv oir up- date is performed by rein terpreting part of memory as complex for elemen t-wise op erations. Con versely , the training and readout steps remain strictly in the real domain to av oid sup- plemen tary complexity . A.1 Basis Construction Assume the reserv oir matrix W ∈ R n × n is diagonalizable. Let P ∈ GL n ( C ) b e the matrix of eigen vectors and Λ the diagonal matrix of eigen v alues, partitioned as follows: • Λ = diag ( λ 1 , . . . , λ n r , µ 1 , ¯ µ 1 , . . . , µ n i , ¯ µ n i ) , where λ j ∈ R and µ k ∈ C \ R . • P = [ u 1 , . . . , u n r , v 1 , ¯ v 1 , . . . , v n i , ¯ v n i ] is the asso ciated matrix of eigenv ectors. • Λ real = ( λ 1 , . . . , λ n r ) and Λ cpx = ( µ 1 , . . . , µ n i ) . Let define Q as: Q = [ u 1 , . . . , u n r , Re { v 1 } , Im { v 1 } , . . . , Re { v n i } , Im { v n i } ] Applying Theorem 1 with the change of basis matrix Q yields the transformed system. Ho wev er, in the Q -basis, the reserv oir matrix [ W ] Q = Q − 1 W Q is no longer diagonal; it is blo ck-diagonal. Performing the up date [ r ( t − 1)] Q [ W ] Q w ould t ypically require 2 × 2 matrix m ultiplications for the complex comp onents, losing the efficiency of the diagonal represen tation. A.2 The Memory View T rick T o bypass the computational b ottleneck of blo c k-diagonal matrix multiplications, we prop ose an efficient implementation based on memory reinterpretation. The core idea is to pro cess the state update in the complex domain where [ W ] P is purely diagonal, but to pro ject it bac k into the real Q -basis for the readout phase. Since mo dern numerical libraries store complex num b ers as pairs of contiguous real floating-p oin t v alues, we treats the exact same ph ysical memory blo ck as either real or complex dep ending on the op eration context. This allo ws us to p erform the complex part of the reservoir up date in the "complex world" and the real part of the reserv oir up date in the "real world" with fast element-wise op erations ( ⊙ ), while keeping the readout training only in the "real world." While processing the state r en tirely in complex arithmetic w ould normally double the num b er of parameters required for the readout (due to separate real and imaginary weigh ts), our approach guaran tees that the readout cost remains p erfectly equiv alent to that of a standard real-v alued ESN. W e define [ r ] real Q and [ r ] cpx Q as t wo differen t slices of the memory blo ck [ r ] Q , with [ r ] cpx Q in terpreted either as a list of couples reals num b er, or a list of complex num b er (con taining real and imaginary part): • [ r ] real Q = [ r ] Q [1 : n r ] • [ r ] cpx Q = [ r ] Q [ n r + 1 : n ] . view ( C ) 24 Linear Reser voir Optimiza tion Using this dual represen tation, the reservoir state up date can b e computed efficien tly through separate element-wise m ultiplications for the real and complex partitions, b efore accum u- lating the input and feedback terms in the real domain. Reserv oir Up date Step: [ r ] real Q ← [ r ] real Q ⊙ Λ real [ r ] cpx Q ← [ r ] cpx Q ⊙ Λ cpx [ r ] Q ← [ r ] Q + u ( t )[ W in ] Q + y ( t − 1)[ W fb ] Q Pro of W e define the transformation matrix Z = diag ( I n r , Z, . . . , Z ) , where Z is the change-of- basis matrix mapping a complex conjugate pair to its real and imaginary parts: Z = 1 2  1 1 − i i  Then, Q = P Z . By definition of the transformation bases, the relation [ r ] Q = r Q = rP Z = [ r ] P Z alw ays holds. W e can expand this explicitely: [ r ] Q = h [ r ] ( λ 1 ) P . . . [ r ] ( λ n r ) P Re { [ r ] ( µ 1 ) P } Im { [ r ] ( µ 1 ) P } . . . Re { [ r ] ( µ n i ) P } Im { [ r ] ( µ n i ) P } i When applying the memory views, the data partitions corresp ond exactly to the real eigen- v alues and the complex eigenv alues in the original P -basis: [ r ] real Q def = ([ r ] Q )[1 : n r ] =  [ r ] ( λ 1 ) P . . . [ r ] ( λ n r ) P  [ r ] cpx Q def = ([ r ] Q )[ n r + 1 : n ] . view ( C ) = h [ r ] ( µ 1 ) P . . . [ r ] ( µ n i ) P i Because the memory p erfectly aligns with the real and imaginary comp onents of the P - basis pro jection, performing the elemen t-wise m ultiplications [ r ] real Q ← [ r ] real Q ⊙ Λ real and [ r ] cpx Q ← [ r ] cpx Q ⊙ Λ cpx on the sp ecific views is strictly equiv alent to p erforming the full complex elemen t-wise m ultiplication in the original P -basis: [ r ] P ← [ r ] P ⊙ Λ (25) By reverting to the matrix m ultiplication notation, this operation directly corresp onds to the state transition in the Q -basis: [ r ] Q ← [ r ] Q [ W ] Q (26) Finally , b y accum ulating the input real v alue and feedback pro jections directly on to the real memory blo ck [ r ] Q , the complete reservoir step is correctly and efficiently ev aluated as: [ r ] Q ← [ r ] Q [ W ] Q + u ( t )[ W in ] Q + y ( t − 1)[ W fb ] Q (27) 25 de Coudenhove, Bendi-Ouis, Strock, Hinaut A.3 Readout and T raining While the reservoir step partly exploits complex arithmetic to achiev e optimal parallel effi- ciency , the readout and training are p erformed en tirely in the Q -basis. F ollowing Theorem 1 (iii) and (iv), the readout weigh ts are learned via: y ( t ) = W out,bias + y ( t − 1) W out,out + [ r ( t )] Q [ W out,res ] Q (28) [ W out ] Q =  [ X ] T Q [ X ] Q + α  I 0 0 Q T Q  − 1 [ X ] T Q Y (29) App endix B. P arallelization W e fo cus on Linear ESN instead of Non-Linear ESN b ecause the sequential computation in Non-Linear ESNs is difficult to parallelize: each state x ( t ) is obtained by applying a nonlin- ear transformation to the previous state x ( t − 1) . One can formally unroll the dynamics to express x ( t ) directly in terms of past inputs, but this do es not by itself enable parallel com- putation. T o parallelize across time, the unrolled expression w ould need to b e algebraically simplified to eliminate in termediate states. F or generic nonlinear activ ations, ho wev er, such a simplification is not generally a v ailable. With polynomial activ ations, the size of the re- sulting closed-form expression can gro w exponentially with sequence length, making this approac h impractical. Even for a simple activ ation function like ReLU, Montúfar et al. (2014) demonstrated that it can lead to an exponential n umber of linear regions, thereb y constraining the potential for algebraic simplification in non-linear mo dels. Ho wev er, with linear reserv oirs, eac h state can b e expressed as: x ( t ) = t X i =1 u ( i ) W in W t − i (30) where the con tribution of each input u i to state x t can b e computed indep endently . This indep endence allows each input’s influence (or echo) to b e ev aluated separately at ev ery time step, facilitating extensive parallelization on modern GPU arc hitectures and leading to significan t sp eedups, particularly for long sequences. App endix C. Sp ecial case of D in = D out = 1 This section follows the reasoning presen t in section 3.3.1. Supp ose no w that the dimension of the input signal and the dimension of the output signal are D in = D out = 1 . Since D in = 1 , R(t) is effectively a row vector 1 × N . 26 Linear Reser voir Optimiza tion y ( t ) = r ( t ) w out = ( W in ⊙ R ( t )) w out = N X p =1 R ( t ) p ( w in ) p ( w out ) p = R ( t )( w T in ⊙ w out ) Let L b e the ob jectiv e loss function (e.g., Mean Squared Error) and Y ( t ) the truth signal. The standard readout w out is t ypically trained such that: w out = argmin β ∈ C N × 1 L  ( r ( t ) β , Y ( t )) 1 ≤ t ≤ T  Note: If an L2 p enalt y (Ridge regression) is applied to regularize the readout, changing the parameterization alters the problem. P enalizing the transformed comp osite weigh ts instead of ∥ w out ∥ 2 will yield a differen t effective regularization, meaning standard Ridge regression is not p erfectly equiv alent under this transformation. How ev er, for unregularized optimization, the follo wing holds: Theorem 6. If no weight in w in is 0, then the optimization c an b e p erforme d dir e ctly on the unweighte d states R ( t ) : γ ∗ = argmin γ ∈ C N × 1 L  ( R ( t ) γ , Y ( t )) 1 ≤ t ≤ T  Onc e the optimal γ ∗ is found, the original output weights c an b e uniquely r e c over e d via w out = γ ∗ ⊘ w T in , wher e ⊘ denotes element-wise division. Pro of Let γ = w T in ⊙ w out . Since w in con tains no zeros, there is a bijection b et ween w out and γ . The output equation b ecomes: y ( t ) = R ( t )( w T in ⊙ w out ) = R ( t ) γ Th us, we can optimize for γ directly using the states R ( t ) , whic h do not dep end on w in . Under these h yp otheses, the effectiv e readout w eights can be learned without explicitly instan tiating w in during the state collection phase. A c knowledgmen ts and Disclosure of F unding A ckno wledgmen ts 27 de Coudenhove, Bendi-Ouis, Strock, Hinaut References Narsimha Reddy Chilkuri and Chris Eliasmith. Parallelizing legendre memory unit training. In In ternational conference on machine learning, pages 1898–1907. PMLR, 2021. Joni Dambre, Da vid V erstraeten, Benjamin Sc hrauw en, and Serge Massar. Information pro cessing capacity of dynamical systems. Scientific rep orts, 2(1):514, 2012. Alan Edelman and Eric Kostlan. How many zeros of a random p olynomial are real? Bulletin of the American Mathematical So ciety, 32(1):1–37, 1995. Claudio Gallicc hio, Alessio Micheli, and Luca P edrelli. Hierarc hical temp oral represen ta- tion in linear reserv oir computing. In Italian W orkshop on Neural Nets, pages 119–129. Springer, 2017. Daniel J Gauthier, Erik Bollt, Aaron Griffith, and W endson AS Barb osa. Next generation reserv oir computing. Nature communications, 12(1):5564, 2021. Luk as Gonon and Juan-P ablo Ortega. Reserv oir computing universalit y with stochastic inputs. IEEE transactions on neural net works and learning systems, 31(1):100–112, 2019. Alb ert Gu and T ri Dao. Mam ba: Linear-time sequence mo deling with selective state spaces. In First conference on language mo deling, 2024. Herb ert Jaeger. Short term memory in echo state netw orks. 2001. Danil K ory akin, Johannes Lohmann, and Martin V Butz. Balanced ec ho state net works. Neural Net works, 36:35–45, 2012. Ric hard B Lehoucq, Dann y C Sorensen, and Chao Y ang. ARP ACK users’ guide: solution of large-scale eigen v alue problems with implicitly restarted Arnoldi metho ds. SIAM, 1998. Guido Montúfar, Razv an Pascan u, Kyungh yun Cho, and Y oshua Bengio. On the num b er of linear regions of deep neural netw orks. In Z. Ghahra- mani, M. W elling, C. Cortes, N. Lawrence, and K.Q. W einberger, editors, A dv ances in Neural Information Pro cessing Systems, v olume 27. Curran Asso ciates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ fa6f2a469cc4d61a92d96e74617c3d2a- Paper.pdf . Rob ert Simon F ong, Bo yu Li, and Peter Tiňo. Linear simple cycle reserv oirs at the edge of stability p erform fourier decomp osition of the input driving signals. Chaos: An In terdisciplinary Journal of Nonlinear Science, 35(4), 2025. P eter Tino. Dynamical systems as temp oral feature spaces. Journal of Machine Learning Researc h, 21(44):1–42, 2020. Pietro V erzelli, Cesare Alippi, Lorenzo Livi, and Peter Tiňo. Input-to-state representation in linear reserv oirs dynamics. IEEE T ransactions on Neural Netw orks and Learning Systems, 33(9):4598–4609, 2021. 28 Linear Reser voir Optimiza tion Aaron V o elker, Iv ana Ka jić, and Chris Eliasmith. Legendre memory units: Con tinuous-time represen tation in recurren t neural netw orks. Adv ances in neural information processing systems, 32, 2019. 29

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment