Analysis of Wide and Deep Echo State Networks for Multiscale Spatiotemporal Time Series Forecasting

Analysis of Wide and Deep Echo State Networks for Multiscale Spatiotemporal Time Series Forecasting Zachariah Carmichael, Humza Syed, Dhireesha Kudithipudi Neuromorphic AI Lab , Rochester Institute of T echnology Rochester, New Y ork {zjc2920, hxs7174, dxkeec}@rit.edu ABSTRA CT Echo state networks are computationally lightweight r eservoir models inspired by the random projections obser ved in cortical circuitry . As interest in reservoir computing has gro wn, networks have b ecome deep er and more intricate. While these networks are increasingly applied to nontrivial forecasting tasks, there is a need for comprehensive performance analysis of deep reservoirs. In this work, we study the inuence of partitioning neurons given a budget and the eect of parallel r eservoir pathways across dierent datasets exhibiting multi-scale and nonlinear dynamics. KEY W ORDS Echo state networks (ESNs), time series forecasting, reservoir com- puting, recurrent neural networks A CM Reference Format: Zachariah Carmichael, Humza Sy ed, and Dhireesha Kudithipudi. 2019. Anal- ysis of Wide and Deep Echo State Networks for Multiscale Spatiotemp oral Time Series Forecasting. In Neuro-inspired Computational Elements W ork- shop (NICE ’19), March 26–28, 2019, Albany , N Y , USA. ACM, Ne w Y ork, NY, USA, 10 pages. https://doi.org/10.1145/3320288.3320303 1 IN TRODUCTION Recurrent neural networks (RNNs) have recently shown advances on a variety of spatiotemporal tasks due to their innate feedback connections. One of the most commonly used RNNs today is the long short term memory (LSTM) network [ 18 ]. LSTMs are widely employed for solving spatiotemporal tasks and demonstrate high accuracy . However , these networks are generally prone to expensive computations that often lead to long training times. A computationally lightw eight approach to address spatiotemp o- ral processing is reservoir computing (RC), which is a biologically- inspired framework for neural networks. RC networks comprise hidden layers, r eferred to as reservoirs, that consist of pools of neu- rons with xed random weights and sparse random connectivity . Echo state networks (ESNs) [ 22 ] and liquid state machines (LSMs) [ 30 ] are the two major types of RC. Both architectures make use of sparse random connectivity between neurons to mimic an intrinsic form of memor y , as well as enable rapid training, as training occurs Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honor ed. Abstracting with credit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic p ermission and /or a fee. Request permissions from permissions@acm.org. NICE ’19, March 26–28, 2019, Albany , N Y , USA © 2019 Copyright held by the owner/author(s). Publication rights licensed to A CM. ACM ISBN 978-1-4503-6123-1/19/03. . . $15.00 https://doi.org/10.1145/3320288.3320303 = Reservoir (b) u ( t ) y ( t ) u ( t ) y ( t ) (a) u ( t ) y ( t ) (c) Figure 1: V arious Mod-De epESN top ologies. only within the readout lay er . Whereas ESNs are rate-base d models, liquid state machines (LSMs) are spike-based. The focus of this work is primarily on ESNs. In general, ESNs have been shown to perform well on small spa- tiotemporal tasks but underperform as task complexity incr eases. Prior literature has shown that ESNs are capable of various func- tions, such as speech processing, EEG classication, and anomaly detection [ 24 , 44 ]. In recent literature , several groups have begun to study how these networks can cope with incr easingly complex time series tasks with dynamics across multiple scales and domains [ 5 , 14 , 29 , 32 ]. One technique to enhance ESNs is the addition of reservoir layers. These networks are referred to as deep ESNs, which provide a hierarchical framework for feature extraction and for capturing nontrivial dynamics while maintaining the lightweight training of a conventional ESN. Ma et al . introduced the Deep-ESN architecture which utilizes a sequence of multiple reservoir layers and unsupervised enco ders to extract intricacies of temporal data [ 29 ]. Gallicchio et al . proposed an architecture, named DeepESN, that utilizes Jaeger et al . ’s leaky-integrate neurons in a deep er ESN architecture [11, 24]. In our previous work, we introduced the Mod-DeepESN , a mo du- lar architecture that allows for varying topologies of deep ESNs [ 6 ]. Intrinsic plasticity (IP) primes neurons to contribute more equally towards predictions and impr oves the network’s performance. A network with a wide and layered topology was able to achieve a lower root-mean-squared error (RMSE) than other ESN mo dels on a daily minimum temperature series [ 20 ] and the Mackey-Glass time series. In this paper , we pr opose alternative design techniques to enhance the Mod-DeepESN architecture. Comprehensive analysis is required to understand why these networks perform well. NICE ’19, March 26–28, 2019, Albany , NY, USA Z. Carmichael et al. 2 ARCHI TECT URE A framework under the RC paradigm, ESNs are RNNs which com- prise one or more reservoirs with a population of rate-based neu- rons. Reservoirs maintain a set of random untrained weights and exhibit a nonlinear response when the network is driven by an in- put signal. The state of each reservoir is recorded over the duration of an input sequence and a set of output weights is trained on the states based on a teacher signal. As the output computes a simple linear transformation, there is no need for expensive backpropaga- tion of error throughout a network as is r equired for training RNNs in the deep learning paradigm. Thus, the vanishing gradient prob- lem is avoided while still being able to capture complex dynamics from the input data. V anilla ESNs comprise a single reservoir and have limited application, espe cially with data exhibiting multi-scale and highly nonlinear dynamics. T o this end, various architectures have b een propose d with multiple reservoirs, additional projections, autoencoders, plasticity mechanisms, etc. [5, 6, 12, 14, 29, 32]. Building o of [ 6 , 12 ], we introduce the exible Mod-DeepESN architecture, which maintains parameterized connectivity between reservoirs and the input. A br oad set of topologies are accommo- dated by its modularity , several of which are shown in Figure 1. W e denote the tensor of input data U ∈ R N S × N t × N U which com- prises N S sequences of N t timesteps and N U features. N t may dier between each of the N S sequences, but for simplicity such variability is left out of the formulation. Reser voirs that are con- nected to the input receive the vector u ( t ) ∈ R N U at timestep t where u ( t ) ∈ U ∈ R N t × N U and U ∈ U . u ( t ) is mapped by the input weight matrix W i n ∈ R N U × ∥ C u ∥ 0 N R into each reservoir . N R is the numb er of neurons p er reservoir and typically N R ≫ N U . The binary matrix C determines the feedforward connections b e- tween reser v oirs and the input u . For example, if element C u , 2 is ‘1’, then u and reservoir 2 are connected. ∥ C u ∥ 0 gives the number of reservoirs that are connected to u . The output of the l th reservoir , x ( l ) ∈ R N R , is computed as (1) ˜ x ( l ) ( t ) = tanh  W ( l ) r e s i ( l ) ( t ) + ˆ W ( l ) r e s x ( l ) ( t − 1 )  (1a) x ( l ) ( t ) = ( 1 − a ( l ) ) x ( l ) ( t − 1 ) + a ( l ) ˜ x ( l ) ( t ) (1b) where i ( l ) ( t ) is given by (2). i ( l ) ( t ) = ( u ( t ) l = 1 x ( l − 1 ) ( t ) l > 1 (2) W ( l ) r e s ∈ R N R × N R is a fe edforward weight matrix that conne cts two reservoirs, while ˆ W ( l ) r e s ∈ R N R × N R is a recurrent weight matrix that connects intra-reservoir neurons. The per-layer leaky parame- ter a ( l ) controls the leakage rate in a moving e xponential average manner . Note that the bias vectors are left out of the formulation for simplicity . The state of a Mod-DeepESN network is dened as the concatenation of the output of the N L reservoirs, i.e. x ( t ) = ( u ( t ) , x ( 1 ) ( t ) , . .., x ( N L ) ( t )) ∈ R N U + N L N R . The matrix of all states is denote d as X = ( x ( 1 ) , x ( 2 ) , . .. , x ( N t )) ∈ R N S N t ×( N U + N L N R ) . Fi- nally , the output of the netw ork for the duration N t is computed as a linear combination of the state matrix using (3). Y = X W ou t (3) The matrix W ou t ∈ R ( N U + N L N R )× N Y contains the feedfor ward weights between reservoir neurons and the N Y output neurons, and Y ∈ R N S N t × N Y is the ground truth with a label for each timestep . In a for ecasting task, the dimensionality of the output is the same as the input, i.e. N Y = N U . Ridge regression, also kno wn as Tikhonov regularization, is used to solve for optimal W ou t and is shown with the explicit solution in (4) W ou t = ( X ⊺ X + β I ) − 1 X ⊺ Y (4) and using singular value decomposition (SVD) in (5) W ou t =  V Σ Σ ⊙ Σ + β I U ⊺  Y (5) where β is a regularization term, I is the identity matrix, ⊙ is the Hadamard product, and X = U Σ V ⊺ . The SVD solution of the Moore- Penrose pseudoinverse giv es a more accurate result but comes at the cost of higher computational complexity . T o maintain reservoir stability , ESNs need to satisfy the echo state property (ESP) [11, 22] as stated by (6). max 1 ≤ l ≤ N L 1 ≤ k ≤ N R    eig k  ( 1 − a ( l ) ) I + a ( l ) ˆ W ( l ) r e s     < 1 (6) The function eig k gives the k th eigenvalue of its matrix argument and | · | gives the modulus of its complex scalar argument. The maximum eigenvalue modulus is referred to as the spectral radius and must be less than unity (‘1’) in order for initial conditions of each reservoir to be washe d out asymptotically . A hyperparameter ˆ ρ is substituted for unity to allow for reservoir tuning for a giv en task. Each ˆ W ( l ) r e s is drawn from a uniform distribution and scaled such that the ESP is satised. The remaining untrained weight matrices are set using one of the two methods. First, a matrix can be drawn from a uniform distribution and scaled to have a spe cied L 2 (Frobenius) norm, i.e. W i n = W ′ i n   W ′ i n   2 π i n , W ( l ) r e s = W ′( l ) r e s    W ′( l ) r e s    2 π r e s where π i n and π r e s are hyperparameters. Second, Glorot (Xavier ) initialization [ 17 ] can be utilized without incurring further hyp er- parameters. The metho d initializes weights such that the activation (output) variance of consecutive fully-connected layers is the same. Formally , weights are drawn from a normal distribution with zer o mean and a standard deviation of p 2 /( n i n + n ou t ) , where n i n and n ou t are the number of inputs and outputs of a layer , respectively . All weight matrices are drawn with a sparsity parameter which gives the probability that each weight is nullied. Specically , s i n determines sparsity for W i n , ˆ s r e s for each ˆ W ( l ) r e s , and s r e s for each W ( l ) r e s . Furthermore, an unsupervised intrinsic plasticity (IP) learning rule is emplo yed. The rule, originally proposed by Schrauwen et al . , introduces gain and bias terms to the nonlinearities of reservoir neurons, i.e . tanh ( x ) is substituted with tanh ( д x + b ) where д is the gain and b is the bias. Iterative application of the rule minimizes the Kullback-Leibler (KL) divergence between the empirical output distribution (as driven by U ) and a target Gaussian distribution Analysis of Wide and Deep ESNs for Multiscale Spatiotemporal Time Series Forecasting NICE ’19, March 26–28, 2019, Albany , NY, USA [40]. The update rules for the i th neuron are given by (7) and (8) ∆ b ( l ) i ( t + 1 ) = − η σ 2  − µ + ˜ x ( l ) i ( t )  2 σ 2 + 1 −  ˜ x ( l ) i ( t )  2 + µ ˜ x ( l ) i ( t )   (7) ∆ д ( l ) i ( t + 1 ) = η д ( l ) i ( t ) + ∆ b ( l ) i ( t + 1 ) x ( l ) i ( t ) (8) where ˜ x i is given by (1a) and x i is given by (1b) . The hyperpa- rameter η is the the learning rate, and σ and µ are the standard deviation and mean of a target Gaussian distribution, respectively . In a pre-training phase, the learned parameters are each initial- ized as b ( l ) i ( t ) = 0 and д ( l ) i ( t ) = 1 and are up dated iteratively in a layer-wise fashion. 2.1 Particle Swarm Optimization Instances of the Mod-DeepESN network may achieve satisfactory forecasting performance using empirical guesses of hyp erparam- eters, however , a more sophisticated optimization approach will further improve the performance. W e thus pr op ose black-box op- timization of the Mod-DeepESN hyperparameters using particle swarm optimization (PSO) [ 26 ]. In PSO, a population of particles is instantiated in the search space of an optimization problem. This space contains the possible values of continuous or discrete hy- perparameters and the particles move ar ound the space based on external tness, or cost, signals. The communication network topol- ogy employ ed dictates the social behavior , or dynamics, of the swarm. Here, we utilize a star topology in which each particle is attracted to the globally best-performing particle. Formally , each particle is a candidate solution of N H hyperpa- rameters with position p i ( t ) ∈ R N H and velocity v i ( t ) ∈ R N H . The trivial p osition update of each particle is given by (9) while the velocity update is given by (10). p i ( t + 1 ) = p i ( t ) + v i ( t + 1 ) (9) v i ( t + 1 ) = w v i ( t ) + φ 1 U 1 ( t )  ˆ b i ( t ) − p i ( t )  + φ 2 U 2 ( t )  ˆ b ∗ i ( t ) − p i ( t )  (10) The matrices U 1 ( t ) , U 2 ( t ) ∈ R N H × N H are populated by values uni- formly drawn from the interval [ 0 , 1 ) at each timestep. The best position found by a particle is the vector ˆ b i ( t ) ∈ R N H while the best solution found in the neighborhoo d of a particle is the vector ˆ b ∗ i ( t ) ∈ R N H . With a star communication topology , ˆ b ∗ i ( t ) is the best position found globally . The velocity update comprises three parameters which inuence a particle’s dynamics: w is the iner- tia weight, φ 1 is the cognitive acceleration ( ho w much a particle should follow its p ersonal best), and φ 2 is the so cial acceleration (how much a particle should follow the swarm’s global best). All hyperparameters are considered during the optimization process, except for β , which is swept after X has been compute d, e xploiting the fact that all the weight matrices but W ou t are xed. 2.2 Neural Mapping RC networks have strong underpinnings in neural processing. Re- cent studies have shown that the distribution of the complex repre- sentation layer and the linear construction layer in the reservoir is similar to the one observed in the cerebellum. The model with granule layer (representation layer) and the synapses b etw een gran- ule and Purkinje cells ( linear readout layer) [ 46 ] is use d to study computationally useful case studies such as vestibulo-ocular reex adaptation [ 8 ]. The Purkinje cells are trained from the random, distributed input signals of the granule cells’ parallel dendritic connections [ 41 ]. W e also employ intrinsic plasticity , as in [ 40 ], to modulate neuronal activations to follow a known distribution. Whereas a biological neuron’s electrical properties are modied, a reservoir neuron’s gain and bias are adjusted. There are arguments that identifying the powerful computational paradigm, edge-of- chaos, for the biological counterparts of the reservoir are yet to be understoo d. However , infusing hybrid plasticity mechanisms such as short-term plasticity or long-term synaptic plasticity can help enhance the computational performance. It is interesting to note that the reservoir computational mo dels ( both spiking and non-spiking) seem to have a boost in their performance from em- bedding intrinsic plasticity , akin to the biological models [ 42 , 43 ]. This convergence with the biological counterparts vastly improv es our understanding of building spatiotemporal processing, though one should take a parsimonious approach with correlations. 3 MEASURING RESERVOIR GOODNESS V arious metrics have been proposed to understand reservoir per- formance [ 15 , 16 , 23 , 28 , 37 ]. In this work, we quantify reservoir goodness by measuring the stability of reservoir dynamics as well as evaluating the forecasting prociency of networks trained on synthetic and real-world tasks. 3.1 Separation Ratio Graphs Separation ratio graphs [ 16 ] are considered for determining the tness of Mod-DeepESN state separability . The method compares the separation of inputs with the separation of outputs for a given input and output, e.g. the input and output of a single reservoir or the input and output of a set of reservoirs. The metric assumes that inputs that appear similar should have a comparable degree of similarity with outputs. That is, Output Separation z }| {    x ( l 2 ) i ( t ) − x ( l 2 ) j ( t )    2    i ( l 1 ) i ( t ) − i ( l 1 ) j ( t )    2 | {z } Input Separation ≈ 1 , l 2 > l 1 (11) where ∥ · ∥ 2 gives the Euclidean distance between inputs ( or outputs) i and j . When the output separation is plotte d as a function of the input separation, the relation should b e close to identity , i.e. a linear regression trend line should yield a slope m ≈ 1 and intercept b ≈ 0 . If the output separation ≫ the input separation, a reservoir (or network) is considered to b e in the “chaotic” zone, whereas it is considered to be in the “attractor” zone if the output separation ≪ the input separation. 3.2 Lyapunov Exponent The Lyapunov exponent (LE) oers a quantitative measure of the ex- ponential growth (decay) rate of innitesimal perturbations within NICE ’19, March 26–28, 2019, Albany , NY, USA Z. Carmichael et al. 10 2 10 3 # Neurons/Reserv oir ( N R ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 NRMSE 0 10 20 30 40 50 # Reserv oirs ( N L ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 NRMSE − 0 . 75 − 0 . 50 − 0 . 25 0 . 00 0 . 25 0 . 50 0 . 75 Breadth - Depth Deep er Wider Deep er Wider Figure 2: Reservoir neuronal budgeting results for N N = 2048 neurons for the Mackey Glass task. Color corresponds to N L B − N L D N L . Le : NRMSE as a function of N R with a linear trend line. Right : NRMSE as a function of N L with a linear trend line of NRMSE as a function of log ( N L ) . a dynamical system [ 9 , 15 , 27 ]. An N -dimensional system has N LEs which describe the evolution along each dimension of its state space. The maximum LE (MLE) is a good indication of a system’s stability as it commands the rate of contraction or expansion in the system state space. If the MLE of a system is below 0, a system is said to exhibit “stable” dynamics, whereas an MLE above 0 de- scribes a system with “chaotic” dynamics. An MLE of 0 is commonly referred to as the “ edge of chaos. ” Local MLEs [ 15 ] are considered in this work as the y are more useful in practical experiments and can be estimated by driving a network by a real input signal (e.g. a time series). The MLE, denoted λ m ax , of a Mod-DeepESN instance can be computed for a given set of input sequences by (12) λ m ax = max 1 ≤ l ≤ N L 1 ≤ k ≤ N R 1 N S N t N S Õ i = 1 N t Õ t = 1 ln     eig k   1 − a ( l )  I + a ( l ) D ( l ) i ( t ) ˆ W ( l )      (12) where for the i th sequence the diagonal matrix D ( l ) i ( t ) is given by (13). D ( l ) i ( t ) =              1 −  ˜ x ( l ) 1 ( t )  2 0 . . . 0 0 1 −  ˜ x ( l ) 2 ( t )  2 . . . 0 . . . . . . . . . . . . 0 0 . . . 1 −  ˜ x ( l ) N R ( t )  2              (13) 3.3 T ask Performance Metrics The more straightforward way of measuring reservoir goodness is by evaluating the performance of a network on a task. For scalar- valued time series forecasting tasks, we consider the following three metrics to quantify network error: root-mean-square error (RMSE) (14) , normalized RMSE (NRMSE) (15) , and mean absolute percentage error (MAPE) (16). RMSE = v u t 1 N S N t N S Õ i = 1 N t Õ t = 1 ( y ( t ) − ˆ y ( t ) ) 2 (14) NRMSE = v u t Í N S i = 1 Í N t t = 1 ( y ( t ) − ˆ y ( t ) ) 2 Í N S i = 1 Í N t t = 1 ( y ( t ) − ¯ y ) 2 (15) MAPE = 100% N S N t N S Õ i = 1 N t Õ t = 1 | y ( t ) − ˆ y ( t ) | y ( t ) (16) The vector ˆ y ( t ) is the predicte d time series value at timestep t , ¯ y is the average value of the ground truth (time series) over the N t timesteps, y ( t ) ∈ Y , and ˆ y ( t ) , y ( t ) , ¯ y ∈ R N Y . Proposed in [ 3 ] and adapted for polyphonic music tasks in [ 4 ], frame-level accuracy (FL- ACC) r ewards only true positives across every timestep of all samples as shown in (17). FL- ACC = Í N S i = 1 Í N t t = 1 TP i ( t ) Í N S i = 1 Í N t t = 1 TP i ( t ) + FP i ( t ) + FN i ( t ) (17) The subscript i of each of the {true , false} positives ({T ,F}P) and false negatives (FN) denotes the corresponding quantity for the i th sequence. Note that FL- A CC is analogous to the intersection-over- union (IoU) metric, also known as the Jaccard index [ 21 ], which is commonly used to assess the performance of image segmentation models. For forecasting tasks, NRMSE is considered as its value is not dependent on the scale of the model nor the data, thus allowing for a more accurate comparison with results reported in the literature. FL- ACC is utilized for the polyphonic forecasting task as the true negative (TN) rate is not a good indicator of performance, as w ell as for consistent comparison of results. 4 EXPERIMEN TS W e analyze the Mod-DeepESN architecture on diverse time series forecasting tasks that exhibit multi-scale and nonlinear dynamics: Analysis of Wide and Deep ESNs for Multiscale Spatiotemporal Time Series Forecasting NICE ’19, March 26–28, 2019, Albany , NY, USA 10 2 10 3 # Neurons/Reserv oir ( N R ) 0 . 13625 0 . 13650 0 . 13675 0 . 13700 0 . 13725 0 . 13750 0 . 13775 0 . 13800 0 . 13825 NRMSE 0 10 20 30 40 50 # Reserv oirs ( N L ) 0 . 13625 0 . 13650 0 . 13675 0 . 13700 0 . 13725 0 . 13750 0 . 13775 0 . 13800 0 . 13825 NRMSE − 0 . 75 − 0 . 50 − 0 . 25 0 . 00 0 . 25 0 . 50 0 . 75 Breadth - Depth Deep er Wider Deep er Wider Figure 3: Reservoir neuronal budgeting results for N N = 2048 neurons for the Melbourne, A ustralia, minimum temperature forecasting task. Color corresponds to N L B − N L D N L . Le : NRMSE as a function of N R with a linear trend line of NRMSE as a function of log ( N R ) . Right : NRMSE as a function of N L with a linear trend line. the chaotic Mackey-Glass series, a daily minimum temperature series [20], and a set of polyphonic music series [4, 39]. 4.1 Practical Details For each task considered in this work, we run PSO to produce a candidate set of hyperparameters that oer the best performance on the appropriate validation set of data. Ther eafter , these param- eterizations of Mod-DeepESN are evaluated on the test set for the specic task. All numerical results are av eraged over 10 runs. W e run PSO for 100 iterations with 50 particles and set its parameters as follows: φ 1 = 0 . 5 , φ 2 = 0 . 3 , w = 0 . 9 . During training, β is swept from the set { 0 } ∪ { 10 − n | n ∈ [ 1 . . 8 ]} , exploiting the fact that X need only b e computed once. Ridge re- gression is carrie d out using SVD according to (5) for increased numerical stability . Only the score produced by the best-performing β is considered during the PSO update. W e only consider dense grid topologies in these experiments, which are referred to as Wide and Layered in [ 6 ]. This allows all networks to be described in terms of breadth and depth, reducing the complexity of neuronal partitioning and other analyses. Addi- tionally , the leaky rate α is kept constant across all the reservoirs, i.e. α = a ( l ) ∀ l ∈ [ 1 . . N L ] . 4.1.1 Mo del Implementation. The Mod-DeepESN architecture and its utilities are implemented in Python using sev eral op en-sour ce libraries. TensorFlow [ 1 ] and Keras [ 7 ] are used for matrix/tensor operations, managing network graphs, and unrolling RNNs. The PSO implementation extends the base optimizers pro vided by the PySwarms [ 34 ] library , and the pandas [ 33 ], NumPy [ 35 ], and SciPy [25] libraries are employed throughout the codebase. 4.2 Neuronal Partitioning An interesting question in reservoir models is to determine the op- timal size and connectivity of/within reser v oirs. Is a large reservoir as ee ctiv e as a multitude of small reservoirs, and should these small reservoirs extend deeper or wider? T o address this matter , we propose neuronal partitioning to explore a space of grid topolo- gies. Given a budget of N N neurons network-wide, we evaluate the task performance for a depth N L D and breadth N L B where N L = N L D × N L B and N R = ⌊ N N / N L ⌋ . This experiment is per- formed for the Mackey Glass and minimum temperature forecasting tasks. The prohibitive size of the polyphonic music data prevents us from running such an experiment. The values of N L D and N L B selected for each experiment are the integer factors of N L ∈ [ 1 . . 12 ] and of N L ∈ { 16 , 24 , 25 , 32 , 36 , 48 , 49 , 64 } . 4.3 Mackey Glass Mackey Glass [ 31 ] is a classical chaotic time series benchmark for evaluating the forecasting capacity of dynamical systems [ 2 , 22 , 29 , 47 ]. The series is generated from a nonlinear time delay dierential equation using the fourth-order Runge-Kutta method (RK4) and is given by (18). d x d t = β x ( t − τ ) 1 + x ( t − τ ) n − γ x ( t ) (18) During generation, we set τ =17, β =0.2, γ =0.1, and n =10 with a time resolution ( d t ) of 0.1 to compare with metho ds evaluated in [ 29 ]. 10,000 samples are split into 6,400 training samples 1,600 0 2000 4000 6000 8000 10000 timestep ( dt = 0 . 1) 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 v alue Figure 4: The Mackey Glass chaotic time series [31] com- puted over 10,000 timesteps of duration d t = 0 . 1 . NICE ’19, March 26–28, 2019, Albany , NY, USA Z. Carmichael et al. 0 1 2 3 4 I n p u t S e p a r a t i o n : | | u i ( t ) u j ( t ) | | 0 1 2 3 4 O u t p u t S e p a r a t i o n : | | x i ( t ) x j ( t ) | | Separation Ratio Graph for ESN Input to Reservoir 8,0 y=x y=0.835x + 0.565 0 20 40 60 80 Timestep (a) 0.0 0.5 1.0 1.5 2.0 I n p u t S e p a r a t i o n : | | u i ( t ) u j ( t ) | | 0.0 0.5 1.0 1.5 2.0 2.5 O u t p u t S e p a r a t i o n : | | x i ( t ) x j ( t ) | | Separation Ratio Graph for Reservoir 8,0 to Reservoir 8,1 y=x y=1.106x + 0.043 0 20 40 60 80 Timestep (b) 0 2 4 6 8 10 12 14 I n p u t S e p a r a t i o n : | | u i ( t ) u j ( t ) | | 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 O u t p u t S e p a r a t i o n : | | x i ( t ) x j ( t ) | | Separation Ratio Graph for Reservoir 1,0 to Reservoir 1,1 y=x y=1.130x + 0.793 0 20 40 60 80 Timestep (c) Figure 5: Separation ratio plots for various Mo d-DeepESN instances for the Melbourne forecasting task. (a) Separation ratio plot for the best-performing model. ( b) Separation ratio plot for the second-best-p erforming model. (c) Separation ratio plot for the worst-performing mo del. validation samples, and 2,000 testing samples for 84 timestep-ahead forecasting, i.e. giv en u ( t ) , predict y ( t ) = u ( t + 84 ) . T o reduce the inuence of transients, the rst 100 timesteps of the training set are used as a washout period. T able 1 contains the best forecasting results for Mod-DeepESN as well as those reported in [ 29 ]. The Mod-DeepESN framework falls slightly short of the performance of Deep-ESN but outp erforms the other baselines in terms of (N)RMSE. MAPE exhibits several biases, such as punishing negative errors mor e than positive, which may be the reason for this discrepancy . 4.3.1 Neuronal Partitioning. Neural partitioning is run for the Mackey Glass task with results reported in Figure 2. It is apparent that smaller values of N R and larger values of N L yield the lowest NRMSE; an agglomeration of small reservoirs outperform a single large reservoir for this task. While marginal, broader top ologies outperform deeper for this task. 4.4 Melbourne, Australia, Daily Minimum T emp eratur e Forecasting The Melbourne, Australia, daily minimum temperature forecasting series [ 20 ] is recorded from 1981-1990 and shown in Figure 6. In this task, the goal is to predict the next minimum temperature of the directly procee ding day in Melbourne, i.e. given u ( t ) , predict y ( t ) = u ( t + 1 ) . The data is smoothed with a 5-step moving window average and split into 2,336 training samples, 584 validation samples, T able 1: Mackey-Glass Time Series 84-Step Ahead Prediction Results. All errors are reported in thousandths. Network N L RMSE NRMSE MAPE V anilla ESN [24] 1 43.7 201 7.03 ϕ -ESN [10] 2 8.60 39.6 1.00 R 2 SP [5] 2 27.2 125 1.00 MESM [32] 7 12.7 58.6 1.91 Deep-ESN [29] 2 1.12 5.17 .151 Mod-DeepESN 3 7.22 27.5 5.55 0 500 1000 1500 2000 2500 3000 3500 timestep (da y) 5 10 15 20 temp erature ( ◦ C) daily mon thly mean resample Figure 6: The Melb ourne , Australia, daily minimum temper- ature time series [20]. and 730 testing samples to compare with methods evaluated in [ 29 ]. A washout period of 30 timesteps (days) is used to rid transients. T able 2 contains the best forecasting results for Mod-DeepESN as well as those reported in [ 29 ]. The Mod-De epESN framework outperforms all baselines in terms of (N)RMSE. This result is more interesting than that of Mackey Glass as the time series comprises real data as opposed to synthetic. T able 2: Daily Minimum T emp eratur e Series 1-Step Ahead Prediction Results. All errors are reported in thousandths. Network N L RMSE NRMSE MAPE ESN [24] 1 501 139 39.5 ϕ -ESN [10] 2 493 141 39.6 R 2 SP [5] 2 495 137 39.3 MESM [32] 7 478 136 37.7 Deep-ESN [29] 2 473 135 37.0 Mod-DeepESN 4 459 132 37.1 Analysis of Wide and Deep ESNs for Multiscale Spatiotemporal Time Series Forecasting NICE ’19, March 26–28, 2019, Albany , NY, USA (a) 1 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 α 0 . 144 0 . 152 0 . 160 0 . 168 0 . 176 0 . 184 NRMSE (b) 1 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 α 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 MLE ( λ max ) (c) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 α 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 15 0 . 16 0 . 17 0 . 18 NRMSE (d) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 α 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 MLE ( λ max ) Figure 7: 2D and 3D heatmaps of NRMSE and λ m ax as a function of α and N L . Note that the color bar gradient is reversed for visualizing λ m ax . (a) Impact of N L and α on NRMSE. (b) Impact of N L and α on λ m ax . (c) Impact of reservoir breadth, depth, and α on NRMSE. (d) Impact of reser voir breadth, depth, and α on λ m ax . 4.4.1 Neuronal Partitioning. Neural partitioning is run for the Mel- bourne daily minimum temperature task with results reporte d in Figure 3. The trends of N R and N L are less apparent than that ob- served with the Mackey Glass forecasting task. The gradient of error is consistent with the change in N R and N L for deeper topologies between the tasks, but the same do es not hold for broader netw orks; in fact, the inverse is observed, which suggests that hierarchical 0 2000 4000 6000 8000 10000 0 25 50 75 F riedric h Burgm ¨ uller. “Die Quelle”. ´ Etudes, Opus 109 . 1858. 0 10000 20000 30000 40000 50000 60000 70000 80000 0 25 50 75 F ranz Sc h ub ert. “No. 2” F our Impr omptus, D. 899, Opus 90 . 1827. timestep MIDI note (A0 to C8) Figure 8: T wo samples from the Piano-midi.de dataset with insets highlighting activity . T erminology: Opus : number of a musical work indicating chr onological order of produc- tion; D. : Deutsch Thematic Catalogue number (of a Schubert work). features and larger memor y capacity is required to improve per- formance. Elongated memory capacity has been shown to emerge with the depth of an ESN [15], which supports this observation. 4.5 Polyphonic Music T asks W e evaluate the Mod-De epESN on a set of polyphonic music tasks as dened in [ 4 ]. In particular , w e use the data pr ovided 1 for the Piano- midi.de 2 task. The data comprises a set of piano roll se quences preprocessed as described in [ 39 ]. 87 sequences with an average of 872.5 timesteps are used for training, 12 sequences with an average of 711.7 timesteps are used for validation, and 25 sequences with an average of 761.4 timesteps are used for testing. The goal of this task is to predict y ( t ) = u ( t + 1 ) given u ( t ) where N Y = N U = 88 . Multiple notes may be played at once, so an argmax cannot be use d at the output of the readout layer; rather , the output of each neuron is binarized with a threshold. In practice, we nd this threshold using training data by sweeping ∼ 20 values uniformly distributed between the minimum and maximum of the predicted values. This threshold can also be found by training a linear clas- sier on the predicted outputs, using an adaptive threshold, or using Otsu’s method [ 36 ]. Lastly , an optimal threshold may exist on a per-neuron basis. A washout period of 20 steps is used to rid transients. T able 3 contains the best forecasting results for Mod-DeepESN as well as those reported in [ 4 , 13 ]. The Mod-DeepESN framework outperforms both RNN-RBM [ 4 ] and DeepESN [ 13 ] on the Piano- midi.de corpus with fewer trained parameters and reservoirs. 1 http://www- etud.iro.umontreal.ca/~boulanni/icml2012 2 Classical MIDI piano music (http://piano- midi.de/). NICE ’19, March 26–28, 2019, Albany , NY, USA Z. Carmichael et al. (a) 1 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Sp ectral Radius ( ˆ ρ ) 0 . 4 0 . 8 1 . 2 1 . 6 NRMSE (b) 1 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Sp ectral Radius ( ˆ ρ ) − 0 . 25 0 . 00 0 . 25 0 . 50 0 . 75 MLE ( λ max ) (c) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 Sp ectral Radius ( ˆ ρ ) 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 < 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 > 0.21 NRMSE (d) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 Sp ectral Radius ( ˆ ρ ) 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 -0.4 -0.2 0 0.2 0.4 0.6 0.8 > 1 MLE ( λ max ) Figure 9: 2D and 3D heatmaps of NRMSE and λ m ax as a function of ˆ ρ and N L . Note that the color bar gradient is reversed for visualizing λ m ax . (a) Impact of N L and ˆ ρ on NRMSE 3 . (b) Impact of N L and ˆ ρ on λ m ax . (c) Impact of reservoir breadth, depth, and ˆ ρ on NRMSE. NRMSE values span far greater than 0.21 and thus are clipped. (d) Impact of reservoir breadth, depth, and ˆ ρ on λ m ax . T able 3: Piano-midi.de Time Series 1-Step Ahead Prediction Results. Network FL- ACC DeepESN [13] 33.22% shallowESN [13] 31.76% RNN-RBM [4] 28.92% Mod-DeepESN 33.44 % 5 DISCUSSION It is evident that the optimal Mod-DeepESN topologies found via neural partitioning are task-sp ecic; no one conguration seems op- timal for multiple tasks. Betw een the Mackey Glass and Melbourne forecasting tasks, wider networks e xhibit a smaller condence in- terval which indicates consistency in performance. There is less of an apparent trend on the real-world Melbourne task (more so for deep networks), although this is within expectations due to noise in the data. W e create separation ratio plots of reservoir responses to time series input with various examples shown in Figure 5. The best- performing mo dels achieve similar error values, but exhibit con- siderably dierent dynamics at dierent scales of magnitude. The worst-performing model yields a separation ratio similar to that of 3 The colors of the color bar are linearly mapped to the inter val [ 0 , 1 ] with power-law normalization, i.e. ( c − c min ) γ ( c max − c min ) γ ( γ = 0 . 3 ) for some given colors, c . the second-best; the biases dier , however , this can be attributed to the dierence in magnitude (as a result of e.g. input scaling). Looking at (11) , it can be observed that the identity function yields an ideal response at the empirical “edge of chaos;” this sheds light on some shortcomings of the metric. The te chnique can b e made more robust by considering input-to-output similarity , matching the variance of inputs and reservoir r esponses (to avoid skewing the slope), and tracking the consistency of separation ratios over time (as reser v oirs are stateful). W e recommend these plots as a debugging method for ESNs as the y unv eil useful attributes beyond input and output separation. Of the three tasks considered, the Melbourne daily minimum forecasting task is sele cted for exploring the design space . The data is non-synthetic and its size is not prohibitive of such exploration. Here, we pr oduce heatmaps of NRMSE and MLE ( λ m ax ) as a func- tion of several swept hyperparameters. In each, a practical range of breadth and depth values are considered. Figure 7 delineates the impact of α and shows that λ m ax is a reliable indication of performance ( ρ = − 0 . 956 4 ) for the task. There is no signicant impact induced by modulating breadth or depth, which agrees with the neuronal partitioning result (see Figure 3). Figur e 9 illustrates the impact of ˆ ρ and demonstrably has a more substantial impact on network stability than α , as expected. The network error plateaus to a minimum near 1 . 1 a nd increases dramatically afterward. This critical point is beyond the “ e dge of chaos” ( λ m ax = 0 ) and er- ror is asymmetrical about it. Here, λ m ax is a poor predictor of 4 Pearson correlation coecient [ 38 ] between NRMSE and λ m a x (do not confuse with ˆ ρ , the spectral radius). Analysis of Wide and Deep ESNs for Multiscale Spatiotemporal Time Series Forecasting NICE ’19, March 26–28, 2019, Albany , NY, USA error ( ρ = 0 . 377 ) (even the correlation b etw een ˆ ρ and NRMSE is higher). Again, depth and breadth are not indicative of error on their own, howev er , both deep er and wider networks suer larger errors beyond the critical value of ˆ ρ . An interesting obser vation is that, while the tasks dier , well- performing networks in this work often exhibit a positive λ m ax whereas the networks in [ 15 ] exhibit a negative λ m ax primarily . This characterization of Mod-DeepESN as a system with “unstable ” dynamics requires further attestation but indicates that such does not preclude consistent performance. 6 CONCLUSION W e provide analytical rationale for the characteristics of de ep ESN design that inuence forecasting performance. Within the mal- leable Mod-DeepESN architecture, we experimentally support that networks perform optimally beyond the “edge of chaos. ” Provided constraints on model size or compute resources, we explore the eects of neuron allocation and reservoir placement on perfor- mance. W e also demonstrate that network breadth plays a role in dictating certainty of performance b etw een instances. These characteristics may be present within neuronal populations, which could conrm that powerful models emerge from an agglomeration of weak learners. Redundancy through parallel pathways, e xtrac- tion of nonlinear data regularities with depth, and discernibility of latent representations all appear to have a signicant impact on Mod-DeepESN performance. Future studies should explore the design space of reservoirs in tandem with neuromorphic hardware design constraints. A CKNO WLEDGMEN TS W e w ould like to thank the members of the Neuromorphic Articial Intelligence Lab for helpful discussions during this work. W e also thank the creators and maintainers of the matplotlib [ 19 ] and seaborn [45] libraries which we used for plotting. W e acknowledge the Air Force Research Lab in funding part of this work under agreement number F A8750-16-1-0108. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily r epre- senting the ocial p olicies or endorsements, either expressed or implied, of Air Force Research Laborator y or the U .S. Government. REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, et al . 2015. T ensorFlow: Large- Scale Machine Learning on Heterogeneous Systems. CoRR abs/1603.04467 (2015). arXiv:1603.04467 http://ar xiv .org/abs/1603.04467 - software available from ten- sorow .org. [2] Pau Vilimelis Aceituno, Y an Gang, and Yang- Y u Liu. 2017. T ailoring Arti- cial Neural Networks for Optimal Learning. CoRR abs/1707.02469 (2017), 1–22. arXiv:cs/1707.02469 http://ar xiv .org/abs/1707.02469 [3] Mert Bay and Andreas F. Ehmannnd J. Stephen Downie. 2009. Evaluation of Multiple-F0 Estimation and Tracking Systems. In Pr oceedings of the 10th Inter- national Society for Music Information Retrieval Conference, ISMIR , Keiji Hirata, George Tzanetakis, and Kazuyoshi Y oshii (Eds.). International Society for Mu- sic Information Retrieval, Kobe International Conference Center , Kobe, Japan, 315–320. http://ismir2009.ismir .net/proceedings/PS2- 21.p df [4] Nicolas Boulanger-Lewandowski, Y oshua Bengio, and Pascal Vincent. 2012. Mod- eling T emporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In Proceedings of the 29th Inter- national Conference on Machine Learning, ICML (ICML’12) . Omnipr ess, Edinburgh, Scotland, UK, 1881–1888. http://icml.cc/2012/papers/590.pdf [5] John B. Butcher , David V erstraeten, Benjamin Schrauwen, Charles R. Day , and Peter W . Haycock. 2013. Reservoir computing and extreme learning machines for non-linear time-series data analysis. Neural Networks 38 (2013), 76–89. https: //doi.org/10.1016/j.neunet.2012.11.011 [6] Zachariah Carmichael, Humza Syed, Stuart Burtner , and Dhireesha Kudithipudi. 2018. Mod-De epESN: Modular Deep Echo State Network. Conference on Cognitive Computational Neuroscience abs/1808.00523 (Sept. 2018), 1–4. arXiv:cs/1808.00523 http://arxiv .org/abs/1808.00523 or https://ccneuro.org/2018/proceedings/1239. pdf . [7] François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. [8] Paul Dean, John Porrill, and James V . Stone. 2002. Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo- ocular reex. The Royal Society 269, 1503 (2002), 1895–1904. https://doi.org/10. 1098/rspb.2002.2103 [9] Jean-Pierre Eckmann and David Ruelle. 1985. Ergodic theor y of chaos and strange attractors. Reviews of Modern P hysics 57 (July 1985), 617–656. Issue 3. https://doi.org/10.1103/RevModPhys.57.617 [10] Claudio Gallicchio and Alessio Micheli. 2011. Ar chitectural and Markovian factors of echo state networks. Neural Networks 24, 5 (2011), 440–456. https: //doi.org/10.1016/j.neunet.2011.02.002 [11] Claudio Gallicchio and Alessio Micheli. 2017. Echo State Property of Deep Reservoir Computing Networks. Cognitive Computation 9, 3 (2017), 337–350. https://doi.org/10.1007/s12559- 017- 9461- 9 [12] Claudio Gallicchio, Alessio Micheli, and Luca Pe dr elli. 2017. Deep reser v oir computing: A critical experimental analysis. Neurocomputing 268 (2017), 87–99. https://doi.org/10.1016/j.neucom.2016.12.089 [13] Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018. Deep Echo State Networks for Diagnosis of Parkinson’s Disease. In Proceedings of the 26th European Symposium on A rticial Neural Networks, ESANN . i6doc.com, Bruges, Belgium, 397–402. http://www .elen.ucl.ac.be/Proceedings/esann/esannpdf/es2018- 163. pdf [14] Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. 2018. Design of deep echo state networks. Neural Networks 108 (2018), 33–47. https://doi.org/10.1016/j. neunet.2018.08.002 [15] Claudio Gallicchio, Alessio Micheli, and Luca Silvestri. 2018. Local Lyapunov Exponents of De ep Echo State Networks. Neurocomputing 298 (2018), 34–45. https://doi.org/10.1016/j.neucom.2017.11.073 [16] Thomas E. Gibbons. 2010. Unifying quality metrics for reser v oir networks. In Proceedings of the International Joint Conference on Neural Networks, IJCNN . IEEE, Barcelona, Spain, 1–7. https://doi.org/10.1109/IJCNN.2010.5596307 [17] Xavier Glorot and Y oshua Bengio. 2010. Understanding the diculty of training deep feedforward neural networks. In Proceedings of the 13th International Confer- ence on A rticial Intelligence and Statistics, AISTA TS (JMLR Proce e dings) , Y ee Whye T eh and D. Mike Titterington (Eds.), V ol. 9. JMLR.org, Chia Laguna Resort, Sar- dinia, Italy , 249–256. http://jmlr .org/proce edings/papers/v9/glorot10a.html [18] Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long Short-T erm Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9. 8.1735 [19] John D. Hunter . 2007. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55 [20] Rob J. Hyndman and Y angzhuoran Y ang. 2018. Daily minimum temperatures in Melbourne, Australia (1981–1990). https://pkg.yangzhuoranyang.com/tsdl/ [21] Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone. The New Phytologist 11, 2 (Feb. 1912), 37–50. https://doi.org/10.1111/j.1469- 8137.1912. tb05611.x [22] Herbert Jaeger . 2001. The “Echo State” A pproach to Analysing and Training Re cur- rent Neural Networks-with an Erratum Note . T echnical Report 148. Fraunhofer Institute for Autonomous Intelligent Systems, GMD-German National Research Institute for Information T echnology . http://w ww .faculty .jacobs- university .de/ hjaeger/pubs/EchoStatesT echRep.pdf [23] Herbert Jaeger . 2002. Short term memor y in e cho state networks . Te chnical Report 152. Fraunhofer Institute for Autonomous Intelligent Systems, GMD-German National Research Institute for Information T echnology . http://w ww .faculty . jacobs- university .de/hjaeger/pubs/STMEchoStatesT e chRep .p df [24] Herbert Jaeger , Mantas Lukoševičius, Dan Popovici, and Udo Siewert. 2007. Opti- mization and applications of echo state networks with leaky-integrator neurons. Neural networks 20, 3 (2007), 335–352. https://doi.org/10.1016/j.neunet.2007.04. 016 [25] Eric Jones, Travis Oliphant, Pearu Peterson, et al . 2001. SciPy: Op en Source Scientic T ools for Python. http://www.scip y .org/ [26] James Kennedy and Russell C. Eberhart. 1995. Particle swarm optimization. In Proceedings of the International Conference on Neural Networks, ICNN’95 (1995), V ol. 4. IEEE, Perth, W A, A ustralia, 1942–1948. https://doi.org/10.1109/ICNN.1995. 488968 NICE ’19, March 26–28, 2019, Albany , NY, USA Z. Carmichael et al. [27] Aleksandr M. Lyapunov . 1992. The general problem of the stability of motion. In- ternat. J. Control 55, 3 (1992), 531–534. https://doi.org/10.1080/00207179208934253 [28] Thomas Lymburn, Alexander Khor , Thomas Stemler , et al . 2019. Consistency in Echo-State Networks. Chaos: A n Interdisciplinar y Journal of Nonlinear Science 29, 2 (2019), 23118. https://doi.org/10.1063/1.5079686 [29] Qianli Ma, Lifeng Shen, and Garrison W . Cottrell. 2017. Deep-ESN: A Multi- ple Projection-encoding Hierarchical Reservoir Computing Framework. CoRR abs/1711.05255 (2017), 15. arXiv:cs/1711.05255 [30] W olfgang Maass, Thomas Natschläger, and Henry Markram. 2002. Real-Time Computing Without Stable States: A New Framew ork for Neural Computation Based on Perturbations. Neural Computation 14, 11 (2002), 2531–2560. https: //doi.org/10.1162/089976602760407955 [31] Michael C. Mackey and Leon Glass. 1977. Oscillation and chaos in physiological control systems. Science 197, 4300 (1977), 287–289. https://doi.org/10.1126/ science.267326 [32] Zeeshan K. Malik, Amir Hussain, and Qingming J. Wu. 2016. Multilayered Echo State Machine: A Novel Architecture and Algorithm. IEEE Transactions on Cybernetics 47, 4 (June 2016), 946–959. https://doi.org/10.1109/TCYB.2016. 2533545 [33] W es McKinney . 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (2010), Stéfan van der W alt and Jarrod Millman (Eds.). SciPy , Austin, TX, 51–56. [34] Lester James V . Miranda. 2018. PySwarms, a Research- T o olkit for Particle Swarm Optimization in Python. Journal of Open Source Software 3, 21 (2018), 433. https: //doi.org/10.21105/joss.00433 [35] Travis E. Oliphant. 2006. A Guide to NumPy . V ol. 1. Trelgol Publishing, USA. [36] Nobuyuki Otsu. 1979. A Threshold Selection Method from G ray-Le vel Histograms. IEEE Transactions on Systems, Man, and Cyb ernetics 9, 1 (Jan. 1979), 62–66. https: //doi.org/10.1109/TSMC.1979.4310076 [37] Mustafa C. Ozturk, Dongming Xu, and José Carlos Príncipe. 2007. Analysis and Design of Echo State Networks. Neural Computation 19, 1 (2007), 111–138. https://doi.org/10.1162/neco.2007.19.1.111 [38] Karl Pearson and Francis Galton. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, 347-352 (1895), 240–242. https://doi.org/10.1098/rspl.1895.0041 [39] Graham E. Poliner and Daniel P. W . Ellis. 2006. A Discriminative Model for Poly- phonic Piano Transcription. EURASIP Journal on Advances in Signal Processing 2007, 1 (2006), 48317. https://doi.org/10.1155/2007/48317 [40] Benjamin Schrauwen, Marion W ardermann, David V erstraeten, Jochen J. Steil, and Dirk Stroobandt. 2008. Improving reservoirs using intrinsic plasticity . Neu- rocomputing 71, 7-9 (2008), 1159–1171. https://doi.org/10.1016/j.neucom.2007.12. 020 [41] Gordon M. Shepherd. 1990. The Synaptic Organization of the Brain (3rd ed.). Oxford University Press, New Y ork, NY, USA. [42] Nicholas M. Soures. 2017. De ep Liquid State Machines with Neural P lasticity and On-Device Learning . Master’s thesis. Ro chester Institute of T e chnology . [43] Nicholas M. Soures, Lydia Hays, Eric Bohannon, Abdullah M. Zyarah, and Dhiree- sha Kudithipudi. 2017. On-Device STDP and Synaptic Normalization for Neu- romemristive Spiking Neural Netw ork. In Proceedings of the 60th International Midwest Symposium on Circuits and Systems (MWSCAS) . IEEE, Boston, MA, USA, 1081–1084. https://doi.org/10.1109/MWSCAS.2017.8053115 [44] Nicholas M. Soures, Lydia Hays, and Dhir eesha Kudithipudi. 2017. Robustness of a memristor based liquid state machine. In Proceedings of the International Joint Conference on Neural Networks, IJCNN . IEEE, Anchorage, AK, USA, 2414–2420. https://doi.org/10.1109/IJCNN.2017.7966149 [45] Michael W askom, Olga Botvinnik, Drew O’Kane, et al . 2018. Mwaskom/Seaborn: V0.9.0 (July 2018). zenodo V0.9.0 (2018), 1. https://doi.org/10.5281/zeno do .1313201 [46] T adashi Y amazaki and Shigeru Tanaka. 2007. The cerebellum as a liquid state machine. Neural Networks 20, 3 (2007), 290–297. https://doi.org/10.1016/j.neunet. 2007.04.004 [47] Mohd-Hanif Y uso, Joseph Chrol-Cannon, and Y aochu Jin. 2016. Modeling neural plasticity in echo state networks for classication and r egression. Information Sciences 364-365 (2016), 184–196. https://doi.org/10.1016/j.ins.2015.11.017 A ADDI TIONAL RESULTS W e additionally construct heatmaps for the impact of topology and s r e s on NRMSE and λ m ax , shown in Figure 10. This result shows that λ m ax somewhat correlates with NRMSE ( ρ = − 0 . 544 ), which only moderately supports the “edge of chaos” hypothesis. How ever , there is a clear tr end between NRMSE and both s r e s and depth. The error extending outward radially fr om the bottom-right corner of Figure 10a correlates positively with decreasing s r e s and increasing N L . More signicantly , depth correlates with NRMSE ( ρ = − 0 . 749 ) with deeper networks giving lower errors. Wider networks also always yield lower errors in this e xperiment. (a) 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 W r es Sparsit y ( s r es ) 0 . 1395 0 . 1410 0 . 1425 0 . 1440 0 . 1455 0 . 1470 NRMSE (b) 2 3 4 5 6 7 8 9 10 12 14 15 16 18 20 21 24 27 30 # Reserv oirs ( N L ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 W r es Sparsit y ( s r es ) 0 . 360 0 . 375 0 . 390 0 . 405 0 . 420 MLE ( λ max ) (c) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 W r es Sparsit y ( s r es ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 < 0.136 0.138 0.14 0.142 0.144 0.146 0.148 NRMSE (d) breadth 1 2 3 4 5 6 7 8 9 10 depth 1 2 3 W r es Sparsit y ( s r es ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 0 . 34 0 . 36 0 . 38 0 . 40 0 . 42 0 . 44 MLE ( λ max ) Figure 10: 2D and 3D heatmaps of NRMSE and λ m ax as a function of s r e s and N L . Note that the color bar gradient is reversed for visualizing λ m ax . (a) Impact of N L and s r e s on NRMSE. (b) Impact of N L and s r e s on λ m ax . (c) Impact of reservoir breadth, depth, and s r e s on NRMSE. (d) Impact of reservoir breadth, depth, and s r e s on λ m ax .

Analysis of Wide and Deep Echo State Networks for Multiscale Spatiotemporal Time Series Forecasting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment