Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions
Recurrent Neural Networks (RNN) have become competitive forecasting methods, as most notably shown in the winning method of the recent M4 competition. However, established statistical models such as ETS and ARIMA gain their popularity not only from t…
Authors: Hansika Hewamalage, Christoph Bergmeir, Kasun B
Recurren t Neural Net w orks for Time Series F orecasting: Curren t Status and F uture Directions Hansik a Hew amalage ∗ , Christoph Bergmeir , Kasun Bandara F aculty of Information T e chnology, Monash University, Melb ourne, Austr alia. Abstract Recurren t Neural Net works (RNN) ha ve b ecome comp etitiv e forecasting metho ds, as most notably sho wn in the winning metho d of the recent M4 comp etition. Ho wev er, established statistical mo dels suc h as ETS and ARIMA gain their p opularit y not only from their high accuracy , but they are also suitable for non-exp ert users as they are robust, efficient, and automatic. In these areas, RNNs hav e still a long w ay to go. W e presen t an extensive empirical study and an open-source softw are framew ork of existing RNN architectures for forecasting, that allow us to dev elop guidelines and b est practices for their use. F or example, w e conclude that RNNs are capable of modelling seasonality directly if the series in the dataset p ossess homogeneous seasonal patterns, otherwise we recommend a deseasonalization step. Comparisons against ETS and ARIMA demonstrate that the implemented (semi-)automatic RNN mo dels are no silver bullets, but they are comp etitiv e alternativ es in many situations. Keywor ds: Time Series F orecasting, Recurren t Neural Netw orks © 2020. This man uscript version is made av ailable under the CC-BY-NC-ND 4.0 license http://creativecommons. org/licenses/by- nc- nd/4.0/ 1. Introduction The forecasting field in the past has b een c haracterised by practitioners on the one hand discarding Neural Netw orks (NN) as not b eing comp etitiv e, and on the other hand NN enth usiasts presen ting man y complex no vel NN architectures, mostly without convincing empirical ev aluations against simpler univ ariate statistical metho ds. In particular, this notion w as supported b y man y time series forecasting comp etitions suc h as the M3, NN3 and NN5 comp etitions ( Makridakis et al. , 2018b ; Crone et al. , 2011 ; Crone , 2008 ). Consequently , NNs were labelled as not suitable for forecasting ( Hyndman , 2018 ). There is a n umber of p ossible reasons for the underperformance of NNs in the past, one b eing that individual time series themselves usually are to o short to b e mo deled using complex approaches. Another p ossibilit y may be that the time series’ characteristics hav e changed ov er time so that e v en long time series ma y not contain enough relev an t data to fit a complex mo del. Th us, to mo del sequences by complex approaches, it is essential that they ha ve adequate length as well as that they are generated from a comparatively stable system. Also, NNs are further criticized for their black-box nature ( Makridakis et al. , 2018b ). Thus, forecasting practitioners traditionally hav e often opted for more straightforw ard statistical tec hniques. Ho wev er, we are no w living in the Big Data era. Companies hav e gathered plethora of data o ver the years, whic h contain important information ab out their business patterns. Big Data in the context of time series do es not necessarily mean that the individual time series contain lots of data. Rather, it typically means that there are many related time series from the same domain. In such a context, univ ariate forecasting techniques that consider individual time series in isolation, may fail to pro duce ∗ Corresponding author. Postal Address: F aculty of Information T echnology , P .O. Box 63 Monash Univ ersity , Victoria 3800, Australia. E-mail address: hansik a.hewamalage@monash.edu Pr eprint submitte d to Elsevier De cemb er 24, 2020 reliable forecasts. They b ecome inappropriate for the Big Data con text where a single mo del could learn simultaneously from many similar time series. On the other hand, more complex models such as NNs b enefit most from the a v ailability of massive amounts of data. Therefore, researchers are now lo oking successfully into the p ossibilit y of applying NNs as sub- stitutes to many other mac hine learning and statistical techniques. Most notably , in the recent M4 comp etition, a Recurren t Neural Netw ork (RNN) was able to achiev e impressive p erformance and win the comp etition ( Smyl , 2020 ). Other examples for successful new developmen ts in the field are nov el arc hitectures such as DeepAR, Multi-Quantile Recurrent Neural Netw ork (MQRNN), Spline Quan tile F unction RNNs and Deep State Space Mo dels for probabilistic forecasting ( Salinas et al. , 2019 ; W en et al. , 2017 ; Gasthaus et al. , 2019 ; Rangapuram et al. , 2018 ). Though there is an abundance of Ma- c hine Learning and Neural Netw ork forecasting metho ds in the literature, as Makridakis et al. ( 2018b ) p oin t out, the metho ds are t ypically not ev aluated rigorously against statistical benchmarks and w ould usually p erform w orse than those. This finding of Makridakis et al. ( 2018b ) was arguably one of the main drivers for organising the M4 competition ( Makridakis et al. , 2018a ). F urthermore, often the datasets used or the co de implementations for the NN related forecasting research are not made pub- licly av ailable, whic h p oses an issue for reproducibility of the claimed performance ( Makridakis et al. , 2018b ). A lack of released co de implementations also makes it difficult for the forecasting communit y to adapt suc h researc h work to their practical forecasting ob jectives. In contrast, p opular statistical mo dels such as ETS and ARIMA that hav e traditionally supported forecasting in a univ ariate context gain their p opularit y not only from their high accuracy . They also ha ve the adv an tages of b eing relatively simple, robust, efficient, and automatic, so that they can b e used by non-expert users. F or example, the forecast ( Hyndman and Khandak ar , 2008 ) pac k age in the R programming language ( R Core T eam , 2014 ) implements a n umber of statistical tec hniques related to forecasting such as ARIMA, ETS, Seasonal and T rend Decomp osition using Lo ess (STL Decomp osition) in a single cohesive soft ware pack age. This pac k age still outshines many other forecasting pac k ages later developed, mainly due to its simplicity , accuracy , robustness and ease of use. Consequen tly , many users of traditional univ ariate tec hniques will not hav e the expertise to dev elop and adapt complex RNN mo dels. They will wan t to apply comp etitiv e, yet easy to use mo dels that can replace the univ ariate mo dels they currently use in pro duction. Therefore, regardless of the recen t successes of RNNs in forecasting, they may still b e reluctan t to try RNNs as an alternative since they ma y not ha ve the exp ert kno wledge to use the RNNs adequately and achiev e satisfactory accuracy . This is also directly related to the recently emerged dispute in the forecasting communit y around whether ‘off-the-shelf ’ deep learning techniq ues are able to outp erform classical b enc hmarks. F urthermore, b esides the ab o vemen tioned in tuitions around short isolated series versus large time series databases, no established guidelines exist as to when traditional statistical metho ds will outp erform RNNs, and whic h particular RNN arc hitecture should b e used ov er another or how their parameters should b e tuned to fit a practical forecasting con text. Although the results from the M4 forecasting comp etition ha ve clearly shown the p oten tial of RNNs, still it remains unclear ho w comp etitiv e RNNs can b e in practice in an automated standard approach, without extensive exp ert input as in a comp etition con text. Therefore, it is evident that the forecasting communit y would b enefit from standard softw are implemen tations, as well as guidelines and extensive exp erimen tal comparisons of the p erformance of traditional forecasting metho ds and the different RNN arc hitectures av ailable. Recen tly , few works hav e addressed the developmen t of standard softw are in the area. A pac k age has b een developed by T ensorflow in Python, for structural time series mo delling using the T ensorflow Probabilit y library ( Dillon et al. , 2017 ). This pack age provides supp ort for generating probabilistic forecasts by mo delling a time series as a sum of several structural comp onen ts such as seasonality , lo cal linear trends, and external v ariables. GluonTS, a Python based op en-source forecasting library recen tly introduced by Alexandrov et al. ( 2019 ) is sp ecifically designed for the purp ose of supp orting forecasting researchers with easy exp erimen tation using deep neural netw orks. Our work also presents a standard softw are framework, but fo cused on RNNs and supp orted by a review of the literature and implemen ted techniques, as well as an extensive empirical study . In 2 particular, our study has four main contributions. First, we offer a systematic o verview and catego- rization of the relev an t literature. Secondly , we p erform a rigorous empirical ev aluation of a num b er of the most p opular RNN architectures for forecasting on several publicly a v ailable datasets for pure univ ariate forecasting problems (i.e., without exogenous v ariables). The implemented models are stan- dard RNN architectures with adequate prepro cessing without considering sp ecial and sophisticated net work types suc h as in Smyl ( 2020 )’s w ork in volving RNNs. This is mainly b ecause our motiv ation in this study is to ev aluate ho w competitive are the off-the-shelf RNN techniques for forecasting against traditional univ ariate tec hniques and th us provide insigh ts to non-exp ert forecasting practitioners to start using RNNs. W e compare the p erformance of the inv olved mo dels against tw o state-of-the-art statistical forecasting b enc hmarks, namely the implementations of ETS and ARIMA mo dels from the forecast ( Hyndman and Khandak ar , 2008 ) pac k age. W e stic k to single seasonality forecasting prob- lems to b e able to compare with those t wo b enc hmarks. Although the state of the art in forecasting can no w be seen as the methods of the M4 competition winners, most notably Smyl ( 2020 ) and Montero- Manso et al. ( 2020 ), w e do not present comparisons against those metho ds as they are not automated in a wa y that they would b e straightforw ardly applicable to other datasets, and the idea of our researc h is to tune RNNs to replace the fully automatic univ ariate benchmarks that are b eing used hea vily b y practitioners for everyda y forecasting activities outside of comp etition environmen ts. Thirdly , based on the exp erimen ts w e offer conclusions as a b est practices guideline to tune the netw orks in general. W e in tro duce guidelines at ev ery step of the RNN mo delling from the initial prepro cessing of the data to tuning the hyperparameters of the mo dels. The methodologies are generic in that they are not tied to any sp ecific domain. Thus, the main ob jectiv e of our study is to make this work repro ducible b y other practitioners for their practical forecasting tasks. Finally , all implementations are publicly a v ailable as a cohesive op en-source softw are framework 1 . The rest of the pap er is structured as follows. W e first give a comprehensive background s tu dy of all related concepts in Section 2 , including the traditional univ ariate forecasting techniques and differen t NN architectures for forecasting men tioned in the literature. Section 3 presents the details of the metho dology emplo yed, including the RNN architectures implemen ted and the corresp onding data prepro cessing tec hniques. In Section 4 , w e explain the experimental framew ork used for the study with a description of the used datasets, the training, v alidation and testing sp ecifics and the comparison b enc hmarks used. In Section 5 , we present a critical analysis of the results follo wed by the conclusions in Section 6 , and future directions in Section 7 . 2. Background Study This section details the literature related to Univ ariate F orecasting, T raditional Univ ariate F ore- casting T echniques, Artificial Neural Netw orks (ANN) and leveraging cross-series information when using ANNs. 2.1. Univariate F or e c asting A purely univ ariate forecasting problem refers to predicting future v alues of a time series based on its own past v alues. That is, there is only one time dep enden t v ariable. Giv en the target series X = { x 1 , x 2 , x 3 , ..., x t , ..., x T } the problem of univ ariate forecasting can be formu lated as follows: { x T +1 , ..., x T + H } = F ( x 1 , x 2 , ..., x T ) + (1) Here, F is the function appro ximated by the mo del developed for the problem with the v ariable X . The function predicts the v alues of the series for the future time steps from T + 1 to T + H , where H is the intended forecasting horizon. denotes the error asso ciated with the function approximation F . 1 Av ailable at: https://github.com/HansikaPH/time- series- forecasting . 3 2.2. T r aditional Univariate F or e c asting T e chniques Time series forecasting has b een traditionally a research topic in Statistics and Econometrics, from simple metho ds such as Seasonal Na ¨ ıv e and Simple Exp onen tial Smo othing, to more complex ones suc h as ETS ( Hyndman et al. , 2008 ) and ARIMA ( Box et al. , 1994 ). In general, traditional univ ariate metho ds w ere on top in comparison to other computational intelligence metho ds at many forecasting comp etitions including NN3, NN5 and M3 ( Crone et al. , 2011 ; Makridakis and Hib on , 2000 ). The b enefit of the traditional univ ariate metho ds is that they work well when the volume of the data is minimal ( Bandara et al. , 2020 ). The num b er of parameters to b e determined in these tec hniques is quite lo w compared to other complex mac hine learning tec hniques. Ho wev er, such traditional univ ariate tec hniques introduced thus far lack few key requirements inv olved with complex forecasting tasks. Since one mo del is built p er each series, a frequent retraining is required which is compute intensiv e esp ecially in the case of massiv e time series databases. Also, these univ ariate techniques are not mean t for exploiting cross-series information using global mo dels since they take into account only the features and patterns inherent in a single time series at a time. This is not an issue if the individual time series are long enough having man y data p oin ts so that the models become capable of capturing the sequential patterns. How ever in practice this is usually not the case. On the other hand, learning from many time series can b e effectively used to address such problems of limited data av ailability in the single series. 2.3. Artificial Neur al Networks With the ever increasing a v ailabilit y of data, ANNs ha ve b ecome a dominant and popular tec hnique for machine learning tasks in the recent past. A F eed F orward Neural Netw ork (FFNN) is the most basic type of ANN. It has only forward connections in b et ween the neurons contrary to the RNNs whic h ha ve feedback lo ops. There are a num b er of works where ANNs are used for forecasting. Zhang et al. ( 1998 ) pro vide a comprehensive summary of such work in their pap er. According to these authors, ANNs p ossess v arious app ealing attributes whic h make them go od candidates for forecasting against the aforemen tioned statistical techniques. First, ANNs can mo del an y form of unknown relationship in the data with minimum a-priori assumptions. Second, ANNs can generalize and transfer the learned relationships to unseen data. Third, ANNs are universal approximators, meaning that they are capable of mo delling any form of relationship in the data, esp ecially non-linear relationships ( Hornik et al. , 1989 ). The range of functions mo delled by an ANN is muc h higher than the range cov ered by the statistical techniques. The studies of T ang et al. ( 1991 ) to compare ANNs against the Bo x-Jenkins metho dology for forecasting establish that ANNs are comparativ ely better for forecasting problems with long forecasting horizons. Clav eria and T orra ( 2014 ) derive the same observ ations with resp ect to a tourism demand forecasting problem. How ever, determining the best net work structure and the training pro cedure for a given problem are crucial to tune ANNs for maximal accuracy . An equally critical decision is the selection of the input v ariables for the mo delling ( Zhang and Kline , 2007 ). ANNs hav e b een applied for forecasting in a multitude of domains ov er the years. Mandal et al. ( 2006 ) use an ANN for electric load forecasting along with Euclidean Norm applied on w eather information of b oth the training data as w ell as the forecasting duration to deriv e similarit y b et ween data points in the training data and the forecasts. This technique is called the similar days approach. Efforts hav e also b een made to minimize the manual interv en tion in the NN mo delling pro cess to mak e it automated. T o this end, Y an ( 2012 ) prop ose a new form of ANN named as the G eneralized Regression Neural Netw ork (GRNN) which is a sp ecial form of a Radial Basis F unction (RBF) Netw ork. It requires the estimation of just one design parameter, the Spread F actor which decides the width of the RBF and consequently how muc h of the training samples contribute to the output. A hybrid approac h to forecasting is prop osed by Zhang ( 2003 ) combining an ARIMA mo del with an ANN and thus lev eraging the strengths of b oth mo dels. The ARIMA mo del is used to mo del the linear comp onen t of the time series while the ANN can then mo del the residual whic h corresp onds to the non-linear part. This approach is closely related to b o osting, commonly used as an ensembling techni que to reduce bias in predictions. The idea has 4 b een inspired from the common concept that a combination of several models can often outperform the individual mo dels in isolation. Zhang and Berardi ( 2001 ) and Rahman et al. ( 2016 ) hav e also w orked using ANNs for forecasting, sp ecifically with ensembles. The most common wa y to feed time series data into an ANN, sp ecifically an FFNN is to break the whole sequence into consecutive input windows and then get the FFNN to predict the windo w or the single data p oin t immediately following the input window. Y et, FFNNs ignore the temp oral order within the input windows and every new input is considered in isolation ( Bianchi et al. , 2017 ). No state is carried forw ard from the previous inputs to the future time steps. This is where RNNs come in to pla y; a sp ecialized NN developed for mo delling data with a time dimension. 2.3.1. R e curr ent Neur al Networks for F or e c asting RNNs are the most commonly used NN arc hitecture for sequence prediction problems. They hav e particularly gained p opularit y in the domain of natural language pro cessing. Similar to ANNs, RNNs are universal approximators ( Sc h¨ afer and Zimmermann , 2006 ) as w ell. How ev er, unlike ANNs, the feedbac k loops of the recurrent cells inherently address the temp oral order as well as the temp oral dep endencies of the sequences ( Sch¨ afer and Zimmermann , 2006 ). Ev ery RNN is a combination of a num b er of RNN units. The most p opular RNN units commonly used for sequence mo delling tasks are the Elman RNN cell, Long Short-T erm Memory (LSTM) cell and the Gated Recurrent Unit (GRU) ( Elman , 1990 ; Ho c hreiter and Sc hmidhuber , 1997 ; Cho et al. , 2014 ). Apart from them, other v arian ts hav e b een introduced suc h as Depth Gated LSTM, Clo c kwork RNN, Sto c hastic Recurrent Netw orks and Bidirectional RNN ( Y ao et al. , 2015 ; Koutn ´ ık et al. , 2014 ; Ba yer and Osendorfer , 2014 ; Sch uster and Paliw al , 1997 ). Nevertheless, these latter RNN units ha ve hardly b een used in the forecasting literature. They were mostly designed with language mo delling tasks in mind. Jozefo wicz et al. ( 2015 ) p erform an empirical ev aluation of different RNN cells. In particular, those authors aim to find an RNN unit that p erforms b etter than the LSTM and GRU in three defined tasks namely , arithmetic computations, XML mo delling where the net work has to predict the next c haracter in a sequence of XML data and a language mo delling task using the Penn T reeBank dataset. They do not cov er time series forecasting in particular. The work b y Bianc hi et al. ( 2017 ) sp ecifically targets the problem of short-term load forecasting. Their experiments systematically test the p erformance of all three of the p opular recurrent cells, Elman RNN (ERNN) cell, LSTM cell and the GRU cell, and compare those with Echo State Net works (ESN) and the Non-linear Autoregressive with eXogenous (NARX) inputs Netw ork. The exp erimen ts are p erformed on b oth synthetic time series as w ell as real-w orld time series. F rom the results, the authors conclude that b oth LSTM and GRU demonstrate similar p erformance in the chosen datasets. In essence, it is hard to differen tiate which one is b etter in which scenario. Additionally , the ERNN shows a comparable p erformance to the gated RNN units, and it is faster to train. The authors argue that gated RNN units can p oten tially outperform ERNNs in language mo delling tasks where the temp oral dep endencies can b e highly non-linear and abrupt. F urthermore, the authors state that gradient-based RNN units (ERNN, LSTM, GR U) are relatively slo w in terms of training time due to the time-consuming backpropagation through time procedure. A rec u rrent unit can constitute an RNN in v arious types of architectures. A num b er of different RNN architectures for forecasting can b e found in the literature. Although most commonly used for natural language pro cessing tasks, these architectures are used in different time series forecasting tasks as w ell. The stack ed architecture can b e claimed as the most commonly used architecture for forecasting with RNNs. Bandara et al. ( 2020 ) employ the Stac ked mo del in their w ork on using a clustering approach for grouping related time series in forecasting. Due to the v anishing gradient problem existent in the v anilla RNN cells, LSTM cells (with p eephole connections) are used instead. The metho d includes several essential data prepro cessing steps and a clustering phase of the related time series to exploit cross-series information, and those authors demonstrate significan t results on both the CIF 2016 as well as the NN5 forecasting comp etition datasets. In fact, it is the same architecture used by Smyl ( 2016 ) to win the CIF 2016 forecasting comp etition. In the comp etition, Smyl has dev elop ed tw o global mo dels, one for the time series with forecasting horizon 12 and the other one for 5 the time series with horizon 6. The work b y Smyl and Kuber ( 2016 ) closely follows the aforemen tioned stac ked architecture with a sligh t mo dification known as the skip connections. Skip connections allow la yers far b elo w in the stac k to directly pass information to lay ers well abov e and thus minimize the v anishing gradient effect. With the skip connections added, the architecture is denoted as the ResNet arc hitecture whic h is adapted from the work of He et al. ( 2016 ) originally for image recognition. Another RNN arc hitecture p opular in neural mac hine translation tasks is the Sequence to Sequence (S2S) architecture introduced b y Sutsk ever et al. ( 2014 ). The o verall mo del has tw o components, the enco der and the deco der which b oth act as t wo RNN netw orks on their own. P eng et al. ( 2018 ) apply a S2S architecture for host load prediction using GRU cells as the RNN unit in the netw ork. Those authors test their approach using tw o datasets, the Go ogle clusters dataset and the Dinda dataset from traditional Unix systems. The approach is compared against tw o other state-of-the- art RNN mo dels, an LSTM-based netw ork and the ESN. According to the results, the GRU based S2S mo del manages to outp erform the other mo dels in b oth the datasets. The DeepAR model for probabilistic forecasting, developed b y Salinas et al. ( 2019 ) for Amazon also uses a S2S architecture for prediction. T h e authors of that w ork use the same architecture for b oth the enco der and the decoder comp onen ts, although in practice the tw o comp onen ts usually differ. Therefore, the weigh ts of b oth the enco der and the deco der are the same. W en et al. ( 2017 ) develop a probabilistic forecasting mo del using a slightly mo dified version of a S2S netw ork. As teacher-signal enforcing of the deco der using autoregressiv e connections generally leads to error accumulation throughout the prediction horizon, those authors use a combination of tw o Multi Lay er Perceptrons (MLP) for the deco der; a global MLP whic h encapsulates the enco der outputs and the inputs of the future time steps and a local MLP whic h acts upon each sp ecific time step of the prediction horizon to generate the quantiles for that p oin t. Since the deco der do es not use any autoregressiv e connections, the technique is called the Direct Multi Horizon Strategy . More recently , S2S mo dels ak a auto enco ders are used in forecasting to extract time series features follo wed by another step to generate the actual predictions. Zhu and Laptev ( 2017 ) use an auto encoder to address the issue of uncertaint y , sp ecifically in the form of mo del missp ecification. The mo dels that are fitted using the training data may not b e the optimal mo dels for the test data, when the training data patterns are different from the test data. The auto enco der can alleviate this by extracting the time series features during training and calculating the difference b et ween these features and the features of the series encountered during testing. This gives the mo del the intuition of how differen t the training and test data partitions are. Laptev et al. ( 2017 ) incorp orate an auto encoder to train a global mo del across many heterogeneous time series. The auto encoder in this con text is exp ected to extract the features of each series which are later concatenated to the input window for forecasting by an LSTM mo del. Lik ewise, apart from direct prediction, the S2S mo del is also used in intermediate feature extraction steps prior to the actual forecasting. Bahdanau et al. ( 2015 ) and Luong et al. ( 2015 ) presen t tw o different v ariants of the S2S mo del, with attention mechanisms. The fundamental idea b ehind these atten tion mechanisms is to ov ercome the problem in S2S models that they encode all the information in the whole time series to just a single v ector and then deco de this v ector to generate outputs. Embedding all the information in a fixed-size v ector can result in information loss ( Qin et al. , 2017 ). Hence, the atten tion mo del tries to o vercome this b y iden tifying imp ortan t p oin ts in the time series to pay attention to. All the p oin ts in the time series are assigned weigh ts whic h are computed for each output time step and the more imp ortan t p oin ts are assigned higher weigh ts than the less imp ortan t ones. This metho d has b een empirically pro ven to b e hea vily successful in v arious neural machine translation tasks where the translation of eac h word from one language to another requires sp ecific attention on particular words in the source language sentence. Since time series p ossess seasonalit y comp onen ts, atten tion weigh ts as describ ed ab o ve can b e used in a forecasting context. F or instance, if a particular monthly time series has a yearly seasonality , predicting the v alue for the next immediate month b enefits more from the v alue of the exact same mon th of the previous year. This is analogous to assigning more w eight to the v alue of the sequence exactly 12 months ago. The work of Suilin ( 2017 ) done for the Kaggle challenge of Wikip edia W eb 6 T raffic F orecasting encompasses this idea ( Google , 2017 ). How ever, that author’s w ork does not use the traditional Bahdanau et al. ( 2015 ) or Luong et al. ( 2015 ) attention since those techniques require re- computation of the attention w eights for the whole sequence at each time step, whic h is computationally exp ensiv e. There are tw o v ariants in the atten tion mec hanism prop osed by Suilin ( 2017 ). In the first metho d, the encoder outputs that correspond to imp ortan t p oin ts iden tified b efore are directly fed as inputs to the respective time steps in the decoder. In the second method, the imp ortan t p oin ts concept is relaxed to tak e a w eighted av erage of those important points along with their tw o neigh b ouring p oin ts. This scheme helps to account for the noise in the time series as well as to cater for the different lengths of the months and the leap years. This latter attention has b een further extended to form a t yp e of hierarchical atten tion using an interconnected hierarc h y of 1D con volution la yers corresponding to w eighted av eraging the neighbouring p oin ts and max po oling la yers in b et w een. Despite using the same w eights for all the forecasting steps at the deco der, Suilin ( 2017 ) claims that this reduces the error significantly . Cinar et al. ( 2017 ) use a more complex attention sc heme to treat the p eriods in the time series. The idea of those authors is that the classical attention mechanism prop osed by Bahdanau et al. ( 2015 ) is intended for machine translation and therefore it do es not attend to the seasonal p erio ds in time series in a forecasting condition. Hence, they prop ose an extension to the Bahdanau-style attention and name it as P osition-based Conten t Atten tion Mechanism. An extra term ( π (1) ) is used in the atten tion equations to embed the imp ortance of each time step in the time series to calculate the outputs for the forecast horizon. Based on whether a particular time step in the history corresp onds to a pseudo-p eriod of the output step or not, the mo dified equation can b e used to either augment or diminish the effect of the hidden state. The vector π (1) is exp ected to b e trained along with the other parameters of the netw ork. Those authors p erform exp erimen ts using their prop osed technique o ver six datasets, b oth univ ariate and multiv ariate. They further compare the resulting error with the traditional attention mechanism as well as ARIMA and Random F orest. The results imply that the prop osed v ariant of the attention mechanism is promising since it is able to surpass the other models in terms of accuracy for five of the tested six datasets. F urthermore, the plots of the attention weigh ts of the classical attention scheme and the prop osed v ariant denote that by in tro ducing this v arian t, the pseudo-p eriods get assigned more weigh t whereas in the na ¨ ıve attention, the weigh ts are increasing to wards the time steps closer to the forecast origin. Another work by Qin et al. ( 2017 ) prop oses a dual-stage atten tion mec hanism (D A-RNN) for m ultiv ariate forecasting problems. In the first stage of the attention which is the input attention, differen t w eights are assigned to different driving series based on their significance to con tribute to wards forecasting at each time step. This is done prior to feeding input to the enco der comp onen t of the S2S net work. The input received b y eac h time step of the encoder is a set of v alues from different exogenous driving series whose influence is sufficiently increased or decreased. The w eights p ertaining to eac h driving series’ v alues are derived using another MLP which is trained along with the S2S netw ork. In the second stage of attention which is the temp oral attention, a usual Bahdanau-st yle attention sc heme is emplo yed to assign w eights to the encoder outputs at eac h prediction step. In terms of the exp erimen ts, the mo del is tested using tw o datasets. An extensive comparison is p erformed among man y mo dels including ARIMA, NARX RNN, Enco der Decoder, A ttention RNN, Input Atten tion RNN and the Dual Stage A ttention RNN. The Dual Stage Atten tion mo del is able to outp erform all the other mo dels in both the datasets. F urther exp erimen ts using noisy time series reveal that the Dual Stage A ttention Mo del is comparatively robust to noise. More recently , Liang et al. ( 2018 ) hav e developed a multi-lev el attention netw ork named GeoMAN for time series forecasting of Geo-sensory data. According to those authors, the significance of that mo del in comparison to the D A-RNN mo del is that this mo del explicitly handles the sp ecial c haracter- istics inherent in Geo-sensory data such as the spatio-temp oral correlation. While the DA-RNN has a single input attention mechanism to differentiate b et ween different driving series, the GeoMAN has t wo levels of spatial attention; the lo cal spatial attention which highlights different lo cal time series pro duced b y the same sensor based on their importance to the target series and the global spatial atten tion which do es the same for the time series pro duced by different surrounding sensors globally . 7 The concatenation of the con text vectors pro duced by these t wo attention mechanisms is then fed to a third temp oral attention mechanism used on the deco der. Those authors emphasize the imp ortance of the t wo-fold spatial attention mechanism as opp osed to one input atten tion as in the work by Qin et al. ( 2017 ) since the latter treats b oth lo cal and global time series as equal in its attention technique. The underlying attention scheme used for all three attention steps follows a Bahdanau-style additive sc heme. Empirical evidence on tw o Geo-sensory datasets hav e prov en the prop osed GeoMAN mo del to outp erform other state-of-the-art techniques suc h as S2S models, ARIMA, LSTM, and DA-RNN. Apart from using individual RNN mo dels, several researchers hav e also exp erimen ted using en- sem bles of RNN mo dels for forecasting. Ensembles are meant for combining multiple weak learners together to generate a more robust prediction and th us reduce the individual training time of each base RNN. Smyl ( 2017 ) segments the problem in to tw o parts; developing a group of sp ecialized RNN mo dels and ensembling them to derive a combined prediction. That author’s metho dology follows the idea that general clustering based on a standard metric for dev eloping ensembles do es not offer the best p erformance in the con text of forecasting. Therefore this approach randomly assigns time series of the dataset to one of the RNNs in the po ol for training at the first ep o c h. Once all netw orks are trained, ev ery time series in the dataset is assigned to a sp ecific N num b er of b est netw orks which give the minim um error after the training in that ep o c h. RNNs are trained in this manner iteratively using the newly allo cated series un til the v alidation error grows or the num b er of epo c hs finishes. The ensem bling includes using another netw ork to determine which RNN in the p ool should mak e the forecasts for a giv en series or feeding the forecasts of each netw ork to another netw ork to make the final predictions or using another netw ork to determine the w eights to b e assigned to the forecasts of each RNN in the p ool. Despite the complexity asso ciated with these techniques, that author claims that none of them w ork well. Simple techniques such as av erage, weigh ted av erage (weigh t based on training loss of eac h net work or the rate of b ecoming a b est netw ork) of all the netw orks or the weigh ted av erage of the N b est netw orks work well on the other hand. This metho dology demonstrates encouraging results on the monthly series of the M3 comp etition dataset. The RNNs used for the p ool comprise of a stack ed LSTM with skip connections. This technique is also used in the winning solution of Sm yl ( 2020 ) at the M4 comp etition which prov es that this approach works well. Krstano vic and P aulheim ( 2017 ) suggest that the higher the diversit y of the base learners, the b etter the accuracy of the final ensem ble. Those authors develop an ensemble using the metho d kno wn as Stacking where a num b er of base LSTM learners are com bined using a meta-learner whose inputs are the outputs of the base learners. The outputs of the meta-learner corresp ond to the final forecasts exp ected. The dissimilarity of the base learners is enforced by using a range of v alues for the hyperparameters suc h as drop out rate, num b er of hidden lay ers, num b er of no des for the la yers and the learning rate. F or the meta-learner those authors use Ridge Regression, eXtreme Gradient Bo osting (X GBo ost) and Random F orest. The metho d is compared with other p opular techniques suc h as LSTM, Ensemble via mean forecast, Moving Average, ARIMA and XGBoost on four different datasets. The suggested ensemble with Stacking in general p erforms b est. A mo dified b oosting algorithm for RNNs in the time series forecasting context is prop osed by Assaad et al. ( 2008 ) based on the AdaBo ost algorithm. The no velt y asp ect of this approach is that the whole training set is considered for mo del training at each iteration by using a sp ecialized parameter k which controls the w eight exerted on each training series based on its error in the previous iteration. Therefore, in every iteration, the series that got a high error in the previous iteration are assigned more weigh t. A v alue of 0 for k, assigns equal weigh ts for all the time series in the training set. F urthermore, the base mo dels in this approach are merged together using a w eighted median which is more robust when encountered with outliers as opp osed to weigh ted mean. Exp erimen ts using t wo time series datasets show promising p erformance of this ensem ble mo del on b oth single-step ahead and multi-step ahead forecasting problems, compared to other mo dels in the literature such as MLP or Threshold Autoregressiv e (T AR) Mo del. 8 2.4. L ever aging Cr oss-Series Information Exploiting cross-series information in forecasting is an idea that gets increased attention lately , esp ecially in the aftermath of the M4 comp etition. The idea is that instead of developing one mo del p er each time series in the dataset, a mo del is dev elop ed by exploiting information from many time series simultaneously . In the literature suc h models are often referred to as global mo dels whereas univ ariate mo dels which build a mo del p er every series are kno wn as local mo dels ( Janusc ho wski et al. , 2020 ). How ev er, the application of global mo dels to a set of time series do es not indicate an y in terdep endence b et w een them with resp ect to the forecasts. Rather, it means that the parameters are estimated globally for all the time series a v ailable ( Janusc howski et al. , 2020 ). The former is basically the scenario of multiv ariate forecasting where the v alue of one time series is driv en b y other external time v arying v ariables. Suc h relationships are directly mo delled in the forecast equations of the metho d. Y et, a global mo del trained across series usually works indep enden tly on the individual series in a univ ariate manner when pro ducing the forecasts. In mo dern forecasting problems, often the requirement is to pro duce forecasts for many time s eries whic h may ha ve similar patterns, as opp osed to forecasting just one time series. One common example in the domain of retail is to pro duce forecasts for many similar pro ducts. In suc h scenarios, global mo dels can demonstrate their true p oten tial by learning across series to incorp orate more information, in comparison to the lo cal metho ds. T rap ero et al. ( 2015 ) use a similar idea in their work for demand forecasting of sto c k-keeping units (SKUs). In particular, for SKUs with limited or no promotional history asso ciated with it, the co efficien ts of the regression mo del are calculated b y p ooling across man y SKUs. Ho wev er, the regression mo del used can capture only linear relationships and it do es not main tain an y internal state per each time series. In recent literature, researchers hav e used the idea of developing global mo dels in the context of deep neural netw orks. F or RNNs, this means that the weigh ts are calculated globally , yet the state is main tained p er each time series. The winning solution by Smyl ( 2020 ) at the M4 forecasting comp eti- tion uses the global mo del concept with lo cal parameters as well, to cater for individual requirements of different series. Bandara et al. ( 2020 ) do this by clustering groups of related time series. A global mo del is dev elop ed per each cluster. Salinas et al. ( 2019 ) also apply the idea of cross-series information in their DeepAR mo del for probabilistic forecasting. More recently , W en et al. ( 2017 ); Rangapuram et al. ( 2018 ); W ang et al. ( 2019 ) and Oreshkin et al. ( 2019 ) also emplo y the cross series learning con- cept in their work using deep neural netw orks for forecasting. Bandara et al. ( 2019 ) dev elop global mo dels for forecasting in an E-commerce environmen t by considering sales demand patterns of similar pro ducts. 3. Metho dology In this section, we describ e the details of the metho dology emplo yed in our comprehensive exp er- imen tal study . W e present the different recurrent unit types, the RNN architectures as well as the learning algorithms that we implement and compare in our work. 3.1. R e curr ent Neur al Networks Our work implements a n umber of RNN architectures along with different RNN units. These are explained in detail next. 3.1.1. R e curr ent Units Out of the different RNN units mentioned in the literature, we select the follo wing three types of recurren t units to constitute the la yers of the RNNs in our exp erimen ts. • Elman Recurrent Unit • Gated Recurrent Unit 9 • Long Short-T erm Memory with Peephole Connections The base recurrent unit is introduced by Elman ( 1990 ). The structure of the basic ERNN cell is as sho wn in Figure 1 . Figure 1: Elman Recurrent Unit h t = σ ( W i · h t − 1 + V i · x t + b i ) (2a) z t = tanh ( W o · h t + b o ) (2b) In Equations 2a and 2b , h t ∈ R d denotes the hidden state of the RNN cell (d b eing the cell dimension). This is the only form of memory in the ERNN cell. x t ∈ R m (m b eing the size of the input) and z t ∈ R d denote the input and output of the cell at time step t . W i ∈ R d × d and V i ∈ R d × d denote the weigh t matrices whereas b i ∈ R d denotes the bias vector for the hidden state. Likewise, W o ∈ R d × d and b o ∈ R d signify the weigh t matrix and the bias vector of the cell output. The curren t hidden state dep ends on the hidden state of the previous time step as well as the current input. This is supp orted with the feedbac k lo ops in the RNN cell connecting its current state to the next state. These connections are of extreme imp ortance to consider past information in up dating the current cell state. In the exp erimen ts, we use the sigmoid function (indicated by σ ) as the activ ation of the hidden state and the hyperb olic tangent function (indicated b y tanh) as the activ ation of the output. The ERNN Cell suffers from the well kno wn v anishing gradient and explo ding gradient problems o ver v ery long sequences. This implies that the simple RNN cells are not capable of carrying long term dep endencies to the future. When the sequences are quite long, the backpropagated gradients tend to diminish (v anish) and consequen tly the weigh ts do not get up dated adequately . On the other hand, when the gradients are huge, they may burst (explo de) o ver long sequences resulting in unstable w eight matrices. Both these issues are ensued from the gradients b eing intractable and hinder the abilit y of the RNN cells to capture long term dependencies. Ov er the years, several other v ariations hav e b een in tro duced to this base recurrent unit addressing its shortcomings. The LSTM cell introduced b y Ho c hreiter and Schmidh ub er ( 1997 ) is p erhaps the most p opular cell for natural language pro cessing tasks due to its capability to capture long-term dep endencies in the sequence while alleviating gradient v anishing issues. The structure of the LSTM Cell is illustrated in Figure 2 10 Figure 2: Basic Long Short-T erm Memory Unit i t = σ ( W i · h t − 1 + V i · x t + b i ) (3a) o t = σ ( W o · h t − 1 + V o · x t + b o ) (3b) f t = σ ( W f · h t − 1 + V f · x t + b f ) (3c) ˜ C t = tanh ( W c · h t − 1 + V c · x t + b c ) (3d) C t = i t ˜ C t + f t C t − 1 (3e) h t = o t tanh ( C t ) (3f ) z t = h t (3g) Compared to the basic RNN cell, the LSTM cell has tw o comp onen ts to its state, the hidden state and the internal cell state where the hidden state corresp onds to the short-term memory comp onen t and the cell state corresp onds to the long-term memory . With its Constant Error Carrousel (CEC) capabilit y supp orted by the internal state of the cells, LSTM av oids the v anishing and explo ding gradien t issues. Moreo ver a gating mechanism is introduced whic h comprises of the input, forget and the output gates. In the Equations 3a - 3g , h t ∈ R d is a vector which denotes the hidden state of the cell, where d is the cell dimension. Similarly C t ∈ R d is the cell state and ˜ C t ∈ R d is the candidate cell state at time step t which captures the imp ortan t information to b e p ersisted through to the future. x t ∈ R d and z t ∈ R d are the same as explained for the basic RNN cell. W i , W o , W f , W c ∈ R d × d denote the weigh t matrices of the input gate, output gate, forget gate and the cell state resp ectiv ely . Lik ewise, V i , V o , V f , V c ∈ R d × d and b i , b o , b f , b c ∈ R d denote the weigh t matrices corresp onding to the curren t input and the bias vectors resp ectiv ely . i t , o t , f t ∈ R d are the input, output and forget gate v ectors. The activ ation function σ of the gates denotes the sigmoid function which outputs v alues in the range [0, 1]. In Equation 3e , the input and the forget gates together determine how m uch of the past information to retain in the curren t cell state and how muc h of the current con text to propagate forw ard to the future time steps. denotes the element wise multiplication which is known as the 11 Hudmard Pro duct. A v alue of 0 in forget gate f t denotes that nothing should b e carried forward from the previous cell state. In other words, the previous cell state should b e completely forgotten in the curren t cell state. F ollowing this argument, a v alue of 1 implies that the previous cell state should b e completely retained. The same notion holds for the other tw o gates i t and o t . A v alue in b et ween the t wo extremes of 0 and 1 for b oth the input and forget gates can carefully control the v alue of the current cell state using only the imp ortan t information from b oth the previous cell state and the curren t candidate cell state. F or the candidate cell state, the activ ation function is a hyperb olic tangen t function whic h outputs v alues in the range [-1, 1]. An imp ortan t difference of the LSTM cell compared to the simple RNN cell is that its output z t is equal to the hidden state h t . In this review, we use the v ariant of the v anilla LSTM cell kno wn as the LSTM cell with p eephole connections. In this LSTM unit, the states are up dated as p er the following equations. i t = σ ( W i · h t − 1 + V i · x t + P i · C t − 1 + b i ) (4a) o t = σ ( W o · h t − 1 + V o · x t + P o · C t + b o ) (4b) f t = σ ( W f · h t − 1 + V f · x t + P f · C t − 1 + b f ) (4c) ˜ C t = tanh ( W c · h t − 1 + V c · x t + b c ) (4d) C t = i t ˜ C t + f t C t − 1 (4e) h t = o t tanh ( C t ) (4f ) z t = h t (4g) The difference is that the p eephole connections let the forget and the input gates of the cell lo ok at the previous cell state C t − 1 b efore up dating it. In the Equations 4a to 4g , P i , P o , P f ∈ R d × d represen t the weigh t matrices of the input, output, and forget gates respectively . As for the output gate, it can no w insp ect the curren t cell state for generating the output. The implementation used in this research is the default implementation of p eephole connections in the T ensorflow framework. The GR U is another v ariant introduced by Cho et al. ( 2014 ) which is comparatively simpler than the LSTM unit as well as faster in computations. This is due to the LSTM unit having three gates within the internal gating mechanism, whereas the GRU has only t wo, the up date gate and the reset gate. The up date gate in this unit plays the role of the forget gate and the input gate com bined. F urthermore, similar to the v anilla RNN cell, the GR U cell also has only one comp onen t to the state, i.e the hidden state. Figure 3 and the equations display the functionality of the GRU cell. 12 Figure 3: Gated Recurrent Unit u t = σ ( W u · h t − 1 + V u · x t + b u ) (5a) r t = σ ( W r · h t − 1 + V r · x t + b r ) (5b) ˜ h t = tanh ( W h · r t · h t − 1 + V h · x t + b h ) (5c) h t = u t ˜ h t + (1 − u t ) h t − 1 (5d) z t = h t (5e) u t , r t ∈ R d denote the up date and reset gates respectively . ˜ h t ∈ R d indicates the candidate hidden state and h t ∈ R d indicates the current hidden state at time step t. The weigh ts and biases follow the same notation as mentioned b efore. The reset gate decides how muc h of the previous hidden state con tributes to the candidate state of the current step. Since the up date gate functions alone without a forget gate, (1 − u t ) is used as an alternative. The GRU has gained a lot of p opularit y o wing to its simplicit y (lesser no. of parameters) compared to the LSTM cell and also its efficiency in training. 3.1.2. R e curr ent Neur al Network Ar chite ctur es W e use the following RNN arc hitectures for our study . 3.1.2.1 Stacke d Ar chite ctur e The stack ed arc hitecture used in this study closely relates to the arc hitecture mentioned in the work of Bandara et al. ( 2020 ). It is as illustrated in Figure 5 . Figure 4 shows the folded version of the RNN while Figure 5 demonstrates the unfolded version through time. The idea is that the same RNN unit rep eats for ev ery time step, sharing the same w eights and biases betw een each of them. The feedback loop of the cell helps the netw ork to propagate the state h t to the future time steps. F or the purp ose of generalization only the state h t is shown in the diagram; How ever, for an LSTM cell, h t should b e accompanied by the cell state C t . The notion 13 Figure 4: F olded V ersion of RNN Figure 5: Stacked Architecture of stac king means that multiple LSTM lay ers can b e stac ked on top of one another. In the most basic setup, the mo del has only one LSTM la yer. Figure 5 is for such a basic case, whereas Figure 6 is for a m ulti-lay ered scenario. As seen in Figure 6 , for many hidden lay ers the same structure as in Figure 5 is rep eated multiple times stac ked on top of eac h other where the output from every lay er is directly fed as input to the next immediate lay er ab o ve and the final forecasts retrieved from the last la yer. As seen in Figure 5 , X t denotes the input to the cell at time step t and ˆ Y t corresp onds to the output. X t and ˆ Y t used for the stacking architecture are vectors instead of single data p oin ts. This is done according to the mo ving window scheme explained later in Section 4.2.5 . A t every time step, the cell functions using its existing weigh ts and pro duces the output Z t whic h corresp onds to the next immediate output window of the time series. Lik ewise, the output of the RNN cell instance of the final time step ˆ Y T corresp onds to the exp ected forecasts for the particular time series. How ev er, the output of the cell do es not conform to the exp ected dimension which is the forecasting horizon (H). Since the cell dimension (d) is an externally tuned hyperparameter it may take on any appropriate v alue. Therefore, to pro ject the output of the cell to the exp ected forecasting horizon, an affine neural la yer is connected on top of every recurrent cell, whose weigh ts are trained altogether with the recurrent net work itself. In Figure 5 , this fully connected la yer is not shown explicitly and the output ˆ Y t ∈ R H corresp onds to the output of the combined RNN cell and the dense lay er. 14 Figure 6: Multi-layer Stacked Architecture During the mo del training pro cess, the error is calculated per each time step and accum ulated until the end of the time series. Let the error p er eac h time step t b e e t . Then, e t = Y t − ˆ Y t (6) where Y t is the actual output vector at time step t, also prepro cessed according to the mo ving windo w strategy . F or all the time steps, the accumulated error E can b e defined as follows. E = T X t =1 e t (7) A t the end of every time series, the accumulated error E is used for the Backpropagation Through Time (BPTT) once for the whole sequence. This BPTT then up dates the weigh ts and biases of the RNN cells according to the optimizer algorithm used. 3.1.2.2 Se quenc e to Se quenc e Ar chite ctur e The S2S architecture used in this study is as illustrated in Figure 7 . There are few ma jor differences of the S2S netw ork compared to the Stac ked arc hitecture. The first is the input format. The input x t fed to eac h cell instance of this netw ork is a single data p oin t instead of a vector. Stated differently , this net work do es not use the moving windo w scheme of data prepro cessing. The RNN cells keep getting input at eac h time step and consequently build the state of the netw ork. This comp onen t is known as the Encoder. Ho wev er, in contrast to the Stack ed arc hitecture, the output is not considered p er each time step; rather only the forecasts pro duced after the last input p oin t of the Enco der are considered. 15 Figure 7: Sequence to Sequence with Deco der Architecture Here, ev ery y t corresp onds to a single forecasted data p oin t in the forecast horizon. The comp onen t that pro duces the outputs in this manner is called the Deco der. The Deco der comprises of a set of RNN cell instances as well, one p er each step of the forecast horizon. The initial state of the Decoder is the final state built from the Enco der which is also known as the con text v ector. A distinct feature of the Decoder is that it con tains autoregressiv e connections from the output of the previous time step in to the input of the cell instance of the next time step. During training, these connections are disregarded and the externally fed actual output of each previous time step is used as means of teac her forcing. This teaches the Deco der comp onen t how inaccurate it is in predicting the previous output and how m uch it should b e corrected. During testing, since the actual targets are not av ailable, the autoregressiv e connections are used instead to substitute them with the generated forecasts. This is known as scheduled sampling where a decision is taken either to sample from the outputs or the external inputs at the Deco der. The deco der can also accept external inputs to supp ort exogenous v ariables whose v alues are known for the future time steps. Since the hidden state size ma y not b e equal to 1, which is the exp ected output size of each cell of the Deco der, an affine neural lay er is applied on top of every cell instance of the Deco der similar to the Stack ed arc hitecture. The error computed for bac kpropagation in the S2S architecture differs from that of the Stac ked architecture since no error accumulation happ ens ov er the time steps of the Enco der. Only the error at the Deco der is considered for the loss function of the optimizer. E = H X t =1 y t − ˆ y t (8) There are basically tw o types of comp onen ts for output in a S2S netw ork. The most common is the Deco der as mentioned abov e. Inspired by the w ork of W en et al. ( 2017 ), a dense lay er can also b e used in place of the Deco der. W e implement b oth these p ossibilities in our exp erimen ts. How ev er, since w e do not feed an y future information into the decoder, w e do not use the concept of lo cal MLP as in W en et al. ( 2017 ) and use only one global MLP for the forecast horizon. This technique is exp ected to ob viate the error propagation issue in the Deco der with autoregressive connections. This mo del is illustrated in Figure 8 . As seen in Figure 8 , the netw ork no longer contains a Deco der. Only the Enco der exists to take input p er eac h time step. Ho wev er, the format of the input in this mo del can b e tw o fold; either with a moving window scheme or without it. Both these are tested in our study . In the scheme with the mo ving windo w, eac h Encoder cell receives the v ector of inputs X t whereas without the mo ving windo w sc heme, the input x t corresp onds to a single scalar input as explained b efore for the netw ork with the 16 Figure 8: Sequence to Sequence with Dense Lay er Architecture Arc hitecture Output Comp onen t Input F ormat Error Computation Stac ked Dense Lay er Mo ving Windo w Accum ulated Error Sequence to Sequence Deco der Without Moving Window Last Step Error Sequence to Sequence Dense Lay er Without Moving Window Last Step Error Sequence to Sequence Dense Lay er Mo ving Windo w Last Step Error T able 1: RNN Architecture Information Deco der. The forecast for b oth these is simply the output of the last Enco der time step pro jected to the des ired forecast horizon using a dense lay er without bias (F ully connected lay er). Therefore, the output of the last Enco der step is a v ector b oth with a moving window or without. The error considered for the backpropagation is the error pro duced by this forecast. E = Y T − ˆ Y T (9) In comparison to the Stac ked architecture whic h is also fed with a moving windo w input, the only difference in this S2S arc hitecture with the dense lay er and the mo ving window input format is that in the former, the error is calculated p er each time step and the latter calculates the error only for the last time step. The set of mo dels selected for implementation by considering all the aforementioned arc hitectures and input formats is shown in T able 1 . All the mo dels are implemented using the three RNN units Elman RNN cell, LSTM cell and the GRU cell and tested across the five datasets detailed further in Section 4.1 . 3.2. L e arning A lgorithms Three learning algorithms are tested in our framew ork; the Adam optimizer, the Adagrad optimizer and the COntin uous COin Betting (COCOB) optimizer ( Kingma and Ba , 2015 ; Duc hi et al. , 2011 ; Orab ona and T ommasi , 2017 ). Both Adam and Adagrad hav e built-in T ensorflow implementations while the COCOB optimizer has an op en-source implementation whic h uses T ensorflow ( Orab ona , 2017 ). The Adam optimizer and the Adagrad optimizer both require the hyperparameter learning rate, whic h if po orly tuned may perturb the whole learning pro cess. The Adagrad optimizer in tro duces the Adaptive Learning Rate concept where a separate learning rate is k ept for each v ariable of the function to b e optimized. Consequen tly , the different weigh ts are up dated using separate equations. Ho wev er, the Adagrad optimizer is prone to shrinking learning rates ov er time as its learning rate 17 up date equations hav e accumulating gradients in the denominator. This slows down the learning pro cess of the optimizer o v er time. The Adam optimizer, similar to Adagrad, keeps one learning rate p er each parameter. How ever, to address the issue of v anishing learning rates, the Adam optimizer uses b oth exp onen tially deca ying av erage of gradien t moments and the gradien ts squared in its up date equations. The Adam optimizer is generally exp ected to p erform b etter than the other optimizers and Kingma and Ba ( 2015 ) empirically demonstrate this. The COCOB optimizer attempts to minimize the loss function b y self-tuning its learning rate. Therefore, this is one step closer to fully automating the NN mo delling pro cess since the user is reliev ed from the burden of defining the initial learning rate ranges for the h yp erparameter tuning algorithm. Learning algorithms are extremely sensitiv e to their learning rate. Therefore, finding the optimal learning rate is crucial for the mo del p erformance. The COCOB algorithm is based on a coin b etting scheme where during each iteration, an amount of money is b et on the outcome of a coin toss such that the total wealth in p ossession is maximized. Orab ona and T ommasi ( 2017 ) apply the same idea to a function optimization where the b et corresp onds to the size of the step tak en along the axis of the indep enden t v ariable. The total wealth and the outcome of the coin flip corresp ond to the optimum p oin t of the function and the negative subgradien t of the function at the b et p oin t, resp ectiv ely . During eac h round, a fraction of the current total wealth (optimum p oin t) is b et. The b etting strategy is designed such that the total wealth do es not b ecome negative at any p oin t and the fraction of the money b et in each round increases until the outcome of the coin toss remains constan t. F or our context this means that as long as the signs of the negative subgradient ev aluations remain the same, the algorithm keeps on making bigger steps along the same direction. This mak es the conv ergence faster in contrast to other gradient descent algorithms which hav e a constant learning rate or a decaying learning rate where the conv ergence b ecomes slo wer close to the optimum. Similar to the Adagrad optimizer, COCOB to o maintains separate coins (learning rates) for each parameter. 4. Exp erimen tal F ramew ork T o implement the mo dels we use version 1.12.0 of the T ensorflow op en-source deep learning frame- w ork introduced by Abadi et al. ( 2015 ). This section details the different datasets used for the ex- p erimen ts along with their asso ciated prepro cessing steps, the information of the mo del training and testing pro cedures as w ell as the b enc hmarks used for comparison. 4.1. Datasets The datasets used for the exp erimen ts are taken from the follo wing forecasting comp etitions, held during the past few y ears. • CIF 2016 F orecasting Competition Dataset • NN5 F orecasting Comp etition Dataset • M3 F orecasting Comp etition Dataset • M4 F orecasting Comp etition Dataset • Wikip edia W eb T raffic Time Series F orecasting Comp etition Dataset • T ourism F orecasting Comp etition Dataset As mentioned earlier, our study is limited to using RNN arc hitectures on univ ariate, m ulti step ahead forecasting considering only single seasonality , to b e able to straightforw ardly compare against automatic standard b enc hmark metho ds. F or the NN5 dataset and the Wikip edia web traffic dataset, since they contain daily data with less than tw o years of data, we consider only the weekly seasonality . F rom the M3 and M4 datasets, we only use the monthly time series which contain a single y early 18 Dataset Name No. of Time Series F orecasting Horizon F requency Max. Length Min. Length CIF 2016 72 6, 12 Mon thly 108 22 NN5 111 56 Daily 735 735 M3 1428 18 Mon thly 126 48 M4 48,000 18 Mon thly 2794 42 Wikip edia W eb T raffic 997 59 Daily 550 550 T ourism 366 24 Mon thly 309 67 T able 2: Dataset Information seasonalit y . The NN5, Wikip edia web traffic and the T ourism datasets contain non-negative series meaning that they also ha ve 0 v alues. T able 2 gives further insight into these datasets. The CIF 2016 comp etition dataset has 72 monthly time series with 57 of them having a prediction horizon of 12 and the remaining 15 ha ving a prediction horizon of 6 in the original comp etition. Some of the series having prediction horizon 6 are shorter than tw o full p eriods and are thus considered as having no seasonality . Out of all the time series, 48 series are artificially generated while the rest are real time series originating from the banking domain ( ˇ St ˇ epni ˇ ck a and Burda , 2017 ). The NN5 comp etition was held in 2008. This dataset has in total 111 daily time series which represent close to tw o years of daily cash withdraw al data from A TM machines in the UK ( Ben T aieb et al. , 2012 ). The forecasting horizon for all time series is 56. The NN5 dataset also contains missing v alues. The metho ds used to comp ensate for these issues are as detailed in Section 4.2.2 . Tw o of the datasets are selected from the M comp etition series held ov er the y ears. F rom b oth M3 and M4 comp etitions w e select only the monthly category which contains a single yearly seasonalit y . The yearly data contain no seasonality and the series are relatively shorter. In the quarterly data to o, although they contain y early seasonalit y , the series are shorter and contain few er series compared to the monthly category . The M3 comp etition, held in 2000, has monthly time series with a prediction horizon of 18. In particular, this dataset consists of time series from a num b er of different categories namely , micro, macro, industry , finance, demography and other. The total num b er of time series in the monthly category is 1428 ( Makridakis and Hib on , 2000 ). The M4 comp etition held in 2018 has a dataset with similar format to that in the M3 comp etition. It has the same 6 categories as stated b efore and the participants were required to make 18 months ahead predictions for the monthly series. Compared to the previous comp etitions, one of the key ob jectiv es of the M4 is to increase the num b er of time series av ailable for forecasting ( Makridakis et al. , 2018a ). Consequently , the whole dataset consists of 100,000 series and the mon thly subcategory alone has a total of 48,000 series. The other tw o datasets that we select for the exp erimen ts are b oth taken from Kaggle c hallenges, the W eb T raffic Time Series F orecasting Comp etition for Wikip edia articles and the T ourism Dataset Comp etition ( Go ogle , 2017 ; A thanasop oulos et al. , 2010 ). The task of the first comp etition is to predict the future web traffic (num b er of hits) of a given set of Wikip edia pages (145,000 articles), given their history of traffic for ab out t wo years. The dataset in volv es some metadata as well to denote whether the traffic comes from desktop, mobile, spider or all these sources. F or this study , this problem is reduced to the history of the first 997 articles for the p eriod of 1 st July 2015 to 31 st Decem b er 2016. The exp ected forecast horizon is from 1 st Jan uary 2017 to 28 th F ebruary 2017, cov ering 59 days. One imp ortan t distinction in this dataset compared to the others is that all the v alues are integers. Since the NN outputs con tinuous v alues b oth p ositiv e and negativ e, to obtain the final forecasts, the v alues need to b e rounded to the closest non-negative integer. The tourism dataset contains monthly , yearly and quarterly time series. Again, for our exp erimen ts w e select the monthly category which has 366 series in total. The requirement in this comp etition for the monthly category is to predict the next 24 mon ths of the given series. How ever, the nature of the data is generically termed as ’tourism related’ and no sp ecific details are given ab out what v alues the individual time series hold. The idea of the comp etition is to encourage the public to come up with new mo dels that can b eat the results which are initially published by Athanasopoulos et al. ( 2011 ) using the same tourism dataset. 19 T able 2 also gives details of the lengths of different time series of the datasets. The CIF 2016, M3 and the T ourism datasets hav e relatively short time series (maximum length b eing 108, 126 and 309 resp ectiv ely). The NN5 and Wikip edia W eb T raffic dataset time series are longer. In the M4 monthly dataset, lengths of the series v ary considerably from 42 to 2794. Figure 9 shows violin plots of the seasonality strengths of the different datasets. T o extract the seasonalit y strengths, we use the tsfeatures R pac k age ( Hyndman et al. , 2019 ). Figure 9: Violin Plots of Seasonality Strengths. The tw o datasets NN5 and T ourism hav e higher seasonality strength compared to the others. The inter-quan tile ranges of the tw o resp ectiv e violin plots are lo cated comparatively higher along the y axis. Among these tw o, the seasonality strengths of the individual NN5 series v ary less from each other since the inter-quantile range is quite narrow. F or the Wikipedia W eb T raffic dataset, the inter-quan tile range is even more narro w and located muc h low er along the y axis. This indicates that the time series of the Wikip edia W eb T raffic dataset carry quite minimal seasonality compared to the other datasets. F or the CIF, M3 monthly and M4 monthly datasets, the seasonalit y strengths of the different time series are spread across a wide spectrum. 4.2. Data Pr epr o c essing W e apply a n umber of prepro cessing steps in our work that w e detail in this section. Most of them are closely related to the ideas presen ted in Bandara et al. ( 2020 ). 4.2.1. Dataset Split The training and v alidation datasets are separated similar to the ideas presented by Suilin ( 2017 ) and Bandara et al. ( 2020 ). F rom eac h time series, w e reserv e a part from the end for v alidation with a length equal to the forecast horizon. This is for finding the optimal v alues of the h yp erparameters using the automated tec hniques explained in Section 4.3.1 . The rest of the time series constitutes the training data. This approach is illustrated in Figure 10 . W e use such fixed origin mechanism for v alidation instead of the rolling origin sc heme since w e w ant to replicate a usual comp etition setup. Also, given the fact that we hav e 36 RNN mo dels implemen ted and tested across 6 datasets inv olving thousands of time series for b oth training and ev aluation, we deem our results represen tativ e even with fixed origin ev aluation and it would b e computationally extremely challenging to p erform a rolling origin ev aluation. 20 Figure 10: T rain V alidation Set Split Ho wev er, as stated by Suilin ( 2017 ), this kind of split is problematic since the last part of the sequence is not considered for training the mo del. The further a wa y the test predictions are from the training set the worse, since the underlying patterns ma y c hange during this last part of the sequence. Therefore, in this work, the aforementioned split is used only for the v alidation phase. F or testing, the mo del is re-trained using the whole sequence without an y data split. F or each dataset, the mo dels are trained using all the time series av ailable, for the purp ose of dev eloping a global mo del. Different RNN architectures need to b e fed data in different formats. 4.2.2. A ddr essing Missing V alues Man y machine learning metho ds cannot handle missing v alues, so that these need to b e prop erly replaced by other appropriate substitutes. There are different techniques a v ailable to fill in missing v alues, Mean substitution and Median substitution b eing t wo of them. Linear in terp olation and replacing by “0” are other p ossible metho ds. Out of the datasets selected for our exp erimen ts, the NN5 dataset and the Kaggle web traffic dataset contain missing v alues. F or the NN5 dataset, we use a median substitution metho d. Since the NN5 dataset includes daily data, a missing v alue on a particular da y is replaced by the median across all the same da ys of the week along the whole series. F or example, if a missing v alue is encountered on a T uesday , the median of all v alues on T uesdays of that time series is taken as the substitute. This approac h seems sup erior to taking the median across all the av ailable data p oin ts, since the NN5 series ha ve a strong weekly seasonality . Compared to the NN5 dataset, the Kaggle web traffic dataset has many more missing v alues. In addition to that, this dataset do es not differentiate b et w een missing v alues and “0” v alues. Therefore, for the Kaggle web traffic dataset, a simple substitution by “0”s is carried out for all the missing v alues. 4.2.3. Mo del ling Se asonality With regard to mo delling seasonality using NNs, there hav e b een mixed notions among the re- searc hers ov er the years. Some of the early w orks in this space infer that NNs are capable of modelling seasonalit y accurately ( Sharda and Patil , 1992 ; T ang et al. , 1991 ). How ever, more recen t exp erimen ts suggest that deseasonalization prior to feeding data to the NNs is essential since NNs are weak in mo delling seasonality . Particularly , Clav eria et al. ( 2017 ) empirically show for a tourism demand fore- casting problem that seasonally adjusted data can b oost the performance of NNs esp ecially in the case of long forecasting horizons. Zhang and Qi ( 2005 ) conclude that using b oth detrending and deseason- alization can improv e the forecasting accuracy of NNs. Similar observ ations are recorded in the w ork of Zhang and Kline ( 2007 ) as well as Nelson et al. ( 1999 ). More recently , in the winning solution b y Sm yl ( 2020 ) at the M4 forecasting comp etition, the same argument that NNs are weak at mo delling seasonalit y , is put forward. A core part of our research is to inv estigate whether NNs actually struggle to mo del seasonality on their own. Therefore, we run experiments with and without removing the seasonalit y . In the following, we describ e the deseasonalization pro cedure we use in case we remov e the seasonality . Since in theory deterministic seasonality do es not change and is therewith known ahead of time, relieving the NN from the burden of mo delling it can ease the task of the NN and let it predict only the non-deterministic parts of the time series. F ollowing this general consensus, we run a set 21 of exp erimen ts with prior deseasonalization of the time series data. F or that we use the Seasonal and T rend Decomp osition using Lo ess (STL Decomp osition) introduced b y Cleveland et al. ( 1990 ) as a wa y of decomp osing a time series into its seasonal, trend and remainder comp onen ts. Lo ess is the underlying method for estimating non-linear relationships in the data. By the application of a sequence of Lo ess smo others, the STL Decomp osition method can efficiently separate the seasonalit y , trend and remainder components of the time series. How ever, this tec hnique can only be used with additiv e trend and seasonalit y components. Therefore, in order to con v ert all multiplicativ e time series components to additiv e format, the STL Decomp osition is immediately preceded by a v ariance stabilization technique as discussed in Section 4.2.4 . The STL Decomp osition is p oten tially able to allow the seasonality to c hange ov er time. How ev er, w e use STL in a deterministic wa y , where we assume the seasonality of all the time series to b e fixed along the whole time span. In particular, we use the implementation of the STL Decomp osition av ailable in the forecast R pac k age introduced by Hyndman and Khandak ar ( 2008 ). By sp ecifically setting the s.window pa- rameter in the stl metho d to “p erio dic”, we make the seasonalit y deterministic. Hence, w e remov e only the deterministic seasonality comp onen t from the time series while other sto c hastic seasonality comp onen ts may still remain. The NN is exp ected to mo del such sto c hastic seasonalit y by itself. The STL Decomp osition is applied in this manner to all the time series, regardless of whether they actually sho w seasonal b eha viour or not. Nev ertheless, this technique requires at least tw o full p eriods of the time series data to determine its seasonalit y comp onen t. In extreme cases where the full length of the series is less than tw o p eriods, the technique considers such sequences as having no seasonality and returns 0 for the seasonalit y component. 4.2.4. Stabilizing the V arianc e in the Data V ariance stabilization is necessary if an additive seasonality decomp osition tec hnique such as the STL decomp osition in Section 4.2.3 is used. F or time series forecasting, v ariance stabilization can b e done in man y wa ys, and applying a logarithmic transformation is arguably the most straightforw ard w ay to do so. How ev er, a shortcoming of the logarithm is that it is undefined for negative and zero- v alued inputs. All v alues need to b e on the p ositiv e scale for the logarithm. F or non-negative data, this can b e easily resolv ed by defining a Log transformation as in Equation 10 , as a slightly altered v ersion of a pure logarithm transformation. w t = ( log ( y t ) if min ( y ) > , log ( y t + 1) if min ( y ) ≤ (10) y in Equation 10 denotes the whole time series. F or count data, can b e defined as equal to 0, whereas for real-v alued data, can be selected as a small p ositiv e v alue close to 0. The logarithm is a v ery strong transformation and ma y not alw ays b e adequate. There are some other similar transformations that attempt to ov ercome this shortcoming. One such transformation is the Box-Co x transformation. It is defined as follows. w t = ( log ( y t ) if λ = 0 , ( y λ t − 1) /λ if λ 6 = 0 (11) As indicated b y Equation 11 , the Bo x-Cox transformation is a combination of a log transformation (when λ = 0) and a pow er transformation (when λ 6 = 1) denoted by y λ t . The parameter λ needs to b e carefully chosen. In the case when λ equals 1, the Bo x-Cox transformation results in y t − 1. Th us, the shap e of the data is not c hanged but the series is simply shifted do wn by 1 ( Rob J Hyndman , 2018 ). Another similar form of p o w er transformation is the Y eo-Johnson transformation ( Y eo and Johnson , 2000 ) shown in Equation 12 . 22 w t = (( y t + 1) µ − 1) /µ if µ 6 = 0 , y t ≥ 0 , log ( y t + 1) if µ = 0 , y t ≥ 0 , − [( − y t + 1) (2 − µ ) − 1] / (2 − µ ) if µ 6 = 2 , y t < 0 , − log ( − y t + 1) if µ = 2 , y t < 0 (12) The parameter µ in the Y eo-Johnson transformation is confined in the range [0 , 2] while µ = 1 giv es the identit y transformation. V alues of y t are allow ed to b e either p ositiv e, negative or zero. Ev en though this is an adv antage ov er the logarithm, the choice of the optimal v alue of the parameter µ is again not trivial. Due to the complexities associated with selecting the optimal v alues of the parameters λ and µ in the Box-Co x and Y eo-Johnson T ransformations, and given that all data used in our exp erimen ts are non-negativ e, w e use a Log transformation in our exp erimen ts. 4.2.5. Multiple Output Str ate gy F orecasting problems are typ ically multi-step-ahead forecasting problems, which is also what we fo cus on in our exp erimen ts. Ben T aieb et al. ( 2012 ) p erform an extensive research inv olving five differen t techniques for multi-step-ahead forecasting. The Recursive Strategy which is a sequence of one-step-ahead forecasts, inv olves feeding the prediction from the last time step as input for the next prediction. The Direct Strategy inv olves different mo dels, one p er each time step of the forecasting horizons. The DiRec Strategy combines the concepts from the t wo ab o ve metho ds and dev elops m ul- tiple mo dels one p er each forecasting step where the prediction from eac h mo del is fed as input to the next consecutiv e mo del. In all these techniques, the mo del outputs basically a scalar v alue corresp ond- ing to one forecasting step. The other tw o techniques tested by Ben T aieb et al. ( 2012 ) use a direct m ulti-step-ahead forecast where a vector of outputs corresp onding to the whole forecasting horizon is directly produced by the mo del. The first tec hnique under this category is known as the Multi-Input Multi-Output (MIMO) strategy . The adv antage of the MIMO strategy comes from pro ducing the forecasts for the whole output window at once, and thus incorp orating the inter-dependencies b et ween eac h time step, rather than forecasting eac h time step in isolation. The DIRMO technique synthesizes the ideas of the Direct tec hnique and the MIMO technique where each mo del pro duces forecasts for windo ws of size s ( s ∈ { 1 , ..., H } , H b eing the forecast horizon). In the extreme cases where s = 1 and s = H the tec hnique narrows down to the Direct strategy and the MIMO strategy , resp ectiv ely . The exp erimen tal analysis in that pap er using the NN5 comp etition dataset demonstrates that the m ultiple output strategies are the b est ov erall. The DIRMO strategy requires careful selection of the output window size s ( Ben T aieb et al. , 2012 ). W en et al. ( 2017 ) also state in their work that the Direct Multi-Horizon strategy also known as the MIMO strategy p erforms reliably since it a voids error accum ulation o ver the prediction time steps. F ollowing these early findings, we use in our study the multiple output strategy in all the RNN arc hitectures. Ho w ever, the inputs and outputs of each recurrent cell differ b et ween the differen t arc hitectures. F or the S2S mo del, eac h recurrent cell of the encoder is fed a single scalar input. F or the stac king mo del and the S2S mo del with the dense lay er output, a moving window sc heme is used to feed the inputs and create the outputs as in the w ork of Bandara et al. ( 2020 ). In this approach, every recurren t cell accepts a window of inputs and pro duces a window of outputs corresp onding to the time steps whic h immediately follow the fed inputs. The recurrent cell instances of the next consecutive time step accepts an input window of the same size as the previous cell, shifted forward b y one. This metho d can also b e regarded as an effective data augmentation mechanism ( Sm yl and Kub er , 2016 ). Figure 11 illustrates this sc heme. Let the length of the whole sequence be l , the size of the input windo w m and the size of the output windo w n . As mentioned in Section 4.2.1 , during the training phase, the last n sized piece from this sequence is left out for v alidation. The rest of the sequence (of length l − n ) is broken do wn into blo c ks of size m + n forming the input output combination for each recurrent cell instance. Likewise in total, there are l − n ∗ 2 − m and l − n − m such blo c ks for the training and the v alidation stages resp ectiv ely . 23 Figure 11: Moving Windo w Scheme Ev en though the v alidation stage do es not inv olve explicit training of the mo del, the created blo c ks should still b e fed in sequence to build up the state. The output window size is set to be equal to the size of the forecasting horizon H ( n = H ). The selection of the input window size is a careful task. The ob jective of using an input window as opp osed to single input is to relax the duty of the recurrent unit to remember the whole history of the sequence. Though theoretically the RNN unit is exp ected to memorize all the information from the whole sequence it has seen, it is not practically comp eten t in doing so ( Sm yl and Kub er , 2016 ). The imp ortance of the information from distant time steps tend to fade as the mo del contin ues to see new inputs. How ever, in terms of the trend, it is intelli gible that only the last few time steps con tribute the most in forecasting. Putting these ideas together, we select the input window size m with tw o options. One is to make the input windo w size slightly bigger than the output window size ( m = 1 . 25 ∗ output w indow siz e ). The other is to make the input window size slightly bigger than the seasonality p eriod ( m = 1 . 25 ∗ seasonality per iod ). F or instance if the time series has weekly seasonality ( seasonal ity per iod = 7) with exp ected forecasting horizon 56, with the first option we set the input window size to b e 70 (1 . 25 ∗ 56), whereas with the second option it is set to b e 9 (1 . 25 ∗ 7). The constant 1.25 is selected purely as a heuristic. The idea is to ascertain that every recurrent cell instance at each time step gets to see at leas t its last p erio dic cycle so that it can mo del any remaining sto c hastic seasonality . In case, the total length of the time series is to o short, m is selected to b e either of the tw o feasible, or some other p ossible v alue if none of them work. F or instance, the forecasting horizon 6 subgroup of the CIF 2016 dataset comprises of such short series and m is selected to b e 7 (sligh tly bigger than the output windo w size) even though the seasonality per iod is equal to 12. 4.2.6. T r end Normalization The activ ation functions used in RNN cells, such as the sigmoid or the h yp erbolic tangent function ha ve a saturation area after whic h the outputs are constant. Hence, when using RNN cells it should b e assured that the inputs fed are normalized adequately suc h that the outputs do not lie in the saturated range ( Smyl and Kub er , 2016 ). T o this end, we hav e p erformed a p er window lo cal normalization step for the RNN architectures which use the moving window sc heme. F rom eac h deseasonalized input and corresp onding output window pair, the trend v alue of the last time step of the input window (found b y applying the STL Decomp osition) is deducted. This is carried out for all the input and output windo w pairs of the moving window strategy . The red coloured time steps in Figure 11 corresp ond to those last time steps of all the input windows. This technique is inspired by the batc h normalization sc heme and it also helps address the trend in time series ( Ioffe and Szegedy , 2015 ). As for the mo dels that do not use the moving window scheme, a per sequence normalization is carried out where the trend v alue of the last p oin t of the whole training sequence is used for the normalization instead. 4.2.7. Me an Normalization F or the set of experiments that do not use STL Decomp osition, it is not p ossible to p erform the trend normalization as describ ed in Section 4.2.6 . F or those exp erimen ts, we p erform a mean normalization of the time series b efore applying the Log transformation mentioned in Section 4.2.4 . In 24 Sp ecification Resource 1 Resource 2 Resource 3 CPU Mo del Name In tel(R) Core(TM) i7-8700 In tel Xeon Gold 6150 In tel Xeon CPU E5-2680 v3 CPU Architecture x86 64 x86 64 x86 64 CPU Cores 12 1 4 Memory(GB) 64 50 5 GPU Mo del Name GP102 [GeF orce GTX 1080 Ti] nVidia T esla V100 - No. of GPUs 2 2 - T able 3: Hardware Specifications this mean normalization, every time series is divided by the mean of that particular time series. The division instead of subtraction further helps to scale all the time series to a similar range which helps the RNN learning pro cess. The Log transformation is then carried out on the resulting data. 4.3. T r aining & V alidation Scheme The metho dology used for training the mo dels is detailed in this section. Our w ork uses the idea of global models in Section 2.4 and develops one mo del for all the a v ailable time series. Ho wev er, it is imp ortan t to understand that the implemented tec hniques work well only in the case of related (similar/homogeneous) time series. If the individual time series are from heterogeneous domains, mo deling such time series using a single model may not render the b est results. Due to the complexit y and the total num b er of mo dels asso ciated with the study , we run the exp erimen ts on three computing resources in parallel. The details of these machines are as men tioned in T able 3 . Due to the scale of the M4 mon thly dataset, the GPU machines are used to run the exp erimen ts on it. Resource 2 and Resource 3 indicate allo cated resources from the Massive cluster ( eResearch Cen tre. , 2019 ). Resource 2 is used for GPU only op erations whereas Resource 3 is used for CPU only op erations. Resource 1 is used for b oth CPU and GPU op erations. 4.3.1. Hyp er-p ar ameter T uning The exp erimen ts require the prop er tuning of a n umber of hyperparameters asso ciated with the mo del training. They are as mentioned b elo w. 1. Minibatch Size 2. Number of Ep ochs 3. Ep och Size 4. Learning Rate 5. Standard Deviation of the Gaussian Noise 6. L2 Regularization Parameter 7. RNN Cell Dimension 8. Number of Hidden Lay ers 9. Standard Deviation of the Random Normal Initializer Minibatc h size denotes the nu mber of time series considered for eac h full backpropagation in the RNN. This is a more limited version than using all the av ailable time series at once to p erform one bac kpropagation, which poses a significan t memory requiremen t. On the other hand, this could al so b e regarded as a more generalized v ersion of the extreme case whic h uses only a single time series per each full bac kpropagation, also known as stochastic gradien t descen t. Apart from the explicit regularization sc hemes used in the RNN mo dels further discussed in Section 4.3.2 , minibatch gradient descent also in tro duces some implicit regularization to the deep learning mo dels in sp ecific problem scenarios such as classification ( Soudry et al. , 2018 ). An ep och denotes one full forward and bac kward pass through the whole dataset. Therefore, the n umber of ep o c hs denote how many such passes across the dataset are required for the optimal training of the RNN. Ev en within each ep och, the dataset is tra versed a 25 n umber of times denoted b y the ep och size. This is esp ecially useful for those datasets with limited n umber of time series to increase the num b er of datap oin ts av ailable for training. This is b ecause NNs are supp osed to be trained b est when the amount of data av ailable is higher. Out of the three optimizers used in the exp erimen ts, the Adam optimizer and the Adagrad optimizer b oth require the appropriate tuning of the learning rate to conv erge to the optimal state of the net work parameters fast. T o reduce the effect of mo del o verfitting, tw o steps are taken; adding Gaussian noise to the input and using L2 regularization for the loss function (explained further under Section 4.3.2 ). Both these techniques require tuning hyperparameters; the standard deviation of the Gaussian noise distribution and the L2 regularization parameter. F urthermore, the weigh ts of the RNN units are initialized using random samples drawn from a normal distribution whose standard deviation is tuned as another h yp erparameter. There are tw o other h yp erparameters that directly relate to the RNN netw ork architecture; the n umber of hidden lay ers and the cell dimension of eac h RNN cell. F or simplicity , b oth the Enco der and the Deco der components of the Sequence to Sequence net work are comp osed of the same n umber of hidden lay ers. Also, the same cell dimension is used for the RNN cells in b oth the Enco der and the Deco der. How ev er, as explained b efore in Section 3.1.1 , the different Recurrent unit t yp es that w e emplo y in this study , namely the LSTM cell, ERNN ce ll, and the GR U, ha ve differen t n umbers of trainable parameters for the same cell dimension, with the LSTM having the highest and the ERNN ha ving the lo west num b er of parameters resp ectiv ely . Therefore, to enable a fair comparison, in other researc h communities such as natural language pro cessing, it is common practice to consider the total n umber of trainable parameters in the different mo dels compared ( Collins et al. , 2016 ; Ji et al. , 2016 ; Jagannatha and Y u , 2016 ). This is to ascertain that the capacities of all the compared mo dels are the same to clearly distinguish the p erformance gains achiev ed purely through the nov elt y introduced to the mo dels. As our main aim is to compare softw are frameworks, we use the cell dimension whic h is the natural h yp erparameter to b e tuned b y practitioners, and additionally p erform experiments using the num b er of trainable hyperparameters instead. F or this, we c ho ose the b est mo del com bination iden tified from the preceding exp erimen ts inv olving the cell dimension as a h yp erparameter and run it again along with all three recurrent unit types by letting the hyperparameter tuning algorithm c ho ose the optimal num b er of trainable parameters as a hyperparameter. Hence, for this exp erimen t, instead of setting the cell dimension directly , w e set the same initial range for the num b er of trainable parameters across all the compared models with the three RNN units. As the num b er of trainable parameters is directly prop ortional to the cell dimension, and the deep learning framew ork we use allo ws us to only set the cell dimension, we then calculate the corresp onding cell dimension from the n umber of parameters, and set this parameter accordingly . F or tuning h yp erparameters, there are many different techniques av ailable. The most na ¨ ıv e and the fundamental approach is hand tuning whic h requires intensiv e manual exp erimen tation to find the b est p ossible hyperparamters. Ho wev er, our approach is more targeted tow ards a fully automated framew ork. T o this end, Grid Search and Random Search are t wo of the most basic metho ds for auto- mated h yp erparameter tuning ( Bergstra and Bengio , 2012 ). Grid Search, as its name suggests explores in the space of a grid of different hyperparameter v alue com binations. F or example, if there are tw o h yp erparameters Γ and Θ, a set of v alues are specified for each hyperparameter as in ( γ 1 , γ 2 , γ 3 , ... ) for Γ and ( θ 1 , θ 2 , θ 3 , ... ) for Θ. Then the Grid Searc h algorithm go es through each pair of hyperparameter v alues ( γ , θ ) that lie in the grid of Γ and Θ and uses them in the mo del to calculate the error on a held out v alidation set. The h yp erparameter combination which giv es the minimum error is chosen as the optimal hyperparameter v alues. This is an exhaustive search through the grid of p ossible hy- p erparameter v alues. In the Random Search algorithm, instead of the exact v alues, distributions for the h yp erparameters are provided, from whic h the algorithm randomly samples v alues for each mo del ev aluation. A maxim um num b er of iterations are provided for the algorithm until when the mo del ev aluations are p erformed on the v alidation set and the optimal com bination thus far is selected.There are other more sophisticated h yp erparameter tuning techniques as w ell. 26 4.3.1.1 Bayesian Optimization The metho ds discussed ab o ve ev aluate an ob jective function (v alidation error) p oin t wise. This means retraining mac hine learning mo dels from scratc h which is quite compute intensiv e. The idea of the Ba yesian Optimization is to limit the num b er of such exp ensiv e ob jective function ev aluations. The tec hnique mo dels the ob jective function using a Gaussian prior ov er all p ossible functions ( Snoek et al. , 2012 ). This probabilistic mo del embeds all prior assumptions regarding the ob jectiv e function to b e optimized. During every iteration, another instance of the hyperparameter space is selected to ev aluate the ob jective function v alue. The decision of which p oin t to select next, dep ends on the minimization/maximization of a separate acquisition function which is muc h cheaper to ev aluate than the ob jectiv e function itself. The acquisition function can tak e man y forms out of whic h the implemen- tation of Sno ek et al. ( 2012 ) uses Exp ected Impro vemen t. The optimization of the acquisition function considers all the previous ob jectiv e function ev aluations to decide the next ev aluation point. Therefore, the Bay esian Optimization pro cess tak es on smarter decisions of function ev aluation compared to the aforemen tioned Grid Search and Random Searc h. Giv en an initial range of v alues for ev ery hyperpa- rameter and a defined n umber of iterations, the Bay esian Optimization metho d initializes using a set of prior kno wn parameter configurations and th us attempts to find the optimal v alues for those hyper- parameters for the given problem. There are many practical implementations and v arian ts of the basic Ba yesian Optimization technique such as hyperopt, sp earmin t and bay esian-optimization ( Bergstra , 2012 ; Sno ek , 2012 ; F ernando , 2012 ). 4.3.1.2 Se quential Mo del b ase d Algorithm Configur ation (SMAC) SMA C for hyperparameter tuning, is a v ariant of Bay esian Optimization prop osed b y Hutter et al. ( 2011 ). This implementation is based on the Sequen tial Mo del Based Optimization (SMBO) concept. SMBO is a technique founded on the ideas of the Bay esian Optimization but uses a T ree-structured P arzen Estimator (TPE) for mo delling the ob jective function instead of the Gaussian prior. The TPE algorithm pro duces a set of candidate optimal v alues for the hyperparameters in each iteration, as opp osed to just one candidate in the usual Bay esian Optimization. The tree structure lets the user define conditional hyperparameters which dep end on one another. The pro cess is carried out until a giv en time b ound or a num b er of iterations is reac hed. The work b y Hutter et al. ( 2011 ) further en- hances the basic SMBO algorithm b y adding the capability to work with categorical hyperparameters. The Python implemen tation of SMAC used for this study is a v ailable as a Python pack age ( Lindauer et al. , 2017 ). W e use the version 0.8.0 for our exp erimen ts. The initial hyperparameter ranges used for the NN mo dels across the different datasets are as listed in T able 4 . Additionally , for the exp erimen t whic h inv olves the num b er of trainable parameters instead of the cell dimension as a hyperparameter, w e set an initial range of 2000 − 25000 across all the compared mo dels in all the selected datasets. This range is selected based on the initial exp erimen ts in volving the cell dimension. 4.3.2. De aling with Mo del Overfitting Ov erfitting in a machine learning mo del is when the performance on the v alidation dataset is m uch w orse than the p erformance on the training dataset. This means that the mo del has fitted quite well on to the training data to the exten t that it has lost its generalization capability to mo del other unseen data. In NN models, this mostly happ ens due to the complexity ensued from the large num b er of parameters. In order to make the mo dels more v ersatile and capture unseen patterns, we are using t wo techniques in our framework. One metho d is to add noise to the input to distort it partially . The noise distribution applied here is a Gaussain distribution with zero mean and the standard deviation treated as another h yp erparameter. The other tec hnique is L2 w eight regularization whic h uses a Ridge p enalt y in the loss function. W eight regularization av oids the weigh ts of the netw ork from growing excessiv ely by incorp orating them in the loss function and thus p enalizing the netw ork for increasing its complexity . Given the L2 regularization parameter ψ , the Loss function for the NN can b e defined as b elo w. 27 Dataset Batch Size Ep ochs Epo c h Size Std. Noise L2 Reg. Cell Dim. Lay ers Std. Initializer Learning Rate Adam Adagrad CIF (12) 10 - 30 3 - 25 5 - 20 0.01 - 0.08 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 CIF (6) 2 - 5 3 - 30 5 - 15 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 5 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 NN5 5 - 15 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 25 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Mic) 40 - 100 3 - 30 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Mac) 30 - 70 3 - 30 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Ind) 30 - 70 3 - 30 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Dem) 20 - 60 3 - 30 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Fin) 20 - 60 3 - 30 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M3(Oth) 10 - 30 3 - 30 5 - 20 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Mic) 1000 - 1500 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Mac) 1000 - 1500 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Ind) 1000 - 1500 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Dem) 850 - 1000 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Fin) 1000 - 1500 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 50 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 M4(Oth) 50 - 60 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 25 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 Wikipedia 200 - 700 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 25 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 T ourism 10 - 90 3 - 25 2 - 10 0.0001 - 0.0008 0.0001 - 0.0008 20 - 25 1 - 2 0.0001 - 0.0008 0.001 - 0.1 0.01 - 0.9 T able 4: Initial Hyp erparameter Ranges L = E + ψ p X i =1 w 2 i | {z } L2 regularization (13) Applied to our context, E in Equation 13 denotes the Mean Absolute Error from the netw ork outputs. w i represen ts the trainable parameters of the net work where p is the n umber of all suc h parameters. Hence, as seen in Equation 13 , L2 regularization adds the squared magnitudes of the w eights multiplied by the regularization parameter ψ . The selection of ψ is paramount since to o large a v alue results in underfitting and on the other hand v ery small v alues let the mo del become ov erfitted. In our study , ψ is also optimized as a h yp erparameter. 4.4. Mo del T esting Once an optimal hyperparameter combination is found, those v alues are used to train the mo del and get the final forecasts. During testing, the mo del is trained using all the data av ailable and the final forecasts are written to files. W e address parameter uncertaint y by training all the mo dels on 10 differen t T ensorflow graph seeds but using the same initially tuned h yp erparameters. W e do not re-run the hyperparameter tuning as this w ould b e computationally very exp ensiv e. The different seeds give 10 different initializations to the net works. Once the RNN forecasts are obtained from every mo del for the differen t seeds, they are ensembled b y taking the median across the seeds. 4.4.1. Data Post-Pr o c essing T o calculate the final error metrics, the effects of the prior prepro cessing is prop erly reversed on the generated forecasts. F or the exp erimen ts which use STL Decomp osition, this p ost-processing is carried out as follows. 1. Reverse the lo cal normalization by adding the trend v alue of the last input p oin t. 2. Reverse deseasonalization by adding back the seasonality comp onen ts. 3. Reverse the log transformation by taking the exp onen tial. 4. Subtract 1, if the data contain 0s 5. F or in teger data, round the forecasts to the closest integer. 6. Clip all negative v alues at 0 (T o allow for only p ositiv e v alues in the forecasts). F or those exp erimen ts which do not use STL Decomp osition, the p ost-processing of the forecasts is done as follows. 28 1. Reverse the log transformation by taking the exp onen tial. 2. Subtract 1, if the data contain 0s 3. Multiply the forecasts of every series b y its corresp onding mean to rev erse the effect of mean scaling. 4. F or in teger data, round the forecasts to the closest integer. 5. Clip all negative v alues at 0 (T o allow for only p ositiv e v alues in the forecasts). 4.4.2. Performanc e Me asur es W e measure the performance of the mo dels in terms of a num b er of metrics. Symmetric Mean Ab- solute Percen tage Error (SMAPE) is the most common p erformance measure used in many forecasting comp etitions. S M AP E = 100% H H X k =1 | F k − Y k | ( | Y k | + | F k | ) / 2 (14) H, F k and Y k indicate the size of the horizon, the forecast of the NN, and the actual forecast, resp ectiv ely . As seen in Equation 14 , the SMAPE is a metric based on p ercen tage errors. Ho wev er, according to Hyndman and Ko ehler ( 2006 ), SMAPE measure is susceptible to instability with v alues close to 0. The forecasts of the Wikip edia W eb T raffic, NN5 and T ourism datasets are heavily affected b y this since they all hav e 0 v alues. Therefore, to o vercome this we use another v ariant of the SMAPE prop osed b y Suilin ( 2017 ). In this metric, the denominator of the ab o ve SMAPE is changed as follo ws. max ( | Y k | + | F k | + , 0 . 5 + ) (15) where the parame ter is set to 0.1 following Suilin ( 2017 ). The idea is that the ab o ve metric can a void division b y v alues close to zero b y switc hing to an alternate positive constan t for the denominator in SMAPE when the forecasts are to o small v alues. Y et, the SMAPE error metric has several other pitfalls such as its lac k of interpretabilit y and high skewness ( Hyndman and Ko ehler , 2006 ). Based on these issues, another metric is proposed by Hyndman and Koehler ( 2006 ) known as the Mean Absolute Scaled Error (MASE) to address them. MASE is defined as follows. M AS E = 1 H P H k =1 | F k − Y k | 1 T − M P T k = M +1 | Y k − Y k − M | (16) MASE is a scale-indep enden t measure, where the numerator is the same as in SMAPE, but nor- malized by the av erage in-sample one step na ¨ ıve forecast error or the seasonal na ¨ ıv e forecast error in case of seasonal data. A v alue greater than 1 for MASE, indicates that the performance of the tested mo del is worse on av erage than the na ¨ ıve b enc hmark and a v alue less than 1 denotes the opp osite. Therefore, this error metric provides a direct indication of the p erformance of the mo del relative to the na ¨ ıv e benchmark. The mo del ev aluation of this study is presented using six metrics; Mean SMAPE, Median SMAPE, Mean MASE, Median MASE, Rank SMAPE and Rank MASE of the time series of each dataset. The rank measures compare the p erformance of ev ery mo del against each other with resp ect to every time series in the dataset. Apart from them, when considering all the datasets together, ranks of mo dels within each dataset with resp ect to b oth mean SMAPE and mean MASE metrics are plotted, since the original SMAPE and MASE v alues lie in very different ranges for the different datasets. These metrics are referred to as Mean SMAPE Ranks and Mean MASE Ranks resp ectiv ely . The literature introduces several other performance measures for forecasting such as the Geometric Mean Relative Absolute Error (GMRAE) ( Hyndman and Ko ehler , 2006 ). The GMRAE to o is a relativ e error measure similar to MASE which measures the p erformance with resp ect to some b enc hmark metho d, but uses the geometric mean instead of the mean. How ever, as Chen et al. ( 2017 ) state in their w ork, due to the usage of the geometric mean, v alues from the GMRAE can b e quite small close to zero, esp ecially in situations like ours where the exp ected forecasting horizon is very long such as in the NN5 and the Wikip edia W eb T raffic datasets and the num b er of time series is large. 29 4.5. Benchmarks As for the b enc hmarks for comparison, we select tw o strong, well established traditional univ ariate tec hniques, ets and auto.arima with their default parameters from the forecast ( Hyndman and Khandak ar , 2008 ) pack age in R. W e p erform exp erimen ts with the tw o techniques on all the afore- men tioned datasets and measure the p erformance in terms of the same metrics mentioned in Section 4.4.2 . F urthermore, we perform exp erimen ts with a p o oled regression similar to the work of T rap ero et al. ( 2015 ). As men tioned in Section 2.4 , the p o oled regression mo del differs from the RNN in that it does not maintain a state per every series and it mo dels a linear relationship b et ween the lags and the target v ariable. How ever, similar to the RNNs, the p o oled regression mo dels also calculate their w eights globally b y considering cross-series information, i.e., a p o oled regression mo del works as a global AR mo del. Therefore, we use them in this study as a b enc hmark against RNNs to diffe r entiate b et ween the performance gains in RNNs purely due to building a global mo del v ersus due to the choice of the RNN architectures. W e also train unp o oled versions of regression mo dels on all the datasets meaning that the mo dels are built p er every series similar to ets and arima . The unp o oled regression mo dels are used in this study to observe the accuracy gains from training regression mo dels as global mo dels rather than usual univ ariate mo dels. F or implementing the regression mo dels, we use the linear regression implementation from the glmnet pack age in the R programming language ( F riedman et al. , 2010 ). Before feeding the series into the p ooled regression mo dels, we normalize them by dividing ev ery series by its mean v alue. F or the n umber of lags, we exp erimen t with tw o options. The auto.arima model from the forecast pac k age selects by default up to a maximum of 5 lags for b oth the AR and MA comp onen ts combined. An ARMA mo del can b e approximated by a pure AR mo del with a higher num b er of lags ( Rob J Hyndman , 2018 ). Since the p ooled regression mo del is a pure AR mo del, to b e approximately compatible with the complexity of the auto.arima mo del, we use 10 lags (as opp osed to 5 lags) as the first option for the num b er of lags in the p ooled regression mo dels. On the other hand, the moving windo w sc heme as men tioned in Section 4.2.5 , already imp oses a n umber of lags on the input window of the relev an t RNN architectures. Therefore, w e also build versions of the regression mo dels with this n umber of lagged v alues. F urthermore, as mentioned in Section 4.3.2 , the RNN mo dels are regularized using a Ridge p enalt y . Hence, to enable a fair comparison, we develop the regression mo dels b oth with and without the L2 regularization. T o find the L2 regularization parameter we use t wo metho ds. The first is the built-in grid search technique in the glmnet pack age using a 10-fold cross-v alidation with a mean squared error as the v alidation error metric. How ev er, to b e more comparable with the RNNs, w e also use the Ba y esian optimization h yp erparameter tuning along with the SMAPE as the v alidation error metric to find the L2 regularization parameter in the regression mo dels. F or this w e use a simple 70%-30% split of training and v alidation sets resp ectiv ely . F or the p ooled regression mo dels, the range for the L2 regularization parameter is set to the 0-1 interv al with 50 iterations in the tuning pro cess. F or the unp o oled v ersions of the regression mo dels this initial range is set to the 0-200 interv al, since p er one series there is a smaller amount of data av ailable and the regularization parameter may need a larger v alue to a void ov erfitting. F or the Bay esian optimization we use the rBayesianOptimization pac k age in the R programming language ( Y an , 2016 ). During testing, a recursive strategy is carried out to pro duce forecasts up to the forecasting horizon where one p oin t is forecasted at every step, taking the last forecast as the most recent lag. 4.6. Statistic al T ests of the R esults T o ev aluate the statistical significance of the differences of the performance of multiple techniques, w e p erform a non-parametric F riedman rank-sum test. This te st determines whether the p erformance differences are statistically significan t. T o further explore these differences with resp ect to a control tec hnique, sp ecifically the best p erforming technique, we then p erform Ho c hberg’s p ost ho c pro ce- dure ( Garc ´ ıa et al. , 2010 ). The significance lev el used is α = 0 . 05. F or those tests that hav e only t wo tec hniques to compare, we use a non-parametric paired Wilcoxon signed-rank test with Bonferroni correction, to measure the statistical significance of the differences. F or this w e apply the wilcox.test 30 function from the stats pack age of the R core libraries ( R Core T eam , 2014 ). The Bonferroni proce- dure is used to adjust the significance lev el used for comparison at each step, of a num b er of successive comparisons ( Garc ´ ıa et al. , 2010 ). In particular, the pro cedure divides the significance level by the total num b er of comparisons for use in every individual comparison. The mean and the median of the SMAPE measure are used where relev an t for all the statistical testing. 5. Analysis of Results This section provides a comprehensive analysis of the results obtained from the exp erimen ts. The datasets that we use for our exp erimen ts are quite different from each other in terms of the seasonalit y strength, n umber of time series, and length of individual time series. Thus, it seems reasonable to assume that the selected datasets cov er a sufficient num b er of different time series characteristics to arriv e at generalized conclusions. The results of all the implemen ted models in terms of the mean SMAPE metric are as shown in T able 5 . Results in terms of all the other error metrics are av ailable in an Online App endix 2 . In the tables presenting the results, we use the abbreviations ‘NSTL’, ‘MW’ and ‘NMW’ to denote ‘without using STL Decomposition’, ‘moving windo w’ and ‘non mo ving windo w’ resp ectiv ely . F urthermore, ‘IW+’ indicates an increased input window size. Model Name CIF Kaggle M3 NN5 T ourism M4 auto.arima 11.7 47.96 14.25 25.91 19.74 13.08 ets 11.88 53.5 14.14 21.57 19.02 13.53 Pooled Regression Lags 10 *14 . 67 *122 . 43 † *14 . 73 † *30 . 21 † *31 . 29 † *14 . 18 † Pooled Regression Lags 10 Bay esian Regularized *14 . 67 *122 . 43 † *14 . 85 † *30 . 21 † *31 . 29 † *14 . 18 † Pooled Regression Lags 10 Regularized *14 . 52 *123 . 94 † *14 . 75 † *30 . 21 † *31 . 29 † *14 . 19 † Pooled Regression Lags Window *12 . 89 **47 . 47 *14 . 36 † *22 . 41 *21 . 1 † *13 . 75 † Pooled Regression Lags Window Ba yesian Regularized *12 . 89 **47 . 55 *14 . 42 † *22 . 41 *21 . 1 † *13 . 75 † Pooled Regression Lags Window Regularized *12 . 89 **47 . 46 *14 . 38 † *22 . 41 *21 . 1 † *13 . 75 † Unpo oled Regression Lags 10 15.35 *75 . 39 † 14 . 81 † *30 . 06 † *27 . 39 † *13 . 38 † Unpo oled Regression Lags 10 Bay esian Regularized 12.85 *84 . 36 † *15 . 99 † *30 . 21 † *30 . 41 † *14 . 2 † Unpo oled Regression Lags 10 Regularized *12 . 74 *94 . 28 † 15 . 32 † *30 . 06 † *27 . 61 † *14 † Unpo oled Regression Lags Window 14.34 *56 . 75 † 14.93 *22 . 63 *20 . 71 † *13 . 46 † Unpo oled Regression Lags Window Ba yesian Regularized 12.46 *60 . 66 † *15 . 72 † *23 . 15 *21 . 58 † *14 . 15 † Unpo oled Regression Lags Window Regularized 11.61 54 . 61 † 14.72 *22 . 67 *21 . 13 † *13 . 65 † S2S GR U adagrad 11.57 51 . 77 † *15 . 91 † 28.33 *20 . 57 † *13 . 53 † S2S GR U adam 11.66 49 . 9 † 14 . 53 † *28 . 61 *20 . 62 † *14 . 52 † S2S GR U co cob 10.79 **46 . 89 *14 . 74 † 28.36 *20 . 35 † *13 . 61 † S2S LSTM adagrad 11.23 51 . 77 † 14 . 29 † 28.34 *20 . 9 † *14 . 13 † S2S LSTM adam 10.71 **49 . 34 ‡ 14 . 77 † *28 . 04 *20 . 5 † *13 . 99 † S2S LSTM cocob 11.04 52 . 46 † 14.6 24.77 *20 . 51 † *14 . 02 † S2S ERNN adagrad 10.47 51 . 45 † *14 . 7 † *28 . 99 *20 . 44 † *14 . 24 † S2S ERNN adam 11.19 50 . 7 † *14 . 66 † *28 . 32 *20 . 63 † *13 . 86 † S2S ERNN cocob 10.48 **48 . 95 † *14 . 56 † *26 . 35 *20 . 77 † *13 . 8 † S2SD GR U MW adagrad 11.6 51 . 77 † 14.11 28.73 19.7 *13 . 38 † S2SD GR U MW adam 10.66 51 . 76 † 14 . 78 † *25 . 17 19.5 *13 . 79 † S2SD GR U MW co cob 10.3 48 . 8 † 14 . 38 † *27 . 92 19.67 *13 . 57 † S2SD GR U NMW adagrad 10.65 *52 . 29 † *14 . 27 † 28.73 *20 . 99 † *13 . 51 † S2SD GR U NMW adam 10.41 **46 . 9 14 . 35 † *25 . 42 *20 . 5 † *13 . 92 † S2SD GR U NMW co cob 10.48 **47 . 33 14 . 35 † 24.8 *20 . 8 † *13 . 75 † S2SD LSTM MW adagrad 11.58 51 . 76 † 14.19 28.73 *20 . 36 † *13 . 39 † S2SD LSTM MW adam 10.28 51 . 09 † 14.68 *26 *20 . 18 *13 . 76 † S2SD LSTM MW co cob 11.23 51 . 77 † 14.45 *24 . 29 19.76 *13 . 59 † S2SD LSTM NMW adagrad 10.81 *52 . 28 † *14 . 39 † 28.73 *20 . 22 † *13 . 5 † S2SD LSTM NMW adam 10.34 **47 . 37 14 . 43 † 25.36 *20 . 33 † *13 . 62 † S2SD LSTM NMW co cob 10.23 49 . 74 † 14 . 31 † *27 . 82 *20 . 29 † *13 . 66 † S2SD ERNN MW adagrad 11.34 51 . 58 † 14.2 28.58 19.58 *13 . 42 † 2 https://drive.google.com/file/d/16rdzLTFwkKs- _eG_MU0i_rDnBQBCjsGe/view?usp=sharing 31 S2SD ERNN MW adam 10.73 49 . 12 † 14 . 75 † *27 . 49 19.69 *13 . 83 † S2SD ERNN MW co cob 10.39 **47 . 96 † 14 . 73 † *25 . 58 19.44 *14 . 01 † S2SD ERNN NMW adagrad 10.31 49 . 72 † *14 . 86 † 28.65 *21 . 27 † *13 . 52 † S2SD ERNN NMW adam 10.3 **47 . 76 † *14 . 79 † *29 . 1 20.04 *14 . 32 † S2SD ERNN NMW co cob 10.28 **47 . 94 † *14 . 94 † *26 . 37 19.68 *13 . 57 † Stack ed GRU adagrad 10.54 **47 . 23 14.64 21 . 84 ‡ *21 . 56 † *14 . 15 † Stack ed GRU adam 10.55 **47 . 11 14.76 21 . 93 ‡ *21 . 92 † *14 . 29 † Stack ed GRU co cob 10.54 **46 . 73 14.67 22 . 18 ‡ *21 . 66 † *14 . 57 † Stack ed LSTM adagrad 10.51 **47 . 12 14.34 22 . 12 ‡ *20 . 61 † *13 . 97 † Stack ed LSTM adam 10.51 **47 . 13 14.39 21.53 ‡ *22 . 28 † *14 . 23 † Stack ed LSTM co cob 10.4 **47 . 14 14.44 21 . 61 ‡ *21 . 12 † *14 . 13 † Stack ed ERNN adagrad 10.53 **47 . 79 14.73 23 . 83 ‡ *21 . 24 *14 . 3 † Stack ed ERNN adam 10.51 **46 . 7 14.71 23 . 53 ‡ *22 . 22 † *14 . 51 † Stack ed ERNN co cob 10.57 **47 . 38 14.93 23 . 59 ‡ *21 . 72 † *14 . 24 † Stack ed GRU adagrad(IW+) - **47 . 09 - 23.43 - - Stack ed GRU adam(IW+) - **46 . 35 - 24.96 - - Stack ed GRU co cob(IW+) - **46 . 46 - 21 . 92 ‡ - - Stack ed LSTM adagrad(IW+) - **46 . 41 - 22.97 - - Stack ed LSTM adam(IW+) - **46 . 52 - 21 . 59 ‡ - - Stack ed LSTM co cob(IW+) - **46 . 54 - 21 . 54 ‡ - - Stack ed ERNN adagrad(IW+) - **47 . 08 - 22.96 - - Stack ed ERNN adam(IW+) - **46 . 76 - 22 . 5 ‡ - - Stack ed ERNN co cob(IW+) - **47 . 59 - 23.13 - - NSTL S2S GR U adagrad *20 . 02 † *68 . 6 † *18 . 32 † *30 . 98 † *199 . 97 † - NSTL S2S GR U adam *16 . 98 † 50 . 48 † *18 . 52 † *28 . 09 † *48 . 2 † - NSTL S2S GR U coc ob *17 . 96 † 50 . 44 † *17 . 56 † *26 . 01 *95 . 19 † - NSTL S2S LSTM adagrad *17 . 54 † *71 . 25 † *17 . 75 † *29 . 52 † *41 . 41 † - NSTL S2S LSTM adam *16 . 11 † **47 . 8 † *17 . 37 † *36 . 33 † *32 . 69 † - NSTL S2S LSTM co cob *17 . 8 † *68 . 36 † *18 . 42 † *30 . 29 † *48 . 44 † - NSTL S2S ERNN adagrad *19 . 98 † 51 . 4 † *19 . 73 † *26 . 33 † *49 . 83 † - NSTL S2S ERNN adam *18 . 12 † **49 . 84 † *19 . 06 † *29 . 81 † *51 . 24 † - NSTL S2S ERNN co cob *17 . 97 † 51 . 69 † *18 . 85 † *31 . 77 † *43 . 68 † - NSTL S2SD GR U MW adagrad *16 . 12 † 51 . 08 † *20 . 87 † *24 . 45 *199 . 97 † *19 . 47 † NSTL S2SD GR U MW adam *13 . 74 51 . 2 † *18 . 09 † *24 . 08 *36 . 02 † - NSTL S2SD GR U MW cocob *17 . 77 † 51 . 71 † *19 . 48 † *24 . 31 *37 . 91 † - NSTL S2SD GR U NMW adagrad *14 . 42 **46 . 57 *15 . 8 † *25 . 43 *199 . 97 † - NSTL S2SD GR U NMW adam *15 . 71 † **46 . 08 *16 . 17 † *23 . 85 *38 . 63 † - NSTL S2SD GR U NMW cocob *15 . 05 † **47 . 69 † *16 . 14 † *24 . 85 *31 . 57 † - NSTL S2SD LSTM MW adagrad *27 . 06 † 51 . 24 † *21 . 29 † *40 . 96 † *40 . 27 † - NSTL S2SD LSTM MW adam *15 . 23 51 . 11 † *19 . 33 † *25 . 7 *38 . 52 † - NSTL S2SD LSTM MW co cob *18 . 61 † 49 . 83 † *23 . 05 † *24 . 14 *24 . 34 † - NSTL S2SD LSTM NMW adagrad *14 . 81 † 49 . 77 † *15 . 9 † *25 . 03 *25 . 47 † - NSTL S2SD LSTM NMW adam *14 . 84 **46 . 07 *17 . 05 † *24 . 55 *23 . 91 † - NSTL S2SD LSTM NMW co cob *22 . 74 † **47 . 72 † *16 . 35 † *25 . 31 *31 . 52 † - NSTL S2SD ERNN MW adagrad *19 . 19 † 50 . 71 † *19 . 22 † *27 . 35 † *22 . 29 † - NSTL S2SD ERNN MW adam *13 . 92 49 . 69 † *18 . 55 † *24 . 52 *30 . 79 † - NSTL S2SD ERNN MW co cob *14 . 88 † 50 . 82 † *16 . 42 † *24 . 87 *26 . 16 † - NSTL S2SD ERNN NMW adagrad *15 . 78 † **47 . 41 *17 . 22 † *27 . 47 † *38 . 19 † - NSTL S2SD ERNN NMW adam *15 . 02 **46 . 37 *16 . 34 † *22 . 81 *28 . 55 † - NSTL S2SD ERNN NMW co cob *14 . 24 48 . 27 † *15 . 26 † *23 . 22 *25 . 82 † - NSTL Stac ked GRU adagrad *16 . 77 † **45 . 68 ‡ *16 . 39 † *24 . 39 *199 . 97 † - NSTL Stac ked GRU adam *16 . 34 † ** 45.62 ‡ *16 . 76 † *23 . 47 *38 . 66 † - NSTL Stac ked GRU cocob *17 . 64 † **46 . 35 *16 . 23 † *23 . 55 *41 . 52 † - NSTL Stac ked LSTM adagrad *17 . 08 † **45 . 88 ‡ *16 . 69 † *25 . 1 *23 . 46 † - NSTL Stac ked LSTM adam *15 . 83 † **46 . 12 *17 . 54 † *24 . 44 *21 . 73 † - NSTL Stac ked LSTM co cob *15 . 27 † **45 . 9 *17 . 97 † *26 . 28 † *25 . 24 † *17 . 71 † NSTL Stac ked ERNN adagrad *13 . 25 **46 *16 . 78 † *24 . 22 *25 . 36 † - NSTL Stac ked ERNN adam *16 . 05 † **45 . 77 ‡ *16 . 43 † *23 . 04 *20 . 67 - NSTL Stac ked ERNN co cob *15 . 74 † **46 . 02 *17 . 32 † *23 . 55 *28 . 71 † - NSTL Stac ked GRU adagrad(IW+) - **45 . 9 - *23 . 53 - - NSTL Stac ked GRU adam(IW+) - **45 . 69 ‡ - 22.2 - - NSTL Stac ked GRU cocob(IW+) - **46 . 01 - *22 . 78 - - NSTL Stac ked LSTM adagrad(IW+) - **45 . 68 ‡ - *23 . 15 - - 32 NSTL Stac ked LSTM adam(IW+) - **46 . 32 ‡ - 22.45 - - NSTL Stac ked LSTM co cob(IW+) - **46 . 12 - 22.66 - - NSTL Stac ked ERNN adagrad(IW+) - **45 . 69 ‡ - *22 . 79 - - NSTL Stac ked ERNN adam(IW+) - **45 . 92 ‡ - 22.13 - - NSTL Stac ked ERNN co cob(IW+) - **45 . 79 ‡ - *22 . 8 - - T able 5: Mean SMAPE Results In T able 5 , the b est mo del in ev ery dataset is indicated in b oldface. F urthermore, as mentioned b e- fore in Section 4.6 , since auto.arima and ets are the tw o established b enc hmarks chosen, we p erform paired Wilco xon signed-rank tests with Bonferroni correction for every mo del against the tw o b enc h- marks. In T able 5 , † and * denote those mo dels whic h are significantly worse than auto.arima and ets resp ectiv ely . Similarly , ‡ and ** denote the mo dels whic h are significantly b etter than auto.arima and ets resp ectiv ely . W e see that on the CIF2016, M3, T ourism and M4 datasets no model is signif- ican tly b etter than the tw o b enc hmarks. On the NN5 dataset, the RNN mo dels hav e only managed to p erform significantly better than auto.arima but not ets . Moreov er, across all the datasets it can b e seen that the RNN models whic h p erform significantly b etter than the b enc hmarks are mostly the v ariants of the Stack ed architecture. Also, it is only on the Wikip edia W eb T raffic dataset, that the p ooled versions of the regression mo dels p erform significan tly b etter than the b enc hmark auto.arima . W e observe that in general, increasing the num b er of lags from 10 to the size of the input window of the RNNs, improv es accuracy of the regression mo dels across all the datasets. On the datasets Wikip edia W eb T raffic, M3 and NN5, the inclusion of the cross-series information alone has an effect since the p o oled regression mo dels with increased num b er of lags hav e p erformed b etter than the unp ooled versions of regression. F urthermore, on the t wo datasets Wikip edia W eb T raffic and NN5, the p o oled regression mo dels with increased num b er of lags ha ve outp erformed the auto.arima mo del whic h acts up on every series indep enden tly . Although the p ooled mo dels hav e p erformed worse than the unp ooled models on the CIF, T ourism and M4 datasets, the RNN mo dels hav e mostly outp erformed b oth the unpo oled and po oled regression mo dels on these datasets. This indicates that ev en though the series in those datasets are not quite homogeneous to build global models, the use of RNN arc hitectures has a clear effect on the final accuracy . In fact, on the CIF dataset the effect from the choice of the RNN architectures has managed to outp erform the tw o statistical b enc hmarks ets and auto.arima . On the Wikip edia W eb T raffic, M3 and NN5 datasets to o, the RNNs hav e outp erformed the statistical b enc hmarks as well as b oth the versions of the regression mo dels implying that on these datasets the cross-series inclusion together with the choice of the RNNs hav e lead to b etter accuracy . Another significant observ ation is that on man y datasets, both the Ba yesian optimization and the hyperparameter tuning with 10-fold cross v alidation for the L2 regularization parameter of the p ooled regression mo dels select v alues close to 0 so that the mo dels effectively ha ve nearly no L2 regularization. This is why in many datasets the mean SMAPE v alues are the same for b oth with and without regularization. Therefore, for the p ooled regression mo dels w e cannot conclude that L2 w eight regularization impro ves the model performance. The model without any L2 regularization is already an appropriate fit for the data. F or the unpo oled regression mo dels, w e obtain mixed results. As discussed b y Bergmeir et al. ( 2018 ), while b eing applicable to detect ov erfitting, a generic k-fold cross-v alidation ma y lead to an underestimation of the cross-v alidation errors for mo dels that underfit. Ho wev er, our p ooled regression mo dels do not show this b eha viour, as they ha ve very small L2 regularization parameters. As for the unp o oled mo dels, although for some series the L2 regularization parameters are high, it is the same case for b oth 10-fold cross-v alidation and the normal cross-v alidation. 5.1. R elative Performanc e of RNN Ar chite ctur es W e compare the relativ e p erformance of the different RNN arc hitectures against each other across all the datasets, in terms of all the error metrics. The violin plots in Figure 12 illustrate these results. W e see that the best arc hitecture differs based on the error metric. On mean SMAPE, the S2S with the Dense Lay er (S2SD) arc hitecture without the mo ving window, p erforms the b est. Ho wev er, on all the other error metrics, the Stack ed architecture p erforms the b est. With resp ect to the rank error 33 plots, clearly the Stack ed architecture pro duces the b est results for most of the time series. Thus, the results on the mean SMAPE indicate that the Stack ed arc hitecture results in higher errors for certain outlying time series. The tw o versions of the S2SD architectures (with and without the moving windo w) p erform comparably well, although not as go od as the Stack ed architecture on most cases. Therefore, in b et w een the tw o v ersions of the S2SD architectures, it is hard to derive a clear conclusion ab out the b est mo del. The S2S architecture (with the deco der) p erforms the worst o verall. Figure 12: Relative P erformance of Different RNN Architectures With respect to the mean SMAPE v alues, the F riedman test of statistical significance giv es an o v er- all p -v alue of 9 . 3568 × 10 − 3 implying that the differences are statistically significant. The Ho c hberg’s p ost-hoc pro cedure is p erformed by using as the control metho d the S2SD MW architecture whic h 34 p erforms the b est. The Stac ked and the S2SD Non MW arc hitectures do not p erform significantly w orse, with an adjusted p -v alue of 0.602. Ho wev er, the S2S architecture p erforms significanlty worse with an adjusted p -v alue of 5 . 24 × 10 − 3 . In terms of the median SMAPE, the F riedman test of statis- tical significance gives an ov erall p -v alue of 0 . 349, implying that the differences of the mo dels are not statistically significant. 5.2. Performanc e of R e curr ent Units The violin plots in Figure 13 demonstrate the p erformance comparison of the different RNN units, in terms of mean SMAPE ranks and mean MASE ranks. The F riedman test of statistical significance with resp ect to the mean SMAPE v alues pro duces an ov erall p -v alue of 0 . 101. Though the p -v alue do es not indicate statistical significance, from the plots, w e can derive that the LSTM with p eephole connections cell p erforms the best. The ERNN cell performs the w orst and the GRU exhibits a p erformance in-b et ween these tw o. Figure 13: Relative P erformance of Different Recurrrent Cell T yp es 5.3. Performanc e of Optimizers The violin plots in Figure 14 illustrate the p erformance comparison of the differen t optimizers, in terms of b oth the mean SMAPE ranks and the mean MASE ranks. F rom the plots we see that the Adagrad optimizer p erforms the w orst complying with the findings in the literature stated under the Section 3.2 . Even though the Adam optimizer has b een the b est optimizer thus far, we can conclude from this study that the COCOB optimizer p erforms the b est out of the three. Ho wev er, the Adam optimizer also sho ws quite comp etitiv e p erformance in-b et w een these tw o. The F riedman test of statistical significance with resp ect to the mean SMAPE v alues gives an o verall p -v alue of 0 . 115. Although there is no strong statistical evidence in terms of significance, w e conclude that the Co cob optimizer is further preferred ov er the other tw o, since it do es not require to set an initial learning rate. This supports fully automating the forecasting approac h, since it eliminates the tuning of one more external h yp erparameter. 5.4. Performanc e of the Output Comp onents for the Se quenc e to Se quenc e Ar chite ctur e Figure 15 shows the relative performance of the tw o output comp onen ts for the S2S architecture. The dense lay er p erforms b etter than the deco der owing to the error accumulation issue asso ciated with the teacher forcing used in the deco der as mentioned in Section 2.3.1 . With teac her forcing, the autoregressive connections in the deco der tend to carry forward the errors generating from each 35 Figure 14: Relative P erformance of Different Optimi zers forecasting step, resulting in even more uncertain ty of the forecasts along the prediction horizon. The output from the paired Wilcoxon signed-rank test with resp ect to the mean SMAPE v alues gives an o verall p -v alue of 2 . 064 × 10 − 4 whic h indicates that the deco der p erforms significantly worse than the dense lay er. Figure 15: Comparison of the Ouput Comp onen t for the Sequence to Sequence with the Dense Lay er Architecture 5.5. Comp arison of Input Window Sizes for the Stacke d Ar chite ctur e Figure 16 sho ws the comparison of the tw o input windo w size options for the Stac ked architecture on the tw o daily datasets, NN5 and Wikip edia W eb T raffic. Both with and without STL Decomp osition results are plotted. Here, ‘Large’ denotes an input window size slightly larger than the expected prediction horizon while ‘Small’ denotes an input window size slightly larger than the seasonality p eriod which is 7 (for the daily data). F rom the plots, we can state that large input window sizes help the Stack ed architecture b oth when STL Decomp osition is used and not used. Ho wev er, with mean MASE ranks, small input window sizes also p erform comparably . F or the case when the seasonality is 36 not remo ved, making the input window size large impro ves the accuracy b y a h uge margin. Thus, large input windows make it easier for the Stack ed architecture to learn the underlying seasonal patterns in the time series. The output from the paired Wilcoxon signed-rank test with resp ect to the mean SMAPE v alues gives an ov erall p -v alue of 2 . 206 × 10 − 3 whic h indicates that the small input windo w size p erforms significan tly worse than the large input window size when b oth with STL Decomp osition and without STL Decomp osition cases are considered together. Figure 16: Comparison of Input Window Sizes for the Stack ed Architecture 5.6. Analysis of Se asonality Mo del ling The results from comparing the mo dels with STL Decomp osition and without STL Decomp osition are as illustrated in Figures 17 and 18 , for the mean SMAPE metric and the mean MASE metric resp ectiv ely . Due to the scale of the M4 mon thly dataset, w e do not run all the mo dels without removing seasonalit y on that dataset. Rather, only the best few mo dels by lo oking at the results from the first stage with remov ed seasonality , are selected to run without removing seasonalit y . As a consequence, w e do not plot those results here. The plots indicate that on the CIF, M3 and the T ourism datasets, remo ving seasonality works b etter than mo delling seasonalit y with the NN itself. On the M4 mon thly dataset to o, the same observ ation holds lo oking at the results in T able 5 . Ho wev er, on the Wikip edia 37 Figure 17: Comparison of the Performance with and without STL Decomposition - Mean SMAPE 38 Figure 18: Comparison of the Performance with and without STL Decomposition - Mean MASE 39 W eb T raffic dataset, except for few outliers, mo delling seasonality using the NN itself works almost as go od as removing seasonality b eforehand. This can b e attributed to the fact that in the Wikip edia W eb T raffic dataset, all the series ha ve quite minimal seasonalit y according to the violin plots in Figure 9 , so that there is no big difference in the data if the seasonality is remov ed or not. On the other hand, on the NN5 dataset to o, NNs seem to b e able to reasonably mo del seasonalit y on their own, except for few outliers. T o further explain this result, w e also plot the seasonal patterns of the different datasets as shown in the Figure 19 . F or the conv enience of illustration, in every category of the M4 mon thly dataset and the Wikip edia W eb T raffic dataset, we plot only the first 400 series. F urthermore, from every series of every dataset, only the first 50 time steps are plotted. More seasonal pattern plots for the different categories of the CIF, M3 monthly and M4 monthly datasets are av ailable in the Online App endix 3 . In the NN5 dataset, almost all the series hav e the same seasonal pattern as in the 4 th plot of Figure 19 , with all of them having the same length as indicated in T able 2 . All the series in the NN5 dataset start and end on the same dates( Crone , 2008 ). Also, according to Figure 9 , the NN5 dataset has higher seasonality , with the seasonality strengths and patterns ha ving v ery little v ariation among the individual series. In contrast, for the T ourism dataset, although it con tains series with higher seasonalit y , it has a high v ariation of the individual seasonality strengths and patterns as well as the lengths of the series. The starting and ending dates are different among the series. Lo oking at the 5 th plot of Figure 19 , it is also evident that the series hav e very different seasonal patterns. F or the rest of the datasets to o, the lengths as well as the seasonal patterns of the individual series v ary considerably . Therefore, from these observ ations we can conclude that NNs are capable of mo delling seasonality on their own, when all the series in the dataset hav e similar seasonal patterns and the lengths of the time series are equal, with the start and the end dates coinciding. T able 6 sho ws the rankings as well as the o verall p -v alues obtained from the paired Wilcoxon signed-rank tests for comparing the cases with and without STL Decomp osition in ev ery dataset. The o verall p -v alues are calculated for each dataset separately by comparing the with STL Decomp osition and without STL Decomp osition cases in all the av ailable mo dels. Dataset Ranking Ov erall p -v alue With STL Decomp. Without STL Decomp. CIF 1.0 2.0 2 . 91 × 10 − 11 M3 1.0 2.0 2 . 91 × 10 − 11 T ourism 1.06 1.94 1 . 455 × 10 − 10 Wikip edia W eb T raffic 1.73 1.27 0.028 NN5 1.56 1.44 0.911 T able 6: Average rankings and results of the statistical testing for seasonality mo delling with resp ect to mean SMAPE across all the datasets. On the CIF, M3 and the T ourism datasets, using STL Decomp osition has the b est ranking. On the Wikip edia W eb T raffic and the NN5 datasets, not using STL Decomp osition has the b est ranking. The calculated ov erall p -v alues obtained from the paired Wilcoxon signed-rank test are shown for the different datasets on the Ov erall p -v alue column. On the CIF, M3 and the T ourism datasets, eliminating STL Decomp osition p erforms significantly worse than applying STL Decomp osition. Ho wev er, on the Wikip edia W eb T raffic dataset, applying STL Decomp osition performs significantly worse than not applying it. On the NN5 dataset, there is no significant difference b et ween using and not using STL Decomp osition. 5.7. Performanc e of RNN Mo dels Vs. T r aditional Univariate Benchmarks The relative p erformance of the mo del types in terms of the mean SMAPE metric and median SMAPE metric are shown in Figure 20 and Figure 21 resp ectiv ely . More comparison plots in terms of the mean MASE and median MASE metrics are av ailable in the Online App endix 4 . W e identify as 3 https://drive.google.com/file/d/16rdzLTFwkKs- _eG_MU0i_rDnBQBCjsGe/view?usp=sharing 4 https://drive.google.com/file/d/16rdzLTFwkKs- _eG_MU0i_rDnBQBCjsGe/view?usp=sharing 40 Figure 19: Seasonal Patterns of All the Datasets 41 Mo del Name CIF Kaggle M3 NN5 Stac ked GR U cocob 10.69 45.77 14.67 22.46 Stac ked LSTM co cob 10.54 45.93 14.44 24.04 Stac ked ERNN co cob 10.66 129.23 14.99 24.67 T able 7: Mean SMAPE Results with the Num b er of T rainable Parameters as a Hyperparameter a representativ e suggested mo del com bination from the ab o ve sections the Stack ed architecture with LSTM cells with p eephole c on nections and the COCOB optimizer, and therefore indicated it in ev ery plot as ’Stac ked LSTM COCOB’ (With small and large input windo w sizes). If the b est RNN on each dataset differs from Stack ed LSTM COCOB, these b est mo dels are also shown in the plots. All the other RNNs are indicated as ’RNN’. The p erformance of the RNN arc hitectures compared to traditional univ ariate b enc hmarks dep end on the p erformance metric used. An RNN arc hitecture is able to outp erform the benchmark tec hniques on all the datasets except the M4 monthly dataset, in terms of both the SMAPE and MASE error metrics. On the M4 mon thly dataset, some RNNs outp erform ETS, but ARIMA p erforms b etter than all of the RNNs. How ever, since we build the mo dels p er each category of the M4 monthly dataset, w e further plot the p erformance in terms of the different categories in Figure 22 . F rom these plots, we can see that RNNs outp erform traditional univ ariate b enc hmarks only in the Micro category . F urther plots in terms of the other error metrics are shown in the Online App endix 5 . W e also observ e that mean error metrics are dominated by some outlier errors for certain time series. This is due to the observ ation that on the T ourism dataset, RNNs outp erform ETS and ARIMA with resp ect to the median SMAPE but not the mean SMAPE. F urthermore, the ’Stack ed LSTM COCOB’ mo del com bination in general p erforms comp etitiv ely on most datasets. It outp erforms the t wo uni- v ariate b enc hmarks on the CIF 2016, NN5 and the Wikip edia W eb T raffic datasets. Therefore, we can state that Stack ed mo del combined with the LSTM with p eephole cells and the COCOB optimizer is a comp etitiv e mo del combination to p erform forecasting. 5.8. Exp eriments Involving the T otal Numb er of T r ainable Par ameters The comparison of the relative p erformance of the recurrent units presented in Section 5.2 , is b y considering the cell dimension as a tunable hyperparameter for the RNNs. How ever, as men tioned b efore in Section 4.3.1 , in other research comm unities it is also common to tune the total num b er of trainable parameters as a hyperparameter instead of the cell dimension. Therefore, we also p erform exp erimen ts by tuning the total num b er of trainable parameters as a hyperparameter to compare the relativ e p erformance of the different recurrent unit types. F or this exp erimen t we select the b est configurations identified via the results analyzed th us far. Consequen tly , we choose the Stack ed arc hitecture along with the COCOB optimizer to run on all the three recurrent unit types. W e p erform this exp erimen t on the four datasets CIF, Wikip edia W eb T raffic, M3 and NN5 where the RNNs ha ve outp erformed the statistical b enc hmarks. Particularly on the NN5 and Wikip edia W eb T raffic datasets, we run the version of the Stack ed architecture without applying the STL decomp osition with the increased input window size. T able 7 shows these results in terms of the mean SMAPE v alues. The violin plots in Figure 23 indicate the relative p erformance of the three RNN units in terms of b oth the mean SMAPE and mean MASE ranks. In accordance with the results obtained in Section 5.2 , Figure 23 also indicates that the LSTM cell with p eephole connections performs the b est, the ERNN cell p erforms the w orst and the performance of the GRU is betw een the other t wo. Again, the F riedman test of statistical significance p erformed in terms of the mean SMAPE v alues pro duces an o verall p -v alue of 0.174 whic h means that this difference is not statistically significan t. 5 https://drive.google.com/file/d/16rdzLTFwkKs- _eG_MU0i_rDnBQBCjsGe/view?usp=sharing 42 Figure 20: Performance of RNNs Compared to T raditional Univ ariate T echniques - Mean SMAPE 43 Figure 21: Performance of RNNs Compared to T raditional Univ ariate T echniques - Median SMAPE 44 Figure 22: Performance of RNNs Compared to T raditional Univ ariate T echniques in Different M4 Categories - Mean SMAPE 45 Figure 23: Relative P erformance of Different Re curr rent Cell Types under the Same Number of T otal T rainable Parameters Ov erall, we see that the results obtained after tuning for the total n umber of trainable parameters are coherent with the results obtained b y tuning for the cell dimension instead. As mentioned in Section 4.3.1 , there is a direct connection b et ween the amount of trainable parameters and the cell dimension. F ollowing Smyl ( 2020 ), the cell dimension is not a v ery sensitive hyperparameter, whic h means that c hanges in the cell dimension, and consequen tly changes in the n umber of trainable parameters, do not ha ve significant effects on the RNN p erformance. Therefore, we assume that the conclusions derived by tuning the cell dimension as a hyperparameter hold v alid even after correcting for the total num b er of trainable parameters in the different recurrent units. Nev ertheless, since tuning for the cell dimension is the viable approach due to practical limitations, we conclude that it is sufficient to tune the cell dimension across a similar range to pro duce a suitable num b er of trainable parameters in the different recurren t unit types. 5.9. Comp arison of the Computational Costs of the RNN Mo dels Vs. the Benchmarks In addition to the comparisons of the prediction p erformance presented thus far, we perform a comparison betw een the computational times of the RNN mo dels with resp ect to the chosen standard b enc hmarks, on the four datasets CIF2016, Wikipedia W eb T raffic, M3 and NN5 where an RNN mo del outp erforms the b enc hmarks. The computational times presented in T able 8 are for the best p erforming RNN mo dels in each one of those datasets. F or the purp ose of comparison, b oth the b enc hmarks and the RNN mo dels are allo cated 25 CPU cores for the execution. The o verall pro cess of the RNN mo dels has different stages in its pip eline such as prepro cessing of the data, tuning the hyperparameters and the final mo del training with the optimal configuration for testing. Therefore, T able 8 shows a breakdown of the computational costs for these individual stages as w ell as the total time. F rom T able 8 w e see that, compared to the standard b enc hmarks, the RNN mo dels hav e tak en a considerable amount of computational time for the ov erall pro cess. F or instance, on the horizon 12 category of the CIF dataset, the total time for the RNN is 3749 . 5 s (1 . 0 hr ) whereas auto.arima and ets hav e taken only 85 . 1 s (1 . 4 min ) and 41 . 6 s resp ectiv ely . Similarly on the NN5 dataset, the RNN mo del has taken a total of 51490 . 7 s (14 . 3 hr s ) whereas auto.arima and ets mo dels hav e taken 1067 . 7 s (17 . 8 min ) and 53 . 3 s only . How ever, it is also evident in T able 8 , that most of the computational time in RNNs is dev oted to the h yp erparameter tuning using SMAC with 50 iterations. F or example, in the Micro category of the M3 dataset, although the whole pip eline takes 2523 . 0 s (42 . 1 min ), 46 Dataset Model Preprocessing Hyperparameter T uning Model T raining & T esting T otal CIF(12) S2SD LSTM NMW cocob 1 . 9 2531 . 8 1215 . 7 3749 . 5 CIF(12) auto.arima - - - 85 . 1 CIF(12) ets - - - 41 . 6 CIF(6) S2SD LSTM NMW co cob 0 . 5 2535 . 3 304 . 9 2840 . 3 CIF(6) auto.arima - - - 4 . 3 CIF(6) ets - - - 8 . 3 Kaggle NSTL Stacked GRU adam 159 . 7 12 803 . 2 5350 . 2 18 313 . 0 Kaggle auto.arima - - - 834 . 4 Kaggle ets - - - 481 . 7 M3(Mic) S2SD GRU MW adagrad 30 . 2 1927 . 2 565 . 6 2523 . 0 M3(Mic) auto.arima - - - 849 . 8 M3(Mic) ets - - - 73 . 0 M3(Mac) S2SD GRU MW adagrad 28 . 2 5359 . 9 1863 . 0 7251 . 0 M3(Mac) auto.arima - - - 719 . 5 M3(Mac) ets - - - 72 . 7 M3(Ind) S2SD GR U MW adagrad 30 . 5 4084 . 0 1054 . 6 5169 . 1 M3(Ind) auto.arima - - - 1462 . 3 M3(Ind) ets - - - 218 . 8 M3(Dem) S2SD GRU MW adagrad 5 . 8 1875 . 9 422 . 6 2304 . 4 M3(Dem) auto.arima - - - 273 . 3 M3(Dem) ets - - - 73 . 1 M3(Fin) S2SD GRU MW adagrad 12 . 1 1848 . 9 493 . 0 2354 . 0 M3(Fin) auto.arima - - - 352 . 2 M3(Fin) ets - - - 90 . 3 M3(Oth) S2SD GRU MW adagrad 4 . 1 1319 . 8 147 . 5 1471 . 5 M3(Oth) auto.arima - - - 273 . 3 M3(Oth) ets - - - 31 . 3 NN5 Stack ed LSTM adam 42 . 2 37 660 . 2 13 788 . 3 51 490 . 7 NN5 auto.arima - - - 1067 . 7 NN5 ets - - - 53 . 3 T able 8: Computational Times Comparison of the RNNs with the Benchmarks (in seconds) the training of the final mo del with the optimal configuration and testing has accounted for only 565 . 6 s (9 . 4 min ). On the other hand, the total running time for auto.arima and ets on the same dataset is 849 . 8 s (14 . 2 min ) and 73 . 0 s (1 . 2 min ) resp ectively . Hence, even though the ov erall pro cess of RNNs is computationally costly , the training and testing of one mo del can b e comparable to the computational costs of the statistical b enc hmarks. Moreo ver, compared to the b enc hmarks whic h build one model p er ev ery series, the final trained mo dels from RNNs are less complex with few er parameters than the univ ariate tec hniques, on a global scale. F rom this comparison we see that RNNs are typically computationally more exp ensiv e mo dels compared to traditional univ ariate techniques. How ever, regardless of the computational cost they are capable of p erforming b etter than the traditional univ ariate benchmarks in all the cases mentioned in T able 8 . This finding is another asp ect of the recen t changes in the forecasting communit y now ac knowledging that complex methods can hav e merits ov er simpler statistical b enc hmarks ( Makridakis et al. , 2018b ). Y et, our study shows how RNNs can b e trained in a w ay to achiev e such improv ements. With the av ailability of massiv e amoun ts of computational resources now adays, the impro v ed accuracy brough t forw ard by the RNNs for forecasting is certainly b eneficial for forecasting practitioners. 5.10. Hyp erp ar ameter Configur ations Usually , the larger the initial hyperparameter space that needs to b e searched, the higher the n umber of iterations of the automated h yp erparameter tuning technique should b e. This dep ends on the num b er of hyperparameters as w ell as the initial hyperparameter ranges. W e use 50 iterations of the SMAC algorithm for hyperparameter tuning to b e suitable across all the datasets. F or our exp erimen ts w e choose roughly the same range across all the datasets as shown in T able 4 , except for the minibatch size. The minibatch size needs to b e chosen prop ortional to the size of the dataset. 47 Usually , for the low er b ound of the initial h yp erparameter range of the minibatch size, around 1 / 10th of the size of the dataset is appropriate. How ever, we v ary the upp er b ound in different orders for the differen t datasets. F or small datasets such as CIF, NN5, T ourism and the different categories of M3, the upp er b ound differs from the low er b ound in the orders of 10s. F or bigger datasets such as the Wikip edia W eb T raffic, and the differen t categories of M4, w e set the upper bound to b e larger than the lo wer b ound in the orders of 100s. F or the num b er of hidden lay ers in the RNN, many recen t studies suggest that a low v alue usually p erforms b etter ( Smyl and Kub er , 2016 ; Salinas et al. , 2019 ; W ang et al. , 2019 ; Bandara et al. , 2020 ). F ollo wing this conv ention, we choose the num b er of lay ers b et ween 1-2 and we observe that the RNNs p erform w ell with suc h low v alues. On the one hand this is due to the ov erfitting effects resulting from the increased num b er of parameters with the added lay ers. On the other hand, the num b er of lay ers also directly relates to the computational complexity . As for the learning rates, we see that the con vergence of the Adagrad optimizer usually requires higher learning rates in the range 0.01 - 0.9. F or the Adam optimizer the iden tified range is smaller in betw een 0.001 - 0.1. On the other hand, v ariations in the cell dimension and the standard deviation of the random normal initializer barely impact the p erformance of the mo dels. It is also imp ortan t not to set large v alues for the standard deviation of the Gaussian noise and the L2 w eight regularization parameters, since to o high a v alue for them makes the mo del almost completely underfit the data and eliminate the NN’s effect entirely in pro ducing the final forecasts. 6. Conclusions The motiv ation of this study is to address some of the key issues in using RNNs for forecasting and ev aluate if and ho w they can b e used by forecasting practitioners with limited knowledge of these techniques for their forecasting tasks effectively . Through a num b er of systematic exp erimen ts across datasets with div erse c haracteristics, we deriv e conclusions from general data preprocessing b est practices to hyperparameter configurations, b est RNN architectures, recurrent units and optimizers. All our mo dels exploit cross-series information in the form of global mo dels and th us leverage the existence of massiv e time series databases with man y related time series. F rom our exp erimen ts w e conclude that the Stac ked architecture combined with the LSTM cells with p eephole connections and the COCOB optimizer, fed with deseasonalized data in a moving windo w format can b e a comp etitiv e mo del generally across many datasets. In particular, though not statistically significant, the LSTM is the b est unit t yp e even when correcting for the amoun t of trainable parameters. W e further conclude that when all the series in the dataset follow homogeneous seasonal patterns with all of them cov ering the same duration in time with sufficient lengths, RNNs are capable of capturing the seasonality without prior deseasonalization. Otherwise, RNNs are w eak in mo delling seasonalit y on their own, and a deseasonalization step should be emplo yed. F rom the exp erimen ts inv olving the p ooled and unp o oled versions of the regression mo dels, w e can conclude that the concept of cross-series information helps in certain types of datasets. How ev er, even on those datasets which in v olve man y heterogeneous series, the strong mo delling capabilities of RNNs can drive them to perform comp etitiv ely in terms of the forecasting accuracy . Therefore, we can conclude that lev eraging cross-series information has its o wn b enefits on sets of time series while the predictive capabilit y provided by the RNNs can further impro ve the forecasting accuracy . With resp ect to the computational costs, we observe that RNNs take relatively higher computational times compared to the statistical b enc hmarks. How ever, with the cloud infrastructure now adays commonly in place at companies such pro cessing times are feasible. Moreov er, our study has empirically pro ven that RNNs are goo d candidates for forecasting whic h in many cases outperform the statistical benchmarks that are curren tly the state-of-the-art in the communit y . Thus, with the extensive exp erimen ts inv olved with this study , we can confirm that complex methods no w ha v e benefits o ver simpler statistical b enc hmarks in many forecasting situations. Our procedure is (semi-)automatic as for the initial hyperparameter ranges of the SMA C algorithm, w e select approximately the same range across all the datasets except for the minibatc h size whic h we select dep ending on the size of each dataset. Th us, though fitting of RNNs is still not as straightforw ard 48 and automatic as for the tw o state-of-the-art univ ariate forecasting b enc hmarks, ets and auto.arima , our pap er and our released co de framework are imp ortan t steps in this direction. W e finally conclude that RNNs are no w a go od option for forecasting practitioners to obtain reliable forecasts which can outp erform the b enc hmarks. 7. F uture Directions The results of our study are limited to p oin t forecasts in a univ ariate context. Nev ertheless, mo delling uncertaint y of the NN predictions through probabilistic forecasting has received growing atten tion recently in the communit y . Also, when considering a complex forecasting scenario such as in retail industry , the sales of different pro ducts may b e interdependent. Therefore, a sales forecasting task in such a context requires m ultiv ariate forecasting as opposed to univ ariate forecasting. F urthermore, although this study considers only single seasonalit y forecasting, in terms of higher frequency data with sufficient length, it b ecomes b eneficial to mo del m ultiple seasonalities in a big data context. As seen in our study , global NN mo dels often suffer from outlier errors for certain time series. This is probably due to the fact that the a verage weigh ts of NNs found by fitting global mo dels ma y not b e suitable for the individual requirements of certain time series. Consequently , it b ecomes necessary to develop nov el mo dels which incorporate b oth global parameters as well as lo cal parameters for individual time series, in the form of hierarchical models, p oten tially combined with ensembling where the single mo dels are trained in different w ays with the existing dataset (e.g., on different subsets). The work by Bandara et al. ( 2020 ), Sm yl ( 2020 ), and Sen et al. ( 2019 ) are steps in this direction, but still this topic remains largely an op en research question. In general, deep learning is a fast-paced researc h field, where many new architectures are intro- duced and discussed rapidly . How ever, for practitioners it often remains unclear in which situations the techniques are most useful and how difficult it is to adapt them to a given application case. Re- cen tly , CNNs, although initially intended for image pro cessing, hav e b ecome increasingly p opular for time series forecasting. The work carried out by Shih et al. ( 2019 ) and Lai et al. ( 2018 ) follow the argumen t that typical RNN-based attention schemes are not go o d at mo delling seasonality . There- fore, they use a combination of CNN filters to capture lo cal dep endencies and a custom attention score function to mo del the long-term dependencies. Lai et al. ( 2018 ) hav e also exp erimen ted with recurren t skip connections to capture seasonality patterns. On the other hand, Dilated Causal Con- v olutions are sp ecifically designed to capture long-range dep endencies effectively along the temp oral dimension ( v an den Oord et al. , 2016 ). Such la yers stack ed on top of eac h other can build massiv e hi- erarc hical atten tion net works, attending p oin ts even wa y back in the history . They hav e b een recently used along with CNNs for time series forecasting problems. More adv anced CNNs ha ve also b een in tro duced such as T emp oral Con volution Netw orks (TCN) which combine b oth dilated conv olutions and residual skip connections. TCNs hav e also b een used for forecasting in recen t literature ( Borovykh et al. , 2018 ). Recent studies suggest that TCNs are promising NN arc hitectures for sequence mo delling tasks on top of b eing efficient in training ( Bai et al. , 2018 ). Therefore, a comp etitiv e adv antage ma y b egin to unfold for forecasting practitioners by using CNNs instead of RNNs. 49 Ac kno wledgment This research was supp orted b y the Australian Research Council under gran t DE190100045, F ace- b ook Statistics for Impro ving Insigh ts and Decisions research a ward, Monash Universit y Graduate Researc h funding and MASSIVE - High p erformance computing facility , Australia. References Abadi, M., Agarwal, A., Barham, P ., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Da vis, A., Dean, J., Devin, M., Ghemaw at, S., Go odfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefo wicz, R., Kaiser, L., Kudlur, M., Leven b erg, J., Man´ e, D., Monga, R., Mo ore, S., Murra y , D., Olah, C., Sch uster, M., Shlens, J., Steiner, B., Sutsk ev er, I., T alwar, K., T uck er, P ., V anhouck e, V., V asudev an, V., Vi´ egas, F., Viny als, O., W arden, P ., W attenberg, M., Wick e, M., Y u, Y., Zheng, X., 2015. T ensorFlow: Large-scale machine learning on heterogeneous systems. Softw are av ailable from tensorflo w.org. URL https://www.tensorflow.org/ Alexandro v, A., Benidis, K., Bohlke-Sc hneider, M., Flunkert, V., Gasthaus, J., Jan uscho wski, T., Maddix, D. C., Rangapuram, S. S., Salinas, D., Sc hulz, J., Stella, L., T ¨ urkmen, A. C., W ang, Y., 2019. Gluonts: Probabilistic time series mo dels in python. CoRR abs/1906.05264. URL Assaad, M., Bon ´ e, R., Cardot, H., Jan. 2008. A new b oosting algorithm for improv ed time-series forecasting with recurren t neural netw orks. Inf. F usion 9 (1), 41–55. A thanasop oulos, G., Hyndman, R., Song, H., W u, D., 2011. The tourism forecasting comp etition. In ternational Journal of F orecasting 27 (3), 822 – 844. A thanasop oulos, G., Hyndman, R. J., Song, H., W u, D., 2010. T ourism forecasting part tw o. URL https://www.kaggle.com/c/tourism2/data Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation b y jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (Eds.), 3rd International Conference on Learning Represen ta- tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference T rack Pro ceedings. URL Bai, S., Kolter, J. Z., Koltun, V., 2018. An empirical ev aluation of generic con volutional and recurrent net works for sequence mo deling. CoRR abs/1803.01271. URL Bandara, K., Bergmeir, C., Smyl, S., 2020. F orecasting across time series databases using recurrent neural netw orks on groups of similar series: A clustering approach. Exp ert Systems with Applications 140, 112896. Bandara, K., Shi, P ., Bergmeir, C., Hew amalage, H., T ran, Q., Seaman, B., 2019. Sales demand forecast in e-commerce using a long short-term memory neural netw ork metho dology . In: Gedeon, T., W ong, K. W., Lee, M. (Eds.), Neural Information Pro cessing. Springer International Publishing, Cham, pp. 462–474. Ba yer, J., Osendorfer, C., 2014. Learning sto c hastic recurrent netw orks. URL Ben T aieb, S., Bon tempi, G., A tiya, A., Sorjamaa, A., 6 2012. A review and comparison of strategies for m ulti-step ahead time series forecasting based on the nn5 forecasting comp etition. Exp ert Systems with Applications 39 (8), 7067–7083. 50 Bergmeir, C., Hyndman, R. J., Ko o, B., 2018. A note on the v alidity of cross-v alidation for ev aluating autoregressiv e time series prediction. Computational Statistics & Data Analysis 120, 70 – 83. Bergstra, J., 2012. Hyp eropt: Distributed asynchronous hyper-parameter optimization. URL https://github.com/hyperopt/hyperopt Bergstra, J., Bengio, Y., 2012. Random searc h for Hyp er-P arameter optimization. J. Mac h. Learn. Res. 13 (F eb), 281–305. Bianc hi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., Jenssen, R., 2017. An ov erview and com- parativ e analysis of recurren t neural netw orks for short term load forecasting. CoRR abs/1705.04378. URL Boro vykh, A., Bohte, S., Oosterlee, C. W., 2018. Conditional time series forecasting with con volutional neural netw orks. arXiv preprin t URL Bo x, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis: F orecasting and Control. F orecasting and Control Series. Prentice Hall. Chen, C., Twycross, J., Garibaldi, J. M., Mar. 2017. A new accuracy measure based on b ounded relativ e error for time series forecasting. PLoS One 12 (3), e0174202. Cho, K., v an Merrienbo er, B., Gulcehre, C., Bahdanau, D., Bougares, F., Sch wenk, H., Bengio, Y., 25–29 Octob er 2014. Learning phrase represen tations using RNN Enco der–Decoder for statistical mac hine translation. In: Pro ceedings of the 2014 Conference on Empirical Metho ds in Natural Language Pro cessing (EMNLP). Association for Computational Linguistics, Stroudsburg, P A, USA, pp. 1724–1734. Cinar, Y. G., Mirisaee, H., Goswami, P ., Gaussier, E., A ¨ ıt-Bachir, A., Strijov, V., 2017. Position-based con tent attention for time series forecasting with sequence-to-sequence rnns. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy , E.-S. M. (Eds.), Neural Information Pro cessing. Springer International Publishing, Cham, pp. 533–544. Cla veria, O., Monte, E., T orra, S., Sep. 2017. Data pre-processing for neural netw ork-based forecasting: do es it really matter? T echnological and Economic Dev elopment of Economy 23 (5), 709–725. Cla veria, O., T orra, S., Jan. 2014. F orecasting tourism demand to catalonia: Neural net works vs. time series mo dels. Econ. Model. 36, 220–228. Clev eland, R. B., Cleveland, W. S., McRae, J. E., T erp enning, I., Jan. 1990. STL: A Seasonal-T rend decomp osition pro cedure based on lo ess. J. Off. Stat. 6 (1), 3–33. Collins, J., Sohl-Dickstein, J., Sussillo, D., 2016. Capacity and trainability in recurrent neural netw orks. In: International Conference on Learning Representations 2016 (ICLR 2016). Crone, S. F., 2008. NN5 comp etition. URL http://www.neural- forecasting- competition.com/NN5/ Crone, S. F., Hib on, M., Nik olop oulos, K., Jul. 2011. Adv ances in forecasting with neural netw orks? empirical evidence from the NN3 comp etition on time series prediction. Int. J. F orecast. 27 (3), 635–660. Dillon, J. V., Langmore, I., T ran, D., Brevdo, E., V asudev an, S., Mo ore, D., Patton, B., Alemi, A., Hoffman, M. D., Saurous, R. A., 2017. T ensorflow distributions. CoRR abs/1711.10604. URL 51 Duc hi, J., Hazan, E., Singer, Y., 2011. Adaptiv e subgradien t metho ds for online learning and sto c hastic optimization. J. Mac h. Learn. Res. 12 (Jul), 2121–2159. Elman, J. L., Apr. 1990. Finding structure in time. Cogn. Sci. 14 (2), 179–211. eResearc h Cen tre., M., 2019. M3 user guide. URL https://docs.massive.org.au/index.html F ernando, 2012. Bay esian optimization. URL https://github.com/fmfn/BayesianOptimization F riedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear mo dels via co ordinate descent. Journal of Statistical Soft ware 33 (1), 1–22. Garc ´ ıa, S., F ern´ andez, A., Luengo, J., Herrera, F., May 2010. Adv anced nonparametric tests for m ultiple comparisons in the design of exp erimen ts in computational intelligence and data mining: Exp erimen tal analysis of p o wer. Inf. Sci. 180 (10), 2044–2064. Gasthaus, J., Benidis, K., W ang, Y., Rangapuram, S. S., Salinas, D., Flunk ert, V., Janusc howski, T., 16–18 Apr 2019. Probabilistic forecasting with spline quantile function rnns. In: Chaudhuri, K., Sugiy ama, M. (Eds.), Pro ceedings of Machine Learning Researc h. V ol. 89 of Pro ceedings of Machine Learning Research. PMLR, pp. 1901–1910. Go ogle, 2017. W eb traffic time series forecasting. URL https://www.kaggle.com/c/web- traffic- time- series- forecasting He, K., Zhang, X., Ren, S., Sun, J., 27–30 June 2016. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. Ho c hreiter, S., Sc hmidhuber, J., Nov. 1997. Long short-term memory . Neural Comput. 9 (8), 1735– 1780. Hornik, K., Stinchcom b e, M., White, H., Jan. 1989. Multilay er feedforw ard netw orks are universal appro ximators. Neural Netw. 2 (5), 359–366. Hutter, F., Ho os, H. H., Leyton-Brown, K., 17–21 Jan 2011. Sequential mo del-based optimization for general algorithm configuration. In: Co ello, C. A. C. (Ed.), Learning and In telligent Optimization. Springer Berlin Heidelb erg, Berlin, Heidelb erg, pp. 507–523. Hyndman, R., 2018. A brief history of time series forecasting comp etitions. URL https://robjhyndman.com/hyndsight/forecasting- competitions/ Hyndman, R., Kang, Y., T alagala, T., W ang, E., Y ang, Y., 2019. tsfeatures: Time Series F eature Extraction. R pac k age version 1.0.0. URL https://pkg.robjhyndman.com/tsfeatures/ Hyndman, R., Khandak ar, Y., 2008. Automatic time series forecasting: The forecast pack age for R. Journal of Statistical Softw are, Articles 27 (3), 1–22. Hyndman, R., Ko ehler, A., Ord, K., D Sn yder, R., 01 2008. F orecasting with exp onen tial smo othing. The state space approach. Springer Berlin Heidelb erg. Hyndman, R. J., Ko ehler, A. B., Oct. 2006. Another look at measures of forecast accuracy . Int. J. F orecast. 22 (4), 679–688. Ioffe, S., Szegedy , C . , 06–11 July 2015. Batch normalization: Accelerating deep netw ork training b y reducing in ternal co v ariate shift. In: Pro ceedings of the 32nd International Conference on Mac hine Learning - V olume 37. ICML’15. JMLR.org, pp. 448–456. 52 Jagannatha, A. N., Y u, H., Jun. 2016. Bidirectional RNN for medical even t detection in electronic health records. Pro ceedings of the conference. Asso ciation for Computational Linguistics. North American Chapter. Meeting 2016, 473–482. Jan uscho wski, T., Gasthaus, J., W ang, Y., Salinas, D., Flunkert, V., Bohlke-Sc hneider, M., Callot, L., 2020. Criteria for classifying forecasting metho ds. International Journal of F orecasting 36 (1), 167 – 177, m4 Comp etition. Ji, Y., Haffari, G., Eisenstein, J., Jun. 2016. A latent v ariable recurrent neural netw ork for discourse- driv en language models. In: Pro ceedings of the 2016 Conference of the North American Chapter of the Asso ciation for Computational Linguistics: Human Language T echnologies. Asso ciation for Computational Linguistics, San Diego, California, pp. 332–342. Jozefo wicz, R., Zaremba, W., Sutskev er, I., 06–11 July 2015. An empirical exploration of recurren t net work architectures. In: Pro ceedings of the 32nd International Conference on Machine Learning - V olume 37. ICML’15. JMLR.org, pp. 2342–2350. Kingma, D. P ., Ba, J., 7–9 Ma y 2015. Adam: A metho d for sto c hastic optimization. In: 3rd In terna- tional Conference for Learning Represen tations. V ol. 1412. Koutn ´ ık, J., Greff, K., Gomez, F., Sc hmidhuber, J., 21–26 Jun 2014. A clo c kwork RNN. In: Pro ceedings of the 31st In ternational Conference on Machine Learning - V olume 32. ICML’14. JMLR.org, pp. I I–1863–II–1871. Krstano vic, S., Paulheim, H., 12–14 Dec 2017. Ensem bles of recurrent neural netw orks for robust time series forecasting. In: Artificial Intelligence XXXIV. Springer International Publishing, pp. 34–46. Lai, G., Chang, W.-C., Y ang, Y., Liu, H., 8–12 July 2018. Mo deling long- and Short-T erm temp oral patterns with deep neural net works. In: The 41st In ternational ACM SIGIR Conference on Researc h & Developmen t in Information Retriev al. SIGIR ’18. ACM, pp. 95–104. Laptev, N., Y osinski, J., Li, L. E., Smyl, S., 06–11 Aug 2017. Time-series extreme even t forecasting with neural net works at ub er. In: International Conference on Machine Learning. V ol. 34. pp. 1–5. Liang, Y., Ke, S., Zhang, J., Yi, X., Zheng, Y., 13–19 July 2018. Geoman: Multi-lev el attention net works for geo-sensory time series prediction. In: Pro ceedings of the Twen ty-Sev enth In ternational Join t Conference on Artificial In telligence, IJCAI-18. International Join t Conferences on Artificial In telligence Organization, pp. 3428–3434. Lindauer, M., Eggensp erger, K., F eurer, M., F alkner, S., Biedenk app, A., Hutter, F., 2017. Smac v3: Algorithm configuration in python. URL https://github.com/automl/SMAC3 Luong, T., Pham, H., Manning, C . D., Sep 19–21 2015. Effective approaches to attention-based neural mac hine translation. In: Pro ceedings of the 2015 Conference on Empirical Metho ds in Natural Language Pro cessing. Asso ciation for Computational Linguistics, Stroudsburg, P A, USA, pp. 1412– 1421. Makridakis, S., Hib on, M., Oct. 2000. The M3-Comp etition: results, conclusions and implications. Int. J. F orecast. 16 (4), 451–476. Makridakis, S., Spiliotis, E., Assimak op oulos, V., Oct. 2018a. The M4 comp etition: Results, findings, conclusion and w ay forw ard. In t. J. F orecast. 34 (4), 802–808. Makridakis, S., Spiliotis, E., Assimakopoulos, V., mar 2018b. Statistical and machine learning fore- casting metho ds: Concerns and w ays forw ard. PLOS ONE 13 (3), e0194889. 53 Mandal, P ., Senjyu, T., Urasaki, N., F unabashi, T., Jul. 2006. A neural netw ork based several-hour- ahead electric load forecasting using similar da ys approac h. In t. J. Electr. Po wer Energy Syst. 28 (6), 367–373. Mon tero-Manso, P ., Athanasopoulos, G., Hyndman, R. J., T alagala, T. S., 2020. Fforma: F eature- based forecast mo del av eraging. In ternational Journal of F orecasting 36 (1), 86 – 92, m4 Comp etition. Nelson, M., Hill, T., Rem us, W., O’Connor, M., Sep. 1999. Time series forecasting using neural net works: Should the data b e deseasonalized first? J. F orecast. 18 (5), 359–367. Orab ona, F., 2017. co cob. URL https://github.com/bremen79/cocob Orab ona, F., T ommasi, T., 04–09 Dec 2017. T raining deep netw orks without learning rates through coin b etting. In: Pro ceedings of the 31st International Conference on Neural Information Pro cessing Systems. NIPS’17. Curran Asso ciates Inc., USA, pp. 2157–2167. Oreshkin, B. N., Carp o v, D., Chapados, N., Bengio, Y., 2019. N-BEA TS: neural basis expansion analysis for in terpretable time series forecasting. CoRR abs/1905.10437. URL P eng, C., Li, Y., Y u, Y., Zhou, Y., Du, S., 31 jan – 03 F eb 2018. Multi-step-ahead host load prediction with GRU based Enco der-Decoder in cloud computing. In: 2018 10th International Conference on Kno wledge and Smart T ec hnology (KST). pp. 186–191. Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., Cottrell, G. W., 2017. A dual-stage attention- based recurrent neural net work for time series prediction. In: Pro ceedings of the 26th In ternational Join t Conference on Artificial Intelligence. IJCAI’17. AAAI Press, p. 2627–2633. R Core T eam, 2014. R: A Language and En vironment for Statistical Computing. R F oundation for Statistical Computing, Vienna, Austria. URL http://www.R- project.org/ Rahman, M. M., Islam, M. M., Murase, K., Y ao, X., Jan. 2016. Lay ered ensemble arc hitecture for time series forecasting. IEEE T rans Cyb ern 46 (1), 270–283. Rangapuram, S. S., Seeger, M., Gasthaus, J., Stella, L., W ang, Y., Janusc howski, T., 2018. Deep state space mo dels for time series forecasting. In: Pro ceedings of the 32nd In ternational Conference on Neural Information Pro cessing Systems. NIPS’18. Curran Asso ciates Inc., USA, pp. 7796–7805. Rob J Hyndman, G. A., 2018. F orecasting: Principles and Practice, 2nd Edition. OT exts. URL https://otexts.com/fpp2/ Salinas, D., Flunkert, V., Gasthaus, J., Janusc howski, T., 2019. Deepar: Probabilistic forecasting with autoregressiv e recurren t netw orks. International Journal of F orecasting. Sc h¨ afer, A. M., Zimmermann, H. G., 10–14 Sep 2006. Recurren t neural netw orks are universal ap- pro ximators. In: Proceedings of the 16th International Conference on Artificial Neural Netw orks - V olume Part I. ICANN’06. Springer-V erlag, Berlin, Heidelb erg, pp. 632–640. Sc huster, M., Paliw al, K. K., Nov. 1997. Bidirectional recurrent neural net works. T rans. Sig. Proc. 45 (11), 2673–2681. Sen, R., Y u, H.-F., Dhillon, I., 2019. Think globally , act lo cally: A deep neural netw ork approach to high-dimensional time series forecasting. URL 54 Sharda, R., Patil, R. B., Oct. 1992. Connectionist approach to time series prediction: an empirical test. J. In tell. Man uf. 3 (5), 317–323. Shih, S.-Y., Sun, F.-K., Lee, H.-y ., Sep 2019. T emp oral pattern atten tion for multiv ariate time series forecasting. Machine Learning 108 (8), 1421–1441. Sm yl, S., 2016. F orecasting short time series with LSTM neural netw orks. Accessed: 2018-10-30. URL https://gallery.azure.ai/Tutorial/Forecasting- Short- Time- Series- with- LSTM- Neural- Networks- 2 Sm yl, S., 25–28 jun 2017. Ensemble of sp ecialized neural netw orks for time series forecasting. In: 37th In ternational Symposium on F orecasting. Sm yl, S., 2020. A h ybrid metho d of exp onen tial smo othing and recurrent neural netw orks for time series forecasting. In ternational Journal of F orecasting 36 (1), 75 – 85, m4 Comp etition. Sm yl, S., Kub er, K., 19–22 Jun 2016. Data prepro cessing and augmentation for multiple short time series forecasting with recurrent neural netw orks. In: 36th International Symp osium on F orecasting. Sno ek, J., 2012. Sp earmin t. URL https://github.com/JasperSnoek/spearmint Sno ek, J., Laro c helle, H., Adams, R. P ., 03–08 Dec 2012. Practical ba yesian optimization of machine learning algorithms. In: Pro ceedings of the 25th International Conference on Neural Information Pro cessing Systems - V olume 2. NIPS’12. Curran Asso ciates Inc., USA, pp. 2951–2959. Soudry , D., Hoffer, E., Nacson, M. S., Gunasek ar, S., Srebro, N., Jan. 2018. The implicit bias of gradien t descen t on separable data. J. Mach. Learn. Res. 19 (1), 2822–2878. ˇ St ˇ epni ˇ ck a, M., Burda, M., 09 –12 July 2017. On the results and observ ations of the time series forecast- ing comp etition cif 2016. In: 2017 IEEE In ternational Conference on F uzzy Systems (FUZZ-IEEE). pp. 1–6. Suilin, A., 2017. k aggle-w eb-traffic. Accessed: 2018-11-19. URL https://github.com/Arturus/kaggle- web- traffic/ Sutsk ever, I., Viny als, O., Le, Q. V., 08–13 Dec 2014. Sequence to sequence learning with neural net works. In: Pro ceedings of the 27th In ternational Conference on Neural Information Pro cessing Systems - V olume 2. NIPS’14. MIT Press, Cambridge, MA, USA, pp. 3104–3112. T ang, Z., de Almeida, C., Fishwic k, P . A., Nov. 1991. Time series forecasting using neural netw orks vs. b o x- jenkins metho dology . Sim ulation 57 (5), 303–310. T rap ero, J. R., Kourentzes, N., Fildes, R., F eb 2015. On the iden tification of sales forecasting models in the presence of promotions. Journal of the Op erational Research So ciet y 66 (2), 299–307. v an den Oord, A., Dieleman, S., Zen, H., Simony an, K., Vin yals, O., Grav es, A., Kalch brenner, N., Senior, A. W., Kavuk cuoglu, K., Sep 13-15 2016. W av enet: A generative mo del for ra w audio. In: The 9th ISCA Sp eec h Synthesis W orkshop. ISCA, p. 125. W ang, Y., Smola, A., Maddix, D., Gasthaus, J., F oster, D., Janusc howski, T., 09–15 Jun 2019. Deep factors for forecasting. In: Chaudhuri, K., Salakhutdino v, R. (Eds.), Pro ceedings of the 36th Inter- national Conference on Machine Learning. V ol. 97 of Pro ceedings of Machine Learning Researc h. PMLR, Long Beac h, California, USA, pp. 6607–6617. W en, R., T orkkola, K., Naray anaswam y , B., Madek a, D., 04 - 09 Dec 2017. A Multi-Horizon quan tile recurren t forecaster. In: 31st Conference on Neural Information Pro cessing Systems (NIPS 2017), Time Series W orkshop. 55 Y an, W., Jul. 2012. T ow ard automatic time-series forecasting using neural net works. IEEE T rans Neural Netw Learn Syst 23 (7), 1028–1039. Y an, Y., 2016. rBay esianOptimization: Ba yesian Optimization of Hyp erparameters. R pack age version 1.1.0. URL https://CRAN.R- project.org/package=rBayesianOptimization Y ao, K., Cohn, T., Vylomov a, K., Duh, K., Dyer, C., 22 Jun – 14 Aug 2015. Depth-Gated LSTM. 20th Jelinek Summer W orkshop on Sp eec h and Language T echnology 2015. Y eo, I.-K., Johnson, R. A., 2000. A new family of p o wer transformations to improv e normalit y or symmetry . Biometrik a 87 (4), 954–959. Zhang, G., Eddy Patu w o, B., Hu, M. Y., 1998. F orecasting with artificial neural netw orks: The state of the art. Int. J. F orecast. 14, 35–62. Zhang, G. P ., Jan. 2003. Time series forecasting using a hybrid ARIMA and neural netw ork mo del. Neuro computing 50, 159–175. Zhang, G. P ., Berardi, V. L., Jun. 2001. Time series forecasting with neural netw ork ensem bles: an application for exc hange rate prediction. J. Op er. Res. So c. 52 (6), 652–664. Zhang, G. P ., Kline, D. M., Nov. 2007. Quarterly Time-Series forecasting with neural net works. IEEE T rans. Neural Netw. 18 (6), 1800–1814. Zhang, G. P ., Qi, M., Jan. 2005. Neural netw ork forecasting for seasonal and trend time series. Eur. J. Op er. Res. 160 (2), 501–514. Zh u, L., Laptev, N., 18 – 21 nov 2017. Deep and confident prediction for time series at ub er. In: 2017 IEEE International Conference on Data Mining W orkshops (ICDMW). IEEE, pp. 103–110. 56
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment