WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.


💡 Research Summary

The paper introduces WAVE (Weighted Autoregressive Varying Gate), a novel attention mechanism that embeds the full ARMA (autoregressive moving‑average) structure into modern transformer models for time‑series forecasting. The authors first demonstrate that a decoder‑only autoregressive transformer, when equipped with appropriate tokenization (non‑overlapping patches of length equal to the forecast horizon) and RevIN channel‑wise normalization, can achieve performance on par with state‑of‑the‑art encoder‑only or hybrid models. This challenges the prevailing belief that decoder‑only architectures suffer from error accumulation and are unsuitable for long‑horizon forecasting.

Building on this baseline, the core contribution is the integration of a moving‑average (MA) term without increasing computational complexity or parameter count. Traditional linear attention reduces the quadratic cost of softmax attention to O(N) by replacing the exponential kernel with a linear kernel ϕ(q)·ϕ(k). However, directly adding an MA component would require constructing and inverting an N×N lower‑triangular weight matrix Θ, which would revert the complexity to O(N²). To avoid this, the authors propose an “indirect MA weight generation” strategy. After computing the AR output o_AR(t) at each step, they define the residual r_t = v_{t+1} – o_AR(t) as the error ε_t. The MA output is then expressed as a convolution of past errors with learned MA weights θ_{t‑1,j}. By formulating the relationship r = (I + Θ) ε, they can recover ε via a simple linear solve that leverages the same linear‑RNN state updates used in efficient attention. Consequently, the MA contribution o_MA(t) = Σ_j θ_{t‑1,j}·ε_j can be accumulated in O(N) time, preserving the linear‑time advantage of the underlying attention.

The paper also discusses the role of gating. Existing gated linear attention applies an exponential decay gate to the AR weights, which is beneficial for NLP tasks where recent context is more important. In time‑series data, however, seasonal and cyclic patterns persist over long horizons and should not be attenuated. WAVE therefore decouples short‑term fluctuations (captured by the MA term) from long‑term dependencies (captured by the AR term). This separation allows the AR component to focus on stable, periodic structures while the MA component handles abrupt changes, noise, and local dynamics.

WAVE is evaluated on twelve public benchmark datasets covering electricity consumption, traffic flow, weather, and other domains. The authors integrate WAVE into several baseline attention variants: standard softmax attention, linear attention, Attention‑Free Transformer (AFT), and gated linear attention. Across 48 sub‑experiments, WAVE‑augmented models consistently achieve lower MSE/MAE than their plain AR counterparts, with average improvements of 10–15 %. Moreover, WAVE models surpass the best published baselines (e.g., DLinear, PatchTST, Autoformer) in overall ranking, as shown by box‑plot visualizations. Importantly, these gains come without any increase in FLOPs or memory consumption; the models retain O(N) time complexity and essentially unchanged parameter counts.

Ablation studies confirm that the indirect MA generation is crucial: directly computing Θ leads to quadratic cost and training instability, while the proposed method maintains efficiency and stability. Visualizations of learned MA weights highlight that high MA coefficients align with timestamps of sudden spikes or anomalies, whereas AR weights remain relatively uniform, reflecting long‑term trends.

The authors release their code, facilitating reproducibility and future extensions. They acknowledge limitations such as the current focus on single‑channel MA generation and the need for adaptive selection of AR order p and MA order q. Future work may explore multi‑channel MA interactions, dynamic order selection, and application to other sequential domains like finance or healthcare.

In summary, WAVE bridges classical statistical time‑series modeling with modern deep learning attention, delivering a computationally cheap yet powerful mechanism that simultaneously captures long‑range dependencies and short‑term volatility, thereby setting a new performance benchmark for transformer‑based time‑series forecasting.


Comments & Academic Discussion

Loading comments...

Leave a Comment