A Policy Gradient-Based Sequence-to-Sequence Method for Time Series Prediction
Sequence-to-sequence architectures built upon recurrent neural networks have become a standard choice for multi-step-ahead time series prediction. In these models, the decoder produces future values conditioned on contextual inputs, typically either actual historical observations (ground truth) or previously generated predictions. During training, feeding ground-truth values helps stabilize learning but creates a mismatch between training and inference conditions, known as exposure bias, since such true values are inaccessible during real-world deployment. On the other hand, using the model’s own outputs as inputs at test time often causes errors to compound rapidly across prediction steps. To mitigate these limitations, we introduce a new training paradigm grounded in reinforcement learning: a policy gradient-based method to learn an adaptive input selection strategy for sequence-to-sequence prediction models. Auxiliary models first synthesize plausible input candidates for the decoder, and a trainable policy network optimized via policy gradients dynamically chooses the most beneficial inputs to maximize long-term prediction performance. Empirical evaluations on diverse time series datasets confirm that our approach enhances both accuracy and stability in multi-step forecasting compared to conventional methods.
💡 Research Summary
The paper addresses two well‑known shortcomings of sequence‑to‑sequence (Seq2Seq) models for multi‑step time‑series forecasting: exposure bias, which arises because teacher‑forcing training uses ground‑truth inputs that are unavailable at test time, and error accumulation, which results from feeding the model’s own predictions back into the decoder in an autoregressive fashion. To overcome these issues, the authors propose a reinforcement‑learning (RL) based training paradigm that learns an adaptive input‑selection policy for the decoder.
A pool of auxiliary models (e.g., ARIMA, Prophet, LightGBM, and other pre‑trained neural networks) is first constructed. Each auxiliary model generates a candidate prediction for the next time step, and the decoder itself is also included as a candidate. The selection of which candidate to feed into the decoder at each decoding step is cast as a sequential decision‑making problem modeled as a Markov Decision Process (MDP). The state is defined as the current hidden state of the decoder, which encapsulates all information about the decoding history. The action is the choice of a model from the pool. The transition dynamics follow the recurrent update of the decoder given the selected input.
The reward function combines two components: (1) a rank‑based term that encourages the policy to pick models that have higher accuracy rankings in the pool, and (2) an accuracy term that directly reflects the absolute error of the decoder’s prediction after using the selected input. A hyperparameter α balances these components, while β scales the error term into a reward‑compatible range.
A stochastic policy network πθ(a|s) parameterized by θ outputs a probability distribution over the candidate models conditioned on the current decoder state. The policy is trained with the REINFORCE algorithm (a classic policy‑gradient method). For each episode (i.e., a full forecasting horizon of H steps), the cumulative discounted reward Gk = Σ_{t=k}^H γ^{t−k} rk is computed, and the gradient update follows ∇θ J(θ) = E
Comments & Academic Discussion
Loading comments...
Leave a Comment