Statistical benchmarking of transformer models in low signal-to-noise time-series forecasting
We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.
💡 Research Summary
The paper presents a rigorous benchmark of transformer‑based models for multivariate time‑series forecasting under low‑signal‑to‑noise conditions, where only a few years of daily observations are available. The authors generate synthetic datasets with known temporal and cross‑sectional dependency structures and systematically vary the signal‑to‑noise ratio (ρ) from 2 % to 50 %. Each dataset is built by sampling predictor series X(j)ₜ,ₙ from a standard normal distribution and then constructing the latent optimal predictor eYₜ,ₙ as a weighted sum of one or more “effects”: (i) a simple linear combination (Order 0), (ii) a time‑shifted version (TS‑Shift), (iii) a cross‑sectional shift (CS‑Shift), (iv) a non‑linear feature interaction (Fea‑Nonlin), and (v) combined time‑ and cross‑sectional shifts or non‑linear interactions (Order 2). The observed target Yₜ,ₙ is formed by mixing eYₜ,ₙ with independent Gaussian noise Zₜ,ₙ so that Corr(Y, eY)=ρ. This construction yields a ground‑truth optimal predictor against which all models are evaluated by computing the out‑of‑sample correlation between model forecasts and eY.
Four baseline methods are considered: ordinary least squares (OLS), a global Lasso regression, a global gradient‑boosted tree model, and a global multilayer perceptron (MLP). All baselines flatten the three‑dimensional input (look‑back window × number of series × features) into a single vector, thereby discarding explicit temporal and cross‑sectional structure. For OLS the authors derive an analytical expression for the expected correlation C(ρ,γ)=ρ·√
Comments & Academic Discussion
Loading comments...
Leave a Comment