A Lightweight Multi-View Approach to Short-Term Load Forecasting

A Lightweight Multi-View Approach to Short-Term Load Forecasting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Time series forecasting is a critical task across domains such as energy, finance, and meteorology, where accurate predictions enable informed decision-making. While transformer-based and large-parameter models have recently achieved state-of-the-art results, their complexity can lead to overfitting and unstable forecasts, especially when older data points become less relevant. In this paper, we propose a lightweight multi-view approach to short-term load forecasting that leverages single-value embeddings and a scaled time-range input to capture temporally relevant features efficiently. We introduce an embedding dropout mechanism to prevent over-reliance on specific features and enhance interpretability. Our method achieves competitive performance with significantly fewer parameters, demonstrating robustness across multiple datasets, including scenarios with noisy or sparse data, and provides insights into the contributions of individual features to the forecast.


💡 Research Summary

The paper addresses the practical challenges of short‑term electric load forecasting, where recent advances in transformer‑based deep learning have achieved state‑of‑the‑art accuracy but at the cost of large model sizes, high computational demand, and a tendency to overfit when historical data become less relevant. To mitigate these issues, the authors propose a lightweight multi‑view architecture that combines single‑value embeddings with a scaled time‑range input representation.

Core ideas

  1. Multi‑view encoding – Each exogenous variable (temperature, humidity, holiday flags, etc.) is treated as a separate “view”. Instead of learning high‑dimensional embeddings for each view, the method reduces each view to a compact vector, often a single scalar (single‑value embedding). This drastically cuts the number of trainable parameters.
  2. Scaled time‑range input – Rather than feeding a uniformly spaced sliding window, the model selects a small set of meaningful lags: one year, one week, one day, 12 h, 6 h, and the last five hourly steps. The resulting 10‑point lag matrix captures seasonal patterns and short‑term dynamics while keeping the input dimension tiny.
  3. Embedding dropout – During training, each non‑target view’s embedding is randomly masked with probability p (Bernoulli dropout). This prevents the network from relying too heavily on any single view, improves robustness to noisy or missing features, and yields interpretable importance scores when the dropout mask is examined.

Model architecture
The compact embeddings are stacked into a tensor Z, which is fed to a standard transformer encoder to produce contextual hidden states H. A decoder stack then performs cross‑attention between H and the original lagged inputs U, finally projecting to the forecast horizon τ. The authors experiment with three aggregation strategies (additive, concatenative, and the single‑value case) and find that the single‑value configuration already delivers competitive results.

Experimental setup
Four real‑world datasets are used: two proprietary Hydro‑Québec datasets (a single‑village REM set and a national HQ set), the public Ontario IESO dataset, and the public Panama CND dataset. Each contains hourly load and eight weather/ calendar features. Training uses four years of data, validation the following year, and testing the most recent year; for the one‑year REM set a rolling‑window scheme (two‑month training, one‑month testing) is applied. All models are trained with AdamW (lr = 0.001), cosine annealing, 300 epochs, batch size 30 on a Tesla V100 GPU. Baselines include TiDE, SARIMAX, Temporal Fusion Transformer (TFT), and N‑BEATS, implemented via Darts or statsmodels, with extensive hyper‑parameter tuning and identical cross‑validation protocols.

Results
The proposed model contains only a few hundred thousand parameters (versus several million for large transformers) and achieves mean absolute percentage error (MAPE) comparable to or slightly better than the baselines across all horizons (1‑day, 2‑day, 7‑day). The advantage is most pronounced on the noisy CND dataset and the limited‑size REM dataset, where baseline deep models tend to overfit. Ablation studies show that removing embedding dropout degrades performance, confirming its role in regularization and interpretability. Visualizations of learned embedding weights illustrate how the model distributes importance across views and how dropout isolates individual contributions.

Limitations and future work
The scaled lag selection relies on domain knowledge; an automated lag‑selection mechanism could broaden applicability. Single‑value embeddings may struggle with complex non‑linear interactions, suggesting a hybrid approach that mixes low‑dimensional and richer embeddings. The current work focuses on point forecasts; extending to probabilistic forecasts (prediction intervals) would increase utility for risk‑aware decision making.

Conclusion
By integrating compact single‑value embeddings, a carefully chosen lag set, and embedding dropout, the authors deliver a parsimonious yet accurate forecasting system. The approach demonstrates that, for short‑term load forecasting, massive transformer models are not always necessary; a well‑designed lightweight architecture can achieve state‑of‑the‑art performance while being far more suitable for real‑time deployment and environments with limited computational resources.


Comments & Academic Discussion

Loading comments...

Leave a Comment