fev-bench: A Realistic Benchmark for Time Series Forecasting
Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly with the rise of pretrained models. Existing benchmarks often have limited domain coverage or overlook real-world settings such as tasks with covariates. Their aggregation procedures frequently lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks lack consistent evaluation infrastructure or are too rigid for integration into existing pipelines. To address these gaps, we propose fev-bench, a benchmark of 100 forecasting tasks across seven domains, including 46 with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for forecasting evaluation emphasizing reproducibility and integration with existing workflows. Using fev, fev-bench employs principled aggregation with bootstrapped confidence intervals to report performance along two dimensions: win rates and skill scores. We report results on fev-bench for pretrained, statistical, and baseline models and identify promising future research directions.
💡 Research Summary
The paper addresses a pressing need in the rapidly evolving field of time‑series forecasting: a high‑quality, statistically rigorous benchmark that can fairly evaluate both pretrained neural models and traditional statistical approaches. Existing benchmarks suffer from limited domain coverage, a lack of covariate‑rich tasks, reliance on single‑number summaries, and brittle, monolithic evaluation pipelines that hinder reproducibility and integration with modern workflows.
To fill these gaps the authors introduce fev‑bench, a benchmark comprising 100 forecasting tasks drawn from seven real‑world domains (energy, nature, cloud, mobility, economics, health, retail). The tasks are constructed from 96 distinct datasets sourced from the Monash repository, GIFT‑Eval, BOOM, Kaggle, and domain‑specific collections. Importantly, 46 of the tasks include covariates—static identifiers, past‑only dynamic series, and known‑future variables such as holidays—thereby reflecting the multivariate, covariate‑driven forecasting scenarios encountered in practice. The benchmark deliberately avoids duplicating horizons within the same dataset; instead, it selects horizon lengths that are meaningful for each domain (e.g., 168‑step hourly forecasts for energy, 30‑day forecasts for retail).
Evaluation metrics are chosen to be scale‑independent and to cover both point and probabilistic forecasting. Point forecasts are assessed with Mean Absolute Scaled Error (MASE), while probabilistic forecasts use Scaled Quantile Loss (SQL) computed over quantiles 0.1–0.9. Both metrics inherit the desirable properties of scale‑invariance and robustness to seasonal patterns. The authors also report Weighted Quantile Loss (WQL) and Weighted APE for completeness, but SQL is positioned as the primary probabilistic metric because it aligns with MASE’s normalization and is under‑used in existing benchmarks.
The core methodological contribution lies in the aggregation framework. After evaluating M models on R tasks, an error matrix E (R × M) is formed. Two complementary aggregate scores are defined:
-
Average win rate (Wj) – the probability that model j achieves a lower error than a randomly chosen competing model on a randomly selected task. Ties count as half‑wins. This metric provides an intuitive, relative ranking but ignores the magnitude of differences.
-
Skill score (Sj) – a normalized reduction of error relative to a fixed baseline (Seasonal Naïve). Errors are clipped to the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment