Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting
The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free and causally sound design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our experiments reveal the flaws of the previous benchmarks and the biases in model evaluation, providing new insights into multiple existing forecasting models and LLMs across various evaluation tasks.
💡 Research Summary
The paper identifies critical shortcomings in current time‑series forecasting benchmarks—namely pre‑training contamination, temporal leakage, and description leakage—that have led to an inflated perception of progress, especially with the rise of large language models (LLMs). Classic unimodal datasets such as ETT, Electricity, and M4 are small, outdated, and ambiguously structured, making them unsuitable for evaluating modern, high‑capacity models. Early multimodal benchmarks like TimeMMD attempted to add textual context but introduced severe flaws: low sampling frequencies, reliance on web‑scraped text that is likely already present in LLM training corpora, and uncontrolled temporal and description leakage.
To address these issues, the authors formalize three high‑fidelity benchmarking principles: (1) Data Sourcing Integrity – all data must be drawn from authenticated, continuously updated real‑time APIs, guaranteeing freshness, high frequency (5 min to 1 h), and minimal overlap with public pre‑training corpora; (2) Leak‑Free and Causally Sound Design – only exogenous textual information that can be known ahead of time (e.g., weather forecasts) is incorporated, eliminating temporal leakage and ensuring that text never directly describes the target variable, thus avoiding description leakage; (3) Structural Clarity – a clear separation between “Subjects” (e.g., individual sensors, regions) and “Channels” (the observed variables for each subject) is enforced, enabling rigorous evaluation of both within‑subject generalization and cross‑subject transfer.
Guided by these principles, the authors construct Fidel‑TS, a large‑scale multimodal benchmark comprising millions of high‑frequency data points across diverse domains (photovoltaics in Canada, renewable energy grids in Germany, atmospheric physics, NYC traffic, CAISO electricity, etc.). Each dataset includes a static metadata layer (global description, per‑subject and per‑channel documentation) and a dynamic textual layer (weather forecasts, scheduled maintenance events, sensor‑downtime descriptions). The data curation pipeline handles real‑world imperfections: short gaps are linearly interpolated, while long gaps are treated as explicit “sensor downtime” events with accompanying textual annotations, turning data quality issues into learning signals.
The benchmark is released with multiple evaluation splits: observed (in‑domain), hidden (zero‑shot), down‑sampled (‑h), and importance‑sampled mini‑sets (‑mini), supporting both realistic operational testing and controlled ablations.
Extensive experiments compare a suite of state‑of‑the‑art time‑series models (Informer, Autoformer, N‑HiTS, etc.) and several LLMs (GPT‑4, LLaMA‑2, Falcon) under a unified framework. Results reveal three key findings: (1) No single unimodal model dominates across all subjects; foundation models for time series do not consistently deliver the promised zero‑shot robustness; (2) Multimodal gains are conditional—benefits appear only when the textual modality is causally relevant (e.g., weather for solar or electricity) and when the model architecture can effectively fuse heterogeneous inputs; (3) While LLMs excel at general reasoning, their forecasting accuracy and reliability degrade sharply in high‑frequency, long‑term, leak‑free settings, often underperforming dedicated forecasting architectures.
By exposing the inflated performance reported on legacy benchmarks and providing a rigorously constructed, leakage‑free alternative, Fidel‑TS sets a new standard for trustworthy evaluation of both traditional time‑series models and emerging multimodal/LLM approaches. The authors also open‑source the full data pipeline and code, encouraging the community to adopt these high‑fidelity practices for future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment