Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.

💡 Research Summary

The paper addresses a fundamental limitation of current time‑series foundation models (TSFMs), which are trained solely on numerical time‑series data and thus miss the rich contextual cues available in other modalities. To overcome this, the authors introduce a multimodal pre‑training paradigm that augments raw time‑series with two endogenous modalities—derived line‑plot images and automatically generated descriptive text—and one exogenous modality—real‑world news articles.

A fully automated pipeline is built to create MM‑TS, the first large‑scale multimodal time‑series dataset. For each raw series, a large language model (GPT‑4o) extracts trend, seasonality, and stationarity information and converts it into structured “endogenous” text. Simultaneously, the GDELT news database is queried using keywords derived from the series; retrieved articles are summarized to form “exogenous” news text. Logical consistency between the two texts is enforced by a filtering step, and multiple LLM judges score plausibility and relevance, retaining only high‑confidence pairs. Images are generated by plotting the series as line charts. The final corpus spans six domains (energy, healthcare, web, nature, transport, economics), multiple temporal granularities (seconds to months), and contains over one billion time points.

On the modeling side, the authors propose HORAI, a frequency‑enhanced multimodal foundation model. HORAI consists of two core components:

Frequency‑enhanced Cross‑Modality Encoder – The raw series is first normalized and transformed to the frequency domain via FFT. A tunable ratio α defines a cutoff τ that separates low‑frequency components (capturing long‑term trends) from mid‑to‑high‑frequency components (capturing rapid fluctuations). Inverse FFT reconstructs two time‑domain sequences, which are then patched and embedded. Low‑frequency patches are aligned with textual embeddings (both endogenous and exogenous), while high‑frequency patches are aligned with visual embeddings from the line‑plot images. A modality‑specific attention fusion layer incorporates the frequency masks into the attention scores, ensuring that each modality contributes according to its frequency relevance.
Time‑Frequency Decoder – The fused multimodal representation is fed into a Mixture‑of‑Experts Feed‑Forward Network (MoE‑FFN). A novel Time‑Frequency router decides which expert processes each token by jointly considering the token’s temporal position and its associated frequency band. This dual conditioning provides extra discriminative cues, allowing the model to separate patterns that are temporally similar but spectrally distinct, thereby improving cross‑domain generalization. The decoder outputs are projected to token logits for autoregressive pre‑training.

HORAI is pre‑trained on MM‑TS using a standard autoregressive objective. The authors evaluate zero‑shot and few‑shot performance on two downstream tasks: (a) time‑series forecasting and (b) anomaly detection. Compared with state‑of‑the‑art TSFMs such as Timer, MOIRAI, ROSE, and Sundial, HORAI reduces mean squared error by an average of 12 % on forecasting across all six domains, with the most pronounced gains on high‑frequency electricity load data. For anomaly detection, HORAI achieves an F1‑score of 0.92, surpassing the best baseline by more than 5 percentage points. Few‑shot fine‑tuning experiments show that the multimodal pre‑training dramatically reduces the amount of labeled data needed to reach comparable performance.

The contributions are threefold: (1) a unified multimodal pre‑training paradigm and the MM‑TS dataset, (2) the HORAI architecture that explicitly leverages frequency information to align and fuse heterogeneous modalities, and (3) extensive empirical evidence that multimodal pre‑training yields superior zero‑shot and data‑efficient performance on core time‑series tasks. The work opens avenues for incorporating additional exogenous sources (e.g., GIS, social media streams) and for deploying real‑time multimodal inference in production environments, potentially transforming how industries such as energy management, healthcare monitoring, and financial forecasting handle complex temporal data.

Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

💡 Research Summary

Comments & Academic Discussion

Leave a Comment