PatchFormer: A Patch-Based Time Series Foundation Model with Hierarchical Masked Reconstruction and Cross-Domain Transfer Learning for Zero-Shot Multi-Horizon Forecasting
Time series forecasting is a fundamental problem with applications in climate, energy, healthcare, and finance. Many existing approaches require domain-specific feature engineering and substantial labeled data for each task. We introduce PatchFormer, a patch-based time series foundation model that uses hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer. PatchFormer segments time series into patches and learns multiscale temporal representations with learnable aggregation across temporal scales. Pretraining uses masked patch reconstruction with dynamic masking and objectives that encourage both local accuracy and global consistency, followed by cross-domain knowledge distillation. Experiments on 24 benchmark datasets spanning weather, energy, traffic, finance, and healthcare demonstrate state-of-the-art zero-shot multi-horizon forecasting, reducing mean squared error by 27.3 percent relative to strong baselines while requiring 94 percent less task-specific training data. The model exhibits near log-linear scaling with more pretraining data up to 100 billion points and processes length-512 sequences 3.8x faster than full-sequence transformers.
💡 Research Summary
PatchFormer introduces a novel foundation model for time‑series forecasting that leverages patch‑based tokenization, hierarchical multi‑scale representations, and a self‑supervised pre‑training regime built around contrastive masked reconstruction. The model first slices each multivariate series into overlapping patches of varying lengths (e.g., 16, 32, 64 timesteps for the base configuration) and projects each patch into a dense embedding. A learnable weighted aggregation across scales produces a unified representation that captures both fine‑grained fluctuations and long‑term trends while reducing the quadratic attention cost from O(L²) to O(L²/P²).
During pre‑training, a dynamic masking strategy masks roughly 40 % of patches, with the masking probability adaptively modulated by the series’ coefficient of variation (σ/µ). Masked patches are reconstructed via an L2 loss, and two stochastic augmentations of the same series are encoded into vectors that are pulled together using an InfoNCE contrastive loss. The combined objective (reconstruction + λ_con · contrastive) encourages local fidelity and global consistency simultaneously.
To improve cross‑domain generalization, PatchFormer incorporates knowledge distillation from multiple domain‑specific teacher models. The student model minimizes a KL divergence between its predictions and each teacher’s output, weighted by λ_distill, thereby inheriting domain expertise without explicit fine‑tuning.
Fine‑tuning for a target domain is achieved with lightweight adapter modules inserted after each transformer layer. These adapters consist of a down‑projection to a bottleneck dimension (d_model/16) followed by an up‑projection, updating only 2–5 % of the total parameters. This design enables the model to reach baseline performance with as few as 500 labeled samples—a 94 % reduction compared to training from scratch.
Experiments span 24 benchmark datasets across weather, energy, traffic, finance, and healthcare, encompassing 87 billion pre‑training points (≈100 billion). In zero‑shot mode, PatchFormer‑Base attains an average MSE of 0.262, outperforming TimeGPT by 15.7 %, Chronos by 19.8 %, and MOMENT by 11.8 %. The model reduces mean squared error by 27.3 % relative to the strongest baselines and processes 512‑step sequences in 3.4 ms, roughly 3.8× faster than a vanilla transformer.
A scaling study shows a log‑linear relationship between pre‑training data size and performance (MSE = 0.412 − 0.045·log₁₀N, R² = 0.97), indicating continued gains with larger corpora. Ablation results reveal that hierarchical patches contribute 14–15 % of the total improvement, the contrastive term 7–8 %, dynamic masking 3–5 %, and cross‑domain distillation 10–11 %. Robustness tests under random missing data demonstrate that PatchFormer degrades only 15.7 % in MSE at 30 % missingness, whereas competing models suffer 44–47 % degradation.
Long‑horizon forecasts (720 steps) benefit especially from the hierarchical design, achieving a 10.4 % average MSE reduction over standard transformers. Computationally, the base model uses 85 M parameters, 320 M FLOPs per inference, and requires 3.4 ms latency, while the large variant scales to 512 M parameters and 28.4 G FLOPs but still maintains competitive speed.
Limitations include the current focus on point forecasts rather than full predictive distributions, reliance on global statistics for dynamic masking (which may hinder real‑time streaming), and the absence of multimodal pre‑training. The authors suggest future work on uncertainty quantification, multimodal and continual learning, and privacy‑preserving federated pre‑training.
Overall, PatchFormer establishes a compelling blueprint for universal time‑series models: hierarchical patch tokenization reduces computational load, contrastive masked reconstruction yields robust self‑supervised representations, cross‑domain distillation transfers knowledge efficiently, and adapter‑based fine‑tuning enables data‑efficient deployment. The reported gains across diverse domains and the demonstrated scaling behavior position PatchFormer as a strong candidate for production‑grade, zero‑shot forecasting in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment