MantisV2: Closing the Zero-Shot Gap in Time Series Classification with Synthetic Data and Test-Time Strategies

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Developing foundation models for time series classification is of high practical relevance, as such models can serve as universal feature extractors for diverse downstream tasks. Although early models such as Mantis have shown the promise of this approach, a substantial performance gap remained between frozen and fine-tuned encoders. In this work, we introduce methods that significantly strengthen zero-shot feature extraction for time series. First, we introduce Mantis+, a variant of Mantis pre-trained entirely on synthetic time series. Second, through controlled ablation studies, we refine the architecture and obtain MantisV2, an improved and more lightweight encoder. Third, we propose an enhanced test-time methodology that leverages intermediate-layer representations and refines output-token aggregation. In addition, we show that performance can be further improved via self-ensembling and cross-model embedding fusion. Extensive experiments on UCR, UEA, Human Activity Recognition (HAR) benchmarks, and EEG datasets show that MantisV2 and Mantis+ consistently outperform prior time series foundation models, achieving state-of-the-art zero-shot performance.

💡 Research Summary

Time series classification is a fundamental problem across many domains, yet labeled data are often scarce. Recent advances in foundation models for vision and language have motivated the development of Time Series Foundation Models (TSFMs). The original Mantis model demonstrated strong zero‑shot performance with a lightweight frozen encoder, but suffered from data leakage in pre‑training and a noticeable gap between frozen and fine‑tuned performance. This paper introduces MantisV2 and its variant Mantis+, which together close the zero‑shot gap through three complementary innovations.

First, the authors pre‑train exclusively on synthetic time series generated by the CauKer framework, which uses Gaussian‑process kernel composition and structural causal models to produce diverse, out‑of‑distribution samples. Using a contrastive learning objective with a Random Crop‑Resize (RCR) augmentation, the model learns representations that are invariant to small temporal distortions. Experiments show that 100 k synthetic series achieve comparable or superior accuracy to the same amount of real data, and scaling to 2 M synthetic samples further improves performance while eliminating data leakage.

Second, the architecture is refined. The token generator now combines three parallel branches: raw signal patches, first‑order differential patches, and patch‑wise statistics (mean and standard deviation). Each branch produces 256‑dimensional embeddings that are concatenated and linearly projected into 32 tokens of size 256. The transformer processes these tokens with six layers, eight attention heads, and a learnable class token. By fixing the token count to 32 and resizing inputs to length 512, computational cost is reduced and the model becomes roughly 30 % lighter than the original Mantis, yet yields a 1–2 % absolute gain in zero‑shot accuracy.

Third, a comprehensive test‑time pipeline is introduced. Intermediate transformer layers are harvested, and multiple perturbed versions of the input (via small noise) are passed through the encoder; their embeddings are averaged in a self‑ensembling step. Additionally, embeddings from a complementary vision‑based backbone (TiViT‑H) are concatenated with the Mantis embeddings, providing cross‑model fusion that further boosts robustness. This strategy improves zero‑shot accuracy by 2–3 percentage points without any additional training.

Extensive evaluation on the UCR (128 datasets), UEA, Human Activity Recognition, and EEG benchmarks demonstrates that MantisV2 and Mantis+ consistently outperform prior TSFMs, including TS2Vec and T‑Loss, and that the fusion variant matches the performance of fine‑tuned Mantis, effectively closing the zero‑shot gap. The authors release all model checkpoints, the synthetic pre‑training dataset, and the CauKer‑2M data on HuggingFace, ensuring reproducibility and facilitating future research on multi‑channel time series, domain‑specific adapters, and large‑scale zero‑shot applications.

MantisV2: Closing the Zero-Shot Gap in Time Series Classification with Synthetic Data and Test-Time Strategies

💡 Research Summary

Comments & Academic Discussion

Leave a Comment