MTS-JEPA: Multi-Resolution Joint-Embedding Predictive Architecture for Time-Series Anomaly Prediction

MTS-JEPA: Multi-Resolution Joint-Embedding Predictive Architecture for Time-Series Anomaly Prediction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multivariate time series underpin modern critical infrastructure, making the prediction of anomalies a vital necessity for proactive risk mitigation. While Joint-Embedding Predictive Architectures (JEPA) offer a promising framework for modeling the latent evolution of these systems, their application is hindered by representation collapse and an inability to capture precursor signals across varying temporal scales. To address these limitations, we propose MTS-JEPA, a specialized architecture that integrates a multi-resolution predictive objective with a soft codebook bottleneck. This design explicitly decouples transient shocks from long-term trends, and utilizes the codebook to capture discrete regime transitions. Notably, we find this constraint also acts as an intrinsic regularizer to ensure optimization stability. Empirical evaluations on standard benchmarks confirm that our approach effectively prevents degenerate solutions and achieves state-of-the-art performance under the early-warning protocol.


💡 Research Summary

The paper introduces MTS‑JEPA, a novel Joint‑Embedding Predictive Architecture tailored for proactive anomaly prediction in multivariate time‑series. Existing JEPA models, while effective for learning latent dynamics, suffer from two critical issues when applied to continuous data: (i) representation collapse due to the lack of negative samples in self‑distillation objectives, and (ii) an inability to capture precursor signals that manifest at multiple temporal scales because they operate at a single fixed resolution.

To overcome these limitations, the authors propose three key innovations. First, they construct dual‑resolution inputs for each time window: a fine‑grained view obtained by patching the normalized series, and a coarse view generated by down‑sampling (averaging) the same window. The online (student) encoder processes only the fine view, while a momentum‑updated EMA (teacher) encoder ingests both fine and coarse views, producing multi‑scale targets. This asymmetry forces the student to infer global context from local evidence, effectively embedding scale‑aware representations.

Second, a soft codebook bottleneck is introduced. Continuous encoder outputs are projected onto a learnable set of K prototype vectors via temperature‑scaled cosine similarity, yielding soft assignment distributions p. The expected embedding z = Σ p_k c_k is a convex combination of prototypes, which mathematically bounds the latent space and guarantees a strictly positive batch covariance. The authors provide proofs of both an upper bound on the norm of z and a lower bound on Cov(z), thereby certifying that representation collapse cannot occur.

Third, the predictor is split into two branches. The fine predictor, a Transformer operating at patch level, forecasts high‑frequency dynamics, while the coarse predictor uses a learnable query token and cross‑attention to predict a single global code that captures low‑frequency trends. Both predictions are supervised against the EMA‑generated multi‑scale targets using KL‑divergence losses. An auxiliary decoder reconstructs the original fine‑view patches from the soft‑quantized embeddings, ensuring that the codebook does not become overly abstract and that signal‑level semantics are retained.

The overall training objective combines (i) fine‑scale prediction loss, (ii) coarse‑scale prediction loss, (iii) codebook alignment loss, and (iv) reconstruction loss, weighted by hyper‑parameters. EMA parameters are updated with a momentum coefficient, providing a stable teacher signal that gradually tracks the student.

Experiments are conducted on seven public benchmarks (UCR, SMAP, MSL, Yahoo, etc.) under an early‑warning protocol where the model must predict whether an anomaly will occur within the next 5–10 steps. MTS‑JEPA achieves an average F1‑score of 0.84, outperforming strong baselines such as TS‑JEPA, PatchTST, TS2Vec, and conventional auto‑encoders by 3–7 percentage points. Ablation studies reveal that removing the codebook leads to rapid representation collapse and a steep drop in performance, while varying the codebook size (K) and temperature (τ) shows that K=512 and τ=0.1 work well across datasets. The dual‑resolution design proves especially beneficial: fine‑scale predictors excel on datasets with abrupt spikes, whereas coarse‑scale predictors dominate on slowly drifting signals.

The paper also discusses limitations and future directions. The current architecture uses only two resolution levels; extending to a full pyramid could capture even richer dynamics. Dynamic adaptation of the codebook size or prototype updating strategies may further improve flexibility. Additionally, the momentum coefficient in EMA updates presents a trade‑off between teacher stability and student learning speed that warrants deeper investigation.

In summary, MTS‑JEPA combines multi‑scale input modeling, a differentiable soft codebook, and EMA‑based teacher‑student training to deliver stable, collapse‑free latent representations that are highly effective for early anomaly prediction in complex time‑series environments, setting a new state‑of‑the‑art in this domain.


Comments & Academic Discussion

Loading comments...

Leave a Comment