MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model
Recently reconstruction-based deep models have been widely used for time series anomaly detection, but as their capacity and generalization capability increase, these models tend to over-generalize, often reconstructing unseen anomalies accurately. Prior works have attempted to mitigate this by incorporating a memory architecture that stores prototypes of normal patterns. Nevertheless, these approaches suffer from high training costs and have yet to be effectively integrated with time series foundation models (TFMs). To address these challenges, we propose MOMEMTO, an improved variant of TFM for anomaly detection, enhanced with a patch-based memory module to mitigate over-generalization. The memory module is designed to capture representative normal patterns from multiple domains and enables a single model to be jointly fine-tuned across multiple datasets through a multi-domain training strategy. MOMEMTO initializes memory items with latent representations from a pre-trained encoder, organizes them into patch-level units, and updates them via an attention mechanism. We evaluate our method using 23 univariate benchmark datasets. Experimental results demonstrate that MOMEMTO, as a single model, achieves higher scores on AUC and VUS metrics compared to baseline methods, and further enhances the performance of its backbone TFM, particularly in few-shot learning scenarios.
💡 Research Summary
The paper addresses a critical limitation of reconstruction‑based time‑series anomaly detectors: over‑generalization, where the model learns to reconstruct even unseen anomalous patterns, thereby reducing detection sensitivity. While prior works have introduced memory modules (e.g., MEMTO) to store prototypes of normal behavior, those approaches suffer from high training cost, sensitivity to memory initialization, and a point‑level memory design that is less effective for interval or periodic anomalies. Moreover, existing time‑series foundation models (TFMs) such as MOMENT, MOIRAI, or Chronos are primarily built for forecasting and often employ decoder‑only architectures, making them ill‑suited for anomaly detection without substantial adaptation.
MOMEMTO (Memory‑Enhanced MOMENT) proposes a two‑fold solution. First, it reuses the pre‑trained MOMENT encoder, which is based on a T5‑style transformer and trained with a patch‑masked reconstruction objective. This encoder already provides high‑quality, domain‑agnostic representations, thereby alleviating MEMTO’s sensitivity to memory initialization. Second, it introduces a patch‑level memory gate that aligns directly with the encoder’s patch tokens. Each memory item is a matrix of shape (N × d_model), storing a prototypical normal pattern for every patch position. The memory is initialized per domain by averaging encoder outputs from a random subset of series belonging to that domain, followed by L2‑normalization for stability.
The memory interaction proceeds through five stages:
- Memory Alignment – Using the input mask, only observed patches are selected, and both queries and memory slices are reshaped to (P × d_model) so that variable‑length series can be processed uniformly.
- Memory Update – Cosine similarity between normalized queries and memory slices is computed; only the top‑K most similar memory items are updated. Each selected item receives a query‑based attention update, blended with the original item via a learned gate ψ(k) = σ(m(k)Uψ + v(k)Wψ). This data‑driven selective update reduces computational load and prevents the entire memory from drifting.
- Query Update – For each of the K selected items, an attention‑produced intermediate representation ˜q(k) is generated. The final refined query ˜q is a weighted sum of these intermediates, where weights are the similarity scores from the previous step.
- Query Alignment – The refined query ˜q is concatenated with the original query q at the patch level, preserving positional correspondence, and fed into a lightweight decoder.
- Decoder – A two‑layer fully‑connected decoder reconstructs the original time‑series patches. By keeping the decoder simple, the model forces reliance on the encoder‑memory pair, mitigating over‑generalization.
A key contribution is the Multi‑Domain Training strategy. Instead of training a separate model per dataset, MOMEMTO jointly fine‑tunes a single architecture on 23 univariate benchmark datasets (TSB‑AD‑Benchmark, 870 series). The number of memory items equals the number of user‑defined domains, providing a balanced starting point. During training, memory items can accumulate information from multiple domains, evolving into domain‑general prototypes while still capturing domain‑specific nuances. This joint training reduces overall training time, memory footprint, and enables knowledge sharing across heterogeneous time‑series.
Experimental evaluation uses non‑overlapping windows of length 512, and performance is measured with threshold‑independent metrics: Area Under the ROC Curve (AUC) and Volume Under the Surface (VUS). MOMEMTO consistently outperforms strong baselines—including OmniAnomaly, Anomaly Transformer, TimesNet, and the original MEMTO—across all datasets. In few‑shot scenarios (≤5 % labeled data), MOMEMTO improves AUC by an average of 7 percentage points over the MOMENT backbone alone, demonstrating that the patch‑based memory effectively regularizes the model when supervision is scarce. Ablation studies confirm that (i) the patch‑level memory, (ii) top‑K selective updates, and (iii) multi‑domain joint training each contribute significantly to the observed gains. Moreover, removing the memory module leads to a marked increase in reconstruction accuracy on anomalous inputs, validating the memory’s role in suppressing over‑generalization.
In summary, MOMEMTO delivers a practical, scalable solution for time‑series anomaly detection by (1) leveraging a pre‑trained TFM encoder, (2) integrating a patch‑aligned, selectively updated memory gate, and (3) adopting a unified multi‑domain training paradigm. The approach achieves state‑of‑the‑art detection performance on a broad benchmark, reduces computational overhead, and remains robust in low‑label regimes, making it well‑suited for real‑world monitoring of heterogeneous sensor streams.
Comments & Academic Discussion
Loading comments...
Leave a Comment