Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection
In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.
💡 Research Summary
Continual‑MEGA introduces a comprehensive benchmark that addresses two major gaps in current anomaly detection (AD) research: the lack of large‑scale, diverse data that mimics real‑world production streams, and the absence of evaluation protocols that jointly test continual learning (CL) and zero‑shot generalization. The authors first construct a new dataset, ContinualAD, comprising 30 object categories captured with ten different consumer devices. For each category, normal images and a rich set of defect images (cracks, holes, scratches, contamination, etc.) are collected, and every defect region is annotated with pixel‑level polygon masks. In total, ContinualAD provides 14,655 normal and 15,827 anomalous images, yielding a substantially higher intra‑class variance than existing benchmarks such as MVTec‑AD or VisA.
The benchmark integrates seven public datasets (MVTec‑AD, VisA, BT‑AD, MPDD, Real‑IAD, VIADUCT, etc.) together with ContinualAD, forming three distinct evaluation scenarios:
-
Scenario 1 – Standard Continual Learning: The model is pretrained on 85 “base” classes (a union of MVTec‑AD and VisA) and then incrementally learns 5, 10, or 30 “new” classes over 12, 6, or 2 steps respectively. This setting measures classic CL metrics such as forgetting and forward transfer.
-
Scenario 2 – Continual Zero‑Shot Learning (CZSL): Both the base and new class streams deliberately exclude MVTec‑AD and VisA, which are held out exclusively for testing. After the model has adapted to the incremental stream of other datasets, its ability to detect anomalies in the completely unseen MVTec‑AD and VisA domains is evaluated. This isolates the effect of continual adaptation on zero‑shot generalization.
-
Scenario 3 – ContinualAD Zero‑Shot: ContinualAD is removed from both the base and incremental streams, and the model is trained only on the other public datasets. The held‑out ContinualAD then serves as a zero‑shot test set, allowing the authors to quantify how much the newly introduced dataset contributes to cross‑domain robustness.
To tackle these challenges, the authors propose a baseline called ADCT (Anomaly Detection across Continual Tasks). ADCT builds on a frozen CLIP (ViT‑B/16) backbone and inserts lightweight mixture‑of‑experts (MoE) adapters that are fine‑tuned at each continual step. Crucially, ADCT also synthesizes “anomalous features” from normal image embeddings, providing richer visual cues than text‑prompt‑only methods. The adapters are deliberately low‑capacity to avoid over‑fitting to previously seen classes while still enabling task‑specific adaptation.
Extensive experiments across all three scenarios reveal several key findings:
-
Performance Gap in Existing Methods: State‑of‑the‑art continual AD approaches (e.g., Cao et al., 2024; Liu et al., 2024) achieve respectable AUROC on standard CL but suffer dramatic drops (often below 0.70) when evaluated zero‑shot on unseen domains. Pixel‑level localization is especially weak, with IoU scores frequently under 0.30.
-
ADCT’s Superior Trade‑off: ADCT consistently outperforms prior work by 2–5 percentage points in AUROC, 3–6 pp in average precision, and improves pixel‑level IoU by 0.10–0.15. Notably, in the CZSL setting, ADCT maintains AUROC above 0.80, demonstrating that modest adapter updates combined with feature synthesis preserve and even enhance generalization.
-
Impact of ContinualAD: Removing ContinualAD from the benchmark reduces zero‑shot performance by 8–12 pp across all methods, confirming that the larger, more varied dataset is essential for learning representations that transfer to novel defect types and imaging conditions.
-
Ablation Insights: Experiments disabling the feature synthesis module or increasing adapter size both lead to higher forgetting and lower zero‑shot scores, indicating that (i) synthetic anomalous features provide a crucial bridge between normal and abnormal distributions, and (ii) over‑parameterized adapters tend to overfit to the incremental stream, harming cross‑domain robustness.
The paper’s contributions are threefold: (1) a publicly released, large‑scale benchmark (Continual‑MEGA) that unifies continual learning and zero‑shot evaluation for AD; (2) a thorough empirical analysis showing substantial room for improvement in current methods, especially in pixel‑level localization and cross‑domain generalization; and (3) the ADCT baseline, which demonstrates that lightweight adaptation plus feature synthesis is a promising direction for future research.
In summary, Continual‑MEGA sets a new standard for evaluating anomaly detection systems in realistic, evolving environments. By providing both a challenging dataset suite and a clear protocol for measuring continual adaptation and zero‑shot transfer, it invites the community to develop more robust, efficient, and generalizable AD models that can be deployed in real industrial settings without costly retraining. Future work may explore online replay buffers, generative synthesis of defect images, or self‑supervised pretraining tailored to the high‑variance ContinualAD domain, further narrowing the gap between laboratory benchmarks and production‑line requirements.
Comments & Academic Discussion
Loading comments...
Leave a Comment