mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale
Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.
💡 Research Summary
The paper addresses two fundamental challenges in multivariate time‑series anomaly detection (MTS‑AD): the lack of a comprehensive benchmark that reflects real‑world diversity, and the difficulty of selecting the most suitable detection model without supervision. To this end, the authors introduce mTSBench, the largest publicly available benchmark for MTS‑AD and model selection. mTSBench aggregates 344 labeled multivariate time‑series from 19 publicly available datasets spanning 12 application domains (healthcare, cybersecurity, industrial monitoring, finance, etc.). The datasets contain point, range, and contextual anomalies, thereby capturing a wide spectrum of real‑world anomaly patterns and inter‑signal dependencies.
The benchmark evaluates 24 state‑of‑the‑art anomaly detectors covering four methodological families: (i) statistical and data‑mining approaches (PCA, RobustPCA, COPOD, CBLOF, Isolation Forest, etc.), (ii) classic machine‑learning methods (K‑Nearest Neighbors, LOF, K‑means‑AD, etc.), (iii) deep‑learning models (LSTM‑AD, USAD, AutoEncoder, TimesNet, TranAD, etc.), and (iv) the only two publicly released large‑language‑model (LLM) based detectors for multivariate series (ALLM4TS and a second LLM method). All detectors are run under a unified preprocessing pipeline and evaluated with a consistent suite of metrics: AUC‑ROC, AUC‑PR, precision, recall, F1, point‑based and event‑based scores, as well as ranking‑based measures for model selection.
Results confirm the long‑standing observation that no single detector dominates across all datasets. Some deep‑learning models achieve near‑perfect AUC‑ROC (>0.95) on certain industrial datasets, while simple statistical methods outperform them on others, especially when anomalies are subtle or data are low‑dimensional. The LLM‑based detectors, despite being only two, demonstrate competitive performance on datasets with complex cross‑signal relationships, hinting at the promise of foundation models for MTS‑AD.
Beyond detection, the paper benchmarks three recent unsupervised model‑selection techniques: a meta‑learning approach based on handcrafted meta‑features, an internal‑score‑driven selector (e.g., variance of model outputs), and a simple ensemble‑weighting scheme. All three are evaluated against two baselines: an “oracle” that always picks the true best detector for a given dataset, and a random selector. Even the strongest selector falls short of the oracle by 15–20 % on average, and only modestly outperforms random choice. This gap underscores that current unsupervised selection methods do not adequately capture the high‑dimensional temporal dependencies characteristic of multivariate series.
The authors compare mTSBench to existing benchmarks (TODS, TimeSeriesAD, EEAD, etc.), highlighting that prior suites either focus on univariate series, contain far fewer multivariate examples, or lack any model‑selection component. By integrating both detection and selection evaluation, providing point‑ and range‑based anomaly labels, and releasing code for reproducibility, mTSBench establishes a new standard platform for the community.
In conclusion, the paper makes three key contributions: (1) the creation of the most extensive, diverse, and publicly available MTS‑AD benchmark; (2) a systematic empirical study confirming that no detector is universally superior and that current unsupervised selectors are far from optimal; (3) a unified evaluation framework that quantifies both detection performance and selector effectiveness, exposing a substantial performance gap that motivates future research. The authors suggest several promising directions: richer meta‑feature engineering tailored to multivariate dynamics, reinforcement‑learning‑based selector policies, and deeper integration of LLMs for both detection and model‑selection. mTSBench is released at https://plan‑lab.github.io/mtsbench, inviting researchers to develop and benchmark more robust, generalizable solutions for multivariate time‑series anomaly detection.
Comments & Academic Discussion
Loading comments...
Leave a Comment