Evaluating Real-time Anomaly Detection Algorithms - the Numenta Anomaly Benchmark

Much of the world’s data is streaming, time-series data, where anomalies give significant information in critical situations; examples abound in domains such as finance, IT, security, medical, and energy. Yet detecting anomalies in streaming data is a difficult task, requiring detectors to process data in real-time, not batches, and learn while simultaneously making predictions. There are no benchmarks to adequately test and score the efficacy of real-time anomaly detectors. Here we propose the Numenta Anomaly Benchmark (NAB), which attempts to provide a controlled and repeatable environment of open-source tools to test and measure anomaly detection algorithms on streaming data. The perfect detector would detect all anomalies as soon as possible, trigger no false alarms, work with real-world time-series data across a variety of domains, and automatically adapt to changing statistics. Rewarding these characteristics is formalized in NAB, using a scoring algorithm designed for streaming data. NAB evaluates detectors on a benchmark dataset with labeled, real-world time-series data. We present these components, and give results and analyses for several open source, commercially-used algorithms. The goal for NAB is to provide a standard, open source framework with which the research community can compare and evaluate different algorithms for detecting anomalies in streaming data.

💡 Research Summary

The paper addresses a critical gap in the evaluation of real‑time anomaly detection for streaming time‑series data. While many detection algorithms exist, most benchmarks are batch‑oriented and do not reflect the constraints of online operation—namely, the need to produce immediate alerts, to learn continuously, and to avoid excessive false alarms. To fill this void, the authors introduce the Numenta Anomaly Benchmark (NAB), an open‑source framework that provides a controlled, repeatable environment for testing and scoring streaming anomaly detectors.

NAB consists of three core components. First, a curated dataset of 58 real‑world time‑series drawn from five domains (IT operations, finance, energy, healthcare, and manufacturing). Human experts have manually labeled more than 1,800 anomaly events, covering a wide range of patterns such as spikes, jumps, drifts, and pattern collapses. This diversity ensures that algorithms are evaluated on realistic, heterogeneous signals rather than synthetic or narrowly scoped data.

Second, a scoring algorithm specifically designed for streaming contexts. The score starts at zero and evolves as the detector processes the data point by point. When an anomaly window is entered, the detector can earn a “True Positive” (TP) reward if it raises an alarm; the reward is larger the earlier the alarm is raised, thereby incentivizing low latency. If the detector fails to alarm within the window, a “False Negative” (FN) penalty is applied. Alarms raised outside anomaly windows incur a “False Positive” (FP) penalty, which is deliberately steep to reflect the operational cost of unnecessary alerts. The scoring system includes three tunable weight parameters (A, B, C) that let users adopt a “standard”, “conservative”, or “lenient” mode, aligning the benchmark with different production policies.

Third, a fully open‑source Python library that bundles data loaders, the scoring engine, and visualization tools. Researchers can plug in any detection algorithm that conforms to a simple interface, run it on the same data streams, and obtain comparable scores without having to re‑implement preprocessing or evaluation logic. The framework also encourages community contributions of new data sets, labeling efforts, and scoring variants, making NAB a living benchmark.

To demonstrate NAB’s utility, the authors evaluate five widely used algorithms: (1) Numenta’s Hierarchical Temporal Memory (HTM) implementation, which flags anomalies based on prediction error; (2) Twitter’s AnomalyDetection package, which combines seasonal decomposition with residual analysis; (3) Facebook Prophet, which models trend and seasonality and treats large forecast errors as anomalies; (4) a One‑Class Support Vector Machine (OC‑SVM), which learns a boundary around normal data in a high‑dimensional feature space; and (5) an LSTM‑based deep learning model that predicts future values and uses the prediction residual as an anomaly score.

Results show that HTM and Twitter’s AnomalyDetection achieve the highest overall NAB scores, primarily because they detect anomalies quickly (average detection latency under 5 % of the anomaly window) and generate few false positives. Prophet performs well on strongly seasonal series but suffers from higher false negatives when abrupt drifts occur. OC‑SVM’s performance is highly sensitive to kernel and ν‑parameter choices, leading to lower scores across the board. The LSTM model, while capable of high recall when trained on ample data, incurs significant computational overhead for online updates, resulting in delayed alerts and reduced timeliness scores.

The authors extract several key insights. First, timeliness and false‑alarm suppression dominate the overall score, confirming that real‑time operational constraints differ fundamentally from batch evaluation metrics. Second, a single algorithm rarely excels across all domains; domain‑specific characteristics (seasonality, drift frequency, noise level) heavily influence performance, underscoring the need for adaptable or hybrid solutions. Third, the ability to adjust scoring weights allows practitioners to align benchmark outcomes with business‑level risk tolerances, making NAB a practical decision‑support tool rather than a purely academic metric.

Finally, the paper outlines future research directions: semi‑automated labeling to reduce annotation cost, multi‑scale anomaly detection that simultaneously monitors short‑term spikes and long‑term trend changes, context‑aware dynamic weighting of TP/FN/FP penalties, and extensions to non‑numeric streams such as log messages or event traces. By releasing NAB as open source, the authors invite the community to expand the dataset, contribute new algorithms, and refine the scoring methodology, with the ultimate goal of establishing a de‑facto standard for evaluating streaming anomaly detection systems.

💡 Research Summary

📜 Original Paper Content