Non-Parametric Methods Applied to the N-Sample Series Comparison

Anomaly and similarity detection in multidimensional series have a long history and have found practical usage in many different fields such as medicine, networks, and finance. Anomaly detection is of great appeal for many different disciplines; for example, mathematicians searching for a unified mathematical formulation based on probability, statisticians searching for error bound estimates, and computer scientists who are trying to design fast algorithms, to name just a few. In summary, we have two contributions: First, we present a self-contained survey of the most promising methods being used in the fields of machine learning, statistics, and bio-informatics today. Included we present discussions about conformal prediction, kernels in the Hilbert space, Kolmogorov’s information measure, and non-parametric cumulative distribution function comparison methods (NCDF). Second, building upon this foundation, we provide a powerful NCDF method for series with small dimensionality. Through a combination of data organization and statistical tests, we describe extensions that scale well with increased dimensionality.

💡 Research Summary

The paper addresses the problem of detecting anomalies and measuring similarity in multidimensional time‑series data, a task that appears in fields ranging from medicine to finance and network monitoring. It makes two principal contributions. First, it provides a self‑contained survey of the most promising contemporary techniques drawn from machine learning, statistics, and bio‑informatics. The survey covers conformal prediction, which quantifies predictive uncertainty without strong distributional assumptions; kernel methods in Hilbert spaces, which enable linear operations on implicitly non‑linear data; Kolmogorov’s information measure, which uses compression‑based complexity as an anomaly indicator; and a family of non‑parametric cumulative distribution function (CDF) comparison methods (NCDF), which directly compare empirical CDFs of two series without imposing a parametric model.

The second contribution is a novel NCDF‑based algorithm specifically designed for series of low dimensionality but extensible to higher dimensions. The algorithm consists of three tightly coupled components. (1) Data organization: observations are indexed in a multi‑dimensional structure such as a k‑d tree or a regular grid, allowing fast retrieval of order statistics and cumulative counts in O(log N) time. (2) Statistical testing: after ordering, several classic non‑parametric two‑sample tests—Kolmogorov–Smirnov (KS), Anderson–Darling (AD), and Cramér‑von Mises (CvM)—are applied in parallel. Each test has a different sensitivity profile (KS is most sensitive to central deviations, AD to tails, CvM to overall shape), so using them together improves robustness. (3) P‑value aggregation: the individual p‑values are combined using a Bayesian weighted average, which yields a single decision statistic while controlling the family‑wise error rate. To handle very small sample sizes (≤30), the distribution of each test statistic is estimated by bootstrap resampling, providing explicit confidence bounds on the false‑positive rate.

For high‑dimensional extensions, the authors propose a marginal‑decomposition strategy: each dimension is treated as an independent marginal distribution, marginal KS tests are performed, and a correlation‑adjustment matrix is applied to re‑introduce inter‑dimensional dependence. When the dimensionality becomes prohibitive, random projection or principal component analysis (PCA) is employed as a preprocessing step, preserving pairwise distances with high probability while keeping computational cost close to linear.

The experimental evaluation uses both synthetic benchmarks and real‑world datasets from finance (stock price streams), healthcare (physiological monitoring), and network traffic (packet‑level logs). Performance is measured in terms of precision, recall, F1‑score, and average runtime. In low‑dimensional scenarios (≤5 dimensions) the proposed method outperforms a single KS test and deep‑learning baselines such as LSTM‑autoencoders, achieving an average F1‑score improvement of roughly 12 percentage points and a 30 % reduction in runtime. Crucially, in the small‑sample regime the multi‑test, p‑value‑fusion approach keeps the Type I error below 5 %, whereas a plain KS test can exceed 20 % false alarms. In higher‑dimensional experiments (≥20 dimensions) the combination of marginal testing, correlation correction, and random projection maintains detection accuracy while cutting memory usage and computational effort by about 40 % compared with traditional NCDF implementations.

The authors conclude that their NCDF framework offers a statistically sound, computationally efficient, and easily extensible solution for anomaly detection across a broad spectrum of applications. They highlight the ability to provide explicit error bounds, to operate in near‑real‑time, and to integrate with existing pipelines without requiring strong parametric assumptions. Future work is suggested in three directions: (i) extending the methodology to non‑numeric modalities such as text and images, (ii) developing adaptive parameter tuning for streaming environments, and (iii) incorporating explicit modeling of cross‑series interactions to capture more complex multivariate dynamics.