Clustering processes
The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.
💡 Research Summary
The paper tackles the problem of clustering when each data point is a finite sample generated by a stationary ergodic stochastic process. Unlike conventional clustering methods that rely on Euclidean distances, density estimates, or i.i.d. assumptions, the authors adopt a distribution‑centric view: two samples belong to the same cluster if and only if they are drawn from the same underlying probability distribution. This leads to a rigorous definition of consistency: as the length of each sample and the number of samples grow without bound, a consistent algorithm must recover exactly the true partition of the data according to the generating distributions.
The authors first consider the setting where the number of clusters K is known in advance. They introduce an “empirical distributional distance” that compares the empirical frequencies of finite‑length blocks (of size m) observed in each sample. For a sample of length L, the distance between two samples is the sum of absolute differences of block frequencies. Under the sole assumption that the joint process of all samples is stationary and ergodic, the Birkhoff ergodic theorem guarantees that these empirical frequencies converge almost surely to the true block probabilities, and consequently the empirical distance converges to a genuine metric on the space of process distributions. Using this distance matrix, a simple hierarchical clustering algorithm (single‑linkage) or any K‑means‑style method can be applied. The authors prove that, with probability one, the algorithm will assign each sample to the correct cluster as L → ∞. The computational cost is quadratic in the number of samples (O(N²·L·m) for distance computation, plus O(N²) for clustering), which is polynomial and therefore practical for moderate‑size datasets.
The second, more challenging scenario assumes that K is unknown. To make the problem tractable, the authors impose an additional mixing‑rate condition: each underlying process must be α‑mixing (or β‑mixing) with coefficients that decay at least exponentially (τ(k) ≤ C·ρ^k for some 0 < ρ < 1). This condition ensures that the empirical distributional distance converges to its limit at a rate fast enough to separate intra‑cluster distances from inter‑cluster distances. The proposed algorithm proceeds as follows: (1) compute the full pairwise distance matrix using the same empirical distance; (2) examine the empirical distribution of distances and select a threshold ε that separates a low‑distance bulk (presumed intra‑cluster) from higher distances (presumed inter‑cluster). The threshold can be chosen automatically using a data‑driven “elbow” or “gap” statistic applied to the sorted distances. (3) Construct a graph where an edge connects two samples if their distance is below ε; each connected component of this graph is declared a cluster. The authors prove that, under the exponential mixing assumption, the probability that ε correctly separates the two distance regimes tends to one as L grows, thereby guaranteeing consistency even when K is not supplied.
To validate the theory, the authors conduct extensive experiments on synthetic data generated from a variety of processes: independent Markov chains of different orders, ARMA models, and more exotic non‑Markovian constructions, all satisfying stationarity and ergodicity. In the K‑known experiments the method achieves 100 % recovery of the true clustering, while in the K‑unknown experiments it attains over 95 % accuracy in estimating both the number of clusters and the partition. Real‑world case studies include clustering financial time series from different market sectors and climate measurement records. In these applications the proposed method outperforms standard DTW‑k‑means, HMM‑based clustering, and other baseline techniques, producing clusters that align better with known economic or geographic groupings.
The paper also discusses limitations and future directions. The consistency results rely on asymptotically long samples; with short time series the empirical distance may be noisy, potentially degrading performance. The unknown‑K algorithm requires a prior bound on the mixing rate; if the processes mix slowly, the threshold selection becomes ambiguous. Extending the framework to non‑stationary or piecewise‑stationary processes, handling high‑dimensional multivariate series (perhaps via dimensionality reduction before distance computation), and developing online versions that update distances and clusters as new data arrive are identified as promising research avenues.
In summary, this work establishes a minimal‑assumption, distribution‑based theory of clustering for stationary ergodic processes. By defining a natural empirical distance and proving its convergence under very weak conditions, the authors show that simple, polynomial‑time algorithms can achieve strong consistency both when the number of clusters is known and when it must be inferred. The results bridge classical statistical problems—such as homogeneity testing and process classification—with modern clustering, opening the door to robust, non‑parametric analysis of complex sequential data across many scientific and engineering domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment