A Cluster-based Approach for Outlier Detection in Dynamic Data Streams (KORM: k-median OutlieR Miner)
Outlier detection in data streams has gained wide importance presently due to the increasing cases of fraud in various applications of data streams. The techniques for outlier detection have been divided into either statistics based, distance based, density based or deviation based. Till now, most of the work in the field of fraud detection was distance based but it is incompetent from computational point of view. In this paper we introduced a new clustering based approach, which divides the stream in chunks and clusters each chunk using kmedian into variable number of clusters. Instead of storing complete data stream chunk in memory, we replace it with the weighted medians found after mining a data stream chunk and pass that information along with the newly arrived data chunk to the next phase. The weighted medians found in each phase are tested for outlierness and after a given number of phases, it is either declared as a real outlier or an inlier. Our technique is theoretically better than the k-means as it does not fix the number of clusters to k rather gives a range to it and provides a more stable and better solution which runs in poly-logarithmic space.
💡 Research Summary
The paper addresses the growing need for efficient outlier detection in high‑velocity data streams such as fraud monitoring and network traffic analysis. Traditional outlier detection methods—statistical, distance‑based, density‑based, and deviation‑based—are largely designed for static datasets and become computationally prohibitive when applied to unbounded streams. Distance‑based approaches (e.g., Knorr‑Ng) require user‑defined radius R and neighbor count k, while density‑based methods like LOF need extensive nearest‑neighbor searches. Moreover, many of these techniques assume the entire dataset can be stored or accessed repeatedly, which is infeasible for streaming environments.
To overcome these limitations, the authors propose a clustering‑based framework called KORM (k‑median OutlieR Miner). The core idea is to partition the incoming stream into fixed‑size chunks (called phases) and process each chunk with an online k‑median algorithm. Specifically, they adapt Meyerson’s ONLINE‑FL algorithm, which probabilistically opens a new facility (median) at a point x with probability min(θ·w / f, 1), where θ is the distance to the nearest existing facility, w is the point’s weight, and f is the current facility cost. The facility cost for phase j is defined as F_j = L_j / (k·(1+log n)), where L_j is a lower‑bound value that grows geometrically (L_{j+1}=β·L_j). Two constants γ and β (experimentally set to 34) control the cost and facility‑count limits.
During a phase, each incoming point is either assigned to the nearest open facility (increasing that facility’s weight) or becomes a new facility. After processing the whole chunk, facilities whose weight has not increased are marked as Temporal Candidate Outliers (TCOs). These TCOs are carried forward to the next phase together with the weighted facilities (the summary of the processed chunk). In each subsequent phase, the weight of a TCO is examined; if it remains unchanged for O consecutive phases, the TCO’s outlier score reaches the user‑specified threshold O and the point is declared a real outlier. Once declared, the outlier is removed from all future clustering. If the weight increases before reaching O, the point is treated as an inlier and continues to participate in clustering.
The algorithm guarantees poly‑logarithmic space usage: the number of facilities at any time is bounded by O(k·log n), and the total memory consumption is O(k·polylog n). The authors run 2·log n parallel instances of ONLINE‑FL per phase, stopping each when either the total cost exceeds 4·L_j·(1+4(γ+β)) or the number of facilities exceeds 4·k·(1+log n)·(1+4(γ+β)). This ensures a single pass over the stream with overall time complexity O(n·polylog n).
Experimental evaluation was performed on a modest Windows Vista machine (Intel Core Duo T2450, 1 GB RAM) using MATLAB. The authors compared KORM against a previous k‑means‑based streaming outlier detector (Manzoor & Li, 2009). Results show that KORM reduces memory consumption, achieves comparable or slightly higher detection accuracy (precision, recall, F‑measure), and processes data faster. The flexibility of k‑median—allowing the number of clusters to vary between k and k·log n—provides greater stability when the underlying data distribution changes over time, a scenario where fixed‑k k‑means can struggle.
Despite these advantages, the paper has several limitations. The experiments are confined to small‑scale synthetic and modest real datasets; scalability to high‑dimensional, high‑throughput streams remains untested. The constants γ and β are fixed without a systematic sensitivity analysis, and the choice of the outlier‑persistence parameter O lacks clear guidance, potentially affecting the trade‑off between false positives and false negatives. Moreover, the algorithm’s reliance on randomization (online‑FL) means performance guarantees hold only with high probability, which may be problematic for mission‑critical applications.
Future work suggested includes extending the method to distributed streaming platforms (e.g., Apache Flink, Spark Streaming), developing adaptive mechanisms for automatically tuning γ, β, and O based on stream characteristics, and evaluating the approach on large‑scale, high‑dimensional real‑world streams such as network intrusion logs or financial transaction feeds.
In summary, the paper introduces KORM, a novel streaming outlier detection framework that leverages online k‑median clustering to maintain a compact weighted summary of the data, identifies candidate outliers via unchanged facility weights, and confirms true outliers after persistent inactivity across multiple chunks. The method achieves poly‑logarithmic space, constant‑factor approximation to the k‑median objective, and demonstrates improved stability and efficiency over earlier k‑means‑based streaming outlier detectors.
Comments & Academic Discussion
Loading comments...
Leave a Comment