Untangling the Braid: Finding Outliers in a Set of Streams
📝 Abstract
Monitoring the performance of large shared computing systems such as the cloud computing infrastructure raises many challenging algorithmic problems. One common problem is to track users with the largest deviation from the norm (outliers), for some measure of performance. Taking a stream-computing perspective, we can think of each user’s performance profile as a stream of numbers (such as response times), and the aggregate performance profile of the shared infrastructure as a “braid” of these intermixed streams. The monitoring system’s goal then is to untangle this braid sufficiently to track the top k outliers. This paper investigates the space complexity of one-pass algorithms for approximating outliers of this kind, proves lower bounds using multi-party communication complexity, and proposes small-memory heuristic algorithms. On one hand, stream outliers are easily tracked for simple measures, such as max or min, but our theoretical results rule out even good approximations for most of the natural measures such as average, median, or the quantiles. On the other hand, we show through simulation that our proposed heuristics perform quite well for a variety of synthetic data.
💡 Analysis
Monitoring the performance of large shared computing systems such as the cloud computing infrastructure raises many challenging algorithmic problems. One common problem is to track users with the largest deviation from the norm (outliers), for some measure of performance. Taking a stream-computing perspective, we can think of each user’s performance profile as a stream of numbers (such as response times), and the aggregate performance profile of the shared infrastructure as a “braid” of these intermixed streams. The monitoring system’s goal then is to untangle this braid sufficiently to track the top k outliers. This paper investigates the space complexity of one-pass algorithms for approximating outliers of this kind, proves lower bounds using multi-party communication complexity, and proposes small-memory heuristic algorithms. On one hand, stream outliers are easily tracked for simple measures, such as max or min, but our theoretical results rule out even good approximations for most of the natural measures such as average, median, or the quantiles. On the other hand, we show through simulation that our proposed heuristics perform quite well for a variety of synthetic data.
📄 Content
Imagine a general purpose stream monitoring system faced with the task of detecting misbehaving streams among a large number of distinct data streams. For instance, a network diagnostic program at an IP router may wish to highlight flows whose packets experience unusually large average network latency. Or, a cloud computing service such as Yahoo Mail or Amazon’s Simple Storage Service (S3), catering to a large number of distinct users, may wish to track the quality of service experienced by its users. The performance monitoring of large, shared infrastructures, such as cloud computing, provides a compelling backdrop for our research, so let us dwell on it briefly. An important characteristics of cloud computing applications is the sheer scale and large number of users: Yahoo Mail and Hotmail sup-port more than 250 million users, with each user having several GBs of storage. With this scale, any downtime or performance degradation affects many users: even a guarantee of 99.9% availability (the published numbers for Google Apps, including Gmail) leaves open the possibility of a large number of users suffering downtime or performance degradation. In other words, even a 0.1% user downtime affects 250,000 users, and translates to significant loss of productivity among users. Managing and monitoring systems of this scale presents many algorithmic challenges, including the one we focus on: in the multitude of users, track those receiving the worst service.
Taking a stream-computing perspective, we can think of each user’s performance profile as a stream of numbers (such as response times), and the aggregate performance profile of the whole infrastructure as a braid of these intermixed streams. The monitoring system’s goal then is to untangle this braid sufficiently to track the top k outliers. In this paper, we study questions motivated by this general setting, such as “which stream has the highest average latency?”, or “what is the median latency of the k worst streams?,” “how many streams have their 95th percentile latency less than a given value?” and so on.
These problems seem to require peering into individual streams more deeply than typically studied in most of the extant literature. In particular, while problems such as heavy hitters and quantiles also aim to understand the statistical properties of IP traffic or latency distributions of webservers, they do so at an aggregate level: heavy hitters attempt to isolate flows that have large total mass, or users whose total response time is cumulatively large. In our context, this may be uninteresting because a user can accumulate large total response time because he sends a lot of requests, even though each request is satisfied quickly. On the other hand, streams that consistently show high latency are a cause for alarm. More generally, we wish to isolate flows or users whose service response is bad at a finer level, perhaps taking into account the entire distribution.
We have a set B, which we call a braid, of m streams {S1, S2, . . . , Sm}, where the ith stream has size ni, namely, ni = |Si|. We assume that the number of streams is large and each stream contains potentially an unbounded number of items; that is, m ≫ 1 and ni ≫ 1, for all i. By vij , we will mean the value of the jth item in the stream Si; we make no assumptions about vij beyond that they are real-valued.
In the examples mentioned above, vij represents the latency of the jth request by the ith user. We formalize the misbehavior quality of a stream by an abstract weight function l, which is function of the set of values in the stream. For instance, l(S) may denote the average or a particular quantile of the stream S. Our goal is to design streaming algorithms that can estimate certain fundamental statistics of the set {l(S1), l(S2), . . . , l(Sm)}.
When needed, we use a self-descriptive superscript to discuss specific weight functions, such as l avg for average, l med for median, l max for maximum, l min for minimum etc. For instance, if we choose the weight to be the average, then l avg (S) denotes the average value in stream S, and max{l avg (S1), l avg (S2), . . . , l avg (Sm)} computes the worst-stream by average latency. Throughout we will focus on the one-pass model of data streams.
As is commonly the case with data stream algorithms, we must content ourselves with approximate weight statistics because even in the single stream setting neither quantiles nor frequent items can be computed exactly. With this in mind, let us now precisely define what we mean by a guaranteed-quality approximation of high weight streams. There are two natural and commonly used ways to quantify an approximation: by rank or by value. (Recall that the rank of an element x in a set is the number of items with value equal to or less than x.)
• Rank Approximation: Let l be an arbitrary weight function (e.g. median), and let li = l(Si) be the value of this function for stream Si. We say that a value l ′ i is
This content is AI-processed based on ArXiv data.