Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window

Reading time: 6 minute
...

📝 Original Info

  • Title: Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
  • ArXiv ID: 0912.4569
  • Date: 2010-02-03
  • Authors: Researchers from original ArXiv paper

📝 Abstract

The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is $O(\frac{k} {\epsilon} \log \frac{\epsilon N}{k})$ bits for basic counting and $O(\frac{k}{\epsilon} \log \frac{N}{k})$ words for the remainings, where $k$ is the number of distributed data streams, $N$ is the total number of items in the streams that arrive or expire in the window, and $\epsilon < 1$ is the desired error bound. Matching and nearly matching lower bounds are also obtained.

💡 Deep Analysis

Deep Dive into Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window.

The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to minimize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statistics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is $O(\frac{k} {\epsilon} \log \frac{\epsilon N}{k})$ bits for basic counting and $O(\frac{k}{\epsilon} \log \frac{N}{k})$ words for the remainings, where $k$ is the number of distributed data streams, $N$ is the total number of items in the streams that arrive or expire in the window,

📄 Full Content

arXiv:0912.4569v2 [cs.DS] 3 Feb 2010 Symposium on Theoretical Aspects of Computer Science 2010 (Nancy, France), pp. 179-190 www.stacs-conf.org CONTINUOUS MONITORING OF DISTRIBUTED DATA STREAMS OVER A TIME-BASED SLIDING WINDOW HO-LEUNG CHAN 1 AND TAK-WAH LAM 1 AND LAP-KEI LEE 2 AND HING-FUNG TING 1 1 Department of Computer Science, University of Hong Kong, Hong Kong E-mail address: {hlchan,twlam,hfting}@cs.hku.hk 2 Max-Planck-Institut f¨ur Informatik, 66123 Saarbr¨ucken, Germany E-mail address: lklee@mpi-inf.mpg.de Abstract. The past decade has witnessed many interesting algorithms for maintaining statistics over a data stream. This paper initiates a theoretical study of algorithms for monitoring distributed data streams over a time-based sliding window (which contains a variable number of items and possibly out-of-order items). The concern is how to mini- mize the communication between individual streams and the root, while allowing the root, at any time, to be able to report the global statistics of all streams within a given error bound. This paper presents communication-efficient algorithms for three classical statis- tics, namely, basic counting, frequent items and quantiles. The worst-case communication cost over a window is O( k ε log εN k ) bits for basic counting and O( k ε log N k ) words for the remainings, where k is the number of distributed data streams, N is the total number of items in the streams that arrive or expire in the window, and ε < 1 is the desired error bound. Matching and nearly matching lower bounds are also obtained. 1. Introduction The problems studied in this paper are best illustrated by the following puzzle. John and Mary work in different laboratories and communicate by telephone only. In a forever- running experiment, John records which devices have an exceptional signal in every 10 seconds. To adjust her devices, Mary at any time needs to keep track of the number of exceptional signals generated by each device of John in the last one hour. John can call Mary every 10 seconds to report the exceptional signals, yet this requires too many calls in an hour and the total message size per hour is linear to the total number N of exceptional signals in an hour. Mary’s devices actually allow some small error. Can the number of calls and message size be reduced to o(N), or even poly-log N if a small error (say, 0.1%) is 1998 ACM Subject Classification: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems. Key words and phrases: Algorithms, distributed data streams, communication efficiency, frequent items. T.W. Lam is partially supported by the GRF Grant HKU-713909E; H.F. Ting is partially supported by the GRF Grant HKU-716307E. c ⃝ H.L. Chan, T.W. Lam, L.K. Lee, and H.F. Ting CC ⃝ Creative Commons Attribution-NoDerivs License 180 H.L. CHAN, T.W. LAM, L.K. LEE, AND H.F. TING allowed? It is important to note that the input is given online and Mary needs to know the answers continuously; this makes our problem different from those in other similar classical models, such as the Simultaneous Communication Complexity model [4], in which all inputs are given in advance and the parties need to compute an answer only once. Motivation. The above problem appears in data stream applications, e.g., network monitoring or stock analysis. In the last decade, algorithms for continuous monitoring of a single massive data stream gained a lot of attention (see [1,26] for a survey), and the main challenge has been how to represent the massive data using limited space, while allowing certain statistics (e.g., item counts, quantiles) to be computed with sufficient accuracy. The space-accuracy tradeofffor representing a single stream has gradually been un- derstood over the years (e.g., [2, 15, 18, 19]). Recently, motivated by large scale networks, the database community is enthusiastic about communication-efficient algorithms for con- tinuous monitoring of multiple, distributed data streams. In such applications, we have k ≥1 remote sites each monitoring a data stream, and there is a root (or coordinator) responsible for computing some global statistics. A remote site needs to maintain cer- tain statistics itself, and has to communicate with the root often enough so that the root can compute, at any time, the statistics of the union of all data streams within a certain error. The objective is to minimize the communication. The communication aspects of data streams introduce several challenging theoretical questions such as what is the opti- mal communication-accuracy tradeofffor maintaining a particular statistic, and whether two-way communication is inherently more efficient than one-way communication. Data stream models and ε-approximate queries. The data stream at each remote site is a sequence of items from a totally ordered set U. Each item is associated with an integral time-stamp recording its arrival time. Each remote site has limited space and hence it can only maintai

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut