Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Randomized Algorithms for T racking Distrib uted Count, Frequencies, and Ranks Zengfeng Huang K e Y i Hong K ong Univ ersity of Science and T echnology { huangzf, yike } @cse.ust.hk Qin Zhang MAD ALGO, Univ ersity of Aarhus qinzhang@cs.au.dk October 31, 2018 Abstract W e show that randomization can lead to signiﬁcant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter n i that gets incremented over time, and the goal is to track an ε -approximation of their sum n = P i n i continuously at all times, using minimum communication. While the deterministic commu- nication complexity of the problem is Θ( k /ε · log N ) , where N is the ﬁnal v alue of n when the tracking ﬁnishes, we sho w that with randomization, the communication cost can be reduced to Θ( √ k /ε · log N ) . Our algorithm is simple and uses only O (1) space at each player, while the lower bound holds ev en assuming each player has inﬁnite computing power . Then, we extend our techniques to two related dis- tributed tracking problems: fr equency-tracking and r ank-trac king , and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and hav e been extensi vely studied in the literature. 1 Intr oduction W e start with a very basic problem in distrib uted tracking, what we call count-trac king . There are k players each holding a counter n i that is initially 0 . Over time, the counters get incremented and we denote by n i ( t ) the value of the counter n i at time t . The goal is to track an ε -approximation of the total count n ( t ) = P i n i ( t ) , i.e., an ˆ n ( t ) such that (1 − ε ) n ( t ) ≤ ˆ n ( t ) ≤ (1 + ε ) n ( t ) , 1 continuously at all times. There is a coordinator whose job is to maintain such an ˆ n ( t ) , and will try to do so using minimum communication with the k players (the formal model of computation will be deﬁned shortly). There is a trivial solution to the count-tracking problem: Every time a counter n i has increased by a 1 + ε factor , the player informs the coordinator of the change. Thus, the coordinator always has an ε - approximation of ev ery n i , hence an ε -approximation of their sum n . Letting N denote the ﬁnal v alue of n , simple analysis shows that the communication cost of this algorithm is O ( k /ε · log N ) 2 . This algorithm was actually used in [16] for solving essentially the same problem, which also provided many practical moti vations for studying this problem. Note that this algorithm is deterministic and only uses one-way communication (from the players to the coordinator), and yet it turns out this simple algorithm is already 1 W e sometimes omit “ ( t ) ” when the context is clear . 2 A more careful analysis leads to a slightly better bound of O ( k /ε · log ( εN/k )) , but we will assume that N is sufﬁciently large, compared to k and 1 /ε , to simplify the bounds. 1 space (per site) communication count-tracking tri vial O (1) Θ( k /ε · log N ) ne w O (1) O ( √ k /ε · log N ) Ω( √ k /ε · log N ) messages frequency-tracking [29] O (1 /ε ) Θ( k /ε · log N ) ne w O (1 / ( ε √ k )) O ( √ k /ε · log N ) Ω(1 / ( ε √ k )) bits ? Ω( √ k /ε · log N ) messages rank-tracking [29] O (1 /ε · log n ) O ( k /ε · log N log 2 (1 /ε )) ne w O  1 / ( ε √ k ) · log 1 . 5 1 ε log 0 . 5 1 ε √ k  O  √ k /ε · log N log 1 . 5 1 ε √ k  Ω(1 / ( ε √ k )) bits ? Ω( √ k /ε · log N ) messages sampling [9] O (1) O (1 /ε 2 · log N ) T able 1: Space and communication costs of previous and new algorithms. W e assume k ≤ 1 /ε 2 . All upper bounds are in terms of words. ? This is conditioned upon the communication cost being O ( √ k /ε · log N ) bits. optimal for deterministic algorithms, ev en if two-way communication is allowed [29]. Thus the immediate questions are: What about randomized algorithms that are allowed to fail with a small probability? Is two- way communication not useful at all? In this paper , we set out to address these questions, and then mo ve on to consider other related distributed tracking problems. 1.1 The distributed tracking model W e ﬁrst giv e a more formal deﬁnition of the computation model that we will w ork with, which is essentially the same as those used in prior work on distributed tracking [2, 3, 5, 6, 8, 9, 16, 23, 29]. There are k dis- tributed sites S 1 , . . . , S k , each recei ving a stream of elements o ver time, possibly at v arying rates. Let N be the total number of elements in all k streams. W e denote by A i ( t ) the multiset (bag) of elements received by S i up until time t , and let A ( t ) = U k i =1 A i ( t ) be the combined data set, where ] denotes multiset addition. There is a coordinator whose job is to maintain (an approximation of) f ( A ( t )) continuously at all times, for a given function f (e.g., f ( A ( t )) = | A ( t ) | for the count-tracking problem abov e). The coordinator has a direct two-way communication channel with each of the sites; note that broadcasting a message costs k times the communication for a single message. The sites do not communicate with each other directly , but this is not a limitation since they can always pass messages via the coordinator . W e assume that communi- cation is instant, i.e., no element will arriv e until all parties hav e decided not to send more messages. As in prior work, our measures of complexity will be the communication cost and the space used to process each stream. Unless otherwise speciﬁed, the unit of both measures is a wor d , and we assume that an y integer less than N , as well as an element from the stream, can ﬁt in one word. This model was initially abstracted from many applied settings, ranging from distributed data monitor - ing, wireless sensor networks, to network traf ﬁc analysis, and has been e xtensiv ely studied in the database community . From 2008 [8], the model has started to attract interests from the theory community as well, as it naturally combines two well-studied models: the data stream model and multi-party communication complexity . When there is only k = 1 site who also plays the role of the coordinator , the model degenerates to the standard streaming model; when k ≥ 2 and our goal is to do a one-shot computation of f ( A ( ∞ )) , then the model de generates to the (number-in-hand) k -party communication model. Thus, distributed track- ing is more general than both models. Meanwhile, it also appears to be signiﬁcantly different from either , 2 with the above count-tracking problem being the best example. This problem is trivial in both the streaming and the communication model (ev en computing the exact count is tri vial), whereas it becomes nontrivial in the distributed tracking model and requires new techniques, especially when randomization is allowed, as illustrated by our results in this paper . Note that there is some work on distributed str eaming (see e.g. [10, 11, 17, 30]) that adopts a model very similar to ours, but with a fundamental difference. In their model there are k streams, each of which runs a streaming algorithm on its local data. But the function f on the combined streams is computed only at the end or upon requests by the user . As one can see that the count-tracking problem is also trivial in this model. The crucial dif ference is that, in this model, the sites wait passi vely to get polled. If we want to track f continuously , we have to poll the sites all the time. Whereas in our model, the sites actively participate in the tracking protocol to make sure that f is always up-to-date. 1.2 Problem statements, pr evious and new results In this paper, we ﬁrst study the count-tracking problem. Then we extend our approach to two related, more general problems: fr equency-tracking and rank-tr acking . Both problems are of central importance in large data monitoring and analysis, and ha ve been extensi vely studied in the literature. In all the communication upper bounds, we will assume k ≤ 1 /ε 2 ; otherwise all of them will carry an extra additi ve O ( k log N ) term. There are other good reasons to justify this assumption, which we will explain later . All our results are summarized in T able 1; belo w we discuss each of them respectiv ely . As mentioned earlier , the deterministic communication complexity for the count-tracking problem has been settled at Θ( k /ε · log N ) [29] 3 , with or without two-w ay communication. In this paper , we show that with randomization and two-w ay communication, this is reduced to Θ( √ k /ε · log N ) . W e ﬁrst in Section 2.1 present a randomized algorithm with this communication cost that, at any one given time instance, maintains an ε -approximation of the current n with a constant probability . The algorithm is v ery simple and uses O (1) space at each site. It is easy to make the algorithm correct for all time instances and boost the probability to 1 − δ : Since we can use the same approximate value ˆ n of n until n grows by a 1 + ε factor , it suf ﬁces to make the algorithm correct for O (log 1+ ε N ) = O (1 /ε · log N ) time instances. Then running O (log ( log N δ ε )) independent copies of the algorithm and taking the median will achiev e the goal of tracking n continuously at all times, with probability at least 1 − δ . The Ω( √ k /ε · log N ) lower bound (Section 2.2) actually holds on the number of messages that ha ve to be e xchanged, regardless of the message size, and holds e ven assuming the sites hav e unlimited space and computing power . That randomization is necessary to achieve this √ k - factor improvement follows from the previous deterministic lo wer bound [29]; here in Section 2.2 we gi ve an proof that two-way communication is also required. More precisely , we sho w that any randomized algorithm with one-w ay communication has to use Ω( k/ε · log N ) communication, i.e., the same as that for deterministic algorithms. In the fr equency-trac king (a.k.a. heavy hitters tracking ) problem, A ( t ) is a multiset of cardinality n ( t ) at time t . Let f j ( t ) be the frequency of element j in A ( t ) . The goal is to maintain a data structure from which f j ( t ) , for any giv en j , can be estimated with absolute error at most εn ( t ) , with probability at least 0 . 9 (say). Note that this problem degenerates to count-tracking when there is only one element. It is reasonable to ask for an error in terms of n ( t ) : if the error were εf j ( t ) , then ev ery element would hav e to be reported if they were all distinct. In fact, this error requirement is the widely accepted deﬁnition for the heavy hitters problem, which has been extensi vely studied in the streaming literature [7]. Sev eral 3 The lower bound in [29] was stated for the heavy hitters tracking problem, but essentially the same proof works for count- tracking. 3 algorithms with the optimal O (1 /ε ) space exist [18–20]. In the distributed tracking model, we previously [29] gav e a deterministic algorithm with O ( k /ε · log N ) communication, which is the best possible for deterministic algorithms. In this paper , by generalizing our count-tracking algorithm, we reduce the cost to O ( √ k /ε · log N ) , with randomization (Section 3). Since this problem is more general than count-tracking, by the count-tracking lo wer bound, this is also optimal. Our algorithm uses O (1 / ( ε √ k )) space to process the stream at each site, which is actually smaller than the Ω(1 /ε ) space lower bound for this problem in the streaming model. This should not come at a surprise: Due to the fact that the site is allowed to communicate to the coordinator during the streaming process, the streaming lo wer bounds do not apply in our model. T o this end, we pro ve a ne w space lo wer bound of Ω(1 / ( ε √ k )) bits for our model, sho wing that our algorithm also uses near-optimal space. This space lower bound is conditioned upon the requirement that the communication cost should be O ( √ k /ε · log N ) bits. Note that it is not possible to prove a space lo wer bound unconditional of communication: A site can send every element to the coordinator and thus only needs O (1) space. In fact, what we pro ve is a space-communication trade-off; please see Section 3.2 for the precise statement. For the rank-tr acking problem, it will be con venient to assume that the elements are drawn from a totally ordered uni verse and A ( t ) contains no duplicates. The rank of an element x in A ( t ) ( x may not be in A ( t ) ) is the number of elements in A ( t ) smaller than x , and our goal is to compute a data structure from which the rank of any giv en x can be estimated with error at most εn ( t ) , with constant probability . Note that a rank-tracking algorithm also solves the frequency-tracking problem (but not vice versa), by turning each element x into a pair ( x, y ) to break all ties and maintaining such a rank-tracking data structure. When the frequenc y of x is desired, we ask for the ranks of ( x, 0) and ( x, ∞ ) and take the difference. W e pre viously [29] gave a deterministic algorithm for the rank-tracking problem with communication O ( k /ε · log N log 2 (1 /ε )) . In this paper , we sho w in Section 4 how randomization can bring this down to O ( √ k /ε · log N log 1 . 5 (1 /ε √ k )) , which is again optimal ignoring p olylog(1 /ε, k ) factors. Since rank-tracking is more general than frequenc y-tracking, the pre vious lower bounds also hold here. Our algorithm uses space that is also close to the Ω(1 / ( ε √ k )) lo wer bound. Since we are talking about randomized algorithms with a constant success probability , we should also compare with random sampling. It is well known [25] that this probabilistic guarantee can be achie ved for all the problems above by taking a random sample of size O (1 /ε 2 ) . A random sample can be maintained continuously over distributed streams [9], solving these distributed tracking problems, with a communication cost of O (1 /ε 2 · log N ) . This is worse than our algorithms when k = o (1 /ε 2 ) . As noted earlier, all the upper bounds we hav e mentioned above hav e a hidden additiv e O ( k log N ) term, including that for the random sampling algorithm. Thus when k = Ω(1 /ε 2 ) , all of them boil down to O ( k log N ) , while Ω( k ) is an easy lo wer bound for all these problems (see Theorem 2.3). This means that when k = Ω(1 /ε 2 ) , all problems can be solved optimally by just random sampling, up to an O (log N ) factor . Therefore, k = o (1 /ε 2 ) is the more interesting case worthy of studying. In addition, as the error (in particular for the frequency-tracking and the rank-tracking problems) is in terms of n , the current size of the entir e data set, typical values of ε are quite small. For example, ε = 10 − 2 ∼ 10 − 4 was used in the experimental study [7] for these problems in the streaming model; while k usually ranges from 10 to 10 4 . Thus we will assume k ≤ 1 /ε 2 in all the upper bounds throughout the paper . The idea behind all our algorithms is very simple. Instead of deterministic algorithms, we use random- ized algorithms that produce unbiased estimators for n i , the frequencies, and ranks with v ariance ( εn ) 2 /k , leading to an overall v ariance of ( εn ) 2 , which is sufﬁcient to produce an estimate within error εn with con- stant probability . This means we can af ford an error of εn/ √ k from each site, as opposed to εn/k for deter- ministic algorithms. This is essentially where we obtain the √ k -factor impro vement by randomization. Our 4 algorithms are simple and extremely lightweight, in particular the count-tracking and frequency-tracking algorithms, thus can be easily implemented in power -limited distrib uted systems like wireless sensor net- works. 1.3 Other related w ork As distributed tracking is closely related to the streaming and the k -party communication model, it could be enlightening to compare with the known results of the above problems in these models. As mentioned earlier , the count-tracking problem is tri vial in both models, requiring O (1) space in the streaming model and O ( k ) communication in the k -party communication model. Both the frequency-tracking and rank-tracking problems ha ve been e xtensively studied in the streaming model with a long history . The former was ﬁrst resolv ed by the MG algorithm [20] with the optimal space O (1 /ε ) , though several other algorithms with the same space bound hav e been proposed later on [18, 19]. The rank problem is also one of the earliest problems studied in the streaming model [21]. The best deter- ministic algorithm to date is the one by Greenwald and Khana [12]. It uses O (1 /ε · log n ) working space to maintain a structure of size O (1 /ε ) , from which any rank can be estimated with error εn . Note that the rank problem is often studied as the quantiles problem in the literature. Recall that for any 0 ≤ φ ≤ 1 , the φ -quantile of D is the element in A ( t ) that ranks at b φn c , while an ε -approximate φ -quantile is any element that ranks between ( φ − ε ) n and ( φ + ε ) n . Clearly , if we hav e the data structure for one problem, we can do a binary search to solve the other . Thus the two problems are equiv alent, for deterministic algorithms. For algorithms with probabilistic guarantees, we need all O (log (1 /ε )) decisions in the binary search to succeed, which requires the failure probability to be lowered by an O (log (1 /ε )) f actor . By running O (log log (1 /ε )) independent copies of the algorithm, this is not a problem. So the two problems differ by at most a factor of O (log log(1 /ε )) . The existing streaming algorithms for the frequency and rank problems can be used to solve the one- shot v ersion of the problem in the k -party communication model easily . More precisely , we use a streaming algorithm to summarize the data set at each site with a structure of size O (1 /ε ) , and then send the these summary structures to the coordinator , resulting in a communication cost of O ( k /ε ) . Recently , we designed randomized algorithms for these two problems with O ( √ k /ε ) communication [13, 14], which have just been sho wn to be near-optimal in an unpublished manuscript [26]. Thus, the results in this paper demonstrate that, the seemingly much more challenging tracking problem, which requires us to solve the one-shot problem continuously at all times, is only harder by an Θ(log N ) factor (except for the count-tracking problem, which is much harder than its one-shot version). Finally , we should mention that all these distrib uted tracking problems ha ve been studied in the database community pre viously , but mostly using heuristics. Keralapura et al. [16] approached the count-tracking problem using prediction models, which do not work under adversarial inputs. Babcock and Olston [3] studied the top- k tracking problem, a v ariant of the frequency (heavy hitters) tracking problem, but did not offer a theoretical analysis. The rank-tracking problem was ﬁrst studied by Cormode et al. [6]; their algorithm has a communication cost of O ( k /ε 2 · log N ) under certain inputs. 2 T racking Distrib uted Count 2.1 The algorithm The algorithm with a ﬁxed p Let p be a parameter to be determined later . For no w we will assume that p is ﬁxed. The algorithm is very simple: Whenever site S i recei ves an element (hence n i gets incremented by 5 one), it sends the latest value of n i to the coordinator with probability p . Let ¯ n i be the last updated value of n i recei ved by the coordinator . W e ﬁrst estimate each n i by ˆ n i =  ¯ n i − 1 + 1 /p, if ¯ n i exists ; 0 , else. (1) Then we estimate n as ˆ n = P i ˆ n i . Analysis As mentioned in the introduction, our analysis will hold for any given one time instance. It is also important to note that this given time instance shall not depend on the randomization internal to the algorithm. W e sho w that each ˆ n i is an unbiased estimator of n i with v ariance at most 1 /p 2 . This is very intuitiv e, since n i − ¯ n i is the number of failed trials until the site decides to send an update to the coordinator , when we look backward from the current time instance. This follo ws a geometric distribution with parameter p , but not quite, as it is bounded by n i . This is why we need to separate the two cases in (1). A more careful analysis is gi ven belo w: Lemma 2.1 E [ ˆ n i ] = n i ; V a r [ ˆ n i ] ≤ 1 /p 2 . Pr oof. Deﬁne the random variable X =  n i − ¯ n i + 1 , if ¯ n i exists; n i + 1 /p, else. No w we can re write ˆ n i as ˆ n i = n i − X + 1 /p . Thus it sufﬁces to sho w that E [ X ] = 1 /p and Va r [ X ] ≤ 1 /p 2 . Letting t = n i − ¯ n i + 1 , we have E [ X ] = n i X t =1 ( t (1 − p ) t − 1 p ) + ( n i + 1 /p )(1 − p ) n i = 1 p . V a r [ X ] = n i X t =1 (( t − 1 /p ) 2 (1 − p ) t − 1 p ) + ( n i + 1 /p − 1 /p ) 2 (1 − p ) n i = (1 − p )(1 − (1 − p ) n i ) p 2 ≤ 1 p 2 . By Lemma 2.1, we know that ˆ n is an unbiased estimator of n with variance ≤ k /p 2 . Thus, if p = √ k /εn , the v ariance of ˆ n will be ( εn ) 2 , which means that ˆ n has error at most 2 εn with probability at least 3 / 4 , by Chebyshev inequality . Rescaling ε and p by a constant will reduce the error to εn and improves the success probability to 0 . 9 , as desired. Here we also see that separating the two cases in (1) is actually important. Otherwise, when n i = Θ( εn/ √ k ) , there would be a constant probability that ¯ n i does not exist, leading to a bias of Θ(1 /p ) = Θ( εn/ √ k ) . Summing ov er all k sites, this would exceed our error requirement. It is interesting to note that similar ideas were used to solve the one-shot quantile problem over dis- tributed data [13]. 6 Dealing with a decr easing p It is not possible and necessary to set p exactly to √ k /εn . From the analysis abov e, it should be clear that keeping p = Θ( √ k /εn ) will sufﬁce. T o do so, we ﬁrst track n within a constant factor . This can be done efﬁciently as follows. Each site S i keeps track of its own counter n i . Whene ver n i doubles, it sends an update to the coordinator . The coordinator sets n 0 = P k i =1 n 0 i , where n 0 i is the last update of n i . When n 0 doubles (more precisely , when n 0 changes by a factor between 2 and 4 ), the coordinator broadcasts n 0 to all the sites. Let ¯ n be the last broadcast value of n 0 . It is clear that ¯ n is always a constant-factor approximation of n . The communication cost is O ( k log N ) , since each site sends O (log N ) updates to the coordinator and the coordinator broadcasts O (log N ) times, each of which costs k messages. These broadcasts di vide the whole tracking period into O (log N ) rounds, and within each round, n stays within a constant factor of ¯ n , the broadcast value at the be ginning of the round. No w , when ¯ n ≤ √ k /ε , we set p = 1 . This causes all the ﬁrst O ( √ k /ε ) elements to be sent to the coordinator . When ¯ n > √ k /ε , we set p = 1 / b ε ¯ n/ √ k c 2 , where b x c 2 denotes the lar gest power of 2 smaller than x . Since ¯ n is monotonically increasing, p gets halv ed over the rounds. At the beginning of a round, if the new p is half 4 of that in the previous round, each site S i adjusts its ¯ n i appropriately , as follows. First with probability 1 / 2 , the site decides if ¯ n i remains the same. If so, nothing changes; otherwise, it repeatedly ﬂips a coin with probability 1 /p (with the ne w p ). Every failed coin ﬂip decrements ¯ n i by one. It does so until a successful coin ﬂip, or ¯ n i = 0 . Finally , the site informs the coordinator of the new value of ¯ n i ; if ¯ n i = 0 , the coordinator will treat it as if ¯ n i does not exist. It should be clear that after this adjustment, the whole system looks as if it had always been running with the ne w p . It is easy to see that the communication cost in each round is O ( k + pn ) = O ( k + √ k /ε ) = O ( √ k /ε ) , thus the total cost is O ( √ k /ε · log N ) . Theorem 2.1 Ther e is an algorithm for the count-trac king pr oblem that, at any time , estimates n = P i n i within err or εn with pr obability at least 0 . 9 . It uses O (1) space at each site and O ( √ k /ε · log N ) total communication. 2.2 The lower bound Before pro ving the lo wer bounds, we ﬁrst state our lo wer bound model formally , in the context of the count- tracking problem. The N elements arriv e at the k sites in an online fashion at arbitrary time instances. W e do not allo w spontaneous communication. More precisely , it means that a site is allowed to send out a message only if it has just receiv ed an element or a message from the coordinator . Likewise, the coordinator is allo wed to send out messages only if it has just recei ved messages from one or more sites. When a site S j is allo wed to send out a message, it decides whether it will indeed do so and the content of the message, based only on its local counter n j and the message history between S j and the coordinator , possibly using some random source. W e assume that the site does not look at the current clock. W e argue that the clock con veys no information since the elements arri ve at arbitrary and unpredictable time instances. (If the elements arriv e in a predictable fashion, say , one per time step, the problem can be solved without communication al all.) Similarly , when the coordinator is allo wed to send out messages, it makes the decision on where and what to send based only on its message history and some random source. W e will lower bound the communication cost only by the number of messages, regardless of the message size. 4 T o be more precise, the new p might also be a quarter of the previous p , b ut it can be handled similarly . 7 2.2.1 One-way communication lower bound In this section we show that two-way communication is necessary to achiev e the upper bound in Theo- rem 2.1, by proving the follo wing lower bound. Remember that we assume N is sufﬁciently lar ger than k and 1 /ε . Theorem 2.2 If only the sites can send messages to the coordinator but not vice versa, then any randomized algorithm for the count-tr acking pr oblem that, at any time, estimates n within err or εn with pr obability at least 0 . 9 must send Ω( k /ε · log N ) messa ges. Pr oof. W e ﬁrst deﬁne the hard input distribution µ . (a) With probability 1 / 2 , all elements arri ve at one site that is uniformly picked at random. (b) Otherwise, the N elements arriv e at the k sites in a round-robin fashion, each site receiving N /k elements in the end. By Y ao’ s Minimax principle [28], we only need to argue that any deterministic algorithm with success probability at least 0 . 8 under µ has expected cost Ω( k /ε · log N ) . Note that when only one-way communication is allo wed, a site decides whether to send messages to the coordinator only based on its local counter n j . Thus the communication pattern can be essentially described as follo ws. Each site S j has a series of thresholds t 1 j , t 2 j , . . . such that when n j = t i j , the site sends the i -th message to the coordinator . These thresholds should be ﬁxed at the beginning. W e lo wer bound the communication cost by rounds. Let W i be the number of elements that hav e arri ved up until round i . W e divide the rounds by setting W 1 = k /ε , and W i +1 = d (1 + ε ) W i e for i ≥ 1 . Thus there are 1 /ε · log ( εN /k ) rounds, which is Ω(1 /ε · log N ) for suf ﬁciently large N . At the beginning of round i + 1 , suppose that S 1 , S 2 , . . . , S k hav e already sent z i 1 , z i 2 , . . . , z i k messages to the coordinator , respecti vely . Let t i +1 max = (1 + ε ) · max { t z i j j | j = 1 , 2 , . . . , k } . W e ﬁrst observe that there must be at least k / 2 sites with their next threshold t z i j +1 j ≤ t i +1 max . Otherwise, suppose there are less than k / 2 sites with such ne xt thresholds, then with probability at least 1 / 4 case (a) happens and the random site S j chosen to receiv e all elements has t z i j +1 j > t i +1 max ≥ (1 + ε ) t z i j j . Thus, with probability at least 1 / 4 the algorithm fails when the t i +1 max -th element arri ves, contradicting the success guarantee. On the other hand, with probability 1 / 2 case (b) happens. In this case all t z i j j ( j = 1 , 2 , . . . , k ) are no more than W i /k , since in case (b), elements arri ve at all k sites in turn. In the next εW i elements, each site S j recei ves εW i /k elements. If the site S j has t z i j +1 j ≤ t i +1 max , then it must send a message in this round, since W i /k + εW i /k ≥ t i +1 max ≥ t z i j +1 j , that is, its ( z i j + 1) -th threshold is triggered. As argued, there are ≥ k/ 2 sites with t z i j +1 j ≤ t i +1 max , so the communication cost in this round is at least k / 2 . Summing up all rounds, the total communication is at least Ω( k /ε · log N ) . 2.2.2 T wo-way communication lower bound Belo w we prove two randomized lower bounds when two-way communication is allowed. The ﬁrst one justiﬁes the assumption k ≤ 1 /ε 2 , since otherwise, random sampling will be near-optimal. 8 Theorem 2.3 Any randomized algorithm for the count-tracking problem that, at any time, estimates n within err or 0 . 1 n with pr obability at least 0 . 9 must e xchange Ω( k ) messa ges. Pr oof. The hard input distribution is the same as that in the proof of Theorem 2.2. T o prove this lo wer bound we are only interested in the number of sites that communicate with the coordinator at least once. Before any element arriv es, we can still assume that each site keeps a triggering threshold. The thresholds of S j shall remain the same unless it communicates with the coordinator at least once. W e argue that there must be at least k / 2 sites whose triggering threshold is no more than 1 , since otherwise if case (a) happens and the randomly chosen site is one with a triggering threshold larger than 1 , the algorithm will fail, which would happen with probability at least 1 / 4 . On the other hand, if case (b) happens, then all the sites with threshold 1 will ha ve to communicate with the coordinator at least once: either their thresholds are triggered by the round-robin arriv al of elements, or they receive a message from the coordinator , which can possibly change their threshold. Finally , we show that the upper bound in Theorem 2.1 is asymptotically tight. W e ﬁrst introduce the follo wing primitiv e problem. Deﬁnition 2.1 ( 1 -bit) Let s be either k/ 2 + √ k or k / 2 − √ k , each with pr obability 1 / 2 . F r om the k sites, a subset of s sites pick ed uniformly at random eac h have bit 1 , while the other k − s sites have bit 0 . The goal of the communication pr oblem is for the coor dinator to ﬁnd out the value of s with pr obability at least 0 . 8 . W e will sho w the following lo wer bound for this primitiv e problem. Lemma 2.2 Any deterministic algorithm that solves 1 -bit has distributional communication complexity Ω( k ) . Lemma 2.2 immediately implies the follo wing theorem: Theorem 2.4 Any randomized algorithm for the count-tracking problem that, at any time, estimates n within err or εn with pr obability at least 0 . 9 must e xchange Ω( √ k /ε · log N ) messag es, when k < 1 /ε 2 . Pr oof. W e will again ﬁx a hard input distrib ution ﬁrst and then focus on the distributional communication complexity of deterministic algorithms with success probability at most 0 . 8 . Let [ m ] = { 0 , 1 , . . . , m − 1 } . The adversarial input consists of ` = log εN k = Ω(log N ) rounds. W e further divide each round i ∈ [ ` ] into r = 1 / (2 ε √ k ) subrounds. The input at round i ∈ [ ` ] is constructed as follows, at each subround j ∈ [ r ] , we ﬁrst choose s to be k / 2 + √ k or k / 2 − √ k with equal probability . Then we choose s sites out of the k sites uniformly at random and send 2 i elements to each of them (the order does not matter). It is easy to see that at the end of in each subround in round i , the total number of items is no more than τ i = √ k /ε · 2 i . Thus after s · 2 i elements ha ve arri ved in a subround, the algorithm has to correctly identify the v alue of s with probability at least 0 . 8 , since otherwise with probability at least 0 . 2 the estimation of the algorithm will de viate from the true v alue by at least √ k · 2 i > ετ i , violating the success guarantee of the algorithm. This is exactly the 1-bit problem deﬁned above. By Lemma 2.2, the communication cost of each subround is Ω( k ) . Summing ov er all r subrounds and then all ` rounds, we have that the total communica- tion is at least ` · r · Ω( k ) ≥ Ω( √ k /ε · log N ) . 9 No w we prov e Lemma 2.2. Pr oof. (of Lemma 2.2) First of all, observe that whene ver the coordinator communicates with a site, the site can send its whole input (i.e., its only bit) to the coordinator . After that, the coordinator knows all the information about that site and does not need to communicate with it further . Therefore all that we need to in vestigate is the number of sites the coordinator needs to communicate with. There can be two types of actions in the protocol. (a) A site initiates a communication with the coordinator based on the bit it has. (b) The coordinator, based on all the information it has g athered so far , asks some site to send its bit. Note that if a type (b) communication takes place before a type (a) communication, we can always swap the two, since this only gi ves the coordinator more information at an earlier stage. Thus we can assume that all the type (a) communications happen before type (b) ones. In the ﬁrst phase where all the type (a) communications happen, let x be the number of sites that send bit 0 to the coordinator, and y be the number of sites that send bit 1 to the coordinator . If E [ x + y ] = Ω( k ) , then we are done. So let us assume that E [ x + y ] = o ( k ) . By Markov inequality we ha ve that, with probability at least 0 . 9 , x + y = o ( k ) . After the ﬁrst phase, the problem becomes that there are s 0 = s − y = s − o ( k ) sites ha ving bit 1 , out of a total k 0 = k − x − y = k − o ( k ) sites. The coordinator needs to ﬁgure out the exact v alue of s 0 with probability at least 0 . 8 − (1 − 0 . 9) = 0 . 7 . In the second phase where all type (b) communication happens, from the coordinator’ s perspecti ve, all the remaining sites are still symmetric (by the random input we choose), therefore the best it can do is to probe an arbitrary site among those that it has not communicated with. This is still true e ven after the co- ordinator has probed some of the remaining sites. Therefore, the problem boils down to the following: The coordinator picks z sites out of the remaining k 0 sites to communicate and then decides the value of s 0 with success probability at least 0 . 7 . W e call this problem the sampling problem. W e can sho w that to achiev e the success guarantee, z should be at least Ω( k ) . This result is perhaps folklore; proofs to more general versions of this problem can be found in [4] (Chapter 4), and also [22, 27]. W e include a simpler proof in the appendix for completeness. With this we conclude the proof of Lemma 2.2. 3 T racking Distrib uted Frequencies In the frequency-tracking problem, A (we omit “ ( t ) ” when the context is clear) is a multiset and the goal is to track the frequency of any item j within error εn . Let f ij denote the local frequency of element j in A i , and let f j = P k i =1 f ij . 3.1 The algorithm The algorithm with a ﬁxed p As in Section 2.1 we ﬁrst describe the algorithm with a ﬁxed parameter p . If each site tracks the local frequencies f ij exactly , we can essentially use the count-tracking algorithm to track the f j ’ s. T o achiev e small space, we make use of the following algorithm due to Manku and Motwani [18] at each site S i : W e maintain a list L i of counters. When an element j arri ves at S i , it ﬁrst checks if there is a counter c ij for j in L i . If yes, we increase c ij by 1 . Otherwise, we sample this element with probability p . If it is sampled, we insert a counter c ij , initialized to 1 , into L i . It is easy to see that the expected size of L i is O ( pn i ) . 10 Next, we follo w a similar strategy as in the count-tracking algorithm: The site reports the counter c ij to the coordinator when it is ﬁrst added to the counter list with an initial v alue of 1 . Afterward, for ev ery j that is arriving, the site always increments c ij as before, but only sends the updated counter to the coordinator with probability p . W e use ¯ c ij to denote the last updated v alue of c ij . The tricky part is ho w the coordinator estimates f ij , hence f j . Fix any time instance. The difference between f ij and ˆ c ij comes from t wo sources: one is the number of j ’ s missed before a copy is sampled, and the other is the number of j ’ s that arriv e after the last update of c ij . It is easy to see that both errors follow the same distribution as n i − ¯ n i in the count-tracking algorithm. Thus it is tempting to modify (1) as ˆ f ij =  ¯ c ij − 2 + 2 /p, if ¯ c ij exists; 0 , else. (2) Ho wev er, this estimator is biased and its bias might be as large as Θ( εn/ √ k ) . Summing over k streams, this would e xceed our error guarantee. T o see this, consider the f ij copies of j . Effecti vely , the site samples e very copy with probability p , while ¯ c ij − 2 is exactly the number of copies between the ﬁrst and the last sampled copy (e xcluding both). W e deﬁne X 1 as before X 1 =  t 1 , if the t 1 th copy is the ﬁrst one sampled; f ij + 1 /p, if none is sampled. W e deﬁne X 2 in exactly the same w ay , except that we examine these f ij copies backward: X 2 =    t 2 , if the t 2 th copy is the ﬁrst one sampled in the re verse order; f ij + 1 /p, if none is sampled. It is clear that X 1 and X 2 hav e the same distribution with E [ X 1 ] = E [ X 2 ] = 1 /p (by Lemma 2.1), so ˆ f ij = f ij − ( X 1 + X 2 ) + 2 /p is unbiased. Since ¯ c ij − 2 = f ij − t 1 − t 2 , the correct unbiased estimator should be ˆ f ij =  ¯ c ij − 2 + 2 /p, if ¯ c ij exists; − f ij , else. (3) Compared with the previous wrong estimator (2), the main dif ference is how the estimation is done when no cop y of j is sampled. When f ij = Θ( εn/ √ k ) and p = Θ(1 /f ij ) , this happens with constant probability , which would result in a bias of Θ( f ij ) = Θ( εn/ √ k ) . Ho wev er, the correct estimator (3) depends on f ij , the quantity we want to estimate in the ﬁrst place. The workaround is to use another unbiased estimator for f ij when ¯ c ij is not yet av ailable. It turns out that we can just use simple random sampling: The site samples ev ery element with probability p (this is independent of the sampling process that maintains the list L i ), and sends the sampled elements to the coordinator . Let d ij be the number of sampled copies of j receiv ed by the coordinator from site i , the ﬁnal estimator for f ij is ˆ f 0 ij =  ¯ c ij − 2 + 2 /p, if ¯ c ij exists; − d ij /p, else. (4) Since d ij is independent of ¯ c ij , the estimator is still unbiased. Below we analyze its v ariance. Analysis Intuitiv ely , the v ariance is not af fected by using the simple random sampling estimator d ij /p , because it is only used when ¯ c ij is not av ailable, which means that f ij is likely to be small, and when f ij is small, d ij /p actually has a small v ariance. When f ij is lar ge, d ij /p has a large variance, b ut we will use it only with small probability . Below we gi ve a formal proof. 11 Lemma 3.1 E [ ˆ f 0 ij ] = f ij ; V a r [ ˆ f 0 ij ] = O (1 /p 2 ) . Pr oof. W e ﬁrst analyze the estimator ˆ f ij of (3). That E [ ˆ f ij ] = f ij follo ws from the discussion above. Its v ariance is Va r [ ˆ f ij ] = Va r [ X 1 + X 2 ] . Note that X 1 and X 2 are not independent, but they both have expectation 1 /p and variance ≤ 1 /p 2 . W e ﬁrst rewrite V a r [ X 1 + X 2 ] = E [ X 2 1 + X 2 2 + 2 X 1 X 2 ] − E [ X 1 + X 2 ] 2 = V ar [ X 1 ] + E [ X 1 ] 2 + Va r [ X 2 ] + E [ X 2 ] 2 +2 E [ X 1 X 2 ] − ( E [ X 1 ] + E [ X 2 ]) 2 ≤ 4 /p 2 + 2 E [ X 1 X 2 ] − 4 /p 2 ≤ 2 E [ X 1 X 2 ] . Let E t be the e vent that the t th copy of j is the ﬁrst being sampled. W e hav e E [ X 1 X 2 ] = f ij X t =1 (1 − p ) t − 1 pt E [ X 2 | E t ] + (1 − p ) f ij ( f ij + 1 /p ) 2 = f ij X t =1 (1 − p ) t − 1 pt   (1 − p ) f ij − t ( f ij − t + 1) + f ij − t X l =1 (1 − p ) l − 1 pl   +(1 − p ) f ij ( f ij + 1 /p ) 2 ≤ 1 p 2 + (1 − p ) f ij f 2 ij + (1 − p ) f ij f ij p . Let c = f ij p . If c ≤ 2 , f ij ≤ 2 /p , and the v ariance is O (1 /p 2 ) . Otherwise E [ X 1 X 2 ] ≤ 1 p 2 + c 2 p 2 e c + c p 2 e c = O (1 /p 2 ) , since c 2 ≤ e c when c > 2 . Next we analyze the ﬁnal estimator ˆ f 0 ij of (4). First, d ij is the sum of f ij Bernoulli random v ariables with probability p , so E [ d ij /p ] = f ij and Va r [ d ij /p ] ≤ f ij p/p 2 = f ij /p . Let E ∗ be the e vent that ˆ c ij is av ailable, i.e., at least one copy of j is sampled, and E 0 = E ∗ , then E [ ˆ f 0 ij ] = E [ ˆ f ij | E ∗ ] Pr [ E ∗ ] + E [ − d ij /p | E 0 ] Pr [ E 0 ] = E [ ˆ f ij | E ∗ ] Pr [ E ∗ ] + ( − f ij ) Pr [ E 0 ] = E [ ˆ f ij ] = f ij . The v ariance is V a r [ ˆ f 0 ij ] = E [ ˆ f 0 2 ij ] − E [ ˆ f 0 ij ] 2 = E [ ˆ f 2 ij | E ∗ ] Pr [ E ∗ ] + E [( d ij /p ) 2 | E 0 ] Pr [ E 0 ] − f 2 ij = E [ ˆ f 2 ij | E ∗ ] Pr [ E ∗ ] − f 2 ij + E [( d ij /p ) 2 ] Pr [ E 0 ] = E [ ˆ f 2 ij | E ∗ ] Pr [ E ∗ ] − f 2 ij + ( Va r [ d ij /p ] + f 2 ij ) Pr [ E 0 ] 12 Note that V a r [ ˆ f ij ] = E [ ˆ f 2 ij ] − f 2 ij = E [ ˆ f 2 ij | E ∗ ] Pr [ E ∗ ] + E [ ˆ f 2 ij | E 0 ] Pr [ E 0 ] − f 2 ij = E [ ˆ f 2 ij | E ∗ ] Pr [ E ∗ ] + f 2 ij Pr [ E 0 ] − f 2 ij , so V a r [ ˆ f 0 ij ] = Va r [ ˆ f ij ] + Va r [ d ij /p ] Pr [ E 0 ] ≤ V ar [ ˆ f ij ] + f ij p · (1 − p ) f ij . Due to the same reason as abov e, the second term is O (1 /p 2 ) , and the proof completes. Dealing with a decreasing p As in the count-tracking algorithm, we di vide the whole tracking period into O (log N ) rounds. W ithin each round, n stays within a constant factor of ¯ n , while ¯ n remains ﬁxed for the whole round. W ithin a round, we set the parameter p for all sites to be p = 1 / b ε ¯ n/ √ k c 2 . When we proceed to a ne w round, all sites clear their memory and we start a new copy of the algorithm from scratch with the new p . Giv en an item j , the coordinator estimates its frequency from each round separately , and add them up. Since the variance in a round is O ( k /p 2 ) and p increases geometrically over the rounds, the total v ariance is asymptotically bounded by the v ariance of the last round, i.e., O (1 /ε 2 ) , as desired. The space used at some site could still be large, since the site may recei ve too many elements in a round. If all the O ( n ) elements in a round have gone to the same site, the site will need to use space O ( pn ) = O ( √ k /ε ) . T o bound the space, we restrict the amount of space used by each site. More precisely , when a site recei ves more than ¯ n/k elements, it sends a message to the coordinator for notiﬁcation, clears its memory , and starts a new copy of the algorithm from scratch. The coordinator will treat the new cop y as if it were a new site, while the original site no longer recei ves more elements. Now the space used at each site is at most p ¯ n/k = O (1 / ( ε √ k )) . Since there are at most O ( k ) such new “virtual” sites e ver created in a round, this does not af fect the variance by more than a constant f actor . It remains to show that the total communication cost is O ( √ k /ε · log N ) . From earlier we kno w that there are O (log N ) rounds; within each round, ¯ n is the same and n stays within Θ( ¯ n ) . Focus on one round. For each arriving element, the site S i updates ¯ c ij with probability p and also independently samples it with probability p to maintain d ij . This costs O ( n · p ) = O ( √ k /ε ) communication. Theorem 3.1 Ther e is an algorithm for the fr equency-tracking pr oblem that, at any time, estimates the fr equency of any element within err or εn with pr obability at least 0 . 9 . It uses O (1 / ( ε √ k ) space at each site and O ( √ k /ε · log N ) communication. 3.2 Space lower bound It is easy to see that the communication lower bounds for the count-tracking problem also hold for the frequency-tracking problem. In this section, we prove the follo wing space-communication trade-off. 13 Theorem 3.2 Consider any randomized algorithm for the frequency-tr acking pr oblem that, at any time, estimates the fr equency of any element within err or εn with pr obability at least 0 . 9 . If the algorithm uses C bits of communication and uses M bits of space per site, then we must have C · M = Ω(log N /ε 2 ) , assuming k ≤ 1 /ε 2 . Thus, if the communication cost is C = O ( √ k /ε · log N ) bits, the space required per site is at least Ω(1 / ( ε √ k )) bits, as claimed in T able 1. Note that, howe ver , our algorithm of the previous section uses O ( √ k /ε · log N ) words of communication and O (1 / ( ε √ k )) wor ds of space, so there is still a small gap between the lower and upper bound. Interestingly , this lower bound also shows that the random sampling algorithm [9] (see T able 1) actually attains the other end of this space-communication trade-off (ignoring the word/bit dif ference). Pr oof. (of Theorem 3.2) W e will use a result in [26] which states that, under the k -party communication model, there is an input distribution µ k such that, any algorithm that solves the one-shot version of the problem under µ k with error 2 εn with probability 0 . 9 needs at least c √ k /ε bits of communication for some constant c , assuming k ≤ 1 /ε 2 . Moreover , any algorithm that solves ` independent copies of the one-shot version of the problem needs at least ` · c √ k /ε bits of communication. W e will consider the problem over ρk sites, for some integer ρ ≥ 1 to be determined later . W e divide the whole tracking period into log N rounds. In each round i = 1 , . . . , log N , we generate an input indepen- dently chosen from distrib ution µ ρk to the sites. W e pick elements from a different domain for every round so that we have log N independent instances of the problem. In round i , for ev ery element e picked from µ ρk for any site, we replace it with 2 i − 1 copies of e . W e arrange the element arri vals in a round so that site S 1 gets all its elements ﬁrst, then S 2 gets all its elements, and so on so forth. W e will only require the contin- uous tracking algorithm to solv e the frequency estimation problem at the end of each round. Since the last round al ways contains half of all the elements that ha ve arri ved so f ar , the algorithm must solve the problem for the elements in each round, namely , log N independent instances of the one-shot problem. By the result in [26], the communication cost to solve all these instances of the problem is at least c √ ρk /ε · log N . Let A k be a continuous tracking algorithm over k sites that communicates C bits in total and uses M bits of space per site. Belo w we show ho w to solv e the problem ov er the ρk sites in each round, by simulating the k -site algorithm A k . In each round, we start the simulation with sites S 1 , . . . , S k . Whenev er A k exchanges a message, we do the same. When S 1 has receiv ed all its elements, it sends its memory content to S k +1 , which then takes the role of S 1 in the simulation and continue. Similarly , when S 2 has receiv ed all its elements, it sends its memory content to S k +2 , which replaces S 2 in the simulation. In general, when S j is done with all its elements, it passes its role to S j + k . When S ρk is done, the simulation ﬁnishes for this round. S ρk then sends a broadcast message and we proceed to the next round. Let us analyze the communication cost of the simulation. First, we exchange e xact the same messages as A k does, which costs C . W e also communicate ρ ( k − 1) memory snapshots and a broadcast message in each round, which costs ≤ ρkM log N over all rounds. Thus, we hav e C + ρk M log N ≥ c p ρk /ε · log N . Rearranging, M ≥ c ε √ ρk − C ρk log N = 1 √ ρk  c ε − C √ ρk log N  Thus, if we set √ ρ = l 2 C ε c √ k log N m , then M ≥ c 2 ε √ ρk = Ω  log N C ε 2  , 14 as claimed. 4 T racking Distrib uted Ranks On a stream of n elements, an algorithm that produces an unbiased estimator for any rank with variance O (( εn ) 2 ) was presented in [24], which has been very recently improved and made to work in a stronger model [1]. It uses O (1 /ε · log 1 . 5 (1 /ε )) working space to maintain a rank estimation summary structure of size O (1 /ε ) . W e call this algorithm A and will use it as a black box in our distributed tracking algorithm. The overall algorithm As before, with O ( k log N ) communication, we ﬁrst track ¯ n , a constant factor approximation of the current n . This also divides the tracking period into O (log N ) rounds. The Θ( n ) elements arri ving in a round are divided into chunks of size at most ¯ n/k , each processed by an instance of algorithm C , described belo w . A site may recei ve more than ¯ n/k elements. When the ( ¯ n/k + 1) th element arri ves, the site ﬁnishes the current instance of C , and starts a ne w one, which will process the next ¯ n/k elements, and so on so forth. Algorithm C Algorithm C reads at most ¯ n/k elements, and di vides them into blocks of size b = ε ¯ n/ √ k , so there are at most 1  √ k blocks. W e build a balanced binary tree on the blocks in the arriv al order , and the height of the tree is h ≤ log 1  √ k . For each node v in the tree, let D ( v ) be all the elements contained in the leav es in the subtree rooted at v . For each D ( v ) , we start an instance of A , denoted as A v , to process its elements as they arrive. W e say that v is active if A v is still accepting elements. For a node v at lev el ` (the leaves are said to be on lev el 0 ), the error parameter of A v is set to 2 − ` / √ h . W e say v is full if all the elements in D ( v ) have arrived. When v is full, we send the summary computed by A v to the coordinator , and free the space used by A v . Furthermore, for each element that is arri ving, we sample it with probability p = √ k ε ¯ n , and if it is sampled, we send it to the coordinator . Analysis of costs W e ﬁrst analyze the v arious costs of C . At an y time there are at most h active nodes, one at each le vel, so the space used by C is at most h X ` =0 √ h 2 ` log 1 . 5 1 ε = O √ h ε √ k log 1 . 5 1 ε ! . The communication for C includes all the summaries computed, and the elements sampled. For each ` , the total size of the summaries on le vel ` is O  1 ε √ k 2 − ` · 2 ` √ h  = O √ h ε √ k ! . Summing over all h le vels, it is h 1 . 5 ε √ k . There are at most 2 k instances of C in a round, therefore the total communication cost in a round is O ( h 1 . 5 √ k /ε ) . The number of sampled elements in a round is O ( np ) = O ( √ k /ε ) . Thus, ov er all O (log N ) rounds, the total communication cost is O ( h 1 . 5 √ k /ε · log N ) . 15 Estimation It remains to show ho w the coordinator estimates the rank of any gi ven element x at any time with variance O (( εn ) 2 ) . W e decompose all n elements that have arri ved so far into smaller subsets, and estimate the rank of x in each of the subsets. Since all estimators are unbiased, the ov erall estimator is also unbiased; the v ariance will be the sum of all the variances. W e will focus on the current round; all previous rounds can be handled similarly . Recall that there are O ( ¯ n ) elements arriving in this round and ¯ n = Θ( n ) . Every chunk of ¯ n/k elements are processed by one instance of C . Consider any such chunk. Suppose up to no w , n 0 elements in this chunk ha ve arrived for some n 0 ≤ ¯ n/k . W e write n 0 as n 0 = q · b + r for some r < b , and decompose these n 0 elements into at most h + 1 subsets. The ﬁrst q b elements are decomposed into at most h subsets, each of which corresponds to a full node in the binary tree of C . The node has already sent its summary to the coordinator , which we can use to estimate the rank. For a node at lev el ` , the variance is (2 − i / √ h · 2 i b ) 2 = b 2 /h , so the total variance from all h nodes is b 2 . For the last r elements of the chunk that are still being processed by an active node, the coordinator does not have an y summary for them. But recall that the site always samples each element with probability p = √ k / ( ε ¯ n ) and sends it to the coordinator if it is sampled. Thus, the rank of x in these r elements can be estimated by simply counting the number c of elements sampled that are smaller than x , and the estimator is c/p . The variance of this estimator is r /p ≤ b/p = b 2 . Thus, the v ariance from any chunk is O ( b 2 ) . Since there are at most 2 k chunks in the round, the total variance is O ( b 2 k ) = O (( ε ¯ n ) 2 ) = O (( εn ) 2 ) . As the v ariances of the previous rounds are geometrically decreasing, the total variance from all the rounds is still bounded by O (( εn ) 2 ) , as desired. Theorem 4.1 Ther e is an algorithm for the rank-trac king pr oblem that, at any time, estimate the rank of any element within err or εn with pr obability at least 0 . 9 . It uses O  1 ε √ k log 1 . 5 1 ε log 0 . 5 1 ε √ k  space at eac h site with communication cost O  √ k ε log N log 1 . 5 1  √ k  . Refer ences [1] P . K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. W ei, and K. Y i. Mergeable summaries. In Thir d W orkshop on Massive Data Algorithmics , 2011. [2] C. Arackaparambil, J. Brody , and A. Chakrabarti. Functional monitoring without monotonicity . In Pr oc. International Colloquium on Automata, Languag es, and Pr ogramming , 2009. [3] B. Babcock and C. Olston. Distributed top-k monitoring. In Proc. A CM SIGMOD International Confer ence on Management of Data , 2003. [4] Z. Bar-Y ossef. The complexity of massive data set computations . PhD thesis, Uni versity of California at Berkele y , 2002. [5] H.-L. Chan, T . W . Lam, L.-K. Lee, and H.-F . Ting. Continuous monitoring of distributed data streams ov er a time-based sliding windo w . Algorithmica , to appear . [6] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Pr oc. ACM SIGMOD International Confer- ence on Management of Data , 2005. 16 [7] G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In Pr oc. International Confer ence on V ery Larg e Data Bases , 2008. [8] G. Cormode, S. Muthukrishnan, and K. Y i. Algorithms for distributed functional monitoring. ACM T ransactions on Algorithms , 7(2), Article 21, 2011. Preliminary version in SOD A ’08. [9] G. Cormode, S. Muthukrishnan, K. Y i, and Q. Zhang. Continuous sampling from distributed streams. J ournal of the A CM , to appear . Preliminary version in PODS’10. [10] P . B. Gibbons and S. T irthapura. Estimating simple functions on the union of data streams. In Pr oc. A CM Symposium on P arallelism in Algorithms and Ar chitectur es , 2001. [11] P . B. Gibbons and S. T irthapura. Distributed streams algorithms for sliding windo ws. In Pr oc. A CM Symposium on P arallelism in Algorithms and Ar chitectur es , 2002. [12] M. Greenwald and S. Khanna. Space-ef ﬁcient online computation of quantile summaries. In Pr oc. A CM SIGMOD International Confer ence on Management of Data , 2001. [13] Z. Huang, L. W ang, K. Y i, and Y . Liu. Sampling based algorithms for quantile computation in sensor networks. In Pr oc. ACM SIGMOD International Confer ence on Management of Data , 2011. [14] Z. Huang, K. Y i, Y . Liu, and G. Chen. Optimal sampling algorithms for frequency estimation in distributed data. In IEEE INFOCOM , 2011. [15] W . In. An intr oduction to pr obability theory and its applications V ol. 1 . W iley , New Y ork, 1968. [16] R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efﬁcient distrib uted monitoring of thresholded counts. In Pr oc. A CM SIGMOD International Confer ence on Management of Data , 2006. [17] A. Manjhi, V . Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in dis- tributed data streams. In Pr oc. IEEE International Confer ence on Data Engineering , 2005. [18] G. Manku and R. Motwani. Approximate frequency counts ov er data streams. In Pr oc. International Confer ence on V ery Larg e Data Bases , 2002. [19] A. Metwally , D. Agrawal, and A. Abbadi. An integrated ef ﬁcient solution for computing frequent and top-k elements in data streams. ACM T ransactions on Database Systems , 31(3):1095–1133, 2006. [20] J. Misra and D. Gries. Finding repeated elements. Science of Computer Pr ogramming , 2:143–152, 1982. [21] J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theor etical Computer Science , 12:315–323, 1980. [22] B. Patt-Shamir and A. Shafrir . Approximate distributed top-k queries. Distrib uted Computing , 21(1):1– 22, 2008. [23] I. Sharfman, A. Schuster , and D. Keren. Shape sensitiv e geometric monitoring. In Pr oc. A CM Sympo- sium on Principles of Database Systems , 2008. [24] S. Suri, C. T oth, and Y . Zhou. Range counting ov er multidimensional data streams. Discr ete and Computational Geometry , 2006. 17 z ( p − α ) z ( p + α ) x 0 ` 1 ` 2 N 2 ( z ( p + α ) , σ 2 2 ) N 1 ( z ( p − α ) , σ 2 1 ) Figure 1: Differentiating two distrib utions [25] V . N. V apnik and A. Y . Cherv onenkis. On the uniform conv ergence of relative frequencies of e vents to their probabilities. Theory of Pr obability and its Applications , 16:264–280, 1971. [26] E. V erbin, D. P . W oodruff, and Q. Zhang. T ight bounds for distributed functional monitoring. Manuscript, av ailable at http://www .cse.ust.hk/qinzhang/distFp.pdf. [27] D. W oodruff. Efﬁcient and Private Distance Appr oximation in the Communication and Str eaming Models . PhD thesis, Massachusetts Institute of T echnology , 2007. [28] A. C. Y ao. Probabilistic computations: T o wards a uniﬁed measure of complexity . In Pr oc. IEEE Symposium on F oundations of Computer Science , 1977. [29] K. Y i and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In Pr oc. ACM Symposium on Principles of Database Systems , 2009. [30] H. Zhao, A. Lall, M. Ogihara, and J. Xu. Global iceberg detection ov er distributed data streams. In Pr oc. IEEE International Confer ence on Data Engineering , 2010. A Lower bound f or the sampling problem Claim A.1 T o solve the sampling pr oblem we need to pr obe at least Ω( k ) sites. Pr oof. Suppose that the coordinator only samples z = o ( k ) sites. Let X be the number of sites that are sampled with bit 1 . Then X is chosen from the hypergeometric distribution with probability density function (pdf) Pr [ X = x ] =  s 0 x  k 0 − s 0 z − x  /  k 0 z  . The e xpected v alue of X is z k 0 · s 0 , which is z k 0  k 2 − y + √ k  or z k 0  k 2 − y − √ k  , depending on the value of s 0 . Let p =  k 2 − y  /k 0 = 1 2 ± o (1) and α = √ k /k 0 = 1 / √ k ± o (1 / √ k ) . T o av oid tedious calculation, we assume that X is picked randomly from one of the two normal distributions N 1 ( µ 1 , σ 2 1 ) and N 2 ( µ 2 , σ 2 2 ) with equal probability , where µ 1 = z ( p − α ) , µ 2 = z ( p + α ) , σ 1 , σ 2 = Θ( p z p (1 − p )) = Θ( √ z ) . In Feller [15] it is shown that the normal distribution approximates the hyper geometric distrib ution very well when z is large and p ± α are constants in (0 , 1) 5 . 5 In Feller’ s book [15] the follo wing is prov ed. Let p ∈ (0 , 1) be some constant and q = 1 − p . The population size is N and the sample size is n , so that n < N and N p, N q are both integers. The hypergeometric distri bution is P ( k ; n, N ) =  N p k  N q n − k  /  N n  for 0 ≤ k ≤ n . 18 No w our task is to decide from which of the two distributions X is drawn based on the value of X with success probability at least 0 . 7 . Let f 1 ( x ; µ 1 , σ 2 1 ) and f 2 ( x ; µ 2 , σ 2 2 ) be the pdf of the two normal distrib utions N 1 , N 2 , respecti vely . It is easy to see that the best deterministic algorithm of differentiating the two distributions based on the value of a sample X will do the following. • If X > x 0 , then X is chosen from N 2 , otherwise X is chosen from N 1 , where x 0 is the value such that f 1 ( x 0 ; µ 1 , σ 2 1 ) = f 2 ( x 0 ; µ 2 , σ 2 2 ) (thus µ 1 < x 0 < µ 2 ). Indeed, if X > x 0 and the the algorithm decides that “ X is chosen from N 1 ”, we can always ﬂip this decision and improv e the success probability of the algorithm. The error comes from two sources: (1) X > x 0 but X is actually drawn from N 2 ; (2) X ≤ x 0 but X is actually drawn from N 1 . The total error is 1 / 2 · (Φ( − ` 1 /σ 1 ) + Φ( − ` 2 /σ 2 )) , where ` 1 = x 0 − µ 1 and ` 2 = µ 2 − x 0 . (Thus ` 1 + ` 2 = µ 2 − µ 1 = 2 αz ). Φ( · ) is the cumulativ e distrib ution function (cdf) of the normal distribution. See Figure 1. Finally note that ` 1 /σ 1 = O ( αz / √ z ) = O ( p z /k ) = o (1) and ` 2 /σ 2 = O ( αz / √ z ) = o (1) , so Φ( − ` 1 /σ 1 ) + Φ( − ` 2 /σ 2 ) > 0 . 99 . Therefore, the failure probability is at least 0 . 49 , contradicting our suc- cess probability guarantee. Thus we must have z = Ω( k ) . Theorem A.1 [15] If N → ∞ , n → ∞ so that n/ N → t ∈ (0 , 1) and x k := ( k − np ) / √ npq → x , then p ( k ; n, N ) ∼ e − x 2 / 2(1 − t ) p 2 π npq (1 − t ) 19

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment