Analysis of a Bloom Filter Algorithm via the Supermarket Model

1 Analysis of a Bloom Filter Algorithm via the Supermarket Model Y ousra Chabchoub, Christine Fricker and Hanene Mohamed Abstract — This paper deals with the problem of identifying elephants in the Internet T rafﬁc. The aim is to analyze a new adaptive algorithm based on a Bloom Filter . This algorithm uses a so-called min-rule which can be described as in the supermark et model. This model consists of joining the shortest queue among d queues selected at random in a large number of m queues. In case of equality , one of the shortest queues is chosen at random. An analysis of a simpliﬁed model gives an insight of the error generated by the algorithm on the estimation of the number of the elephants. The main conclusion is that, as m gets large, there is a deterministic limit f or the empirical distribution of the ﬁlter counters. Limit theorems are prov ed and the limit is identiﬁed. It depends on k ey parameters. The condition for the algorithm to perform well is discussed. Theoretical results are validated by experiments on a trafﬁc trace from France T elecom and by simulations. I . I N T RO D U C T I O N T o be ef ﬁcient, network trafﬁc measurement methods hav e to be adapted to the actual trafﬁc characteristics. Internet links are currently carrying a huge amount of data at a very high bit rate (40 Gb/s in OC-768). T o analyze on-line this traf ﬁc, scalable algorithms are required. They have to operate fast, using a limited small memory . The traf ﬁc is mainly analyzed at the ﬂow le vel. A ﬂo w is a sequence of packets deﬁned by the classical 5tuple composed of the source and destination addresses, the source and destination port numbers together with the protocol type. Flows statistics are very useful for trafﬁc engineering and network management. In particular , information about large ﬂows (also called elephants) is very interesting for many applications. Note that an elephant is a ﬂow with at least K packets, where K is in practice equal to 20. Elephants are not numerous (around 5 to 20% of the number of ﬂows), but they represent the main part (80-90 %) of the trafﬁc v olume in terms of packets. Elephants statistics can be exploited in various ﬁelds such as attacks detection or accounting. In the literature, some probabilistic algorithms hav e been developed to estimate on-line the number of ele- phants in a dense trafﬁc. In [1], Flajolet analyzed the Adaptive Sampling algorithm proposed by W e gman. This algorithm is based on a special sampling method that provides a random set of the original ﬂows. Some characteristics on elephants (number , size distribution) can be inferred from this sample. Some other algorithms (see [2], [3]) based on sampling are designed to provide elephants statistics, but they seem to be Y . Chabchoub and C. Fricker are with INRIA Paris — Rocquencourt, Domaine de V oluceau, 78153 Le Chesnay , France. Email: First-Name.Last- Name@inria.fr H. Mohamed is with Institut Univ ersitaire de T echnologie de Sceaux, 8 A venue Cauchy , 92330 Sceaux, France, Email: Hanene.Mohamed@iut- sceaux.fr designed to very speciﬁc ﬂow size distribution or require an a priori knowledge of the total number of ﬂows to recover the loss of information caused by the sampling. Moreover all these algorithms are not able to identify elephants, that is to giv e their addresses. Such information is particularly useful for attacks detection. For that, Estan and V arghese [4] propose an algorithm based on Bloom ﬁlters. This algorithm is quick enough and it uses a limited memory , but it is not adapted to trafﬁc variations. It uses a ﬁxed parameter which should be adjusted according to trafﬁc intensity . Azzana in [5] then Chabchoub et al. [6] propose an improv ement for this algorithm by adding a refreshment mechanism that depends on trafﬁc variations. The principle of this latter algorithm is the follo wing. The ﬁlter is composed of d stages. Each stage contains m counters and is associated to a hashing function. When a packet is receiv ed, its IP header is hashed by the d independent hashing functions and the corresponding counter in each stage is incremented by one. When a counter reaches K (the smallest elephants size), the corresponding ﬂow is considered as an elephant. Due to the hea vy Internet traf ﬁc, the ﬁlter needs to be sometimes refreshed, otherwise all the counters will exceed the threshold K , and then all the ﬂows will be seen as elephants. The idea is to decrease all counters by one ev ery time the proportion of non null counters reaches a gi ven threshold r . In this way the refreshment frequenc y of the ﬁlter depends closely on the actual traf ﬁc intensity . Notice that the algorithm uses an improvement, the min-rule , also called conservative update in [4]. It consists in incrementing only the counters among d having the minimum value, for an arriving packet. Indeed, because of collisions, the ﬂow size is at most giv en by the smallest associated counters. So the min- rule reduces the overestimation of ﬂo w size. This algorithm has been ﬁrst presented in [5]. In [6], a more complete version of the algorithm is developed. A new refreshment mechanism based on the av erage of counters values is added. In addition, the algorithm (under some modiﬁcations) is applied to attacks detection. These algorithms are validated using several trafﬁc traces. Chabchoub et al. present in [7] a theoretical analysis of the algorithm proposed by Azzana and described above. Their objectiv e is to estimate the error generated by the algorithm on the estimation of number of elephants. The analytical study does not tak e into account the min-rule. In this paper , we focus on the analysis of the min-rule. For this purpose, the algorithm described above has been slightly modiﬁed. W e consider now just one ﬁlter and two hashing functions ( d = 2 ). An arriving packet increments the smallest counter among the two associated counters. In case of equality , 2 only one counter is incremented at random. In this way , e very packet increments exactly one counter . A ﬂo w is declared as an elephant when its smallest associated counter reaches C = K/d . The same refreshment mechanism is maintained with a threshold r of about 50% . The basic idea is that when the ﬁlter is not overloaded, in general, for each arriving packet of a gi ven ﬂo w , one of the two counters will be incremented in an alternative way . It means that the two counters will have almost the same v alues and when the smallest one reaches C , the corresponding ﬂow has a total size of about K = dC packets ( C packets hitting each counter). The advantage of this new algorithm is that each arriving packet increments exactly one counter . In this case the way to increment the counters is exactly , in a system of m queues, the way a customer joins the shortest queue among d queues chosen at random, ties being solved at random. This so- called supermarket model by Luczak and McDiarmid [8], [9], also kno wn as load-balancing model with the choice, has been extensi vely studied in the literature because of its numerous useful applications. In computer science, the central result is stated in a pioneer paper by Azar et al. [10] then Miztenmacher [11], for a discrete time model when n balls are thrown into n urns with the choice. It is prov ed that, with probability tending to 1 as n gets large, the maximum load of an urn is log n/ log log n + O (1) when d = 1 and log log n/ log d + O (1) if d ≥ 2 . Luczak and McDiarmid, in continuous time related models with the choice, explore the concentration of the maximum queue length (see [8], [12]). But the model had also already been studied in Vvedenskaya et al. [13], Graham [14] and others for mean-ﬁeld limit theorems. In [13], a functional law of large numbers is stated: The v ector of the tail proportions of queues con verges in distribution as m tends to inﬁnity to the unique solution of a differential system. The dif ferential equation has a unique ﬁxed point u ρ ( k ) = ρ ( d k − 1) / ( d − 1) for a throughput ρ . It means that, when d ≥ 2 , the tail probabilities of the queue length decrease drastically . In Vv edenskaya and Suhov [15], variants of the choice policy and general service time distributions are in vestigated. Graham in [14] prov es the con ver gence of the in variant measures to some Dirac measure. In other words, when m is large, the stationary vector of the proportions of queues with k customers is essentially deterministic and gi ven by this limit. The aim of this paper is to model the min-rule via the supermarket model and to ev aluate the performance of the new proposed algorithm. In particular , we want to calculate the error generated by the algorithm on the estimation of number of elephants. Notice that this error is due to both false negati ves (missed elephants) and false positives (mice considered as elephants). Let us focus on false positives. T o be declared as an elephant, a mouse must be hashed to one among counters greater than C after this operation. So the proportion of such counters is a good parameter to in vestig ate in order to ev aluate false positiv es. The most part of the paper is the analysis of a simple model where the ﬂows are mice of size one. It is relev ant because most of the ﬂo ws are mice so collisions between ﬂows are mainly due to collisions between mice. It turns out that the probability that a mouse is detected as an elephant is bounded by the probability that a gi ven counter is greater than C just before a refreshment time. Thus the problem reduces to analyze the behavior of the model at the refreshment times. Moreov er the transition phase is very short thus the study of the stationary beha vior is pertinent. The ke y idea of the study is to use the Marko vian frame work in order to rigorously establish limit theorems and analytical expressions in the stationary regime. The main result is that, as m tends to inﬁnity , the ev olution of the model is characterized by a dynamical system which has at least one ﬁxed point. When d = 1 , this ﬁxed point is unique and denoted by ¯ w . In this paper , we conjecture its uniqueness when d ≥ 2 . The interpretation of ¯ w as a ke y quantity in a supermarket model with deterministic service times is discussed. Analytical expressions are gi ven in [7] for d = 1 , are more complicated to obtain here. An objectiv e would be to pro ve the con vergence of the in variant measure of the Markov chain as m tends to + ∞ to the Dirac measure δ ¯ w at the ﬁxed point ¯ w . In practice, such a result is completely crucial. If it is not true, if the sequence of in variant measures do not con ver ge, the system oscillates with long periods of transition between dif ferent conﬁgurations (metastability phenomenon). So e ven if the algorithm performs well during a while, it can reach another state where it can giv e bad results. This question is partially addressed here. But the con vergence of the inv ariant measures is conjectured, due to simulations of the algorithm where such a phenomenon has not been observ ed. For such a result, a possible technique is the existence of a L yapounov function which both proves the con vergence of the dynamical system to its unique ﬁxed point ¯ w and the con ver gence of the sequence of in v ariant measures to δ ¯ w . Such a function is exhibited in [7] for d = 1 and C = 2 . A simulation of the limit distrib ution ¯ w is done for a uniform general mice size distribution. Experiments have two goals. First to compare the original version presented in [6] and the version of the algorithm introduced here. Second, the time between two refreshments is plotted. This quantity is crucial for the trade-off between false positiv es and false negati ves. The time to reach the stationary phase is discussed. The organization of the paper is as follo ws: Section II presents the analytical results for the simple model deﬁned to study the question of false positiv es. Section III is dev oted to experiments. Section IV gives a discussion of the way to choose the parameter r in order to have an algorithm which performs well. I I . T H E M A R K OV I A N U R N A N D B A L L M O D E L A. Description of the model In this section, the question of false positiv es is addressed: the probability for a mouse to be detected by the algorithm as an elephant. The problem is studied in a simple framework, where ﬂo ws are reduced to mice of size one. Thus the model can be described as a urn and ball model because one size ﬂows hashed in a ﬁlter with m counters can be vie wed as balls thrown into m urns with capacity C under the supermarket 3 rule: For each ball, a subset of d urns is chosen at random and the ball is put in the least loaded urn, ties being resolved uniformly . Balls overﬂo wing the capacity C are rejected. If, after putting the ball, the number of non empty urns e xceeds r m , then one ball is remov ed from every non empty urn. The probability of a ﬂow to be detected as an elephant is reduced to the probability that, after the ball arriv al, all the d chosen urns hav e C balls. It is bounded by the probability that, just before a refreshment time, after putting the last ball in its urn, all the d urns chosen for that contain C balls. The bound is more con venient to study . The embedded model just before the refreshment times is studied. B. A Markovian framework For ﬁxed C , let us consider the sequence ( W m n ) n ∈ N , where W m n denotes the vector of the proportions of urns with 0 , . . . , C balls just before the n th refreshment time. For m ≥ 1 , ( W m n ) n ∈ N is an ergodic Markov chain on the ﬁnite state space P ( r ) m = { w ∈  N m  C +1 , C X i =0 w i = 1 , C X i =1 w i = d r m e m } , (where d r m e denotes the smallest integer larger than rm ). Thus it has a unique inv ariant measure π m . The problem is that this quantity is combinatorically in- tractable. Even the transition probability P m of the Markov chain is awfully difﬁcult to write. Nev ertheless, one could expect an asymptotic of this quantity when m is large. In other words, the limit of the in variant measures π m when m is large is inv estigated. C. A dynamical system The way which is used here to obtain limit theorems is very classical (see [16] for example). In fact, the similar results for d = 1 can be found in Chabchoub et al [7]. The following results extend the case d = 1 to d ≥ 1 . Of course the motiv ation here is the case d ≥ 2 . The proofs must often be rewritten with new ar guments and the sections which are still valid will be in general omitted. The following result is that, as m → + ∞ , the Markov chain con verges in distribution to a deterministic dynamical system which will be explicited. Let P def = { w ∈ R C +1 + , C X i =0 w i = 1 } and P ( r ) def = { w ∈ R C +1 + , C X i =0 w i = 1 and C X i =1 w i = r } be the state spaces. Let the shift s be deﬁned as s : w 7→ ( w 0 + w 1 , w 2 , . . . , w C , 0) on P and λ : P ( r ) → R + , w 7→ Z r r − w 1 du 1 − u d . For the vector of proportions v ∈ P , it is more con venient to deal with the vector of the tail proportions u deﬁned by u k = P i ≥ k v i . G is then deﬁned on P ( r ) by G ( w ) = v ( λ ( w )) (1) where ( v ( t )) is associated to ( u ( t )) the unique solution of u 0 k = u d k − 1 − u d k , k ∈ { 1 , . . . C } , u 0 = 1 with initial condition u (0) corresponding to v (0) = s ( w ) . Pr oposition 1: If W m 0 con verges in distrib ution to w ∈ P ( r ) m then ( W m n ) n ∈ N con verges in distribution to the dynamical system ( w n ) n ∈ N giv en by the recursion w n +1 = G ( w n ) , n ∈ N . Notice that G maps, by deﬁnition of λ , P ( r ) to P ( r ) . Pr oof: The result is a consequence of the conv ergence of the transition P m of the Markov chain ( W m n ) n ∈ N as m tends to + ∞ to P gi ven by P ( w, . ) = δ G ( w ) . It means that, starting from w just before a refreshment time, at the next refreshment time, the vector of the proportions of urns tend to G ( w ) when m tends to + ∞ . The uniform con vergence stated by the following lemma provides the conv enient way to prov e Proposition 1. Lemma 1: For ε > 0 , sup w ∈P ( r ) m P m ( w , { w 0 ∈ P ( r ) m : || w 0 − G ( w ) || > ε } ) → m → + ∞ 0 . Pr oof: The idea of the proof is that, starting from w (with d r m e non empty urns), after refreshment, the v ector of the proportions is s ( w ) deﬁned by s ( w ) = ( w 0 + w 1 , w 2 , . . . , w C , 0) where the proportion of non empty urns is r − w 1 . Then a number τ m 1 of balls are thrown in order to reach a state w 0 with again d r m e non empty urns. It has to be proved that w 0 is close to G ( w ) . There are three steps: 1) It can be proved that this number τ m 1 is deterministic at ﬁrst order , equiv alent to λ ( w ) m , where λ ( w ) = Z r r − w 1 dt 1 − t d (2) when m is large. More precisely , sup w ∈P ( r ) m P w      τ m 1 m − λ ( w )     > ε  → m →∞ 0 . (3) T o see it, starting from w , τ m 1 has an analytical expression as a sum of different numbers Y l of balls necessary to hit the ( l + 1) th non empty urn. Indeed, τ m 1 = d rm e− 1 X l = d r m e− w 1 m Y l , (4) where the Y l s for l ∈ N are independent random variables with geometrical distributions on N ∗ with respectiv e parameters a l = l − 1 Y j =0 l − j m − j , 4 i.e. P ( Y l = n ) = ( l/m ) n − 1 (1 − l /m ) , n ≥ 1 . As E ( Y l ) = 1 / (1 − a l ) , computing the mean and comparing this sum with integrals leads to sup w ∈P ( r ) m E w  τ m 1 m  → m →∞ λ ( w ) . (5) At the same time, as V ar( Y l ) = a l / (1 − a l ) 2 , sup w ∈P ( r ) m V ar w ( τ m 1 ) m → m →∞ Z r r − w 1 dt (1 − t d ) 2 . (6) By Bienaym ´ e-Chebychev’ s inequality , using equations (5) and (6), it pro ves (3). 2) From the pre vious fact, there is a natural coupling throwing τ m 1 m balls or λ ( w ) m balls where the v ectors of proportions W m 1 and say ˜ W m 1 are close to each other . Then there is also a coupling throwing λ ( w ) m and a Poisson random variable with parameter λ ( w ) m , for which, by Chernoff ’ s inequality , the vector of proportions ˜ W m 1 and say ˆ W m 1 are close. 3) The vector of proportions ˜ W m 1 obtained by coupling is the vector of proportions at time λ ( w ) in a queueing supermarket model without departures. The model consists of m queues with capacity C where customers arrive according to a Poisson process with rate m . At each arriv al, a subset of d queues is chosen and the customer joins the shortest one, ties being solved at random. Let W m ( t ) be the vector of the proportions of m queues with 0 , 1 , . . . , C customers at time t . It is more conv enient to deal with the tail proportions deﬁned as U m k ( t ) = X i ≥ k W m i ( t ) . Giv en W m (0) = s ( w ) , we ha ve that ˆ W m 1 = W m ( λ ( w )) . By the con ver gence of the Markov process ( U m ( t )) to the ﬂuid limit (see Vvedenskaya et al. [13]), it holds that ˆ W m 1 con verges in distribution to v ( λ ( w )) where v is associated to u the ﬂuid limit of ( U m ( t )) , the unique solution of the differential system du k dt = u d k − 1 − u d k (1 ≤ k ≤ C ) , u 0 = 1 , (7) with initial condition u (0) corresponding to v (0) = s ( w ) . Moreov er using the continuity of a solution of a differential equation with respect to the initial condition, for each ε, t > 0 , 0 ≤ k ≤ C , P ( sup w ∈P ( r ) m | U m k ( λ ( w )) − u k ( λ ( w )) | > ε ) → m →∞ 0 which straightforwardly leads to P ( sup w ∈P ( r ) m k ˆ W m 1 − G ( w ) k > ε ) → m →∞ 0 where G ( w ) = v ( λ ( w )) . It ends the proof of the lemma. The argument to obtain Proposition 1 from Lemma 1 is standard and detailed in [7, Proposition 1]. It is omitted here. D. F ixed point of the dynamical system The function P ( r ) − → P ( r ) w 7− → G ( w ) being continuous on the con vex compact set P ( r ) , by Brouwer’ s theorem, it has a ﬁxed point. It remains to pro ve the uniqueness of the ﬁxed point. Recall that, for d = 1 , the proof is based on the interpretation of the ﬁxed point equation G ( w ) = w as the in v arriant measure equation w P = w of some ergodic Markov chain with transition P . This Markov chain is the queue length at the service time completions of a M /G/ 1 /C queue with deterministic service times equal to 1 with arriv al rate λ ( w ) . The proof of the uniqueness of the solution of w = µ λ ( w ) is then based on the coupling argument that, if λ ≤ λ 0 then µ λ is stochastically dominated by µ λ 0 (see [7] for details). Let us try to extend the ar gument for the case d ≥ 2 . For that, let us consider the follo wing system. Balls are thrown into m urns with capacity C with a Poisson arriv al process with rate λm . Each ball joins the least loaded urn among a subset of d urns, chosen at random. The ties are resolved uniformly . At each unit time, one ball is removed from each non-empty urn. It can be proved as in Proposition 1 that the vector of the proportions of urns with k balls just before time n con ver ges, when m is large, to a dynamical system w n +1 = H ( w n ) where, for v deﬁned as previously with initial condition v (0) = s ( w ) , H ( w ) = v ( λ ) . But the argument fails for d ≥ 2 . Indeed, the equation w = H ( w ) can not be interpreted as the inv ariant measure equa- tion of some ergodic Markov chain ( L n ) n ∈ N on { 0 , . . . , C } because the differential system (7) is not linear for d > 1 . If it was then there should e xist P λ such that H ( w ) = v ( λ ) = v (0) P λ = s ( w ) P λ = w P where P = QP λ because s ( w ) can easily be written wQ where Q is a transition matrix. Thus another way to prove it should be found and, at this point, the uniqueness of the ﬁxed point is conjectured. 5 E. Identiﬁcation of the ﬁxed point If the capacity is inﬁnite, then the parameter λ ( ¯ w ) is equal to r . In this case, it is simple to hav e the explicit e xpression of ¯ w 1 , which is a good approximation for the case C = 20 . It is the purpose of this section. Assume that C = + ∞ . By deﬁnition, λ ( ¯ w ) = Z r r − ¯ w 1 dt 1 − t d where F ( x ) = R x 0 dt 1 − t d deﬁnes a bijection from [0 , 1[ to its image. Using that λ ( ¯ w ) = r for C = + ∞ , it can be rewritten r = F ( r ) − F ( r − ¯ w 1 ) , or ¯ w 1 = r − F − 1 ( F ( r ) − r ) . (8) Notice that for d ∈ N , F has an explicit expression. For d = 1 , F ( x ) = − log(1 − x ) which leads (see [7]) to ¯ w 1 = (1 − r )( e r − 1) . Moreov er , for d = 2 , F ( x ) = argth x and thus ¯ w 1 = r − r − thr 1 − r thr . (9) In Figure 1, ¯ w 1 is plotted for d = 1 , 2 . 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.2 0.4 0.6 0.8 1 w(1) r d=1 d=2 Fig. 1. Limit proportion ¯ w 1 of counters with value 1 for d = 1 , 2 . F . Con ver gence of in variant measures A ﬁrst result is obtained. It is prov ed in [7, Proposition 2] and recalled here omitting the proof. Pr oposition 2: Let, for m ∈ N , π m be the stationary distribution of ( W m n ) n ∈ N . Deﬁne P as the transition on P ( r ) giv en by P ( w , . ) = δ G ( w ) . Any limiting point π of ( π m ) m ∈ N is a probability measure on P ( r ) which is in v ariant for P i.e. that satisﬁes G ( π ) = π . As noticed in the introduction, the limiting point of π m is not necessarily unique, because there is not a unique measure π such that G ( π ) = π . Nev ertheless G has a unique ﬁxed point thus G ( δ ¯ w ) = δ ¯ w . But imagine that G has cycles, i.e. that there e xist n ≥ 2 and w 1 , . . . , w n in P ( r ) such that G ( w i ) = w i +1 (1 ≤ i < n ) , G ( w n ) = w 1 then π = 1 /n P n i =1 δ w i is in variant under G . It gi ves two dif- ferent limiting points for π m . A way to prove the conv ergence is to exhibit a L yapounov function for G (see [7, Theorem 1] for details). Such a L yapounov function is exhibited in [7] for d = 1 and C = 2 . It is not in vestigated here. G. Gener al mice size distribution The aim of the subsection is to extend the pre vious results to a model with general size distrib ution. An approximated model is taken. Indeed, as mice size are short (with mean close to some units, in real trafﬁc traces, close to 4), an approximated model is to consider that the packets of the mice are thro wn without interleaving in the target counters. It means that the packets of the different mice arriv e consecutiv ely in the ﬁlter . The model chosen is thus an urn and ball model where balls are thrown by batches . The balls in a batch are thrown together in a unique urn, the least loaded urn among d chosen at random in the m urns. The i th batch is composed with S i balls, where the S i s are independent random variables with distribution denoted by p . Let also ( W m n ) n ∈ N be the sequence of vectors giving the proportions of urns at 0 , . . . , C just before the n th refreshment time in this model where balls are thrown by batches. The dynamic is the same: If, before a refreshment time, the state is w ∈ P ( r ) m , it becomes s ( w ) and then a number τ m 1 ( w ) deﬁned by (4) of successi ve batches are thro wn in urns until d r m e urns are non empty . The model generalizes the previous one obtained for mice of size one ( p (1) = 1 ). Note that equation (9) is extended in this case by ¯ w 1 = r − r − th(r / E (S)) 1 − r th(r / E (S)) . (10) Let G be deﬁned on P by G ( w ) = v ( λ ( w )) (11) where the tail function ( u ( t )) corresponding to ( v ( t )) is the unique solution of the dif ferential equation du k dt = k X j =1 p j ( u d k − j − u d k ) (1 ≤ k ≤ C ) , u 0 = 1 . (12) Everything in Section II remains v alid. The supermark et model obtained by coupling is a model with batch arriv als without departures. Its mean-ﬁeld limit is obtained as previously and leads to the differential equation (12). Propositions 1 and 2 hold. The description of the unique ﬁxed point can be extended. I I I . E X P E R I M E N T S In this section, the proposed algorithm is tested against an ADSL traf ﬁc trace from France T elecom IP backbone netw ork. This trafﬁc trace has been captured on a Gigabit Ethernet link in October 2003 between 9:00 pm and 10:00 pm. This 6 period corresponding to a peak activity by ADSL customers, its duration is 1 hour and contains more than 10 millions of TCP ﬂows. In our e xperiments, the ﬁlter consists of m = 2 20 counters associated to two independent hashing functions ( d = 2) . Elephants are here deﬁned as ﬂows with at least 20 packets ( K = 20) . -8 -6 -4 -2 0 2 4 6 8 10 0 10 20 30 40 50 60 Relative error(%) Time(min) Original version of the algorithm Supermarket model Fig. 2. Impact of the Supermarket model on the estimation of number of elephants , r = 50% , France T elecom trace The relati ve error on the estimated number of elephants is plotted in Figure 2. T wo different versions of the algorithm are considered: The original algorithm dev eloped in [5], [6] and the proposed algorithm using the supermarket model. W e re- call, that these two algorithms use the min-rule (incrementing only the smallest counter), but in a different way: In case of equality , only one counter is incremented at random with the supermarket model whereas, the two counters are incremented in the original version of the algorithm. Results show that both methods give a good estimation of number of elephants, for the whole duration of the trace. 250000 300000 350000 400000 450000 500000 550000 600000 650000 20 40 60 80 100 120 140 160 The refreshment time The refreshment rank Original version of the algorithm Supermarket model Fig. 3. Duration of the transition and the stationary phase, r = 50% , France T elecom trace Figure 3 presents the inter-r efreshment time (duration be- tween two successive refreshments in terms of number of ar- riving packets) for the whole traf ﬁc trace. It can be noticed that the stationary phase is reached at the K th refreshment time. So the transition phase seems to be rather short, according to experiments. The stationary inter-refreshment time using the algorithm based on the supermarket model is higher than the one obtained with the original version of the algorithm. This can be explained by the fact that with the supermarket model ev ery arriving packet increments exactly one counter , whereas in the original version, if the two selected counters are equal, they are both incremented by one. In particular , when they are both null, they will be both impacted. As a consequence, the proportion of non null counters gro ws faster and the ﬁlling up threshold r is reached more quickly . Figure 3 gi ves an e xplanation to the behavior of the algo- rithms plotted in Figure 2. In the original algorithm, the inter- refreshment time is lower thus more elephants are missed. The error is thus negativ e. In the supermarket version, the error is positiv e due to false positives. 0.98 0.985 0.99 0.995 1 1.005 1.01 1.015 1.02 30 35 40 45 50 55 60 (Stationary refreshment time)/rm r(%) Supermarket model Fig. 4. Comparison between r m and the stationary inter-refreshment time τ m ∞ , France T elecom trace In Figure 4, the impact of r on the stationary inter- refreshment time τ m ∞ is in vestigated. More precisely τ m ∞ /r m is plotted for v arious v alues of r . According to experiments, τ m ∞ is very close to rm . In fact the refreshment can be seen as removing rm from the sum S of all counters (decreasing by one all non null counters which are exactly r m as the refreshment is performed as soon as the ﬁlling up threshold r is reached). As we are in the stationary phase, we hav e con vergence of w i , the proportion of counters at i , for i ∈ { 0 , . . . , C } . Therefore the sum of all counters conv erges. So just before the next refreshment, r m packets must be inserted into the ﬁlter , to let S hav e its former value. Packets belonging to elephants which ha ve been detected are not taken into account. Those packets are very numerous and they are not inserted into the ﬁlter to avoid polluting it. I V . D I S C U S S I O N The performance of the algorithm clearly depends on the ﬁlling up r . T o hav e a good estimation of the number of elephants , r must be around 50% . When r has higher v alues, 7 elephants number will be largely ov erestimated due to false positiv es. The key quantity is ¯ w i , the stationary proportion of counters at i when m gets large. An explicit expression for ¯ w is not av ailable ev en if a numerical value could be computed. Nev ertheless, less ambitiously , one can maybe simply found the critical value of r for which ¯ w C gets non negligible. At least, the impact of r on ¯ w is shown here by simulation. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 3 4 5 6 7 8 9 10 w(i) i r=50% r=90% Fig. 5. Impact of r on the limit stationary distribution ¯ w , simulation using only mice with uniform size distribution of mean 4 , m = 2000 Figure 5 ( w is written for ¯ w ) is not based on a real trafﬁc trace b ut on simulation. The objectiv e here is to ev aluate the limit stationary distribution ¯ w if we consider a trafﬁc composed only of mice. The mice mean size is taken equal to four to be close to the real trafﬁc (This value is deduced from the real trafﬁc trace). Under these conditions, we obtain a decreasing limit stationary distribution of ¯ w , when r equals 50% . F or a ﬁlling up threshold of 90% , counters are v ery likely to be higher . W e can notice that the main part of counters values is around six and there are many counters at C . This explains the f act that with a ﬁlling up threshold around 50% , the algorithm performs better . Equation (10) with C = ∞ giv es a very good approximation of the values of ¯ w 1 in this case. Indeed, the analytical expression giv es ¯ w 1 = 0 . 10 for r = 0 . 5 and ¯ w 1 = 0 . 05 for r = 0 . 9 . These values are very close to the values obtained on Figure 5 by simulation. V . C O N C L U S I O N W e analyze in this paper a ne w algorithm catching on- line elephants in the Internet. This algorithm is based on Bloom ﬁlters with a refreshment mechanism that depends on the current trafﬁc intensity . It also uses a conservati ve way to update counters, called the min-rule. This latter is exactly to increment the lowest counter among a set of d chosen at random as in the supermarket model which provides a much lower tail distribution for the counter v alues. For a model in volving just mice, limit theorems in vestigate the existence of a deterministic limit for the empirical distrib ution of counters values, when the ﬁlter size gets lar ge. This limit can be exploited to adjust the parameters for the algorithm to perform well. The accuracy of the algorithm and some theoretical results are tested against a traf ﬁc trace from France T elecom and by simulations. R E F E R E N C E S [1] P . Flajolet, “On adaptati ve sampling, ” Computing , pp. 391–400, 1990. [2] O. Gandouet and A. Jean-Marie, “Loglog counting for the estimation of ip trafﬁc, ” in Proceedings of the 4th Colloquium on Mathematics and Computer Science Algorithms, Tr ees, Combinatorics and Pr obabilities , Nancy , France, 2006. [3] Y . Chabchoub, C. Fricker, F . Guillemin, and P . Robert, “Inference of ﬂow statistics via packet sampling in the internet, ” IEEE Communication Letters , pp. 897 – 899, 2008. [4] C. Estan and G. V arghese, “New directions in trafﬁc measurement and accounting, ” in Proc. Sigcomm’02 , Pittsburgh, Pennsylvania, USA, August 19-23 2002. [5] Y . Azzana, “Mesures de la topologie et du traﬁc internet, ” Ph.D. dissertation, Universit ´ e de Paris 6, July 2006. [Online]. A vailable: http://www- c.inria.fr/twiki/pub/RAP/FormerMembers/Azzana- PhD.pdf [6] Y . Chabchoub, C. Fricker , F . Guillemin, and P . Robert, “ Adaptive algorithms for identifying large ﬂo ws in ip trafﬁc, ” in Submitted to ITC21 21st International T eletraf ﬁc Congr ess , september 2009. [7] Y . Chabchoub, C. Fricker , F . Meunier, and D. T ibi, “ Analysis of an algorithm catching elephants on the internet, ” in F ifth Colloquium on Mathematics and Computer Science , ser . DMTCS Proceedings Series, september 2008, pp. 299–314. [8] M. J. Luczak and C. McDiarmid, “On the maximum queue length in the supermarket model, ” Ann. Pr obab. , vol. 34, no. 2, pp. 493–527, 2006. [9] ——, “ Asymptotic distributions and chaos for the supermarket model, ” Electr on. J . Pr obab . , v ol. 12, pp. no. 3, 75–99 (electronic), 2007. [10] Y . Azar, A. Z. Broder, and A. R. Karlin, “On-line load balancing, ” Theor et. Comput. Sci. , vol. 130, no. 1, pp. 73–84, 1994. [11] Mitzenmacher , “The power of two choices in randomized load balanc- ing, ” Ph.D. dissertation, Berkeley , 1996. [12] M. J. Luczak and C. McDiarmid, “On the power of two choices: balls and bins in continuous time, ” Ann. Appl. Probab . , vol. 15, no. 3, pp. 1733–1764, 2005. [13] N. D. Vvedenskaya, R. L. Dobrushin, and F . I. Karpelevich, “ A queueing system with a choice of the shorter of two queues—an asymptotic approach, ” Problemy P eredachi Informatsii , vol. 32, no. 1, pp. 20–34, 1996. [14] C. Graham, “Chaoticity results for “join the shortest queue”, ” in Council for African American Researc hers in the Mathematical Sciences, Vol. III (Baltimor e, MD, 1997/Ann Arbor , MI, 1999) , ser. Contemp. Math. Providence, RI: Amer . Math. Soc., 2001, vol. 275, pp. 53–68. [15] N. D. Vvedenskaya and Y . M. Sukhov , “Dobrushin’s mean-ﬁeld approx- imation for a queue with dynamic routing, ” INRIA, T ech. Rep. 3328, dec 1997. [16] V . Dumas, F . Guillemin, and P . Robert, “ A Markovian analysis of additiv e-increase multiplicative-decrease algorithms, ” Adv . in Appl. Pr obab. , vol. 34, no. 1, pp. 85–111, 2002.

Analysis of a Bloom Filter Algorithm via the Supermarket Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment