A statistical analysis of probabilistic counting algorithms
This paper considers the problem of cardinality estimation in data stream applications. We present a statistical analysis of probabilistic counting algorithms, focusing on two techniques that use pseudo-random variates to form low-dimensional data sk…
Authors: Peter Clifford, Ioana A. Cosma
Pr ob abilistic c ounting algorith ms 1 A Statistical Analysis of Probabil i stic Coun ting Algorithms PETER CLIFF ORD Departmen t of Statistics, Univ ersit y of Oxford IO ANA A. COSMA Statistical Lab oratory , Univ ersity of C am bridge ABSTRA CT . This pap er considers the problem of cardinality estimation in data stream applications. W e presen t a statistical analysis of probabilistic coun ting al- gorithms, fo cusing on tw o tec hniques that use pseudo-random v ariates to form lo w- dimensional data sketc hes. W e apply con v entional st a tistical metho ds to compare probabilistic algorithms based on storing either selected order statistics, or random pro jections. W e derive estimators of the cardinality in b oth cases, and sho w that the maximal-term estimator is recursiv ely computable and has exp onentially decreasing error b ounds. F urt hermore, we sho w that the estimators ha v e comparable asymptotic efficiency , and explain this result b y demonstrating an unexp ected c onnection b etw een the tw o approaches. Key wor ds: asymptotic relativ e efficiency , cardin alit y , data sketc h ing, data stream, hash function, maxim u m likel iho o d estimation, sp ace complexit y , stable d istr ibution, tail b ounds. 1 In tro duction High-throughput, transient ly observ ed, data streams pose no vel and c h allenging problems for c om- puter scien tists and statisticians (Muthukrishnan, 2005; Aggarw al , 2007). Adv ances in s cience and tec hnology are con tin u ally expandin g b oth the size of data sets a v ailable for analysis and the rate of data acquisition; examples include increasingly hea vy In ternet traffic on r ou ters (Ak ella et al. , 2003; Cormo d e & Muthukrishnan, 2005b), high frequency fi nancial transactions, and commercial database app licatio ns (Whang et al. , 1990). The online appro ximation of p rop erties of data streams, suc h as cardinalit y , frequency mo- men ts, quantil es, and empirical en trop y , is of great interest (Cormo de & Muth ukrishn an , 2005a ; Harv ey et al. , 2008). The goal is to construct and main tain sub-linear rep resen tations of the data from which target prop erties can b e inferred with high efficiency (Aggarw al, 2007). Data str eam 2 Peter Cliffor d and Io ana A. Cosma algorithms t yp ically allo w only one pass ov er the data, i.e., data are observe d, pr ocessed to up date the representa tion, and th en discard ed. By ‘efficien t’ with r esp ect to the in f erence pro cedure, w e mean that estimators are accurate with high pr obabilit y . With resp ect to the hand ling of data, w e mean that t he alg orithm has fast pro cessing and up d ating time p er data element, uses lo w storage, and is insensitiv e to the order of arriv al of d ata. Th is is in con trast to sampling-based tec h niques, that are s ensitiv e to the pattern of rep etitions in th e data. This article fo cuses on the problem of estimating the num b er of distinct items in a data stream when storage constraint s p reclude the p ossibilit y of main taining a comprehen siv e list of previously observ ed items. The num b er of distinct items or c ar dinality can, for example, refer to pairs of source-destination IP addresses, observed within a giv en time windo w of In tern et traffic, monitored for the purp ose of anomaly detection, e.g., denial-of-service attac ks on the n et wo rk (Giroire, 2009). There is a s u rprisingly long history of wo rk on cardinalit y estimation in the computer science literature, sta rting from the pioneering w ork of Fla jolet & Martin (198 5), and dev elop ed in isolation from mainstr eam statistical r esearch. Our pu r p ose is to re-analyse these algorithms in ‘traditional’ statistica l terms. The focus w ill b e on comparisons in terms of asymptotic relativ e efficiency , piv otal quan tities, and statistical err ors b ounds, as opp osed to th e fo cus in the computer science literature on storage space and pro cessing time. W e will concen trate on sketching algorithms that exploit hash functions to record meaningful in f ormation, either by storing ord er statistics or b y r andom pro jections. The pap er is organised as follo ws. In S ection 2 we defin e terms, such as hash function and hash- ing, and giv e a brief and selectiv e history of cardinalit y estimation algorithms. W e then in vestiga te t wo t yp es of algorithms from a conv en tional statistical viewp oint , d eriving maxim um lik eliho o d estimators (MLEs) for m ethod s based on order statistics in Section 3 and rand om pro jections in Section 4. F or order statistic metho ds, we show that the choic e of sampling distr ibution is im- material w hen sampling f r om a con tinuous d istribution, b ut that s u bstan tial sa vings in storage can b e ac h ieved b y using samples from the geometric d istr ibution with ou t s ignifi can t reduction in the asymptotic relativ e efficiency . W e also s ho w that these estimators are recur siv ely computable with exp onentia lly decreasing err or b ounds . W e th en prop ose an approximat e estimator for p ro- jection metho ds usin g α -stable d istributions, with α close to zero. In Section 5 we compare the t wo metho ds and fi nd unexp ectedly that, in a certain sense, they are essen tially equ iv alent . Finally , in S ection 6, we compare the p erformance of ou r algorithms to existing b enc hmark algorithms on sim ulated data. S ection 7 concludes the pap er. Pr ob abilistic c ounting algorith ms 3 2 Definitions and history W e d efi ne a discr ete data str e am to b e a transien tly observ ed s equence of d ata element s with t yp es dra wn fr om a counta ble, p ossibly in fi nite, set I . A t discrete time p oin ts t = 1 , . . . , T , a pair of the form ( i t , d t ) is observ ed, where i t ∈ I is the t yp e of the d ata element , and d t is an integ er-v alued quan tit y . Let I T b e the set of distinct data t yp es observ ed b y time T . A basic goal in data stream analysis is to ob tain information ab out the collection a ( T ) = { a i ( T ) , i ∈ I T } , where a i ( T ) = P T t =1 d t I ( i t = i ) is the cum ulativ e quant it y of type i at time T . When there is no p ossibilit y of confusion, w e write a and a i for a ( T ) and a i ( T ), resp ectiv ely . Our concern will b e primarily with th e s p ecial case when d t > 0 , ∀ t ; the c ash r e gister case in the terminology of Cormo de et al. (2003). Many s ummary statistics of in terest are functions of a , e.g., c = P i ∈I T I ( a i ( T ) > 0), the cardinalit y of the set I T in the cash register case. Recal l that w e are assuming that s torage constraints make it imp ossible to kno w a pr ecisely . Hashing (Kn uth, 1998) is a b asic to ol u sed in pr o cessing data, wh ere the t yp e of data elemen t is iden tified b y a complicated lab el. Hashing wa s originally d esigned to s p eed-up table lo okup for the purp ose of item retriev al or for ident ifying similar items. F or example, supp ose that data elements are r ecords of compan y employ ees, uniqu ely id entified by complicated lab els, that must b e stored in a table. A hash function can b e d esigned to m ap the lab el to an intege r v alue in a give n range, called the hash value , indexing the location in the table where the corr esp onding emp lo ye e record is stored. Pr ess et al. (2007) pr esen t algorithms for constru cting hash functions. Giv en the h ash function and a lab el, the corresp ondin g r ecord is easily accessible for up dating, for example. In general, a hash f u nction h : I 7→ { 1 , . . . , L } is a deterministic function of the inp ut in I that has lo w collisio n probability , i.e., P h ( i ) = h ( j ) , i 6 = j < 1 /L , w here a collision o ccurs if t w o or more differen t inputs are m ap p ed to the same hash v alue (Knuth, 1998). A truly random hash function m ap s v alues in I to { 1 , . . . , L } ind ep enden tly; h o wev er, n o constru ction exists for suc h functions. Instead, th e r equiremen t for ind ep endence is reduced to k -wise indep end en ce, where any k distinct v alues in I are mapp ed to k in dep endent v alues in { 1 , . . . , L } . Carter & W egman (1979) is the fi rst referen ce on constructing k -wise indep en d en t hash fu n ctions. F or our pu r p oses, we can think of a hash function as the map p ing b et we en the seed of a ran d om n um b er generator and the first elemen t in th e sequence of compu ter generated ps eudo-random n um b ers, usually un iformly distribu ted ov er some ran ge. A collect ion of in dep endent hash functions ( h 1 , . . . , h m ) then corresp ond s to the m ind ividual mappings f rom the s eed to th e first m elements of a pseud o-random sequence. This metho d of constructing a hash f unction mapping to pseud o- random num b ers ha vin g a giv en distribution is kno wn as the metho d of se e ding . Nisan (1992) sho ws that there exists an explicit implementa tion of a pseud o-random num b er generator that con v erts a 4 Peter Cliffor d and Io ana A. Cosma random seed, i.e., in our case, an elemen t in I , to a sequence of bits, in distinguishable fr om tr u ly random bits. Hence, we can assume throughout that th e sequ ences under lyin g our hash fu nctions are truly random. 2.1 Probabilistic coun ting Fla jolet & Martin (1985) in tro duce the id ea of indep enden tly hashing eac h element i ∈ I T to a long str ing of p seudo-random bits, un iformly distributed o ver a fi nite range. Let ρ ( i ) denote the rank of the fir st b it 1 in h ( i ). The algorithm stores and up dates a bitmap table of all the v alues of ρ observed, and returns an asymptotically unbiased estimate of the cardinality based on th e quan tit y max r ; [1 , . . . , r ] ⊆ { ρ ( i ) , i ∈ I T } . T he LogLog counting algorithm (Durand & Fla j olet, 2003) estimates the cardinalit y from the su mmary statistic max i ∈I T ρ ( i ), av oiding the need for the bitmap table. The algorithm offers an improv emen t in terms of storage r equiremen ts, for giv en accuracy , b y storing small bytes rather than in tegers. The Hyp er-LogLog coun ting algorithm (Fla jolet et al. , 20 07) impro v es the accuracy further by prop osing a harmon ic mean estimator based on th is maxim um s tatistic . The latter algorithm is particularly well suited f or large s cale cardinalit y estimation problems. Ch en & Cao (2009) dev elop an algorithm that com bines hashin g to bit patterns with sampling at an adaptive rate. They sho w empirically th at their algorithm outp erforms Hyp er-LogLog for small to medium scale problems, but lac k theoretical j ustification of this claim. Instead of estimating th e cardinality from b it patterns, Giroire (2009) h ashes the data t yp es u ni- formly to p s eudo-random v ariables in (0,1), stores order s tatistics of hash v alues falling in disjoint subinterv als co v ering th is range, and av erages cardinalit y estimates ov er these su b in terv als. This approac h is called sto chastic aver aging and was in tro duced b y Fla jolet & Martin (1985). The Min- Count algorithm (Giroire, 2009) stores the third order statistic, and emplo ys a logarithmic f amily transformation. T able 1 sho ws that these estimators ha v e comparable p r ecision. The asymptotic relativ e efficiency (ARE) is defined as the ratio of c 2 /m to the asymptotic v ariance of the estima- tor. Chassaing & Gerin (2006) sho w that, in a large class of nearly-unbiased cardinalit y estimators based on order statistics, the v ariance of an estimator is low er-b ounded appr o ximately b y c 2 /m , where s tochastic a v er aging o ver m interv als is emplo y ed. This equals the asymptotic v ariance of our estimator based on order statistics from Section 3. Pro jection metho ds for l α norm estimation w ith s treaming data are d escrib ed in In dyk (2006) for α ∈ { 1 , 2 } , and references therein. The id ea is to hash d istinct data typ es i t to in dep endent copies of α -stable random v ariables, and store weig h ted linear com binations of the hash v alues. Exploiting prop erties of the stable la w , C ormo de et al. (2003) approxima te the cardinalit y using estimates of Pr ob abilistic c ounting algorith ms 5 T able 1: A comparison of cardinalit y estimation algorithms based on hashing and storing order statistics, or random p ro j ectio ns. The fir st four algorithms apply sto c hastic a verag ing with m subinterv als. Float s tand s for floating p oint num b er. Algorithm Cost ARE Probabilistic coun tin g (Fla jolet & Martin, 1985) m intege rs (16-32 bits) 1.64 LogLog (Durand & Fla jolet, 2003) m sm all bytes (5 bits) 0.592 Hyp er-LogLog (Fla jolet et al. , 2007) m sm all bytes 0.925 MinCount (Giroire, 2009 ) m fl oats (32-64 bits) 1.00 Maximal-te rm (con tin uous) m fl oats 1.00 Maximal-te rm (geometric, q = 1 / 2) m intege rs 0.930 Maximal-te rm(geometric, q = 10 / 11 ) m intege rs 0.999 Random pro jections (Prop osition 6) m fl oats 1.00 Random pro jections (Cormo d e et al. , 2003) m fl oats 0.481 l α with α close to zero. Th e seminal pap er of Alon et al. (1999) is the first attempt at obtaining tigh t low er b ounds on the space complexit y of app ro ximating the cardinalit y of a simple data stream. Bar Y ouss ef e t al. (2002) present the b est previous ( ǫ, δ )-appro ximation of the cardinalit y of a simple data stream in terms of space requirements, namely O 1 /ǫ 2 · log (log c ) · log (1 /δ ) ; this w ork is the first to mak e no assumptions on th e existence of a truly random hash fun ction. An estimator ˆ c is said to b e an ( ǫ, δ ) -appr oximation of c , for some ǫ, δ > 0 arbitrarily small, if P ( | ˆ c − c | > ǫc ) ≤ δ . I n dyk & W o o d r uff (2003) sho w that the d ep endence of the sp ace requirement on ǫ through the factor 1 /ǫ 2 cannot b e reduced to 1 /ǫ . Kane et al. (2010) offer the b est algorithm for an ( ǫ, δ )- appro x im ation of the cardinalit y of a simp le data stream with space r equiremen t of O 1 /ǫ 2 + log ( c ) , and no assumptions on the existence of a truly random hash function. F or a general data stream, the ( ǫ, δ )-appr o ximation of Cormo de et al. (2003) requires a data sketc h of length O 1 /ǫ 2 · log(1 /δ ) ; this result is obtained from Chernoff b ound s on tail p robabilities of th e estimator ˆ c . W e employ the same approac h in Section 3.3 to derive storage requirements for our algorithms. 3 Order statistics 3.1 Con t in uous random v ariables A data stream in the cash r egister case p r o vides data elemen ts of the form ( i t , d t ), w h ere i t ∈ I T , and d t > 0, for t = 1 , . . . , T . W e start with a s im p le adaptation of the ideas of Fla jolet & Martin (1985) and Giroire (2009), wh ic h w e call th e maximal-term data sk e tch . A t time t , th e data t y p e i t 6 Peter Cliffor d and Io ana A. Cosma is used as the seed of a random n um b er generator to pro du ce the fi rst p seudo-random n um b er h ( i t ) uniformly d istr ibuted on (0,1). W rite h ( i t ) ∼ U (0 , 1). The algorithm r ecords h + , th e maxim um v alue of h ( i t ), as th e stream is pro cessed, restarting the r andom n u m b er generator with the seed i t at eac h stage. Note that if a p articular data t yp e is seen more than once, the v alue of h + is unc hanged, b ut wh enev er a new t y p e i t is observed, th er e is a c hance that h + will increase. F or the idealised U (0 , 1) hash fun ction, the v ariable Y = h + has density f ( y ; c ) = cy c − 1 , y ∈ (0 , 1) , since it is the maxim u m of c ind ep enden t U (0 , 1) v ariables, where c is the u nkno wn card inalit y . The qu an tit y c is then an unkn o wn parameter to b e estimated by standard statistical metho ds. T o increase the efficiency in estimating c , we sample m su ccessiv e v alues h 1 ( i t ) , . . . , h m ( i t ) from the random num b er generator at eac h stage, and store Y j = h + j , j = 1 , . . . , m , th us obtaining a s ample of size m from f ( y ; c ). Prop osition 1. The MLE of c b ase d on ( Y 1 , . . . , Y m ) is ˆ c = − m/ P m j =1 log Y j with asymptotic distribution Normal( c, c 2 /m ) as m → ∞ . The expr ession − c P m j =1 log( Y j ) ∼ Gamma( m, 1) c an b e use d as a pivot in setting exact c onfidenc e intervals for c . Pr o of. Using standard sampling theory . Asymptotically , ˆ c is un biased and appro ximately normally distributed with standard error ˆ c/ √ m , so that by storing m = 10 , 000 v alues, for example, we can obtain an estimate of c to within 2% with 95% confidence, regardless of the size of c . Remark 1. When estimating an inte ger value d p ar ameter, such as the c ar dinality, the derivatives involve d in the standar d derivation of the lar g e sample distribution of the M LE c annot b e c alculate d. Nevertheless, e quivalent r esults c an b e derive d in terms of finite differ enc es, and sinc e the standar d deviation of the estimators we c onsider i s of the or der of c , with c lar ge, the use of derivatives c an b e justifie d. Hammersley (1950) pr ovides an e arly discussion of these issues. Note that the maximal-term sketc h do es not allo w deletions in the stream, i.e., d t < 0, since it do es not tak e into accoun t the v alue of d t , and thus cannot mo dify the quantitie s Y j if a i t b ecomes zero. In con trast, the metho d of data ske tc hin g via random pro jectio ns in Section 4 allo ws deletions and p ermits th e estimation of P i ∈I T I ( a i ( T ) > 0), pr o vided that a i ( T ) ≥ 0 when ev er the estimation pro cedure is applied. 3.1.1 Using the k th order sta tistic A p ossible improv emen t migh t b e to store the k th order statistic of the h ash v alues rather than h + . Pr ob abilistic c ounting algorith ms 7 Prop osition 2. F or k < c , let Y j denote the k th or der statistic of the hash values fr om the j th hash function h j ∼ U (0 , 1) , j = 1 , . . . , m . The MLE ˆ c of c b ase d on Y 1 = y 1 , . . . , Y m = y m is the unique r o ot of log m Y j =1 y j + k X i =1 m ˆ c − i + 1 = 0 . (1) When c is lar ge, the r o ot is given appr oximately by ˆ c = k 1 − Q m j =1 y 1 /m j − 1 , with standar d err or appr oximately ˆ c/ √ k m . F urthermor e, the estimator in (1) is r e cursively c omputable. Pr o of. Th e first part of the pro of is straightforw ard. F or the second, recall that a sequence of statistics T m ( x 1 , . . . , x m ) is said to b e r e cursively c omputable if T m ( x 1 , . . . , x m ) = T m ( z 1 , . . . , z m ) ⇒ T m +1 ( x 1 , . . . , x m , w ) = T m +1 ( z 1 , . . . , z m , w ) , ∀ m ∈ N ; see for example Lauritzen (198 8) who pr o ve s, for indep enden t random v ariables X 1 , . . . , X m , that if T m ( X 1 , . . . , X m ) is min im al sufficient, then the sequence T m , m ≥ 1 is recursively computable. This p rop ert y of sufficient statistics was first remark ed b y Fisher (1925). It follo ws from a theorem of Lehmann & Scheff ´ e (1950) that the statistic T m ( Y 1 , . . . , Y m ) = Q m j =1 Y j is min imal su fficien t for c , so ˆ c is also minimal sufficient and hence recursivel y computable. The prop er ty of recursive computabilit y is particularly imp ortant when dealing with massive data sets du e to constrain ts on a v ailable storage. F or example, sup p ose t w o ind ep endent estimates, ˆ c 1 and ˆ c 2 , of the cardinalit y c are a v ailable, based on samp les of size m 1 and m 2 . By substituting the estimates in (1), the asso ciated pr o duct terms can b e reco v ered; the com bined estimate can then b e obtained by com bining the pro d ucts and u sing (1) once again w ith m = m 1 + m 2 . When c is large, the com b ined estimate is app ro ximated by k 1 − [(1 − k / ˆ c 1 ) m 1 (1 − k / ˆ c 2 ) m 2 ] 1 / ( m 1 + m 2 ) . F ur thermore, w e remark that to k eep a record of the k th ord er statistic for eac h of the m sub sets as the s tream is pro cessed requir es storing k m v alues. Ho wev er, sin ce the standard error of ˆ c is appro ximately ˆ c/ √ k m for large m , there is no gain in accuracy relativ e to the s torage requirement. W e also note that there is no adv ant age in using a hash function h that maps to a contin uous distribution F other th an U (0 , 1). The MLE of the maximal-term data sk etc h m erely b ecomes ˆ c = − m P m j =1 log F ( M j ) , wh ere M j = max i ∈I T h j ( i ) , (2) 8 Peter Cliffor d and Io ana A. Cosma whic h h as the same distrib u tion as ˆ c in Prop osition 1. 3.2 Discrete r andom v ariables Hashing to in teger v alues rather than floating p oint n um b ers requires less storage, a pr iorit y when handling massive data str eams. W e sh o w that the loss of statistical efficiency is negligible when in teger-v alued hash functions are c h osen appropriately . W e first consider hash ing to Bernoulli r an- dom v ariables, n ot previously considered in the literature, and then to geometric random v ariables. 3.2.1 Data sketc hing with Bernoulli random v ariables T o imp lemen t h ash ing to a Bernoulli v ariable, we start with an arra y of 0s of length m and then c h an ge the j th elemen t to 1 if h j ( i t ) < p , where, as b efore, h j ( i t ) is the j th simulat ed U (0 , 1) v ariable f rom the seed i t , j = 1 , . . . , m . Th e v alue of p is c hosen to maximise Fisher’s information. Prop osition 3. Fisher’s information for a Be rnoul li hash functions with pr ob ability p is maximise d with p max = 1 − exp( − λ 0 /c ) ≈ λ 0 /c , for lar ge c , wher e λ 0 = 2 + W ( − 2 e − 2 ) ≈ 1 . 594 , and W is L amb ert’s function. The asymptotic r elative effic i ency of the MLE of c with Be rnoul li hashing ( p = λ/c ) , r elative to the estimator obtaine d with a c ontinuous hash function, is λ 2 / ( e λ − 1) for lar ge c . This r esult enables low er b oun ds on th e asymptotic relativ e efficiency to b e sp ecified. F or example, if c is kno wn in adv ance to lie in (0 . 3 c 0 , 4 . 3 c 0 ) for some fixed c 0 , th en with p = 1 /c 0 the ARE is at least 25%. Consequently , 4 m bits of storage suffice to p ro vide the same accuracy as storing m fl oating p oint num b ers wh en h ash ing to con tin u ous random v ariables. Pr o of. After pro cessing th e data stream we ha v e observ ations fr om m Bernoulli v ariables, eac h with probabilit y P = 1 − (1 − p ) c . Fisher’s information for P is m/ ( P (1 − P )) and hen ce the information for c is I ( c ) = m P (1 − P ) dP dc 2 = mq c (log q ) 2 1 − q c , where q = 1 − p. Substituting q = exp ( − λ/c ) giv es I ( c ) = mc − 2 λ 2 / ( e λ − 1). Since Fisher’s information using conti n uous v ariables is m/c 2 , this giv es the asym p totic relativ e efficiency as claimed. The Fish er in f ormation from Bernoulli hashing attains its maximum wh en λ is the p ositiv e ro ot of λ = 2(1 − exp( − λ )), whic h can b e expressed in terms of Lam b ert’s W function and is giv en appro ximately b y λ 0 = 1 . 59 4. Pr ob abilistic c ounting algorith ms 9 3.2.2 Geometric ra ndom v aria bles Supp ose that th e hash fu nction maps to a geometric r andom v ariable with cum ulativ e distribution function G p ( x ) = 1 − q x , with p + q = 1, x = 1 , 2 , . . . W e note that p = 1 / 2 is the case analysed b y Durand & Fla jolet (2003) and Fla jolet et al. (2007). As b efore, for the maximal-term data sk etc h, w e store Y j = h + j = max { h j ( i t ); i t ∈ I T } , j = 1 , . . . , m , where h j ( i t ) are in dep endentl y simulated from G p b y the metho d of seeding, and estimate c b ased on the ran d om sample Y 1 = y 1 , . . . , Y m = y m . L et G c p b e the distrib ution fu nction of the maxim u m of c ind ep endent G p v ariables. Prop osition 4. The M LE of c b ase d on a sample Y 1 = y 1 , . . . , Y m = y m dr awn fr om G c p satisfies m X i =1 log 1 − q y i 1 − q y i ˆ c − log 1 − q y i − 1 1 − q y i − 1 ˆ c 1 − q y i ˆ c − 1 − q y i − 1 ˆ c = 0 . (3 ) In the limit as m → ∞ , the distribution of ˆ c/c is asymptotic al ly normal with me an 1 and varianc e 1 / ( mψ c ) wher e ψ c c an b e appr oximate d by ψ ∞ = ∞ X k = −∞ q 2 k ( q − 1 − 1) 2 [exp( q k − 1 ) − exp( q k )] , for lar ge c . Pr o of. Th e log-lik eliho o d function is L ( y 1 , . . . , y m ; c ) = m X j =1 log (1 − q y j ) c − (1 − q y j − 1 ) c . F ormally d ifferentiat ing with resp ect to c , we ha v e the score fun ction as give n in (3). Squ aring and taking exp ectat ions in the case m = 1, we hav e Fisher’s information p er observ ation: I ( c ) = ∞ X y = 1 [log(1 − q y )(1 − q y ) c − log (1 − q y − 1 )(1 − q y − 1 ) c ] 2 (1 − q y ) c − (1 − q y − 1 ) c . As m → ∞ , f r om the usu al large sample theory of maxim um lik eliho o d estimation, ˆ c/c is asymp tot- ically normally d istributed with mean 1 and v ariance 1 / ( mψ c ) wh ere ψ c = c 2 I ( c ). No w let c → ∞ through the sequ en ce c = q − r , where r is a p ositiv e in teger. W riting y = r + k , we hav e lim c →∞ c 2 I ( c ) = ∞ X k = −∞ q 2 k ( q − 1 − 1) 2 [exp( q k − 1 ) − exp( q k )] , as claimed. 10 Peter Cliffor d and Io ana A. Cosma In practice, to solv e for ˆ c in (3), one iteration of the Newton-Raphson algorithm started from a consisten t estimator of c pro duces an asymp totica lly efficien t estimator (Rao, 1973). A consisten t estimator is ˆ c = log( r /m ) / log (1 − q n ), wh er e r = |{ y j ; y j ≤ n }| and n = ⌊ log q (1 / 2) ⌋ , if r 6 = 0, else, set ˆ c = T , th e length of the stream observ ed. The statistical efficiency of the m aximal-term MLE in the geometric case can b e made arbitrarily close to that in the conti n uous case. F or large c , the Fisher information is an increasing function of q as q → 1. In particular, for q = 10 / 11 , th e ARE of the estimator of c b ased on a sample of maxima from G p as compared to th e estimator based on a rand om s ample of maxima from any con tinuous distribu tion is 0 . 9985. F or the sp ecial case considered by Durand & Fla jolet (2003) and Fla jolet et al. (2007) with p = 1 / 2, the asymptotic relativ e efficiency is 0 . 9304. W e note that the estimator ˆ c , b ased on a samp le of m axima from G p , d oes not ha v e the pr op ert y of recurs iv e computabilit y , u nlik e the estimator in the co n tin uous case. Nev ertheless, wh en q approac h es 1, the geometric distrib u tion is w ell appro ximated b y the exp onentia l distribu tion with parameter λ = − log q , so the log-lik eliho o d is app ro ximately L ( y 1 , . . . , y m ; c ) = m log ( cλ ) + ( c − 1) m X j =1 n log 1 − e − λy j o − λ m X j =1 y j . F or this d istribution, the statistic S m = Q m j =1 1 − e − λY j = Q m j =1 1 − q Y j is sufficien t for th e parameter c , and the MLE is ˆ c = − m/ log S m , s o that, to this d egree of appro ximation, recursive estimation is p ossible. 3.3 Storage requiremen ts In th is section we determine exp onentially decreasing upp er b ounds on the tail p robabilities of our estimators, and sho w that in the geometric case, the storage r equiremen t of an algorithm implemen ting the estimation pro cedure attains the tigh t lo w er b ound of Indyk & W o o dr u ff (2003). Prop osition 5. In the c ontinuous c ase, the tail err or b ounds for the estimator ˆ c given in (2) ar e P (ˆ c ≥ (1 + ǫ ) c ) ≤ exp( − mǫ 2 /C 1 ) , and P (ˆ c ≤ (1 − ǫ ) c ) ≤ exp( − mǫ 2 /C 2 ) , wher e C 1 = ǫ 2 (1 + ǫ ) − ǫ + (1 + ǫ ) log (1 + ǫ ) , C 2 = ǫ 2 (1 − ǫ ) ǫ + (1 − ǫ ) log(1 − ǫ ) . In the limit as ǫ → 0 , the c onstants C 1 and C 2 tend to 2, so for smal l ǫ , the tail err or b ounds ar e exp onential ly de cr e asing in m ǫ 2 . Pr ob abilistic c ounting algorith ms 11 Pr o of. In the cont in uous case, the piv otal quantit y mc/ ˆ c has a Gamma distribution with momen t generating function (1 − t ) m , t < 1. The b ounds on the tail pr obabilities are obtained fr om the momen t generating function u sing the metho d of Chernoff (1952). In the discrete geometric case, these results hold to arbitrary accuracy by appro ximating the geometric distribution b y an exp onen tial distribution w ith mean − log q and q close to 1. F rom Prop osition 5, ˆ c is an ( ǫ, δ )-appro ximation of c pr o vided that m = O ( ǫ − 2 ). The exp ected v alue of th e maxim um order statistic based on a sample of size c f rom G p is O (log c ) for fixed p (Kirsc henhofer & P r o dinger, 1993). It follo ws that the space requirement of an algorithm imple- men ting the estimation p ro cedure in th e geometric case is of ord er O ǫ − 2 log(log c ) , attaining the tigh t lo we r b oun d of Indyk & W o o druff (2003). 4 Random pro jections Data sk etc hing via random pro jections exploits prop erties of the α -stable distribu tion, introd uced b y L´ evy (1924). Th e stabilit y p rop ert y lies at the heart of the random pr o jection metho d. F or simplicit y , we r estrict atten tion to p ositiv e, strictly stable v ariables of index α , for α ∈ (0 , 1), h a ving Laplace tr ansform e − λ α , λ ≥ 0 (F eller , 1971; Zolotarev, 1986). Let F α denote the d istr ibution function. The stabilit y prop erty of F α is as follo ws: if X 1 , X 2 ∼ F α indep endently , and a 1 and a 2 are arbitrary p ositiv e constant s, then a 1 X 1 + a 2 X 2 D = a α 1 + a α 2 1 /α X, (4) where X ∼ F α (F eller , 1971). The random pro jection metho d for cardinalit y estimation p ro ceeds as follo ws (Cormo de et al. , 2003; Indyk, 2006). F or j = 1 , . . . , m and α ∈ (0 , 1) fixed, let h j b e in d ep endent h ash fu nctions mapping from I to samples from F α , via the usual metho d of seedin g; in practice, this will inv olv e constructing sim ulated F α v ariables from pairs of U (0 , 1) v ariables. Th en, u p date and store the pro jections V j ( T ) = P T t =1 d t h j ( i t ), j = 1 , . . . , m , to giv e the data ske tc h V 1 , . . . , V m , where we write V j = V j ( T ) for b revit y . By the stabilit y prop erty in (4 ), we ha v e that V j = P T t =1 d t h j ( i t ) = P i ∈I T a i h j ( i ) D = ℓ α ( a ) X j , (5) where X j ∼ F α indep endently for j = 1 , . . . , m , and ℓ α ( a ) = ( P i ∈I T a α i ) 1 /α . In other words, V 1 , . . . , V m is a sample fr om a scale family w ith unkn o wn scale parameter ℓ α ( a ). It sh ould b e noted that when d t = 1 for t = 1 , . . . , T , th en ℓ α ( a ) = ( P i ∈I T n α i ) 1 /α , where n i is the num b er of times 12 Peter Cliffor d and Io ana A. Cosma that item i is observed in the data s tr eam by time T . In p rinciple, calculation of the MLE of the scale p arameter, ℓ α ( a ) in (5), is straigh tforward. Raising this MLE to the p o wer of α giv es the MLE of P i ∈I T a α i , and with α sufficientl y small, this pro duces an appro ximation to ˆ c . In p ractice there are sev ere numerical difficulties in obtaining th e MLE w h en α is small (Nolan, 1997, 2001). Instead, Cormo de et al. (2003) estimate ℓ α ( a ) by ˜ V / ˜ µ , where ˜ V is the s ample median of { V 1 , . . . , V m } , and ˜ µ is the n u m erically d etermined median of F α . They sho w that an ( ǫ, δ )- appro ximation to c can b e obtained b y c ho osing m of order O 1 /ǫ 2 · log(1 /δ ) and 0 < α ≤ ǫ/ log( B ), where B is an upp er b ound for the elemen ts of a . W e ad op t a sligh tly different approac h and exp loit the limiting distribu tion of V α j for s m all α . Prop osition 6. As α → 0 , the r andom variable c m X j =1 V − α j D → Gamma ( m, 1) . (6) Conse qu e ntly, the variable c an b e use d as an appr oximate pivot in setting c onfidenc e intervals for c . F or α smal l, the estimator ˆ c = m/ P m j =1 V − α j has asymp totic distribution Normal( c, c 2 /m ) as m → ∞ . Pr o of. Zolotarev (1986) sh o ws that X α D → 1 / Z where Z ∼ Exp(1), as α → 0. It follo ws fr om (5) that V α j = X i ∈I T a i h j ( i ) α D = X α X i ∈I T a α i D → c/ Z , j = 1 , . . . , m (indep end en tly) , and hence c P m j =1 V − α j → Gamma( m, 1). The estimator ˆ c is obtained by equating the piv ot to its mean m , and the appro ximate distribu tion of ˆ c then follo ws from the asymptotic normalit y of the Gamma d istribution. When comparing the estimator ˆ c ab ov e with ˜ c = ˜ V / ˜ µ α in Cormo de et al. (2003), w e are effectiv ely comparing the MLE of the parameter of an exp onential distribution with an estimator obtained by equating the sample and p opu lation m ed ians. The ARE of ˜ c to ˆ c is th en app r o ximately 48% since by u sing the standard asymptotic distribu tion of s ample medians, we find that ˜ c ∼ Normal( c, c 2 (log 2) − 2 /m ) for large m i.e., ˆ c is t wice as efficient asymptotically as ˜ c . A t th is stage w e ha v e sh o wn that the estimato rs of c using the maximal-term or random pro- jection ske tc hes can ha ve comparable efficiency . This leads us to conjecture that in some sense the metho ds are essenti ally equiv alent, wh ic h we explore in the n ext section. Pr ob abilistic c ounting algorith ms 13 5 Comparison of pro jection and maximal-term sk etc hes In Section 3.1 we s ho w that the efficiency of the maximal-term data sket c h do es not dep end on the p articular conti n uous distribu tion that is s im ulated b y the hash function. F or the pur p ose of comparison, w e n o w hash to F α , in b oth cases. Note th at we are n ot prop osing to use th is distribution directly for the maximal term estimator since it has an extremely hea vy tail wh en α is small. Storing the m aximum of c su c h v ariables, for c large, wo uld require h igh precision floating p oin t num b ers. Consider a data stream in the cash r egister case, observ ed up to time T . Let a denote the accum u lation ve ctor, and c the card in alit y . F or j = 1 , . . . , m , let h j b e indep end ent hash fu nctions mapping from I to copies of X ∼ F α , for fixed α ∈ (0 , 1). L et ˆ c p = m/ P m j =1 V − α j b e the pro jection estimator d efined in Prop osition 6 and let ˆ c m denote the maximal-term estimator in (2) w here M j = max i ∈I T h α j ( i ) and F is the distribution function of X α . Theorem 1. F or smal l α , the pivotal quantities for the maximal-term and pr oje ction sketches ar e e quivalent, i.e., c m ˆ c p − m ˆ c m = c m X j =1 V − α j + c m X j =1 log F ( M j ) P → 0 , a s α → 0 , and in p articular V − α j + log F ( M j ) P → 0 for e ach j = 1 , . . . , m . Pr o of. Let M = max i ∈I T X α i b e a typical maximal term in (2 ) w ith X i ∼ F α , i ∈ I T and let δ > 0 b e arbitrary . Since P ( M < y ) ≤ P ( X α < y ) and X − α D → E x p(1) as α → 0 (Zolotarev, 1986), there are v alues α 0 and y 0 > 0 su c h that P ( M < y 0 ) < δ for all α < α 0 . No w let G α ( y ) b e the d istr ibution function of X α , i.e. G α = F in (2). S ince X − α D → Exp(1), then G α ( y ) → exp( − 1 /y ) , u n iformly in y > 0, and consequen tly log G α ( y ) → − 1 /y uniform ly in y > y 0 as α → 0. It follo ws, b y the u s ual arguments, that log G α ( M ) + 1 / M P → 0 , as α → 0 . (7) Finally , w riting V = P i ∈I T a i X i for th e typical term in (5), w e ha ve M 1 /α a min = X max a min ≤ V ≤ X max X i ∈I T a i = M 1 /α X i ∈I T a i , 14 Peter Cliffor d and Io ana A. Cosma where X max = max i ∈I T X i and a min = min i ∈I T a i . I t follo w s that a α min ≤ V α / M ≤ X i ∈I T a i α , (8) and as α → 0, V α / M P → 1. S in ce b oth M and V α ha ve prop er limiting distributions, this implies that V − α − M − 1 P → 0 as α → 0 , and toget her with (7) w e ha v e log F ( M ) + V − α P → 0 as α → 0 . W e hav e established that the terms in the summations are in dividually equiv alent for small α and since the num b er of terms, m , is fi n ite, the result is pro v ed. Note that the sp ecific v alues of d t > 0 are un im p ortan t in determining the cardinalit y . F or practical p u rp oses, p ositiv e v alues of d t can b e tak en to b e 1 and this ma y hav e the effect of impro ving the b ounds in (8). 6 Empirical study T able 2 p r esen ts the results of an empirical stud y comparing v arious cardinalit y estimation algo- rithms on simulated data s ets of exact card inalit y ranging from 10 4 to 5 × 10 7 . W e compare the p erformance of our estimators from Prop ositions 1, 4, and 6, against that of th e Hyp er -LogLog (Fla jolet et al. , 2007), MinCount (Giroire, 2009), and median (Corm o d e et al. , 2003 ) estimators. W e also compute the LogLog estimator (Dur and & Fla jolet , 2003 ), and fin d that its p erf orm ance is n ot comparable; the p ercen t error is consistentl y ab o v e 20%, and as high as 50% for the lo w end of cardinalities (results not shown). F urthermore, w e compute the maximal-term estimator w ith hashing to the p ositiv e, α -stable d istribution ( α = 0 . 05); this estimator is compared to the random pro jection estimator in Theorem 1. Again, results are not s ho wn ; the p ositiv e, α -stable distribution, for α close to zero, is v ery hea v y tailed, and n umerical difficulties are encounte red in estimating the cum ulativ e distribution fu nction in the tail. Computations are p erformed on a 64GB su p er- computer; the co de is written in C , and uses the GSL libr ary , and R (h ttp://www.r -p ro ject.org/ ) pac k ages. The data is simulat ed in R. Ov erall, the p erformance of these algorithms is impressive, particularly on large scales, wher e a data sk etc h of size 16384 su ffices to estimate cardinalit y v alues u p to 5 × 10 7 with extremely high accuracy . F rom the results on asymptotic efficiency , w e exp ect that, with 95% confidence, our estimates are within 8.66, 6.12, 4.33, 2.17, and 1.53% for m ∈ { 2 9 , 2 10 , 2 11 , 2 13 , 2 14 } , r esp ectiv ely , regardless of th e size of c . F or sm all scales, the Hyp er-LogLog estimator is clearly outp erformed b y the other estimators. F or the approac h based on h ashing and storing order statistics, the estimators Pr ob abilistic c ounting algorith ms 15 T able 2: Comparison of cardinalit y estimation algorithms on simulated data sets: maximal-term estimators with hashing to the exp onen tial distr ib ution of mean 1, and to the geometric distrib u tion ( ρ = 1 . 1), Hyp er-LogLog (Fla jolet et al. , 2007), MinCount (Giroire, 2009), and random p ro j ections estimators (Prop osition 6 and median estimator of C ormo de et al. (2003)) with h ashing to the p ositiv e, α -stable distribution ( α = 0 . 05). The p ercent error ap p ears in brack ets. c m ˆ c ˆ c Hyp er-LogLog MinCoun t ˆ c ˜ c (Prop. 1 ) (Prop. 4) (Prop. 6 ) (Sec. 4 ) 10 4 2 9 9543 9 553 8040 10261 10837 10261 (4.56) (4.47) (19.6) (2.61) (8.36) (2.6 1) 5 × 10 4 2 9 50190 50144 43754 48594 51436 51349 (0.378 ) (0.288 ) (12.5) (2.81) (2.87) (2.7 0) 10 5 2 10 10276 1 10291 6 10212 2 981 13 98527 11036 3 (2.76) (2.92) (2.12) (1.89) (1.47) (10.36 ) 5 × 10 5 2 11 51205 6 51198 8 43196 5 499698 4 93066 502999 (2.41) (2.40) (13.6) (0.060 2) (1 .39) (0.600) 10 6 2 13 99480 3 99470 2 99240 8 10046 10 971817 1002 570 (0.520 ) (0.530 ) ( 0.759) (0.461 ) (2.8 2) (0.257) 5 × 10 6 2 14 50015 60 49924 99 472731 0 50010 00 4 92411 2 501967 0 (0.031 1) (0.150 ) (5.45) (0.020 0) (1 .52) (0.393) 10 7 2 14 99927 80 99656 77 931360 0 10102 100 98263 37 97252 90 (0.072 2) (0.343 ) (6.86) (1.02) (1.74) (2.7 5) 5 × 10 7 2 14 50666 000 5022 1623 47764 000 50118 300 49258 239 46 67780 0 (1.33) (0.443) (4 .47) (0.237 ) (1.4 8) (6.64) of Prop ositions 1 and 4 hav e comparable p erformance to the MinCount estimator. Similarly , for the approac h based on random p ro jections and th e stable distribution, the estimator of Pr op osition 6 has comparable p erformance to the median estimator of Cormo de et al. (2003). Both in terms of p erformance, and storage requirements, w e p refer the maximal-term estimator of P r op osition 4 with hashing to the Geometric distrib ution. 7 Conclusion In this pap er we discuss the problem of cardinalit y esti mation o ver streaming data, under the assumption that the size of the data precludes th e p ossibilit y of main taining a compreh en siv e list of all distinct data elemen ts observe d. Probabilistic counting algorithms pro cess data elemen ts on the fl y in three steps: (i) hash eac h data elemen t to a cop y of a pseud o-random v ariable, (ii) up date a low-dimensional data sketc h of th e stream, and (iii) discard the d ata elemen t. F or this purp ose, w e present tw o approac h es: indirect record kee ping u sing pseudo-random v ariates and 16 Peter Cliffor d and Io ana A. Cosma storing either selected ord er statistics, or r andom pro jections. Both approac hes exploit the id ea of hashin g via the metho d of seeding. The d ata ske tc h is a random s amp le of v ariables whose distribution is parameterised b y the cardinalit y as unknown parameter, and w e derive estimators of the cardinalit y in a con ven tional statistical fr amew ork. W e b eliev e that hashin g and d ata sket c h ing are nov el ideas in the statistics literature, and offer great p otenti al for fur ther dev elopment in problems of d imension red u ction, and online data analysis. W e an alyse the statistical pr op erties of our estimators in terms of Fish er in formation, asymptotic relativ e efficiency , and error b ounds on the estimation error, and the computational prop er ties in terms of recur siv e computabilit y and s torage requiremen ts. Compared to existing algorithms that emplo y the same app roac hes to cardinalit y estimation, our estimators outp erform in terms of ARE with one exception: the probabilistic counting algorithm of Fla j olet & Martin (1985) th at stores a bitmap table, and therefore is far more computationally exp ensiv e. Finall y , we d emonstrate an un exp ected link b et wee n the m etho d of maximal-term ske tc hin g based on hashing to the F α distribution, and the metho d of random pro jections, sho w ing that the t wo metho d s are essential ly the same when α is small. Ho w ev er, sin ce there is no gain in efficiency for the maximal-term sketc h in u sing the F α distribution, rather than the simpler U (0 , 1) d istribution, as sho wn in Section 3.1, the latter is to b e pr eferr ed. Moreo ver, sin ce w e sho w in Section 3.2 that discrete hash functions are capable of comparable efficiency b ut with reduced storage requirements, discrete maximal-term metho ds must b e the metho d of c hoice. In fact, algorithms imp lemen ting our estimation pr o cedu re with discrete maximal-term ske tc hin g and geometric hash ing attain the tight low er b ound on storage requirement s for cardinalit y estimation. An empirical stud y estimating cardin alities u p to 5 × 10 7 supp orts our theoretical r esu lts. Ac kno wledgemen t The authors wo uld lik e to thank Professor Steffen Laur itzen for insightful discussions, and th e referees for suggestions that ha v e significantly impro v ed the p ap er. Ioana Cosma w ould lik e to thank the Department of Statistics at Unive rsit y of Oxford for fundin g via the T eac h in g Assistan t Bursary sc h eme. References Aggarw al, C. C. (2007). Data str e ams: Mo dels and Algorithms. Springer-V erlag, New Y ork. Ak ella, A., Bharam b e, A., Reiter, M. & Seshan, S. (2003). Detecting DDoS attac ks on IS P n et works. Pr ob abilistic c ounting algorith ms 17 In ACM SIGMOD/PODS Workshop on Management and Pr o c essing of Data Str e ams (MPDS) . San Diego, California. Alon, N., Matias, Y. & Szegedy , M. (1999 ). The space complexit y of appro ximating the frequ ency momen ts. J. Comput. System Sci. 58 , 137–14 7. Bar Y oussef, Z., Ja yr am, T. S., Kum ar, R., S iv akumar, D. & T revisan, L. (2002). Counting distinct elemen ts in a data stream. L e ctur e Notes in Comput. Sci. 2483 , 1–10. Carter, J. L. & W egman, M. N. (1979). Univ ersal classes of hash functions. J. Comput. System Sci. 18 , 143–154. Chassaing, P . & Gerin, L. (2006). Efficient estimatio n of the cardinalit y of large data sets. I n Pr o c e e dings of 4th Col lo quium on Mathematics and Computer Scienc e , vol . A G of Discr ete M ath- ematics and The or etic al Computer Scienc e P r o c e e dings (DMTCS) . pp. 419– 422. Chen, A. & Cao, J . (2009). Distinct coun ting with a self-learning bitmap. In Pr o c e e dings of IEEE International Confer enc e on Data E ngine e ring . pp. 1171–11 74. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a h y p othesis b ased on the s um of observ ations. Annals of Mathematic al Statistics 23 , 493–50 7. Cormo de, G., Datar, M., In dyk, P . & Muthukrishnan, S. (2003). Comparin g data streams using Hamming norm s (ho w to zero in). IEE E T r ansactions on Know le dge and Data Engine ering 15 , 529–5 40. Cormo de, G. & Muth u krishnan, S. (2005a ). An im p ro ved data s tr eam s ummary: th e Coun t-Min sk etc h and its applications. J. Al gorithms 55 , 58–75. Cormo de, G. & Muth u krishnan, S. (2005 b). What’s new: Finding significant differences in net w ork data s treams. IEEE /A CM T r ansactions on N etworking (TON) 13 , 1219–1 232. Durand, M. & Fla jolet, P . (2003). Loglog counting of large cardinalities. L e c tu r e Notes in Comput. Sci. 2832 , 605–617. F eller, W. (1971). An intr o duction to pr ob ability the ory and its applic ations , v ol. 2. John Wiley & Sons, Inc., New Y ork, 2nd edn. Fisher, R. A. (1925) . Th eory of statistica l estimation. Math. Pr o c. Cambridge Philos. So c. 22 , 700–7 25. 18 Peter Cliffor d and Io ana A. Cosma Fla jolet, P ., F usy , E., Gandouet, O. & Meunier, F. (2007). Hyp erLogLog: the analysis of a n ear- optimal cardinalit y estimation algorithm. In Pr o c e e dings of Confer enc e on Analy sis of Algorithms , v ol. AH of Discr ete Mathematics and The or etic al Computer Scienc e Pr o c e e dings (D M TCS) . pp. 127–1 46. Fla jolet, P . & Martin, G. N. (1985) . Probabilistic counting algorithms for d atabase applications. J. Comput. System Sci. 31 , 182–209. Giroire, F. (2009). Order s tatistic s and estimating cardinalities of massive data sets. Di scr ete Appl. Math. 157 , 406–427. Hammersley , J. M. (1950 ). On estimating r estricted parameters. J. R oy. Statist. So c. Ser. B 12 , 192–2 40. Harv ey , N. J. A., Nelson, J. & Onak, K. (2008). S k etc hing and streaming entrop y via approximati on theory . In 49th Annual IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . pp . 489– 498. Indyk, P . (2006). Stable distr ibutions, pseud orandom generators, embed d ings, and data stream computation. J. ACM 53 , 307–3 23. Indyk, P . & W o o dru ff, D. P . (2003). Tigh t low er b oun d s for the d istinct elemen ts problem. In Annual Symp osium on F oundations of Computer Scienc e (F O CS) , v ol. 44. Cambridge, MA, USA, p p. 283–289. Kane, D. M., Nelson, J. & W o o druff, D. P . (2010). An optimal algorithm for the distinct elemen ts problem. In Pr o c e e dings of the 29th Symp osium on Principles of D atab ase Systems (PODS) . Indiana, USA, pp . 41–52. Kirsc henhofer, P . & P ro dinger, H. (1993). A result in order stat istics r elated to probabilistic coun ting. Computing 51 , 15–27. Kn uth, D. E. (1998). The art of c omputer pr o gr amming: Sorting and se ar ching , vol. 3. Addison- W esley , Massac h usetts, 2nd edn. Lauritzen, S. L. (1988). Extremal families and systems of sufficient statistics. In L e ctur e notes in statist. , v ol. 49. S p ringer, New Y ork. Lehmann, E. L. & S cheff ´ e, H. (1950). Completeness, similar regions and unbiase d estimation. Part I. Sankhy ¯ a 10 , 305–340. Pr ob abilistic c ounting algorith ms 19 L ´ evy , P . (1924). T h ´ eorie des erreur s. La loi de Gauss et les lois exceptionelles. Bul l. So c. Math. F r anc e 52 , 49–85. Muth ukrishn an , S. (2005). Data str e ams: Algorithms and applic ations . No w Publishers In c, C am- bridge, Massac h usetts, 1st edn . Nisan, N. (1992). Pseud orandom generators for sp ace-boun ded computation. Combinatoric a 12 , 449–4 61. Nolan, J. P . (1997). Num er ical calculatio n of stable densities and distribution f u nctions. Sto ch. Mo dels 13 , 759–774. Nolan, J. P . (2001). Maxim um lik eliho o d estimation and diagnostics for s table distrib u tions. In O. E. Barndorff-Nielsen, T. Mik osch & S. I. Resnic k, ed s., L´ evy pr o c esses: The ory and applic ations . Birkh¨ auser, Boston, pp. 379–4 00. Press, W. H., T euko lsky , S. A., V etterling, W. T. & Flannery , B. P . (2007). Numeric al r e cip es: The art of scientific c omputing . Cambridge Un iv ersity Press, New Y ork, NY. Rao, C. R . (1973). Line ar statistic al infer enc e and its applic ations . John Wiley & Sons, Inc., New Y ork, 2nd edn. Whang, K.-Y., V and er-Zanden, B. T. & T a ylor, H. M. (1990) . A lin ear-time probabilistic coun ting algorithm f or database app licati ons. ACM T r ansactions on Datab ase Systems 15 , 208–2 29. Zolotarev, V. M. (1986). One-dimensional stable distributions . Ame rican Mathematical So ciet y , Pro vidence, RI . Ioana A. C osma, Statistical Lab oratory , Cen tre for Mathematical Sciences, Universit y of Cam- bridge, Wilb erforce Road, Cambridge, CB3 0WB, United Kingdom. E-mail: ioana@statslab.ca m.ac.uk
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment