Sketching and Streaming Entropy via Approximation Theory

We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptima…

Authors: Nicholas J. A. Harvey, Jelani Nelson, Krzysztof Onak

Sk etc hing and Streaming En trop y via Appro ximatio n Theory Nic holas J. A. Harv ey ∗ Jelani Nelson † Krzysztof Onak ‡ Abstract W e conclude a sequence of w ork by giving near-optimal sketc hing and strea ming algor ithms for estimating Shannon entrop y in the most general streaming mo del, with ar bitrary inser- tions and deletions. This improv es o n pr ior results that obtain sub optimal space b ounds in the g eneral mo del, and near-optimal b o unds in the insertion-o nly mo del without sketc h- ing. Our high-level appro ach is simple: we giv e alg orithms to estimate Re nyi and Tsallis ent ro py , and use them to extrap olate an estimate of Shannon entrop y . The accuracy o f our estimates is prov en using approximation theory argument s and extremal prop er ties of Chebyshev polynomials, a tec hnique which may be useful for o ther pr oblems. Our work also yields the best- k nown and near-optimal additive approximations for en tropy , and hence also for conditional entropy and mutual information. 1 In tro duc tion Streaming algorithms ha v e attracted m uch atte ntio n in sev eral computer science comm un ities, notably theory , databases, and netw orking. M any algo rithm ic problems in this mo del are no w w ell-understo o d, for example, the problem o f estimating frequency momen ts [1, 2, 10, 18, 32, 35]. More recen tly , sev eral researc hers ha ve studied the problem of estimating the empirical entrop y of a stream [3 , 6, 7, 12, 13, 37]. Motiv ation. There are t w o key moti v ations for studying en tropy . The first is that it is a fundamentall y imp ortan t qu an tit y with useful algebraic pr op erties (c hain r ule, etc.). The second stems from s everal practical applications in computer net w orking, suc h as net wo rk anomaly detection. Let us consider a concrete example. One form of malicious activit y on th e int ernet is p ort sc anning , in wh ic h attac k ers prob e target machines, trying to find op en p orts wh ic h could b e lev eraged for further attac k s . In contrast, t ypical internet traffic is directed to a small n umb er of hea vily used p orts for w eb traffic, email delive ry , etc. Consequent ly , when a p ort scanning attac k is underwa y , there is a significant c hange in th e distribution of p ort n umb ers in the pac k ets b eing deliv ered. It has b een sho wn that measurin g th e en tropy of th e distribution of p ort num b ers provides an effectiv e means to d etect such attac ks. See Lakhin a et al. [19] and Xu et al. [36] for further information about s uc h p roblems and metho ds for their solution. ∗ MIT Computer Science and Artificial In telligence Lab oratory . nickh@mit. edu . S upp orted by a Natural Sciences and Engineering Researc h Council of Canada PGS S chola rship, by N S F contract CCF-0515221 and by ONR grant N 00014-05-1-0148 . † MIT Computer S cience and Art ifi cial Intelligence Lab oratory . minilek@mit .edu . Supp orted by a National Defense Science and Engineering Graduate (ND SEG) F ellowship. ‡ MIT Computer Science and Artificial Intellig ence Lab oratory . konak@mit .edu . Sup p orted in part by NSF contra ct 0514771. 1 Our T echniques. In this pap er, we giv e an algorithm for estimating empirical Shann on en trop y while u sing a nearly optimal amoun t of sp ace. Our algorithm is actually a sk etc hing algo rithm, not just a streaming algorithm, and it applies to general streams whic h allo w insertions and deletions of elements. On e attractiv e asp ect of our w ork is its clean high-lev el appr oac h: w e reduce the en trop y estimati on problem to the w ell-studied frequency momen t problem. More concretely , we giv e algorithms for estimating other notions of en tropy , R´ en yi an d T sallis en trop y , whic h are closely r elated to frequency moments. The link to Sh annon entrop y is established by pro ving b ounds on the r ate at whic h these other en tropies con v erge to w ard Sh annon en tropy . Remark ably , it seems that such an analysis w as n ot previously known. There are several tec hnical obs tacles that arise with this approac h. Unfortun ately , it do es not seem that the optimal amount of space can b e obtained while using just a single estimate of R ´ enyi or Tsallis entrop y . W e o v ercome this obstacle by using several estimates, together with appro ximation theory argumen ts and ce rtain infr equen tly-used extremal prop erties of Ch eb yshev p olynomials. T o our k n o wledge, this is th e fir st u se of such tec hniqu es in the conte xt of streaming algorithms, and it seems likel y that these tec hniques could b e app licable to many other problems. Suc h argument s yield go o d algorithms for additivel y estimating entrop y , but obtaining a go o d m ultiplicativ e appro ximation is more difficult when the entrop y is v ery small. In su ch a scenario, ther e is n ecessarily a very hea vy elemen t, and the task that one must solv e is to estimate the momen t of all elemen ts exc lu ding this hea vy element. This task has b ecome kno wn as the r e si dual moment estimation problem, and it is emerging as a useful b uilding blo c k for other streaming p r oblems [3, 5, 10]. T o estimate the α th residual moment for α ∈ (0 , 2], w e sho w that ˜ O ( ε − 2 log m ) bits of sp ace suffice with a random oracle and ˜ O ( ε − 2 log 2 m ) bits without. T his compares with existing algorithms that use O ( ε − 2 log 2 m ) bits for α = 2 [11], and O ( ε − 2 log m ) for α = 1 [10]. No non-trivial algorithms were previously known for α 6∈ { 1 , 2 } . Th ough, the previously kno wn algorithms we re m ore general in wa ys u n related to the needs of our w ork: they can remov e the k heavie st elemen ts without requiring that they are sufficien tly heavy . Multiplicativ e Entrop y Estimation. Let us no w s tate th e p erformance of these algorithms more explicitly . W e fo cus exclusiv ely on single-pass algorithms unless otherwise noted. The first algorithms for appro ximating en trop y in the streaming mod el are due to Gu ha et al. [13 ]; they ac hiev ed O ( ε − 2 + log n ) w ords of space b u t assum ed a randomly ordered stream. Chakrabarti, Do Ba and Muth ukrish nan [7] then ga v e an algorithm for worst-c ase ord ered streams us- ing O ( ε − 2 log 2 m ) w ords of space, but requ ired t w o passes o ve r the input. Th e algorithm of Chakrabarti, Cormo de and McGregor [6] uses O ( ε − 2 log m ) w ords of sp ace to give a multiplica- tiv e 1 + ε appro ximation, although their algorithm cannot pro d uce ske tc hes and only applies to insertion-only streams. In con trast, th e algorithm of Bh uv anagiri and Ganguly [3] provides a sk etc h and can handle d eletions but requ ires roughly ˜ O ( ε − 3 log 4 m ) words 1 . Our w ork fo cuses primarily in the strict turnstile mo del (d efined in S ection 2), wh ic h allo ws deletions. Our algorithm f or multiplica tiv ely estima ting Shann on en tropy uses ˜ O ( ε − 2 log m ) w ords of space. Th ese b ounds are nearly-optimal in terms of th e dep enden ce on ε , since there is an ˜ Ω( ε − 2 ) low er b oun d even for insertion-only s treams. Our algorithms assume access to a random oracle. T his assum ption can b e remo ved through the use of Nisan’s pseudorand om generator [23], increasing the space b ounds by a factor of O (log m ). Additiv e En tropy E stimation. Additiv e app ro ximations of en tropy are also useful, as they directly yield additiv e approximat ions of conditional entrop y and m utu al information, whic h cannot b e app ro ximated multiplicat ive ly in small space [17]. Chakrabarti et al. n oted th at since 1 A recent, yet unp ublished improv ement by the same authors [4] improv es this to ˜ O ( ε − 3 log 3 m ) words. 2 Shannon en tropy is b ounded ab o v e by log m , a multiplicati ve (1 + ( ε/ log m )) appr o ximation yields an add itiv e ε -appro ximation. In this wa y , the wo rk of Chakrabarti et al. [6] and Bhuv ana- giri and Ganguly [3] yield add itive ε appr o ximations u s ing O ( ε − 2 log 3 m ) and ˜ O ( ε − 3 log 7 m ) w ords of space resp ectiv ely . Our algorithm yields an additiv e ε appro ximation using only ˜ O ( ε − 2 log m ) w ords of space. In p articular, ou r sp ace b ounds for m ultiplicativ e and additiv e appro ximation differ b y only log log m f actors. Zhao et al. [37] giv e practical metho ds for addi- tiv ely estimating th e so-called en tropy norm of a s tr eam. Their algorithm can b e view ed as a sp ecial case of ours since it inte rp olates Sh annon entrop y using t wo estimates of Tsallis en tropy , although this interpretatio n wa s seemingly unknown to th ose authors. Other Information Statistics. W e also giv e algorithms for approxi mating R´ en yi [26] and Tsallis [33] en tropy . R ´ enyi en trop y pla ys an imp ortant role in expand ers [15], pseud orandom generators, quan tum co mp utation [34, 38], and ecology [22, 27]. Tsallis en trop y is a imp ortant quan tit y in p h ysics that generaliz es Boltzmann-Gibbs en trop y , and also pla ys a role in the quan tum con text. R ´ enyi and Tsallis en trop y are b oth parameterized by a scalar α ≥ 0. The efficiency of our estimati on algorithms dep ends on α , and is stated precisely in S ection 5. A preliminary version of this work app eared in the IEEE Information Th eory W orksh op [14]. 2 Preliminaries Let A = ( A 1 , . . . , A n ) ∈ Z n b e a ve ctor initialized as ~ 0 whic h is mo dified by a stream of m up d ates. E ach up date is of the form ( i, v ), wher e i ∈ [ n ] and v ∈ {− M , . . . , M } , and causes the change A i ← A i + v . F or simplicit y in stating b ounds, we henceforth assume m ≥ n and M = 1; the latter can b e s im ulated by increasing m by a f actor of M and representing an up date ( i, v ) with | v | separate up dates (though in actualit y our algorithm can p erform all | v | u p d ates sim ultaneously in the time it tak es to do one up date). Th e vec tor A gives rise to a p robabilit y distribution x = ( x 1 , . . . , x n ) with x i = | A i | / k A k 1 . Thus for eac h i either x i = 0 or x i ≥ 1 /m . In th e strict turnstile mo del , we assume A i ≥ 0 for all i ∈ [ n ] at the end of the stream. In the gener al u p date mo del we make no such assump tion. F or the remaind er of th is pap er, w e assume the strict turnstile mo del and assume access to a random oracle, u nless stated otherwise. Ou r algorithms also extend to the general up d ate mo del, t ypically incr easing b oun ds by a factor of O (log m ). As remarked ab o v e, the random oracle can b e remo v ed, using [23], w h ile increasing the space b y another O (log m ) factor. When giving b ound s, w e often use the follo wing tilde notation: we sa y f ( m, ε ) = ˜ O ( g ( m, ε )) if f ( m, ε ) = O ( g ( m, ε )(log log m + log (1 /ε )) O (1) ). W e no w defin e some functions commonly u s ed in future sections. The α th norm of a v ector is denoted k·k α . W e define the α th moment as F α = P n i =1 | A i | α = k A k α α . W e define the α th R ´ enyi entr opy as H α = log( k x k α α ) / (1 − α ) and the α th Tsallis entrop y as T α = (1 − k x k α α ) / ( α − 1). Shannon en trop y H = H 1 is defined by H = − P n i =1 x i log x i . A straigh tforw ard application of l’Hˆ o pital’s ru le sho ws that H = lim α → 1 H α = lim α → 1 T α . It w ill often b e conv enien t to fo cus on the quantit y α − 1 in s tead of α itsel f. Thus, w e often w rite H ( a ) = H 1+ a and T ( a ) = T 1+ a . W e will often need to appro ximate fr equency momen ts, for whic h we use the follo wing: F act 2.1 (Indyk [16], Li [20], [21]). There is an algorithm for m ultiplicativ e appr oximati on of F α for any α ∈ (0 , 2]. The algorithm needs O ( ε − 2 log m ) bits of space in the general up date mo del, and O   | α − 1 | ε 2 + 1 ε  log m  bits of sp ace in the strict turnstile mo del. F or an y fu nction a 7→ f ( a ), we d enote its k th deriv ativ e with resp ect to a by f ( k ) ( a ). 3 3 Estimating S hannon En trop y 3.1 Ov erview W e b egin by describin g a general algorithm for computing an add itive approximati on to Shann on en tropy . T he remainder of this pap er describ es an d analyzes v arious details and incarnations of this algorithm, in cluding extensions to giv e a m ultiplicativ e approxi mation in Section 3.4. W e assume that m , the length of the stream, is kn o wn in adv ance. Compu ting k A k 1 is trivial since w e assu me the strict turnstile mo del at pr esen t. Algorithm 1. Our algorithm f or additiv ely appro ximating empir ical Shannon en trop y . Cho ose error p arameter ˜ ε and k p oin ts { y 0 , . . . , y k } Pro cess the en tire stream: F or eac h i , compute ˜ F 1+ y i , a (1 + ˜ ε )-a pp ro ximation of the frequency moment F 1+ y i F or eac h i , compute ˜ H ( y i ) = − log( ˜ F 1+ y i / || A || 1+ y i 1 ) /y i and ˜ T ( y i ) =  1 − ˜ F 1+ y i / || A || 1+ y i 1  /y i Return an estimate of H (0) or T (0) by in terp olatio n using the p oin ts ˜ H ( y i ) or ˜ T ( y i ) 3.2 One-point In terp olation The easiest implemen tation of this algorithm is to set k = 0, and estimate Shannon en tropy H using a single estimate of R´ en yi en tropy H ( y 0 ). W e c ho ose y 0 = ˜ Θ( ε/ (l og n log m )) and ˜ ε = ε · y 0 . By F act 2.1, th e s pace required is ˜ O ( ε − 3 log n log m ) w ords. The follo wing argum en t shows this giv es an add itiv e O ( ε ) appro ximation. With constan t probabilit y , ˜ F 1+ y 0 = (1 ± ˜ ε ) F 1+ y 0 . T hen ˜ H ( y 0 ) = − 1 y 0 log  ˜ F 1+ y 0 || A || 1+ y 0 1  = − 1 y 0 log  (1 ± O ( ˜ ε )) n X i =1 x 1+ y 0 i  = H ( y 0 ) ± O  ˜ ε y 0  = H ± O ( ε ) . (3.1) The last equalit y follo ws f r om the follo wing theorem, which b ounds the rate of con ve rgence of R ´ enyi ent ropy to w ards Shannon ent ropy . A p r o of is giv en in App endix A.1. Theorem 3.1. Let x ∈ R n b e a probabilit y distribution w h ose smallest p ositiv e v alue is at least 1 /m , where m ≥ n . Let 0 < ε < 1 b e arb itrary . Define µ = ε/ (4 log m ), ν = ε/ (4 log n log m ), α = 1 + µ/  16 log (1 /µ )  , and β = 1 + ν /  16 log (1 /ν )  . T hen 1 ≤ H 1 H α ≤ 1 + ε and 0 ≤ H 1 − H β ≤ ε. 3.3 Multi-p oint In terp olation The algorithm of Section 3.2 is limited b y the f ollo wing tradeoff: if we choose the p oint y 0 to b e close to 0, the accuracy increases, but the space usage also increases. In this section, w e av oid that tradeoff by inte rp olating with m ultiple p oin ts. This allo ws us to obtain go o d accuracy without taking the p oin ts to o close to 0. W e formalize this using approxi mation theory arguments and prop erties of Chebyshev p olynomials. The algorithm estimates the Tsallis entrop y with error parameter ˜ ε = ε/ (12( k + 1) 3 log m ) using p oin ts y 0 , y 1 , . . . , y k , chosen as follo ws. F irst, the num b er of p oin ts is k = log (1 /ε ) + log log m . Their v alues are c hosen to b e an affine transform ation of the extrema of the k th Chebyshev p olynomial. F ormally , set ℓ = 1 / (2( k + 1) log m ) an d d efine the map f : R → R by f ( y ) = ( k 2 ℓ ) · y − ℓ · ( k 2 + 1) 2 k 2 + 1 , then define y i = f  cos( iπ/k )  . (3.2) 4 The correctness of this algorithm is p ro ve n in S ection 3.3.2. Let us n o w analyze the space requirement s. Comp uting the estimate ˜ F 1+ y i uses only ˜ O ( ˜ ε − 2 / log m ) words of space by F act 2.1 since | y i | ≤ 1 / (2( k + 1) log m ) for eac h i . By our c hoice of k = ˜ O (1) and ˜ ε , the total space requ ir ed is ˜ O ( ε − 2 log m ) words. W e argue correctness of this algorithm in Section 3.3.2. Before doing so, we m ust men tion some prop erties of Chebyshev p olynomials. 3.3.1 Cheb yshev Po lynomials Our algorithm exploits certa in extremal prop erties of Chebyshev p olynomials. F or a basic in tro d u ction to Chebyshev p olynomials we r efer the reader to [24, 25, 28]. A th orough tr eatment of these ob jects can b e found in [29]. W e no w pr esen t the bac kground relev an t for our p u rp oses. Definition 3.2. Th e set P k consists of all p olynomials of d egree at most k with r eal co efficien ts. The Chebyshev p olynomial of d egree k , P k ( x ), is defin ed b y the recur rence P k ( x ) =      1 , k = 0 x, k = 1 2 xP k − 1 ( x ) − P k − 2 ( x ) , k ≥ 2 and satisfies | P k ( x ) | ≤ 1 for all x ∈ [ − 1 , 1]. Th e v alue | P k ( x ) | equals 1 for exactly k + 1 v alues of x in [ − 1 , 1]; sp ecifically , P k ( η j,k ) = ( − 1) j for 0 ≤ j ≤ k , where η j,k = cos( j π /k ). The set C k is defined as th e set of all p olynomials p ∈ P k satisfying max 0 ≤ j ≤ k | p ( η j,k ) | ≤ 1. F act 3.3 (Extr emal Gro wth Pr op ert y). I f p ∈ C k and | t | ≥ 1, then | p ( t ) | ≤ | P k ( t ) | . Pro of. S ee [29, Ex. 1.5.11] or Rogosinski [30].  F act 3.3 s tates that all p olynomials wh ic h are b oun ded on certain “critical p oints” of th e in terv al I = [ − 1 , 1] cannot gro w faster than Cheb yshev p olynomials once lea ving I . 3.3.2 Correctness T o analyze our algo rithm, let us first supp ose that our algorithm could exactly compu te the Tsallis en tropies T ( y i ) for 0 ≤ i ≤ k . Let p b e the d egree- k p olynomial obtained by int erp olating at the c hosen p oints, i.e., p ( y i ) = T ( y i ) for 0 ≤ i ≤ k . The algorithm uses p (0) as its estimate for T (0). W e analyze the accuracy of this estimate using the follo wing fact. Recall that the notation g ( k ) denotes the k th deriv ativ e of a fu nction g . F act 3.4 (Phillips and T a ylor [25], Theorem 4.2). Let y 0 , y 1 , . . . , y k b e p oints in the interv al [ a, b ]. Let g : R → R b e suc h that g (1) , . . . , g ( k ) exist and are con tin uous on [ a, b ], and g ( k +1) exists on ( a, b ). Then, for every y ∈ [ a, b ], there exists ξ y ∈ ( a, b ) suc h that g ( y ) − p ( y ) = k Y i =0 ( y − y i ) ! g ( k +1) ( ξ y ) ( k + 1)! where p ( y ) is the degree- k p olynomial obtained by int erp olating the p oints ( y i , g ( y i )), 0 ≤ i ≤ k . T o apply this fact, a b oun d on | T ( k +1) ( y ) | is needed. It suffi ces to consider the in terv al [ − ℓ, 0), since the map f defined in Eq. (3.2) sends − 1 7→ − ℓ and 1 7→ − ℓ/ (2 k 2 + 1), and hence Eq. (3.2 ) sho ws that y i ∈ [ − ℓ, 0) f or all i . Since ℓ = 1 / (2( k + 1) log m ), it f ollo ws from the follo wing lemma that | T ( k +1) ( y i ) | ≤ 4 log k +1 ( m ) H k + 2 ∀ 0 ≤ i ≤ k . (3.3) 5 Lemma 3.5. Let ε b e in (0 , 1 / 2]. Th en, | T ( k ) ( − ε ( k +1) log m ) | ≤ 4 log k ( m ) H / ( k + 1). By F act 3.4 and Eq. (3.3), we hav e | T (0) − p (0) | ≤ | ℓ | k +1 · 4 log k +1 ( m ) H ( k + 1)! ( k + 2) = 1 2 k +1 log k +1 ( m ) · 4 log k +1 ( m ) H ( k + 1)! ( k + 2) ≤ 2 ε ( k + 1)! ( k + 2) ≤ ε 2 , (3.4) since 2 k = (log m ) /ε and H ≤ log m . This demonstr ates that ou r algorithm computes a goo d appro ximation of T (0) = H , under the assumption that the v alues T ( y i ) can b e computed exactly . T he remainder of th is section explains ho w to r emo v e this assumption. Algorithm 1 d o es not compute the exact v alues T ( y i ), it only computes app r o ximations. The accuracy of these appro ximations can b e d etermined as follo ws. Then ˜ T ( y i ) = 1 − ˜ F 1+ y i / || A || 1+ y i 1 y i ≤ T ( y i ) − ˜ ε · P n j =1 x 1+ y i j y i . (3.5) No w recall that x j ≥ 1 /m for eac h i and y i ≥ − ℓ , so that x y i i ≤ m ℓ = m 1 / 2( k +1) log m < 2. Thus P n j =1 x 1+ y i j ≤ 2 P n j =1 x j = 2. Since ˜ ε/ℓ = ε/ (6 k 2 ), we h a v e T ( y i ) ≤ ˜ T ( y i ) ≤ T ( y i ) + ε/ (3 k 2 ) . (3.6) No w let ˜ p ( x ) b e the degree- k p olynomial defined by ˜ p ( y i ) = ˜ T ( y i ) for all 0 ≤ i ≤ k . Then Eq. (3.6 ) sh o ws that r ( x ) = p ( x ) − ˜ p ( x ) is a p olynomial of degree at most k satisfying | r ( y i ) | ≤ ε/ (3 k 2 ) for all 0 ≤ i ≤ k . Let P : R → R b e the Chebyshev p olynomial of degree k , and let Q ( y ) = P  f − 1 ( y )  b e an affine transformation of P . Then th e p olynomial r ′ ( y ) = (3 k 2 /ε ) · r ( y ) satisfies | r ′ ( y i ) | ≤ | Q ( y i ) | for all 0 ≤ i ≤ k . Thus F act 3.3 implies that | r ′ (0) | ≤ | Q (0) | . By d efinition of Q , Q (0) = P ( f − 1 (0)) = P (1 + 1 /k 2 ). T h e follo wing lemma sho ws that this is at most e 2 . Lemma 3.6. Let P b e th e k th Chebyshev p olynomial, k ≥ 1, and let x = 1 + k − c . T hen | P k ( x ) | ≤ k Y j =1  1 + 2 j k c  ≤ e 2 k 2 − c . Th us | r ′ (0) | ≤ e 2 and | r (0) | ≤ ε/ 2 s ince k ≥ 2. T o conclude, w e ha v e sho wn | p (0) − ˜ p (0) | = | r (0) | ≤ ε/ 2. Combining with Eq. (3.4) via th e triangle inequ ality sho ws | ˜ p (0) − H | ≤ ε . 3.4 Multiplicativ e Appro ximation of Shannon En tropy W e no w discuss ho w to extend the m ulti-p oin t interp olation algorithm to obtain a multi plicativ e appro ximation of S hannon en tropy . The main to ol that we require is a multiplicativ e estimate of Tsallis en tropy , rather than the ad d itiv e estimates used ab o v e. S ection 5 shows that the requ ired m ultiplicativ e estimates can b e efficien tly computed; Section 4 p r o vides to ols for doing this. The mo d ifications to the multi-p oin t inte rp olation algorithm are as follo ws. W e set the n umb er of int erp olation p oin ts to b e k = max { 5 , log (1 /ε ) } , then argue as in Eq. (3.4) to ha v e | T (0) − p (0) | ≤ εH / 2, where p is the interp olated p olynomial of d egree k . W e th en 6 use Algorithm 1, but w e compute ˜ T ( y i ) to b e a (1 + ˜ ε )-m ultiplicativ e estimation of T ( y i ) in- stead of an ˜ ε -additiv e estimation b y using Theorem 5.6. By arguing as in Eq. (3.6), we hav e T ( y i ) ≤ ˜ T ( y i ) ≤ T ( y i ) + εT ( y i ) / (3 k 2 ) ≤ T ( y i ) + 4 εH / (3 k 2 ). The fin al inequalit y follo ws from Lemma 3.5 w ith k = 0. F rom this p oin t, the argument remains identic al as Section 3.3.2 to show that | p (0) − ˜ p (0) | ≤ 4 εe 2 H / (3 k 2 ) < εH / 2, yieldin g | ˜ p (0) − H | ≤ εH b y the triangle inequalit y . 4 Estimating Residual Momen ts T o multiplicat ive ly approximate Shan n on ent ropy , the algo rithm of Section 3.4 requires a mul- tiplicativ e appro ximation of Tsallis entrop y . S ection 5 shows that the r equired quantit ies can b e computed. The main to ol needed is an efficient alg orithm for estimating r esidual moments . That is the topic of the pr esen t section. Define the residu al α th momen t to b e F res α := P n i =2 | A i | α = F α − | A 1 | α , where we reorder the items such that | A 1 | ≥ | A 2 | ≥ . . . ≥ | A n | . In this section, we p r esen t tw o efficient algorithms to compute a 1 + ε multiplic ativ e appr o ximation to F res α for α ∈ (0 , 2]. These algorithms succeed with constant probabilit y un der the assumption that a hea vy hitter exists, say | A 1 | ≥ 4 5 k A k 1 . The algorithm of Section 4.2 is v alid only in the str ict tur nstile mo d el. Its space usage has a complicated dep end ence on α ; for th e pr imary range of inte rest, α ∈ [1 / 3 , 1), the b ound is O (( ε − 1 /α + ε − 2 (1 − α ) + log n ) log m ) . Th e algorithm of Section 4.3 is v alid in the general up date mo del and u ses ˜ O ( ε − 2 log m ) bits of space. 4.1 Finding a Hea vy Elemen t A subroutine that is needed for b oth of our algorithms is to detect whether a hea vy hitter exists ( | A i | ≥ 4 5 k A k 1 ) and to fi nd the iden tit y of that elemen t. W e will d escrib e a pro cedure for doing so in the general up date mo d el. W e use the follo wing resu lt, whic h essen tially follo ws fr om the coun t-min sk etc h [8] . F or completeness, a self-con tained pro of is giv en in Ap p end ix A.5 . F act 4.1. Let w ∈ R n + b e a we ight v ector on n elemen ts so th at P i w i = 1. Ther e exists a family H of hash functions mapp ing the n elemen ts to O (1 /ε ) bins with |H | = n O (1) suc h that a random h ∈ H satisfies the follo wing tw o pr op erties with p robabilit y at least 15 / 16. (1) If w i ≥ 1 / 2 then the w eigh t of elemen ts that collide with elemen t i is at most ε · P j 6 = i w j . (2) If max i w i < 1 / 2 then the w eigh t of elemen ts hashing to eac h b in is at most 3 / 4 . W e use the hash fun ction from F act 4.1 with ε = 1 / 10 to p artition the elemen ts into bins, and for eac h bin m aintain a coun ter of the n et L 1 w eigh t that hash to it. If there is a h eavy hitter, then the n et w eigh t in its b in is more than 4 / 5 − ε (1 / 5) > 3 / 4. Con ve rsely , if there is a bin with at least 3 / 4 of the we ight then F act 4.1 implies th en there is a heavy elemen t. W e determine the id en tit y of the h ea vy elemen t via a group-testing type of argument: w e main tain ⌈ log 2 n ⌉ counters, of which the i th coun ts the n umb er of elements which hav e their i th bit s et. Thus, if there is hea vy element , w e can determine its i th bit by c hec king whether the fraction of elements with their i th bit is at lea st 3 / 5. 4.2 Buc k eting Algorithm In this section, w e d escrib e an algorithm for estimating F res α that w orks only in the strict turnstile mo del. The algorithm h as sev eral cases, d ep endin g on the v al ue of α . Case 1: α = 1 . This is th e simplest case for our algorithm. W e use the hash fu nction from F act 4.1 to partition th e elemen ts into bins, and for eac h bin m ain tain a count of the num b er of elemen ts that hash to it. If there is a b in with more than 3 / 4 elemen ts at the end of the pro cedur e, then there is a h ea vy elemen t, and it suffi ces to r eturn the total num b er of elements 7 in the other b ins. Otherwise, we ann ounce that there is no hea vy hitter. The correctness follo ws from F act 4.1, an d the space required is O  1 ε log m  bits. Case 2: α = (0 , 1 3 ) ∪ (1 , 2] . Again, we use th e hash function from F act 4.1 to partition the elemen ts in to bins. F or eac h bin, w e main tain a coun t of the num b er of elemen ts, and a sk etc h of the α th momen t using F act 2.1. The count s allo w us to detect if there is a hea vy hitter, as in Case 1. If so, w e combine the moment sk etc hes of all b ins other than the one conta inin g the hea vy h itter; this giv es a go o d estima te with constan t probabilit y . By F act 2.1, w e need only O  1 ε ·  | α − 1 | ε 2 + 1 ε  log m + 1 ε log m  = O  | α − 1 | ε 3 + 1 ε 2  log m  bits. Case 3: α = [ 1 3 , 1) . This is the most in teresting case. This id ea is to k eep just one sket ch of the α th momen t for the en tire stream. A t the end, we estimate F res α b y artificially app ending deletions to the stream whic h almost en tirely remov e the heavy h itter from th e sket ch. The algorithm computes four qu antitie s in parallel. First, ˜ F res 1 = (1 ± ε ′ ) F res 1 with error parameter ε ′ = ε 1 /α , u sing the ab o ve alg orithm with α = 1. S econd, ˜ F α = (1 ± ε ) F α using F act 2.1. Th ird, F 1 , whic h is trivial in the strict turn s tile mo del. Lastly , w e determine the iden tit y of the heavy h itter as in Section 4.1. No w we exp lain h ow to estimate F res α . The k ey observ ation is that F 1 − ˜ F res 1 is a very go o d appr oximati on to A 1 (assume this is the heavy h itter). So if w e delete th e hea vy hitter ( F 1 − ˜ F res 1 ) times, th en there are at most A 1 ≤ ε ′ F res 1 remaining o ccurrences. Define ˜ F res α to b e the v alue of ˜ F α after pro cessing these deletions. C learly F res α ≥ ( F res 1 ) α , b y conca vity of the function y 7→ y α . On the other hand, the remaining o ccurrences of th e hea vy hitter contribute at most ( ε ′ F res 1 ) α . Hence, the remainin g o ccurrences of the heavy hitter inflate F res α b y a factor of at most 1 + ( ε ′ · F res 1 ) α / ( F res 1 ) α = 1 + ε . Th us ˜ F res α = (1 + O ( ε )) F res α , as desired. The num b er of bits of sp ace used by this algorithm is at most O  1 ε ′ log m +  1 − α ε 2 + 1 ε  log m + log n log m  = O   1 ε 1 /α + 1 − α ε 2 + log n  log m  . 4.3 Geometric Mean Algorithm This section d escrib es an algorithm for estimat ing F res α in the general up date mo del. A t a high lev el, the algorithm uses a h ash function to partition the stream elemen ts in to tw o sub streams, then separately estimates the moment F α for the substreams. Th e estimate for the substream whic h do es not con tain the hea vy hitter yields a go o d estimate of F res α . W e impr o v e accuracy of this estimator by a v eraging man y in dep end en t trials. Detailed description and analysis follo w. W e u se Li’s ge ometric me an estimator [21] for estimat ing F α since it is unbiased (its b eing unbiase d will b e useful later). The geometric mean estimator is defined as follo ws. Let k and α b e parameters. W e let y = R · A , wh ere A is the v ector repr esen ting the stream and R is a k × n matrix whose entries are i.i.d. samples fr om an α -stable distribution. Define ˜ F α = Q k j =1 | y j | α/k [ 2 π Γ( α k )Γ(1 − 1 k ) sin( π α 2 k )] k . The space required to compute this estimator is easily seen to b e O ( k · log m ) bits. Li analyzed the v ariance of ˜ F α as k → ∞ , ho wev er for our purp oses w e are only interested in the case k = 3 and h enceforth restrict to only this case (one can sho w ˜ F α has un b ounded v ariance for k < 3). Building on L i’s analysis, w e sh ow th e follo wing r esu lt. Lemma 4.2. Th ere exists an absolute constant C GM suc h that V ar h ˜ F α i ≤ C GM · E h ˜ F α i 2 . 8 Let r d enote the num b er of indep enden t tr ials. F or eac h j ∈ [ r ], the algorithm pic ks a function h j : [ n ] → { 0 , 1 } un iformly at random. F or j ∈ [ r ] and l ∈ { 0 , 1 } , define F α,j,l = P i : h j ( i )= l | A i | α . This is the α th momen t f or the l th substream durin g the j th trial. F or eac h j and l , our algorithm computes an estimate ˜ F α,j,l of F α,j,l using the geometric mean estimator. W e also ru n in parallel the algorithm of S ection 4.1 to disco v er which i ∈ [ n ] is the heavy h itter; henceforth assu me i = 1. Our ov erall estimate for F res α is then ˜ F res α = 2 r r X j =1 ˜ F α,j, 1 − h j (1) The sp ace used by our algorithm is simply the sp ace requ ired f or r geometric mean estimators and the one hea vy hitter algorithm. The latter uses ˜ O ( ε − 1 log n ) bits of space [8, Th eorem 7]. Th us the tota l space required is ˜ O ( r log m + ε − 1 log n ) bits. W e n ow ske tc h an analysis of the algorithm; a formal argument is giv en in App end ix A.4. The natural analysis w ould b e to show that, for eac h item, the fraction of trials in which the item do esn’t coll ide with the hea vy hitter is concen trated around 1 / 2. A u nion b ound o v er all items w ould require choosing the num b er of trials to b e Ω( 1 ε 2 log n ). W e obtain a significantl y smaller num b er of trials b y using a different analysis. Instead of usin g a concen tration b ound for eac h item, we observ e that items with r oughly the s ame w eigh t (i.e., the v alue of | A i | ) are essen tially equiv ale nt for the purp oses of th is analysis. So we partition the items in to classes suc h that all items in the a cla ss ha v e the same w eigh t, up to a (1 + ε ) factor. W e then apply concen tration b ounds for eac h class, rather than sep arately for eac h item. T he n umb er of classes is only R = O ( 1 ε log m ), an d a u nion b ound ov er classes only requires Θ( 1 ε 2 log R ) tr ials. As argued, the space u sage of this alg orithm is ˜ O ( r log m + ε − 1 log n ) = ˜ O ( ε − 2 log m ) bits. 5 Estimation of R ´ en yi and Tsallis En trop y This section summarizes our algorithms for estimating R ´ enyi and T s allis entrop y . These al- gorithms are used as subroutines for estimating Shannon en tropy in S ection 3, and may b e of indep en d en t in terest. The tec hniques w e use for b oth the entropies are almost identica l. In p articular, to compu te additiv e appro ximation of T α or H α , it s uffices to compute a suffi cien tly precise multiplicat ive appro ximation of the α -th momen t. Due to space constrain ts, w e present pro ofs of all lemmas and theorems from th is section in the app end ix. Theorem 5.1. There is an algorithm that computes an additive ε -appro ximation of R ´ enyi en tropy in O  log m | 1 − α |· ε 2  bits of sp ace for any α ∈ (0 , 1) ∪ (1 , 2]. Theorem 5.2. There is an algorithm for additiv e app ro ximation of T sallis ent ropy T α using • O  n 2(1 − α ) log m (1 − α ) ε 2  bits, for α ∈ (0 , 1). • O  log m ( α − 1) ε 2  bits, for α ∈ (1 , 2]. In ord er to obtain a multiplic ativ e appr o ximation of Tsallis and R´ en yi entrop y , we m us t pro ve a few facts. The next lemma says that if there is n o h ea vy elemen t in the empirical distribution, then Tsallis en trop y is at least a constan t. 9 Lemma 5.3. Let x 1 , x 2 , . . . , x n b e v alues in [0 , 1] of total su m 1. There exists a p ositiv e constan t C su c h that if x i ≤ 5 / 6 for all i then, for α ∈ (0 , 1) ∪ (1 , 2],    1 − n X i =1 x α i    ≥ C · | α − 1 | . Corollary 5.4. Th ere exists a constant C su c h that if the p robabilit y of eac h elemen t is at most 5 / 6, then the T sallis ent ropy is at least C for any α ∈ (0 , 1) ∪ (1 , 2]. Pro of. W e h a v e T α = 1 − P n i =1 x α α − 1 = | 1 − P n i =1 x α i | | α − 1 | ≥ C.  W e no w sh o w how to d eal with the case when ther e is an elemen t of large probability . It turns out that in this case we can obtain a multiplicat ive approximati on of Tsallis en tropy b y com bining t w o r esidual momen ts. Lemma 5.5. Ther e is a p ositiv e constant C suc h that if there is an elemen t i of p robabilit y x i ≥ 2 / 3, then the sum of a multiplica tiv e (1 + C · | 1 − α | · ε )-appro ximation to 1 − x i and a m ultiplicativ e (1 + C · | 1 − α | · ε )-appr oximati on to P j 6 = i x α j giv es a m ultiplicativ e (1 + ε )- appro ximation to | 1 − P i x α i | , for any α ∈ (0 , 1) ∪ (1 , 2]. W e these collect those f acts in the follo wing theorem. Theorem 5.6. There is a streaming algorithm for m ultiplicativ e (1 + ε )-approximat ion of Tsallis entrop y for any α ∈ (0 , 1) ∪ (1 , 2] using ˜ O  log m/ ( | 1 − α | ε 2 )  bits of sp ace. The next lemma shows that we can handle the logarithm that app ears in the defin ition of R ´ enyi ent ropy . Lemma 5.7. It su ffices to ha v e a multiplicat ive (1 + ε )-appro ximation to t − 1, where t ∈ (4 / 9 , ∞ ) to compute a multiplic ativ e (1 + C · ε ) app ro ximation to log( t ), for s ome constant C . W e no w hav e all necessary facts to estimate R´ en yi en trop y f or α ∈ (0 , 2]. Theorem 5.8. There is a streaming algorithm for m ultiplicativ e (1 + ε )-approximat ion of R ´ enyi en tropy f or any α ∈ (0 , 1) ∪ (1 , 2]. The algorithm uses ˜ O  log m/ ( | 1 − α | ε 2 )  bits of space. In f act, Theorem 5.8 is tigh t in the sens e that (1 + ε )-m ultiplicativ e approximat ion of H α for α > 2 r equires p olynomial space, as seen in the follo wing theorem. Theorem 5.9. F or any α > 2, an y randomized one-pass streaming algorithm which (1 + ε )- appro ximates H α ( X ) requires Ω( n 1 − 2 /α − 2 ε − γ ( ε +1 /α ) ) bits of sp ace for arbitrary constant γ > 0. Tsallis en tropy can b e efficien tly appr oximate d b oth m ultiplicativ ely and add itively also for α > 2, but we omit a p r o of of that f act in th is v ersion of the pap er. 6 Mo difications for General Up date Streams The algorithms describ ed in S ection 3 and Section 5 are for the strict turnstile mo d el. They can b e extended to w ork in the general up dates mo del with a few mo difications. First, w e cann ot efficien tly and exac tly compute k A k 1 = F 1 in th e general up date mo del. Ho w ev er, a (1 + ε )-m ultiplicativ e appro ximation can b e computed in O ( ε − 2 log m ) bits of space b y F act 2.1. In Section 3.2 and Section 3.3, the v alue of k A k 1 is u s ed as a n ormalization factor to scale the estimate of F α to an estimate of P n i =1 x α i . (See, e.g., Eq. (3.1) and Eq. (3.5).) Ho w eve r, ˜ F α ( ˜ F 1 ) α = (1 ± ε ) · F α  (1 ± ε ) · F 1  α =  1 ± O ( ε )  · F α F α 1 , 10 so the fact that F 1 can only b e approxi mated in the general up date mod el affects the analysis only by increasing th e constan t facto r that m ultiplies ε . A similar mo dification m ust also b e applied to all algo rithm s in Sectio n 5; w e omit th e details. Next, the multiplicati ve algorithm Section 3.4 n eeds to compute a m ultiplicativ e estimate of T ( y i ) usin g Th eorem 5.6. In the general up dates mo d el, a weak er r esult than Th eorem 5.6 holds: w e obtain a m ultiplicativ e (1 + ε )-approximat ion of Tsallis entrop y for an y α ∈ (0 , 1) ∪ (1 , 2] using ˜ O  log m/ ( | 1 − α | · ε ) 2  bits of space. The pro of is identical to the argum en t in App endix A.6 , except that the the moment estimator of F act 2.1 uses more sp ace, and we m ust use the residu al momen t algorithm of Section 4.3 instead of Section 4.2. Similar mo difications m ust b e made to Theorem 5.1, Theorem 5.2 and Theorem 5.8, with a commensurate increase in the sp ace b ounds. 7 F uture Researc h W e hop e that the tec hniques f r om appro ximation th eory that w e in tro d uce m ay b e u seful for streaming and sk etc hing other fun ctions. F or instance, consider the follo wing function G α,k ( x ) = P i x α i (log n ) k , wher e k ∈ N and α ∈ [0 , ∞ ). One can sho w that lim β → α G α,k ( x ) − G β ,k ( x ) α − β = G β ,k +1 ( x ) . Note that G α, 0 ( x ) is the α th momen t of x , and one can attempt to estimate G α,k +1 b y computing G β ,k for β = α and β close to α . It is not unlike ly that our tec hniques can b e generalized to estimation of functions G α,k for α ∈ (0 , 2]. Can one also use our tec hniques for approximat ion of other classes of f unctions? Ac kno wledgemen ts W e thank Piotr Indy k and Ping Li for many h elpful discussions. W e also thank Jonathan Kelner for some p oin ters to th e appro ximation theory literature. References [1] Nog a Alon, Y ossi Matias, and Mario Szegedy . The S pace C omp lexit y of Appr o ximating the F requ ency Moment s. Journal of Computer and System Scienc es , 58(1):137– 147, 1999. [2] Ziv Bar-Y ossef, T. S. Jayram, Ra vi Kumar, and D. S iv akumar. An inf ormation statistics approac h to data stream and communicat ion complexit y . J. Comput. Syst. Sc i . , 68(4):702 – 732, 2004. [3] Lakshminath Bh uv anagiri and Sumit Ganguly . Estimating entrop y o v er data streams. In Pr o c e e dings of the 14th Annual Eur op e an Symp osium on Algorithms , pages 148–159, 200 6. [4] Lakshminath Bhuv anagiri an d S umit Ganguly . Hierarc hical Sampling from S ketc hes: Esti- mating F unctions ov er Data S treams, 2008. Manuscript. [5] Lakshminath Bh uv anagi ri, Sumit Ganguly , Deepanjan K esh , and Chandan Saha. Simpler algorithm for estimating fr equency moment s of data streams. In Pr o c e e dings of the 17t h Annual ACM-SIAM Symp osium on Discr e te Algor ithms (SODA) , p ages 708–71 3, 2006. [6] Amit Chakrabarti, Graham Cormo de, and Andrew McGregor. A near-optimal algorithm for computin g the en tropy of a stream. In Pr o c e e dings of the 18th Annual ACM-SIAM Symp osium on Discr ete Algorithms (SOD A) , pages 328–335, 200 7. 11 [7] Amit Chakr ab arti, Khanh Do Ba, and S. Muthukrishnan. Estimating Entrop y and Ent ropy Norm on Da ta Streams. In Pr o c e e dings of the 23r d A nnual Symp osium on The or etic al Asp e cts of Computer Scienc e (ST ACS) , pages 196–205, 2006 . [8] Graham C ormo de and S. Mu th ukrish nan. An imp ro ve d data str eam summ ary : the count - min ske tc h and its app lications. J . A lgorithms , 55(1):58–7 5, 2005. [9] Thomas Cov er and Jo y Th omas. Elements of Information The ory . Wiley In terscience, 1991. [10] Sumit Ganguly and Grah am Cormo de. On estimating frequ ency momen ts of data streams. In APP R OX-RANDO M , p ages 479–49 3, 2007. [11] Sumit Ganguly , Deepanjan Kesh , and Ch andan Saha. Practical algorithms f or trac king database join sizes. I n Pr o c e e dings of the 25th Internationa l Confer enc e on F oundations of Softwar e T e chnolo gy and The or etic al Computer Scienc e (FSTTCS) , pages 297–309 , 2005. [12] Y u Gu, Andrew McCallum, and Donald F. T o wsley . Detecting anomalies in net wo rk traffic using maximum entrop y estimation. In Internet Me asurment Confer enc e , pages 345–35 0, 2005. [13] Sudipto Guha, Andr ew McGregor, and Suresh V enk atasubramanian. Streaming and s ublin- ear app ro ximation of entrop y and information distances. In Pr o c e e dings of the 17th Annua l ACM-SIAM Symp os ium on Discr ete Algorithms (SODA) , pages 733–742 , 2006. [14] Nic h olas J. A. Harv ey , Jelani Nelso n, and Krzysztof Onak. Streaming algorithms for esti- mating ent ropy . In Pr o c e e dings of IEEE Informatio n The ory Worksho p , 2008. [15] Shlomo Ho ory , Nathan Linial, and Avi Wigderson. Expand er graphs and their applications. Bul letin of the Americ an M athematic al So ciety , 43(4) :439–561, 2006. [16] Piot r Ind yk. Stable distributions, pseudorand om generators, emb eddings, and data stream computation. J. ACM , 53(3): 307–323, 2006. [17] Piot r Indyk and An drew McGregor. Declaring indep endence via the sket c hin g of sket ches. In Pr o c e e dings of the 19th Annual ACM-SIAM Symp osium on Discr ete Algorithms (SODA) , 2008. [18] Piot r Ind yk and Davi d P . W o o druff. Optimal appr o ximations of the frequency m oments of data streams. In Pr o c e e dings of the 37th Annual ACM Symp os ium on The ory of Computing (STOC) , pages 202–208, 2005. [19] A. Lakhin a, M. Crov ella, and C. Diot. Mining anomalies using traffic feature distributions. In Pr o c e e dings of the A CM SIGCOMM Confer enc e , pages 217–228, 2005. [20] Ping Li. Compr essed counting. CoRR abs /0802.2305 v2, 2008. [21] Ping Li. Estimators and tail b ounds for dimens ion r eduction in l p (0 < p ≤ 2) using stable random pro jections. In Pr o c e e dings of the 19th Annual ACM-SIAM Symp osium on Discr ete Algor ithms (SODA) , p ages 10–19, 2008. 12 [22] Canran Liu, Rob ert J . Whittak er, Keping Ma, and J a y R. Malcolm. Unifying and dis- tinguishing d iv ersit y ordering metho ds for comparin g comm unities. Population Ec olo gy , 49(2): 89–100, 2006. [23] No am Nisan. Pseudorandom generators for s pace-b ounded computation. Combinatoric a , 12(4): 449–461, 1992. [24] Ge orge McArtney Phillips. Interp olation and Appr oximation by Polynomials . Sprin ger- V erlag, New Y ork , 2003 . [25] Ge orge McArtney Phillips and Pete r John T a ylor. The or y and Applic ations of Numeric al Ana lysis . Academic P r ess, 2nd edition, 19 96. [26] Alfred R´ en yi. On measures of entrop y an d information. In Pr o c. F ourth Berkeley Symp. Math. Stat. and P r ob ability , v olume 1, pages 547–561, 1961 . [27] Carlo Ricotta, Alessandra Pa cini, and Giancarlo Av ena. Parametric scaling from sp ecies to gro wth-form d iv ersit y: an int eresting analogy with m ultifractal fun ctions. Biosystems , 65(2– 3):179–18 6, 200 2. [28] Theod ore J. Rivlin. An Intr o duction to the Appr oxim ation of F unctions . Do ve r Publications, New Y ork, 1981. [29] Theod ore J. Rivlin. Chebyshev Polynomials: F r om Appr oxima tion The ory to Algebr a and Numb er The ory . John Wiley & Sons, 2nd edition, 1990. [30] W erner W olfgang Rogosinski. Some elemen tary inequ alities for p olynomials. The M athe- matic al Gazette , 39(32 7):7–12, 1955. [31] W alter Rud in. Principles of Mathematic al Analysis . McGra w-Hill, third edition, 1976. [32] Mic h ael E. Saks and Xiaodong Su n. Space lo w er b oun ds for d istance app ro ximation in the data stream mo del. In Pr o c e e dings of the 34th Annual A CM Symp os ium on The ory of Computing (STOC) , pages 360–369 , 2002. [33] Constan tino Tsallis. P ossible generaliz ation of b oltzmann-gibbs statistic s. Journal of Sta- tistic al Physics , 52:479–48 7, 1988. [34] Wim v an Dam and Pa trick Hayden. Ren yi-ent ropic b ound s on quantum comm unication. arXiv:quan t-ph/0204093 , 2002. [35] Da vid W o o druff. Efficient and Private Distanc e Appr oxima tion in the Communic ation and Str e aming Mo dels . P hD thesis, Massac husetts Institute of T ec hn ology , 2007. [36] Kuai Xu, Zh i-Li Zhang, and Supr atik Bh attac haryy a. P rofiling internet bac kb one traffic: b ehavio r mo d els and applications. In Pr o c e e dings of the ACM SIGCOMM Confer enc e , pages 169–180 , 2005. [37] Ha iquan Z hao, Ashwin Lall, Mitsunori Ogihara, Oliv er Spatschec k, Jia W ang, and Jun Xu. A Data Streaming Algorithm for Estimating Ent ropies of O D Flo ws. In Pr o c e e dings of the Internet Me asur ement Confer enc e (IM C) , 2007. [38] Karol ˙ Zyczk o wski. R´ en yi Extrap olation of Shann on Entrop y. Op en Systems & Information Dynamics , 10(3):29 7–310, 2003. 13 A Pro ofs A.1 Pro ofs from Section 3.2 Recall that x ∈ R n is a distribu tion whose smallest p ositiv e v alue is at least 1 /m . The ke y tec hnical lemma needed is as follo ws. Lemma A.1. Let α > 1, let ξ = ξ ( α ) denote 4( α − 1) H 1 ( x ), and let e ( α ) = 2  ξ log n + ξ log (1 /ξ )  . Assume that ξ ( α ) < 1 / 4. T hen H α ≤ H 1 ≤ H α + e ( α ). W e require the follo wing basic results. Claim A.2. The follo wing inequalities follo w from con v exit y . • Let 0 < y ≤ 1. T hen e y < 1 + 2 y . • Let y > 0. Then 1 − y ≤ log (1 /y ). • Let 0 ≤ y ≤ 1 / 2. T hen 1 / (1 − y ) ≤ 1 + 2 y . Claim A.3. Let 1 ≤ a ≤ b and let x ∈ R n . T hen k x k b ≤ k x k a ≤ n 1 /a − 1 /b k x k b . Claim A.4. If 0 ≤ α ≤ β then H α ≥ H β Claim A.5. If α > 1 then log  1 / k x k α  < ( α − 1) · H 1 . Pro of. log  1 / k x k α  = α − 1 α H α ( x ) < ( α − 1) · H α ( x ) ≤ ( α − 1) · H 1 ( x ) .  Claim A.6. Let y = ( y 1 , . . . , y n ) and z = ( z 1 , . . . , z n ) b e probability distrib utions such that k y − z k 1 ≤ 1 / 2. Then | H 1 ( y ) − H 1 ( z ) | ≤ k y − z k 1 · log  n k y − z k 1  . Pro of. S ee Co v er and Thomas [9, 16 .3.2].  Pro of (of Lemma A.1). The fir st in equalit y follo ws fr om Claim A.4 so w e fo cus on the second one. Defin e f ( α ) = log k x k α α and g ( α ) = 1 − α , so that H α = f ( α ) /g ( α ). Th e d eriv ativ es are f ′ ( α ) = P n i =1 x α i log x i k x k α α and g ′ ( α ) = − 1 , so lim α → 1 f ′ ( α ) /g ′ ( α ) exists and equals H ( x ). Since lim α → 1 f ( α ) = lim α → 1 g ( α ) = 0, l’Hˆ opital’s rule implies that lim α → 1 H α = H ( x ). A stronger version of L’Hˆ opita l’s rule is as follo ws. Claim A.7. Let f : R → R and g : R → R b e differen tiable functions suc h that the follo wing limits exist lim α → 1 f ( α ) = 0 , lim α → 1 g ( α ) = 0 , and lim α → 1 f ′ ( α ) /g ′ ( α ) = L. Let ε and δ b e suc h that | α − 1 | < δ implies that | f ′ ( α ) /g ′ ( α ) − L | < ε . Th en | α − 1 | < δ also implies that | f ( α ) /g ( α ) − L | < ε . 14 Pr o of. See Rud in [31, p.109]. ✷ Th us , to p ro ve our lemma, it suffices to sh o w that | f ′ ( α ) /g ′ ( α ) − H 1 | < e ( α ). (In fact, w e actually need | f ′ ( β ) /g ′ ( β ) − H 1 | < e ( α ) f or all β ∈ (1 , α ], but this f ollo ws by monotonicit y of e ( β ) for β ∈ (1 , α ].) A k ey concept in this pro of is th e “p erturb ed” p robabilit y distr ib ution x ( α ), d efined by x ( α ) i = x α i / k x k α α . W e h a v e the follo wing r elationship. f ′ ( α ) g ′ ( α ) = P n i =1 x α i log(1 /x i ) k x k α α = P n i =1 x α i  log(1 /x i ) + log k x k α − log k x k α  k x k α α =  P n i =1 x α i log( k x k α /x i )  −  P n i =1 x α i log k x k α  k x k α α = 1 α n X i =1 x α i k x k α α log k x k α α x α i ! − log k x k α = H 1  x ( α )  α + log(1 / k x k α ) In summ ary , w e ha v e sho wn that      f ′ ( α ) g ′ ( α ) − H 1  x ( α )  α      ≤ log (1 / k x k α ) ≤ ( α − 1) · H 1 ( x ) , (A.1) the last in equ alit y follo w ing fr om Claim A.5. T o us e this b ound, w e obser ve that:     f ′ ( α ) g ′ ( α ) − H 1  x ( α )      =      f ′ ( α ) g ′ ( α ) − H 1  x ( α )  α + 1 α − 1 ! H 1  x ( α )       ≤      f ′ ( α ) g ′ ( α ) − H 1  x ( α )  α      + | 1 /α − 1 | · H 1  x ( α )  W e now subs titute Eq. (A.1) into this expression, and use | 1 /α − 1 | ≤ α − 1 (v alid since α ≥ 1). This yields:     f ′ ( α ) g ′ ( α ) − H 1  x ( α )      ≤ ( α − 1) · H 1 ( x ) + ( α − 1) · H 1  x ( α )  (A.2) Recall that our goal is to analyze | f ′ ( α ) /g ′ ( α ) − H 1 ( x ) | . W e do this b y sho wing that H 1  x ( α )  ≈ H 1 ( x ), and th at the r ight-hand side of Eq. (A.2) is at most e ( α ). This is done using Claim A.6; the k ey step is b oun d ing k x − x ( α ) k 1 . Claim A.8. Sup p ose that 1 < α ≤ 1 + 1 / (2 log n ). Then 1 / k x k α α < 1 + 3( α − 1) H 1 ( x ). Pr o of. F rom Claim A.3 and k x k 1 = 1, w e obtain 1 / k x k α ≤ n 1 − 1 /α < n α − 1 . Ou r h yp othesis on α implies that α · log(1 / k x k α ) < α · ( α − 1) log n < 2 · ( α − 1) log n ≤ 1 . (A.3) 15 Th us 1 k x k α α = e α log (1 / k x k α ) < 1 + 2 · α log (1 / k x k α ) < 1 + 3( α − 1) H 1 ( x ) . The first inequalit y is fr om Claim A.2 and Eq. (A.3 ), and the seco nd from Claim A.5. ✷ Recall that ξ = 4( α − 1) H 1 ( x ). Claim A.9. k x − x ( α ) k 1 ≤ ξ . Pr o of. T o a v oid the absolute v alues, w e shall sp lit the sum d efi ning k x − x ( α ) k 1 in to t wo cases. F or that purp ose, let S = { i : x ( α ) i ≥ x i } . Th en k x − x ( α ) k 1 = X i ∈ S  x ( α ) i − x i  + X i 6∈ S  x i − x ( α ) i  = X i ∈ S x i · x α − 1 i k x k α α − 1 ! + X i 6∈ S x i · 1 − x α − 1 i k x k α α ! The first sum is up p er-b ounded using x α − 1 i ≤ 1 and P i ∈ S x i ≤ 1. The second sum is u pp er- b ound ed u s ing k x k α α ≤ 1 and 1 − x α − 1 i ≤ log  1 /x α − 1 i  (see Claim A.2). ≤ 1 k x k α α − 1 ! + ( α − 1) X i 6∈ S x i log(1 /x i ) ≤ 3( α − 1) H 1 ( x ) + ( α − 1) H 1 ( x ) , using Claim A.8. This completes the pro of. ✷ Th us , b y our assumption that ξ ( α ) < 1 / 4, by Claim A.6, by Claim A.9, and by the fact that x 7→ x log(1 /x ) is monotonically increasing for x ∈ (0 , 1 / 4), w e obtain that | H 1 ( x ) − H 1 ( x ( α )) | ≤ ξ log n + ξ log (1 /ξ ) . No w we assem ble the error b oun ds. Our resu lt from Eq. (A.2) yields     f ′ ( α ) g ′ ( α ) − H 1 ( x )     ≤     f ′ ( α ) g ′ ( α ) − H 1 ( x ( α ))     + | H 1 ( x ) − H 1 ( x ( α )) | ≤  ( α − 1) H 1 ( x ) + ( α − 1) H 1 ( x ( α ))  + | H 1 ( x ) − H 1 ( x ( α )) | ≤ 2( α − 1) H 1 ( x ) + α · | H 1 ( x ) − H 1 ( x ( α )) | ≤ 2  ξ log n + ξ log (1 /ξ )  This completes the pro of.  W e no w use L emm a A.1 to s h o w that H α ≈ H 1 , if α is sufficient ly small. Pro of (of Th eorem 3.1). First we f o cus on the multiplicat ive appro ximation. Th e lo wer b ound is immediate from C laim A.4, so w e sho w the up p er-b ound. F or an arbitrary µ ∈ (0 , 1), we h a v e µ 2 < µ 2 log (1 /µ ) < µ ; 16 this follo ws since µ log (1 /µ ) < 1 / 2 for all µ . Let ˜ µ = µ/  2 log (1 /µ )  . Then ˜ µ log (1 / ˜ µ ) < µ. This follo ws since µ 2 < ˜ µ = ⇒ 1 / ˜ µ < 1 /µ 2 = ⇒ log (1 / ˜ µ ) < 2 log (1 /µ ). The hypotheses of Theorem 3.1 giv e α = 1 + ˜ µ/ 8. Hence, e ( α ) = 8( α − 1) H 1 h log n + log  1 /  4( α − 1) H 1  i ≤ ˜ µH 1 h log n + log  2 / ( ˜ µ H 1 )  i Since H 1 ≥ (log m ) /m for any distribution s atisfying our h yp otheses, this is at most ≤ ˜ µH 1  log n + log (1 / ˜ µ ) + log m  ≤ (log m ) µH 1 < ( ε/ 2) H 1 , since our hyp otheses giv e µ = ε/ (4 log m ). Ap plying Lemma A.1, w e obtain that H 1 − H α ≤ ( ε/ 2) H 1 = ⇒ (1 − ε/ 2) H 1 ≤ H α = ⇒ H 1 H α ≤ 1 1 − ε/ 2 ≤ 1 + ε, the last inequalit y f ollo wing fr om Claim A.2. This establishes the m ultiplicativ e approximat ion. Let us n o w consider the ab o v e argumen t, r eplacing µ with ν = ε/ (4 log n log m ). W e obtai n e ( α ) ≤ (log m ) ν H 1 ≤ ε/ 4 , since H 1 ≤ log n . Thus, the ad d itiv e app r o ximation follo w s d irectly .  A.2 Pro ofs from Section 3.3 Our first task is to p ro v e Lemma 3.5. W e requ ir e a defin ition and t wo p reliminary tec hnical results. F or an y in teger k ≥ 0 and real n umb er a ≥ − 1, defin e G k ( a ) = n X i =1 x 1+ a i log k ( x i ) , so G 0 ( a ) = F 1+ a / || A || 1+ a 1 . Note that G (1) k ( a ) = G k +1 ( a ) for k ≥ 0, and T ( a ) = (1 − G 0 ( a )) /a . Claim A.10. The k th deriv ativ e of th e Tsallis entrop y has the follo wing expression. T ( k ) ( a ) = ( − 1) k k !  1 − G 0 ( a )  a k +1 −   k X j =1 ( − 1) k − j k ! G j ( a ) a k − j +1 j !   17 Pro of. The pr o of is b y induction, the case k = 0 b eing trivial. So assume k ≥ 1. T aking the deriv ativ e of the expression for T ( k ) ( a ) ab ov e, w e obtain: T ( k +1) ( a ) =   k X j =1 k !( k − j + 1)( − 1) ( k +1) − j G j ( a ) a ( k +1) − j +1 j ! + k !( − 1) k − j G j +1 ( a ) a k − j +1 j !   + ( − 1) k +1 ( k + 1)!( G 0 ( a ) − 1) a k +2 + ( − 1) k k ! G 1 ( a ) a k +1 =   k X j =1 k !( − 1) ( k +1) − j G j ( a ) a ( k +1) − j +1 ( j − 1)!  1 + k − j + 1 j    + G k +1 ( a ) a + ( − 1) k +1 ( k + 1)!( G 0 ( a ) − 1) a k +2 =   k +1 X j =1 ( k + 1)!( − 1) ( k +1) − j G j ( a ) a ( k +1) − j +1 j !   + ( − 1) k +1 ( k + 1)!( G 0 ( a ) − 1) a k +2 as claimed.  Claim A.11. Define S k ( a ) = a k +1 T ( k ) ( a ). Then, for 1 ≤ j ≤ k + 1, S ( j ) k ( a ) = j − 1 X i =0  j − 1 i  k ! ( k − j + i + 1)! a k − j + i +1 G k +1+ i ( a ) In particular, for 1 ≤ j ≤ k , we h a v e lim a → 0 S ( j ) k ( a ) = 0 and lim a → 0 S ( k +1) k ( a ) = k ! G k +1 (0) so that lim a → 0 T ( k ) ( a ) = G k +1 (0) k + 1 . Pro of. W e p ro v e the claim by in duction on j . First, n ote S k ( a ) = ( − 1) k k !(1 − G 0 ( a )) −   k X j =1 a j ( − 1) k − j k ! G j ( a ) j !   so that S (1) k ( a ) = ( − 1) k − 1 k ! G 1 ( a ) −   k X j =1 − a ( j + 1) − 1 ( − 1) k − ( j +1) k ! G j +1 ( a ) (( j + 1) − 1)! + a j − 1 ( − 1) k − j k ! G j ( a ) ( j − 1)!   = a k G k +1 ( a ) Th us , the base case h olds. F or the inductiv e step with 2 ≤ j ≤ k + 1, we ha v e S ( j ) k ( a ) = ∂ ∂ a j − 2 X i =0  j − 2 i  k ! ( k − j + i + 2)! a k − j + i +2 G k +1+ i ( a ) ! = j − 2 X i =0  j − 2 i  k ! ( k − j + i + 1)! a k − j + i +1 G k +1+ i ( a ) +  j − 2 i  k ! ( k − j + ( i + 1) + 1)! a k − j +( i +1)+1 G k +1+( i +1) ( a ) ! = j − 1 X i =0  j − 1 i  k ! ( k − j + i + 1)! a k − j + i +1 G k +1+ i ( a ) 18 The fin al equalit y h olds since  j − 2 0  =  j − 1 0  = 1,  j − 2 j − 2  =  j − 1 j − 1  = 1, and by P ascal’s formula  j − 2 i  +  j − 2 i +1  =  j − 1 i +1  for 0 ≤ i ≤ j − 3. F or 1 ≤ j ≤ k , ev ery term in the ab o v e su m is well-defined f or a = 0 and conta ins a p o wer of a which is at least 1, so lim a → 0 S ( j ) k ( a ) = 0. When j = k + 1, all terms but the first term con tain a p o w er of a which is at lea st 1, and the fir st term is k ! G k +1 ( a ), so lim a → 0 S ( k +1) k ( a ) = k ! G k +1 (0). The claim on lim a → 0 T ( k ) ( a ) thus follo ws b y writing T ( k ) ( a ) = S k ( a ) /a k +1 then applying l’Hˆ opital’s ru le k + 1 times.  Pro of (of Lemm a 3.5). W e will first show that     T ( k )  − ε ( k + 1) log m  − G k +1 (0) k + 1     ≤ 6 ε log k ( m ) H ( x ) k + 1 Let S k ( a ) = a k +1 T ( k ) ( a ) and note T ( k ) ( a ) = S k ( a ) /a k +1 . By Claim A.10, lim a → 0 S k ( a ) = 0. F urthermore, lim a → 0 S ( j ) k = 0 for all 1 ≤ j ≤ k b y Claim A.11. Thus, when analyzing lim a → 0 S ( j ) k ( a ) / ( a k +1 ) ( j ) for 0 ≤ j ≤ k , b oth the numerator and denominator appr oac h 0 and w e can apply l’Hˆ opital’s rule (h ere ( a k +1 ) ( j ) denotes the j th deriv ativ e of the function a k +1 ). By k + 1 applications of l’Hˆ opita l’s r ule, w e can thus sa y that T ( k ) ( a ) con ve rges to its limit at least as quic kly as S ( k +1) k ( a ) / ( a k +1 ) ( k +1) = S ( k +1) k ( a ) / ( k + 1)! d o es (using Claim A.7). W e note that G j ( a ) is nonnegativ e for j ev en and nonp ositiv e otherwise. Thus, for negativ e a , eac h term in the s u mmand of the expression for S ( k +1) k ( a ) in C laim A.11 is n onnegativ e f or o dd k and non p ositiv e for ev en k . As the analyses for even and o dd k are nearly identica l, w e fo cus b elo w on o dd k , in whic h case every term in the su mmand is nonnegativ e. F or o dd k , S ( k +2) k ( a ) is n onp ositiv e so that S ( k +1) k ( a ) is monotonically decreasing. Thus, it su ffices to sho w that S ( k +1) k ( − ε/ (( k + 1) log m )) / ( k + 1)! is not m uch larger than its limit. S ( k +1) k  − ε ( k +1) log m  ( k + 1)! = P k i =0  k i  k ! i !  − ε ( k +1) log m  i G k +1+ i  − ε ( k +1) log m  ( k + 1)! ≤ 1 + 2 ε k + 1 k X i =0  k i   ε ( k + 1) log m  i | G k +1+ i (0) | ≤ 1 + 2 ε k + 1 k X i =0 k i  ε ( k + 1) log m  i | G k +1+ i (0) | ≤ 1 + 2 ε k + 1 k X i =0  ε log m  i | G k +1+ i (0) | ≤ 1 + 2 ε k + 1 k X i =0 ε i | G k +1 (0) | ≤ (1 + 2 ε ) | G k +1 (0) | k + 1 + 1 + 2 ε k + 1 k X i =1 ε i | G k +1 (0) | ≤ (1 + 2 ε ) | G k +1 (0) | k + 1 + 2 k + 1 k X i =1 ε i log k ( m ) H ( x ) 19 ≤ | G k +1 (0) | k + 1 + 6 ε log k ( m ) H ( x ) k + 1 The first in equalit y h olds since x i ≥ 1 /m for eac h i , so th at x − ε/ (( k +1) log m ) i ≤ m ε/ (( k +1) log m ) ≤ m ε/ log m ≤ e ε ≤ 1 + 2 ε for ε ≤ 1 / 2. The fin al inequalit y abov e holds since ε ≤ 1 / 2. The lemma follo ws since | G k +1 (0) | ≤ log k ( m ) H ( x ).  Pro of (of Lemma 3.6). Let P j denote the j th Chebyshev p olynomial. W e will pr o v e for all j ≥ 1 that P j − 1 ( x ) ≤ P j ( x ) ≤ P j − 1 ( x )  1 + 2 j k c  . F or the first inequalit y , we observe P j − 1 ∈ C j , so we apply F act 3.3 together with the f act that P j ( y ) is strictly p ositiv e f or y > 1 for all j . F or th e second inequalit y , w e indu ct on j . F or the sak e of the pr o of defi n e P − 1 ( x ) = 1 so that the inductiv e hyp othesis holds at the base case d = 0. F or the inductive step with j ≥ 1, w e u se the recurrence definition of P j ( x ) and we ha v e P j +1 ( x ) = P j ( x )  1 + 2 k c  + ( P j ( x ) − P j − 1 ( x )) ≤ P j ( x )  1 + 2 k c  + P j − 1 ( x ) 2 j k c ≤ P j ( x )  1 + 2 k c  + P j ( x ) 2 j k c = P j ( x )  1 + 2( j + 1) k c   A.3 Pro ofs from Section 4 F act A.12. F or an y real z > 0, Γ( z + 1) = z Γ( z ). F act A.13. F or an y real z ≥ 0, sin( z ) ≤ z . F act A.14 (Eu ler’s Reflection F ormula) . F or any real z , Γ( z )Γ(1 − z ) = π / sin( π z ). Definition A.15. The fu nction V : R + → R is defi n ed by V ( α ) =  2 π Γ( 2 α 3 )Γ( 1 3 ) sin ( π α 3 )  3  2 π Γ( α 3 )Γ( 2 3 ) sin( π α 6 )  6 − 1 Lemma A.16. lim α → 0 V ( α ) = Γ  1 3  3 Γ  2 3  6 Pro of. Define u ( α ) = Γ(2 α/ 3)( πα/ 3) = Γ(2 α/ 3)(2 α/ 3)( π / 2) = Γ ((2 α/ 3) +1)( π / 2) by F act A.12. By the con tin uity of Γ( · ) on R + , lim α → 0 u ( α ) = Γ(1) π / 2 = π / 2. Define f ( α ) = Γ(2 α/ 3) sin( π α/ 3). Then f ( α ) ≤ u ( α ) for all α ≥ 0 by F act A.13, and thus lim α → 0 f ( α ) ≤ π / 2. No w define ℓ δ ( α ) = Γ(2 α/ 3)(1 − δ )( π α/ 3). By th e definition of the deriv ativ e and the fact that the deriv ativ e of sin( α ) ev aluated at α = 1 is 1, it follo ws that ∀ δ > 0 ∃ ε > 0 s.t. 0 ≤ α < ε ⇒ sin( α ) ≥ (1 − δ ) α . Th us , ∀ δ > 0 ∃ ε > 0 s.t. 0 ≤ α < ε ⇒ ℓ δ ( α ) ≤ f ( α ), and so ∀ δ > 0 we hav e that lim α → 0 f ( α ) ≥ 20 lim α → 0 ℓ δ ( α ) = (1 − δ ) π / 2. Thus, lim α → 0 f ( α ) ≥ π / 2, imp lyin g lim α → 0 f ( α ) = π / 2. Similarly w e can define g ( α ) = Γ( α/ 3) sin ( π α/ 6) and sho w lim α → 0 g ( α ) = π / 2. No w, V ( α ) =  2 π Γ  1 3  f ( α )  3  2 π Γ  2 3  g ( α )  6 Th us lim α → 0 V ( α ) = Γ(1 / 3) 3 / Γ(2 / 3) 6 as claimed.  Pro of (of Lemma 4.2). Li sh ows in [21] that the v ariance of the geometric mean estimator with k = 3 is V ( α ) F 2 α . As Γ( z ) and sin( z ) are con tin uous for z ∈ R + , so is V ( α ). F urthermore Lemma A.16 shows that lim α → 0 V ( α ) exists (and equals (Γ(1 / 3) 3 / Γ(2 / 3) 6 ) − 1). W e d efine V (0) to b e th is limit. Thus V ( α ) is con tin uous on [0 , 2], and the extreme v alue theorem implies there exists a constant C GM suc h that V ( α ) ≤ C GM on [0 , 2].  A.4 Detailed Analysis of Geometric Mean Residual Momen ts Algorithm F ormally , define R = l log 1+ ε c 1 m m , and let I z = n i : (1 + ε c 1 ) z ≤ | A i | < (1 + ε c 1 ) z +1 o for 0 ≤ z ≤ R . Let z ∗ satisfy (1 + ε c 1 ) z ∗ ≤ | A 1 | < (1 + ε c 1 ) z ∗ +1 . F or 1 ≤ j ≤ r and 0 ≤ z ≤ R , define X j,z = P i ∈ I z 1 h j ( i ) 6 = h j (1) . W e n o w analyze the j th trial. Claim A.17. E h 2 · F α,j, 1 − h j (1) i =  1 + O ( ε )  · F res α . Pro of. W e h a v e E h 2 · F α,j, 1 − h j (1) i = 2 · E " X i | A i | α · 1 h j ( i ) 6 = h j (1) # = 2 · X z E " X i ∈ I z | A i | α · 1 h j ( i ) 6 = h j (1) # = 2 · X z E " X i ∈ I z  (1 ± ε )(1 + ε ) z  α · 1 h j ( i ) 6 = h j (1) # = (1 ± ε ) α · X z (1 + ε ) z α E [ 2 X j,z ] . Clearly E [ 2 · X j,z ] is | I z | − 1 if z = z ∗ and | I z | otherwise. Thus X z (1 + ε ) z α E [ 2 · X j,z ] = X i ≥ 2  (1 ± ε ) | A i |  α = (1 ± ε ) α · F res α . Since α < 2, (1 ± ε ) α = 1 ± O ( ε ), so th is sho ws the desired result.  W e now sh o w concen tration for X z := 1 r P 1 ≤ j ≤ r X j,z . By indep endence of the h j ’s, Chernoff b ound s show that X z = (1 ± ε ) E [ X z ] with p robabilit y at least 1 − exp( − Θ( ε 2 r )). This qu an tit y is at least 1 − 1 8( R +1) if w e c ho ose r = c 2  ε − 2 (log log || A || 1 + log( c 3 /ε ))  . The go o d event is the ev en t that, f or all z , X z = (1 ± ε ) E [ X z ]; a union b ound sho ws that this o ccurs with p r obabilit y at least 7 / 8. So supp ose that the go o d ev ent o ccurs. Then a calculation analogous to Claim A.17 21 sho ws that X j 2 r · F α,j, 1 − h j (1) = (1 ± ε ) α · X z (1 + ε ) z α · 2 X z = (1 ± ε ) α · X z (1 + ε ) z α · (1 ± ε ) E [ 2 X z ] =  1 ± O ( ε )  · F res α . (A.4) Recall that ˜ F res α = P r j =1 2 r ˜ F α,j, 1 − h j (1) . S in ce the geo metric mean estimator is un biased, we also ha ve that E h ˜ F res α i = E   X j 2 r F α,j, 1 − h j (1)   . (A.5) W e conclude the analysis b y sho wing that the rand om v ariable ˜ F res α is concentrat ed. By Lemma 4.2 applied to eac h subs tr eam, and prop erties of v ariance, we ha v e V ar h ˜ F res α i = 4 r 2 r X j =1 V ar h ˜ F α,j, 1 − h j (1) i ≤ 4 C GM r · E h ˜ F α,j, 1 − h j (1) i 2 ≤ C GM r · E h ˜ F res α i 2 . Chebyshev’s inequ ality th erefore shows that Pr h ˜ F res α = (1 ± ε ) E h ˜ F res α i i ≥ 1 − V ar h ˜ F res α i ( ε · E h ˜ F res α i ) 2 ≥ 1 − C GM ε 2 r > 6 / 7 , b y appropriate c hoice of constan ts. T his ev ent and the go o d ev en t b oth o ccur with probabilit y at least 3 / 4. When this holds, we ha ve ˜ F res α = (1 ± ε ) E h ˜ F res α i = (1 ± ε ) E   X j 2 r F α,j, 1 − h j (1)   =  1 ± O ( ε )  · F res α , b y Eq. (A.5) and Eq. (A.4). A.5 Pro ofs from Section 4.2 Pro of (of F act 4.1). Let B = ⌈ 20 /ε ⌉ b e the num b er of bins. Let H b e a pairwise indep endent family of hash functions, eac h fu nction m apping [ n ] to [ B ]. Standard constr u ctions yield such a family with |H | = n O (1) . W e w ill let h b e a randomly c hosen hash fu nction from H . F or notational simplicit y , su pp ose that x 1 = max i x i . Let E i,j b e the indicator v ariable for the ev en t that h ( i ) = j , so th at E [ E i,j ] = 1 /B and V ar [ E i,j ] < 1 /B . Let X j b e the r an d om v ariable denoting the w eigh t of the items that hash to bin j , i.e., X j = P i x i · E i,j . Sin ce P i x i = 1, w e ha v e E [ X j ] = 1 /B and V ar [ X j ] < k x k 2 2 /B . Supp ose th at x 1 ≥ 1 / 2. Let Y b e the fraction of mass that hashes to x 1 ’s bin, excluding x 1 itself. That is, Y = P i ≥ 2 x i · E i,h (1) . Note that E [ Y ] = ( P i ≥ 2 x i ) /B < ( ε/ 20 ) · ( P i ≥ 2 x i ). By Mark o v’s inequalit y , Pr h Y ≥ ε · ( P i ≥ 2 x i ) i ≤ Pr [ Y ≥ 16 E [ Y ] ] ≤ 1 / 16 . 22 Supp ose that x 1 < 1 / 2. This implies, by con v exit y , th at k x k 2 2 < 1 / 2. Let β = p 2 / 3 < 5 / 6. Then Pr [ | X j − 1 /B | ≥ β ] ≤ V ar [ X j ] β 2 < 3 4 B . Th us , b y a union b ound , Pr [ ∃ j su c h that X j ≥ β + 1 /B ] ≤ 3 4 . Supp ose we wa nt to test if x 1 ≥ 1 / 2 b y c hecking if there’s a bin of mass at least 5 / 6. As argued ab o ve , the failure probab ility of one h ash f u nction is at most 3 / 4. If we c ho ose ten indep en d en t hash fu nctions and chec k that all of them ha ve a bin of at least 5 / 6, then the failure pr obabilit y decreases to less than 1 / 16.  A.6 Pro ofs from Section 5 Pro of (of Theorem 5.1 ). Let m i b e the num b er of times th e i -th element app ears in the stream. Recall th at m is the length of th e stream. By compu ting a (1 + ε ′ )-appro ximation to the α th momen t (as in F act 2.1) and dividing by || A || α 1 , w e get a m ultiplicativ e appro ximation to F α / || A || α 1 = || x || α α . W e can th us compute the v alue 1 1 − α log (1 ± ε ′ ) n X i =1 x α i ! = 1 1 − α log n X i =1 x α i ! + log(1 ± ε ′ ) 1 − α = H α ( X ) ± ε ′ 1 − α . Setting ε ′ = ε · | 1 − α | , we obtain an additive approxi mation alg orithm using O  | 1 − α | ε 2 · | α − 1 | 2 + 1 ε · | α − 1 |  log m  = O (log m/ ( | 1 − α | · ε 2 )) bits, as claimed.  Pro of (of T h eorem 5.2). If α ∈ (0 , 1), then b ecause the function x α is conca v e, w e get by Jensen’s inequalit y n X i =1 x i α ≤ n ·  1 n  α = n 1 − α . If we compute a m ultiplicativ e (1 + (1 − α ) · ε · n α − 1 )-appro ximation to th e α th momen t, we obtain an additive (1 − α ) · ε -appr o ximation to ( P n i =1 x α i ) − 1. T his in turn giv es an additiv e ε -appro ximation to T α . By F act 2.1, O  1 − α ((1 − α ) · ε · n α − 1 ) 2 + 1 (1 − α ) · ε · n α − 1  log m  = O ( n 2(1 − α ) log m/ ((1 − α ) ε 2 )) bits of sp ace suffice to ac hieve the requ ired approximati on to the α th momen t. F or α > 1, the v al ue F α / || A || α 1 is at most 1, so it suffices to appr o ximate F α to within a factor of 1 + ( α − 1) · ε . F or α ∈ (1 , 2], again u sing F act 2.1, w e can ac hiev e this using O (log m/ (( α − 1) ε 2 )) bits of space.  Pro of (of Lemm a 5.3). Consider fir st α ∈ (0 , 1). F or x ∈ (0 , 5 / 6], x α x = x α − 1 ≥  5 6  α − 1 ≥ 1 + C 1 · (1 − α ) , 23 for some p ositiv e constant C 1 . The last equalit y follo ws from con v exit y of (5 / 6) y as a function of y . Hence, n X i =1 x α i ≥ n X i =1 (1 + C 1 (1 − α )) x i = 1 + C 1 (1 − α ) , and fur thermore,      1 − n X i =1 x α i      = n X i =1 x α i ! − 1 ≥ C 1 · (1 − α ) = C 1 · | α − 1 | When α ∈ (1 , 2], th en for x ∈ (0 , 5 / 6], x α x = x α − 1 ≤  5 6  α − 1 ≤ 1 − C 2 · ( α − 1) , for some p ositiv e constan t C 2 . T his implies that n X i =1 x α i ≤ n X i =1 x i (1 − C 2 · ( α − 1)) = 1 − C 2 · ( α − 1) , and      1 − n X i =1 x α i      = 1 − n X i =1 x α i ≥ C 2 · ( α − 1) = C 2 · | α − 1 | . T o finish the p r o of of th e lemma, we set C = min { C 1 , C 2 } .  Pro of (of Lemma 5.5). W e first argue that a m ultiplicativ e appr o ximation to | 1 − x α i | ca n b e obtained fr om a multiplica tiv e appro ximation to 1 − x i . Let g ( y ) = 1 − (1 − y ) α . Note that g (1 − x i ) = 1 − x α i . Since 1 − x i ∈ [0 , 1 / 3], we restrict the domain of g to [0 , 1 / 3]. Th e deriv ativ e of g is g ′ ( y ) = α (1 − y ) α − 1 . Note th at g is strictly increasing for α ∈ (0 , 1) ∪ (1 , 2]. F or α ∈ (0 , 1), the deriv ativ e is in the range [ α, 3 2 α ]. F or α ∈ (1 , 2], it alwa ys lies in the range [ 2 3 α, α ]. In b oth cases, a (1 + 2 3 ε )-appro ximation to y suffices to compu te a (1 + ε )-approximat ion to g ( y ). W e no w consider tw o cases: • Ass u me first th at α ∈ (0 , 1). F or any x ∈ (0 , 1 / 3], w e hav e x α x ≥  1 3  α − 1 = 3 1 − α ≥ 1 + C 1 (1 − α ) , for some p ositiv e constant C 1 . The last inequalit y follo ws f rom the con ve xity of the f unction 3 1 − α . T his means that if x i < 1, then P j 6 = i x α j 1 − x i ≥ P j 6 = i x j (1 + C 1 (1 − α )) 1 − x i = (1 − x i )(1 + C 1 (1 − α )) 1 − x i = 1 + C 1 (1 − α ) . Since x i ≤ x α i < 1, we also ha v e P j 6 = i x α j 1 − x α i ≥ P j 6 = i x α j 1 − x i ≥ 1 + C 1 (1 − α ) . This implies that if we compu te a m ultiplicativ e 1 + (1 − α ) ε/D 1 -appro ximations to b oth 1 − x α i and P j 6 = i x α j , for sufficien tly large constan t D 1 , w e compute a m ultiplicativ e (1 + ε )- appro ximation of ( P n j =1 x α j ) − 1. 24 • T he case of α ∈ (1 , 2] is similar. F or an y x ∈ (0 , 1 / 3], we h a v e x α x ≤  1 3  α − 1 ≤ 1 − C 2 ( α − 1) , for some p ositiv e co ns tan t C 2 . Hence, P j 6 = i x α j 1 − x i ≤ P j 6 = i x j (1 − C 2 ( α − 1)) 1 − x i = (1 − x i )(1 − C 2 ( α − 1)) 1 − x i = 1 − C 2 ( α − 1) , and b ecause x α i ≤ x i , P j 6 = i x α j 1 − x α i ≤ P j 6 = i x α j 1 − x i ≤ 1 − C 2 ( α − 1) . This implies that if we compu te a m ultiplicativ e 1 + ( α − 1) ε/D 2 -appro ximations to b oth 1 − x α i and P j 6 = i x α j , for sufficien tly large constan t D 2 , w e can compute a multiplicat ive (1 + ε )-appro ximation to 1 − P n j =1 x α j .  Pro of (of Theorem 5.6). W e run the algorithm of Section 4.1 to find ou t if there is a v ery hea vy element. This only r equ ires O (log n ) w ords of space. If there is n o heavy elemen t, then by Lemma 5.3 ther e is a constant C ∈ (0 , 1) suc h th at | 1 − P i x α i | ≥ C | α − 1 | . W e wan t to compute a m ultiplicativ e approxima tion to | 1 − P i x α i | . W e kno w that the difference b et w een P i x α i and 1 is large. Therefore, if w e compute a multiplica tive (1+ 1 2 | α − 1 | C ε )-approximati on to P i x α i , w e obtain an add itiv e ( 1 2 | α − 1 | C ε P i x α i )-appro ximation to P i x α i . If P i x α i ≤ 2, then 1 2 | α − 1 | C ε P i x α i | 1 − P i x α i | ≤ | α − 1 | C ε C | α − 1 | = ε. If P i x α i ≥ 2, then 1 2 | α − 1 | C ε P i x α i | 1 − P i x α i | ≤ 1 2 | α − 1 | C ε · 2 ≤ ε. In either case, we obtain a multiplicati ve (1 + ε )-approxima tion to | 1 − P i x α i | , whic h in turn yields a multiplic ativ e appr o ximation to the Tsallis en tropy . W e no w need to b oun d th e amount of space we use in this case. W e use the estimator of F act 2.1, whic h uses O (log m/ ( | α − 1 | ε 2 )) bits in our case. Let us fo cus no w on the case when there is a hea vy element. By Lemma 5.5 it suffi ces to appro ximate F res 1 and F res α , wh ic h w e can do usin g the algorithm of Section 4.2. The num b er of bits required is O  log m ε · | α − 1 |  + ˜ O  | α − 1 | · log m ( ε · | α − 1 | ) 2  = ˜ O  log m ε 2 · | α − 1 |  .  Pro of (of Lemma 5.7). F or t ∈ [4 / 9 , 1], the deriv ativ e of the logarithm function lies in the range [ a, b ], where a an d b are constan ts su c h that 0 < a < b . This implies that in th is 25 case, a (1 + ε )-appro ximation to t − 1 giv es a 1 + b a ε appro ximation to log ( t ). W e are giv en y ∈ [1 − t, (1 + ε )(1 − t )], and we can assume that y ∈ [1 − t, min { 5 / 9 , (1 + ε )(1 − t ) } ]. W e hav e − log ( t ) ≤ − log (1 − y ) , and − log (1 − y ) − log ( t ) ≤ − log (1 − (1 + ε )(1 − t )) − log ( t ) = − log ( t − ε (1 − t )) − log ( t ) ≤ − log ( t ) + ( − log( t − ε (1 − t )) + log ( t )) − log ( t ) ≤ 1 + − log ( t − ε (1 − t )) + log ( t ) − log ( t ) ≤ 1 + ε (1 − t ) · m ax z ∈ [max { t − ε (1 − t ) , 4 / 9 } ,t ] (log( z )) ′ (1 − t ) · min z ∈ [4 / 9 , 1] (log( z )) ′ ≤ 1 + ε (1 − t ) · m ax z ∈ [ t, 1] (log( z )) ′ (1 − t ) · min z ∈ [4 / 9 , 1] (log( z )) ′ ≤ 1 + ε (1 − t ) · b (1 − t ) · a = 1 + b a ε. Consider now t > 1. W e are given y ∈ [ t − 1 , (1 + ε )( t − 1)], and w e hav e log( t ) ≤ log( y + 1) ≤ log ((1 + ε )( t − 1) + 1) . F ur thermore, log((1 + ε )( t − 1) + 1) log( t ) ≤ log( t ) + log ((1 + ε )( t − 1) + 1) − log( t ) log( t ) = 1 + log( t + ( t − 1) ε ) − log ( t ) log( t ) = 1 + R t +( t − 1) ε t (log( z )) ′ dz R t 1 (log( z )) ′ dz ≤ 1 + ( t − 1) ε m ax z ∈ [ t,t +( t − 1) ε ] (log( z )) ′ ( t − 1) max z ∈ [1 ,t ] (log( z )) ′ ≤ 1 + ( t − 1) ε t − 1 = 1 + ε. Hence, w e get a go o d m ultiplicativ e ap p ro ximation to log( t ).  Pro of (of Theorem 5.8). W e u se the algorithm of Section 4.1 to c hec k if there is a single elemen t of high frequency . This only requ ires O (log m ) b its of sp ace. If th ere is no eleme nt of frequency greate r than 5 / 6, then the R ´ en yi en tropy for an y α is greater than the min-entrop y H ∞ = − log max i x i ≥ log (6 / 5). Therefore, in this case it suffices to run the additiv e appr o ximation algorithm with ε ′ = log(6 / 5) ε to obtain a su fficien tly go o d estimate. T o r un that algorithm, w e us e O  log m | 1 − α | ε 2  bits of sp ace. Let us consider the other case, when there is an element of f requency at least 2 / 3. F or α ∈ (1 , 2], we ha v e  2 3  2 ≤ X i x α i ≤ 1 , 26 and for α ∈ (0 , 1), P n i =1 x α i ≥ 1. Therefore, b y Lemm a 5.7, it suffices to compute a m ultiplicativ e appro ximation to | 1 − P i x α i | , which we can do b y Lemma 5.5. By algorithms fr om Section 4.3 and Section 4.2, w e can compu te the multiplicat ive (1 + Θ( | 1 − α | ε ))-appro ximations required b y Lemma 5.5 with the same sp ace complexit y as for the app ro ximation of Tsallis en trop y (see the pr o of of Theorem 5.6).  Pro of (of Theorem 5.9). The pro of is nearly id en tical to that of Theorem 3.1 in [2]. W e need merely observ e that if ˜ H α is a (1 + ε )-approxi mation to H α , th en m α (1+ ε ) 2 (1 − α ) ˜ H α is a m ultiplicativ e m αε -appro ximation to F α . F rom here, w e set t = cm ε n 1 /α and argue iden tically as in [2] via a reduction from t -party disjoin tness; we omit the details.  27

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment