Tight Bounds for Hashing Block Sources

It is known that if a 2-universal hash function $H$ is applied to elements of a {\em block source} $(X_1,...,X_T)$, where each item $X_i$ has enough min-entropy conditioned on the previous items, then the output distribution $(H,H(X_1),...,H(X_T))$ w…

Authors: Kai-Min Chung, Salil Vadhan

Tigh t Bounds fo r Hashing Blo c k Sources ∗ Kai-Min Ch ung † Harv ard Univ ersit y Salil V adhan ‡ Harv ard Univ ersit y August 19, 2021 Abstract It is known that if a 2- universal hash function H is applied to elements of a blo ck sour c e ( X 1 , . . . , X T ), where each item X i has enough min-entrop y conditioned on the previous items, then the output dis tribution ( H , H ( X 1 ) , . . . , H ( X T )) will b e “close” to the uniform distribution. W e provide impr ov ed bo unds on how muc h min-en tropy p er item is required for this to hold, bo th when we ask that the output b e close to uniform in statistical dista nc e and when we only a sk that it b e statistically close to a distribution with small collision probability . In b oth cas es, we reduce the dep endence of the min-entrop y on the num ber T of items from 2 lo g T in previous w ork to log T , which w e show to b e optimal. This leads to corresp o nding improv ements to the recent r esults of Mitzenmacher and V adha n (SODA ‘08) on the analysis of hashing- ba sed alg orithms a nd da ta structures when the data items come from a blo ck source. 1 In tro duction A blo ck sour c e is a sequence of items X = ( X 1 , . . . , X T ) in which eac h item has at least some k bits of “en tropy” conditioned on the previous ones [CG88]. Previous works [CG88, Zuc96, MV08] ha v e analyze d what happ ens wh en one applies a 2-unive rsal h ash function to eac h item in suc h a sequence, establishing results of the follo wing f orm: Blo c k-Source Hashing Theorems (informal): If ( X 1 , . . . , X T ) is a blo ck sour c e with k bits of “e ntr opy” p er item and H is a r andom hash function fr om a 2- universal family mapping to m ≪ k bits, then ( H ( X 1 ) , . . . , H ( X T )) is “close” to the u niform distribution. ∗ An extend ed abstract of t h is pap er will app ear in RANDOM ‘08 [CV08]. † W ork done when visiting U.C. Berkeley , supp orted b y US- Israel BSF grant 2002246 and N SF grant CNS- 0430336 . ‡ W ork done when visiting U.C. Berkeley , supp orted by the Miller Institute for Basic Research in Science, a Guggenheim F ellow ship, US-I srael BSF grant 20060 60, and ONR grant N00014-04-1-0478. 1 In this p ap er, we prov e new results of this form, ac hieving impr o v ed (in some cases, optimal) b oun d s on ho w m uc h en tropy k p er item is n eeded to ensu re that the output is close to uniform, as a function of the other parameters (the output length m of the hash functions, the num b er T of items, and the “distance” from the u niform distribution). But first w e discuss the t w o applications that ha ve motiv ated the study of Block-Source Hashing Theorems. 1.1 Applications of B lo c k-Source Hashing Randomness E xtractors. A r andomness extr actor is an algorithm that extracts almost- uniform bits from a sou r ce of biased and correlated bits, u sing a sh ort se e d of tru ly random bits as a catalyst [NZ96]. Extractors ha v e man y applications in theoretica l computer science and ha v e play ed a cen tral role in th e th eory of p seudorandomn ess. (See the surve ys [NT99, Sha04, V ad 07 ].) Block-source Hashin g Theorems immediately yield metho d s for extracting randomness from blo c k sources, where the seed is u sed to sp ecify a u niv ersal hash fun ction. The gain o ver hashing the enti re T -tuple at once is that the blo c ks ma y b e muc h sh orter than the en tire sequence, and thus a muc h shorter s eed is required to sp ecify th e u nive rsal hash function. Moreo ver, m any subsequent constructions of extractors for general sour ces (without the blo ck structur e) w ork b y fi rst con v erting the source into a b lo c k source and p erformin g blo c k-source hashin g. Analysis of Hashing-Based Algorithms. The idea of hashing has b een widely app lied in designing algorithms and data structures, including hash tables [Knu98], Blo om filters [BM03], summary algorithms for data streams [Mut03], etc. Giv en a stream of data items ( x 1 , . . . , x T ), w e first hash the items into ( H ( x 1 ) , . . . , H ( x T )), and carry out a computation using the hashed v alues. In the literature, the analysis of a hashin g algorithm is t ypically a w orst-case analysis on the inp ut data items, and the b est results are often obtained by u nrealistically mo d elling the hash fun ction as a truly rand om function m ap p ing the items to un if orm and indep enden t m -bit strin gs. On the other hand , for r ealistic, efficien tly computable hash fu n ctions (eg. , 2-univ ersal or O (1)-wise indep enden t hash f unctions), the pro v able p erform ance is sometimes significantl y wo rse. Ho w ev er, suc h gaps seem to not show up in pr actice, and ev en standard 2-unive rsal hash fu nctions empirically seem to matc h the p erformance of truly rand om hash functions. T o explain this phenomenon, Mitzenmac h er and V adhan [MV08] hav e su ggested that the discrepancy is due to wo rst-case analysis, and prop ose to in s tead m o del the input items as coming from a blo c k source. Th en Block-Source Hashin g Theorems imply that the p er f ormance of univ ersal hash functions is close to that of truly random hash functions, pro vided that eac h item has enough b its of en trop y . 2 1.2 Ho w Muc h E n trop y is Required? A natural question ab out Blo ck-Source Hashing Theorems is: ho w large do es the “en trop y” k p er item need to b e to ensu r e a certain amoun t of “clo seness” to u niform (where b oth the en trop y and closeness can b e measured in v arious w a ys). This also has practical significance for the latter motiv ation regarding hash ing-based algorithms, as it corresp onds to the amount of en trop y we need to assume in data items. In [MV08], they pro vide b ounds on the en trop y required for t w o measures of closeness, and use these as basic tools to b ound the required en trop y in v arious applications. The requirement is u sually some small constan t multiple of log T , where T is th e num b er of items in th e source, wh ic h can b e on the b orderline b et wee n a reasonable and unr easonable assumption ab out real-life data. Therefore, it is in teresting to p in do wn the optimal answers to these questions. In what follo ws, w e first summarize the pr evious results, and then discuss our improv ed analysis and corresp onding low er b oun ds. A standard w a y to measure the distance of the outpu t from the uniform distribu tion is b y statistic al distanc e . 1 In the randomness extractor literature, classic results [CG88, ILL89, Zuc96] sh o w that using 2-universal hash functions, k = m + 2 log ( T /ε ) + O (1) bits of min - en trop y (or ev en Renyi en trop y) 2 p er item is s ufficien t for the output distribution to b e ε -close to uniform in statistical d istance. Sometimes a less strin gen t closeness requirement is su fficien t, where w e only requ ir e that the output d istribution is ε -close to a distr ib ution h a ving “small” c ol lision pr ob ability 3 . A result of [MV08] sh o ws that k = m + 2 log T + log(1 /ε ) + O (1) suffi ces to achiev e this r equiremen t. Using 4-wise in d ep end en t hash functions, [MV08] further reduce the required en trop y to k = max { m + log T , 1 / 2( m + 3 log T + log (1 /ε )) } + O (1). Our Results. W e r educe th e entrop y required in the previous r esu lts, as summarized in T a- ble 1. Roughly sp eaking, we sa ve an add itiv e log T bits of min-en trop y (or Ren yi en tropy) for all cases. W e sho w that usin g unive rsal hash fun ctions, k = m + log T + 2 log 1 /ε + O (1) bits p er item is sufficien t for the outpu t to b e ε -close to un iform, and k = m + log( T /ε ) + O (1) is enough for the output to b e ε -clo se to ha ving sm all collision probabilit y . Using 4-wise indep enden t hash functions, the en tropy k further reduces to max { m + log T , 1 / 2( m + 2 log T + log 1 /ε ) } + O (1). The results hold ev en if w e consider the joint distribu tion ( H , H ( X 1 ) , . . . , H ( X T )) (corresp ond - ing to “strong extractors” in the literat ure on rand omness extract ors). Sub stituting our im- pro ve d b ounds in the analysis of hashing-based algorithms from [MV08], we obtain similar reductions in the min -en trop y required for ev ery application with 2-un iv ersal hashing. With 4-wise in d ep end en t h ashing, w e obtain a sligh t impro v emen t for Linear Pr obing, and for the 1 The statistic al d istanc e of tw o random v ariables X and Y is ∆( X , Y ) = max T | Pr[ X ∈ T ] − Pr [ Y ∈ T ] | , where T ranges ov er all p ossible even ts. 2 The min-entr opy of a random v ariable X is H ∞ ( X ) = min x log(1 / Pr[ X = x ]). All of the results mentioned actually hold for the less stringent measure of R enyi entr opy H 2 ( X ) = log(1 / E x ← X [Pr[ X = x ]]). 3 The c ol lision pr ob ability of a random va riable X is P x Pr[ X = x ] 2 . By “small collis ion probability ,” w e mean that th e collision probability is within a constan t factor of the collision probability of uniform distribution. 3 Setting Previous Results Our Results 2-univ ersal h ashing m + 2 log T + 2 log (1 /ε ) m + log T + 2 log(1 /ε ) ε -close to uniform [CG88, ILL89, Zuc96] 2-univ ersal h ashing m + 2 log T + log (1 /ε ) [MV08] m + log T + log (1 /ε ) ε -close to small cp. 4-wise indep. h ashing max { m + log T , max { m + log T , ε -close to small cp. 1 / 2( m + 3 log T + log 1 /ε ) } [MV08] 1 / 2( m + 2 log T + log (1 /ε ) } T able 1: O ur Resu lts: Eac h entry denotes the min-entrop y (actually , Ren yi en trop y) required p er item wh en hashin g a blo ck source of T items to m -bit strings to ensure that the output has statistica l distance at most ε from un iform (or from ha ving collision pr obabilit y within a constan t factor of un iform). Additiv e constan ts are omitted for readabilit y . other applications, we sh o w that the previous b ounds can already b e ac h iev ed with 2-universal hashing. The results are summ arized in T able 2. Although the log T i mpr o v emen t seems small, we remark that it could b e significan t for practical settings of parameter. F or example, sup p ose w e wan t to hash 64 thousand int ernet traffic flows, so log T ≈ 16. Eac h flo w is sp ecified b y the 32-bit IP addr esses and 16-bit p ort n umb ers for the source and destination p lus the 8-bit tr an s p ort proto col, for a total of 104 bits. Th ere is a noticea ble difference b et w een assuming that eac h flo w con tains 3 log T ≈ 48 vs. 4 log T ≈ 64 bits of en tropy as they are on ly 104 b its long, and are very structured. W e also p ro v e corresp ond ing lo we r b ounds showing that our upp er b ounds are almost tigh t. Sp ecifical ly , w e sho w th at when the d ata items ha ve not enough en tropy , then the joint distribution ( H , H ( X 1 ) , . . . , H ( X T )) can b e “far” from u niform. More p recisely , we s ho w that if k = m + log T + 2 log 1 /ε − O (1), then there exists a blo ck source ( X 1 , . . . , X T ) with k bits of min-entrop y p er item s u c h that the distribution ( H , H ( X 1 ) , . . . , H ( X T )) is ε -far from uniform in statistica l distance (for H coming fr om an y hash family). This matc hes our upp er b oun d up to an add itiv e constan t. Similarly , w e sh o w that if k = m + log T − O (1), then there exists a b lo c k source ( X 1 , . . . , X T ) with k b its of min-entrop y p er item such that the distribution ( H, H ( X 1 ) , . . . , H ( X T )) is 0 . 99 -far from having s m all collision probabilit y (for H coming from any hash family). This matc hes our up p er b ound up to an additiv e constant in case the statistical distance parameter ε is constan t; we also exhibit a sp ecific 2-universal family for wh ic h the log (1 /ε ) in our u pp er b oun d is n early tigh t — it cannot b e redu ced b elo w log(1 /ε ) − log log(1 /ε ). Finally , w e also extend all of our low er b oun ds to the case that w e only consider distribu tion of h ash ed v alues ( H ( X 1 ) , . . . , H ( X T )), rather th an their joint d istr ibution with Y . F or this case, the low er b oun ds are necessarily r educed by a term that d ep end s on the size of the hash family . (F or standard constru ctions of univ ersal hash functions, this amoun ts to log n bits of en trop y , where n is th e bit-length of an individu al item.) 4 T yp e of Hash F amily Previous Results [MV08] Our Results Linear Probing 2-univ ersal hashing 4 log T 3 log T 4-wise indep end ence 2 . 5 log T 2 log T Balanced Allo cations with d Choices 2-univ ersal hashing ( d + 2) log T ( d + 1) log T 4-wise indep end ence ( d + 1) log T — Bloom Filters 2-univ ersal hashing 4 log T 3 log T 4-wise indep end ence 3 log T — T able 2: App lications: Eac h en try d enotes the min-en trop y (actually , Ren yi en trop y) r equired p er item to ensure that th e p erform ance of the giv en app lication is “close” to the p erformance when using truly random hash fun ctions. In all cases, th e b ounds omit additiv e terms that dep end on how close a p erformance is desired, and w e restrict to the (standard ) case that the size of the hash table is linear in the num b er of items b eing hashed . That is, m = log T + O (1). T ec hniques. A t a high lev el, all of the previous analyses for h ash ing blo c k sources were loose due to s umming error p r obabilities o v er the T b lo c ks. Ou r impro ve ments come fr om a voiding this linear blo w-up by choosing more refined measures of error. F or example, when w e wa nt th e output to ha ve small statistical distance from uniform , the classic Lefto ve r Hash Lemma [ILL89] sa ys that min-entrop y k = m + 2 log (1 /ε 0 ) suffices for a single hashed blo c k to b e ε 0 -close to uniform, and then a “hybrid argumen t” implies that the join t distribution of T h ashed blo c ks is T ε 0 -close to uniform [Zuc96]. Setting ε 0 = ε/T , this leads to a min-en tropy requirement of k = m + 2 log (1 /ε ) + 2 log T p er blo c k. W e obtain a b etter b ound, reducing 2 log T to log T , b y using H el linger distanc e to analyze the error accum ulation o v er blo c ks, and only passin g to statistica l distance at the end. F or the case wh ere w e only wan t the output to b e close to h a ving small collision p robabilit y , the previous analysis of [MV08] wo rked by first sho wing that the exp ected collision p robabilit y of eac h hashed b lo c k h ( X i ) is “small” ev en conditioned on previous blo c ks, then usin g Marko v’s Inequalit y to ded uce that eac h h ash ed blo c k has small collision probabilit y except with some probabilit y ε 0 , and finally doing a union b ound to deduce th at all hashed blocks ha v e small collision pr obabilit y except with prob ab ility T ε 0 . W e a v oid the union b ound by w orking with more refin ed notions of “conditional collision probabilit y ,” whic h enable us to apply Mark o v’s Inequalit y on the ent ire sequence r ather than on eac h blo c k ind ividually . The starting p oint for our negativ e r esults is the tight lo w er b ound for randomn ess extractors due to R ad h akrishnan and T a-Shma [R T00]. Their metho ds sho w that if th e m in-en tropy parameter k is n ot large enough, then for an y hash family , there exists a (single-block) source 5 X suc h that h ( X ) is “far” from uniform (in statistic al d istance) for “man y” h ash functions h . W e then tak e our blo c k source ( X 1 , . . . , X T ) to consist of T iid copies of X , and argue that the statistic al distance from uniform gro ws sufficien tly fast with the n umb er T of copies tak en. F or example, w e sho w that if tw o distr ibutions ha v e statistical distance ε , then their T -fold pro ducts hav e statistical distance Ω(min { 1 , √ T · ε } ), strengthening a previous b oun d of Reyzin [Rey04], who pro ve d a b ound of Ω(min { ε 1 / 3 , √ T · ε } ). 2 Preliminaries Notations. All logs are based 2. W e u s e the con v en tion that N = 2 n , K = 2 k , and M = 2 m . W e think of a data item X as a random v ariable o ver [ N ] = { 1 , . . . , N } , wh ic h can b e view ed as the set of n -bit strings. A hash fun ction h : [ N ] → [ M ] hashes an item to a m -b it string. A hash f unction family H is a m ultiset of hash functions, and H will u sually denote a uniformly random hash function dra wn f rom H . U [ M ] denotes the uniform distribu tion o v er [ M ]. Let X = ( X 1 , . . . , X T ) b e a sequence of data items. W e use X 0 , the hashe d se quenc e ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) i s ε -close to a distribution ( H , Z ) = ( H , Z 1 , . . . , Z T ) such that cp( H, Z ) ≤ 1 |H| · M T  1 + M K ε  T . In p articular, if K ≥ M T /ε , then ( H , Z ) has c ol lision pr ob ability at most (1 + 2 M T /K ε ) / ( |H| · M T ). 9 T o analyze the distribu tion of the hashed sequence ( H , Y ), the starting p oin t is the follo wing v ersion of the Lefto ver Hash Lemma [BBR85, ILL 89], whic h sa ys that wh en w e hash a r an d om v ariable X with enough ent ropy using a 2-univ ersal hash fu n ction H , th e conditional collision probabilit y of H ( X ) conditioning on H is small. Lemma 3.2 (The Leftov er Hash Lemma) L e t H : [ N ] → [ M ] b e a r andom hash function fr om a 2 -universal family H . L et X b e a r andom variable over [ N ] with cp( X ) ≤ 1 /K . We have cp ( H ( X ) | H ) ≤ 1 / M + 1 /K . W e now ske tc h how the hash ed blo c k source Y = ( Y 1 , . . . , Y T ) = ( H ( X 1 ) , . . . , H ( X T )) is analyzed in [MV08], and h o w w e imp ro v e the analysis. The follo wing natural app roac h is take n in [MV08]. Since the data X is a blo c k K -sour ce, the Leftov er Hash Lemma tells us that for ev ery blo ck i ∈ [ T ], if w e condition on the pr evious blo c ks X α • Cho ose w j +1 , . . . , w T ← U [ M ] , and output ( h, y 1 , . . . , y j , w j +1 , . . . , w T ). It is easy to c hec k that (i) ( H , Z ) is wel l-defined , (ii) ( H , Y ) is ε -close to ( H , Z ), (iii) for ev ery ( h, z ) ∈ ( H, Z ), (1 /T ) · P T i =1 cp( Z i | ( H,Z 0 , the hashe d se q uenc e ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) is ε -close to a distribution ( H, Z ) = ( H , Z 1 , . . . , Z T ) such that cp( H , Z ) ≤ 1 |H| · M T 1 + M K + r 2 M K 2 ε ! T . In p articular, if K ≥ M T + p 2 M T 2 /ε , then ( H, Z ) has c ol lision pr ob ability at most (1 + γ ) / ( |H| · M T ) for γ = 2 · ( M T + p 2 M T 2 /ε ) /K . The impro v ement of T h eorem 3.5 o v er T h eorem 3.1 comes fr om that when w e use 4-wise in- dep end en t hash families, w e hav e a concen tration result on the cond itional collisio n pr obabilit y for eac h blo ck , via the follo wing lemma. Lemma 3.6 ([MV08]) L et H : [ N ] → [ M ] b e a r andom hash function fr om a 4 -wise indep en- dent f amily H , and X a r andom variable over [ N ] with cp( X ) ≤ 1 /K . Then we have V ar h ← H [cp( h ( X ))] ≤ 2 M K 2 . 13 W e can then r eplace the application of Mark o v’s Inequalit y in the p ro of of Theorem 3.1 b y Chebyc hev’s Inequalit y to get stronger result. F ormally , we pro v e the f ollo w ing lemma, whic h suffices to pro v e Theorem 3.5. Lemma 3.7 L et H : [ N ] → [ M ] b e a r andom hash fu nc tion fr om a 4 -wise indep endent family H . L et X = ( X 1 , . . . , X T ) b e a blo ck K -sour c e over [ N ] T . L et ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) . Then with pr ob ability at le ast 1 − ε over ( h, y ) ← ( H , Y ) , 1 T T X i =1 cp( Y i | ( H,Y 0 such that K > M T /ε 2 , the hashe d se quenc e ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) i s ε -close to uniform ( H , U [ M ] T ) . Recall that the previous analysis go es by passing to statistica l d istance first, and then measuring the gro wth of d istance using statistical distance. Th is incurs a qu adratic dep en d ency of K on T . S in ce without further information, the h ybrid argumen t is tight, to sa ve a factor of T , w e ha v e to measure the increase of distance o v er blo c ks in another w a y , and pass to statistica l distance only in the end. It turn s out th at the Hel linger distanc e (cf., [GS02]) is a go o d measure for our purp oses: Definition 3.9 (Hellinger distance) L et X and Y b e two r andom variables over [ M ] . The Hellinger distance b etwe e n X and Y is d ( X, Y ) def = 1 2 X i ( p Pr[ X = i ] − p Pr[ Y = i ]) ! 1 / 2 = s 1 − X i p Pr[ X = i ] · Pr[ Y = i ] . Lik e statistic al distance, Hellinger distance is a distance measure for distrib utions, and it tak es v alue in [0 , 1]. T he follo wing standard lemma sa ys that the t w o distance measures are closely related. W e remark that the lemma is tight in b oth d irections ev en if Y is the uniform distribution. Lemma 3.10 (cf., [GS02]) L et X and Y b e two r andom variables over [ M ] . We have d ( X, Y ) 2 ≤ ∆( X , Y ) ≤ √ 2 · d ( X , Y ) . 16 In particular, the lemma allo ws us to u p p er-b ound th e statistical d istance by upp er-b ounding the Hellinger distance. Since our goal is to b ound the d istance to uniform, it is conv en ient to in tro du ce the follo win g defin ition. Definition 3.11 (Hellinger C loseness to Uniform) L et X b e a r andom variable over [ M ] . The Hellinger closeness of X to uniform U [ M ] is C ( X ) def = 1 M X i p M · Pr[ X = i ] = 1 − d ( X, U [ M ] ) 2 . Note that C ( X, Y ) = C ( X ) · C ( Y ) wh en X and Y are in dep end en t random v ariables, so the Hellinger closeness is w ell-b eha v ed w ith resp ect to pro du cts (un like s tatistical d istance). By Lemma 3.10, if the Hellinger closeness C ( X ) is close to 1, then X is close to u n iform in statistical distance. Recall that collision pr obabilit y b eha v es similarly . If the collision probabilit y cp( X ) is close to 1 / M , then X is close to u niform. In fact, b y the f ollo wing n orm alizatio n, w e can view the collision pr obabilit y as the 2-norm of X , and the Hellinger close ness as the 1 / 2-norm of X . Let f ( i ) = M · Pr[ X = i ] for i ∈ [ M ]. In terms of f ( · ), the collision probabilit y is cp ( X ) = (1 / M 2 ) · P i f ( i ) 2 , and Lemma 2.3 sa ys that if the “2-norm” M · cp( X ) = E i [ f ( i ) 2 ] ≤ 1 + ε where the exp ectation is o ver u n iform i ∈ [ M ], then ∆ ( X, U ) ≤ √ ε ,. Similarly , Lemma 3.10 sa ys that if the “1 / 2-norm” C ( X ) = E i [ p f ( i )] ≥ 1 − ε , then ∆( X , U ) ≤ √ ε . W e no w discuss our approac h to prov e T heorem 3.8. W e wan t to sho w that ( H , Y ) is close to uniform. All w e k n o w is that the conditional collisio n probabilit y cp( Y i | H , Y 0 satisfying 1 /p + 1 /q = 1 . L et x b e a uniformly r andom index over [ M ] . We have E x [ F ( x ) · G ( x )] ≤ E x [ F ( x ) p ] 1 /p · E x [ G ( x ) q ] 1 /q . • In gener al, let F 1 , . . . , F n b e non-ne gative functions fr om [ M ] to R , and p 1 , . . . p n > 0 satisfying 1 /p 1 + . . . 1 /p n = 1 . We have E x [ F 1 ( x ) · · · F n ( x )] ≤ E x [ F 1 ( x ) p 1 ] 1 /p 1 · · · E x [ F n ( x ) p n ] 1 /p n . Pro of of Lemma 3.12: W e pro v e it by indu ction on T . Th e b ase case T = 1 is already non-trivial. Let X b e a random v ariable ov er [ M ] with cp( X ) ≤ α/ M , w e need to sh o w that the Hellinger closeness C ( X ) ≥ p 1 /α . R ecall the normalization we men tioned b efore. Let f ( x ) = M · Pr[ X = x ] for ev ery x ∈ [ M ]. In terms of f ( · ), we w an t to sho w that E x [ f ( x ) 2 ] ≤ α implies E x [ p f ( x )] ≥ p 1 /α . Note that E x [ f ( x )] = 1. W e no w apply H¨ older’s inequalit y with F = f 2 / 3 , G = f 1 / 3 , p = 3, and q = 3 / 2. W e ha v e E x [ f ( x )] ≤ E x [ f ( x ) 2 ] 1 / 3 · E x [ f ( x ) 1 / 2 ] 2 / 3 , whic h implies C ( X ) = E x [ p f ( x )] ≥ E x [ f ( x )] 3 / 2 / E x [ f ( x ) 2 ] 1 / 2 ≥ p 1 /α. Supp ose the lemma is true for T − 1, w e s ho w th at it is true for T . Let f ( x ) = M 1 · Pr[ X 1 = x ]. T o apply the induction h yp othesis, w e consider the conditional random v ariables ( X 2 , . . . , X T | X 1 = x ) for every x ∈ [ M 1 ]. F or every x ∈ [ M 1 ] and j = 2 , . . . , T , we defin e g j ( x ) = M j · cp(( X j | X 1 = x ) | ( X 2 , . . . , X j − 1 | X 1 = x )) to b e th e ”normalized” conditional collisio n probabilit y . By in duction hypothesis, we ha v e C ( X 2 , . . . , X T | X 1 = x ) ≥ p g 2 ( x ) · · · g T ( x ) for ev ery x ∈ [ M 1 ]. It follo ws that C ( X ) = E x [ p f ( x ) · C ( X 2 , . . . , X T | X 1 = x ) ] ≥ E x [ p f ( x ) /g 2 ( x ) · · · g T ( x )] . W e use H¨ older’s inequalit y t wice to show that E x [ p f ( x ) /g 2 ( x ) · · · g T ( x )] ≥ p 1 /α 1 · · · α T . Let u s first sum marize th e constrain ts we ha v e. By definition, w e ha ve E x [ f ( x ) 2 ] ≤ α 1 . Fix 18 j ∈ { 2 , . . . , T } . Note that cp( X j | X 0 is a smal l absolute c onstant. L et H : [ N ] → [ M ] b e a r andom hash function fr om an hash family H . Then ther e exists an inte ge r K = Ω( M T /ε 2 ) , and a blo ck K - sour c e X = ( X 1 , . . . , X T ) such that ( H, Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) is ε -far fr om uniform ( H , U [ M ] T ) in statistic al dista nc e. T o prov e the theorem, w e n eed to find suc h an X f or eve ry hash family H . F ollo wing the in tuition, w e find an X that incur s certain error on a single blo ck, and tak e X = ( X 1 , . . . , X T ) to b e T i.i.d. copies of X . More precisely , w e first find a K -source X such that for Ω(1)-fraction of hash fun ctions h ∈ H , h ( X ) is Ω( ε/ √ T )-far from un iform. Th is step is th e same as the lo w er b oun d pro of for extractors [R T00], which uses the probabilistic metho d. W e pic k X to b e a random flat K -source, i.e., a uniform d istribution o v er a rand om set of s ize K , and sho w that X satisfies the desired p rop erty with nonzero probabilit y . The n ext step is to measure ho w the error accum ulates o v er indep enden t blo c ks. Note that for a fi xed hash function h , the hashed sequence ( h ( X 1 ) , . . . , h ( X T )) consists of T i.i.d. copies of h ( X ). R eyzin [Rey04] has shown that the statistical distance in creases √ T when we hav e T indep enden t copies for small T . How ever, Reyzin’s result only shows an increase up to distance O ( δ 1 / 3 ), where δ is th e statistical distance of the original rand om v ariables. W e impro v e Reyzin’s r esult to sho w th at the Ω( √ T ) gro wth con tin ues until the distance r eac hes some absolute constan t. W e then use it to sho w that the join t distribu tion ( H, Y ) is far from uniform. The follo wing lemma corresp onds to the fir s t step. Lemma 4.2 L et N and M b e p ositive inte gers and ε ∈ (0 , 1 / 4) , δ ∈ (0 , 1) r e al numb ers such that N ≥ M /ε 2 . L et H : [ N ] → [ M ] b e a r andom hash function fr om an hash family H . Then ther e exists an inte ger K = Ω ( δ 2 M /ε 2 ) , and a flat K -sour c e X over [ N ] such that with pr ob ability at le ast 1 − δ over h ← H , h ( X ) is ε -far f r om uniform. 20 Pro of. L et K = ⌊ min { α · M /ε 2 , N / 2 }⌋ for some α to b e determined later. Let X b e a random flat K -source ov er [ N ]. That is, X = U S where S ⊂ [ N ] is a uniformly ran d om size K su bset of [ N ]. W e claim that for ev ery hash fu nction h : [ N ] → [ M ], Pr S [ h ( U S ) is ε -far from uniform ] ≥ 1 − c · √ α (3) for some absolute constant c . Let us assume (3 ), and prov e th e lemma first. Since the claim holds for ev ery hash fun ction h , Pr h ← H, S [ h ( U S ) is ε -far from uniform ] ≥ 1 − c · √ α. Th us, there exists a flat K -source U S suc h that Pr h ← H [ h ( U S ) is ε -far from uniform ] ≥ 1 − c · √ α. The lemma follo w s b y s etting α = min { δ 2 /c 2 , 1 / 32 } . W e p ro ceed to prov e (3). It suffices to sho w that for every y ∈ [ M ], with probability at least 1 − c ′ · √ α o v er r andom U S , the d eviation of Pr[ h ( U S ) = y ] from 1 / M is at least 4 ε/ M , wh ere c ′ is another absolute constan t. T h at is, Pr S      Pr[ h ( U S ) = y ] − 1 M     ≥ 4 ε M  ≥ 1 − c ′ · √ α. (4) Again, let us see wh y (4) is sufficien t to pr ov e (3 ) first. Let us call y ∈ [ M ] is b ad for S if     Pr[ h ( U S ) = y ] − 1 M     ≥ 4 ε M . Since Inequalit y (4) holds for ev ery y ∈ [ M ], we ha v e Pr S,y [ y is bad for S ] ≥ 1 − c ′ · √ α, where y is uniformly random o v er [ M ]. It follo ws th at Pr S [at least 1 / 2-fracti on of y are bad for S ] ≥ 1 − 2 c ′ · √ α Observe that if at least 1 / 2-fraction of y are bad f or S , then ∆( h ( X ) , U [ M ] ) ≥ ε . Inequalit y (3 ) follo ws b y setting c = 2 c ′ . It remains to prov e (4). Let T = h − 1 ( y ). W e ha ve Pr S [ h ( U S ) = y ] = | S ∩ T | / | S | . Thus, recall that K ≤ αM /ε 2 , (4) follo w s f rom inequalit y Pr S      | S ∩ T | − K M     < 4 K ε M  ≤ c ′ · r K ε 2 M , whic h follo w s by the claim b elo w b y setting L = K/ M , and β = 4 ε p K/ M (W orking out the parameters, w e hav e c ′ = 4 c ′′ , ε < 1 / 4 implies β < √ L , and α ≤ 1 / 32 implies β < 1.) 21 Claim 4.3 L et N , K > 1 b e p ositive inte gers such that N > 2 K , and L ∈ [0 , K / 2] , β ∈ (0 , min { 1 , √ L } ) r e al numb ers. L et S ⊂ [ N ] b e a r andom subset of size K , and T ⊂ [ N ] b e a fixe d subset of arbitr ary size. We have Pr S h || S ∩ T | − L | ≤ β √ L i ≤ c ′′ · β , for some absolute c onstant c ′′ . In tuitiv ely , the probabilit y in the claim is maximized when the set T has size N L/K so that L = E S [ | S ∩ T | ], and the claim follo ws by observin g that in this case, the distribution has deviation Θ( √ L ), and eac h p ossible outcome has probabilit y O ( p 1 /L ). The formal pr o of of the claim is in Ap p end ix A and is pro v ed by expressing the probabilit y in terms of binomial co efficien ts, and estimating them usin g Stirling form ula. The n ext step is to measur e the increase of statisti cal distance o v er indep enden t r andom v ariables. Lemma 4.4 L et X and Y b e r andom variables over [ M ] such that ∆( X, Y ) ≥ ε . L et X = ( X 1 , . . . , X T ) b e T i.i.d. c opies of X , and let Y = ( Y 1 , . . . , Y T ) b e T i.i .d. c opies of Y . We have ∆( X , Y ) ≥ min { ε 0 , c √ T · ε } , wher e ε 0 , c ar e absolute c onstants. W e defer the pro of of the ab o ve lemma to App end ix B. Pro of of Theorem 4.1: The absolute constant ε 0 in the theorem is a h alf of the ε 0 in Lemma 4.4. By Lemma 4.2 there is a fl at K -source suc h that for 1 / 2-fractio n of hash functions h ∈ H , h ( X ) is (2 ε/c √ T )-far f rom uniform, for K = Ω ((1 / 2) 2 M / (2 ε/ c √ T ) 2 ) = Ω( M T /ε 2 ). W e set X = ( X 1 , . . . , X T ) to b e T indep end en t copies of X . C on s ider a hash function h such that h ( X ) is (2 ε/c √ T )-far f rom uniform. By Lemma 4.4, ( h ( X 1 ) , . . . , h ( X T )) is 2 ε -far from uniform. Note that this holds for 1 / 2-fractio n of hash function h . It follo ws th at ∆(( H , Y ) , ( H , U [ M ] )) = E h ← H h ∆(( h ( X 1 ) , . . . , h ( X T ) , U [ M ] T ) i ≥ 1 2 · 2 ε = ε. 4.2 Lo w er B ound for Small Collision Probabilit y In this s u bsection, we prov e low er b oun d s on the en trop y n eeded p er item to ensure that the sequence of hashed v alues is close to ha ving small collision probabilit y . Since this requirement is less s tringen t than b eing close to uniform , less en trop y is n eeded fr om the source. The int erest- ing setting in applications is to require the h ashed sequence ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) 22 to b e ε -close to h aving collisio n pr obabilit y O (1 / ( |H| · M T )). Recall that in this setting, instead of requiring K ≥ M T /ε 2 , K ≥ Ω( M T /ε ) is sufficien t for 2-univ ersal hash fun ctions (Theorem 3.1), and K ≥ Ω( M T + T p M /ε ) is s u fficien t f or 4-wise indep end en t hash fu nctions (Theorem 3.5). The main improv ement from 2-universal to 4-wise indep endent hashing is the b etter de- p end ency on ε . Indeed, it can b e sh own that if w e u se tru ly r andom hash functions, w e can reduce the dep endency on ε to log(1 /ε ). Since we are no w proving low er b oun ds for arbitrary hash families, we fo cus on the dep endency on M and T . Sp ecifically , our goal is to sh o w that K = Ω( M T ) is necessary . More precisely , w e sh o w that when K ≪ M T , it is p ossible for th e hashed sequence ( H , Y ) to b e . 99-far from an y distr ib ution that has collision probab ility less than 100 / ( |H | · M T ). W e us e the same strategy as in the previous su b section to p ro v e this lo we r b ound . Fixing a hash family H , w e take T indep enden t copies ( X 1 , . . . , X T ) of the w orst-case X found in Lemma 4.2, and sho w that ( H, H ( X 1 ) , . . . , H ( X T )) is far from having small collision probabilit y . T he new ingredien t is to show that when w e ha v e T indep end ent copies, and K ≪ M T , then ( h ( X 1 ) , . . . , h ( X T )) is v ery f ar from un if orm (sa y , 0 . 99-far) for many h ∈ H . W e then argue that in this case, we can n ot r ed uce the collision p robabilit y of ( h ( X 1 ) , . . . , h ( X T )) by c hanging a small fractio n of distribution, wh ic h im p lies the o v erall distribution ( H , Y ) is far from any distribution ( H ′ , Z ) with small collisio n probabilit y . F ormally , we pro ve the follo wing theorem. Theorem 4.5 L et N , M , and T b e p ositive inte gers such that N ≥ M T . L et δ ∈ (0 , 1) and α > 1 b e r e al numb ers such that α < δ 3 · e T / 32 / 128 . L et H : [ N ] → [ M ] b e a r andom hash function fr om a hash f amily H . Ther e exists an inte g e r K = Ω( δ 2 M T / log ( α/δ )) , and a blo ck K - sour c e X = ( X 1 , . . . , X T ) su c h that ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) is (1 − δ ) - far fr om any distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M T ) . Think of α and δ as constants. Then the th eorem sa ys that K = Ω( M T ) is necessary for the hashed sequ ence ( H , H ( X 1 ) , . . . , H ( X T )) to b e close to having small collision pr ob ab ility , matc hing the upp er b ound in T heorem 3.1. In the previous pro of, w e used Lemma 4.4 to measure th e increase of distance ov er blo cks. Ho wev er, the lemma can on ly measure the progress u p to some small constan t. It is kno wn th at if the num b er of copies T is larger then Ω(1 /ε 2 ), where ε is the statistical d istance of original copy , then th e statistica l distance go es to 1 exp onenti ally fast. F ormally , we us e the follo wing lemma. Lemma 4.6 ([SV99]) L et X and Y b e r andom variables over [ M ] such that ∆( X, Y ) ≥ ε . L et X = ( X 1 , . . . , X T ) b e T i.i .d. c opies of X , and let Y = ( Y 1 , . . . , Y T ) b e T i.i.d. c opies of Y . We have ∆( X , Y ) ≥ 1 − e − T ε 2 / 2 . W e remark that Lemma 4.4 and 4.6 are incomparable. In the parameter range of Lemma 4.4, Lemma 4.6 only gives ∆( X , Y ) ≥ Ω( T ε 2 ) instead of Ω( √ T ε ). T o argue that the o verall 23 distribution is far from ha ving small collision probabilit y , we in tro duce the follo w ing n otion of non unif orm it y . Definition 4.7 L et X b e a r andom variable over [ M ] with pr ob ability mass function p . X is ( δ , β )-nonuniform if for ev e ry function q : [ M ] → R such that 0 ≤ q ( x ) ≤ p ( x ) for al l x ∈ [ M ] , and P x q ( x ) ≥ δ , the fu nction satisfies X x ∈ [ M ] q ( x ) 2 > β / M . In tuitiv ely , a d istribution X o v er [ M ] is ( δ, β )-nonuniform means th at ev en if w e remo ve (1 − δ )-fraction of probabilit y mass from X , the “collision probabilit y” r emains greate r than β / M . In particular, X is (1 − δ )-far from an y random v ariable Y with cp( Y ) ≤ β / M . Lemma 4.8 L et X b e a r andom variable over [ M ] . If X i s (1 − η ) - far fr om uniform, then X is (2 √ β · η , β ) -nonuniform for every β ≥ 1 . Pro of. Let p b e the probabilit y mass function of X , and q : [ M ] → R b e a function su c h that 0 ≤ q ( x ) ≤ p ( x ) for eve ry x ∈ [ M ], and P x q ( x ) ≥ 2 √ β · η . Our goal is to s h o w that P x q ( x ) 2 > β / M . Let T = { x ∈ [ M ] : p ( x ) ≥ 1 / M } . Note that ∆( X, U [ M ] ) = Pr[ X ∈ T ] − Pr[ U [ M ] ∈ T ] ≥ 1 − η . This implies Pr[ X ∈ T ] ≥ 1 − η , and µ ( T ) = Pr [ U [ M ] ∈ T ] ≤ η . Now, X x ∈ T q ( x ) ≥ 2 p β · η − Pr[ X / ∈ T ] ≥ 2 p β · η − η > p β · η , and µ ( T ) ≤ η implies X x ∈ [ M ] q ( x ) 2 ≥ X x ∈ T q ( x ) 2 ≥  P x ∈ T q ( x )  2 | T | > β M . W e are ready to prov e T heorem 4.5. Pro of of Theorem 4.5: By Lemma 4.2 with ε = p 2 ln(128 α/δ 3 ) /T < 1 / 4, there is a flat K -sour ce X suc h that for (1 − δ / 4)-fraction of hash function h ∈ H , h ( X ) is ε -far from u niform, for K = Ω(( δ / 4) 2 M /ε 2 ) = Ω( δ 2 M T / log ( α/δ )). W e set X = ( X 1 , . . . , X T ) to b e T indep end en t copies of X . C onsider a hash f unction h su c h that h ( X ) is ε -far from uniform. By Lemma 4.6, ( h ( X 1 ) , . . . , h ( X T )) is (1 − η )-far from uniform, for η = e − ε 2 T / 2 = δ 3 / 128 α . By L emma 4.8, ( h ( X 1 ) , . . . , h ( X T )) is ( δ / 4 , 2 α/δ )-non unif orm for (1 − δ / 4)-fraction of hash functions h . By the first stateme nt of L emm a 4.9 b elo w, this implies that ( H, Y ) is (1 − δ )-far from an y distribution ( H ′ , Z ) with collision pr obabilit y α/ ( |H| · M T ). 24 Lemma 4.9 L et ( H , Y ) b e a joint distribution over H × [ M ] such that the mar ginal distribution H is uniform over H . L et ε, δ, α b e p ositive r e al numb ers. 1. If Y | H = h is ( δ / 4 , 2 α/δ ) -nonuniform for at le ast (1 − δ/ 4) -fr action of h ∈ H , then ( H , Y ) is (1 − δ ) - f ar fr om any distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M ) . 2. If Y | H = h is (0 . 1 , 2 α/ε ) -nonuniform for at le ast 2 ε -fr action-fr ction of h ∈ H , then ( H , Y ) is ε -far fr om any distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M ) . Pro of. W e in tro du ce the follo wing notatio ns fir st. F or eve ry h ∈ H , we define q h : [ M ] → R b y q h ( y ) = min { Pr[( H, Y ) = ( h, y )] , Pr[( H ′ , Z ) = ( h, y )] } for ev ery y ∈ [ M ]. W e also define f : H → R by f ( h ) = X y ∈ [ M ] q h ( y ) ≤ Pr[ H = h ] = 1 |H| . F or the fir st statemen t, let ( H ′ , Z ) b e a r andom v ariable ov er H × [ M ] that is (1 − δ )- close to ( H , Y ). W e need to sho w that cp( H ′ , Z ) > α/ ( |H | · M ). Note that P h f ( h ) = 1 − ∆ (( H, Y ) , ( H ′ , Z )) ≥ δ . So there are at least a (3 δ/ 4)-fraction of hash fun ctions h w ith f ( h ) ≥ ( δ / 4) / |H| . A t least a (3 δ / 4) − ( δ / 4) = δ / 2-fraction of h satisfy b oth f ( h ) ≥ ( δ / 4) / |H| and Y | H = h is ( δ / 4 , 2 α/δ )-nonuniform. By the defin ition of nonuniformit y , for eac h suc h h , we ha v e X y ∈ [ M ] T ( |H| · q h ( y )) 2 > 2 α δ · M . Therefore, cp( H ′ , Z ) ≥ X h,y q h ( y ) 2 >  δ 2 · |H|  · 2 α δ · |H| 2 M = α |H| · M . Similarly , for the second statemen t, let ( H ′ , Z ) b e a random v ariable ov er H × [ M ] that is ε -close to ( H, Y ). W e need to sho w that cp( H ′ , Z ) > α/ ( |H| · M ). Note that P h f ( h ) = 1 − ∆(( H , Y ) , ( H ′ , Z )) ≥ 1 − ε . So there are at least a 1 − ε/ 0 . 9-fraction of h with f ( h ) ≥ 0 . 1 / | H | . A t least a 2 ε − ε/ 0 . 9 > ε/ 2-fraction of h ash fun ctions satisfy b oth f ( h ) ≥ 0 . 1 / |H| and Y | H = h is (0 . 1 , 2 α/ε )-nonuniform. By Lemma 4.8, for eac h such h , we h a v e X y ∈ [ M ] ( |H| · q h ( y )) 2 > 2 α ε · M . Therefore, cp( H ′ , Z ) ≥ X h,y q h ( y ) 2 >  ε 2 · |H|  · 2 α ε · |H | 2 M = α |H| · M . 25 4.3 Lo w er B ounds for the D ist ribution of Hashed V alues Only W e can extend our low er b ounds to the distrib ution of hash ed sequence Y = ( H ( X 1 ) , . . . , H ( X T )) along (without H ) f or b oth closeness requirements, at the price of losing the dep endency on ε and incurring some dep end ency on the size of the hash family . L et 2 d = |H | b e the size of the hash family . The d ep enden cy on d is necessary . Int uitive ly , the hashed sequence Y cont ains at most T · m bits of en tropy , and th e inp ut ( H , X 1 , . . . , X T ) con tains at least d + T · k b its of entrop y . When d is large enough, it is p ossible th at all the rand omn ess of hashed sequence comes from the randomness of the h ash f amily . I n deed, if H is T -wise indep endent (whic h is p ossible w ith d ≃ T · m ), then ( H ( X 1 ) , . . . , H ( X T )) is uniform when X 1 , . . . , X T are all distinct. Therefore, ∆(( H ( X 1 ) , . . . , H ( X T )) , U [ M ] T ) ≤ Pr[ not all X 1 , . . . , X T are distinct ] Th us, K = Ω ( T 2 ) (indep endent of M ) suffices to mak e the hashed v alue close to un if orm . Theorem 4.10 L e t N , M , T b e p ositive inte gers, and d a p ositive r e al numb er suc h that N ≥ M T /d . L et δ ∈ (0 , 1) , α > 1 b e r e al numb e rs such that α · 2 d < δ 3 · e T / 32 / 128 . L e t H : [ N ] → [ M ] b e a r andom hash function fr om an hash family H of size at most 2 d . Ther e exists an inte ger K = Ω( δ 2 M T /d · log ( α/δ )) , and a blo ck K -sour c e X = ( X 1 , . . . , X T ) such that Y = ( H ( X 1 ) , . . . , H ( X T )) is (1 − δ ) - far fr om any distribution Z = ( Z 1 , . . . , Z T ) with cp( Z ) ≤ α/ M T . In p articular, Y is (1 − δ ) - far fr om uniform. Think of α and δ as constan ts. Then the theorem says that when the hash function contai ns d ≤ T / (32 ln 2) − O (1) bits of rand omness, K = Ω( M T /d ) is n ecessary f or the hashed sequence to b e close to uniform. F or example, in some t ypical hash applications, N = p oly( M ) and the hash function is 2-univ ersal or O (1)-wise indep endent. In this case, d = O (log M ) and w e need K = Ω( M T / log M ). (Recall that our u pp er b oun d in Theorem 3.1 sa ys that K = O ( M T ) suffices.) Pro of. W e will dedu ce the theorem from Theorem 4.5. Rep lacing the parameter α by α · 2 d in Theorem 4.5, we kno w th at there exists an inte ger K = Ω( δ 2 M T /d · log ( α/δ )) and a blo c k K -sour ce X = ( X 1 , . . . , X T ) such that ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) is (1 − δ )-far from an y distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α · 2 d / (2 d · M T ) = α/ M T . No w, supp ose w e are giv en a random v ariable Z on [ M ] T with ∆( Y , Z ) ≤ 1 − δ . T hen w e can define an H ′ suc h that ∆(( H, Y ) , ( H ′ , Z )) = ∆( Y , Z ) (Ind eed, define the conditional distribu tion H ′ | Z = z to equal H | Y = z for ev ery z ∈ [ M ] T .) Then w e hav e cp( Z ) ≥ cp( H ′ , Z ) > α M T . One limitation of the ab ov e lo wer b ound is that it only wo rks when d ≤ T / (32 ln 2) − O (1). F or example, th e lo w er b oun d cann ot b e applied when the hash fu nction is T -wise ind ep endent. 26 Although d = Ω ( T ) may not b e inte resting in practice, for the sak e of completeness, we pro vide another simple lo w er b ound to co ver this parameter region. Theorem 4.11 L e t N , M , T b e p ositive inte gers, and δ ∈ (0 , 1) , α > 1 , d > 0 r e al numb ers. L et H : [ N ] → [ M ] b e a r andom hash function fr om an hash family H of size at most 2 d . Supp ose K ≤ N b e an inte ger such that K ≤ ( δ 2 / 4 α · 2 d ) 1 /T · M . Then ther e e xi sts a blo ck K -sour c e X = ( X 1 , . . . , X T ) such that Y = ( H ( X 1 ) , . . . , H ( X T )) is (1 − δ ) -far fr om any distribution Z = ( Z 1 , . . . , Z T ) with cp( Z ) ≤ α/ M T . In p articular, Y is (1 − δ ) -f ar fr om uniform. Again, think of α and δ as constan ts. The th eorem sa y s that K = Ω( M / 2 d/T ) is necessary for the hashed sequen ce to b e close to un iform. In particular, when d = Θ ( T ), K = Ω ( M ) is necessary . Th eorem 4.10 giv es the same conclusion, bu t only w orks for d ≤ T / (32 ln 2) − O (1). On the other hand, when d = o ( T ), Theorem 4.10 giv es b etter lo w er b ound K = Ω( M T /d ). Pro of. Let X b e an y flat K -sour ce, i.e., a un iform distribution ov er a set of size K . W e simply tak e X = ( X 1 , . . . , X T ) to b e T indep end en t copies of X . Note that Y has sup p ort at most as large as ( H, X ). Thus, | supp( Y ) | ≤ | su p p( H , X ) | = 2 d · K T ≤ δ 2 4 α · M T . Therefore, Y is (1 − δ 2 / 4 α )-far from uniform . By Lemma 4.8, Y is (1 − δ )-far from an y distribution Z = ( Z 1 , . . . , Z T ) with cp( Z ) ≤ α/ M T . 4.4 Lo w er B ound for 2 -univ ersal Hash F unctions In this s ubsection, we sho w Theorem 3.1 is almost tigh t in the follo win g sense. W e show that there exists K = Ω( M T /ε · log (1 /ε )), a 2-univ ersal hash family H , and a blo c k K -source X suc h that ( H , Y ) is ε -far from having collision p robabilit y 100 / ( |H | · M T ). Th e impro v emen t o v er Theorem 4.5 is the almost tight dep en dency on ε . Recall that Theorem 3.1 sa ys that for 2-univ ersal hash f amily , K = O ( M T /ε ) su ffices. T he u pp er and lo w er b ound differs by a factor of log (1 /ε ). In particular, our result f or 4-wise indep end en t h ash f u nctions (Theorem 3.5 ) cannot b e ac hiev ed with 2-univ ersal hash fu nctions. The lo wer b ound can further b e extended to the distrib u tion of hashed sequence Y = ( H ( X 1 ) , . . . , H ( X T )) as in the previous subsection. F urthermore, sin ce the 2-univ ersal hash family we u se has small size, w e only pay a factor of O (log M ) in the lo wer b ound on K . F ormally we p r o v e the follo wing results. Theorem 4.12 F or every prime p ower M , r e al numb ers ε ∈ (0 , 1 / 4) and α ≥ 1 , the fol- lowing holds. F or al l inte gers t and N such that ε · M t − 1 ≥ 1 and N ≥ 6 εM 2 t , and for T = ⌈ ε 2 M 2 t − 1 log( α/ε ) ⌉ , 4 ther e exists an inte ger K = Ω( M T /ε · log( α/ε )) , and a 2 -u niversal 4 F or technical reason, our lo wer b ound pro of does not work for every sufficiently large T . Ho wev er, note that the density of T such that t h e low er b ound h olds is 1 / M 2 . 27 hash f amily H fr om [ N ] to [ M ] , and a blo ck K -sour c e X = ( X 1 , . . . , X T ) such that ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) i s ε -far fr om any distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M T ) . Theorem 4.13 F or every prime p ower M , r e al numb ers ε ∈ (0 , 1 / 4) and α ≥ 1 , the fol- lowing holds. F or al l inte gers t and N such that ε · M t − 1 ≥ 1 and N ≥ 6 εM 2 t , and for T = ⌈ ε 2 M 2 t − 1 log( αM /ε ) ⌉ , ther e exists an inte ger K = Ω( M T /ε · log( αM /ε )) , and a 2 - universal hash family H fr om [ N ] to [ M ] , and a b lo ck K - sour c e X = ( X 1 , . . . , X T ) such that Y = ( H ( X 1 ) , . . . , H ( X T )) is ε -far fr om any distribution Z with cp( Z ) ≤ α/ M T . Basically , the idea is to sh o w that the Marko v Inequalit y applied in the p ro of of Th eo- rem 3.1 (see Inequ alit y (1))is tig ht for a single blo c k. More precisely , we sho w th at there exists a 2-univ ersal hash family H , and a K -source X such that with p robabilit y ε o v er h ← H , cp( h ( X )) ≥ 1 / M + Ω(1 /K ε ). Intuiti ve ly , if we take T = Θ( K ε · log( α/ε ) / M ) in - dep end en t copies of suc h X , then the collision probabilit y will satisfy cp( h ( X 1 ) , . . . , h ( X T )) ≥ (1 + Ω ( M /K ε )) T / M T ≥ α/ ( εM T ), and so the o v erall collision p robabilit y is cp( H, Y ) ≥ α/ ( |H| · M T ). F ormally , we analyze our constru ction b elo w using Hellinger distance, and sho w that the collision p robabilit y remains high ev en after mo difyin g a Θ( ε )-fraction of distribution. Pro of of Theorem 4.12: Fix a prime p ow er M , and ε > 0, we iden tify [ M ] with the finite field F of size M . Let t b e an in teger parameter such that M t − 1 > 1 /ε . Recall that the set H 0 of linear functions { h ~ a : F t → F } ~ a ∈ F t where h ~ a ( ~ x ) = P i a i x i is 2-univ ersal. Note that p ic king a rand om hash function h ← H 0 is equiv alen t to pic king a rand om v ector ~ a ← F t . Tw o sp ecial prop er ties of H 0 are (i) when ~ a = ~ 0, the w hole domain F t is sen t to 0 ∈ F , and (ii) the size of hash family |H 0 | is the same as the s ize of the domain, namely | F t | . W e will use H 0 as a building blo ck in our construction. W e pro ceed to constru ct the h ash family H . W e partition the domain [ N ] in to several sub- domains, and apply differen t hash fun ction to eac h sub-domain. Let s b e an integ er parameter to b e determined later. W e require N ≥ s · M t , and partition [ N ] into D 0 , D 1 , . . . , D s , where eac h of D 1 , . . . , D s has size M t and is ident ified with F t , and D 0 is the remaining p art of [ N ]. In our construction, the data X will nev er come from D 0 . Thus, wlog, w e can assume D 0 is empt y . F or ev ery i = 1 , . . . , s , w e use a linear h ash function h ~ a i ∈ H 0 to sen d D i to F . Thus, a hash fu n ction h ∈ H consists of s linear h ash function ( h ~ a 1 , . . . , h ~ a s ), an d can b e describ ed b y s v ectors ~ a 1 , . . . , ~ a s ∈ F t . Note that to mak e H 2-un iversal, it suffices to pic k ~ a 1 , . . . , ~ a s pairwise indep enden tly . Sp ecifically , we iden tify F t with the fin ite field ˆ F of size M t , and pick ( ~ a 1 , . . . , ~ a s ) by pic king a, b ∈ ˆ F , and outpu t ( a + α 1 · b, a + α 2 · b, . . . , a + α s · b ) for some distin ct constan ts α 1 , . . . , α s ∈ ˆ F . F ormally , w e define the hash family to b e H = { h a,b : [ N ] → [ M ] } a,b ∈ ˆ F , where h a,b = ( h a + α 1 b , . . . , h a + α s b ) def = ( h a,b 1 , . . . , h a,b s ) . It is easy to v erify that H is indeed 2-universal, and |H | = M 2 t . 28 W e next d efine a single b lo c k K -source X that m ak es the Mark o v In equ alit y (1) tigh t. W e simply tak e X to b e a uniform distribution ov er D 1 ∪ · · · ∪ D s , and so K = s · M t . Consider a hash function h a,b ∈ H . If all h a,b i are n on-zero and distinct, then h a,b ( X ) is the uniform distribution. I f exactly one h a,b i = 0, then h a,b sends M t + ( s − 1) M t − 1 elemen ts in [ N ] to 0, and ( s − 1) M t − 1 elemen ts to eac h nonzero y ∈ F . Let us call suc h h a,b b ad hash functions. Th us, if h a,b is bad, then cp( h a,b ( X )) =  M t + ( s − 1) M t − 1 K  2 + ( M − 1) ·  ( s − 1) M t − 1 K  2 = 1 M + M − 1 s 2 M ≥ 1 M + 1 2 s 2 . Note that h a,b is bad with probabilit y Pr[exactl y one h a,b i = 0] = Pr[ b 6 = 0 ∧ ∃ i ( a + α i b = 0)] =  1 − 1 M t  · s M t ≥ s 2 M t . W e set s = ⌈ 4 εM t ⌉ ≤ M t . It follo w s that with probabilit y at least 2 ε o v er h ← H , the collision probabilit y satisfies cp( h ( X )) ≥ 1 / M + 1 / (4 K ε ), as we in tuitiv ely desired. Ho we ve r, instead of working with colli sion probabilit y directly , we need to use Hellinger closeness to measure the growth of distance to uniform (see Definition 3.9.) The follo w ing claim up p er b ounds the Hellinger closeness of h ( X ) for bad hash fu nctions h . The pro of of the claim is deferred to the end of this section. Claim 4.14 Supp ose h is a b ad hash function define d as ab ove, then the Hel linger closeness of h ( X ) satisfies C ( h ( X )) ≤ 1 − M / (64 K ε ) . Finally , for ev ery in teger T ∈ [ ε 2 M 2 t − 1 log( α/ε ) , c 0 · ε 2 M 2 t − 1 log( α/ε )], w e can write T = c · (64 K ε/ M ) · ln(800 α/ε ) for some constan t c < c 0 . Let X = ( X 1 , . . . , X T ) b e T indep enden t copies of X . W e no w show that K, H , X satisfy the conclusion of the theorem. That is, K = Ω( M T / ( ε log ( α/ε ))) (as follo ws from the defi nition of T ) and ( H , Y ) = ( H , H ( X 1 ) , . . . , H ( X T )) is ε -far from an y distribu tion ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M T ). Consider the distr ib ution ( h ( X 1 ) , . . . , h ( X T )) for a bad h ash fu nction h ∈ H . F rom the ab o v e claim, th e Hellinge r closeness satisfies C ( h ( X 1 ) , . . . , h ( X T )) = C ( h ( X )) T ≤ (1 − M / 64 K ε ) T ≤ e M T / 64 K ε ≤ 800 α ε . By Lemma 3.10 and the definition of Hellinger closeness, w e ha ve ∆(( h ( X 1 ) , . . . , h ( X T )) , U [ M ] T ) ≥ 1 − C ( h ( X 1 ) , . . . , h ( X T )) ≥ 1 − 800 α ε . 29 By Lemma 4.8, ( h ( X 1 ) , . . . , h ( X T )) is (0 . 1 , 2 α/ε )-non uniform for at least 2 ε -fraction of bad hash f unctions h . By the second statemen t of Lemma 4.9, this implies ( H , Y ) is ε -far from any distribution ( H ′ , Z ) with cp( H ′ , Z ) ≤ α/ ( |H| · M T ). In sum, give n M , ε, α, t that satisfies the premise of the theorem, w e set K = ⌈ 4 εM t ⌉ · M t , and pro v ed that for ev ery N ≥ K , and T = Θ(( K ε/ M ) · ln ( α/ε )), the conclusion of the theorem is true. It remains to prov e C laim 4.14. Pro of of claim: Let p ( x ) = M · Pr[ h ( X ) = x ] for ev ery x ∈ F . F or a bad hash function h , we ha v e p (0) = (1 + ( M − 1) /s ), and p ( x ) = (1 − 1 /s ) for ev ery x 6 = 0. W e w ill upp er b oun d C ( h ( X )) = (1 / M ) · P x p p ( x ) using T aylo r series. R ecall that for z ≥ 0, there exists some z ′ , z ′′ ∈ [0 , z ] such that √ 1 + z = 1 + z 2 + z 2 2 ·  − 1 4(1 + z ′ ) 3 / 2  ≤ 1 + z 2 − z 2 8(1 + z ) 3 / 2 , and √ 1 − z = 1 − z · 1 2 √ 1 − z ′′ ≤ 1 − z 2 . W e hav e C ( h ( X )) = 1 M X x p p ( x ) ≤ 1 M  1 + M − 1 2 s − ( M − 1) 2 8 s 2 · (1 + ( M − 1) /s ) 3 / 2 + ( M − 1) ·  1 − 1 2 s  = 1 − ( M − 1) 2 8 M s 2 (1 + ( M − 1) /s ) 3 / 2 Recall that M ≥ 2, s = εM t ≥ M , an d s 2 = K ε , w e ha v e C ( h ( X )) ≤ 1 − M 2 64 K ε . ✷ Recall that |H| = M 2 t . Theorem 4.13 follo ws from Theorem 4.12 b y exactly the s ame argumen t as in the pro of of Theorem 4.10. Ac kno wledgmen ts W e thank W ei-Ch un Kao for helpfu l discussions in the early stage s of this work, David Z uc k er- man for telling us ab out Hellinger d istance, and Mic h ael Mitzenmac her for suggesting parameter settings useful in practice. 30 References [BBR85] Charles H. Benn ett, Gilles Brassard, and Jean-Marc Rob ert. Ho w to redu ce your enem y’s in formation (extended abstract). In Hugh C. Williams, editor, A dvanc es in Cryptolo gy—CR YPTO ’85 , volume 218 of L e ctur e Notes in Computer Scienc e , pages 468–4 76. Springer-V erlag, 1986, 18–22 August 1985. [BM03] Andr ei Z. Bro der and Mic hael Mitzenmac her. Surv ey: Net w ork applications of blo om filters: A survey . Internet Mathematics , 1(4), 2003 . [CG88] Benn y Chor and Oded Goldreic h. Unbiased bits from s ou r ces of w eak randomness and pr obabilistic communicat ion complexit y . SIAM J. Comput. , 17(2):230 –261, 1988 . [CV08] Kai-Min Chung and Salil V adhan. Tigh t b ounds for hashing blo c k sources. In APPRO X-RAN DOM , 2008. [Dur04] Ric hard Dur rett. Pr ob ability: The or e y and Examples. Thir d Edition. Duxbu ry , 2004. [GS02] Alison L. Gibbs and F rancis Ed w ard Su. On c ho osing and b ounding probabilit y metrics. International Statistic al R evie w , 70:419, 2002. [ILL89] Ru ssell Impagliazzo , Leonid A. L evin, and Michael Luby . Pseu d o-random generation from one-wa y functions (exte nd ed abstracts). In P r o c e e dings of the Twenty First Annual ACM Symp osium on The ory of Computing , p ages 12–24, Seattle, W ash ington, 15–17 Ma y 1989. [Kn u98] Donald E. Kn uth. The art of c omputer pr o gr amming, V olume 3: Sorting and Se ar ch- ing . Addison-W esley L ongman Publishing Co., Inc., Boston, MA, USA, 1998. [MV08] Mic hael Mitzenmac h er and Salil V adh an. Wh y simp le hash functions wo rk: Ex- ploiting the entrop y in a data stream. In Pr o c e e dings of the 19th Annual ACM-SIAM Symp osium on Discr ete Algorithm s (SODA ‘08) , p ages 746– 755, 20– 22 Jan uary 2008. [Mut03] S. Muth ukrishn an. Data streams: algorithms and applications. In SO DA , pages 413–4 13, 2003 . [NT99] Noam Nisan and Amn on T a-Shm a. Extracting rand omness: A surve y and n ew con- structions. J. Comput. Syst. Sci. , 58(1):14 8–173, 1999. [NZ96] Noam Nisan and Da vid Zu c k erman. Randomness is linear in space. J ournal of Computer and System Scienc es , 52(1):43–52 , F ebruary 1996. [R T00] Jaikumar Radhakrishn an and Amnon T a-Shma. Boun ds for disp ersers, extractors, and depth-tw o sup erconcen tr ators. SIAM Journal on Discr ete Mathematics , 13(1):2 – 24 (elect ronic), 2000. 31 [Rey04] Leonid Reyzin. A n ote on the statistica l differen ce of small direct pro du cts. T ec hnical Rep ort BUCS-TR-2004-032 , Boston Un iv ersit y Computer Science Departmen t, 2004. [SV99] Amit Sahai and Salil V adh an. Manipulating statistical difference. In Panos Pa rdalos, Sanguthev ar Ra j asek aran, and Jos ´ e Rolim, editors, R andomization Metho ds in Algo- rithm Design (DIM ACS Workshop, De c emb er 1997) , v olume 43 of D IMACS Se rie s in Discr ete M athematics and The or e tic al Computer Scienc e , pages 251–270 . American Mathematica l So ciet y , 199 9. [Sha04] R on en Shaltiel. Recen t dev elopmen ts in extractors. In G. P aun, G. Rozenberg, and A. Salomaa, editors, Curr ent T r ends in The or e tic al Computer Sci enc e , v olume 1: Algorithms and Complexit y . W orld Scien tific, 2004. [V ad07] Salil V adhan. The unified theory of pseudorand omn ess. SIGA CT News , 38(3), Septem b er 2007. [Zuc96] David Zuc ke rman. Simulati ng BPP using a general w eak r andom source. Algor ith- mic a , 16(4/5):3 67–391, Octob er/No vem b er 1996. A T ec h nical Lemma on Binomial Co efficien ts Lemma A.1 (Claim 4.3, re st ated) L et N , K > 1 b e inte gers suc h that N > 2 K , and L ∈ [0 , K/ 2] , β ∈ (0 , min { 1 , √ L } ) r e al numb ers. L e t S ⊂ [ N ] b e a r andom subset of size K , and T ⊂ [ N ] b e a fixe d subset of [ N ] of arbitr ary size. W e have Pr S h || S ∩ T | − L | ≤ β √ L i ≤ O ( β ) . Pro of. By an abu se of n otation, w e use T to denote the size of set T . Th e pr obabilit y can b e expressed as a sum of binomial co efficien ts as follo ws. Pr S h || S ∩ T | − L | ≤ β √ L i = ⌊ L + β √ L ⌋ X R = ⌈ L − β √ L ⌉  T R  N − T K − R   N K  . Note that there are at m ost ⌊ 2 β √ L ⌋ + 1 terms, it suffices to sho w that for ev ery R ∈ h L − β √ L, L + β √ L i , f ( T ) def =  T R  N − T K − R   N K  ≤ O r 1 L ! . W e us e the follo wing b ound on bin omial co efficients, whic h can b e deriv ed from Stirling’s form ula. 32 Claim A.2 F or inte gers 0 < i < a , 0 < j < b , we have  a i  b j   a + b i + j  ≤ O s a · b · ( i + j ) · ( a + b − i − j ) i · ( a − i ) · j · ( b − j ) · ( a + b ) ! . Note that L ∈ [0 , K / 2] imp lies K − R = Ω( K ). When 2 R ≤ T ≤ N − 2 K + 2 R , we ha v e f ( T ) =  T R  N − T K − R   N K  = O s T ( N − T ) K ( N − K ) R ( T − R )( K − R )( N − T − K + R ) N ! = O s 1 R · K K − R · N − K N · T ( N − T ) ( T − R )( N − T − K + R ) ! = O r 1 R ! = O r 1 L ! , as d esired. N ote that when N > 2 K , suc h T exists. Finally , observe that β 2 < L implies R ≥ 1, and f ( T ) f ( T + 1) = ( T − R + 1)( N − T ) ( T + 1)( N − T − K + R ) . It follo ws that f ( T ) is increasing when T ≤ 2 R , and f ( T ) is decreasing when T ≥ N − 2 K + 2 R . Therefore, f ( T ) ≤ f (2 R ) = O ( p 1 /L ) for T ≤ 2 R , and f ( T ) ≤ f ( N − 2 K + 2 R ) = O ( p 1 /L ) for T ≥ N − 2 K + 2 R , which complete the p ro of. B Pro of of Lemma 4.4 Lemma B.1 (Lemma 4.4, restated) L et X and Y b e r andom variables over [ M ] such that ∆( X, Y ) ≥ ε . L et X = ( X 1 , . . . , X T ) b e T i.i . d. c opies of X , and let Y = ( Y 1 , . . . , Y T ) b e T i.i.d. c opies of Y . We have ∆( X , Y ) ≥ min { ε 0 , c √ T · ε } , wher e ε 0 , c ar e absolute c onstants. Pro of. W e pr o v e the lemma by the follo wing t w o claims. The first claim reduces the lemma to the sp ecial case that X is a Bernoulli random v ariable with bias Ω( ε ), and Y is a uniform coin. The second claim prov es th e sp ecial case. 33 Claim B.2 L et X and Y b e r andom variables over [ M ] such that ∆( X , Y ) = ε . Then ther e exists a r andom ize d function f : [ M ] → { 0 , 1 } such that f ( Y ) = U { 0 , 1 } , and ∆( f ( X ) , f ( Y )) ≥ ε/ 2 . Pro of of claim: By the d efinition, there exists a set T ⊂ [ M ] suc h that | Pr[ X ∈ T ] − Pr[ Y ∈ T ] | = ε. With out loss of generalit y , we can assume that P r[ Y ∈ T ] = p ≤ 1 / 2 (b ecause w e can tak e the complemen t of T .) Let g : [ M ] → { 0 , 1 } b e th e indicator f u nction of T , so w e hav e Pr Y [ g ( Y ) = 1] = p . F or ev ery x ∈ [ M ], we d efine f ( x ) = g ( x ) ∨ b , where b is a biased coin with Pr [ b = 0] = 1 / (2(1 − p )). Th e claim follo ws by observing that Pr[ f ( Y ) = 0] = Pr[ g ( Y ) = 0 ∧ b = 0] = (1 − p ) · 1 / (2 (1 − p )) = 1 / 2 , and ∆( f ( X ) , f ( Y )) ≥ ∆( X , Y ) · Pr [ b = 0] ≥ ε/ 2 . ✷ Claim B.3 L et X b e a Bernoul li r andom variable over { 0 , 1 } such that Pr [ X = 0] = 1 / 2 − ε . L et X = ( X 1 , . . . , X T ) b e T indep endent c opies of X . Then ∆( X , U { 0 , 1 } T ) ≥ min { ε 0 , c √ T ε } , wher e ε 0 , c ar e absolute c onstants indep endent of ε and T . Pro of of claim: F or x ∈ { 0 , 1 } T , let the weight wt( x ) of x to b e the num b er of 1’s in x . Let S =  x ∈ { 0 , 1 } T : wt( x ) ≤ T 2 − √ T  b e the su b set of { 0 , 1 } T with small weig ht (This c hoice of S is the main source of imp ro v emen t in our pro of compared to that of Reyzin [Rey04], w ho instead considers the set of all x with w eigh t at most T / 2.) F or ev ery x ∈ S , we h a v e Pr[ X = x ] ≤ 1 2 T · (1 − ε ) T / 2+ √ T · (1+ ε ) T / 2 − √ T ≤ 1 − m in ( √ T · ε 2 , 1 2 )! · Pr[ U { 0 , 1 } T = x ] . By standard results on large deviation, w e hav e Pr[ U { 0 , 1 } T ∈ S ] ≥ Ω(1) . 34 Com bining the ab o ve t w o inequalities, we get ∆( X , U { 0 , 1 } T ) ≥ Pr[ U { 0 , 1 } T ∈ S ] − Pr[ X ∈ S ] ≥ 1 − 1 − m in ( √ T · ε 2 , 1 2 )!! · Pr[ U { 0 , 1 } T ∈ S ] ≥ min ( √ T · ε 2 , 1 2 ) · Ω(1) = min { c √ T ε, ε 0 } for some absolute constan ts c, ε 0 , whic h completes the pro of. ✷ Note that applying the same randomized fun ction f on t w o random v ariables X and Y can- not in crease the statistical distance. I.e., ∆( f ( X ) , f ( Y )) ≤ ∆( X , Y ). Th e lemma follo w ing immediately by the ab o ve t w o claims: ∆( X , Y ) ≥ ∆((( f 1 ( X 1 ) , . . . , f T ( X T )) , (( f 1 ( Y 1 ) , . . . , f T ( Y T )) ≥ min { ε 0 , c √ T ε } where f 1 , . . . , f T are in dep end en t copies of rand omized f u nction defi ned in Claim B.2, and ε 0 , c are absolute constan ts from Claim B.3. 35

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment