Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps

Fully De-Amortized Cuckoo Hashing for Cache-Obli vious Dictionaries and Multimaps Michael T . Goodrich Dept. of Computer Science Uni versity of California, Irvine Irvine, CA 92697-3435 USA goodrich(at)acm.org Daniel S. Hirschberg Dept. of Computer Science Uni versity of California, Irvine Irvine, CA 92697-3435 USA dan(at)ics.uci.edu Michael Mitzenmacher Dept. of Computer Science Harv ard Univ ersity Cambridge, MA 02138 michaelm(at)eecs.harvard.edu Justin Thaler Dept. of Computer Science Harv ard Univ ersity Cambridge, MA 02138 jthaler(at)f as.har v ard.edu Abstract A dictionary (or map ) is a key-v alue store that requires all keys be unique, and a multimap is a key-v alue store that allo ws for multiple values to be associated with the same key . W e design hashing- based inde xing schemes for dictionaries and multimaps that achiev e worst-case optimal performance for lookups and updates, with a small or negligible probability the data structure will require a rehash operation, depending on whether we are working in the the external-memory (I/O) model or one of the well-known versions of the Random Access Machine (RAM) model. One of the main features of our constructions is that they are fully de-amortized , meaning that their performance bounds hold without one having to tune their constructions with certain performance parameters, such as the constant factors in the e xponents of failure probabilities or, in the case of the external-memory model, the size of blocks or cache lines and the size of internal memory (i.e., our external-memory algorithms are cache oblivious ). Our solutions are based on a fully de-amortized implementation of cuckoo hashing, which may be of independent interest. This hashing scheme uses two cuckoo hash tables, one “nested” inside the other , with one serving as a primary structure and the other serving as an auxiliary supporting queue/stash structure that is super -sized with respect to traditional auxiliary structures but ne vertheless adds ne gligible storage to our scheme. This auxiliary structure allo ws the success probability for cuckoo hashing to be very high, which is useful in cryptographic or data-intensi ve applications. 1 Intr oduction A dictionary (or map ) is a k ey-v alue store that requires all keys be unique and a multimap [3] is a ke y-value store that allows for multiple values to be associated with the same key . Such structures are ubiquitous in the “inner-loop” computations inv olved in various algorithmic applications. Thus, we are interested in implementations of these abstract data types that are based on hashing and use O ( n ) words of storage, where n is the number of items in the dictionary or multimap. In addition, because such solutions are used in real-time applications, we are interested in implementa- tions that are de-amortized , meaning that they hav e asymptotically optimal worst-case lookup and update complexities, but may hav e small probabilities of ov erﬂowing their memory spaces. Moreov er , we would like these lookup and update bounds to hold without requiring that we build such a data structure speciﬁcally “tuned” to certain performance parameters, since it is not always possible to anticipate such parameters at the time such a data structure is deployed (especially if the length p ( n ) of the sequence of operations on the structure is not known in advance). For instance, if we wish for a failure probability that is bounded by 1 /n c or 1 /n c log n , for some constant c > 0 , we should not be required to build the data structure using an amount of space or other components that are parameterized by c . (For example, our ﬁrst construction gi ves a single algorithm parameterized only by n , and for which, for any s > 0 , lookups take time at most s with probability that depends on s . Previous constructions are parameterized by c as well, and lookups take time c with probability 1 − 1 /n c 0 for ﬁxed constants c, c 0 .) Likewise, in the external-memory model, we would like solutions that are cache-oblivious [21], meaning that they achiev e their performance bounds without being tuned for the parameters of the memory hierarchy , like the size, B , of disk blocks, or the size, M , of internal memory . W e refer to solutions that avoid such parameterized constructions as fully de-amortized . By extending and combining various ideas in the algorithms literature, we show how to develop fully de- amortized data structures, based on hashing, for dictionaries and multimaps. W ithout speciﬁcally tuning our structures to constant factors in the exponents, we pro vide solutions with performance bounds that hold with high pr obability , meaning that the y hold with probability 1 − 1 / p oly( n ) , or with o verwhelming pr obability , meaning that they hold with probability 1 − 1 /n ω (1) . W e also use the term with small pr obability to mean 1 / p oly( n ) , and with ne gligible pr obability to mean 1 /n ω (1) . Brieﬂy , we are able to achiev e the following bounds: • For dictionaries, we present two fully de-amortized constructions. The ﬁrst is for the Practical RAM model [33], and it performs all standard operations (lookup, insert, delete) in O (1) steps in the worst case, where all guarantees hold for any polynomial number of operations with high pr obability . The second works in the external-memory (I/O) model, the standard RAM model, and the A C 0 RAM model [40], and also achiev es O(1) worst-case operations, where these guarantees hold with overwhelming pr obability . • For multimaps, we provide a fully de-amortized scheme that in addition to the standard operations can quickly return or delete all values associated with a key . Our construction is suitable for external memory and is cache oblivious in this setting. For instance, our external-memory solution returns all n k v alues associated with a key in O (1 + n k /B ) I/Os, where B is the block size, but it is not speciﬁcally tuned with respect to the parameter B . Our algorithms use linear space and work in the online setting, where each operation must be completed before performing the next. Our solutions for dictionaries and multimaps include the design of a variation on cuc koo hash tables , which were presented by Pagh and Rodler [36] and studied by a variety of other researchers (e.g., see [16, 28, 34]). These structures use a freedom to place each key-v alue pair in one of two hash tables to achiev e worst-case constant-time lookups and remov als and amortized constant-time insertions, where operations fail with polynomially small probability . W e obtain the ﬁrst fully de-amortized variation on cuckoo hash 1 tables, with negligible f ailure probability to boot. 1.1 Motivations and Models K ey-value associations are used in many applications, and hash-based dictionary schemes are well-studied in the literature (e.g., see [14]). Multimaps [3] are less studied, although a multimap can be viewed as a dynamic inverted ﬁle or in verted index (e.g., see Knuth [26]). Gi ven a collection, Γ , of documents, an in verted ﬁle is an indexing strategy that allo ws one to list, for any word w , all the documents in Γ where w appears. Multimaps also provide a natural representation framew ork for adjacency lists of graphs, with nodes being ke ys and adjacent edges being values associated with a key . For other applications, please see Angelino et al. [3]. Hashing-based implementations of dictionaries and multimaps must necessarily be designed in a com- putational model that supports indexed addressing, such as the Random Access Machine (RAM) model (e.g., see [14]) or the external-memory (I/O) model (e.g., see [1, 42]). Thus, our focus in this paper is on solutions in such models. There are, in fact, sev eral versions of the RAM model, and, rather than insist that our solutions be implemented in a speciﬁc v ersion, we consider solutions for se veral of the most well-kno wn versions: • The standar d RAM : all arithmetic and comparison operations, including integer additi on, subtraction, multiplication, and di vision, are assumed to run in O (1) time 1 . • The Practical RAM [33]: inte ger addition, as well as bit-wise Boolean and SHIFT operations on words, are assumed to run in O (1) time, b ut not integer multiplication and division. • The A C 0 RAM [40]: any A C 0 function can be performed in constant time on memory words, including addition and subtraction, as well as sev eral bit-le vel operations included in the instruction sets of modern CPUs, such as Boolean and SHIFT operations. This model does not allo w for constant-time multiplication, ho wever , since multiplication is not in A C 0 [20]. Thus, the AC 0 RAM is arguably the most realistic, the standard RAM is the most traditional, and the Practical RAM is a restriction of both. As we are considering dictionary and multimap solutions in all of these models, we assume that the hash functions being used are implementable in the model in question and that they run in O (1) time and are sufﬁciently random to support cuckoo hashing. This assumption is supported in practice, for instance, by the fact that one of the most widely-used hash functions, SHA-1, can be implemented in O (1) time in the Practical RAM model. See also Section 3.5 for discussion of the theoretical foundations of this assumption. A frame work that is growing in interest and impact for designing algorithms and data structures in the external-memory model is the cache-oblivious design paradigm, introduced by Frigo et al. [21]. In this external-memory paradigm, one designs an algorithm or data structure to minimize the number of I/Os between internal memory and external memory , but the algorithm must not be explicitly parameterized by the block size, B , or the size of internal memory , M . The advantage of this approach is that one such algorithm can comfortably scale across all le vels of the memory hierarchy and can also be a better match for modern compilers that perform predicti ve memory fetches. Our notion of a “fully de-amortized” data structure extends the cache-obli vious design paradigm in two ways. First, it requires that all operations be characterized in terms of their worst-case performance, not its amortized performance. Second, it extends to internal-memory models the notion of av oiding speciﬁc tuning of the data structure in terms of non-essential parameters. F ormally , we say that a data structure is fully de-amortized if its performance bounds hold in the worst case and the only parameter its construction details depend is n , the number of items it stores. Thus, a fully de-amortized data structure implemented in external-memory is automatically cache-obli vious and its I/O bounds hold in the worst case. 1 Most, b ut not all, researchers also assume that the standard RAM supports bit-wise Boolean operations, such as AND, OR, and XOR; hence, we also allow for these as constant-time operations in the standard RAM model. 2 In addition to being fully de-amortized, our strongest constructions hav e performance bounds that hold with overwhelming probability . While our aim of achie ving structures that provide worst-case constant time operations with overwhelming probability instead of with high probability may seem like a subtle improv ement, there are many applications where it is essential. In particular, it is common in cryptographic applications to aim for negligible failure probabilities. For example, cuckoo hashing structures with negli- gible failure probabilities ha ve recently found applications in obli vious RAM simulations [22]. Moreover , a signiﬁcant moti vation for de-amortized cuck oo hashing is to prev ent timing attacks and clocked adv ersaries from compromising a system [5]. Finally , guarantees that hold with ov erwhelming probability allow us to handle super-polynomially long sequences of updates, as long as the total number of items resident in the dictionary is bounded by n at all times. This may be useful in long-running or data-intensiv e applications. 1.2 Pre vious Related W ork Since the introduction of the cache-oblivious framework by Frigo et al. [21], se veral cache-oblivious al- gorithms have subsequently been presented, including cache-oblivious B-trees [7], cache-obli vious binary search trees [9], and cache-oblivious sorting [10]. P agh et al. [37] describe a scheme for cache-oblivious hashing, which is based on linear probing and achieves O (1) expected-time performance for lookups and updates, but it does not achie ve constant time bounds for an y of these operations in the worst case. As mentioned above, the multimap ADT is related to the in verted ﬁle and in verted index structures, which are well-kno wn in te xt inde xing applications (e.g., see Knuth [26]) and are also used in search engines (e.g., see Zobel and Moffat [44]). Cutting and Pedersen [15] describe an in verted ﬁle implementation that uses B-trees for the indexing structure and supports insertions, but doesn’t support deletions efﬁciently . More recently , Luk and Lam [31] describe an internal-memory in verted ﬁle implementation based on hash tables with chaining, but their method also does not support fast item remov als. Lester et al. [29, 30] and B ¨ uttcher et al. [13] describe external-memory in verted ﬁle implementations that support item insertions only . B ¨ uttcher and Clarke [12] consider trade-of fs for allowing for both item insertions and remov als, and Guo et al. [23] gi ve a solution for performing such operations by using a B-tree v ariant. Finally , Angelino et al. [3] describe an efﬁcient external-memory data structure for the multimap ADT , but like the abov e-mentioned work on in verted ﬁles, their method is not cache-obli vious; hence, it is not fully de-amortized. Also as mentioned abov e, our solutions include the design of a variation on cuckoo hash tables , which were presented by Pagh and Rodler [36] and studied by a v ariety of other researchers (e.g., see [16, 28, 34]). These structures use a freedom to place each key-v alue pair in one of two hash tables to achiev e worst-case constant-time lookups and removals and amortized constant-time insertions with high probability . Kirsch and Mitzenmacher [24] and Arbitman et al. [5] study a method for de-amortizing cuckoo hashing, which achie ves constant-time lookups, insertions, and deletions with high probability , and uses space (2 +  ) n for any constant  > 0 (as is standard in cuckoo hashing). These methods are not fully de-amortized, howe ver , since, in order to achiev e a failure probability of 1 /n c , they construct an auxiliary structure consisting of O ( c ) small lookup tables. In contrast, neither of our dictionary constructions are parameterized by c ; furthermore our second construction provides guarantees that hold with overwhelming probability rather than with high probability . In a subsequent paper , Arbitman et al. [6] study a hashing method that achiev es worst-case constant-time lookups, insertions, and removals with high probability while maintaining loads very close to 1, b ut their method also is not fully de-amortized. Kirsch, Mitzenmacher , and Wieder [25] introduced the notion of a stash for cuckoo hashing, which allo ws the failure probability to be reduced to O (1 /n α ) , for any α > 0 , by using a constant-sized adjunct memory to store items that wouldn’t otherwise be able to be placed. One of our nov el additions in this paper is to demonstrate that by using super-constant sized stashes, along with a variation of the q-heap data structure, we can ensure failures happen only with negligible probability while maintaining constant time lookup and delete operations. 3 1.3 Our Results In this paper we describe efﬁcient hashing-based implementations of the dictionary and multimap ADTs. Our constructions are fully de-amortized and are alternately designed for the external-memory (I/O) model and the well-known versions of the RAM model mentioned above. Because the y are fully de-amortized, our external-memory algorithms are cache-obli vious. W e begin by presenting two new fully de-amortized cuckoo hashing schemes in Section 2. Both of our constructions provide O (1) worst-case lookups, insertions, and deletions, where the guarantees of the ﬁrst construction (in the Practical RAM model) hold with high probability , and the guarantees of the second construction (in the e xternal-memory model, the standard RAM model, and the A C 0 RAM model) hold with ov erwhelming probability . Moreo ver , these results hold ev en if we use polylog( n )-wise independent hash functions. Like the construction of Arbitman et al. [5], both of our dictionaries use space (2 +  ) n for any constant  > 0 , though when combined with another result of Arbitman et al. [6], we can achie ve (1 +  ) n words of space for any constant  > 0 (see Section 3.5). Our second dictionary can be seen as a quantitativ e improv ement ov er the previous solution for de-amortized cuckoo tables [5], as the guarantees of [5] only hold with high probability (and their solution is not fully de-amortized). Both of our dictionary constructions utilize a cuckoo hash table that has another cuckoo hash table as an auxiliary structure. This secondary cuckoo table functions simultaneously as an operation queue (for the sake of de-amortization [5, 6, 24]) and a stash (for the sake of impro ved probability of success [25]). Our second construction also makes use of a data structure for the AC 0 RAM model we call the atomic stash . This structure maintains a small dictionary , of size at most O (log 1 / 2 n ) , so as to support constant- time worst-case insertion, deletions, and lookups, using O (log 1 / 2 n ) space. This data structure is related to the q-heap or atomic heap data structure of Fredman and W illard [20] (see also [43]), which requires the use of lookup tables of size O ( n  ) and is limited to sets of size O (log 1 / 6 n ) . Andersson et al. [2] and Thorup [40] giv e alternative implementations of these data structures in the A C 0 RAM model, but these methods still need precomputed functions encoded in table lookups or have time bounds that are amortized instead of worst-case. Our methods instead make no use of precomputed lookup tables, hold in the worst case, and use techniques that are simpler than those of Anderson et al. and Thorup. W e emphasize that our results do not depend on our speciﬁc atomic stash implementation; one could equally well use q-heaps, atomic heaps, or other data structures that allow constant-sized lookups into data sets of size ω (1) (under suitable assumptions, such as ke ys ﬁtting into a memory word) in order to obtain bounds that hold with ov erwhelming probability in the standard or AC 0 RAM models. W e view this combination of cuckoo hash tables with atomic stashes (or other similar data structures) as an important contribution, as they allow super-constant sized stashes for cuck oo hashing while still maintaining constant time lookups. In Section 4, we also show how to build on our fully de-amortized cuckoo hashing scheme to gi ve an ef ﬁcient cache-oblivious multimap implementation in the external memory model. Our multimap imple- mentation utilizes tw o instances of the nested cuckoo structure of Section 2, together with arrays for storing collections of key-v alue pairs that share a common key . This implementation assumes that there is a cache- obli vious mechanism to allocate and deallocate power -of-two sized memory blocks with constant-factor space and I/O ov erhead; this assumption is theoretically justiﬁed by the results of Brodal et al. [11]. A lo wer bound of V erbin and Zhang [41] implies that our bounds are optimal up to constant factors in the external memory model, e ven if we did not support fast ﬁndAll and remov eAll operations. In addition to a theoretical analysis of our data structures, we hav e performed preliminary experiments with an implementation, and a later writeup will include full details of these. The time bounds we achieve for the dictionary and multimap ADT methods are shown in T able 1. Our space bounds are all O ( n ) , for storing a dictionary or multimap of size at most n . 4 Method Dictionary I/O Perf ormance Multimap I/O Perf ormance add ( k , v ) O (1) O (1) containsKe y ( k ) O (1) O (1) containsItem ( k , v ) O (1) O (1) remov e ( k , v ) O (1) O (1) get ( k ) /getAll ( k ) O (1) O (1 + n k /B ) remov eAll ( k ) – O (1) T able 1: Performance bounds for our dictionary and multimap implementations, which all hold in the worst- case with o verwhelming probability , except for implementations in the Practical RAM model, in which case the abo ve bounds hold with high probability . These bounds are asymptotically optimal. W e use B to denote the block size, and n k to denote the number of items with ke y equal to k . 2 Nested Cuckoo Hashing In this section, we describe both of our nested cuckoo hash table data structures, which provide fully de-amortized dictionary data structures with worst-case O (1) -time lookups and removals and O (1) -time insertions with high and overwhelming probability , respectively . At a high lev el, this structure is similar to that of Arbitman et al. [5] and Kirsch and Mitzenmacher [24], in that our scheme and these schemes use a cuckoo hash table as a primary structure, and an auxiliary queue/stash structure to de-amortize insertions and reduce the failure probability . But our approach substantially dif fers from prior methods in the details of the auxiliary structure. In particular , our auxiliary structure is itself a full-ﬂedged cuckoo hash table, with its o wn (much smaller) queue and stash, whereas prior methods use more traditional queues for the auxiliary structure. Our two dictionary constructions are identical, e xcept for the implementation of the “inner” cuck oo hash table’ s queue and stash. Below , we describe both constructions simultaneously , highlighting their differences when necessary . 2.1 The Components of Our Structure Our dictionaries maintain a dynamic set, S , of at most n items. Our primary storage structure is a cuckoo hash table, T , subdivided into two subtables, T 0 and T 1 , together with two random hash functions, h 0 and h 1 . Each table T i stores at most one item in each of its m cells, and we assume m ≥ (1 +  ) n where n is the total number of items, for some ﬁxed constant  > 0 . For any item x = ( k , v ) that is stored in T , x will either be stored in cell T 0 [ h 0 ( k )] or in cell T 1 [ h 1 ( k )] . Some of the items in S may not be stored in T , ho wever . They will instead be stored in an auxiliary structure, Q . The structure Q is simultaneously two double-ended queues (deques), both of which support fast en- queue and dequeue operations at either the front or the rear , and a cuckoo hash table with its own queue and stash. Because Q is a cuckoo hash table, it supports worst-case constant-time lookups for items based on their keys. The two deques correspond to the queue and stash of the primary structure; we call them OuterQueue and OuterStash respecti vely . Each item x that is stored in Q is therefore also augmented with a pre v pointer , which points to the predecessor of x in its deque, and a next pointer , which points to the successor of x in its deque. W e also augment both deques with front and rear pointers, which respectiv ely point to the ﬁrst element in the deque and the last element in the deque. The “inner” cuckoo hash structure consists of two tables, R 0 and R 1 , each having m 2 / 3 cells, together with two random hash functions, f 0 and f 1 , as well as a small list, L , which is used to implement two deques that we call InnerQueue and InnerStash respectiv ely . Each item x = ( k , v ) that is stored in Q will 5 be located in cell R 0 [ f 0 ( k )] or in cell R 1 [ f 1 ( k )] , or it will be in the list L . The only manner in which our two constructions dif fer is in the implementation of L . In summary , our dictionary data structure consists simply of the “outer” cuckoo table T with a queue and stash, and the “inner” cuckoo table Q , which contains its o wn queue and stash. 2.2 Operations on the Primary Structure T o perform a lookup, that is, a get ( k ) , we try each of the cells T 0 [ h 0 ( k )] , T 1 [ h 1 ( k )] , R 0 [ f 0 ( k )] , and R 1 [ f 1 ( k )] , and perform a lookup in L , until we either locate an item, x = ( k , v ) , or we determine that there is no item in these locations with key equal to k , in which case we conclude that there is no such item in S . Like wise, to perform a remove ( k ) operation, we ﬁrst perform a get ( k ) operation. If an item x = ( k , v ) is found in one of the locations T 0 [ h 0 ( k )] or T 1 [ h 1 ( k )] , then we simply remov e this item from this cell. If such an x is found in R 0 [ f 0 ( k )] , or R 1 [ f 1 ( k )] , or in the structure L , then we remo ve x from this cell and we remov e x from its deque(s) by updating the prev and next pointers for x ’ s neighbors so that they now point to each other . Performing an add ( k , v ) operation is somewhat more in volv ed. W e begin by performing an enqueueLast ( x , 0 , OuterQueue ) operation, which adds x = ( k , v ) at the end of the deque OuterQueue, together with the bit 0 to indicate that x should next be inserted in T 0 . Then, for constants α, α 0 ≥ 1 , set in the analysis, we perform α (outer) insertion substeps , followed by α 0 (inner) insertion substeps. Each outer insertion substep begins by performing a dequeueFront(OuterQueue) operation to remove the pair (( k , v ) , b ) at the front of the deque. If the cell T b [ h b ( k )] is empty , then we add ( k , v ) to this cell and this ends this substep. Otherwise, we e vict the current item, y , in the cell T b [ h b ( k )] , replacing it with ( k , v ) . If this insertion substep just created a second cycle in the insertion process (which we can detect by a simple marking scheme, using O (log 2 n ) space with ov erwhelming probability) 2 , then we complete the insertion substep by performing an enqueueLast ( y , b 0 , OuterStash ) operation, where b 0 = ( b + 1) mo d 2 . If this insertion substep has not created a second cycle, howe ver , then we complete the insertion substep by performing an enqueueFirst ( y , b 0 , OuterQueue ) operation, where b 0 = ( b + 1) mo d 2 , which adds the pair ( y , b 0 ) to the front of the primary structure’ s queue. Thus, by design, an add operation takes O ( α ) = O (1) time in the worst case, assuming the operations on Q run in O (1) time and succeed. Finally , similar to [4], e very m 1 / 4 operations we try to insert an element from the stash by performing a dequeueFront(OuterStash) operation to remove the pair (( k , v ) , b ) at the front of the deque and then spending 2 moves trying to insert ( k , v ) . If no free slot is found, we return the current element from the connected component of ( k , v ) to the front of the stash. This ensures that items belonging to connected components that have had cycles removed via deletions do not remain in the stash long after the deletions occur . All constants mentioned throughout are chosen for theoretical con venience; no effort has been made to optimize them. 2.3 Operations on the A uxiliary Structur e As mentioned above, the auxiliary structure, Q , is a standard cuckoo table of size m 0 = m 2 / 3 augmented with its o wn (inner) queue and stash maintained via the structure L , and pointers to gi ve Q the functionality of two double-ended queues, called OuterQueue and OuterStash. The enqueue and dequeue operations to OuterQueue and OuterStash, therefore, in volv e standard O (1) -time pointer updates to maintain the deque property , plus insertion and deletion algorithms for the inner cuckoo table. Our inner insertion algorithm is dif ferent from our outer one in that we do not immediately place items on the back of the inner queue. Instead, on an insertion of item ( k , v ) , we spend 16 steps trying to insert ( k , v ) into the inner cuckoo table 2 [5] refers to this as a cycle-detection mechanism , and notes there are man y possible instantiations. See also [25]. 6 immediately . If the insertion does not complete in 16 steps, we place the current element from the connected component of ( k , v ) in the back of InnerQueue. The reason for this policy is that, as we will sho w , with ov erwhelming probability almost e very item in the inner table can be inserted in 16 steps, due to the extreme sparsity of the table. Finally , e very m 1 / 6 “inner” operations we additionally spend 16 mo ves trying to insert an element from the front of InnerQueue and a single mov e trying to insert an element from the front of InnerStash, returning elements to the back of InnerQueue or InnerStash in the event that a vacant slot has not been found. The purpose of this is to ensure that items whose connected components hav e shrunk due to deletions do not remain in the inner queue or stash unnecessarily . The only manner in which our two constructions differ is in the implementation of L , which supports the inner stash and inner queue. In our ﬁrst construction, for the Practical RAM model, L is simply two double-ended queues (one for InnerQueue, and one for InnerStash), and when performing a lookup in L , we simply look at all the cells in both deques. In our second construction, we implement L using a data structure we call the atomic stash . This structure maintains a small dictionary , of size at most O (log 1 / 2 n ) , so as to support constant-time worst-case insertion, deletions, and lookups, using O (log 1 / 2 n ) space. Thus, in our second construction, the structure L is simultaneously two dequeues and an atomic stash to enable fast lookups into L . Details are in Appendix A. 3 Analysis The critical insights into why such a standard approach is sufﬁcient to support fully de-amortized constant- time updates and lookups for the auxiliary structure, Q , are based on enhancing the analysis of previous work for cuckoo hash tables. Deﬁnition 1: A sequence π of insert, delete, and lookup operations is n-bounded if at any point in time during the execution of π the data structure contains at most n elements. W e prov e the following two theorems. Theorem 1: Let C be the inner cuckoo hash table consisting of two tables of size m 2 / 3 , together with a queue/stash L as described abov e. For any polynomial p ( n ) , any m 1 / 3 -bounded sequence of operations π 0 on C of length p ( n ) , and an y s ≤ m 1 / 6 , e very insert into C completes in O(1) steps, and L has size at most s , with probability at least 1 − p ( n ) /m Ω( √ s ) . Theorem 1 states that with o verwhelming probability , all insertions performed on the auxiliary structure Q will run in O (1) time, and all of the inner table’ s internal data structures will stay small (ensuring fast lookups), provided we never try to put more than m 1 / 3 items in Q . The following theorem addresses this size condition. Theorem 2: Let T be a cuckoo hashing scheme consisting of two tables of size m . Let m ≥ (1 +  ) n , for some constant  > 0 , and let Q be the queue/stash as described abov e. For any polynomial p ( n ) and any n -bounded sequence of operations π of length p ( n ) , the number of items stored in Q will nev er be more than 2 log 6 n , with probability at least 1 − 1 /n Ω(log n ) . 3.1 Notation and Preliminary Lemmata A standard tool in the analysis of cuckoo hashing is the cuckoo graph G . Speciﬁcally , giv en a cuckoo hash table T , subdivided into two tables T 0 and T 1 , each with m cells, and hash functions h 0 and h 1 , the cuckoo graph for T is a bipartite multigraph (without self-loops), in which left vertices represent the table cells in T 0 and right vertices represent the cells in T 1 . Each key x inserted into the hash table corresponds to an edge connecting T 0 [ h 0 ( x )] to T 1 [ h 1 ( x )] . Thus, if S denotes the set of n items in T , and each of T 0 and T 1 hav e m cells, the cuckoo graph G ( S, h 0 , h 1 ) contains 2 m nodes ( m on each side) and n edges. Gi ven a 7 cuckoo graph G ( S, h 0 , h 1 ) , we use C S,h 0 ,h 1 ( v ) to denote the number of edges in the connected component that contains v . In our analysis, there will actually be two cuckoo graphs G and G 0 . G corresponds to the outer table, and has vertex set [ m ] × [ m ] and at most n edges at any point in time. G 0 corresponds to the inner table and has vertex set [ m 0 ] × [ m 0 ] for m 0 = m 2 / 3 , and (we will show) n 0 ≤ m 1 / 3 edges with overwhelming probability at any point in time. W e will make frequent use of the follo wing lemmata. Lemma 3: Let G ( S, h 0 , h 1 ) be a cuckoo graph with n edges and verte x set [ m ] × [ m ] . Then Pr( C S,h 0 ,h 1 ( v ) ≥ k ) ≤  nk m  k 1 k ! , where the probability is o ver the choice of h 0 and h 1 . Proof: By standard arguments, (e.g. [17, Lemma 1]), Pr( C S,h 0 ,h 1 ( v ) ≥ k ) ≤ Pr( Bin ( nk , 1 /m ) ≥ k ) . The lemma follo ws by an easy calculation. W e no w present some basic facts about the stash. The ﬁrst relates the time it tak es to determine whether an element needs to be put in the stash on an insertion. Lemma 4: [4, Claim 5.4] For any element x and any cuckoo graph G ( S, h 0 , h 1 ) , the number of moves required before x is either placed in the stash or inserted into the cuckoo table is at most 2 C S,h 0 ,h 1 ( x ) . [25, Lemma 2.2] shows that the number of ke ys that must be stored in the stash corresponds to the quantity ¯ e ( G ( S, h 0 , h 1 )) , deﬁned as follows. For a connected component H of G ( S, h 0 , h 1 ) , deﬁne the excess e ( H ) := # edges ( H ) − # nodes ( H ) , and deﬁne ¯ e ( G ( S, h 0 , h 1 )) := X H max( e ( H ) , 0) , where the sum is ov er all connected components H of G ( S, h 0 , h 1 ) . Gi ven a vertex in this random graph, recall that C S,h 0 ,h 1 ( v ) denotes the number of edges in the connected component that contains v , and let B v be the number of edges in the component of v that need to be remov ed to make it acyclic ( B v is also called the cyclotomic number of the component containing v ). Notice that B v = e ( H ) + 1 , that is, B v is 1 more than the number of ke ys from v ’ s component that need to be placed in the stash. W e use the follo wing lemma from [25]. Lemma 5: Let | S | = n with (1 +  ) n ≤ m for some constant  , and let G ( S, h 0 , h 1 ) be a cuckoo graph with v ertex set [ m ] × [ m ] . Then Pr( B v ≥ t | C S,h 0 ,h 1 ( v ) = k ) ≤  3 e 5 k 3 m  t . 3.2 Pro ving Theorem 1 Throughout this section, we use the follo wing notation, follo wing that in [5]. Let π 0 be an m 1 / 3 -bounded sequence of p ( n ) operations. Denote by ( x 1 , . . . , x p ( n ) ) the elements inserted by π 0 . For any integer 0 < i ≤ p ( n ) − m 1 / 3 , let S i denote the set of elements that are stored in the data structure just before the insertion of x i , and let ˆ S i denote S i together with the elements { x i , x i +1 , . . . , x i + m 1 / 3 } , ignoring any deletions between time i and time i + m 1 / 3 . Since π 0 is an m 1 / 3 bounded sequence, we have | S i | ≤ m 1 / 3 and | ˆ S i | ≤ 2 m 1 / 3 for all i . W e need the follo wing lemma, which states that all components of G 0 are tiny . 8 Lemma 6: Let | S | = n 0 , and let G 0 ( S, f 0 , f 1 ) be any cuckoo graph with vertex set [ m 0 ] × [ m 0 ] . Assume n 0 /m 0 ≤ 2 m 1 / 3 . Then for any k ≤ m 1 / 4 , and for any node v , Pr[ C S,f 0 ,f 1 ( v ) ≥ k ] ≤ m − Ω( k ) . Proof: Lemma 3 implies that Pr( C S,f 0 ,f 1 ( v ) ≥ k ) ≤  2 k m 1 / 3  k 1 k ! . For k ≤ m 1 / 4 the probability that C S,f 0 ,f 1 ( v ) ≥ k is at most m − Ω( k ) ; applying a union-bound over all v , the probability some connected component has size greater than k is still at most m − Ω( k ) . 3.2.1 Showing InnerStash Stays Small W e begin by showing that InnerStash has size at most s at all points in time with probability at least 1 − 1 /m Ω( s ) . This requires extending the analysis of prior works [25, 27] to superconstant sized stashes, while le veraging the sparsity of the inner cuckoo table C . Lemma 7: Let | S | = n 0 , and let G 0 ( S, f 0 , f 1 ) be any cuckoo graph vertex set [ m 0 ] × [ m 0 ] . Assume n 0 ≤ 2 m 1 / 3 and m 0 = m 2 / 3 . For any s < m 1 / 6 , with probability 1 − 1 /m Ω( s ) the total number of vertices that reside in the stash of G 0 ( S, f 0 , f 1 ) is at most s . The proof is entirely the same as that for the outer stash, which we prove in Lemma 12. W e defer the proof until then. W e are in a position to sho w formally that the inner stash is small at all points in time. Lemma 8: Let π 0 be any m 1 / 3 -bounded sequence of operations of length p ( n ) on the inner cuckoo table C . For any s < m 1 / 12 , with probability 1 − p ( n ) /m Ω( s ) ov er the choice of hash functions f 0 , f 1 , InnerStash has size at most s at all points in time. Proof: Follo wing the approach of [5], we deﬁne a good event ξ 1 that ensures that InnerStash has size at most s . Namely , let ξ 1 be the event that at all points i in time, the number of vertices v in the stash of G 0 ( ˆ S i , f 0 , f 1 ) is at most s . By Lemma 7, and a union bound over all p ( n ) operations in π 0 , ξ 1 occurs with probability at least 1 − p ( n ) /m Ω( s ) . W e distinguish between the real stash and the effective stash. The real stash at any point i in time is the set of nodes that reside in the stash of G 0 ( S i , f 0 , f 1 ) ; the ef fective stash refers to the real stash, together with elements that used to be in the real stash but have since had cycles removed from their components due to deletions, and have not yet been inserted in the inner cuckoo table. Let E i denote the set of items in the ef fective stash b ut not the real stash of G 0 ( S i , f 0 , f 1 ) . Clearly the e vent ξ 1 guarantees that at any point in time, the size of the real stash is at most s . T o see that the size of the effecti ve stash nev er exceeds s , assume by way of induction that for j ≥ m 1 / 3 , the effecti ve stash has size at most s at all times less than j (clearly ev ent ξ 1 guarantees that this is true for all j ≤ m 1 / 3 as a base case). In particular , the inducti ve hypothesis ensures that the effecti ve stash has size at most s at time j − m 1 / 3 . Observe that during the execution of operations { x j − m 1 / 3 , x j − m 1 / 3 +1 , . . . , x j } , we spend m 1 / 3 m 1 / 6 ≥ s 2 mov es in total on the elements of the effecti ve stash. By Lemma 6 and a union bound of all i ≤ p ( n ) , we may assume all connected components of G 0 ( ˆ S j , f 0 , f 1 ) hav e size at most s/ 2 ; this adds a failure probability of at most p ( n ) /m Ω( s ) to the result. Thus, by Lemma 4 the insertion of an y item x in the ef fective stash nev er requires more than s moves before it succeeds or causes x to be returned to the back of the stash. It follows that by time j , we hav e spent at least s moves on each of the items in E j − m 1 / 3 , and hence all elements in E j − m 1 / 3 are inserted into the table by time j . Thus, at time j the size of the effecti ve stash is at most s , and this completes the induction. 9 3.2.2 Showing InnerQueue Stays Small The bulk of our analysis relies on the follo wing technical lemma. Lemma 9: Let | S | = n 0 , and let G 0 ( S, f 0 , f 1 ) be any cuckoo graph with vertex set [ m 0 ] × [ m 0 ] . Assume n 0 /m 0 ≤ 2 m 1 / 3 . For any 0 < s ≤ m 1 / 6 , with probability 1 /m Ω( √ s ) the total number of vertices v ∈ G 0 ( S, f 0 , f 1 ) with C S,f 0 ,f 1 ( v ) > 8 is at most s . Proof: Lemma 6 implies that the probability that all connected components have size at most √ s is at least 1 − m − Ω( √ s ) . W e assume for the remainder that there are no components of size greater than √ s ; this adds at most a probability 1 /m Ω( √ s ) of failure in the statement of the lemma. Pick a special vertex v i in each component C i of size greater than 8 at random, and assign to v i a “weight” equal to the number of edges in C i . W e sho w with probability at l east 1 − m − Ω( √ s ) we cannot ﬁnd a set of weight s ; the lemma follo ws. Since no component has size more than √ s , we would need to ﬁnd at least j = s/ √ s = √ s vertices that are in components of size greater than 8 and are the special vertex for that component. W e use the fact that for any ﬁxed set of j distinct vertices v 1 . . . v j , the probability (over both the choice of G 0 and the choice of the special vertex for each component) that all j vertices are the special vertices for their component is upper bounded by Pr[ C S,f 0 ,f 1 ( v ) ≥ 8] j . Indeed, write E i for the event v 1 , . . . v i − 1 are all special. W e may write Pr[ v 1 . . . v j are all special ] ≤ j Y i =1 Pr[ v i is special | E i ] . T o bound the right hand side, notice that Pr[ v i is special | E i ] ≤ Pr[ v i is in a dif ferent component from v 1 , . . . v i − 1 and C S,f 0 ,f 1 ( v ) ≥ 8 | E i ] ≤ Pr[ C S,f 0 ,f 1 ( v ) ≥ 8] , where the last inequality holds because because the density of edges after taking out the earlier (larger than expected) components is less than the density of edges a priori. Thus, taking a union bound over e very set { v 1 , . . . , v j } of j = √ s vertices, the probability that we can ﬁnd such a set is at most  m 0 √ s  (Pr[ v 1 , . . . v j are all special ]) ≤  m 0 √ s   Pr[ C S,f 0 ,f 1 ( v ) > 8] √ s  ≤  m 0 √ s   8 8 8! m 8 / 3  √ s ≤ m − Ω( √ s ) . where the second inequality follo ws by Lemma 3. W e are in a position to sho w formally that the inner queue is small at all points in time. Lemma 10: Let π 0 be any m 1 / 3 -bounded sequence of operations of length p ( n ) on the inner cuckoo table C . For any s ≤ m 1 / 6 , with probability 1 − p ( n ) /m Ω( √ s ) ov er the choice of hash functions f 0 , f 1 , InnerQueue has size at most s at all points in time. Proof: Follo wing the approach of [5], we deﬁne a good event ξ 2 that ensures that InnerQueue has size at most s . Namely , let ξ 2 be the e vent that at all points i in time, the number of vertices v ∈ G 0 with C ˆ S i ,f 0 ,f 1 ( v ) > 8 is at most s . By Lemma 9, and a union bound ov er all p ( n ) operations in π 0 , ξ 2 occurs with probability at least 1 − p ( n ) /m Ω( √ s ) . 10 W e distinguish between the real queue and the effective queue. The real queue at any point i in time is the set of nodes v such that C S i ,f 0 ,f 1 ( v ) > 8 ; the effecti ve queue refers to the real queue, together with elements that used to be in the real queue but hav e since had their components shrunk to size at most 8 via deletions, and have not yet been inserted in the inner cuckoo table. Let E i denote the set of items in the ef fective queue b ut not the real queue at time i . Clearly the ev ent ξ 2 guarantees that at any point in time, the size of the real queue is at most s . T o see that in fact the size of the effecti ve queue nev er exceeds s , assume by way of induction that for j ≥ m 1 / 3 , the effecti ve queue has size at most s at all times less than j (clearly event ξ 2 guarantees that this is true for all j ≤ m 1 / 3 as a base case). In particular , the inductiv e hypothesis ensures that the ef fectiv e queue has size at most s at time j − m 1 / 3 . Observe that during the e xecution of operations { x j − m 1 / 3 , x j − m 1 / 3 +1 , . . . , x j } , we spend 16 m 1 / 3 m 1 / 6 ≥ 16 s mov es on the elements of the ef fective queue, with at least 16 moves dev oted to each element. By deﬁnition of the real queue, combined with Lemma 4, at most 16 mov es are required to insert each element of E j − m 1 / 3 , and thus all elements in E j − m 1 / 3 are inserted into the table by time j . Thus, at time j the size of the effecti ve queue is at most s , and this completes the induction. 3.2.3 Putting it together By design, e very insert into C terminates in O (1) steps. Combining Lemmata 8 and 10, with probability 1 − p ( n ) /m − Ω( √ s ) , both InnerQueue and InnerStash have size at most s/ 2 at all times. Since L only contains items from the inner queue and inner stash, it follows that L never contains more than s items. This proves Theorem 1. 3.3 Pro ving Theorem 2 Throughout this section, we use the following notation, following that in Section 3.2. Let π be an n - bounded sequence of p ( n ) operations. Denote by ( x 1 , . . . , x p ( n ) ) the elements inserted by π . F or any integer 0 < i ≤ p ( n ) , let S i denote the set of elements that are stored in the data structure just before the insertion of x i , let ˆ S i denote S i together with the elements { x i , x i +1 , . . . , x i +log 6 n } , ignoring any deletions between time i and time i + log 6 n , and let ¯ S i denote S i together with the elements { x i , x i +1 , . . . , x i + m 1 / 2 } , ignoring any deletions between time i and time i + m 1 / 2 (treating any operations past time p ( n ) as empty). Since π is an n bounded sequence, we ha ve | S i | ≤ n , | ˆ S i | ≤ n + log 6 n , and | ¯ S i | ≤ n + m 1 / 2 for all i . In proving Theorem 2 it suf ﬁces to sho w that neither the stash nor the queue of the primary structure will gro w too large. W e will use the following lemma. Lemma 11: Let | S | = n , and let G ( S, h 0 , h 1 ) be a cuckoo graph with vertex set V = [ m ] × [ m ] , with m ≥ (1 +  ) n for some constant  > 0 . There exists a constant c 2 > 0 such that with probability 1 − 1 /n Ω(log n ) ov er the choice of h 0 and h 1 , all components of G ( S, h 0 , h 1 ) are of size at most c 2 log 2 n . Proof: It is a standard calculation that for any node v in the cuckoo graph, Pr[ C S,h 0 ,h 1 ( v ) ≥ k ] ≤ β k for some constant β ∈ (0 , 1) (see e.g. [25, Lemma 2.4]). The conclusion follo ws by setting k = O (log 2 n ) , and applying the union bound ov er all vertices. 3.3.1 Showing OuterStash Stays Small Lemma 12: Let | S | = n , and let G ( S, h 0 , h 1 ) be any cuckoo graph with vertex set V = [ m ] × [ m ] , where (1 +  ) n ≤ m for some constant  > 0 . For any s such that s ≤ m 1 / 6 , with probability 1 /m Ω( s ) the total number of vertices that reside in the stash of G ( S, h 0 , h 1 ) is at most s . 11 The choice of s ≤ m 1 / 6 is fairly arbitrary but con venient and suf ﬁcient for our purposes. Notice Lemma 12 readily implies Lemma 7. Proof: Here we follo w the work of [22], which considers the analysis of super -constant sized stashes, extending the w ork of [25]. The starting point is Lemmata 3 and 5 above. Speciﬁcally , in [22] it is shown that these lemmata imply that Pr( B v ≥ t ) ≤ ∞ X k =1 min  3 e 5 k 3 m  t , 1 ! β k for some constant β . In particular , we can concern ourselves with values of k that are O ( m 1 / 5 ) , since the summation ov er β k terms for larger v alues of k is dominated by 2 − Ω( m 1 / 5 ) = m − Ω( m 1 / 6 ) . It follows that Pr( B v ≥ t ) is at most max  m − Ω( t ) , m − Ω( m 1 / 6 )  . W e therefore claim that Pr( B v ≥ j + 1) ≤ m − 1 − αj for some constant α for j ≤ m 1 / 6 . No w , following the deriv ation of Theorem 2.2 of [25], we ha ve the probability that the stash exceeds size s is gi ven by the probability that 2 m independently chosen components hav e an excess of more than s edges, which can be bounded as: Pr( ¯ e ( g ) ≥ s ) ≤ 2 m X k =1  2 m k  s k m − αs − k ≤ 2 m X k =1 m − αs  2 es k  k ≤ (2 m ) m − αs e 2 s = m − Ω( s ) . Here the second to last line follows from a straightforward optimiziation to ﬁnd the the maximum of ( x/k ) k (which occurs at k = x/e ). W e no w sho w formally that the outer stash is small at all points in time. Lemma 13: Let π be any n -bounded sequence of operations of length p ( n ) . For s ≤ m 1 / 5 , with probability 1 − p ( n ) /m Ω( s ) ov er the choice of hash functions h 0 , h 1 , OuterStash has size at most s at all points in time. Proof: W e deﬁne a good event ξ 3 that ensures that OuterStash has size at most s . Namely , let ξ 3 be the e vent that at all points i in time, the number of vertices in the stash of G ( ¯ S i , h 0 , h 1 ) is at most s , and additionally all connected components of G ( ¯ S i , h 0 , h 1 ) hav e size at most log 2 n . By Lemmata 11 and 12, as well as a union bound over all p ( n ) operations in π , ξ 3 occurs with probability at least 1 − p ( n ) /m Ω( s ) . As in [5], a minor technical point in applying Lemma 12 is that ¯ S i is n + m 1 / 2 -bounded, not n -bounded. But we can handle this by applying Lemma 12 with  0 = / 2 , since for large enough m , (1 + / 2)( n + m 1 / 2 ) ≤ (1 +  ) n ≤ m . W e distinguish between the real stash and the effective stash. The real stash at any point i in time is the set of nodes v ∈ V such that reside in the stash of G ( S i , h 0 , h 1 ) ; the effecti ve stash refers to the real stash, together with elements that used to be in the real stash but have since had cycles removed from their components due to deletions, and ha ve not yet been inserted in the outer cuckoo table. Let E i denote the set of items in the ef fective stash b ut not the real stash of G ( S i , h 0 , h 1 ) . Clearly the e vent ξ 3 guarantees that at any point in time, the size of the real stash is at most s . T o see that the size of the effecti ve stash nev er exceeds s , assume by way of induction that for j ≥ m 1 / 2 , the effecti ve stash has size at most s at all times less than j (clearly e vent ξ 3 guarantees that this is true for all j ≤ m 1 / 2 as a base case). In particular , the inducti ve hypothesis ensures that the ef fective stash has size at most s at time 12 j − m 1 / 2 . Observe that during the ex ecution of operations { x j − m 1 / 2 , x j − m 1 / 2 +1 , . . . , x j } , we spend at least m 1 / 2 /m 1 / 4 = m 1 / 4 mov es in total on the elements of the ef fective stash. Since all connected components hav e size at most log 2 n , Lemma 4 implies the insertion of any item x in the effecti ve stash never requires more than 2 log 2 n moves before it succeeds or causes x to be returned to the back of the stash. Thus, all elements in the effecti ve stash at time j − m 1 / 2 require at most 2 m 1 / 5 log 2 n ≤ m 1 / 4 operations in total to process, and it follo ws that all elements in E j − m 1 / 2 are inserted into the table by time j . Thus, at time j the size of the ef fective stash is at most s , and this completes the induction. 3.3.2 Showing OuterQueue Stays Small Let | S | = n , and let G ( S, h 0 , h 1 ) be an y cuckoo graph with vertex set [ m ] × [ m ] . It is well-known that, for any node v , there is signiﬁcant probability that an insertion of v into G takes Ω(log n ) time. But one might hope that for a sufﬁciently large set of distinct vertices { v 1 , . . . , v N } , the averag e size of the connected components of the v i ’ s is constant with overwhelming probability over choice of G , and thus any sequence of N insertions will take O ( N ) time in total. Indeed, Lemma 4.4 of Arbitman et al . [5] establishes that, for any distinct vertices { v 1 , . . . , v N } with N ≤ log n , P N i =1 C S,h 0 ,h 1 ( v i ) = O ( N ) with probability 1 − 2 − Ω( N ) ov er the choice of h 0 and h 1 . Roughly speaking, Arbitman et al . use this result to conclude that a logarithmic sized queue sufﬁces for deamortizing cuckoo hashing (where all guarantees hold with high probability), since any sequence of log n insertions can be processed in O (log n ) time steps. Our goal is to achiev e guarantees which hold with overwhelming probability . T o achie ve this, we use the fact that we can afford to keep a queue of super-logarithmic size without affecting the asymptotic space usage of our algorithm. W e show that any sequence of, say , N = log 6 n operations can be cleared from the queue in O ( N ) time with probability 1 − 1 /n ω (1) . It follows that with ov erwhelming probability the queue does not overﬂo w . Unfortunately , the techniques of [5] do not generalize to values of N larger than O (log n ) , so we generalize their result using different methods. Intuiti vely , one should picture the random process we wish to analyze as a standard queueing process, where the time to handle each job is a random variable. In our case, the random v ariable for a job – which is a ke y k to be placed – is the time to ﬁnd a spot for k in the cuckoo table, which is proportional to the size of the connected component in which k is placed in the cuckoo graph by Lemma 4. This random v ariable is kno wn to be constant on av erage and ha ve exponentially decreasing tails, so if the job times were independent, this would be a normal queue (more speciﬁcally , a Galton-W atson process), and the bounds would follo w from standard analyses. Unfortunately , the job times in our setting are not independent. Roughly speaking, our analysis proceeds by sho wing that with o verwhelming probability on a gi ven instance of the cuckoo graph, the e xpected value of the size of the connected component of a randomly chosen verte x is close to its expectation if the graph was chosen randomly . Our main technical tool will be the follo wing lemma. Lemma 14: Let | S | = n , and let G ( S, h 0 , h 1 ) be a cuckoo graph with vertex set V = [ m ] × [ m ] . Let { v 1 , . . . , v N } be a set of N > log 6 ( n ) vertices chosen uniformly at random (with replacement) from the cuckoo graph G ( S, h 0 , h 1 ) . There is a constant c such that with probability 1 − n − Ω(log n ) (ov er both the choice of the v i ’ s and the generation of the cuckoo graph G ), P N i =1 C S,h 0 ,h 1 ( v i ) ≤ cN . Proof: Let X i be the number of vertices in G ( S, h 0 , h 1 ) in components of size i , and let µ i = E [ X i ] be the expected number of vertices in components of size i in a random cuckoo graph, and let µ = 1 2 m P 2 m i =1 iµ i be the expected size of the connected component of a random node in a random cuckoo graph. By standard calculations (see e.g. [25, Lemma 2.4]), there is a constant β ∈ (0 , 1) such that for any v , E [ C S,h 0 ,h 1 ( v )] ≤ P ∞ k =1 β k = O (1) , where the expectation is tak en over choice of h 0 and h 1 , and hence µ = O (1) . 13 For ﬁxed hash functions h 0 and h 1 , we can write E v ∈ V C S,h 0 ,h 1 ( v ) = 1 2 m X v ∈ V C S,h 0 ,h 1 ( v ) = 1 2 m 2 m X i =1 iX i . (1) Our goal is to sho w that with ov erwhelming probability over the choice of h 0 and h 1 , the right hand side of Equation 1 is close to µ . W e will do this by showing that for small enough i , X i is tightly concentrated around µ i , and that larger i do not contribute signiﬁcantly to the sum. Lemma 15: The following properties both hold. 1. Suppose µ i ≥ n 2 / 3 . Then Pr( | X i − µ i | ≥ n 2 / 3 ) = 2 e − e Ω( n 1 / 3 ) . (The e Ω notation hides factors polylogarithmic in n .) 2. Let i ∗ be the smallest value of i such that µ i < n 2 / 3 , and let X ∗ be the number of vertices in components of size at least i ∗ . Then Pr[ X ∗ ≥ γ n 2 / 3 ] ≤ 1 /n Ω(log n ) for some constant γ . Proof: 1. This follows from a standard application of Azuma’ s inequality , applied to the edge e xposure martin- gale that reveal the edges of the cuckoo graph one at a time. More speciﬁcally , we rev eal the edges of G one at a time in an arbitrary order , say e 1 , . . . e n and let Z j = E [ X j | e 1 , . . . , e j − 1 ] . Then Z j is a martingale, and changing a single edge can only change the number of components of size j by a constant (speciﬁcally , two), and hence | Z j − Z j − 1 | ≤ 2 j for all j . Thus, by Azuma’ s inequality Pr( | X i − E [ X i ] | ≥ 2 iλ √ n ) = Pr( | Z n − Z 0 | ≥ 2 iλ √ n ) ≤ 2 e − λ 2 / 2 . Setting λ = n 2 / 3 2 i √ n = Ω( n 1 / 6 /i ) , we see Pr( | X i − E [ X i ] | ≥ n 2 / 3 ) ≤ e − Ω( n 1 / 3 /i 2 ) . By a standard calculation, E [ X i ] ≥ n 2 / 3 implies i ≤ c 1 log n for some constant c 1 , and the theorem follo ws. 2. Since Pr[ C S,h 0 ,h 1 ( v ) ≥ k ] ≤ β k for some constant β ∈ (0 , 1) , it follo ws easily that E [ X ∗ ] ≤ P ∞ i ∗ nβ i , where nβ i ∗ ≤ n 2 / 3 . It follo ws that E [ X ∗ ] = O ( n 2 / 3 ) . In order to get concentration of X ∗ about its mean, we use a slight modiﬁcation of the edge expo- sure martingale, which will essentially allow us to assume that all connected components ha ve size O (log 2 n ) when attempting to bound the dif ferences between martingale steps, which happens with very high probability by Lemma 11. This technique is formalized for e xample in Theorem 3.7 of [32]. Let Q be the ev ent that all connected components are of size at most log 2 n . W e reveal the edges of G one at a time in an arbitrary order , say e 1 , . . . e n and let Z j = E [ X ∗ | e 1 , . . . , e j − 1 ] if the edges e 1 , . . . e j − 1 do not include a component of size greater than log 2 n , and Z j = Z j − 1 if the edges e 1 , . . . e j − 1 do include a component of size greater than log 2 n . Then Z j is a martingale, and since changing a single edge can only change the number of components of size i by at most two, we see that | Z i − Z i − 1 | ≤ 2 c log 2 n . Now as Z n will equal X ∗ except in the case where event Q does not hold, we can apply Azuma’ s inequality to the abov e martingale to conclude that Pr( | X ∗ − E [ X ∗ ] | ≥ 2 cλ √ n log n ) ≤ 2 e − λ 2 / 2 + P r ( ¬ Q ) . Setting λ = n 1 / 3 c log 2 ( n ) √ n , we obtain Pr( | X ∗ − E [ X ∗ ] | ≥ n 1 / 3 ) ≤ e − e Ω( n 2 / 3 ) + 1 n Ω(log n ) . 14 Noting that E [ X ∗ ] = O ( n 2 / 3 ) , we conclude Pr( X ∗ ≥ γ n 1 / 3 ) ≤ 1 n Ω(log n ) for some constant γ . Properties 1 and 2 of Lemma 15 together imply that with very high probability over choice of h 0 and h 1 , 2 m X i =1 iX i = i ∗ − 1 X i =1  µ i ± n 2 / 3  + O ( n 2 / 3 ) ≤ 2 m X i =1 2 µ i + O ( n 2 / 3 log 2 n ) . Combining the above with Equation 1, with ov erwhelming probability ov er choice of h 0 and h 1 it holds that E v ∈ V C S,h 0 ,h 1 ( v ) = 1 2 m 2 m X i =1 iX i ≤ 2 µ + o (1) = O (1) . Thus, we hav e sho wn that with very high probability over the choice of G , there is a constant c 3 such that E v ∈ V [ C S,h 0 ,h 1 ( v )] ≤ c 3 . Our last step in proving Lemma 14 is to show that if we choose a set of vertices at random, the sum of the component sizes is concentrated around its mean. Indeed, by applying a similar argument as in the proof of Lemma 15 Property T wo, we can assume for all v ∈ V , C S,h 0 ,h 1 ( v ) ≤ c 2 log 2 n , as long as we add an additional term of 1 n Ω(log n ) in the bound on the probability obtained using Azuma’ s inequality . This yields Pr( | S X i =1 C S,h 0 ,h 1 ( v i ) − c 3 | ≥ λ √ S c 2 log 2 n ) ≤ 2 e − λ 2 / 2 + 1 n Ω(log n ) . Setting λ = √ S / log 2 n yields P r ( | S X i =1 C S,h 0 ,h 1 ( v i ) − c 3 | ≥ c 2 ) ≤ e − Ω( S/ log 4 n ) + 1 n Ω(log n ) . The conclusion follo ws, with c = c 3 + c 2 . Notice that for an y set of distinct items { x 1 , . . . , x N } to be inserted, the sets { h 0 ( x 1 ) , . . . , h 0 ( x N ) } and { h 1 ( x 1 ) , . . . , h 1 ( x N ) } are uniformly distributed set of vertices in G ( S, h 0 , h 1 ) . Thus, Lemma 14 ensures that for distinct items { x 1 , . . . , x N } with N ≥ log 6 n , and for any set S of size n , with probability at least 1 − 1 /n log n ov er the choice of h 0 and h 1 , P N i =1 C S,h 0 ,h 1 ( v i ) ≤ cN . W ith this in hand, we are ready to show that with overwhelming probability OuterQueue does not exceed size log 6 n o ver any sequence of poly ( n ) operations. Lemma 16: With probability 1 − n − Ω(log n ) , the queue of the primary structure has size at most log 6 n at all times. Proof: W e deﬁne a good e vent ξ 4 that ensures that OuterQueue has size at most log 6 n . Namely , let ξ 4 be the e vent that for all times log 6 n ≤ j ≤ p ( n ) , it holds that P j i = j − log 6 m C ˆ S j ,f 0 ,f 1 ( x i ) ≤ c log 6 n . By Lemma 14, and a union bound ov er all p ( n ) operations in π , ξ 4 occurs with probability at least 1 − 1 /n log n . For j > log 6 n , suppose by induction there are at most log 6 n items in the effecti ve queue at all times less than j (as a base case, this is clearly true for all j ≤ log 6 n ). In particular , this holds at time j − log 6 n . 15 W e can assume all items in the effecti ve queue at time j − log 6 n are distinct because we process deletions immediately . Since all the (at most) log 6 n items in the ef fecti ve queue are distinct, e vent ξ 4 guarantees that all of these items can be cleared from the queue in c log 6 n steps. Setting the number of steps expended on elements of the queue ev ery operation to α = c , all (at most) log 6 n items in the effecti ve queue at time j − log 6 n will be cleared from the queue before time j . Thus, at time j , the queue contains at most log 6 n items, and this completes the induction. Q only contains items from OuterStash and OuterQueue, and we have shown both deques contain log 6 m items with ov erwhelming probability . Theorem 2 follo ws. 3.4 Putting it All T ogether For any constant  > 0 , our nested cuckoo construction uses (2 +  ) n words for the outer cuckoo table, O ( m 2 / 3 ) for the inner structure Q , and (with overwhelming probability) O (log 2 m ) words for the cycle- detection mechanisms. The latter two space costs are dominated by the ﬁrst, and so our total space usage is (2 +  ) n w ords in total for constant  > 0 . W e derive our ﬁnal theoretical guarantees on the running time of each operation for both of our con- structions. Construction One: Inserts take O (1) time by design. Lookups and deletions require examining T 0 [ h 0 ( k )] , T 1 [ h 1 ( k )] , R 0 [ f 0 ( k )] , and R 1 [ f 1 ( k )] , and performing a lookup in L , which in Construction 1 potentially requires examining all elements in L . Theorems 1 and 2 together imply that for 0 < s < log 2 n , with probability at least 1 − 1 /m Ω( √ s ) , L contains only s items. Thus, lookups and removals take time s with probability at least 1 − 1 /m Ω( √ s ) , e ven though s is not a tuning parameter of our construction. T o clarify , the hidden constant in the exponent is ﬁxed, i.e. independent of all parameter s . Thus, for any constant c , there is some larger constant s such that the probability lookups and remov als take time more than s is bounded by 1 − 1 /m c . W e also remark that it is straightforward to extend our analysis to all s ≤ m 1 / 12 , but for clarity we ha ve not presented our results in this generality . Construction T wo : Again, inserts take O (1) time by design. As in Construction One, lookups and deletions require examining T 0 [ h 0 ( k )] , T 1 [ h 1 ( k )] , R 0 [ f 0 ( k )] , and R 1 [ f 1 ( k )] , and performing a lookup in L ; assuming L has size s = O (log 1 / 2 n ) , such a lookup can be performed in constant time using our atomic stash. Theorems 1 and 2 therefore imply that lookups and remov es take O (1) time with probability 1 − 1 /m Ω(log 1 / 4 n ) . 3.5 Extensions 3.5.1 p oly log ( n ) -wise Independent Hash Functions W e remark that an ar gument of Arbitman et al . [5] implies almost without modiﬁcation that in Construction T wo, insertions, deletions, and lookups take O (1) time with ov erwhelming probability ev en if the hash functions h 0 , h 1 , f 0 , and f 1 are chosen from p oly log ( n ) -wise independent hash families. For completeness, we reproduce this argument in our conte xt. In the analysis abo ve, the only places we used the independence of our hash functions were in Lemmata 7, 9, 12, and 14 abov e. These lemmata allowed us to deﬁne four ev ents that occur with high or ov erwhelming probability , whose occurrence guarantee that the our time bounds hold. Speciﬁcally , for any ﬁxed s , our time bounds for Construction T wo hold if none of the follo wing “bad” ev ents occur: 1. Event 1: There exists a set { v 1 , . . . , v N } of N = O (log 6 n ) vertices in the outer cuckoo graph, such that P N i =1 C S,h 0 ,h 1 ( v i ) > cN (this is the complement of ev ent ξ 4 from Section 3.3.2). 16 2. Event 2: There exists a set of at most 2 m vertices in the outer cuckoo graph, such that the number of stashed elements from the set exceeds O (log 6 n ) (this is the complement of event ξ 3 from Section 3.3.1). 3. Event 3: There exists a set of at least O (log 1 / 2 n ) vertices in the inner cuckoo graph, all of whose connected components at some point in time have size greater than 8 (this is the complement of the e vent ξ 2 deﬁned in Section 3.2.2). 4. Event 4: There exists a set of vertices in the cuckoo graph, such that the number of stashed elements from the set at some point in time exceeds O (log 1 / 2 n ) (this is the complement of the e vent ξ 1 deﬁned in Section 3.2.1). Lemmata 7, 9, 12, and 14 ensure that if h 0 , h 1 , f 0 , and f 1 are fully random, none of the four ev ents occur with probability at least 1 − 1 /n Ω(log 1 / 4 ( n )) . In order to show that the conclusion holds e ven if the hash functions are p oly log ( n ) -wise independent, we apply a recent result of Braverman [8] stating that polylogarithmic independence fools constant-depth boolean circuits. Theorem 17: ( [8]) Let s ≥ log m be any parameter . Let F be a boolean function computed by a circuit of depth d and size m . Let µ be an r -independent distrib ution where r ≥ 3 · 60 d +3 · (log m ) ( d +1)( d +3) · s d ( d +3) , then | E µ [ F ] − E [ F ] | < 0 . 82 s · 15 m , Theorem 17 implies that if we can de velop constant-depth boolean circuits of quasi-polynomial size that recognize Ev ents 1-4 abo ve, then the probability an y of the e vents occur under polylog arithmic independent hash functions will be very close to the probability the e vents occur under fully random hash functions. The circuits that recognize our ev ents are similar to those used in Arbitman et al . [5]; the input wires to the ﬁrst two circuits contain the v alues h 0 ( x 1 ) , h 1 ( x 1 ) , ..., h 0 ( x n ) , h 1 ( x n ) (where the x i s represent the elements inserted into the outer cuckoo table), while the input wires to the second two circuits contain the values f 0 ( x 1 ) , f 1 ( x 1 ) , ..., f 0 ( x j ) , f 1 ( x j ) , where j is the number of items in the inner cuckoo table. 1. Identifying Event 1 : Just as in [5], this ev ent occurs if and only if the graph contains at least one forest from a speciﬁc set of forests of the bipartite graph on [ m ] × [ m ], where m = (1 +  ) n . W e denote this set of forests by F n , and observe that F n is a subset of all forests with at most cN = O (log 6 ( n )) vertices, which implies that |F n | = n poly log( n ) . Therefore, the ev ent can be identiﬁed by a constant- depth circuit of size n poly log( n ) that simply enumerates all forests inn F n , and for ev ery such forest checks whether it exists in the graph. 2. Identifying Event 2 : A constant-depth circuit identifying this ev ent enumerates over all S ⊆ [ m ] × [ m ] of size log 1 / 2 ( n ) and checks whether all of elements of S are stashed. As in [5], a minor complication is that we must deﬁne a canonical set of stashed elements for each set S ; this is only for simplifying the analysis, and does not require modifying our actual construction. Our circuit checks whether all elements of S are stashed by , for each x ∈ S , enumerating o ver all connected components in which edge ( h 0 ( x ) , h 1 ( x )) is stashed according to the canonical set of stashed items for S and checking if the component exists in the graph. W e may assume Event 1 does not occur , and thus we need only iterate ov er components of O (log 6 ( n )) v ertices. The circuit thus has O ( n poly log( n ) ) size. 3. Identifying Events 3 and 4: W e may assume Events 1 and 2 do not occur . Then there are at most O (log 6 n ) edges in the inner cuckoo table, so a constant depth circuit of quasipolynomial size simply enumerates over all possible edge sets E 0 of size O (log 6 n ) satisfying Ev ent 3 or Event 4 and checks if E 0 equals the input to the circuit. Thus, we can set s = p oly log ( n ) and m = p oly log ( n ) in the statement of Theorem 17 to conclude that, e ven if we use hash functions from a p oly log( n ) -wise independent family of functions, Events 1- 17 4 still only occur with negligible probability . W e remark that similar arguments demonstrate that when p oly log ( n ) -wise independent hash functions are used, all operations under Construction 1 still take O ( s ) time with probability 1 − 1 /m Ω( √ s ) when s = p oly log( n ) (the amount of independence required depends on s ). 3.5.2 Sufﬁciently Independent Hash Functions Evaluated in O (1) time Unfortunately , all kno wn constructions of poly log ( n ) -wise independent hash f amilies that can be ev aluated in the RAM model while maintaining our O ( n ) space bound come with important ca veats. The classic con- struction of k -wise independent hash functions due to Carter and W egman based on degree- k polynomials ov er ﬁnite ﬁelds requires time O ( k ) to ev aluate; ideally we would like O (1) ev aluation time to maintain our time bounds. The work of Siegel [39] is particularly rele v ant; for polynomial-sized univ erses, he proves the existence of a family of n  -wise independent hash functions for  > 0 (which is super-logarithmic), which can be e v aluated in O (1) time using look up tables of size n δ for some δ < 1 . Howe ver , his construction is non-uniform in that he relies on the existence of certain expanders for which we do not possess explicit constructions. Subsequent works that impro ve and/or simplify [39] (e.g. [18, 19, 35]) all possess polynomial probabilities of failure, which render them unsuitable when seeking guarantees that hold with ov erwhelming probability . The development of uniformly computable hash families which can be ev aluated in O (1) time using o ( n ) words of memory remains an important open question. 3.5.3 Achieving Loads Close to One W e remark that we can achiev e O (1) worst-case operations with overwhelming probability using (1 +  ) n words of memory for any constant  > 0 by substituting our fully de-amortized nested cuckoo hash tables in for the “backyard” cuckoo table of Arbitman et al . [6]. The construction of [6] uses a main table consisting of (1 + / 2) d buck ets, each of size d for some constant d , and uses a de-amortized cuckoo table as a “back yard” to handle elements from ov erﬂowing buck ets. They show that for constant  > 0 , with ov erwhelming probability the backyard cuckoo table must only store a small constant fraction of the elements. Note that n α -wise independent hash functions for some α > 0 are required to map items to buckets in the main table, and therefore the technique cuts our space usage by a factor of about 2, but increases the amount of independence we need to assume in our hash functions for theoretical guarantees to hold. 4 Cache-Oblivious Multimaps In this section, we describe our cache-obli vious implementation of the multimap ADT . T o illustrate the issues that arise in the construction, we ﬁrst giv e a simple implementation for a RAM, and then gi ve an improved (cache-oblivious) construction for the external memory model. Speciﬁcally , we describe an amortized cache-obli vious solution and then we describe how to de-amortize this solution. In the implementation for the RAM model, we maintain two nested cuckoo hash tables, as described in Section 2. The ﬁrst table enables fast containsItem ( k , v ) operations; this table stores all the ( k , v ) pairs using each entire key-v alue pair as the key , and the value associated with ( k , v ) is a pointer to v ’ s entry in a linked list L ( k ) containing all v alues associated with k in the multimap. The second table ensures fast containsK ey ( k ) , getAll ( k ) , and removeAll ( k ) operations: this table stores all the unique keys k , as well as a pointer to the head of L ( k ) . Operations in the RAM implementation. 1. containsKey ( k ) : W e perform a lookup for k in T able 2. 2. containsItem ( k , v ) : W e perform a lookup for ( k , v ) in T able 1. 18 3. add ( k , v ) : W e add ( k , v ) to T able 1 using the insertion procedure of Section 2. W e perform a lookup for k in T able 2, and if k is not found we add k to T able 2. W e then insert v as the head of the linked list corresponding to T able 2. 4. remove ( k, v ) : W e remov e ( k , v ) from T able 1, and remove v from the linked list L ( k ) ; if v was the head of L ( k ) , we also perform a lookup for k in T able 2 and update the pointer for k to point to the ne w head of L ( k ) (if L ( k ) is now empty , we remove k from T able 2.) 5. getAll ( k ) : W e perform a lookup for k in T able 2 and return the pointer to the head of L ( k ) . 6. removeAll ( k ) : W e remove k from T able 2. In order to achie ve unamortized O (1) I/O complexity , we do not update the corresponding pointers of ( k , v ) pairs in T able 1; this creates the presence of “spurious” pointers in T able 1, but Angelino et al . [3] explain how to handle the presence of such spurious pointers while increasing the cost of all other operations by O (1) f actors. All operations above are performed in O (1) time in the worst case with ov erwhelming probability by the results of Section 2. T wo major issues arise in the abov e construction. First, the space-usage remains O ( n ) only if we assume the existence of a garbage-collector for leaked memory , as well as a memory allocation mechanism, both of which must run in O (1) time in the worst case. W ithout the memory allocation mechanism, inserting v into L ( k ) cannot be done in O (1) time, and without the garbage collector for leaked memory , space cannot be reused after remove and removeAll operations. Second, in order to extract the actual values from a getAll ( k ) operation, one must actually traverse the list L ( k ) . Since L ( k ) may be spread all ov er memory , this suf fers from poor locality . W e now present our cache-oblivious multimap implementation. Our implementation avoids the need for garbage collection, and circumv ents the poor locality of the abov e getAll operation. W e do require a cache- obli vious mechanism to allocate and deallocate power -of-two sized memory blocks with constant-factor space and I/O ov erhead; this assumption is theoretically justiﬁed by the results of Brodal et al . [11]. Amortized Cache-Oblivious Multimaps. As in the RAM implementation, we keep two nested cuckoo tables. In T able 1, we store all the ( k , v ) pairs using each entire key-v alue pair as the key . W ith each such pair , we store a count, which identiﬁes an ordinal number for this v alue v associated with this key , k , starting from 0 . For example, if the keys were (4, Alice), (4, Bob), and (4, Eve), then (4, Alice) might be pair 0, (4, Bob) pair 1, and (4, Eve) pair 2, all for the k ey , 4. In T able 2, we store all the unique keys. For each ke y , k , we store a pointer to an array , A k , that stores all the key-v alue pairs having ke y k , stored in order by their ordinal values from T able 1. With the record for a key k , we also store n k , the number of pairs having the key k , i.e., the number of key-v alue pairs in A k . W e assume that each A k is maintained as an array that supports amortized O (1) -time element access and addition, while maintaining its size to be O ( n k ) . Operations. 1. containsKey ( k ) : W e perform a lookup for k in T able 2. 2. containsItem ( k , v ) : W e perform a lookup for ( k , v ) in T able 1. 3. add ( k , v ) : After ensuring that ( k , v ) is not already in the multimap by looking it up in T able 1, we look up k in T able 2, and add ( k , v ) at index n k of the array A k , if k is present in this table. If there is no key k in T able 2, then we allocate an array , A k , of initial constant size. Then we add ( k , v ) to A k [0] and add key k to T able 2. In either case, we then add ( k , v ) to T able 1, gi ving it ordinal n k , and increment the value of n k associated with k in T able 2. This operation may additionally require the gro wth of A k by a factor of two, which would then necessitate copying all elements to the ne w array location and updating the pointer for k in T able 2. 4. remove ( k, v ) : W e look up ( k , v ) in T able 1 and get its ordinal count, i . Then we remov e ( k , v ) from T able 1, and we look up k in T able 2, to learn the v alue of n k and get a pointer to A k . If n k > 1 , we 19 swap ( k 0 , v 0 ) = A k [ n k − 1] and ( k , v ) = A k [ i ] , and then remove the last element of A k . W e update the ordinal v alue of ( k 0 , v 0 ) in T able 1 to no w be i . W e then decrement the value of n k associated with k in T able 2. If this results in n k = 0 , we remove k from T able 2. This operation may additionally require the shrinkage of the array A k by a factor of 2 , so as to maintain the O ( n ) space bound. 5. getAll ( k ) : W e look up k in T able 2, and then list the contents of the n k elements stored at the array A k index ed from this record. 6. removeAll ( k ) : For all entries ( k , v ) of A k , we remove ( k , v ) from T able 1. W e also remov e k from T able 2 and deallocate the space used for A k . As in the RAM implementation, in order to achiev e unamortized O (1) I/O cost, we do not update the pointers of ( k , v ) pairs in T able 1; this creates the presence of “spurious” pointers in T able 1 which are handled the same as in the RAM case. In terms of I/O performance, containsK ey ( k ) and containsItem ( k , v ) clearly require O (1) I/Os in the worst case. getAll ( k ) operations use O (1 + n k /B ) I/Os in the worst case, because scanning an array of size n k uses O ( d n k /B e ) I/Os, e ven though we don’t know the v alue of B . remov eAll ( k ) utilizes O ( n k ) I/Os in the worst-case with overwhelming probability , but these can be charged to the insertions of the n k v alues associated with k , for O (1) amortized I/O cost. add ( k , v ) and remove ( k , v ) operations also require O (1) amortized I/Os with overwhelming probability; the bound is amortized because there is a chance this operation will require a growth or shrinkage of the array A k , which may require moving all ( k , v ) values associated with k and updating the corresponding pointers in T able 1. In the next section, we e xplain how to deamortize add ( k , v ) and remo ve ( k , v ) operations. De-Amortizing the K ey-V alue Arrays. T o de-amortize the array operations, we use a rebuilding technique, which is standard in de-amortization methods (e.g., see [38]). W e consider the operations needed for insertions to an array; the methods for deletions are similar . The main idea is that we allocate arrays whose sizes are powers of 2. Whene ver an array , A , becomes half full, we allocate an array , A 0 , of double the size and start copying elements A in A 0 . In particular, we maintain a crossov er index, i A , which indicates the place in A up to which we hav e copied its contents into A 0 . Each time we wish to access A during this b uild phase, we copy two elements of A into A 0 , picking up at position i A , and updating the two corresponding pointers in T able 1. Then we perform the access of A , as would would otherwise, except that if we wish access an index i < i A , then we actually perform this access in A 0 . Since we copy tw o elements of A for e very access, we are certain to complete the b uilding of A 0 prior to our needing to allocate a new , ev en larger array , even if all these accesses are insertions. Thus, each access of our array will now complete in worst-case O (1) time with overwhelming probability . It immediately follo ws that add ( k , v ) and remo ve ( k , v ) operations run in O (1) worst-case time. All time bounds in T able 1 follow . 5 Conclusion In this paper, we have studied fully de-amortized dictionary and multimap algorithms that support worst- case constant-time operations with high or ov erwhelming probability . At the core of our result is a “nested” cuckoo hash construction, in which an inner cuckoo table is used to support fast lookups into a queue/stash structure for an outer cuckoo table, as well as a simpliﬁed and improv ed implementation of an atomic stash , which is related to the atomic heap or q-heap data structure of Fredman and W illard [20]. W e gave fully de-amortized constructions with guarantees that hold with high probability in the Practical RAM model, and with overwhelming probability in the external-memory (I/O) model, the standard RAM model, or the A C 0 RAM model. Se veral interesting questions remain for future work. First, lookups in our structure may require four or more I/Os in external-memory; it would be interesting to dev elop fully de-amortized structures supporting lookups in as few as two I/Os. A prime possibility suited for external memory is random-walk cuckoo hashing with two hash functions and super -constant b ucket sizes. Second, it would be interesting to dev elop 20 a fully-deamortized dictionary for the Practical RAM model where all operations take O (1) time with overwhelming probability . Acknowledgments This research was supported in part by the U.S. National Science Foundation, under grants 0713046, 0830403, and 0847968, and by an Ofﬁce of Nav al Research: Multidisciplinary Uni versity Research Ini- tiati ve (MURI) A ward, number N00014-08-1-1015. Justin Thaler is supported by the Department of De- fense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Pro- gram. Michael Mitzenmacher was supported in part by the U.S. National Science Foundation, under grants 0964473, 0915922, and 0721491. Refer ences [1] A. Aggarwal and J. S. V itter . The input/output complexity of sorting and related problems. Comm. A CM , 31:1116–1127, 1987. [2] A. Andersson, P . B. Miltersen, and M. Thorup. Fusion trees can be implemented with A C0 instructions only . Theor etical Computer Science , 215(1-2):337–344, 1999. [3] E. Angelino, M. T . Goodrich, M. Mitzenmacher, and J. Thaler . External-memory multimaps. ArXiV ePrints , abs/1104.5533, 2011. [4] Y . Arbitman. Efﬁcient dictionary data structures based on cuckoo hashing. MSc Thesis, Department of Computer Science and Applied Mathematics, W eizmann Institute of Science, Israel, 2010. [5] Y . Arbitman, M. Naor , and G. Sege v . De-amortized cuck oo hashing: Prov able worst-case performance and experimental results. Automata, Langua ges and Pr ogramming , pages 107–118, 2009. [6] Y . Arbitman, M. Naor , and G. Se gev . Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In IEEE Symp. on F oundations of Computer Science (FOCS) , pages 787–796, 2010. [7] M. A. Bender , E. D. Demaine, and M. Farach-Colton. Cache-oblivious b-trees. In 41st IEEE Symp. on F oundations of Computer Science (FOCS) , pages 399–409, 2000. [8] M. Brav erman. Poly-logarithmic independence fools ac0 circuits. Electr onic Colloquium on Compu- tational Complexity , 16:3–8, 2009. [9] G. S. Brodal, R. Fagerber g, and R. Jacob . Cache obli vious search trees via binary trees of small height. In 13th A CM-SIAM Symp. on Discrete Algorithms (SOD A) , pages 39–48, 2002. [10] G. S. Brodal, R. Fagerber g, and K. V inther . Engineering a cache-oblivious sorting algorithm. J. Exp. Algorithmics , 12:2.2:1–2.2:23, 2008. [11] S. Brodal, D. Demaine, and I. Munro. Fast allocation and deallocation with an improved buddy system. Acta Inf. , 41:273–291, March 2005. [12] S. B ¨ uttcher and C. L. A. Clarke. Indexing time vs. query time: trade-offs in dynamic information retrie val systems. In Proc. of 14th A CM Conf. on Information and Knowledge Management (CIKM) , pages 317–318. A CM, 2005. [13] S. B ¨ uttcher , C. L. A. Clark e, and B. Lushman. Hybrid index maintenance for gro wing text collections. In Pr oc. of 29th ACM SIGIR Conf. on Resear ch and Development in Information Retrieval (SIGIR) , pages 356–363. A CM, 2006. [14] T . H. Cormen, C. E. Leiserson, R. L. Ri vest, and C. Stein. Intr oduction to Algorithms . MIT Press, Cambridge, MA, 2nd edition, 2001. 21 [15] D. Cutting and J. Pedersen. Optimization for dynamic in verted index maintenance. In 13th ACM SIGIR Conf. on Researc h and Development in Information Retrieval , SIGIR ’90, pages 405–411. A CM, 1990. [16] L. De vroye and P . Morin. Cuckoo hashing: Further analysis. Information Pr ocessing Letters , 86(4):215–219, 2003. [17] L. Devroye and P . Morin. Cuckoo hashi ng: Further analysis. Information Pr ocessing Letters , 86:215– 219, 2003. [18] M. Dietzfelbinger and M. Rink. Applications of a splitting trick. In Pr oceedings of the 36th International Colloquium on Automata, Languages and Pr ogr amming: P art I , ICALP ’09, pages 354– 365, Berlin, Heidelberg, 2009. Springer -V erlag. [19] M. Dietzfelbinger and P . W oelfel. Almost random graphs with simple hash functions. In Pr oceedings of the Thirty-F ifth Annual A CM Symposium on Theory of Computing , ST OC ’03, pages 629–638, New Y ork, NY , USA, 2003. A CM. [20] M. L. Fredman and D. E. W illard. Surpassing the information theoretic bound with fusion trees. J . Comput. System Sci. , 47:424–436, 1993. [21] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-obli vious algorithms. In 40th IEEE Symp. on F oundations of Computer Science (FOCS) , pages 285–298, 1999. [22] M. T . Goodrich and M. Mitzenmacher . Priv acy-preserving access of outsourced data via obli vious ram simulation. CoRR , abs/1007.1259, 2010. [23] R. Guo, X. Cheng, H. Xu, and B. W ang. Efﬁcient on-line index maintenance for dynamic text collections by using dynamic balancing tree. In Pr oc. of 16th A CM Conf. on Information and Knowledge Manag ement (CIKM) , CIKM ’07, pages 751–760. ACM, 2007. [24] A. Kirsch and M. Mitzenmacher . Using a queue to de-amortize cuckoo hashing in hardware. In 45th Allerton Confer ence on Communication, Contr ol, and Computing , pages 751–758, 2007. [25] A. Kirsch, M. Mitzenmacher , and U. W ieder . More rob ust hashing: cuckoo hashing with a stash. SIAM J . Comput. , 39:1543–1561, 2009. [26] D. E. Knuth. Sorting and Sear ching , v olume 3 of The Art of Computer Pr ogramming . Addison-W esley , Reading, MA, 1973. [27] R. Kutzelnigg. Bipartite random graphs and cuckoo hashing. DMTCS Pr oceedings , 0(1), 2006. [28] R. Kutzelnigg. An improved version of cuckoo hashing: A verage case analysis of construction cost and search operations. Mathematics in Computer Science , 3:47–60, 2010. [29] N. Lester , A. Moffat, and J. Zobel. Efﬁcient online index construction for te xt databases. ACM T rans. Database Syst. , 33:19:1–19:33, September 2008. [30] N. Lester , J. Zobel, and H. W illiams. Efﬁcient online inde x maintenance for contiguous in verted lists. Inf. Pr ocessing & Management , 42(4):916–933, 2006. [31] R. W . Luk and W . Lam. Ef ﬁcient in-memory extensible in verted ﬁle. Information Systems , 32(5):733– 754, 2007. [32] C. McDiarmid. Concentration. In Pr obabilistic methods for algorithmic discr ete mathematics,(M. Habib, C. McDiarmid, J . Ramirez-Alfonsin, B. Reed, Eds.) , pages 195–248, Berlin, 1998. Springer . [33] P . B. Miltersen. Lower bounds for static dictionaries on RAMs with bit operations but no multipli- cation. In F . Meyer and B. Monien, editors, Int. Conf. on Automata, Languages and Pr ogramming (ICALP) , volume 1099 of LNCS , pages 442–453. Springer , 1996. [34] M. Naor , G. Segev , and U. W ieder . History-independent cuckoo hashing. In Pr oceedings of ICALP , pages 631–642. Springer , 2008. 22 [35] A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput. , 38:85– 96, March 2008. [36] R. Pagh and F . Rodler . Cuckoo hashing. Journal of Algorithms , 52:122–144, 2004. [37] R. Pagh, Z. W ei, K. Y i, and Q. Zhang. Cache-oblivious hashing. In 29th ACM Symp. on Principles of Database Systems (PODS) , pages 297–304, 2010. [38] S. Rao Kosaraju and M. Pop. De-amortization of algorithms. In W .-L. Hsu and M.-Y . Kao, editors, Computing and Combinatorics , volume 1449 of LNCS , pages 4–14. Springer , 1998. [39] A. Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. Comput. , 33:505–543, March 2004. [40] M. Thorup. On A C0 implementations of fusion trees and atomic heaps. In 14th ACM-SIAM Symposium on Discr ete Algorithms (SOD A) , pages 699–707, 2003. [41] E. V erbin and Q. Zhang. The limits of buf fering: a tight lower bound for dynamic membership in the external memory model. In Pr oceedings of the 42nd ACM Symposium on Theory of Computing , STOC ’10, pages 447–456, Ne w Y ork, NY , USA, 2010. A CM. [42] J. S. V itter . External memory algorithms and data structures: dealing with massiv e data. ACM Comput. Surv . , 33(2):209–271, 2001. [43] D. E. W illard. Examining computational geometry , v an emde boas trees, and hashing from the perspecti ve of the fusion tree. SIAM J. Comput. , 29:1030–1049, December 1999. [44] J. Zobel and A. Moffat. In verted ﬁles for text search engines. A CM Comput. Surv . , 38, July 2006. A Our Atomic Stash Structur e In this section, we describe our deterministic atomic stash implementation of the dictionary ADT . This structure dynamically maintains a sparse set, S , of at most O ( w 1 / 2 ) key-v alue pairs, using O ( w ) words of memory , so as to support insertions, deletions, and lookups in O (1) worst-case time, where w is our computer’ s word size. Our construction is valid in the I/O model and the A C0 RAM model [40], that is, the RAM model minus constant-time multiplication and di vision, b ut augmented with constant-time A C0 instructions that are included in modern CPU instruction sets. (See previous work on the atomic heap [20] data structure for a solution in the standard RAM model, albeit in a way that limits the size of S to be O ( w 1 / 6 ) and is less efﬁcient in terms of constant factors.) W e assume that keys and values can each ﬁt in a word of memory . Our construction builds on the fusion tree and atomic heap implementations of previous researchers [2, 20, 40], but improves the simplicity and capacity of these previous implementations by taking advantage of modern CPU instruction sets and the fact that we are interested here in maintaining only a simple dictionary , rather than an ordered set. For instance, our solution a voids any lookups in pre-computed tables. A.1 Components of an Atomic Stash Our solution is based on associating with each key k in S , a compressed key , B k , of size w 0 = b w 1 / 2 c − 1 , which we store as a representativ e of k . In addition, for each compressed key , B k , we store a binary mask, M k , of equal length, such that B k ∧ M k 6 = B j ∧ M k , for j 6 = k , with j in S , where “ ∧ ” denotes bit-wise AND. That is, all of the masked keys in S are unique. A critical element in our data structure is an associative cache , X , which is stored in a single word, which we vie w as being divided into w 0 ﬁelds of size w 0 each, plus a (high-order) indicator bit for each ﬁeld, 23 so w ≥ w 0 ( w 0 + 1) . W e denote the ﬁeld with inde x i in X as X ( i ) and the indicator bit with index i (for ﬁeld i ) in X as X [ i ] , and we assume that indices are increasing right-to-left and be gin with index 1 . Thus, the bit position of X [ j ] in X is j ( w 0 + 1) . Note that with standard SHIFT , OR, and AND operations, we can read or write any ﬁeld or indicator bit in X in O (1) time gi ven its index and either a single bit or a bit mask, ~ 1 , of w 0 1’ s. Each ﬁeld X ( i ) is either empty or it stores a compressed key , B k , for some key k in S . In addition, we also maintain a word, Y , with indices corresponding to those in X , such that Y ( i ) stores the mask, M k , if X ( i ) stores the binary k ey B k . W e also maintain ke y and v alue arrays, K and V , such that, if B k is stored in X ( i ) , then we store the ke y-value pair ( k , v ) in K and V , so that K [ i ] = k and V [ i ] = v . T o keep track of the size of S , we maintain a count, n S , which is the number of items in S . Finally , we maintain a “used” mask, U , which has all 1’ s in each ﬁeld of X that is used. That is, U ( i ) = ~ 1 if f X ( i ) holds some key , B k , which is equi valent to saying Y ( i ) 6 = 0 . The reason we use both a compressed ke y and a mask for each ke y stored in S is that, as we add ke ys to S , we may need to sometimes expand the function that compresses keys so that the masked values of keys remain unique, while still ﬁtting in a ﬁeld of X . Even in this en vironment, we would like for previously- compressed keys to still be valid. In particular , our method generates a sequence of compression functions, p 1 , p 2 , etc., so that p i returns a w 0 -bit string whose ﬁrst i bits are signiﬁcant (and can be either 0 or 1 ) and whose remaining bits are all 0 ’ s. In addition, if we let M d denote a bit mask with d signiﬁcant 1 bits and w 0 − d follo wing 0 bits, then, for d ≥ 2 , p d ( k ) ∧ M d − 1 = p d − 1 ( k ) , for any k ey k . A.2 Operations in an Atomic Stash Let us now describe how we perform the various update and query operations on an atomic stash. Assume we may use the follo wing primitive operations in our methods: • A binary operator, “ ⊕ , ” which denotes the bit-size XOR operation. • DUPLICA TE ( B ) : return a word Z having the binary key B (of size w 0 ) stored in each of its ﬁelds. (Note: we can implement DUPLICA TE either using a single multiplication or using O (1) instructions of a modern CPU.) • V ecEQ ( W, B ) : given a word W and a binary key B (of size w 0 ), set the indicator bit W [ i ] = 1 iff B = W ( i ) . This operation can be implemented using standard A C0 operations [2, 40]. • MSB ( W ) : return the index of the most signiﬁcant 1 bit in the word W , or 0 if W = 0 . As Thorup observed [40], this operation can be implemented, for example, by con verting W to ﬂoating point and returning the exponent plus 1. W e perform a getIndex ( k ) operation, which returns the index of key k in X , or 0 if k is not a key in S , as follo ws. W e assume in this method that we have access to the current compression function, p d . • getIndex ( k ) : B ← p d ( k ) Z ← DUPLICA TE ( B ) T ← Y ∧ ( X ⊕ Z ) R ← U ⊕ T V ecEQ ( R, ~ 1) retur n MSB ( R ) / ( w 0 + 1) The correctness of this method follo ws from the fact that B is the key in S at index i associated with k if f T ( i ) is all 0’ s and index i is being used to store a k ey from S , since all masked ke ys in S are unique (and 24 if B is not in X , then MSB ( R ) = 0 ). Also, if one desires that we should a void integer di vision, then we can deﬁne w 0 so that ( w 0 + 1) is a po wer of 2 , while k eeping w ≥ w 0 ( w 0 + 1) , so that abov e di vision can be done as a SHIFT . In addition, note that we can implement a get ( k ) operation, by performing a call to getInde x ( k ) and, if the index, i , returned is greater than 0 , then returning ( K [ i ] , V [ i ]) . T o remov e the key-v alue pair from S associated with a key k in S , we perform the following operation: • remove ( k ) : i ← get ( k ) X ( i ) ← X ( n S ) Y ( i ) ← Y ( n S ) U ( n S ) ← 0 K [ i ] ← K [ n S ] V [ i ] ← V [ n S ] n S ← n S − 1 Our insertion method, which follo ws, assumes that we have access to the current compression function, p d , as well as the mask, M d , of d 1’ s followed by ( w 0 − d ) 0’ s. W e also assume we know the current value of (the global v ariable) d and that we ha ve a function, P EXP AND ( k 1 , k 2 ) , which takes two distinct keys, k 1 and k 2 , and expands the compression function from p d to p d +1 so that p d +1 ( k 1 ) 6 = p d +1 ( k 2 ) . This method also deﬁnes M d +1 to consist of ( d + 1) signiﬁcant 1 bits and w 0 − ( d + 1) trailing 0 ’ s, and it increments d . • add ( k , v ) : i ← get ( k ) if i > 0 then { W e had a collision. } P EXP AND ( k , K [ i ]) { Also increments d } X ( i ) ← p d ( K [ i ]) Y ( i ) ← M d n S ← n S + 1 X ( n S ) ← p d ( k ) Y ( n S ) ← M d U ( n S ) ← ~ 1 K [ n S ] ← k V [ n S ] ← v Since the compressed keys in S form a set of unique masked keys, the new key , k , will collide with at most one of them, even when compressed. So, it is sufﬁcient in this case that we expand the compression function so that it distinguishes k and this one colliding ﬁeld. Thus, a simple inducti ve argument, we maintain the property that all the masked ke ys in S are unique. A.3 Compressing K eys and De-Amortization Let us now describe how we compress keys and expand the compression function, as well as how we perform the necessary de-amortization so that the size of compressed ke ys is never more than w 0 . Our method makes use of the following primitiv e A C0 operation [2, 40], which can also be computed from included primiti ves in modern CPU instruction sets. • SELECT ( W, k ) : Given a word W and key k , the ﬁelds of W are viewed as bit pointers. A length w 0 bit string, B , is returned so that the i -th bit of B equals the W ( i ) -th bit of k . Our compression function is encoded in terms of the counter , d , and a word, W , that encodes the bits to be selected from keys, so that W ( i ) is the index of the i th bit of k in the output of p d ( k ) , for i ≤ d . For i > d , W ( i ) = 0 . Thus, we deﬁne p d simply as follo ws. 25 • p d ( k ) : retur n SELECT ( W, k ) Thus, we also hav e a simple deﬁnition for P EXP AND: • P EXP AND ( k 1 , k 2 ) : d ← d + 1 W ( d ) ← MSB( k 1 ⊕ k 2 ) M d ← M d − 1 ∨ (1 SHIFT d ) Note that in the way we are using the expansion function, p d , we only ev er add bits to the ends of our compressed keys. W e nev er remov e bits. This allo ws keys compressed under previous instantiations of the compression function to continue to be used. But this also implies that, as we continue to add ke ys to S , we might run out of ﬁelds to use, even if we keep the size, n S , of S to be at most, say , w 0 / 2 . Of course, we could use a standard amortized rebuilding scheme to maintain d to be at most w 0 , but this would require using amortization instead of achie ving true worst-case constant-time bounds for our updates and lookups. As a de-amortization technique, therefore, let us re vise our construction, so that, whene ver d > w 0 / 2 , we create a new , initially empty , atomic stash. For each additional atomic-stash operation after this initialization, we remov e two items from the old atomic stash and add them to this new atomic stash. In performing accesses and updates during this time, we also keep a crossov er index, x , so that references to ﬁelds and indicator indices less than or equal to x are done in the new stash and references to ﬁelds and indicator indices greater than x are done in the old stash. Thus, after w 0 / 2 additional operations, we can discard the old stash (for which d ≤ w 0 ) and fully replace it with the ne w one (for which d ≤ w 0 / 2 ). Therefore, so long as n S ≤ w 0 / 2 , we can perform all insertion, deletion, and lookup operations in worst-case O (1) time, which in the I/O model corresponds to O (1) I/Os. 26

Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment