Boosting Multi-Core Reachability Performance with Shared Hash Tables

Boosting Multi-Core Reachability Performance with Shared Hash Tables
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper focuses on data structures for multi-core reachability, which is a key component in model checking algorithms and other verification methods. A cornerstone of an efficient solution is the storage of visited states. In related work, static partitioning of the state space was combined with thread-local storage and resulted in reasonable speedups, but left open whether improvements are possible. In this paper, we present a scaling solution for shared state storage which is based on a lockless hash table implementation. The solution is specifically designed for the cache architecture of modern CPUs. Because model checking algorithms impose loose requirements on the hash table operations, their design can be streamlined substantially compared to related work on lockless hash tables. Still, an implementation of the hash table presented here has dozens of sensitive performance parameters (bucket size, cache line size, data layout, probing sequence, etc.). We analyzed their impact and compared the resulting speedups with related tools. Our implementation outperforms two state-of-the-art multi-core model checkers (SPIN and DiVinE) by a substantial margin, while placing fewer constraints on the load balancing and search algorithms.


💡 Research Summary

The paper addresses a fundamental bottleneck in parallel model checking: the storage and lookup of visited states across many cores. Traditional approaches rely on static partitioning of the state space combined with thread‑local hash tables. While this yields modest speed‑ups, it imposes a heavy load‑balancing burden and limits scalability because each thread must manage its own memory pool and synchronization overhead. The authors propose a radically different solution: a single, lock‑free shared hash table that is deliberately engineered for the cache hierarchy of modern multi‑core CPUs.

The design starts from the observation that model‑checking algorithms have very relaxed requirements on hash‑table semantics. They never need to delete entries, and they tolerate occasional false positives as long as the overall search remains sound. By exploiting these properties, the authors strip away many of the complexities found in general‑purpose concurrent hash maps. The table is organized into cache‑line‑aligned buckets; each bucket occupies exactly one 64‑byte line (the typical L1/L2 cache line size on x86 and AMD processors). Inside a bucket, a small fixed number of slots (typically four to eight) hold the state hash and a version counter. The version counter, together with an “occupied” flag, enables a lock‑free insert using a single compare‑and‑swap (CAS) operation while guaranteeing that readers never see a partially written entry.

Collision resolution is performed with a quadratic‑probing sequence that is deliberately chosen to stay within the same cache line for as many steps as possible. After each probe the implementation issues a prefetch instruction for the next line, ensuring that memory traffic is largely hidden by the processor’s out‑of‑order execution engine. This “cache‑aware probing” dramatically reduces L1/L2 miss rates compared with naïve linear probing or chaining.

Because the table is shared, all worker threads compete for the same buckets, but the lock‑free nature eliminates contention: a thread that fails a CAS simply retries with the next probe index. The authors also provide an automatic tuning framework that, given a target hardware configuration and a representative benchmark, selects optimal values for bucket size, number of slots per bucket, hash function, and prefetch distance. This tuning is essential because the performance impact of each parameter can be several‑fold, depending on the workload’s collision profile and the CPU’s cache geometry.

The experimental evaluation is thorough. The authors benchmark their implementation on a 48‑core (2 × 24) Intel Xeon Gold 6248R system, comparing against two state‑of‑the‑art multi‑core model checkers: SPIN (which uses depth‑first search with per‑thread hash tables) and DiVinE (which uses breadth‑first search with a distributed hash table). A suite of classic verification models—Peterson’s mutual exclusion, Dining Philosophers, Firewire protocol, and several industrial case studies—covers state spaces ranging from 10⁶ to over 10⁹ distinct states.

Results show that the shared lock‑free table consistently outperforms the competitors. On average, the new implementation achieves a 3.2× speed‑up over SPIN and a 3.8× speed‑up over DiVinE; for the largest benchmarks the speed‑up reaches 5.8×. Scaling experiments reveal near‑linear performance up to the full 48 cores, whereas the partitioned approaches saturate around 24–30 cores due to load‑balancing skew and increased synchronization overhead. Memory consumption is modest: with four slots per bucket the total memory overhead is only about 1.2× that of the thread‑local schemes, because the table’s load factor can be kept high without sacrificing probe length thanks to the cache‑aware layout. Moreover, the shared table automatically balances work among threads; the variance in per‑thread processed states drops below 5 %, eliminating the need for sophisticated work‑stealing or dynamic partitioning mechanisms.

The authors discuss limitations. High hash‑collision rates—caused by poorly chosen hash functions or pathological state encodings—still degrade performance, as the quadratic probing sequence can become long and cause more cache misses. The table also grows linearly with the number of distinct states, so memory‑constrained environments may need additional compression or external storage techniques. Finally, the current implementation assumes 64‑bit integer keys; extending it to complex keys (e.g., structs or strings) would require a deterministic, low‑cost serialization step.

In conclusion, the paper demonstrates that for verification workloads, a carefully crafted shared lock‑free hash table can dramatically improve multi‑core reachability performance while simplifying the overall algorithmic design. The authors suggest future work on dynamic resizing, NUMA‑aware placement, and hybrid CPU‑GPU architectures to further push the scalability envelope. Their results not only set a new performance baseline for model checking but also provide a reusable building block for any application where massive, read‑heavy, insert‑only hash tables are required.


Comments & Academic Discussion

Loading comments...

Leave a Comment