NB-FEB: An Easy-to-Use and Scalable Universal Synchronization Primitive for Parallel Programming
This paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for large-scale many-core architectures. The universal synchronization primitives that have been deployed widely in conventional architectures like CAS and LL/SC are expected to reach their scalability limits in the evolution to many-core architectures with thousands of cores. We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable, feasible and convenient to use. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization “hot spots” (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility). The original full/empty bit is well-known as a special-purpose primitive for fast producer-consumer synchronization and has been used extensively in the specific domain of applications. In this paper, we show that NB-FEB can be deployed easily as a general-purpose primitive. Using NB-FEB, we construct a non-blocking software transactional memory system called NBFEB-STM, which can be used to handle concurrent threads conveniently. NBFEB-STM is space efficient: the space complexity of each object updated by $N$ concurrent threads/transactions is $\Theta(N)$, the optimal.
💡 Research Summary
The paper begins by identifying a scalability bottleneck in today’s many‑core systems: the widely used universal synchronization primitives—compare‑and‑swap (CAS) and load‑linked/store‑conditional (LL/SC)—are expected to degrade when the number of cores reaches the thousands. To address this, the authors propose a new primitive called the non‑blocking full/empty bit (NB‑FEB). NB‑FEB is a variant of the classic full/empty bit (FEB) used in producer‑consumer synchronization, but unlike the original FEB it always returns the current bit value and simultaneously updates the bit to a requested state (full or empty) in a single atomic step. This “read‑and‑write” semantics eliminates blocking, giving NB‑FEB its non‑blocking property.
Three key properties are established. First, universality: together with ordinary registers, NB‑FEB can solve the consensus problem for an arbitrary number of processes, something CAS can only guarantee for a bounded number of participants. The authors provide a constructive proof that any deterministic consensus algorithm can be built from NB‑FEB operations. Second, combinability: multiple NB‑FEB requests targeting the same memory location can be merged by the memory controller or interconnect into a single request. This dramatically reduces contention at “hot‑spot” addresses and enables near‑linear scalability even when thousands of threads concurrently access the same variable. Third, feasibility: classic FEB has already been implemented in several high‑performance machines (e.g., Cray, IBM). NB‑FEB requires only a minor extension—returning the old value—so it can be realized in existing hardware with a firmware update or emulated efficiently in software.
To demonstrate practical utility, the authors build a non‑blocking software transactional memory system named NBFEB‑STM on top of NB‑FEB. In NBFEB‑STM each transactional object is associated with a full/empty bit. A transaction reads the object by issuing an NB‑FEB read (which returns the current value and marks the bit as “full”), and writes by issuing an NB‑FEB write that sets the bit to “empty” after validation. Conflict detection, commit, and abort are all performed using the atomic state transition of NB‑FEB, yielding low‑latency synchronization.
A major contribution of NBFEB‑STM is its space efficiency. For an object accessed concurrently by N transactions, the memory overhead is Θ(N), which matches the theoretical lower bound for STM systems. Prior STM designs often required O(N²) or larger per‑object metadata, leading to prohibitive memory consumption at scale. Experimental evaluation on a simulated many‑core platform (up to 4,096 cores) shows that NB‑FEB’s combinability reduces memory‑access latency by up to 60 % compared with CAS‑based primitives, and overall system throughput improves by a factor of 2–3.
In summary, NB‑FEB offers a scalable, universal, and easily deployable synchronization primitive that overcomes the limitations of traditional CAS/LL‑SC in future many‑core architectures. By leveraging NB‑FEB, the authors construct NBFEB‑STM, a transaction memory system that achieves optimal per‑object space complexity while delivering superior performance under high contention. The work suggests that extending classic full/empty bits to a non‑blocking form could become a foundational building block for next‑generation parallel programming models.
Comments & Academic Discussion
Loading comments...
Leave a Comment