Efficient Synchronization Primitives for GPUs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we revisit the design of synchronization primitives—specifically barriers, mutexes, and semaphores—and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to limit the number of atomic accesses required for a synchronization operation because atomic accesses are slower than regular memory accesses. We use the results of the benchmarks to critique existing synchronization algorithms and guide our new implementations, and then define an abstraction of GPUs to classify any GPU based on the behavior of the memory system. We use this abstraction to create suitable implementations of the primitives specifically targeting the GPU, and analyze the performance of these algorithms on Tesla and Fermi. We then predict performance on future GPUs based on characteristics of the abstraction. We also examine the roles of spin waiting and sleep waiting in each primitive and how their performance varies based on the machine abstraction, then give a set of guidelines for when each strategy is useful based on the characteristics of the GPU and expected contention.

💡 Research Summary

The paper revisits the design of three fundamental synchronization primitives—barriers, mutexes, and semaphores—for general‑purpose GPUs. The authors observe that most existing GPU synchronization solutions are simple spin‑locks that were inherited from early CPU designs and that these solutions are ill‑suited for modern many‑core GPUs because they rely heavily on atomic operations, generate high contention, and provide no higher‑level abstraction. To address these shortcomings the authors develop new CUDA implementations of each primitive, conduct a systematic set of micro‑benchmarks on two representative NVIDIA architectures (Tesla GT200 and Fermi GF100), and derive a machine‑abstraction model that captures the most relevant memory‑system characteristics for synchronization.

The benchmark suite consists of twelve tests that isolate (1) atomic reads, (2) atomic writes, (3) volatile (non‑atomic) reads, (4) volatile writes, and (5) mixed sequences where a single atomic operation precedes a series of volatile accesses. Each test is executed 1,000 times per block while the GPU is fully saturated (240 blocks on Tesla, 128 on Fermi). The results show dramatic differences between the two architectures: on Tesla, atomic operations are up to 90× slower than volatile accesses because there is no L2 cache and every atomic must fetch data from DRAM; on Fermi, the same ratio drops to roughly 3× thanks to an L2 cache that services most atomics. Moreover, non‑contentious accesses (each block works on its own word) are 5–10× faster than contentious accesses on Fermi, while the advantage is modest on Tesla. A particularly important observation is that on Fermi, issuing volatile accesses immediately after an atomic causes the atomic unit to retain exclusive control of the cache line, effectively serializing the subsequent volatile operations and erasing any performance benefit.

From these measurements the authors construct a three‑parameter abstraction: (i) the absolute latency of an atomic operation, (ii) the slowdown factor caused by contention, and (iii) the degree to which post‑atomic volatile accesses are serialized. This abstraction guides the design of synchronization algorithms that minimize the number of atomics, front‑load any unavoidable atomics, and then rely on cheap volatile operations for the remainder of the protocol.

Using the abstraction, the paper proposes concrete implementations:

Two‑phase atomic barrier – each block atomically increments a global counter, then all blocks spin on a volatile flag until the counter reaches the total block count. Only one atomic per block is required.
Ticket‑based mutex – blocks obtain a ticket via atomicAdd, then wait on a volatile “now serving” variable until their ticket matches. Again, a single atomic per lock acquisition.
Counter‑plus‑queue semaphore – an atomic decrement acquires a resource; if the counter becomes negative the block waits on a volatile flag. Release performs an atomic increment and notifies a waiting block.

All primitives operate at block granularity; intra‑block coordination is handled by __syncthreads(), which is extremely cheap. The designs also incorporate a hybrid “spin‑then‑sleep” policy: after a configurable number of volatile spins, a block may give up the GPU (by terminating the kernel and restarting later) to avoid wasting cycles under high contention.

Performance evaluation demonstrates that the new primitives outperform naïve spin‑locks by 2.5×–7× on both Tesla and Fermi, with especially strong scaling on Fermi where the cache mitigates contention. In low‑contention scenarios on Tesla, the simpler spin‑lock can still be competitive because the cost of any atomic dominates. The authors therefore provide a decision matrix: on GPUs where atomics are cheap and caches are effective (modern architectures), pure spin‑waiting is usually preferable; on older GPUs with expensive atomics, a limited spin followed by kernel‑level sleep yields lower overall latency.

Finally, the paper extrapolates its findings to future GPUs by arguing that as memory hierarchies become deeper and atomics become more efficient, the emphasis will shift toward designs that keep atomics to a minimum and exploit fast volatile synchronization. The presented abstraction and guidelines therefore serve as a portable framework for building high‑performance synchronization primitives across current and next‑generation many‑core accelerators.

Efficient Synchronization Primitives for GPUs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment