A Complexity Separation Between the Cache-Coherent and Distributed Shared Memory Models
We consider asynchronous multiprocessor systems where processes communicate by accessing shared memory. Exchange of information among processes in such a multiprocessor necessitates costly memory accesses called \emph{remote memory references} (RMRs), which generate communication on the interconnect joining processors and main memory. In this paper we compare two popular shared memory architecture models, namely the \emph{cache-coherent} (CC) and \emph{distributed shared memory} (DSM) models, in terms of their power for solving synchronization problems efficiently with respect to RMRs. The particular problem we consider entails one process sending a “signal” to a subset of other processes. We show that a variant of this problem can be solved very efficiently with respect to RMRs in the CC model, but not so in the DSM model, even when we consider amortized RMR complexity. To our knowledge, this is the first separation in terms of amortized RMR complexity between the CC and DSM models. It is also the first separation in terms of RMR complexity (for asynchronous systems) that does not rely in any way on wait-freedom—the requirement that a process makes progress in a bounded number of its own steps.
💡 Research Summary
The paper investigates the relative power of two widely used shared‑memory architectures—cache‑coherent (CC) and distributed shared memory (DSM)—with respect to the cost of remote memory references (RMRs) in asynchronous multiprocessor systems. RMRs model the expensive communication that occurs when a processor must access memory that is not locally cached (in CC) or not owned by its own memory module (in DSM). While many prior works have compared CC and DSM on classic synchronization problems such as mutual exclusion (ME) and group mutual exclusion (GME), those studies either focused on worst‑case RMR bounds or relied on wait‑free progress guarantees. This work departs from that line by (1) using a very simple “signaling” problem as the benchmark, and (2) establishing a separation in amortized (average) RMR complexity that holds for both wait‑free and non‑wait‑free (busy‑wait) settings.
The signaling problem is defined as follows: a designated “signaler” process must notify a predetermined subset of other processes that an event has occurred. Each notified process repeatedly reads a shared flag until it observes the signal. The authors first present an algorithm for the CC model that solves the problem with O(1) amortized RMRs. The key idea is to place a single shared flag in memory; the signaler writes the flag once, and all waiting processes obtain the flag’s cache line after the first remote miss. Subsequent reads are served from the local cache, so the total number of RMRs incurred across all processes grows only with the number of distinct signals, not with the number of waiters. This algorithm satisfies both wait‑free and terminating progress properties.
Next, the paper proves a lower bound for any DSM algorithm solving the same problem. Because DSM partitions memory into processor‑local modules, a flag that resides in the signaler’s module cannot be read locally by other processes. Consequently, each waiting process must perform at least one remote read of the signaler’s module to learn the flag’s value. By constructing an adversarial schedule that forces all waiting processes to attempt the read after the signal is issued, the authors show that the total number of remote reads is Ω(N) where N is the number of waiting processes. Moreover, they extend the argument to amortized complexity: even if the signaler issues many signals, the average number of RMRs per signal remains bounded below by a constant proportional to the number of waiters, yielding an Ω(N) amortized lower bound.
The separation is striking for several reasons. First, it is the first proof that CC can strictly dominate DSM in terms of average RMR cost, not just worst‑case. Second, the result does not depend on wait‑freedom; it holds for algorithms that allow busy‑waiting, thereby eliminating the usual “wait‑free penalty” that often favors CC. The authors argue that the structural difference—caches that can be shared across processors in CC versus fixed memory ownership in DSM—is the root cause of the gap. In CC, a single remote write can broadcast a signal to arbitrarily many processes; in DSM, broadcasting requires a remote access per recipient.
The paper also discusses broader implications. Since the lower bound shows that any DSM implementation of the CC algorithm would incur more than a constant‑factor overhead in total RMRs, a RMR‑preserving simulation of CC on DSM is impossible. This informs hardware architects: while DSM may offer higher raw bandwidth and simpler physical design, it cannot efficiently support certain broadcast‑style synchronization primitives. Conversely, software designers can exploit CC’s ability to co‑locate frequently accessed data with the processes that use it, achieving near‑constant RMR costs for many coordination patterns.
Related work is surveyed comprehensively. Prior separations (e.g., Hadzilacos and Danek’s GME result) required specific progress constraints and yielded a Θ(N/ log N) gap in worst‑case RMRs. The present paper improves upon these by removing progress constraints and focusing on amortized cost, thereby strengthening the evidence that the two models are fundamentally incomparable for some problems. The authors also note that their lower‑bound technique—information‑propagation arguments combined with adversarial scheduling—could be adapted to other synchronization tasks such as barriers or rendezvous, opening avenues for future research.
In conclusion, the authors demonstrate that the cache‑coherent model can solve the signaling problem with constant amortized RMRs, whereas any DSM solution necessarily incurs linear amortized RMRs. This separation holds irrespective of wait‑freedom, establishing a robust distinction between the two memory models and highlighting the intrinsic advantage of cache sharing for broadcast‑type synchronization. The work deepens our understanding of the cost model underlying modern multiprocessor systems and sets the stage for further exploration of model‑specific algorithmic limits.
Comments & Academic Discussion
Loading comments...
Leave a Comment