Design Principles for Scaling Multi-core OLTP Under High Contention

Design Principles for Scaling Multi-core OLTP Under High Contention
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Although significant recent progress has been made in improving the multi-core scalability of high throughput transactional database systems, modern systems still fail to achieve scalable throughput for workloads involving frequent access to highly contended data. Most of this inability to achieve high throughput is explained by the fundamental constraints involved in guaranteeing ACID — the addition of cores results in more concurrent transactions accessing the same contended data for which access must be serialized in order to guarantee isolation. Thus, linear scalability for contended workloads is impossible. However, there exist flaws in many modern architectures that exacerbate their poor scalability, and result in throughput that is much worse than fundamentally required by the workload. In this paper we identify two prevalent design principles that limit the multi-core scalability of many (but not all) transactional database systems on contended workloads: the multi-purpose nature of execution threads in these systems, and the lack of advanced planning of data access. We demonstrate the deleterious results of these design principles by implementing a prototype system, ORTHRUS, that is motivated by the principles of separation of database component functionality and advanced planning of transactions. We find that these two principles alone result in significantly improved scalability on high-contention workloads, and an order of magnitude increase in throughput for a non-trivial subset of these contended workloads.


💡 Research Summary

The paper addresses a fundamental performance problem in modern multi‑core OLTP systems: workloads that repeatedly access a small set of hot records (high contention) do not scale with the number of cores. While the theoretical limit is set by the need to serialize conflicting operations, existing database engines suffer from additional, avoidable overheads that make the scalability gap far larger than the theoretical bound. The authors identify two pervasive design flaws in most commercial and research DBMSs. First, they point out the “conflated functionality” of execution threads: a single thread is responsible for both running the transaction’s business logic and interacting with the concurrency‑control subsystem (lock manager, OCC validator, etc.). This design creates three sources of overhead: (1) synchronization on shared lock‑metadata structures, which under contention leads to heavy use of atomic instructions; (2) data‑movement overhead as lock metadata constantly migrates between cores, inflating cache‑coherency traffic; and (3) cache‑pollution because instruction and data caches must hold both transaction code and concurrency‑control data, increasing each transaction’s latency. Second, they highlight the “dynamic concurrency control” approach used by most pessimistic‑locking systems: locks are requested on‑the‑fly in the order the transaction accesses records. This arbitrary acquisition order makes deadlocks inevitable, forcing the system to run deadlock‑detection and resolution logic. Under high contention, deadlock handling adds substantial overhead (longer lock hold times, wasted work due to aborts, and extra synchronization).

To overcome these problems, the authors propose two design principles. The first is partitioned functionality: dedicate a set of cores exclusively to concurrency control and another set to transaction execution. Communication between the two groups is performed via explicit message passing rather than shared data structures. By fixing each piece of lock metadata to a particular concurrency‑control core, the system eliminates contention on that metadata, reduces cache‑line bouncing, and improves locality. The second principle is advanced data‑access planning: before a transaction begins, the system analyses its read/write set, predicts the access pattern, and determines a global lock acquisition order. Because the order is fixed, deadlocks cannot arise, allowing the system to drop deadlock‑detection mechanisms entirely. Moreover, knowing the access pattern in advance lets the lock manager pre‑place metadata in the appropriate core’s cache, further reducing data movement.

The authors implement a prototype called ORTHRUS that embodies both principles. ORTHRUS consists of execution threads that only run transaction code and generate lock‑request messages, and concurrency‑control threads that own lock tables, grant locks in the pre‑computed order, and send execution‑grant messages back. The system uses lock‑free queues for inter‑thread communication and assumes the entire database fits in main memory. It implements a pessimistic two‑phase locking protocol but with a deadlock‑free acquisition order derived from the pre‑analysis step.

Experimental evaluation is performed on a 64‑core Intel Xeon platform using several high‑contention benchmarks: a read‑only TPC‑C variant, a YCSB workload with a high proportion of updates, and synthetic workloads that concentrate accesses on a few hot records. ORTHRUS is compared against a conventional 2PL implementation and against partition‑based engines such as H‑Store and Hyper. The results show that ORTHRUS scales almost linearly up to 40‑50 cores, achieving up to ten times higher throughput than the baseline 2PL system. Even for read‑only transactions, where no logical conflicts exist, ORTHRUS outperforms 2PL by a factor of 8–10 because the synchronization and cache‑coherency bottlenecks are eliminated. Deadlock‑related aborts drop to near zero, lock‑hold times shrink by roughly 30 %, and overall cache‑miss rates and memory‑bandwidth consumption are significantly lower.

The paper situates its contributions relative to prior work that mainly focuses on reducing lock granularity, using optimistic concurrency control, or exploiting static data partitioning. While partitioning can avoid cross‑partition contention, it suffers when transactions span multiple partitions. ORTHRUS, by contrast, does not rely on data partitioning; instead, it restructures the internal execution model to remove the identified bottlenecks.

In conclusion, the authors demonstrate that high‑contention OLTP workloads can achieve near‑linear multi‑core scalability by (1) separating transaction execution from concurrency control into dedicated core groups and (2) planning data accesses ahead of time to enforce a deadlock‑free lock order. The prototype validates the effectiveness of these principles, delivering an order‑of‑magnitude throughput improvement on contested workloads. Future work suggested includes extending the approach to distributed settings, handling multi‑partition transactions more efficiently, and automating the access‑pattern analysis for arbitrary SQL queries.


Comments & Academic Discussion

Loading comments...

Leave a Comment