A High Performance Memory Database for Web Application Caches
This paper presents the architecture and characteristics of a memory database intended to be used as a cache engine for web applications. Primary goals of this database are speed and efficiency while running on SMP systems with several CPU cores (four and more). A secondary goal is the support for simple metadata structures associated with cached data that can aid in efficient use of the cache. Due to these goals, some data structures and algorithms normally associated with this field of computing needed to be adapted to the new environment.
💡 Research Summary
The paper presents the design and implementation of a high‑performance in‑memory database intended to serve as a cache daemon for web applications. Recognizing that modern web stacks written in scripting languages such as PHP, Python, and Ruby often rely on external caching to avoid repeated expensive computations, the authors set out to build a system that maximizes throughput on symmetric multi‑processing (SMP) servers with four or more CPU cores while also offering simple metadata support.
Specification. The core data model is a key‑value store where both keys and values are opaque binary strings. To overcome the limitations of pure key‑value stores, the system adds “tags” – integer‑typed metadata that can be grouped by type and value. Tags enable applications to fetch or expire whole groups of keys in a single operation, reducing round‑trip overhead and simplifying application logic. The daemon is required to run on any POSIX‑compatible platform.
Implementation Overview. The daemon consists of three modules:
-
Network Interface – a non‑blocking BSD‑socket layer that listens on both a Unix domain socket and a TCP socket. Incoming data are parsed into complete requests and placed on a thread‑safe job queue.
-
Worker Thread Pool – a configurable pool of POSIX threads that dequeue jobs, parse the binary protocol, and execute the requested operation. The number of workers is set via command‑line arguments. An optional “thread‑less” mode bypasses the pool: the network thread calls the protocol parser directly, eliminating synchronization overhead.
-
Data Storage – the heart of the system. Two large structures are used:
-
A fixed‑size hash table whose buckets are roots of red‑black trees holding the actual key‑value pairs. Each bucket is protected by a reader‑writer lock (pthread rwlock). This design yields fine‑grained locking: readers acquire shared locks and never block each other, while writers obtain an exclusive lock only on the bucket they modify. With a well‑distributed hash function and many more buckets than worker threads (e.g., 256 buckets vs. 4 threads), contention is minimal.
-
A second hierarchy for metadata tags. The first level is a red‑black tree keyed by tag type; each type node contains another red‑black tree keyed by tag value, with pointers back to the associated key‑value records. Each tree level has its own rwlock, allowing concurrent tag queries and updates without global serialization.
-
The garbage collector is deliberately simple: it runs while holding the exclusive lock of a bucket, freeing entries as needed. Because the hash table itself is not globally locked, memory limits are enforced per bucket rather than globally.
Rationale and Design Choices. Traditional cache structures such as LRU lists or splay trees require a global queue or frequent restructuring, which forces exclusive locks on every access and severely limits scalability on multi‑core hardware. By contrast, the bucket‑level rwlock scheme isolates contention and enables high read‑throughput, which matches the read‑most nature of typical web caches. The authors also discuss the trade‑off of abandoning a global eviction policy in favor of per‑bucket reclamation.
Simulation Study. To validate the concurrency model, the authors built a discrete‑event simulator (GPSS) that mimics a configurable number of worker threads, hash buckets, and lock acquisition/release according to pthread rwlock semantics (with writer priority). The simulator generates a saturated workload with varying read/write mixes (90/10, 80/20, 50/50). Results show that when the bucket‑to‑thread ratio is high, over 90 % of lock acquisitions are “fast” (either uncontested or with negligible wait). As the write proportion rises, exclusive lock contention grows, reducing the fast‑acquisition percentage. These trends guided the decision to set the default bucket count to 256, which comfortably exceeds the number of cores on contemporary servers.
Experimental Results. Preliminary measurements were performed on a single‑CPU Pentium M (1.5 GHz) running FreeBSD 7.0, with client and daemon communicating via a Unix domain socket. Two configurations were compared: (a) thread‑less mode and (b) a single worker thread. The thread‑less mode achieved roughly a 25 % performance advantage, confirming that eliminating the job‑queue synchronization yields measurable gains on single‑core systems. The authors note that on multi‑core machines the worker pool can be scaled to exploit additional CPUs, and the simulation predicts near‑linear scaling as long as the bucket count remains sufficiently larger than the thread count.
Conclusions and Future Work. The presented memory cache daemon demonstrates that a combination of a bucketed hash table, per‑bucket red‑black trees, and fine‑grained reader‑writer locks can deliver high concurrency on SMP servers while supporting grouped operations via metadata tags. The design avoids the global bottlenecks of LRU or splay‑based caches and provides a portable POSIX implementation. Future directions include integrating an LRU‑like eviction policy, refining per‑bucket memory accounting, and extending the architecture to a distributed cache across multiple nodes.
Comments & Academic Discussion
Loading comments...
Leave a Comment