DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices’’. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25$\times$ under the same TTFT SLO constraints compared with SOTA work.


💡 Research Summary

Large language model (LLM) serving systems increasingly rely on KV‑cache reuse of prompt prefixes to cut the time‑to‑first‑token (TTFT) and lower operational costs. Existing scheduling approaches fall into two camps. Cache‑affinity schedulers map requests that share a prefix to the same compute node, thereby maximizing cache hits, but they can create severe load hotspots when a popular prefix dominates traffic. Load‑balancing schedulers, such as “least‑loaded”, spread requests evenly across nodes but scatter prefix‑sharing requests, dramatically reducing cache reuse and increasing TTFT. Because both strategies operate within a single mapping space, they cannot simultaneously guarantee high cache affinity and balanced load.

DualMap introduces a dual‑mapping paradigm that breaks this limitation. For each incoming request, two independent hash functions f₁ and f₂ are applied to the request’s prompt prefix, producing two candidate instances. The randomness of the two hashes ensures that distinct prefixes are distributed uniformly across the cluster, while the fact that the same prefix is hashed identically by both functions makes it highly likely that all requests sharing that prefix will belong to the same candidate pair. This design simultaneously preserves cache locality (cache‑hit probability ≈ 1 − 2/m for m requests sharing a prefix) and leverages the classic “power‑of‑two‑choices” (PoTC) principle to achieve strong load balancing: the maximum load deviates from the average by only log log n + O(1) instead of Θ(log n) as in single‑choice allocation.

To turn this theoretical advantage into a practical system, DualMap adds three complementary mechanisms.

  1. SLO‑aware Request Routing – The scheduler first attempts cache‑affinity routing. If the predicted TTFT for the chosen candidate exceeds a pre‑defined service‑level objective (SLO), the algorithm switches to a load‑aware decision, selecting the less‑loaded candidate. This dynamic switch protects latency guarantees during traffic spikes while still favoring cache reuse whenever possible.

  2. Hotspot‑aware Rebalancing – When a particular prefix becomes extremely popular, the primary candidate can become overloaded. DualMap then migrates a subset of those requests to their secondary candidate (the other hash result). Migration decisions prioritize secondary instances that are under‑utilized and already hold a substantial portion of the relevant KV cache, thereby alleviating hotspots with minimal cache loss.

  3. Lightweight Dual‑hash‑Ring Scaling – Scaling the cluster (adding or removing instances) traditionally requires a full remapping of the hash space, which would invalidate many cached prefixes. DualMap instead organizes instances on two independent consistent‑hash rings. Because mapping depends only on relative positions on the rings, a scaling event affects only the neighboring segment of each ring, limiting remapping impact to O(1/n) of the total requests. This enables fast elasticity with negligible cache disruption.

The authors implemented DualMap on top of the vLLM inference engine and evaluated it with real‑world workloads: a conversational dataset and a tool‑agent dataset, both served on an 8‑node cluster running the Qwen2.5‑7B model. Baselines included Mooncake, Preble, Dynamo, and a pure “least‑loaded” scheduler. Results show that DualMap achieves a cache‑hit rate close to the cache‑affinity baseline (≈ 1.2× higher than least‑loaded) while maintaining a coefficient‑of‑variation (CV) for load that is comparable to the pure load‑balancing approach. Under the same TTFT SLO, DualMap’s effective request capacity—defined as the fraction of requests meeting the SLO—increases up to 2.25× relative to the best prior method. Hotspot‑aware rebalancing reduces overload on hot instances by more than 30 % in skewed scenarios, and the dual‑hash‑ring scaling experiment demonstrates less than 5 % cache‑hit degradation when nodes are added or removed.

In summary, DualMap offers a simple yet powerful solution: by giving each request two possible destinations and intelligently picking the better one based on current load and SLO constraints, it reconciles the long‑standing trade‑off between cache affinity and load balancing in distributed LLM serving. The approach is theoretically grounded, incurs minimal overhead, and delivers substantial practical gains, making it a compelling candidate for production‑grade LLM inference platforms that must handle highly skewed, latency‑sensitive traffic at scale.


Comments & Academic Discussion

Loading comments...

Leave a Comment