Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference

Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Model (LLM) inference presents a unique scheduling challenge due to the Key-Value (KV) cache, where a job’s memory footprint grows linearly with the number of decoded tokens. This growth couples scheduling decisions with feasibility: a scheduler must minimize latency under a hard memory budget, yet the response lengths of requests are inherently unknown. While recent works have explored this problem either assuming clairvoyance – exact knowledge of response lengths – or relying on machine-learned predictions, obtaining robust performance guarantees without any prior knowledge of job sizes remains a theoretically fundamental and practically important open problem. In this work, we propose the Geometric Slicing Algorithm (GSA), the non-clairvoyant policy to achieve the first constant competitive ratio for this problem in the offline batch setting. GSA manages uncertainty through a geometric phase structure that periodically restarts jobs to bound memory exposure, combined with a staggered pipeline mechanism that enables high concurrency by smoothing aggregate memory consumption. We prove that GSA achieves a competitive ratio of at most 61.92 for general instances, improving to 32 in the large-memory regime. Our algorithmic framework also yields a clairvoyant counterpart, the Geometric Batching Algorithm (GBA), which achieves an approximation ratio of 10.67 for general instances and 6.75 in the large-memory regime – significantly improving upon the best previously known bound of over 9000. Numerical experiments on real request traces demonstrate that our algorithms perform robustly while preserving these worst-case guarantees.


💡 Research Summary

Large Language Model (LLM) inference is increasingly dominated by the cost of serving requests, and a key source of that cost is the KV‑cache that stores intermediate attention states. Each generated token adds a new key‑value pair, so a request’s memory footprint grows linearly with its (unknown) response length. This coupling of progress and memory creates a fundamentally new scheduling problem: a batch of jobs that fits within the GPU’s memory at the start may later overflow as tokens are generated, forcing costly preemptions (kill‑and‑restart). Moreover, the response length of each request is not known a priori; it only becomes apparent during execution, placing the problem in a non‑clairvoyant setting.

The paper adopts the offline batch model: all requests arrive at time zero, a single GPU with a hard KV‑cache budget M is available, and the objective is to minimize total flow time (the sum of completion times). The authors evaluate algorithms by their competitive ratio – the worst‑case ratio of an algorithm’s flow time to that of an optimal clairvoyant scheduler that knows all response lengths in advance. They also consider the clairvoyant counterpart of their algorithm, measuring approximation ratios.

Key contributions

  1. Geometric Slicing Algorithm (GSA) – a non‑clairvoyant, polynomial‑time scheduler that achieves a constant competitive ratio of at most 61.92 for arbitrary instances and improves to 32 when the memory budget is asymptotically large (the “large‑memory regime”). GSA works without any prior knowledge of request sizes.

  2. Geometric Batching Algorithm (GBA) – the clairvoyant analogue of GSA. It attains an approximation ratio of at most 10.67 for general instances and 6.75 in the large‑memory regime, dramatically improving on the previous best bounds of 9216 and 48 respectively. When all jobs have identical response lengths, the ratio drops to 2 and approaches 1 as memory becomes unbounded.

  3. Algorithmic ingredients – two novel design principles:

    • Geometric slicing groups jobs into exponentially spaced classes based on estimated response lengths. Jobs in the same class are processed together in a “phase”. At the end of each phase, all active jobs are killed and restarted, limiting the amount of memory that can be exposed during a phase.
    • Staggered pipeline scheduling (SPS) spreads the start times of jobs within a phase so that their memory peaks occur at different rounds. This smooths the aggregate memory consumption, allowing higher concurrency under the same memory budget compared with naïve simultaneous batching.
  4. Analytical framework – the authors introduce a memory‑time area perspective. By interpreting the product of memory usage and execution time as a two‑dimensional area, they derive a clean lower bound on the optimal clairvoyant schedule and show that GSA’s area is bounded by a constant factor of this optimum. This technique overcomes the difficulty that the optimal schedule has no simple closed form.

  5. Empirical validation – using synthetic workloads and real request traces from the LMSYS‑Chat‑1M dataset, the authors demonstrate that GSA and GBA not only respect the worst‑case guarantees but also outperform state‑of‑the‑art baselines (e.g., the O(log M) algorithm of Chen et al. 2025) in average latency and memory efficiency. Heuristic variants (GSA‑SPEC, GBA‑D) retain the theoretical guarantees while simplifying implementation.

Technical details

  • Problem model: Each request consists of a fixed‑size prompt (already cached) followed by an unknown number of output tokens. Decoding one token takes one time step and consumes a fixed amount of KV memory (denoted κ). The total memory used at time t is the sum of κ times the number of tokens already generated across all active jobs plus the prompt memory.
  • Geometric slicing: Jobs are placed into classes C₀, C₁, … where class C_k contains jobs whose true response length lies in

Comments & Academic Discussion

Loading comments...

Leave a Comment