LLM Serving Optimization with Variable Prefill and Decode Lengths

LLM Serving Optimization with Variable Prefill and Decode Lengths
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants – an exact solver for small instances and fast heuristics for larger ones – and evaluate them on a public workload spanning short conversations and long-document summarization, where they consistently reduce average latency relative to standard baselines. Our results highlight that during peak-hour tidal backlogs, greedy GPU packing or short-request prioritization can perform poorly when prompt lengths vary widely, and provide a principled, tunable framework for designing production batch schedulers and planning capacity in memory-constrained LLM serving systems.


💡 Research Summary

The paper tackles the offline scheduling problem that arises when serving large language models (LLMs) under a fixed KV‑cache memory budget. Each incoming request is characterized by a pre‑fill length sᵢ (the number of input tokens) and a decode length oᵢ (the number of output tokens to be generated). The KV‑cache initially consumes sᵢ slots for request i, and every newly generated token adds one more slot, so the memory footprint of request i grows linearly as sᵢ + j during the j‑th decode step. The system processes batches of tokens in discrete time; a batch may contain a mixture of pre‑fill operations and decode tokens from many requests, but at most one token per request can appear in a batch. The batch must respect the memory constraint Σ_{i∈batch}(sᵢ + aᵢ) ≤ M, where aᵢ is the current decode index for request i.

The authors first prove that once the uniform‑input‑size assumption is dropped, the problem becomes NP‑hard. They also construct adversarial instances showing that naïve heuristics such as first‑come‑first‑served (FCFS) or shortest‑first (prioritizing smallest oᵢ) can have unbounded competitive ratios: a single request with a huge pre‑fill can monopolize the cache and force all other requests to wait, inflating total latency arbitrarily.

To obtain provable performance guarantees, the paper introduces a new algorithm called Sorted‑F. The core idea is a composite quality metric Qᵢ that captures both pre‑fill and decode costs (e.g., Qᵢ = α·sᵢ + β·oᵢ or a product sᵢ·oᵢ). Sorted‑F proceeds in three steps:

  1. Compute Qᵢ for every request and sort requests in descending order of Qᵢ.
  2. Scan the sorted list and greedily form the largest feasible batch that fits within the memory budget M, allowing a mixture of pre‑fill tokens and decode tokens (PD‑mixing).
  3. Within each batch, schedule decode tokens in shortest‑output‑first order to reduce downstream memory pressure.

The authors prove that this procedure yields a constant‑factor approximation for total end‑to‑end latency, with a competitive ratio bounded by 48 regardless of problem size. The analysis balances batch size (which reduces the number of batches) against the downstream cost of long decodes (which increase memory usage in later steps).

Because exact optimization is infeasible for realistic workloads, the authors develop several practical variants:

  • Exact DP – a dynamic‑programming algorithm that enumerates all feasible batch partitions; it is polynomial in M and n but only practical for small n (≤ 50).
  • Local Swap Search – starts from a feasible schedule (e.g., Sorted‑F) and repeatedly swaps two requests or moves a request to a different batch if it reduces total latency; this yields modest improvements with low overhead.
  • Quantile‑Greedy – selects the top‑q percentile of requests by Qᵢ to form batches first, then fills remaining capacity with lower‑Qᵢ requests; this scales to thousands of requests with negligible runtime.

The paper also formulates the problem as an integer program (IP) to establish a true optimum, but notes that solving the IP in real time is prohibitive. By relaxing the IP to a linear program (LP) and extracting expected start times from the fractional solution, the authors devise Sorted‑LP, a heuristic that sorts requests by these expected start times and then batches them greedily. Sorted‑LP achieves performance close to Sorted‑F while being even simpler to implement.

Empirical evaluation uses a public mixed workload that combines short conversational prompts (average pre‑fill ≈ 15 tokens, decode ≈ 20 tokens) with long document‑summarization tasks (average pre‑fill ≈ 300 tokens, decode ≈ 150 tokens). Experiments vary the backlog size n from 200 to 2000 and set the KV‑cache capacity M to about 1.5 × the average request size. Four schedulers are compared: FCFS, Shortest‑First, Sorted‑LP, and Sorted‑F. Results show that Sorted‑F consistently reduces average latency by 30 %–45 % relative to FCFS and by 25 %–40 % relative to Shortest‑First. The gains are especially pronounced when many large‑pre‑fill requests are present, because Sorted‑F prevents them from blocking the cache. Sorted‑LP performs almost as well as Sorted‑F but with roughly half the scheduling overhead. The exact DP achieves optimal latency for small n but becomes impractical beyond a few hundred requests.

Key practical takeaways include:

  • Accurate profiling of the distribution of pre‑fill lengths is essential; the quality metric Qᵢ should be tuned to the observed workload.
  • The scaling assumption L(M) = o(M) (the maximum per‑request memory usage grows slower than the total cache) is realistic for modern GPUs and guarantees that concurrency can increase with hardware upgrades.
  • During peak hours, policies that prioritize only short decodes can backfire when long pre‑fills dominate; a balanced metric that accounts for both phases yields robust performance.

In summary, the paper makes three major contributions: (1) it proves that heterogeneous pre‑fill lengths render LLM serving scheduling NP‑hard and that common heuristics lack any bounded guarantee; (2) it introduces Sorted‑F, a constant‑factor approximation algorithm with a provable competitive ratio, and several scalable variants; (3) it validates the approach on realistic workloads, demonstrating substantial latency reductions and better resource utilization. The work provides a solid theoretical foundation and a practical toolkit for engineers building production‑grade LLM serving stacks under memory constraints.


Comments & Academic Discussion

Loading comments...

Leave a Comment