EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Serving Large Language Models (LLMs) under mixed workloads–short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests–poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS) policies suffer from severe head-of-line blocking, leading to high tail latency and underutilized hardware. We introduce EWSJF (Effective Workload-based Shortest Job First), an adaptive request-level scheduler that learns workload structure in real time to jointly improve fairness and throughput. EWSJF operates upstream of execution-level schedulers and integrates four components: (1) Refine-and-Prune, an unsupervised partitioning algorithm that discovers performance-homogeneous request groups; (2) Dynamic Queue Routing for assigning requests to these groups; (3) Density-Weighted Scoring, a context-aware prioritization function balancing urgency and fairness; and (4) Bayesian Meta-Optimization, which continuously tunes scoring and partitioning parameters based on live performance feedback. Implemented in vLLM, EWSJF improves end-to-end throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to FCFS. These results demonstrate that adaptive, learning-based request scheduling is a critical missing layer for efficient and responsive LLM serving. Implementation available at https://anonymous.4open.science/r/vllm_0110-32D8.

💡 Research Summary

Large Language Model (LLM) serving systems increasingly face mixed workloads in which short, latency‑sensitive interactive queries arrive together with long, throughput‑oriented batch jobs. Under such conditions, the naïve First‑Come‑First‑Served (FCFS) admission policy suffers from severe head‑of‑line (HoL) blocking: a single long request can stall all subsequent requests, inflating tail latency and leaving GPUs under‑utilized. Existing research either tackles token‑level scheduling (e.g., Orca, Sarathi) or applies static priority queues (e.g., G‑Fair), but none directly addresses the upstream request‑level admission problem in a dynamic, unstructured traffic environment.

The paper introduces EWSJF (Effective Workload‑based Shortest Job First), an adaptive request‑level scheduler that sits upstream of the execution engine (implemented as a plug‑in for vLLM). EWSJF consists of four tightly coupled components:

Refine‑and‑Prune – an unsupervised hybrid clustering algorithm that automatically discovers performance‑homogeneous request groups. Starting from a coarse K‑means partition with k = 3 (capturing short, medium, and long regimes), the algorithm recursively splits clusters at significant density gaps and prunes partitions that violate domain‑specific constraints such as minimum queue width. The result is a set of contiguous, non‑overlapping prompt‑length intervals (queues) where all requests share similar pre‑fill cost.
Dynamic Queue Routing – an online dispatcher that maps each incoming request to the appropriate queue based on its prompt length. If a request falls into a gap between existing queues, an “On‑Demand Bubble Queue Creation” mechanism instantly inserts a new temporary queue, ensuring the system can react to sudden shifts in request size distribution without waiting for the strategic loop.
Density‑Weighted Scoring – a context‑aware priority function that evaluates the oldest request in each non‑empty queue:

\

EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

💡 Research Summary

Comments & Academic Discussion

Leave a Comment