Hummingbird: SLO-Oriented GPU Preemption at Microsecond-scale
Existing GPU-sharing techniques, including spatial and temporal sharing, aim to improve utilization but face challenges in simultaneously ensuring SLO adherence and maximizing efficiency due to the lack of fine-grained task scheduling on closed-source GPUs. This paper presents Hummingbird, an SLO-oriented GPU scheduling system that overcomes these challenges by enabling microsecond-scale preemption on closed-source GPUs while effectively harvesting idle GPU time slices. Comprehensive evaluations across diverse GPU architectures reveal that Hummingbird improves the SLO attainment of high-priority tasks by 9.7x and 3.5x compared to the state-of-the-art spatial and temporal-sharing approaches. When compared to executing exclusively, the SLO attainment of the high-priority task, collocating with low-priority tasks on Hummingbird, only drops by less than 1%. Meanwhile, the throughput of the low-priority task outperforms the state-of-the-art temporal-sharing approaches by 2.4x. Hummingbird demonstrates significant effectiveness in ensuring the SLO while enhancing GPU utilization.
💡 Research Summary
The paper addresses the chronic under‑utilization of GPUs in modern AI services, where high‑priority inference or training jobs are often allocated dedicated GPUs to meet strict Service Level Objectives (SLOs), leaving the hardware idle for large portions of time. Existing sharing mechanisms fall into two categories: spatial sharing (e.g., CUDA multi‑streams, Multi‑Instance GPU) and temporal sharing (e.g., REEF). Spatial sharing suffers from uncontrolled interference across shared resources such as L2 cache, HBM bandwidth, and PCIe, making it impossible to guarantee latency for latency‑critical workloads. Temporal sharing improves SLO attainment by giving exclusive GPU access to high‑priority tasks, but it cannot preempt a running low‑priority kernel on NVIDIA hardware, leading to potentially millisecond‑scale delays that violate SLOs, especially when kernel execution times vary from a few microseconds to several milliseconds.
The authors observe two key facts that enable a new approach. First, while whole‑kernel execution times are highly heterogeneous, the execution time of individual thread blocks is consistently short—99.999 % of blocks across a wide range of models (CNNs, LLMs, MLLMs) complete within 400 µs on an A100. By controlling the number of blocks launched per kernel (a technique they call “split‑kernel”), the runtime can bound the execution time of each block, creating frequent preemption points at the microsecond scale. Second, real‑world serving traces exhibit abundant “bubbles” – idle GPU intervals caused by request variability, memory‑host synchronizations, inter‑GPU communication, or CPU stalls. These bubbles range from hundreds of microseconds to several milliseconds and represent up to 24 % of total GPU time in large models under tensor or expert parallelism.
Building on these observations, the paper introduces Hummingbird, an SLO‑oriented GPU scheduling system that achieves microsecond‑scale preemption on closed‑source NVIDIA GPUs while harvesting idle bubbles to improve overall utilization. Hummingbird consists of three components:
-
Kernel Splitter – a PTX‑level transformer that profiles low‑priority kernels, determines the optimal block count based on SM count and memory bandwidth, and generates split‑kernel logs. The splitter automatically balances two competing forces: smaller blocks reduce preemption latency but increase launch overhead and may under‑utilize the GPU; larger blocks improve throughput but increase the worst‑case preemption delay. By aligning block count with the GPU’s compute capacity, Hummingbird finds the sweet spot where each split‑kernel fills the SMs without exceeding the 400 µs block execution budget.
-
Runtime Scheduler – consumes the splitter’s logs to launch split‑kernels, monitors the GPU work queue, and detects idle bubbles in real time. When a high‑priority request arrives, the scheduler issues a “kernel‑tick” preemption: it allows the currently running low‑priority block to finish its microsecond‑scale slice, then immediately schedules the high‑priority block. If a large bubble is detected, the scheduler may coalesce multiple split‑kernels to reduce launch overhead and improve throughput. This policy ensures that the preemption delay never exceeds the execution time of a single split‑kernel, effectively achieving µs‑scale preemption despite the lack of hardware support for proactive interruption.
-
Memory Management Module – leverages NVLink to implement hierarchical memory offloading, enabling multiple GPUs to share data without excessive PCIe traffic. This is crucial for distributed training or inference scenarios where inter‑GPU synchronization can otherwise dominate the bubble budget.
The evaluation spans three GPU families (mid‑range L40, high‑end A100, and H100) and includes two high‑priority workloads (LLM inference and CNN training) plus four low‑priority workloads covering inference, fine‑tuning, and training. Experiments also cover memory‑intensive cases and distributed settings using up to sixteen GPUs across two DGX‑A100 servers. Key results include:
- SLO Attainment – Hummingbird improves the proportion of high‑priority requests meeting the 99th‑percentile latency target by 9.7× over spatial sharing (Orion, LithOS) and 3.5× over temporal sharing (REEF). Compared to exclusive execution, the SLO drop is less than 1 %, demonstrating near‑perfect isolation.
- Throughput of Low‑Priority Tasks – Through aggressive bubble harvesting, low‑priority throughput increases by 2.4× relative to REEF, directly translating into higher overall GPU utilization.
- Utilization Gains – The system consistently achieves GPU utilization well above 80 % across all tested configurations, even in the presence of highly variable request patterns.
- Scalability – In distributed experiments, hierarchical memory offloading prevents NVLink saturation and maintains the same SLO and throughput benefits as in single‑GPU scenarios.
The authors discuss limitations such as the reliance on predictable block execution times (which may be violated by highly irregular kernels) and the need for PTX‑level transformation, which may not be portable to future GPU architectures without additional engineering. They also note that while Hummingbird’s kernel‑splitting approach is orthogonal to compiler‑based block reduction techniques, combining both could further reduce launch overhead.
In summary, Hummingbird demonstrates that microsecond‑scale preemption is feasible on closed‑source GPUs by exploiting the natural granularity of thread blocks and the prevalence of idle bubbles in real AI workloads. By doing so, it reconciles the historically conflicting goals of strict SLO adherence for latency‑critical services and high overall GPU utilization for batch workloads, offering a practical path for data‑center operators to reduce over‑provisioning and improve cost efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment