gpu_ext: Extensible OS Policies for GPUs via eBPF
Performance in modern GPU-centric systems increasingly depends on resource management policies, including memory placement, scheduling, and observability. However, uniform policies typically yield suboptimal performance across diverse workloads. Existing approaches present a tradeoff: user-space runtimes provide programmability and flexibility but lack cross-tenant visibility and fine-grained control of hardware resources; meanwhile, modifications to the OS kernel introduce significant complexity and safety risks. To address this, we argue that the GPU driver and device layer should provide an extensible OS interface for policy enforcement. While the emerging eBPF technology shows potential, directly applying existing host-side eBPF is insufficient because they lack visibility and control into critical device-side events, and directly embedding policy code into GPU kernels could compromise safety and efficiency. We propose gpu_ext, an eBPF-based runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers by exposing safe programmable hooks and introduces a device-side eBPF runtime capable of executing verified policy logic within GPU kernels, enabling coherent and transparent policies. Evaluation across realistic workloads including inference, training, and vector search demonstrates that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x, incurring low overhead, without modifying or restarting applications
💡 Research Summary
The paper introduces gpu_ext, a novel framework that turns the GPU driver and the device itself into a programmable operating‑system‑level subsystem using eBPF. Modern GPU‑centric workloads (LLM inference, graph neural network training, vector search, etc.) suffer from static, one‑size‑fits‑all resource‑management policies. Existing solutions fall into two camps: user‑space runtimes that are easy to program but lack visibility into low‑level hardware state and cannot coordinate across tenants, and kernel‑space driver modifications that provide fine‑grained control but are hard to deploy, maintain, and risk system stability.
Key contributions
- Extensible OS‑level policy interface – The authors design a narrow, verified set of hooks inside the NVIDIA driver (page‑table updates, command‑buffer handling, interrupt processing, etc.). These hooks expose only hardware‑aligned abstractions, preventing unsafe manipulation while still allowing expressive policies.
- Device‑side eBPF runtime – A custom eBPF interpreter runs on the GPU. Because GPU execution follows a SIMT (single‑instruction‑multiple‑thread) model, the authors extend the traditional eBPF verifier to be “SIMT‑aware”. The verifier enforces warp‑uniform control flow, bounded loops, and memory safety, guaranteeing that a policy program cannot cause divergence, deadlock, or hardware hangs.
- Cross‑layer shared state – Policy state is stored in eBPF maps that are accessible from both CPU and GPU. The maps use a relaxed‑consistency model to hide the latency gap between host memory and device memory while still providing sufficiently fresh information for dynamic decisions.
Design challenges addressed
- C1 (Safe interface): Balancing expressiveness with stability by exposing only a small, well‑defined API.
- C2 (SIMT mismatch): Adapting scalar eBPF semantics to warp‑level execution, preventing divergence and ensuring that every thread in a warp follows the same path.
- C3 (Host‑device shared state): Providing efficient, low‑overhead mechanisms for synchronizing policy data across the heterogeneous memory hierarchy.
Implementation
The prototype extends the open‑source NVIDIA GPU kernel modules on Linux and adds a GPU‑resident eBPF byte‑code interpreter. Policy programs are compiled and verified on the host, then loaded into the driver via an ioctl interface. At runtime the driver invokes the eBPF hooks, and the device‑side interpreter executes the policy logic inside the kernel launch, using warp‑level scheduling primitives.
Evaluation
Four representative workloads were used: (1) LLM inference with KV‑cache management, (2) GNN training, (3) high‑throughput vector search, and (4) a mixed‑priority multi‑tenant scenario. Policies implemented with gpu_ext include:
- Adaptive memory prefetch/eviction, reducing page‑fault rates by up to 73 % and improving average latency by 1.8×.
- Fine‑grained kernel preemption, cutting 99‑percentile latency for latency‑critical inference by more than half.
- Dynamic work‑stealing scheduler, boosting overall system throughput by up to 4.8× and halving tail latency.
Across all experiments, the runtime overhead of the eBPF hooks and map synchronization stayed below 2 % (average 0.8 µs per hook call). Importantly, policies could be swapped without restarting the application or recompiling the GPU binary, demonstrating true dynamic programmability.
Discussion and limitations
The current implementation targets NVIDIA GPUs; extending to AMD or Intel GPUs would require adapting the driver hooks and possibly the SIMT model. Very large or complex eBPF programs increase compile‑time verification, suggesting a need for tooling to aid developers. The relaxed consistency model may be insufficient for workloads that need strict, up‑to‑date state, indicating future work on stronger synchronization primitives.
Conclusion
gpu_ext shows that treating the GPU driver/device as a programmable OS subsystem is feasible and beneficial. By marrying the safety guarantees of eBPF with a SIMT‑aware verifier and cross‑layer shared maps, the framework delivers high‑performance, dynamically updatable policies without sacrificing stability. The results suggest a promising path toward more adaptable, tenant‑aware GPU resource management in data‑center environments, and open avenues for broader hardware support and automated policy synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment