Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization
As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes’ capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE working in conjunction with llm-d improved tail Time to First Token latency by up to 90% even under high loads.
💡 Research Summary
The paper presents a comprehensive performance evaluation of a Kubernetes‑native stack for generative AI (GenAI) inference, focusing on a two‑stage pipeline that first transcribes corporate earnings‑call audio with OpenAI’s Whisper model and then summarizes the resulting transcripts using a large language model (LLM). Three emerging projects—Kueue, Dynamic Accelerator Slicer (DAS), and the Gateway API Inference Extension (GAIE) used together with the llm‑d serving framework—are integrated to address the distinct challenges of batch processing, accelerator slicing, and low‑latency online serving.
In the batch stage, Kueue extends the native Kubernetes scheduler by introducing a hierarchical queuing model (LocalQueue → ClusterQueue → ResourceFlavor). This allows operators to declare resource pools, priorities, and admission policies that are enforced cluster‑wide. By routing Whisper transcription jobs through Kueue, the authors achieve a reduction of total makespan by up to 15 % compared with a plain Kubernetes Job controller, demonstrating improved fairness and higher overall GPU utilization.
DAS builds on NVIDIA’s Multi‑Instance GPU (MIG) technology, but unlike static MIG configurations it creates and destroys slices on‑the‑fly based on current demand. The paper shows that dynamic slicing raises parallel job execution on a single GPU, cutting the mean job completion time by 36 % (from roughly 12 minutes per batch to 7.7 minutes) and reducing idle GPU memory and compute capacity.
The second stage tackles real‑time inference. The llm‑d framework leverages GAIE, a new layer on top of the standard Kubernetes Gateway API, to route LLM requests to the most appropriate pod based on live metrics such as queue length, prefix‑cache hit rate, and LoRA adapter availability. GAIE’s model‑aware routing, combined with a vLLM‑Optimized Inference Scheduler that employs “Precise Prefix‑Cache‑Aware Scheduling,” dramatically improves tail latency. Under a stress test of 2 k requests per second, tail Time‑to‑First‑Token (TTFT) is reduced by up to 90 % (from ~6 seconds to under 1 second), and average response latency drops from 2.3 seconds to 1.1 seconds.
The authors position their work against prior scheduling and resource‑sharing research (e.g., PAX, KaiS, nvshare, DISC) by emphasizing that the three components are lightweight, fully compatible with standard Kubernetes APIs, and require minimal custom device plugins. They also note that while the study is limited to NVIDIA GPUs and fixed model sizes, the methodology can be extended to other accelerators and larger model families. Future directions include multi‑cloud resource federation, cost‑efficiency modeling, and broader GAIE policy extensions for heterogeneous workloads.
Overall, the study demonstrates that a carefully orchestrated combination of Kueue, DAS, and GAIE can turn Kubernetes into a high‑performance, unified platform for both batch and online GenAI inference, delivering measurable gains in makespan, job completion time, and tail latency without sacrificing the scalability and declarative management that make Kubernetes attractive to cloud operators.
Comments & Academic Discussion
Loading comments...
Leave a Comment