VoxServe: Streaming-Centric Serving System for Speech Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

💡 Research Summary

VoxServe is a unified serving system designed specifically for modern Speech Language Models (SpeechLMs) that must operate in streaming scenarios. The authors identify two fundamental gaps in existing serving infrastructures: (1) the inability to handle the heterogeneous, multi‑stage pipelines of SpeechLMs—comprising an LLM backbone, audio tokenizers, detokenizers, and sometimes encoders—and (2) the lack of performance metrics and scheduling policies that address streaming‑specific requirements such as Time‑to‑First‑Audio (TTFA) and continuous streaming viability.

To bridge these gaps, VoxServe introduces a model‑execution abstraction that decouples model architecture from system‑level optimizations. Every SpeechLM is expressed as a sequence of five logical stages—Preprocess, LLM Forward, Sampling, Detokenize, and Postprocess—each exposed through a common interface. Model‑specific logic (e.g., handling multiple codebooks, depth‑wise LLMs, or continuous audio features) is encapsulated in subclasses, while the scheduler and worker components operate on the abstracted stages without needing to know the underlying details.

The system architecture consists of an interface process (exposing HTTP APIs) and an execution process that contains three modules: Scheduler, Worker, and Model. The Scheduler orchestrates request lifecycles, continuously deciding which requests should run LLM inference and which should trigger the detokenizer. The Worker manages GPU resources, executes batched kernels, and leverages CUDA Graphs to pre‑compile repetitive execution graphs, dramatically reducing kernel launch overhead.

A key contribution is the streaming‑aware scheduling algorithm. It treats TTFA and streaming viability as constraints: the first audio chunk must be delivered within a tight latency budget, and each subsequent chunk must arrive before the playback of the previous chunk finishes. The scheduler dynamically adjusts the pre‑fill length of the LLM, the frequency of detokenizer invocations, and cache eviction policies to satisfy these constraints while maximizing overall goodput. An asynchronous pipeline further overlaps LLM generation with detokenizer execution, eliminating idle GPU periods that are common in naïve sequential designs.

The authors evaluate VoxServe on three state‑of‑the‑art SpeechLMs with diverse architectures (e.g., DAC with 9 codebooks, SNAC with multi‑granular codebooks, and CosyVoice2 with a flow‑matching transformer and HiFi‑GAN vocoder). Compared against existing LLM serving stacks such as vLLM and FasterTransformer, VoxServe achieves 10–20× higher request throughput on identical hardware (NVIDIA A100 40 GB) while keeping TTFA in the 30–50 ms range—well below human perceptual thresholds. Streaming viability tests show zero missed deadlines, whereas baseline systems frequently drop chunks, causing audible glitches.

Beyond single‑node performance, VoxServe supports distributed inference and a throughput‑oriented mode, enabling cost‑effective multi‑tenant deployments in cloud environments. The open‑source release (GitHub link provided) allows researchers and engineers to extend the abstraction to new SpeechLM families without re‑implementing core optimizations such as batching, cache management, or CUDA‑graph integration.

In summary, VoxServe delivers a comprehensive, model‑agnostic serving platform that unifies the complex components of SpeechLMs under a single abstraction, introduces streaming‑specific scheduling and asynchronous execution, and demonstrates order‑of‑magnitude gains in both latency and throughput. It represents a significant step toward practical, large‑scale deployment of speech‑centric generative AI services.

VoxServe: Streaming-Centric Serving System for Speech Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment