Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant system bottlenecks. First, multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT). Most systems rely on CPU-based decoding, which severely limits throughput, while existing GPU-based approaches prioritize throughput-oriented parallelism and fail to meet the latency-sensitive requirements of MLLM inference. Second, the vision encoder is a standalone, compute-intensive stage that produces visual embeddings and cannot be co-batched with LLM prefill or decoding. This heterogeneity forces inter-stage blocking and increases token-generation latency. Even when deployed on separate GPUs, these stages underutilize available compute and memory resources, reducing overall utilization and constraining system throughput. To address these challenges, we present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline. FlashCodec accelerates the multimodal preprocessing stage through collaborative multi-GPU video decoding, reducing decoding latency while preserving high throughput. UnifiedServe optimizes the vision-to-text and inference stages using a logically decoupled their execution to eliminate inter-stage blocking, yet physically sharing GPU resources to maximize GPU system utilization. By carefully orchestrating execution across stages and minimizing interference, UnifiedServe Together, our proposed framework forms an end-to-end optimized stack that can serve up to 3.0$\times$ more requests or enforce 1.5$\times$ tighter SLOs, while achieving up to 4.4$\times$ higher throughput compared to state-of-the-art systems.
💡 Research Summary
The paper addresses two critical performance bottlenecks in serving multimodal large language models (MLLMs): (1) the latency‑heavy multimodal preprocessing stage, especially video decoding, and (2) the interference caused by the compute‑intensive vision encoder when integrated with the LLM inference stage. Existing serving architectures fall into two categories. Monolithic designs co‑locate the vision encoder and LLM on a single GPU instance, achieving high throughput but suffering from severe token‑generation stalls because the encoder monopolizes GPU compute, inflating Time‑Between‑Tokens (TBT). Split designs disaggregate the encoder and LLM onto separate instances, eliminating direct interference but fragmenting compute and memory resources, which leads to poor overall utilization and high Time‑to‑First‑Token (TTFT).
To overcome these issues, the authors propose two complementary systems: FlashCodec and UnifiedServe. FlashCodec re‑thinks video decoding by collaboratively using all NVDEC engines across multiple GPUs. It partitions a video into non‑overlapping segments, dispatches them to each GPU, and employs a stall‑free scheduling policy that keeps every decoder engine busy. This collaborative approach reduces per‑video decoding latency by 2.8–9× on a single A100 and up to 5.7–9.1× when scaled across four GPUs, while preserving high throughput.
UnifiedServe restructures the MLLM pipeline into three asynchronous workers: a Vision‑Preprocess worker (using FlashCodec), an Encode‑Prefill worker, and an LLM‑Decode worker. The Encode‑Prefill worker performs vision encoding and LLM pre‑fill in a mutually blocking fashion to bound contention, whereas the Decode worker runs in a separate process to guarantee low‑latency token generation. A lightweight buffering mechanism shares intermediate states (patch tokens and visual embeddings) without excessive memory overhead. Crucially, although the stages are logically decoupled, they physically share the same GPU pool, allowing idle SM cycles during LLM decoding to be harvested for vision encoding. This asymmetry‑aware scheduling eliminates the blocking seen in monolithic designs while avoiding the resource fragmentation of split designs.
Experimental evaluation on Qwen‑2.5‑VL‑7B/32B models across a four‑GPU A100 cluster demonstrates that the combined stack can serve up to 3.0× more requests, enforce 1.5× tighter SLOs, and achieve up to 4.4× higher throughput compared to state‑of‑the‑art baselines. Decoding latency stays below 2 seconds with 2–4 GPUs and under 1 second with 8 GPUs, and the system maintains stable TBT while dramatically reducing TTFT. Memory overhead remains modest (<10 % increase), confirming the practicality of the approach.
In summary, the paper introduces a novel GPU‑internal scheduling and resource‑sharing paradigm that simultaneously accelerates multimodal preprocessing and harmonizes vision encoding with LLM inference. By leveraging collaborative multi‑GPU decoding and asymmetry‑aware shared scheduling, FlashCodec and UnifiedServe provide a scalable, low‑latency, high‑throughput solution for real‑time MLLM serving, bridging the gap between existing CPU‑centric or throughput‑oriented GPU solutions and the stringent latency requirements of production multimodal AI services.
Comments & Academic Discussion
Loading comments...
Leave a Comment