vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Any-to-any multimodal models that jointly handle text, images, video, and audio represent a significant advance in multimodal AI. However, their complex architectures (typically combining multiple autoregressive LLMs, diffusion transformers, and other specialized components) pose substantial challenges for efficient model serving. Existing serving systems are mainly tailored to a single paradigm, such as autoregressive LLMs for text generation or diffusion transformers for visual generation. They lack support for any-to-any pipelines that involve multiple interconnected model components. As a result, developers must manually handle cross-stage interactions, leading to huge performance degradation. We present vLLM-Omni, a fully disaggregated serving system for any-to-any models. vLLM-Omni features a novel stage abstraction that enables users to decompose complex any-to-any architectures into interconnected stages represented as a graph, and a disaggregated stage execution backend that optimizes resource utilization and throughput across stages. Each stage is independently served by an LLM or diffusion engine with per-stage request batching, flexible GPU allocation, and unified inter-stage connectors for data routing. Experimental results demonstrate that vLLM-Omni reduces job completion time (JCT) by up to 91.4% compared to baseline methods. The code is public available at https://github.com/vllm-project/vllm-omni.

💡 Research Summary

The paper addresses a pressing problem in the emerging field of any‑to‑any multimodal models, which aim to accept and generate across text, image, video, and audio modalities within a single end‑to‑end architecture. While such models have demonstrated impressive capabilities, existing serving frameworks (e.g., vLLM, SGLang for autoregressive LLMs, and diffusion‑specific servers for DiT) are each specialized for a single generation paradigm. Consequently, developers must manually stitch together multiple model components—often two or more autoregressive LLMs, diffusion transformers, and modality‑specific decoders—leading to fragmented pipelines, inefficient resource usage, and severe performance degradation.

To solve this, the authors introduce vLLM‑Omni, a fully disaggregated serving system that treats each component of a multimodal pipeline as an independent “stage.” The key contributions are:

Stage Graph Abstraction – Users describe a model as a directed acyclic graph (DAG) where nodes are stages (AR LLM, DiT, CNN, etc.) and edges are user‑defined transfer functions that transform intermediate data (tokens, embeddings, KV caches) between stages. This representation captures arbitrary multimodal pipelines, from the simple “Thinker‑Talker” architecture (two sequential LLMs) to more complex AR + DiT hybrids.
Disaggregated Execution Backend – An orchestrator schedules incoming requests across stages. Each stage runs on its own execution engine: the familiar vLLM engine for autoregressive stages and a dedicated diffusion engine for DiT stages. Per‑stage request batching maximizes GPU utilization, while developers can explicitly allocate GPUs, memory budgets, and parallelism strategies (tensor‑parallel, pipeline‑parallel) per stage. This fine‑grained resource control eliminates the “one‑size‑fits‑all” limitation of existing servers.
Unified Connector for Data Transfer – The system provides a unified connector that abstracts CPU‑GPU copies, shared memory, or RDMA transfers. It also allows custom transformation logic (e.g., token → latent, spectrogram → waveform) to be injected, enabling seamless flow of heterogeneous data without hand‑crafted glue code.
Comprehensive Evaluation – Experiments on representative any‑to‑any models—Qwen‑3‑Omni (Thinker‑Talker), GLM‑Image (AR + DiT), and BAGEL (Mixture‑of‑Transformers + DiT)—show that vLLM‑Omni reduces job completion time (JCT) by up to 91.4 % compared with baseline single‑stage serving. GPU memory utilization and compute throughput improve by roughly 1.8×, and scaling studies demonstrate near‑linear throughput gains when adding more stage‑specific engines.
Limitations and Future Work – Current diffusion support focuses on image/video generation; audio‑oriented diffusion models and newer multimodal tokenizers are not yet integrated. Moreover, as stage graphs become large, orchestration overhead may increase, suggesting a need for automated graph optimization (stage merging, pipeline re‑balancing) in future versions.

In summary, vLLM‑Omni offers the first unified serving framework that natively supports arbitrary any‑to‑any multimodal pipelines. By decoupling stages, providing per‑stage batching and resource allocation, and abstracting data movement, it dramatically improves latency, throughput, and developer productivity. The open‑source release (https://github.com/vllm-project/vllm-omni) positions the system as a practical foundation for the next generation of multimodal AI services.

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment