Native LLM and MLLM Inference at Scale on Apple Silicon

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models, leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21% to 87% higher throughput than llama-cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max demonstrates throughput of up to 525 tokens per second on text models and 28x speedup on repeated image queries, reducing multimodal latency from 21.7 seconds to under 1 second. Video analysis with up to 64 frames achieves 24.7x cache speedup. We release our implementation as open source to support efficient inference on consumer Apple hardware.

💡 Research Summary

The paper introduces vllm‑mlx, a native inference framework for large language models (LLMs) and multimodal language models (MLLMs) on Apple Silicon devices. The authors observe that existing solutions either lack deep integration with Apple’s unified memory architecture (e.g., PyTorch MPS) or are limited to text‑only workloads (e.g., llama‑cpp). Moreover, the official vLLM‑metal backend provides continuous batching but does not support vision‑language models or any form of vision‑side caching. To fill this gap, the authors build on Apple’s MLX library, which offers true zero‑copy tensor operations, lazy evaluation, and native quantization, thereby exploiting the shared CPU‑GPU‑Neural Engine memory pool without costly data transfers.

Key technical contributions

Continuous batching scheduler – Implemented as Algorithm 1, the scheduler admits new requests at token boundaries, dynamically expands the active batch up to a configurable maximum, and removes completed sequences immediately. This design maximizes GPU utilization for multi‑user serving while preserving low latency for each request. Benchmarks on an M4 Max (128 GB unified memory) show throughput scaling from 441 tokens/s (single request) to 1 642 tokens/s (16 concurrent requests) for the Qwen3‑0.6B model—a 3.7× increase. Larger models still benefit, albeit with diminishing returns due to memory‑bandwidth saturation.
Content‑based prefix caching for multimodal inference – The framework computes a SHA‑256 hash over the raw pixel data of each image (or decoded video frame), making the cache agnostic to input representation (URL, base64, file path). Cached entries store both the vision encoder’s embedding and the KV‑cache state at the point where the image is consumed. When a request contains an image already present in the cache, the vision encoder is bypassed entirely (Algorithm 3). This eliminates the typical 1.5–4 seconds of vision processing per turn, achieving up to a 28× speed‑up (latency reduced from 21.7 s to 0.78 s) for repeated image queries. The same principle extends to video, where identical frames map to the same cache entries, delivering up to a 24.7× speed‑up for 64‑frame analyses.
Text prefix caching – For purely textual workloads that share a common system prompt or other static prefix, the KV‑cache for that prefix is reused across requests, cutting the time‑to‑first‑token (TTFT) by up to 5.8× for 512‑token shared prefixes.

Evaluation
All experiments use 4‑bit quantization (Q4_K_M for GGUF models, 4‑bit for MLX). The authors benchmark a suite of models ranging from 0.6 B to 30 B parameters, including Qwen3, Llama 3.2, Gemma 3, GLM‑4, and Nemotron. Results (Table 1) show vllm‑mlx surpassing llama‑cpp by 21 %–87 % in token‑per‑second throughput across the board, with the most pronounced gains on smaller models (1.87× on Qwen3‑0.6B). Compared to the official vLLM‑metal and mlx‑lm baselines, vllm‑mlx consistently delivers higher aggregate throughput thanks to its optimized scheduler.

Multimodal experiments (Table 2) demonstrate that a single cached image reduces latency from 21.7 s to 1.15 s on the second turn and to 0.78 s on subsequent turns. Video benchmarks (Table 3) illustrate the trade‑off between frame rate, total processing time, token throughput, and memory consumption, confirming that caching remains effective even as frame counts rise.

Ablation studies isolate the contributions of vision‑embedding caching (7.8× speed‑up) and KV‑cache reuse (2.4×), which together yield a 19× overall reduction in multimodal latency. Additional analyses show that higher image resolutions and larger video frame counts amplify the benefits of caching, while the LRU eviction policy keeps memory usage bounded (default 512 MB).

Discussion and Limitations
The authors attribute the superiority over llama‑cpp to three factors: (1) true zero‑copy operations enabled by MLX’s unified‑memory design, (2) lazy evaluation that fuses kernels and reduces launch overhead, and (3) the continuous batching scheduler. They acknowledge that the current LRU cache may evict useful entries under heavy multimodal workloads and that KV‑cache reuse is limited to identical prefixes, not to similar or partially overlapping prompts. Future work is suggested on more sophisticated cache eviction (e.g., frequency‑based), multi‑engine scheduling (CPU + GPU + Neural Engine), and fuzzy‑hash or similarity‑based prefix matching.

Impact
vllm‑mlx provides the first open‑source, production‑ready solution that unifies high‑throughput text inference, continuous batching, and efficient multimodal serving on consumer‑grade Apple Silicon. By delivering OpenAI‑compatible APIs, it enables privacy‑preserving local AI agents, edge deployments, and cost‑effective alternatives to cloud‑based inference. The framework’s release (GitHub) invites the community to extend and adapt it for broader model families and emerging Apple hardware.

Native LLM and MLLM Inference at Scale on Apple Silicon

💡 Research Summary

Comments & Academic Discussion

Leave a Comment