Efficient Multimodal Large Language Models: A Survey
In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
💡 Research Summary
This survey provides a comprehensive overview of recent efforts to make Multimodal Large Language Models (MLLMs) more efficient and lightweight, a crucial step toward democratizing these powerful systems beyond large corporate clouds. The authors first motivate the need for efficiency by quantifying the massive computational demands of state‑of‑the‑art models such as MiniGPT‑v2 (≈800 GPU‑hours for training) and LLaVA‑1.5 (≈18 T FLOPs and 41 GB memory for a single inference on a 336×336 image with a 40‑token prompt). These costs impede academic research, limit deployment on edge devices, and raise concerns about equitable access and privacy.
The paper then structures the literature into six major categories: Architecture, Vision, Language, Training, Data & Benchmarks, and Applications.
-
Architecture – Efficient MLLMs retain the classic three‑module pipeline (vision encoder, vision‑language projector, language model) but replace heavy components with compact alternatives. Table 1 lists 17 representative models, showing a shift from large CLIP‑ViT backbones to lightweight encoders such as ViT‑amin, BRAVE, Cobra, and SPHINX‑X, and from massive LLMs (e.g., LLaMA‑13B) to 1‑3 B parameter models like Phi‑2, Gemma, and Qwen‑1.5.
-
Vision Encoders – Two strategies dominate: (a) lightweight backbones (ViT‑amin uses a convolutional stem and MobileConv blocks, achieving 82.9 % ImageNet zero‑shot with only 0.4 B parameters) and (b) multi‑encoder fusion (BRAVE concatenates K different encoders; SPHINX‑X combines DINOv2 and CLIP‑ConvNeXt). Experiments reveal that no single encoder dominates across tasks; instead, diverse visual priors improve robustness.
-
Vision‑Language Projectors – Traditional MLP projectors are replaced by query‑based transformers (Q‑Former), Perceiver Resamplers, MEQ‑Former, and convolution‑plus‑pooling modules (LDPv2). The Honeybee framework introduces C‑Abstractor (convolutional) and D‑Abstractor (deformable‑attention) to adaptively control token count while preserving local context, a key factor for high‑resolution inputs.
-
Small Language Models – The survey highlights the adoption of 1‑3 B‑scale LLMs (Phi‑2, Gemma, Qwen‑1.5) and the integration of efficiency‑focused architectures such as Mixture‑of‑Experts (MoE‑LLaVA), VL‑Mamba, and FastV. These models process compressed visual tokens with minimal degradation, achieving near‑state‑of‑the‑art scores on VQA, TextVQA, and GQA while using far fewer FLOPs.
-
Token Compression & Structural Efficiency – High‑resolution images are handled by techniques like 2‑D Vision Selective Scanning (VSS) in LLaVA‑UHD, S2‑Wrapper, and specialized token reducers (Mini‑Gemini, TinyChart). Efficient attention variants (SPD, FastV, VTW) reduce the quadratic cost of self‑attention to near‑linear, enabling real‑time inference on modest hardware.
-
Training Strategies – Efficient MLLMs employ a multi‑stage pipeline: (a) pre‑training on large image‑text corpora (LAION‑2M, ShareGPT‑4V‑PT) with careful sampling to curb cost; (b) instruction tuning using multimodal instruction sets (LaVIN, HyperLLaVA); (c) diverse training steps (e.g., SPHINX‑X, Cobra) that combine frozen vision backbones with lightweight adapters; and (d) parameter‑efficient transfer learning (EAS, MemVP) that reuse large‑model weights.
-
Data & Benchmarks – The authors catalog pre‑training datasets, instruction‑tuning corpora, and evaluation suites (VQA‑v2, TextVQA, GQA, MME, POPE). They also note domain‑specific benchmarks for biomedical imaging (LLaVA‑Rad), document understanding (TextHawk, TinyChart), and video comprehension (Video‑LLaVA). These benchmarks are essential for quantifying the performance‑efficiency trade‑off.
-
Applications – Efficient MLLMs have been applied to medical image QA (LLaVA‑Rad achieving >85 % accuracy), document chart analysis (TinyChart with 2× speedup), and high‑resolution video understanding (Video‑LLaVA). Such use‑cases demonstrate that compact multimodal models can run on edge devices, reduce latency, and preserve user privacy by avoiding cloud reliance.
Limitations & Future Directions – The survey identifies two main gaps: (i) inevitable performance loss when aggressively compressing models, and (ii) the lack of standardized efficiency‑focused benchmarks. Moreover, hardware‑software co‑design, automated architecture search for multimodal efficiency, and privacy‑preserving training (federated learning, encrypted inference) are highlighted as promising research avenues.
In summary, this survey systematically maps the landscape of efficient MLLMs, offering a taxonomy of architectural choices, compression techniques, training regimes, and application domains. It serves as a roadmap for researchers and practitioners aiming to build cost‑effective, high‑performing multimodal AI systems that can be deployed beyond the confines of large data‑center GPUs.
Comments & Academic Discussion
Loading comments...
Leave a Comment