Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.

💡 Research Summary

Multimodal large language models (MLLMs) have achieved impressive results on a wide range of vision‑language tasks, but their inference cost remains a major obstacle for real‑world deployment. The dominant paradigm inserts a large number of visual tokens—produced by a pretrained image encoder such as ViT—into a large language model (LLM) and fine‑tunes the whole system. Because visual tokens far outnumber textual tokens (e.g., 576 vs. ~30 in LLaVA), the self‑attention and feed‑forward layers of the LLM dominate the FLOPs, especially when high‑resolution images are used.

Existing efficiency approaches fall into two categories. The first compresses visual features before they enter the LLM, using fixed‑rate projectors or resamplers. While this reduces the token count, it discards fine‑grained visual semantics (e.g., OCR text, small objects) and applies the same compression ratio to every sample, which harms hard cases. The second family adopts sparse MoE structures inside the LLM, but typically requires extra multi‑stage training and still incurs considerable overhead.

The paper proposes a Dynamic Pyramid Network (DPN) that re‑thinks the architecture of an MLLM as a hierarchical pyramid. Instead of a single, static compression stage, DPN embeds visual compression modules (pooling layers) directly into several intermediate Transformer layers of the LLM. In shallow layers the visual tokens are kept almost intact, preserving fine details; in deeper layers they are progressively pooled, dramatically reducing the token count for high‑level reasoning. This mirrors successful pyramid designs in vision (e.g., FPN, Swin‑Transformer) and aligns the granularity of visual information with the depth of language processing.

A key component is the Dynamic Pooling Experts (DPE). DPE is a Mixture‑of‑Experts (MoE) layer where each expert corresponds to a different pooling kernel (e.g., 1×1, 1×2, 2×2). A learnable routing token r is concatenated to the visual‑textual input; a small MLP router processes r and outputs a softmax distribution over the experts. During inference the expert with the highest probability is selected, and its pooling operation is applied to the visual tokens. This mechanism enables sample‑dependent compression: easy images receive aggressive pooling, while difficult images retain more tokens.

To prevent the router from collapsing to a single expert, the authors introduce a routing loss L_r, a hinge function that penalizes deviation from a target average compression rate t (set to 1.5 in experiments). The overall training objective combines the standard autoregressive language loss L_a with λ·L_r (λ=0.01). This encourages the router to respect the desired compression budget while still allowing diversity among experts.

The FLOPs analysis treats each Transformer layer’s cost as the sum of multi‑head attention (MHA) and feed‑forward network (FFN) operations. Because DPN changes the token count n_i at each layer, the total FLOPs become Σ_i (4 n_i d² + 2 n_i² d + 2 n_i d m). Empirically, the average token count per layer drops dramatically, yielding up to 56 % reduction in FLOPs compared with the baseline LLaVA.

Experiments are conducted on two popular MLLMs—LLaVA (7B and 13B) and its high‑resolution variant LLaVA‑HR—using the same visual‑language alignment weights and fine‑tuning on the 665 K instruction dataset. DPN layers replace the 8th, 16th, and 24th (or 10th, 20th, 30th for the larger model) Transformer blocks. The method is evaluated on ten benchmarks covering object recognition, spatial reasoning, OCR, and broader multimodal understanding (VQA‑v2, GQA, ScienceQA, TextVQA, MME, MM‑VET, MMMU, SEED, POPE, etc.). Results show that DPN consistently saves FLOPs (average 56 % reduction) while achieving modest performance gains (+0.5 % to +0.8 % absolute improvement). Notably, on the high‑resolution LLaVA‑HR‑X setting DPN delivers a 1.4× speed‑up and +0.62 % gain over the strong baseline. The routing mechanism successfully assigns larger compression rates to easy samples and smaller rates to hard ones, as illustrated by expert activation statistics.

Compared with prior efficient projectors and MoE‑based methods, DPN requires only a single additional training stage (the instruction‑tuning phase) and reuses the pretrained visual projector, avoiding costly multi‑stage MoE training. The Dynamic Pooling Experts are lightweight (simple max‑pooling) and the router is a tiny MLP, making the overall overhead negligible.

In summary, the contributions are:

A novel Dynamic Pyramid Network that embeds hierarchical visual compression inside the LLM, preserving fine‑grained visual cues while reducing computation.
Dynamic Pooling Experts with a routing token and a dedicated routing loss, enabling sample‑adaptive compression rates.
Demonstrated FLOPs reductions up to 56 % on ten diverse benchmarks with equal or improved accuracy, and seamless integration into existing MLLMs without extra pre‑training.

The work opens a practical path toward deploying large multimodal models on resource‑constrained platforms, and suggests future extensions such as more sophisticated experts (e.g., attention‑based downsampling), multi‑modal routing, or automated tuning of the target compression rate.

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

💡 Research Summary

Comments & Academic Discussion

Leave a Comment