HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
💡 Research Summary
The paper introduces HyperVL, an efficient and dynamic Multimodal Large Language Model (MLLM) specifically designed for on-device inference on edge devices like smartphones. It addresses the critical bottleneck in deploying MLLMs at the edge: the excessive latency and memory consumption of standard Vision Transformer (ViT) encoders when processing high-resolution visual inputs, which are essential for tasks like UI understanding and grounding.
HyperVL’s core innovation lies in a multi-faceted efficiency framework. First, it employs an image-tiling strategy to split high-resolution images into smaller, independently encoded patches, capping peak memory usage. Building on this, it introduces two novel techniques:
- Visual Resolution Compressor (VRC): A lightweight, plug-and-play module that predicts an optimal image compression ratio (from 10% to 100% of original size) based on the input image’s information density. This adaptive resolution scaling eliminates redundant computation for semantically simple images, significantly reducing visual encoding latency and the subsequent LLM’s processing load.
- Dual Consistency Learning (DCL): This framework aligns two ViT encoders of different capacities (a smaller SigLIP2-Base and a larger SigLIP2-Large) with a single, shared LLM backbone (Qwen3-1.7B). Through alternating training and semantic consistency distillation (using a KL-divergence loss), DCL enables the two visual branches to produce semantically consistent outputs. This allows the system to dynamically switch between the high-accuracy branch (for complex tasks) and the lightweight branch (for resource-constrained scenarios) based on device capability or latency budget.
The model architecture integrates the VRC, the dual ViT encoders, a vision-language projector (which uses pixel shuffle to compress visual token length by 4x), and the shared LLM.
Comprehensive experiments demonstrate HyperVL’s effectiveness. On capability benchmarks (OpenCompass, MMB, MME, DocVQA, etc.), HyperVL achieves state-of-the-art performance among comparable-sized open-source models (e.g., Qwen3-VL-2B, InternVL2-2B), particularly excelling in on-device-centric tasks. More importantly, on-device system evaluations on real commercial mobile phones show dramatic improvements in efficiency. HyperVL significantly reduces inference latency (up to 74% faster) and memory consumption (up to 40% lower) compared to baseline models, while also lowering power consumption. The VRC is further validated as a general acceleration tool that can be integrated into other pre-trained MLLMs.
In conclusion, HyperVL presents a holistic solution for efficient multimodal inference on edge devices. By combining adaptive resolution selection, a dynamically switchable dual-encoder design, and memory-aware tiling, it successfully balances high performance with the stringent computational constraints of mobile deployment, marking a significant step towards practical on-device multimodal AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment