Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

💡 Research Summary

The paper introduces Mobile‑VideoGPT, a lightweight multimodal framework designed for real‑time video understanding on resource‑constrained devices such as smartphones and edge AI platforms. While existing video large multimodal models (LMMs) often require tens of billions of parameters and large GPU memory footprints, Mobile‑VideoGPT operates with fewer than one billion parameters, a model size of roughly 1 GB in FP16, and a memory demand of about 3 GB VRAM, making it suitable for deployment on devices like the NVIDIA Jetson Orin Nano.
The architecture consists of four main components. First, a CLIP‑based image encoder extracts high‑quality spatial features from every frame of the input video. Second, an Attention‑Based Frame Scoring module computes a spatial attention matrix over the flattened token representations, aggregates token‑level attention scores into a per‑frame importance vector, and selects the top‑K frames (K = T/2, with T = 16 in experiments). This key‑frame selection dramatically reduces redundant computation while preserving the most informative visual content. Third, the selected frames are fed into a lightweight VideoMamba encoder, which captures temporal dynamics using linear‑complexity operations, enabling efficient processing of longer video sequences. Fourth, an Efficient Token Projector (ET‑Proj) compresses the high‑dimensional visual tokens from both the image and video encoders. ET‑Proj applies a feed‑forward network, followed by an Adaptive Pooling layer that reduces spatial resolution (H×W → Hr×Wr), and finally a convolution‑based positional encoder with a skip connection to retain spatial‑temporal positional cues. The resulting compact visual token sequence is then consumed by a small language model (SLM), such as Qwen2‑0.5B, to generate answers, captions, or other textual outputs.
Training proceeds in three stages. (1) The image token projector is pretrained together with the CLIP image encoder to learn effective spatial token compression. (2) The video token projector and VideoMamba encoder are subsequently pretrained to integrate temporal information. (3) In the final instruction‑tuning stage, both projectors become learnable, and LoRA (Low‑Rank Adaptation) is applied to the language model, allowing the system to adapt to multimodal instruction data with minimal additional parameters.
The authors evaluate Mobile‑VideoGPT on six well‑established video understanding benchmarks: ActivityNet‑QA, EgoSchema, ML‑VU, MVBench, NextQA, and PerceptionTest. Across these datasets, Mobile‑VideoGPT‑0.5B achieves an average improvement of 6 accuracy points over the previous state‑of‑the‑art 0.5 B‑parameter models, while using 40 % fewer parameters. Throughput measurements show 7.3 tokens per second on a Jetson Orin Nano and 45.9 tokens per second on an RTX A6000, more than double the speed of comparable models. The model also maintains strong performance on both multiple‑choice and open‑ended tasks, demonstrating robust spatio‑temporal reasoning despite its compact size.
In summary, Mobile‑VideoGPT addresses the core challenges of video LMMs—parameter bloat, high computational cost, and slow inference—by (1) selecting informative key frames via attention‑based scoring, (2) aggressively reducing visual token counts with an efficient projector, and (3) coupling the compressed visual representation with a small language model. This combination yields a system that delivers state‑of‑the‑art accuracy while meeting the strict latency and memory constraints of edge devices. The paper suggests future directions such as more sophisticated frame‑selection strategies, quantization‑aware training, and hardware‑specific optimizations to further push the limits of mobile video AI.

Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment