AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

AndesVL T echnical Report: An Efﬁcient Mobile-side Multimodal Large Language Model AndesVL T eam, OPPO AI Center https://github.com/OPPO- Mente- Lab/AndesVL_Evaluation https://huggingface.co/OPPOer Abstract In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT -4o, Gemini, and Claude Sonnet hav e demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they signiﬁcantly surpass the limitations in memory , power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’ s LLM and various visual encoders. W e comprehensiv ely outline the model architectures, training pipeline, and training data of AndesVL, which achiev es ﬁrst-tier performance across a wide range of open-source benchmarks, including ﬁelds such as te xt- rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-A ware LoRA Fine-T uning (QALFT) framework to facilitate ef ﬁcient task adaptation and model compression during mobile-side deployment of AndesVL. Our QALFT experiments re veal that AndesVL maintains performance with only ignorable degradation (3%) after deployment on mobile devices compared to the original ﬂoating-point model. Moreover , utilizing our cache eviction algorithm—OKV —along with customized speculati ve decoding and compression strategies, we achie ve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaT ek Dimensity 9500 chips. W e release all models on https:// huggingface.co/OPPOer . 1 Introduction In recent years, the advent of large language models (LLMs) represented by ChatGPT [ 178 ], the Qwen series [ 13 , 241 , 242 , 240 ], and the DeepSeek series [ 17 , 126 , 68 ] has ushered in a ne w era of artiﬁcial intelligence. These LLMs hav e demonstrated remarkable capabilities in natural language processing tasks, such as text generation, question answering, and language translation. Building upon the success of LLMs, multimodal large language models (MLLMs) have emerged, expanding the functionality of large models from pure text to multiple modalities. MLLMs incorporate modalities such as image, video, and e ven audio, enabling more div erse and comprehensiv e interactions. The typical training paradigm of MLLMs in volv es lev eraging a pre-trained LLM. By aligning the LLMs with visual encoders and engaging in continual pre-training and ﬁne-tuning, an MLLM that can process multimodal inputs and generate te xt outputs is de veloped. For ef fectiv e training, a substantial amount of multimodal data is necessary , in addition to extensi ve text datasets. This data encompasses image-text pairs, optical character recognition (OCR) data, and visual question-answering (VQA) data. These datasets provide the model with a wide range of multimodal capabilities, such as image captioning, OCR, chart question answering, visual semantic recognition, and visual reasoning. On the cloud side, there are numerous outstanding MLLMs. Models such as the QwenVL series [ 14 , 227 , 16 ], the InternVL series [ 30 , 29 , 28 , 274 , 229 ], GPT -4o [ 177 ], Gemini [ 210 , 189 , 44 , 43 , 219 ] and Claude Sonnet [ 8 ] hav e demonstrated SO T A competence in multimodal tasks. Despite their groundbreaking performance, these Figure 1: Overall performance comparisons over 6 domains (text-rich, reasoning and math, general VQA, multi-image, multilingual and hallucination) of different SO T A MLLMs with 4B parameters. models are generally oriented to wards reaching maximum performance, in volving parameter sizes running into hundreds of billions. Such lar ge-scale parameter demands signiﬁcantly exceed the processing capabilities of mobile devices such as smartphones and tablets, particularly in terms of memory capacity , running speed, and computing power of chips. Consequently , MLLMs that typically run on mobile platforms are limited to approximately 4 billion parameters, as illustrated by Qwen2.5-VL-3B [ 16 ] and InternVL3.5-4B [ 229 ]. T o maintain optimal functionality on mobile hardware, additional measures, such as quantization-aw are training (QA T) and deplo yment optimization on the mobile side, are essential. Currently , only a limited number of mobile-de vice manufacturers and internet companies ha ve started exploring mobile-side MLLMs. For example, viv o has introduced BlueLM-V -3B [ 146 ] and BlueLM-2.5-3B [ 238 ], Meituan has launched the MobileVLM series [ 36 , 37 ], Xiaomi has concentrated on mobile agents with the development of MobileVLM [ 233 ], and Apple has released the Ferret-UI series [ 249 , 123 ] aimed at UI comprehension. Despite these initiativ es, a thorough study explicating the training process, deployment strategies, and performance assessments on both general and mobile-speciﬁc benchmarks of mobile-side MLLMs is still absent. In this paper , we introduce the AndesVL suite. By integrating Qwen3 [ 240 ] LLMs and various visual encoders, we ha ve successfully de veloped mobile-side MLLMs with parameter sizes ranging from 0.6B to 4B. Our models focus on se veral ke y general-purpose capabilities, including kno wledge acquisition, mathematical reasoning, handling text-rich content, dealing with hallucination issues, processing multi-image and multilingual i nputs, 2 and general VQA. W e thoroughly introduce the model architectures, training pipeline, and data preparation strategies. Additionally , we hav e gi ven special consideration to functions crucial for mobile-side MLLMs, such as user interface (UI) understanding. T o e valuate the performance of our models, we have dev eloped mobile-speciﬁc benchmarks. Speciﬁcally , as inspired by Qwen3-4B-Instruct and Qwen3-4B-Thinking [ 240 ], we propose to train the instruct and thinking models of AndesVL separately to achie ve the best instruct-follo wing and reasoning abilities, respectiv ely . Our ﬂoating-point models have achie ved ﬁrst-tier results among models of similar sizes across v arious benchmarks, as sho wn in Fig. 1 , including 32 open-source benchmarks related to the domains mentioned abov e. For practical application on mobile de vices, we hav e also designed a 1 + N Low-Rank Adaptation (LoRA) [ 78 ] architecture to make the model adaptable to different tasks. Based on the AndesVL backbone model, do wnstream tasks can be clustered, and similar tasks can be ﬁne-tuned using a single LoRA module to achiev e optimal performance in speciﬁc domains. In addition to ﬂoating-point models within the 4B parameter range, to enable large models to run on the mobile side, the QA T and Quantization-A ware Lora Fine T uning (QALFT) frame works are necessary for model compression. With this pipeline, our mobile-side models have also demonstrated excellent results in various realistic applications. Additionally , we meticulously design a comprehensiv e mobile-side acceleration suite, with cache e viction, speculativ e decoding and sparsiﬁcation, which achie ve a block efﬁciency (BE) of up to 7.9, with about a 6.7x end-to-end decoding speedup over the baseline (with auto-regressi ve decoding and without compression optimization). Furthermore, we achie ve a memory reduction of up to 30.9% and a weight compression of 1.8 bits-per-weight when deploying AndesVL-4B on MediaT ek Dimensity 9500 chips. This work represents a signiﬁcant step forward in the de velopment and deployment of mobile-side MLLMs. The structure of this article is as follows: Sec. 2 introduces the mobile-side MLLM and related work on mobile- side deployment. Sec. 3 focuses on the model architecture, training data, and training scheme of AndesVL. Sec. 4 introduces the mobile-side 1+N LoRA training architecture of AndesVL and the technical scheme for mobile-side deployment. Sec. 5 details the performance of AndesVL on public benchmarks and self-built mobile-side benchmarks, as well as its comparison with SO T A models. Sec. 6 presents the benchmark results and mobile-side performance of AndesVL after being deployed on mobile devices. Sec. 7 looks ahead to future directions. Sec. 8 summarizes the entire article. The main contributions of this article can be summarized as follo ws: • First, addressing the speed and performance trade-off for mobile implementations of MLLM, we introduce the AndesVL suite, which is a collection of MLLMs designed for ef ﬁcient deployment on edge devices, with parameter scales ranging from 0.6B to 4B, demonstrating competiti ve performance with SO T A models with comparable parameters. • Second, we of fer separate models for Instruct and Thinking versions, making each ideal for tackling the challenges associated with high-efﬁcienc y tasks in understanding and generation, as well as applications in complex mathematical reasoning and planning. • Third, we design a 1+N Lora training pipeline for mobile deployment, which enables efﬁcient task clustering and adaptation. W e further propose the QALFT framew ork to ensure ﬂexible application of the 1+N Lora architecture on mobile devices. • Finally , based on our mobile-side acceleration and compression strategies, e.g. , customized cache eviction, sparsiﬁcation, and speculati ve decoding, AndesVL-4B can achiev e a 6.7x peak decoding speedup ratio, a memory reduction of up to 30.9%, and 1.8 bits-per -weight on MediaT ek Dimensity 9500 chips. 2 Related W orks 2.1 Mobile-side MLLMs Recent years have witnessed a proliferation of remarkable advances in MLLMs. Numerous remarkable MLLMs [ 14 , 227 , 15 , 30 , 29 , 28 , 274 , 229 , 177 , 210 , 189 , 44 , 43 , 219 , 8 ] ha ve been introduced, primarily dri ven by the pursuit of exploring the upper bounds of model performance through scaling laws. This endeav or has resulted in models with astronomically large parameter counts, reaching hundreds of billions or e ven trillions. Nonetheless, this emphasis on lar ge-scale models has left the dev elopment of mobile-side MLLMs relati vely underexplored. 3 Among the efforts to wards more mobile-friendly MLLMs, the Qwen series has made notable progress. Qwen2- VL [ 227 ] and Qwen2.5-VL [ 15 ] introduced model sizes of 2B and 3B, respectiv ely , which are particularly suited for deployment on mobile de vices. These model sizes ef fectiv ely balance performance and the computational limitations of mobile hardware. Similarly , the InternVL series [ 28 , 274 , 229 ] presented a range of model sizes—1B, 2B, and 4B—designed to fulﬁll various operational needs on mobile platforms. In 2023, Meituan emerged as a pioneer in the mobile MLLM domain with the introduction of MobileVLM [ 36 ]. Built upon MobileLLaMA in a LLaV A-lik e [ 129 ] architecture, MobileVLM came in 1.7B and 3B model sizes. It achiev ed SO T A results in some benchmarks for models of similar sizes at that time. Meituan offered signiﬁcant insights into the processing speeds on mobile and IoT platforms, reporting rates of 12.21 and 21.54 tokens per second, respecti vely . In 2024, the release of MobileVLM V2 [ 37 ] further advanced the ﬁeld by e xploring the data scaling law , improving training strategies, and optimizing the modality alignment design. These developments contributed to a comprehensi ve enhancement in the performance of the MobileVLM framew ork. In 2024, the Apple MM series [ 154 , 258 ] demonstrated that e ven relati vely compact models, speciﬁcally those with 1B and 3B parameters, could achiev e impressiv e performance through meticulous data curation and optimized training strategies. The Ferret UI series [ 249 , 123 ] mark ed a signiﬁcant step forward, as it was the ﬁrst series e xtensiv ely dedicated to impro ving the capabilities of screen UI understanding. It e xtended the capabilities of MLLMs to tasks such as referring and grounding on mobile UI screens and answering questions related to screen operations. Howe ver , Apple did not reveal the performance metrics for these models when deployed on mobile platforms. Xiaomi’ s MobileVLM [ 233 ] also made important contrib utions by lev eraging carefully constructed UI under - standing and APP operation trajectory data. This enabled the model to expand its capabilities from understanding within a single UI (intra-UI) to understanding and operating across multiple UIs (inter -UI). Nev ertheless, Xi- aomi’ s 9.8B MobileVLM model was not successfully deployed on mobile de vices. Finally , viv o’ s BlueLM-V -3B [ 146 ] and BlueLM-2.5-3B [ 238 ] achiev ed mobile-side deployment of an MLLM through systematic optimizations in algorithms and hardware deployment. Speciﬁcally , BlueLM-V -3B achiev ed a running memory of 2.2G and a token throughput speed of 24.4 tokens/s on MediaT ek Dimensity 9300 NPUs. This not only showcases its effecti veness but also provides practical performance metrics for mobile-side MLLMs. Despite these efforts, there remains a gap in comprehensively documenting training processes, deployment solutions, and benchmark results for general and mobile-speciﬁc tasks of mobile-side MLLMs. Our work aims to ﬁll this void by presenting the AndesVL suite, which of fers a comprehensive approach to mobile-side MLLMs, including detailed training, deployment, and benchmarking aspects. 2.2 Mobile-Side Deployment of MLLM The deployment of MLLMs on mobile devices presents unique challenges, including limited computational resources, di verse hardware architectures, and stringent ener gy constraints. T o address these issues, v arious solutions [ 156 , 61 , 212 , 82 , 86 , 114 , 42 , 10 ] hav e been proposed that take adv antage of CPUs, GPUs, and NPUs. CPU-based Deployment In 2020, Alibaba dev eloped the Mobile Neural Network (MNN) [ 86 ], an inference engine tailored for mobile applications. It introduces a “pre-inference” mechanism for runtime optimization, thorough kernel optimizations for optimal computation performance, and a back-end abstraction module that enables hybrid scheduling while maintaining a lightweight engine footprint on mobile CPUs. In 2023, Georgi Gergano v [ 61 ] introduced llama.cpp, a lightweight, dependenc y-free C/C++ implementation designed for ef ﬁcient LLM inference across di verse hardware platforms, including mobile CPUs. It includes support for se veral quantization le vels (ranging from 1.5-bit to 8-bit), enabling reduced memory consumption and accelerated inference. GPU-based Deployment In 2024, a machine learning compiler and high-performance deployment engine for LLMs, MLC LLM [ 212 ], was dev eloped, aiming to enable nativ e deployment across various platforms, including mobile GPUs. It compiles models into optimized binaries compatible with platforms such as iOS, Android, and web browsers. In addition, Li et al. [ 114 ] proposed T ransformer-Lite, which focuses on the high-ef ﬁciency deployment of LLM on mobile phone GPUs. It introduced four optimization techniques: a symbolic expression-based approach for dynamic shape model inference, operator optimizations with ex ecution priority settings, an FP4 quantization 4 Figure 2: The ov erall architecture of AndesVL mainly includes a visual encoder, an MLP projector , and an LLM. method termed M0E4 to reduce dequantization o verhead, and a sub-tensor-based technique to eliminate the need for copying key-v alue (KV) cache after inference. These optimizations enable signiﬁcant speedups in both preﬁll and decoding phases compared to existing CPU-based and GPU-based inference engines. NPU-based Deployment Gemini Nano [ 42 ], dev eloped by Google, is designed for on-device use cases, running within Android’ s AICore system service to le verage de vice hardware for lo w-latency inference. It is accessible through the AI Edge SDK, which allows de velopers to customize the inference and prompts. Gemini Nano models, such as Nano-1 (1.8B parameters) and Nano-2 (3.25B parameters), are distilled from larger Gemini models and optimized for edge devices such as smartphones. Finally , Apple’ s On-Device Deployment utilizes the Core ML framework to optimize and deploy large language models on Apple silicon [ 10 ]. T echniques such as grouped-query attention (GQA) mechanisms, mixed 2-bit and 4-bit quantization, and ef ﬁcient memory management strategies enable the deployment of models like Llama-3.1-8B-Instruct on de vices such as the iPhone 15 Pro, achie ving decoding speeds of approximately 30 tokens per second. Despite notable progress in mobile-side deployment of MLLMs, several challenges persist. These include balancing model performance with resource constraints, ensuring cross-device compatibility , standardizing deployment processes, and establishing comprehensiv e ev aluation framew orks for multimodal tasks. T o address these issues, we introduce the AndesVL series, which of fers a comprehensi ve suite of optimized deployment solutions tailored for mobile platforms. This includes detailed training methodologies, quantization techniques, compilation strategies, and hardware-speciﬁc optimizations. Our work aims to bridge e xisting gaps, providing a robust foundation for future research and practical applications in mobile-side MLLM deplo yment. 5 Model Name #Param (B) V ision Encoder Language Model AndesVL-0.6B 0.695 SigLIP2-Base [ 217 ] Qwen3-0.6B [ 240 ] AndesVL-1B 0.927 AIMv2-Large [ 57 ] Qwen3-0.6B [ 240 ] AndesVL-2B 2.055 AIMv2-Large [ 57 ] Qwen3-1.7B [ 240 ] AndesVL-4B 4.360 AIMv2-Large [ 57 ] Qwen3-4B [ 240 ] T able 1: AndesVL model architectures of dif ferent sizes. 3 AndesVL 3.1 Model Architectur e AndesVL comprises models with parameters ranging from 0.6B to 4B parameters, with detailed architectures provided in T able 1 . Follo wing the paradigm of typical MLLMs [ 129 , 26 , 227 ], it consists of three fundamental components: a visual encoder, an MLP projector , and an LLM, as illustrated in Fig. 2 . As a general-purpose MLLM, AndesVL is designed to handle image inputs with arbitrary aspect ratios and resolutions. T o achiev e this, we av oid the image cropping methods employed in other works [ 128 , 26 , 239 ] and instead implement a Nativ e Resolution V iT (NaV iT) [ 45 ]-based strategy , allowing the visual encoder to process input of any resolution directly . This method is particularly beneﬁcial for ef ﬁciently processing low-resolution images and ensures consistency between model inputs and the original data. The MLP projector includes two MLP layers used to align the V iT output with the LLM’ s embedding layer . T o decrease the sequence length of the V iT output going into the LLM, a straightforward yet adaptable pixel shuf ﬂe operation is used to reduce the sequence length to a quarter of its original size. This operation combines and concatenates the data from adjacent 4×4 patches before passing them to the MLP . For the language model, AndesVL emplo ys Qwen3 [ 240 ], utilizing the 0.6B, 1.7B, and 4B models from the Qwen3 series. T o sa ve memory for embedding parameters, we preserve the tied word embeddings conﬁguration across all LLM v ariations. 3.2 T raining Pipeline In this paper , inspired by the recently released Qwen3-4B-Instruct and Thinking models [ 240 ], we propose to dev elop AndesVL in two distinct models: an instruction model (supporting only non-thinking mode) and a thinking model (supporting both non-thinking and thinking modes). Consequently , the training datasets and methodologies are distinct from one another , as will be thoroughly explained in the subsequent subsections. Stages V isual-Language alignment Joint V -L pre-training Multi-task pre-training Main data type Caption Interleav ed image-text All multi-task data + OCR + VQA + Pure text + VQA + Long CoT* + Long CoT* T rainable parameters V iT + MLP Full model Full model V iT sequence length 4,096 / 16,384 4,096 16,384 LLM sequence length 2,048 / 8,192 8,192 32,768 T rained tokens 100B / 100B 160B 160B T able 2: Pre-training stages of AndesVL. * indicates data exclusi vely used for the Thinking models. 3.2.1 Pre-train As illustrated in T able 2 , the pre-training phase of AndesVL consists of three stages: vision-language alignment, joint vision-language pre-training, and multi-task pre-training. T o improve training ef ﬁciency , we pack not only the V iT input but also the LLM input tokens. Our pre-training commences with the Qwen3-Instruct/Thinking versions of the language model. Throughout all pre-training stages, we incorporate a proportion of instruction- following data. This allows us to maintain the model’ s instruction-following capabilities and monitor its performance progression directly via instruction-based ev aluation. V ision-Language Alignment Our primary visual encoder le verages AIMv2-Large [ 57 ], a compact 300M parameter model that of fers superior power ef ﬁciency compared to larger alternati ves such as Qwen2VL-V iT - 675M [ 227 ], making it particularly well-suited for mobile deployment. T o enhance the encoder’ s versatility across varying input resolutions, we inte grate 2D Rotary Position Embeddings (2D-RoPE) [ 202 ], whose strong extrapolation capabilities enable our vision encoder to effecti vely process high-resolution inputs e ven when 6 trained on lo wer resolutions. T o maintain model performance, we preserve the original position embeddings with a length of 1024 and adapt them to different resolutions using bicubic interpolation. W e employ a two-stage training procedure for the visual encoder within our MLLM frame work, keeping the LLM frozen while utilizing diverse training data from captions, OCR, and VQA tasks. The ﬁrst stage processes 100B tokens with a V iT sequence length of 4,096, applying higher learning rates speciﬁcally to the randomly initialized MLP layers, while the second stage continues with an additional 100B tokens with a V iT sequence length of 16,384. For our 1B and 2B model variants, we streamline the training process by directly lev eraging the vision encoder from our 4B model and performing alignment by training the MLP layer only . For our most compact 0.6B model variant, we adopt the SigLIP2-Base-Patch16-512 [ 217 ] model, which follows a similar adaptation strategy that combines bicubic interpolation for position embeddings with 2D-RoPE and tw o-stage training. Joint V ision-Language Pre-training The second stage in volves joint vision-language pre-training. After the visual encoder’ s output aligns well with the LLM’ s representations, we unfreeze the LLM parameters and conduct full-parameter pretraining using a relatively lo w learning rate. In this stage, we utilize a large v olume of unsupervised interleav ed image-text data, enabling the model to acquire extensi ve visual kno wledge. During pre-training on this data, we compute loss only on text tokens, e xcluding image tokens from the calculation. In unidirectional autoregressi ve transformers, inappropriate image positioning may prev ent the model from learning multimodal knowledge effecti vely . F or instance, images placed at the end of a sequence cannot contribute to learning e ven when encoded. T o mitigate this issue, we employed a strategy where, with 50% probability , we maintained the original image positions. W ith the remaining 50% probability , we relocated all images in the data to precede all text content, replacing the images with their corresponding indices. Fig. 3 illustrates this transformation. Original interleav ed document: The sunset over the Pacific Ocean was breathtaking. pacific_sunset.jpg The vibrant colors painted the sky in shades of orange and pink. Later that evening, we hiked to the mountain viewpoint. mountain_vista.jpg T ransformed f ormat: <|image_0|> pacific_sunset.jpg <|image_1|> mountain_vista.jpg The sunset over the Pacific Ocean was breathtaking. <|image_0|> The vibrant colors painted the sky in shades of orange and pink. Later that evening, we hiked to the mountain viewpoint. <|image_1|> Figure 3: Image repositioning strategy for joint vision-language pre-training. Images are mo ved to the beginning of the sequence with 50% probability to ensure effecti ve multimodal learning. Since interleaved image-text data can be viewed as a multimodal extension of unlabeled text data, we also incorporate unlabeled text data from te xt pre-training. T o maintain instruction-following capabilities, we include text instruction data in this stage as well. Furthermore, a certain proportion of multi-task pre-training data is added to enhance the model’ s ov erall multimodal abilities. For the Thinking version of the model, we additionally incorporate long CoT data, which will be detailed in Sec. 3.3.1 . Multi-task Pre-training The ﬁnal stage is multi-task pre-training. In this stage, our approach transitions from self-supervised learning with unsupervised data to supervised learning using annotated data, focusing solely on calculating the text token loss for the answer portions. Data types mainly consist of general VQA, captions, and OCR, alongside task-speciﬁc data like grounding/UI. For the Thinking model variant, we continue to incorporate long CoT data as in the pre vious stage, while increasing the proportion of multimodal types to enhance its step-by-step reasoning capabilities with visual inputs. Although we use 2D RoPE to allo w model inference at high resolutions, we increased the V iT patch input from 4,096 to 16,384 to facilitate learning from data that require high resolution. T o enhance the LLM’ s capabilities in long contexts, particularly its reasoning ability in Thinking mode, we expanded the LLM’ s sequence length from 8,192 to 32,768. 7 Consequently , by completing the three pre-training stages mentioned above, we de veloped the base versions for our Instruct and Thinking models, referred to as AndesVL-Instruct-Base and AndesVL-Thinking-Base , respectiv ely , which are subsequently utilized for post-training. 3.2.2 Post-train The AndesVL post-training process consists of two main stages: supervised ﬁne-tuning (SFT) and reinforcement learning (RL). SFT is utilized for both instruction and thinking models. Notably , mixed preference optimiza- tion (MPO) [ 228 ] is adopted for reﬁning the instruction models, while Group Relati ve Policy Optimization (GRPO) [ 68 ] is employed for the thinking models. Following the application of SFT and MPO to AndesVL- Base, we deriv e the AndesVL-Instruct model. Con versely , the AndesVL-Thinking model is attained through the application of SFT and GRPO. SFT Supervised ﬁne-tuning (SFT) of the pre-trained AndesVL model is conducted utilizing meticulously formatted instruction data. Recognizing the critical inﬂuence of data div ersity and quality on the performance of downstream tasks, an e xtensiv e array of multimodal instruction data is compiled, co vering a wide range of task areas. T o improve the model’ s conv ersational abilities, the Chat-ML instruction data format is employed. The instruction dataset is meticulously crafted to introduce the model to multiple input modalities, enabling the dev elopment of strong representational learning capabilities. Additionally , the dataset encompasses a di verse range of task objectiv es, such as image captioning, visual question answering, text summarization, and code generation. This deliberate diversiﬁcation in data sources and task outlines is designed to enhance the model’ s generalization capacity and remain robust across various application s cenarios. Compliance with the Chat-ML format supports seamless integration with contemporary dialogue-oriented systems, thus promoting coherent and informativ e con versation e xchanges. This strategic SFT method is essential for unlocking the full potential of the AndesVL model, thereby facilitating its ef fectiv e use in real-world scenarios. MPO Direct preference optimization (DPO) has emerged as the dominant approach for aligning LLMs with human preferences, as highlighted in [ 187 ], which can av oid complex on-policy RL pipelines and is suitable for training non-thinking models. Lev eraging its effecti veness in language processing, recent research has extended the application of DPO to multimodal settings [ 113 , 272 ]. Nonetheless, two challenges arise when implementing DPO in MLLM: the scarcity of comprehensi ve, high-quality multimodal reasoning preference datasets and DPO’ s inability to assess the absolute quality of individual responses. T o address these issues, a novel approach known as Mixed Preference Optimization (MPO) was introduced by [ 228 ], which has shown enhancements across various multimodal reasoning e valuation sets. W e borrowed the MMPR dataset and MPO from [ 228 ]. During the training process, a joint loss consisting of preference loss L p , quality loss L q , and generation loss L g was used, which can be formulated as L = w p L p + w q L q + w g L g . The preference loss L p is formulated as L p = − log σ  β log π θ ( y c | x ) π 0 ( y c | x ) − β log π θ ( y r | x ) π 0 ( y r | x )  , (1) where β is the KL penalty coef ﬁcient, x , y c , and y r are user query , chosen response, and rejected response, respectiv ely . The policy model π θ is initialized from model π 0 . The quality loss L q is formulated as L q = − log σ ( β log π θ ( y c | x ) π 0 ( y c | x ) − δ ) − log σ ( − ( β log π θ ( y r | x ) π 0 ( y r | x ) − δ )) , (2) where δ represents the re ward shift, which is introduced by [ 89 ], calculated as the moving a verage of pre vious rew ards to stabilize training. The generation loss L g is the standard cross-entropy loss: L g = − T X t =1 log p θ ( y t | x, y 1 for entity tokens, = 1 otherwise), N represents the batch size, T i is the sequence length of the i -th sample, and w i,t indicates the t -th token in the i -th sample. The total training objectiv e that combines entity-focused and ﬂuency-preserving terms is L total = λ 1 L entity + λ 2 L BLEU/R OUGE , (4) where λ 1 is the weight for entity preservation, λ 2 is the ﬂuency control coef ﬁcient maintaining grammatical quality and L BLEU/R OUGE denotes standard metric-based loss for text quality . The RL phase in the multi-scenario LoRA stage is also highly customized. For example, in the captioning task, in addition to ensuring the accuracy and conciseness of the generated captions, it is also necessary to achieve a higher density of entity words ( i.e . , including useful information such as color , quantity , etc.). This enables the model to output more informati ve content within the same tok en length, thereby facilitating improv ements in downstream tasks such as alb um search. W e deﬁne the Entity Density Reward as R entity = Number of entity words in caption T otal number of words in caption , (5) the Key Inf ormation Reward , R info = β 1 · I ( caption contains color ) + β 2 · I ( caption contains number ) , (6) where I ( · ) is the indicator function (1 if the condition is satisﬁed and 0 otherwise), and β 1 , β 2 are weighting coefﬁcients. So ﬁnally , the T otal Reward is R total = λ 1 R entity + λ 2 R info + λ 3 R BLEU/R OUGE , (7) where λ 1 , λ 2 , λ 3 are weighting coefﬁcients and λ 3 ensures the ﬂuency and relev ance of the caption. V arious detailed examples of re ward signals are presented in T able 5 . This mechanism ensures consistent and high-quality outputs in div erse scenarios. 14 The multi-scenario LoRA training phase focuses on deep customization and strong adaptation , utilizing scenario-speciﬁc data along with customized loss and rew ard functions to signiﬁcantly enhance model precision and practical utility in targeted application scenarios. 4.2 Quantization and Deployment W e hav e established an end-to-end quantization optimization pipeline, comprising a QA T frame work for base models and a scenario-speciﬁc Quantization-A ware LoRA Fine-T uning (QALFT) frame work. This pipeline lev erages cloud-based computational resources and engineering in vestments, to maximally preserve AndesVL performance on edge devices, while simultaneously enhance on-de vice inference efﬁciency through ﬁne-grained mixed-precision quantization. 4.2.1 Quantization-A ware T raining for AndesVL Although post-training quantization (PTQ) techniques hav e advanced rapidly , directly deploying models to mobile devices via PTQ still incurs signiﬁcant performance de gradation. Moreov er , the inherent unpredictability of PTQ-induced accuracy loss imposes an additional b urden on algorithm validation and testing. T o address these challenges, we have dev eloped a robust and ﬂexible Quantization-A ware Training (QA T) framew ork. It supports multiple quantization conﬁgurations: weights can be quantized to 2, 3, 4, or 8 bits, and activ ations to 8 or 16 bits. The frame work also enables ﬁne-grained mixed-precision combinations and includes automated precision assignment strategies to maintain model accurac y while maximizing inference ef ﬁciency . Furthermore, through close collaboration with silicon v endors, we have established a deterministic mapping mechanism that directly translates static-QA T models into hardware-compatible, edge-deployable quantized representations. This approach aims to fundamentally eliminate the performance uncertainty on edge de vices that arises from PTQ. 4.2.2 QALFT QA T effecti vely satisﬁes the accuracy requirements for deploying a single base model on edge devices. Howe ver , in multi-LoRA scenarios, the activ ation quantization encodings of the base model must jointly account for the activ ation ranges introduced by all LoRA adapters. Consequently , any update to a LoRA adapter necessitates re-quantizing both the base model and all associated LoRAs to maintain optimal performance across di verse use cases—an impractical requirement for edge deployment. T o overcome this limitation, we co-dev eloped the Quantization-A ware LoRA Fine-T uning (QALFT) framew ork in collaboration with MediaT ek. QALFT begins by applying PTQ to a QA T -pretrained base model and permanently freezing its quantization encodings. Subsequent LoRA weights are then trained on top of this ﬁxed, quantized backbone—follo wing a paradigm analogous to QLoRA [ 49 ]. This design enables independent updates of scenario-speciﬁc LoRA modules without re-quantizing the base model, thereby eliminating quantization- induced performance degradation during deployment and signiﬁcantly streamlining the iteration cycle for task-speciﬁc algorithms. Empirical ev aluations demonstrate that, in this frame work, the on-device performance of AndesVL degrades by only 3% relati ve to its full-precision model, the marginal loss v alidates the ef ﬁcacy of QALFT in real-world applications. As illustrated in Fig. 8 , QALFT employs a layered architectural design. Its core principle is the complete decoupling of three essential components: the ﬂoating-point base model, training data, and the QALFT trainer . This decoupling ensures that the training logic remains agnostic to and isolated from vendor -speciﬁc hardware infrastructure, thereby facilitating seamless and ef ﬁcient deployment on MediaT ek platforms. 4.3 Mobile-side Acceleration with Cache Eviction The key-v alue cache (KV cache) plays a crucial role in enhancing the inference performance of LLMs. Nev er- theless, as the input sequence length expands, the size of the KV cache also grows proportionally—this not only imposes signiﬁcant pressure on memory resources b ut also undermines time ef ﬁciency . This issue is particularly pronounced for edge devices such as mobile devices: on these platforms, both performing inference on long text inputs and storing massiv e volumes of KV cache data are highly inef ﬁcient and impractical. Surprisingly , the long text prompt itself is extremely sparse, which means that only a small number of tokens contribute most of the v alue. Therefore, we can perform an eviction operation on the KV cache. 15 Figure 8: QALFT framework. Classic cache e viction solutions include streamingLLM [ 236 ], H2O [ 266 ], and snapKV [ 120 ]. The essence of these solutions lies in retaining the latest token and the previous key token based on observ ations, attention weights, etc. W e designed a ne w solution called OKV that outperforms these solutions while maintaining the same compression rate and supports context lengths up to 128K. 4.4 Mobile-Side Acceleration with Speculative Decoding Due to the sequential nature of auto-re gressiv e LLMs, the decoding phase is e xpensive and slo w . Speculati ve decoding has been proven to be an ef fectiv e solution to this problem: EA GLE-2 [ 121 ] performs auto-regression at the feature level, reusing top-layer features from the target model in drafting to achie ve better results than vanilla speculativ e decoding; HASS [ 261 ] proposes a training-time-testing method, which further improves accept length by reusing features of the draft model in the training phase to maintain consistenc y in the inference phase. Based on the characteristics of mobile-side devices, we made some customizations and adaptations to e xisting Eagle-like methods reusing top layer features, to fully utilize the storage and computation resources on devices. W e experimentally ev aluate the AndesVL model with speculativ e decoding on multiple speciﬁc tasks. The results show that our customized method achieves a block efﬁciency (BE) of up to 7.9 . Additionally , when combined with hardware compression and LLM sparsiﬁcation, it obtains a 6.7x peak speedup ratio ov er the baseline. 5 Experiments In this section, we provide experimental results to demonstrate the comprehensive capabilities of AndesVL. The general multimodal capabilities of AndesVL are compared with those of SO T A MLLMs using widely adopted multimodal benchmarks. Follo wing this, the domain-speciﬁc performances of AndesVL are detailed, cov ering text-rich image understanding (including OCR, chart, and document comprehension), reasoning and math, multi-image comprehension, general VQA, hallucination mitig ation, multilingual capability , and GUI-related tasks. Most of the benchmarks are tested using VLMEvalKit [ 54 ]. 16 Model T ext-rich Reasoning &Math Multi-image General VQA Hallucination Multilingual Overall Phi-3.5-V ision [ 2 ] 65.8 24.0 36.8 55.4 63.9 47.0 48.8 Phi-4-Multimodal [ 3 ] 81.2 33.2 42.4 64.2 66.0 54.3 56.9 Gemma3-4B [ 211 ] 61.0 28.9 38.1 57.8 62.1 52.4 50.0 Qwen2.5-VL-3B [ 16 ] 82.1 32.1 44.8 62.2 66.9 58.9 57.8 Ovis2-4B [ 145 ] 85.1 34.1 45.6 71.3 73.2 63.4 62.1 MiniCPM-V -4-4B [ 245 ] 82.4 33.8 59.1 70.9 69.3 55.4 61.8 R-4B-RL [ 85 ] - 57.1 - - - - - InternVL3.5-4B [ 229 ] 82.6 56.9 62.3 72.8 69.6 62.1 67.7 AndesVL-4B-Instruct 85.3 42.1 64.5 72.7 73.0 64.6 67.0 AndesVL-4B-Thinking 86.0 58.3 67.8 73.8 74.8 64.9 70.9 Qwen2-VL-2B [ 227 ] 75.8 23.1 49.7 60.5 66.1 52.8 54.7 MiniCPM-V -2-2B [ 245 ] 60.5 20.8 50.5 53.5 63.6 40.2 48.2 SAIL-VL-1.5-2B [ 52 ] 82.1 29.5 55.6 68.4 70.5 57.3 60.6 SAIL-VL2-2B [ 247 ] 84.1 33.1 59.0 71.8 71.2 61.7 63.5 Ovis2-2B [ 145 ] 81.8 29.5 59.7 67.2 70.3 58.0 61.1 InternVL3-2B [ 274 ] 78.3 31.6 54.3 69.4 67.9 57.4 59.8 InternVL3.5-2B [ 229 ] 79.6 49.9 56.6 68.3 70.5 58.0 63.8 AndesVL-2B-Instruct 82.4 33.8 56.5 66.1 70.9 60.3 61.7 AndesVL-2B-Thinking 81.3 45.7 59.8 68.3 71.8 59.4 64.4 Ovis2-1B [ 145 ] 77.3 24.3 53.0 59.5 65.4 52.4 55.3 InternVL3-1B [ 274 ] 71.2 23.8 47.8 61.2 65.4 47.9 52.9 InternVL3.5-1B [ 229 ] 73.5 32.8 52.2 59.9 65.4 49.1 55.5 AndesVL-1B-Instruct 76.7 27.4 52.1 60.7 67.0 53.3 56.2 AndesVL-1B-Thinking 77.4 35.8 54.3 63.4 67.4 54.1 58.8 SmolVLM2-0.5B [ 150 ] 55.5 18.4 42.0 43.6 54.4 26.1 40.0 AndesVL-0.6B-Instruct 73.5 26.0 51.5 55.3 65.7 51.0 53.8 AndesVL-0.6B-Thinking 73.3 29.4 53.1 57.1 65.9 49.7 54.7 T able 6: The overall comparison of AndesVL with existing MLLMs on 32 benchmarks, which are grouped into 6 domains. The best results are marked in bold . 5.1 Overall P erformance 5.1.1 Benchmarks W e e valuated the performance of AndesVL in comparison to several advanced multimodal models of comparable model size, including Qwen2-VL [ 227 ], Qwen2.5-VL [ 16 ], InternVL3 [ 274 ], InternVL3.5 [ 229 ], Gemma3 [ 211 ], BlueLM-2.5-3B [ 238 ], Phi-3.5-V ision [ 2 ], Phi-4-Multimodal [ 3 ], MiniCPM-V [ 245 ], R-4B [ 85 ], Qianfan- VL [ 51 ], Ovis2 [ 145 ], SAIL-VL-1.5-2B [ 52 ], SAIL-VL2-2B [ 247 ], and SmolVLM2 [ 150 ]. For f air comparisons, these models are grouped by their parameter sizes in the following e valuations. The diverse multimodal capabilities of the proposed AndesVL are assessed using 32 commonly adopted benchmarks, covering various multimodal tasks across six domains: reasoning and math, text-rich, multi- image, general VQA, hallucination and multilingual capability . Detailed benchmark information is presented in subsequent subsections. 5.1.2 Evaluation Results T able 6 summarizes the ov erall performance of v arious existing MLLMs across 32 benchmarks spanning six different cate gories: T ext-rich, Reasoning & Math, Multi-image, General VQA, Hallucination, and Multilingual. W e compute the average scores, dra wn from the models’ original papers or the OpenCompass leaderboard [ 39 ], to represent their capabilities across speciﬁc domains and ov erall. Our proposed AndesVL series substantially outperform existing models of similar sizes on multiple test sets, across all e valuated scales. These statistics highlight the effecti veness of our adv anced training strategies and the quality of the training corpus utilized. Speciﬁcally , across 32 benchmarks, the AndesVL-4B-Thinking model achiev es an overall score of 70.9, outperforming the second-best model, InternVL3.5-4B [ 229 ], by a margin of 3.2 points. Across ev ery multimodal task category , the AndesVL-4B-Thinking model secures a signiﬁcant margin of 0.9 to 5.5 points, underscoring its uni versal superiority in di verse multimodal scenarios. AndesVL-4B-instruct also demonstrates remarkably strong performance across multiple vertical domains, especially on multilingual and multi-image tasks. 17 At the 2B scale, the AndesVL-2B-Thinking model achieves the highest o verall score of 64.4. It e xhibits a clear advantage in multi-image understanding and hallucination mitigation ov er existing models, even surpassing some 4B-scale models. For e ven more compact and lightweight models, our proposed 1B and 0.6B models command a decisiv e advan- tage across all metrics, with their Thinking and Instruct versions occupying the top spots and suppressing other leading models in the literature. Notably , our 0.6B v ariants, the AndesVL-0.6B series, achie ve a performance ev en comparable to existing 1B models, such as InternVL3.5-1B. Abov e results underscore the model’ s proﬁciency in addressing a wide range of real-world tasks that require multimodal perception, understanding, knowledge, and reasoning. Moreover , the div ersity in our models’ sizes, combined with their strong performance, enables them suitable for deployment in a wide range of mobile scenarios, including those with highly limited computing resources. 5.2 Reasoning and Math 5.2.1 Benchmarks T o ev aluate AndesVL ’ s multimodal reasoning and mathematical capabilities, we extensi vely ev aluate the model on various benchmarks for mathematical reasoning as follo ws: MMMU [ 256 ]: MMMU ev aluates MLLMs on college-le vel tasks across six disciplines, testing expert-lev el reasoning and adv anced perception in speciﬁc ﬁelds. The accuracy results achieved from the model’ s direct answer on its validation set are recorded. MMMU Pro [ 256 ]: MMMU Pro ev aluates the multimodal understanding and reasoning capabilities of the model from a wide range of academic disciplines. It is the upgraded v ersion of the MMMU benchmark. The ov erall accuracy score of the direct answer is reported. MathV ista [ 139 ]: MathV ista e valuates the mathematical reasoning ability , such as algebra, geometry , and statistics, of MLLMs with visual contexts. The accuracy scores on the testmini set are recorded. MathV ision [ 226 ]: MathV ision is made up of math problems with visual conte xts. The problems are sourced from real math competitions. The results on full set of the benchmark are reported. MathV erse [ 263 ]: MathV erse e valuates a model’ s capability of solving visual diagram-based math problems. The performance of its vision-only set is reported. DynaMath [ 276 ]: DynaMath consists of variant-generated questions for a seed question under various conditions. The worst-case accuracy is reported to reﬂect the model’ s reliability of MLLMs’ reasoning abilities. W eMath [ 186 ]: W eMath decomposes composite visual math problems into sub-problems to hierarchically assess inherent issues in MLLMs’ reasoning, cov ering 67 knowledge concepts across 5 le vels of granularity . LogicV ista [ 237 ]: LogicV ista e valuates models across ﬁ ve logical reasoning tasks: spatial, deducti ve, inducti ve, numeric, and mechanical reasoning, lev eraging a div erse dataset of visual multiple-choice questions. 5.2.2 Evaluation Results As shown in T able 7 , AndesVL-4B-Thinking achiev es the highest ov erall score of 58.3 across various math and reasoning benchmarks among exiting models. Notably , AndesVL-4B exhibits considerable superiority ov er advanced models on the MathV ista, MathV erse and W eMath benchmarks. With an ov erall score of 45.7, the AndesVL-2B-Thinking model ranks second, performing very close to the top score of 49.9 in literature. Furthermore, the AndesVL ’ s 1B and 0.6B Thinking models deli ver dominant performance within their respectiv e size groups, achieving top ranks not only o verall b ut also on most individual benchmarks. These improv ements over e xiting models highlight the efﬁcacy of our training strate gy . Our approach enhances the visual-text joint reasoning ability by leveraging a large corpus of reﬁned, long Chain-of-Thought (CoT) multimodal data in pre-training and through an intricately designed reinforcement learning process in post- training. Collectiv ely , these ﬁndings underscore AndesVLs’ comprehensiv e capabilities in addressing multimodal mathe- matical problems, as well as reasoning challenges in scientiﬁc, engineering, and real-world contexts. 18 Model MMMU (val) MMMU Pro MathV ista (mini) MathV ision MathV erse (vision-only) DynaMath (worst case) W eMath LogicV ista Overall Qwen2.5-VL-3B [ 16 ] 51.2 30.9* 60.9* 18.8* 25.7* 11.0* 23.2* 35.1* 32.1 BlueLM-2.5-3B [ 238 ] 47.5 - 70.8 28.5 - - - - - BlueLM-2.5-3B-thinking [ 238 ] 51.3 - 78.4 47.7 - - - - - Qianfan-VL-3B [ 51 ] 46.4 - - - - - - - - Gemma3-4B [ 211 ] 47.3 24.5* 46.3 23.1* 23.2* 11.0* 23.1* 32.7* 28.9 Phi-3.5-V ision-4B [ 2 ] 44.6 23.6* 43.3 14.9* 19.7* 9.8* 11.2* 25.1* 24.0 Phi-4-Multimodal [ 3 ] 55.1 38.5 62.4 19.7* 22.0* 13.0* 19.2* 35.6* 33.2 Ovis2-4B [ 145 ] 49.0 28.0* 69.6 21.1* 39.0* 12.6* 18.0* 35.1* 34.1 MiniCPM-V -4-4B [ 245 ] 51.2 33.4* 66.9 20.7 22.0* 14.2* 30.0* 32.0* 33.8 R-4B-RL [ 85 ] 68.1 46.5 78.0 47.8 64.9 39.5 52.8 59.1 57.1 InternVL3.5-4B [ 229 ] 66.6 53.5 * 77.1 54.4 61.7 35.7 50.1 56.4 56.9 AndesVL-4B-Instruct 58.0 37.6 73.3 27.1 34.3 21.2 33.7 41.6 40.9 AndesVL-4B-Thinking 66.9 51.4 79.4 51.0 66.9 35.5 57.4 57.7 58.3 Qwen2-VL-2B [ 227 ] 42.2 19.9* 48.0 17.3* 16.8* 4.0* 11.3* 25.5* 23.1 MiniCPM-V -2B [ 245 ] 38.2 20.8* 39.8 15.0* 16.8* 2.8* 6.3* 26.6* 20.8 SAIL-VL-1.5-2B [ 52 ] 46.7* 23.6* 67.3 18.0* 21.7* 8.6* 16.5* 33.8* 29.5 SAIL-VL2-2B [ 247 ] 47.7 29.1* 71.1 23.4 24.5* 10.2 22.7 36.2 33.1 SAIL-VL2-2B-Thinking [ 247 ] - - 68.5 27.5 - 20.2 38.8 47.0 - Ovis2-2B [ 145 ] 45.6 23.8* 64.1 17.6* 30.7* 10.0* 10.4* 33.6* 29.5 InternVL3-2B [ 274 ] 43.2 26.9* 57.0 19.5* 21.8* 14.6 22.4 47.7 31.6 InternVL3.5-2B [ 229 ] 59.0 42.6 * 71.8 42.8 53.4 31.5 48.5 49.4 49.9 AndesVL-2B-Instruct 46.1 30.7 64.9 22.4 26.8 15.2 30.3 34.0 33.8 AndesVL-2B-Thinking 52.1 37.3 73.3 35.2 54.8 27.5 41.1 44.3 45.7 Ovis2-1B [ 145 ] 36.1 20.9* 59.4 16.0 23.9* 2.8* 9.6* 26.0* 24.3 InternVL3-1B [ 274 ] 43.4 20.1* 45.8 18.8 18.7 5.8 13.4 29.8 24.5 InternVL3.5-1B [ 229 ] 44.2 25.7* 59.3 27.3 37.8 17.2 21.5 29.3 32.8 AndesVL-1B-Instruct 43.1 24.4 53.8 18.1 18.5 10.2 21.0 30.2 27.4 AndesVL-1B-Thinking 44.0 27.9 66.4 23.5 45.1 11.6 33.9 34.0 35.8 SmolVLM2-0.5B [ 150 ] 34.1 14.7* 37.5 13.2* 14.0* 3.2* 7.2* 23.5* 18.4* AndesVL-0.6B-Instruct 40.7 24.9 51.8 19.2 18.7 6.4 16.2 29.8 26.0 AndesVL-0.6B-Thinking 43.3 24.3 54.9 19.2 34.0 7.0 22.8 29.3 29.4 T able 7: Comparison of reasoning and mathematical performance. The best results are marked in bold . Data marked with * are from our e valuation, while others are from their original papers or the OpenCompass leaderboard. 5.3 T ext-rich Image Understanding 5.3.1 Benchmarks In order to ev aluate the OCR, chart, and document understanding capabilities of AndesVL, we perform assessments ov er a variety of te xt-rich datasets, including the following se ven benchmarks. AI2D [ 97 ]: AI2D consists of visual questions based on elementary school science diagrams. The results of its test set with and without mask settings are reported. OCRBench [ 134 ]: OCRBench evaluates the ov erall OCR capabilities of MLLMs across ﬁv e tasks: text recognition, scene text VQA, document VQA, key information extraction, and handwritten math expression recognition. ChartQA [ 151 ]: ChartQA requires a model to comprehend charts and graphs visually . The a verage relaxed accuracy across both human and augmented test sets in ChartQA is taken as the e valuation metric. T extVQA [ 200 ]: T extVQA e valuates a model’ s capability on visual reasoning with visual context from texts within images. The accuracy in the validation set is reported. DocVQA [ 153 ]: DocVQA requires a model to read, comprehend, and retriev e texts within document images to answer related questions. Performance is reported on the test set using the ANLS text similarity metric. InfoVQA [ 152 ]: InfoVQA consists of various complex infographics that combine text, graphics, and visual elements in creativ e layouts. The ANLS similarity score computed on the test set is reported. SEEDBench-2-Plus [ 109 ]: SEEDBench-2-Plus ev aluates a model’ s multimodal capability on text-rich visual tasks across charts, maps, and webs. The average accurac y on this dataset is reported. 5.3.2 Evaluation Results T able 8 shows a detailed comparison of AndesVL with se veral existing promising MLLMs on OCR-related benchmarks. AndesVL demonstrates superior or competitive performance to them. 19 Model AI2D (w M) AI2D (w/o M) ChartQA (test) T extVQA (val) DocVQA (test) InfoVQA (test) OCR Bench SEED 2-Plus Overall Qwen2.5-VL-3B [ 16 ] 81.4 91.3* 84.2* 79.2* 93.0* 77.0* 82.6* 68.2* 82.1 BlueLM-2.5-3B [ 238 ] 83.0 - - - - - 82.6 - - BlueLM-2.5-3B-thinking [ 238 ] 82.6 - - - - - 84.0 - - Qianfan-VL-3B [ 51 ] 81.4 - 81.8 80.1 - - 83.1 67.6 - Gemma3-4B [ 211 ] 70.7 86.3* 33.7 57.7 70.0* 43.0* 66.0 60.7 61.0 Phi-3.5-V ision-4B [ 2 ] 77.8 87.6* 70.0* 65.1* 69.0* 35.0* 59.9 62.2 65.8 Phi-4-Multimodal [ 3 ] 82.3 91.7* 81.4 75.6 93.2 72.7 84.4 68.5 81.2 Ovis2-4B [ 145 ] 85.7 94.2* 84.2* 83.2 * 94.0* 79.0* 91.1 69.3 85.1 MiniCPM-V -4-4B [ 245 ] 82.9 93.0* 84.4 80.8 93.0* 69.0* 89.4 67.0 82.4 R-4B-RL [ 85 ] 86.2 - - - 91.0 - 83.6 - - InternVL3.5-4B [ 229 ] 82.6 92.3 86.0 77.9 92.4 78.0* 82.2 69.4 82.6 AndesVL-4B-Instruct 84.5 94.6 87.8 81.6 96.0 81.0 86.1 70.9 85.3 AndesVL-4B-Thinking 84.9 94.1 90.4 82.1 95.4 81.9 87.0 72.0 86.0 Qwen2-VL-2B [ 227 ] 74.7 84.1* 72.5* 79.5* 90.0* 65.0* 79.7 61.2 75.8 MiniCPM-V -2B [ 245 ] 62.9 68.8* 55.6 73.2 71.0* 40.0* 60.5 51.9 60.5 SAIL-VL-1.5-2B [ 52 ] 83.7 92.4* 78.4* 82.0 92.0* 72.0* 88.5 68.0* 82.1 SAIL-VL2-2B [ 247 ] 83.0 92.8 * 85.0* 83.0 * 93.1 77.0* 89.5 69.1* 84.1 Ovis2-2B [ 145 ] 82.7 91.8* 81.3* 80.0* 92.0* 72.0* 87.3 67.4 81.8 InternVL3-2B [ 274 ] 78.7 87.4 80.2 77.0 88.0* 67.0* 83.5 64.6 78.3 InternVL3.5-2B [ 229 ] 78.8 89.1 80.7 76.5 89.4 70.8 83.6 68.0 79.6 AndesVL-2B-Instruct 80.1 89.9 87.4 79.9 94.2 74.2 84.6 68.8 82.4 AndesVL-2B-Thinking 77.8 89.3 86.6 80.0 93.9 72.9 82.9 67.1 81.3 Ovis2-1B [ 145 ] 76.4 85.3* 74.9* 78.4 * 89.0* 64.0* 89.0 61.4 77.3 InternVL3-1B [ 274 ] 69.4 78.3 75.3 74.1 81.9 53.7 79.0 58.2 71.2 InternVL3.5-1B [ 229 ] 69.3 81.8 77.7 71.5 85.6 60.5 79.5 62.3 73.5 AndesVL-1B-Instruct 71.5 83.8 80.4 77.0 91.5 65.3 78.9 64.8 76.7 AndesVL-1B-Thinking 74.4 86.1 82.3 76.2 91.4 65.8 77.7 65.5 77.4 SmolVLM2-0.5B [ 150 ] 57.3 59.5* 59.6 60.3 70.0* 29.0* 60.9 47.7 55.5 AndesVL-0.6B-Instruct 68.4 82.1 78.2 69.7 89.5 63.4 72.2 64.3 73.5 AndesVL-0.6B-Thinking 68.8 82.9 78.2 68.9 88.8 61.0 73.8 64.0 73.3 T able 8: Comparison of OCR, chart, and document understanding performance. The best results are marked in bold . Data marked with * are from our ev aluation, while others are from their original papers or the OpenCompass leaderboard. Among existing models, our AndesVL-4B-Thinking model claims the top rank with an ov erall score of 86.0, and it achieves the top results on four of eight benchmarks. Meanwhile, the AndesVL-4B-Instruct model also delivers strong and comparable performance on text-rich tasks with a score of 85.3. Most notably , on ChartQA, the AndesVL-4B-Thinking model scores 90.4, exceeding the pre vious best, InternVL3.5-4B (86.0), by 4.4 points. A similar marked adv antage is observed on DocVQA. AndesVL ’ s success on the ChartQA and DocVQ benchmarks, featuring long-text images and complex questions, directly illustrates its ability to not only recognize long te xts accurately but also apply adv anced reasoning to solve challenging, contextual problems effecti vely . Moreov er , the advantages o ver existing models on text-rich tasks persist do wn to our smaller-scale versions. T o be speciﬁc, our proposed AndesVL-2B-Instruct, AndesVL-1B-Thinking, and AndesVL-0.6B-Instruct models all achie ve leading results in their respective model-size groups, with ov erall scores of 82.4, 77.4, and 73.5, respectiv ely . These outcomes demonstrate the effecti veness of our models’ multimodal recognition and comprehension capabilities across a variety of te xt-rich tasks. 5.4 Multi-Image Understanding 5.4.1 Benchmarks T o ev aluate AndesVL ’ s capabilities in perception and understanding of multi-image relation, we conducted assessments on various multi-image benchmarks. BLINK [ 59 ]: BLINK contains visual questions on multiple images from 14 computer vision tasks. Over half of the questions in volv e multiple images. The accuracy result on the v alidation set is reported. MMT -Bench [ 248 ]: MMT -Bench consists of multimodal tasks across recognition, reasoning, and planning, with many sub-tasks requiring multi-image understanding. The accuracy metric in the validation set is reported. 20 Model BLINK Q-Bench1 v al MMT val MuirBench Overall Qwen2.5-VL-3B [ 16 ] 49.3* 30.3* 61.0* 38.7* 44.8 Qianfan-VL-3B [ 51 ] 50.0 73.5 62.2 - - Gemma3-4B [ 211 ] 43.9 14.2* 53.9 40.5* 38.1 Phi-3.5-V ision-4B [ 2 ] 58.3 3.7* 61.6 23.6* 36.8 Phi-4-Multimodal [ 3 ] 61.3 10.2* 60.4 37.6* 42.4 Ovis2-4B [ 145 ] 53.0 20.9* 65.2 43.4* 45.6 MiniCPM-V -4-4B [ 245 ] 54.0 76.5* 59.7 46.1 59.1 R-4B-RL [ 85 ] 56.3 - - - - InternVL3.5-4B [ 229 ] 58.1 73.8* 64.3 53.1 62.3 AndesVL-4B-Instruct 58.2 77.8 66.4 55.5 64.5 AndesVL-4B-Thinking 58.4 77.5 66.5 68.8 67.8 Qwen2-VL-2B [ 227 ] 45.2 72.8* 55.0 25.9* 49.7 MiniCPM-V -2B [ 245 ] 41.2 67.0* 53.5 40.1* 50.5 SAIL-VL-1.5-2B [ 52 ] 45.4* 75.9* 61.7 * 39.5* 55.6 SAIL-VL2-2B [ 247 ] 54.5* 77.1 * 60.9* 43.6* 59.0 Ovis2-2B [ 145 ] 65.7 76.2 * 55.0 41.9* 59.7 InternVL3-2B [ 274 ] 50.3 68.4* 59.5 38.8* 54.3 InternVL3.5-2B [ 229 ] 51.3 72.4* 58.5 44.0 56.6 AndesVL-2B-Instruct 48.1 73.8 58.8 45.5 56.5 AndesVL-2B-Thinking 48.6 74.6 58.5 57.4 59.8 Ovis2-1B [ 145 ] 44.0 71.3 54.7* 42.0* 53.0 InternVL3-1B [ 274 ] 42.9 63.4* 53.5* 31.2 47.8 InternVL3.5-1B [ 229 ] 44.0 68.5* 54.5 41.7 52.2 AndesVL-1B-Instruct 44.7 70.4 55.2 38.0 52.1 AndesVL-1B-Thinking 44.7 72.4 57.0 43.2 54.3 SmolVLM2-0.5B [ 150 ] 40.7 56.5 44.7 26.2* 42.0 AndesVL-0.6B-Instruct 46.6 69.2 52.0 38.0 51.5 AndesVL-0.6B-Thinking 46.0 71.7 52.7 42.0 53.1 T able 9: Comparison of multi-image understanding performance. The best results are marked in bold . Data marked with * are from our e valuation, while others are from their original papers or OpenCompass leaderboard. MuirBench [ 224 ]: MuirBench e valuates MLLMs’ capabilities in multi-image understanding on 12 tasks and 10 types of multi-image relations. The accuracy score is reported. Q-Bench [ 232 ]: Q-Bench assesses the abilities of MLLMs in lo w-level visual perception and understanding. The accuracy metric in the v alidation set is reported. 5.4.2 Evaluation Results The detailed results presented in T able 9 indicate that AndesVL-4B-Thinking achieves superior outcomes across various multi-image benchmarks, culminating in a top ov erall score of 67.8, outperforming the previous best (InternVL3.5-4B, 62.3) by a mar gin of 5.5 points. It also scores the top on three out of four individual multi-image benchmarks. Moreov er , as the model scale decreases, the models persist to demonstrate highly competitiv e accuracy , with the 0.6B variant attaining a score of 53.1. This superiority suggests that the adv anced pre-training strategies and enhanced training datasets emplo yed in AndesVL signiﬁcantly enhance its ability to capture and reason about inter -image relationships by concurrently understanding and analyzing the relationships among multiple images. 5.5 General VQA 5.5.1 Benchmarks W e ev aluate AndesVL ’ s general visual question-answer ability through a range of benchmarks, including real- world understanding and comprehensiv e benchmarks. These e valuations test the model’ s capabilities under complex realistic tasks and v arious comprehensiv e tasks. The follo wing six benchmarks are included in these ev aluations. 21 Model MME_sum MMBench v1.1 MMV et MMStar RealW orldQA R-Bench (dis) Overall Qwen2.5-VL-3B [ 16 ] 2,181* 51.2 60.0 56.3 66.3* 61.8* 62.2 BlueLM-2.5-3B [ 238 ] - 82.1 66.7 64.5 - - - BlueLM-2.5-3B-thinking [ 238 ] - 78.3 65.1 66.3 - - - Qianfan-VL-3B [ 51 ] - - 48.2 57.9 65.8 - - Gemma3-4B [ 211 ] 1,744 66.4 57.8 47.9 55.6 56.6* 57.8 Phi-3.5-V ision-4B [ 2 ] 1,838 67.4 43.2 47.5 53.6 55.4* 55.4 Phi-4-Multimodal [ 3 ] 1,962 77.2 51.9 58.9 64.1 62.8* 64.2 Ovis2-4B [ 145 ] 2,162 81.4 65.5 61.9 71.1* 70.5* 71.3 MiniCPM-V -4-4B [ 245 ] 2,298 79.7 68.0 62.8 68.5 64.7* 71.0 R-4B-RL [ 85 ] - 84.8 73.1 81.9 69.1 - - InternVL3.5-4B [ 229 ] 2,272 79.3* 76.6 65.0 66.3 68.7 72.8 AndesVL-4B-Instruct 2,345 81.2 61.2 66.1 72.2 71.7 72.7 AndesVL-4B-Thinking 2,412 81.7 61.9 69.9 73.2 69.9 73.8 Qwen2-VL-2B [ 227 ] 1,899 72.2* 51.5 47.5 60.7 62.8* 60.5 MiniCPM-V -2B [ 245 ] 1,808 65.8* 41.0 39.1 55.8 64.7* 53.5 SAIL-VL-1.5-2B [ 52 ] 2,063 78.5* 61.4 62.8 67.1 66.7* 68.4 SAIL-VL2-2B [ 247 ] 2,144 80.3 * 68.7 64.1 72.0 * 69.1* 71.8 Ovis2-2B [ 145 ] 2,005 77.0* 67.9 56.7 66.0 64.2* 67.2 InternVL3-2B [ 274 ] 2,221 78.0* 62.2 60.7 64.3 71.4 * 69.4 InternVL3.5-2B [ 229 ] 2,123 75.3* 71.7 62.7 62.0 62.4 68.3 AndesVL-2B-Instruct 2,081 77.3 52.0 60.0 67.8 65.3 66.1 AndesVL-2B-Thinking 2,326 75.4 59.5 62.7 64.8 64.2 68.3 Ovis2-1B [ 145 ] 1,720 68.4* 50.0* 52.1 63.9 61.0* 59.5 InternVL3-1B [ 274 ] 1,935 68.2* 59.5 * 51.5 58.2 60.4 61.2 InternVL3.5-1B [ 229 ] 1,910 67.6* 56.5* 51.9 57.6 57.4 59.9 AndesVL-1B-Instruct 1,938 70.9 43.5 52.5 65.0 63.4 60.7 AndesVL-1B-Thinking 2,006 73.1 48.9 57.9 65.1 64.0 63.4 SmolVLM2-0.5B [ 150 ] 1,448 41.6* 29.9 38.2 52.7* 47.4* 43.6 AndesVL-0.6B-Instruct 1,866 65.3 39.7 44.3 58.6 57.2 55.3 AndesVL-0.6B-Thinking 1,925 66.3 36.1 49.7 61.6 59.8 57.1 T able 10: Comparison of general VQA performance. Notice that the MME score is di vided by 28 to calculate the overall score. The best results are marked in bold . Data marked with * are from our ev aluation, while others are from their original papers or the OpenCompass leaderboard. MME [ 58 ]: MME ev aluates models’ perception and cognitive abilities across 14 sub-tasks. The ov erall summarization score across all tasks is reported. Notice that the MME score is divided by 28 to calcute the ov erall average score. MMBench v1.1 [ 132 ]: MMBench v1.1 ev aluates the multimodal understanding capability of MLLMs. It consists of multimodal questions ov er 20 dimensions and supports English and Chinese versions. The average performance scores on both test sets are reported. MMV et [ 252 ] : MMV et e valuates six core competencies for MLLMs: recognition, knowledge, spatial a wareness, language generation, OCR, and mathematics, across 16 integrated tasks. MMStar [ 24 ]: MMStar ev aluates the multimodal capabilities of MLLMs, focusing on advanced perception, reasoning, math, and science&technology for visual and language understanding. RealW orldQA [ 40 ]: RealW orldQA ev aluates the spatial understanding capabilities of MLLMs under various real-world scenarios. R-Bench [ 111 ]: R-Bench focuses on ev aluating the rob ustness of MLLMs to distortion in the real world, which cov ers 33 distortion dimensions. The accuracy on the distortion set is reported. 5.5.2 Evaluation Results As illustrated in T able 10 , the AndesVL series (4B, 1B, and 0.6B) achie ve the top performance in their respecti ve groups, while the 2B model also delivers a highly competiti ve result. This suggests that our model extracts robust representations and displays a strong ability to comprehend real-world scenarios, enabling it to ef fectively tackle complex and dynamic tasks. 22 Model Hallucination CRPE (relation) POPE (avg) Overall Qwen2.5-VL-3B [ 16 ] 46.6 64.9* 89.3* 66.9 BlueLM-2.5-3B [ 238 ] 53.7 - - - BlueLM-2.5-3B-thinking [ 238 ] 57.3 - - - Qianfan-VL-3B [ 51 ] - - 85.1 - Gemma3-4B [ 211 ] 40.8 61.0* 84.6 62.1 Phi-3.5-V ision-4B [ 2 ] 40.5 68.5* 82.8 63.9 Phi-4-Multimodal [ 3 ] 40.5 72.0* 85.6 66.0 Ovis2-4B [ 145 ] 53.8 77.0 * 88.7 73.2 MiniCPM-V -4-4B [ 245 ] 50.8 74.6* 82.4 69.3 R-4B-RL [ 85 ] 58.9 - - - InternVL3.5-4B [ 229 ] 44.8 75.0 88.9 69.6 AndesVL-4B-Instruct 54.7 75.8 88.5 73.0 AndesVL-4B-Thinking 59.2 75.5 89.8 74.8 Qwen2-VL-2B [ 227 ] 42.4 68.5* 87.3 66.1 MiniCPM-V -2B [ 245 ] 36.1 68.5* 86.3 63.6 SAIL-VL-1.5-2B [ 52 ] 49.8 73.9* 87.7* 70.5 SAIL-VL2-2B [ 247 ] 51.7 75.2 86.8 71.2 Ovis2-2B [ 145 ] 50.2 73.0* 87.8 70.3 InternVL3-2B [ 274 ] 42.5 71.5 89.6 67.9 InternVL3.5-2B [ 229 ] 48.6 75.6 87.2 70.5 AndesVL-2B-Instruct 51.8 73.0 87.9 70.9 AndesVL-2B-Thinking 51.4 74.1 89.8 71.8 Ovis2-1B [ 145 ] 45.2 63.2 87.7 65.4 InternVL3-1B [ 274 ] 41.4 64.0 90.7 65.4 InternVL3.5-1B [ 229 ] 41.0 68.4 86.8 65.4 AndesVL-1B-Instruct 43.2 68.7 89.2 67.0 AndesVL-1B-Thinking 45.6 68.8 87.7 67.4 SmolVLM2-0.5B [ 150 ] 27.7 52.9* 82.7 54.4 AndesVL-0.6B-Instruct 45.3 67.4 84.3 65.7 AndesVL-0.6B-Thinking 42.5 68.3 86.8 65.9 T able 11: Comparison of hallucination alleviation performance. The best results are marked in bold . Data marked with * are from our e valuation, while others are from their original papers or the OpenCompass leaderboard. 5.6 Hallucination Evaluation 5.6.1 Benchmarks W e ev aluate AndesVL ’ s hallucination alle viation ability through a range of widely used hallucination bench- marks. These ev aluations test the model’ s capabilities under visual hallucination settings. The following three benchmarks are included in these ev aluations. HallusionBench [ 66 ]: HallusionBench mainly ev aluates a model’ s capabilities under language hallucination and visual illusion settings. The average of its three metrics—aAcc, fAcc, and qAcc—is taken as the reported performance score. CRPE [ 230 ]: CRPE quantitatively e valuates the object recognition and relation comprehension ability of MLLMs. The accuracy on the CRPE Relation subset is reported. POPE [ 119 ]: POPE e valuates object hallucination tendencies in MLLMs. The ov erall av erage score is reported. 5.6.2 Evaluation Results As illustrated in T able 11 , the AndesVL series achie ve exceptionally high scores ov erall: the 4B, 2B, 1B and 0.6B models attaining 74.8, 71.8, 67.4, and 65.9 points, respecti vely . It maintains a substantial lead o ver other models of comparable size by a margin of 1.5 to 11.5 points - a lead that becomes e ven more pronounced with smaller model scales. This ﬁnding demonstrates that our architecture deliv ers superior hallucination alleviation capabilities while maintaining high accuracy , a key strength that persists e ven in the smallest 0.6B v ariants. 23 Model MMMB Multilingual MMBench MTVQA Overall Qwen2.5-VL-3B [ 16 ] 76.9* 74.9* 24.8 58.9 Qianfan-VL-3B [ 51 ] - - 26.5 - Gemma3-4B [ 211 ] 69.6* 65.3* 22.1 52.4 Phi-3.5-V ision-4B [ 2 ] 61.3* 67.4* 12.3* 47.0 Phi-4-Multimodal [ 3 ] 74.5* 74.2* 14.3 54.3 Ovis2-4B [ 145 ] 79.7* 81.2* 29.4 63.4 MiniCPM-V -4-4B [ 245 ] 72.8* 70.7* 22.6* 55.4 InternVL3.5-4B [ 229 ] 80.2 76.4 29.6 62.1 AndesVL-4B-Instruct 81.9 80.8 31.2 * 64.6 AndesVL-4B-Thinking 81.7 83.2 29.9 64.9 Qwen2-VL-2B [ 227 ] 71.3* 66.3* 20.8 52.8 MiniCPM-V -2B [ 245 ] 60.0* 51.3* 9.3 40.2 SAIL-VL-1.5-2B [ 52 ] 76.0* 72.9* 22.9* 57.3 SAIL-VL2-2B [ 247 ] 79.9 * 78.1 * 27.2* 61.7 Ovis2-2B [ 145 ] 76.6* 72.0* 25.6 58.0 InternVL3-2B [ 274 ] 73.6 71.9 26.7 57.4 InternVL3.5-2B [ 229 ] 74.6 70.9 28.5 58.0 AndesVL-2B-Instruct 76.5 75.3 29.1 60.3 AndesVL-2B-Thinking 76.5 75.0 26.7 59.4 Ovis2-1B [ 145 ] 70.8* 62.6* 23.7 52.4 InternVL3-1B [ 274 ] 63.2 58.2 22.2 47.9 InternVL3.5-1B [ 229 ] 66.0 58.5 22.9 49.1 AndesVL-1B-Instruct 72.0 63.0 24.9 53.3 AndesVL-1B-Thinking 71.3 67.5 23.6 54.1 SmolVLM2-0.5B [ 150 ] 46.8* 23.7* 7.7 26.1 AndesVL-0.6B-Instruct 70.3 60.8 21.8 51.0 AndesVL-0.6B-Thinking 69.4 58.4 21.3 49.7 T able 12: Comparison of multilingual performance. The best results are mark ed in bold . Data marked with * are from our ev aluation, while others are from their original papers or the OpenCompass leaderboard. 5.7 Multimodal Multilingual Understanding 5.7.1 Benchmarks W e ev aluate AndesVL ’ s multilingual understanding capabilities through the following three benchmarks: MMMB [ 205 ]: MMMB assesses multilingual capabilities of MLLMs, comprising 6 languages, 15 categories, and 12,000 questions. The average score is reported. Multilingual MMBench [ 205 ]: Multilingual MMBench extends the original MMBench [ 132 ] dataset to six languages, including English, Chinese, Portuguese, Arabic, T urkish, and Russian. The av erage score is reported. MTVQA [ 209 ]: MTVQA ev aluates the multilingual capability of MLLMs with human-annotated, text-rich images across 9 div erse languages. The average accurac y on the test set is reported. 5.7.2 Evaluation Results As demonstrated in T able 12 , both the Thinking and Instruct variants of AndesVL-4B demonstrate exceptional multilingual capabilities, achieving a leading score of 64.9, which surpasses the previous best model, Ovis2- 4B [ 145 ], by 1.5 points. This advantage persists in the smaller-scale variants of AndesVL, with each one achieving leading multilingual scores within their respecti ve sub-groups. The model’ s professional-grade multilingual capability provides a foundation for the cross-lingual transfer of its multimodal functions, a feature paramount for extending its global utility in mobile applications. 24 Model ScreenSpot ScreenSpot_v2 ScreenSpot_Pro Overall Qwen2.5-VL-3B [ 16 ] 55.5* 80.9* 27.3* 54.6 OS-Atlas-4B [ 238 ] 70.1 71.9 - - InternVL3.5-4B [ 229 ] 83.6 85.1 18.1* 62.3 AndesVL-4B-Instruct 84.3 86.1 28.2 66.2 AndesVL-4B-Thinking 85.2 87.4 32.5 68.4 UI-T ARS-2B [ 227 ] 82.3 84.7 27.7 64.9 InternVL3-2B [ 274 ] 45.1* 47.0* 1.0* 31.0 InternVL3.5-2B [ 229 ] 77.2* 79.6* 12.2* 56.3 AndesVL-2B-Instruct 74.6 76.3 20.9 57.3 AndesVL-2B-Thinking 67.2 70.2 19.6 52.4 InternVL3-1B [ 274 ] 31.3* 30.9* 0.6* 20.9 InternVL3.5-1B [ 229 ] 60.9* 61.9* 9.2* 44.0 AndesVL-1B-Instruct 71.8 73.2 23.1 56.0 AndesVL-1B-Thinking 73.9 74.4 20.9 56.4 T able 13: Comparison of UI understanding performance with other general and speciﬁc models on ScreenSpot testset. The best results are marked in bold . Data marked with * are from our ev aluation, while others are from their original papers. 5.8 GUI T asks 5.8.1 Benchmarks In order to validate the capability of the AndesVL in terms of UI understanding, we conducted experiments on ScreenSpot [ 32 ], ScreenSpot-V2 [ 234 ], ScreenSpot-Pro [ 112 ], and our proposed AndesUI-Bench. ScreenSpot [ 32 ]: ScreenSpot is a realistic GUI grounding benchmark that encompasses mobile, desktop, and web en vironments. It contains over 600 screenshots and 1200 instructions from iOS, Android, macOS, Windo ws, and webpages. It speciﬁcally includes both text-based elements and a variety of widgets and icons. ScreenSpot-V2 [ 234 ]: ScreenSpot-V2 is an enhanced version of the ScreenSpot benchmark that addresses annotation errors and ambiguities in the original dataset. Speciﬁcally , it corrects spelling errors, clariﬁes ambiguous instructions, remov es overly similar questions, and ﬁxes mislabeled ground-truth bounding boxes. These improv ements ensure a more accurate and reliable ev aluation for GUI grounding tasks. ScreenSpot-Pr o [ 112 ]: ScreenSpot-Pro is a new benchmark designed to ev aluate the grounding capabilities of MLLMs in high-resolution professional settings. It includes 1,581 unique instructions in high-resolution screenshots sourced from 23 applications across ﬁve industries (de velopment, creativ e, CAD, scientiﬁc, and ofﬁce) and three operating systems (Linux, macOS, and Windo ws). The benchmark highlights the challenges of high-resolution displays, smaller target sizes, and comple x en vironments in professional applications. For the Screenspot, ScreenSpot-V2, and ScreenSpot-Pro datasets, the annotation format is bounding boxes. For each data instance, the model is required to output a speciﬁc coordinate; if the coordinate falls within the annotated bounding box, it is considered a correct prediction and contributes to the ﬁnal accurac y . AndesUI-Bench : The AndesUI-Bench was de veloped to e valuate the smartphone UI understanding capabilities of MLLMs. As mentioned in Appendix B , the AndesUI-Bench represents the test set of the AndesUI dataset. This dataset includes 9k referring entries, 7.6k grounding entries, 455 comprehensi ve description entries, and 1.2k complex question-answer entries. 5.8.2 Evaluation Results In this study , we present a comprehensive analysis of various models’ performance in UI understanding tasks. As illustrated in T ab . 13 , AndesVL-4B surpasses other models of comparable size in accuracy , achieving a leading score of 68.4. While slightly trailing behind UI-T ARS-2B, a specialized model in the GUI domain, the AndesVL 2B and 1B v ariants maintain highly competiti ve performance, demonstrating robust UI comprehension capabilities. T ab . 14 shows the performance comparison between AndesVL and other leading open-source models on the AndesUI-Bench testset. AndesVL-4B deli vers outstanding performance across all ev aluation metrics, achieving 25 Model Grounding Referring QA Ov erall Qwen2.5-VL-3B [ 16 ] 61.4 62.1 15.0 46.2 InternVL3.5-4B [ 229 ] 91.9 68.9 82.4 81.1 AndesVL-4B-Instruct 95.1 72.5 82.6 83.4 AndesVL-4B-Thinking 94.5 73.4 82.5 83.5 T able 14: Comparison of UI understanding performance on AndesUI-Bench testset. The best results are marked in bold . All results are from our ev aluation. Model MMV et MathV erse (vision-only) RealW orldQA OCRBench HallusionBench Overall AndesVL-2B-Instruct-Base 48.3 22.8 65.1 82.4 49.3 53.6 AndesVL-2B-Instruct-SFT 51.0 25.9 66.9 83.9 49.4 55.4 AndesVL-2B-Instruct-MPO 52.0 26.8 67.8 84.6 51.8 56.6 T able 15: Comparison on sev eral general benchmarks among AndesVL-2B-Instruct-Base, AndesVL-2B-Instruct- SFT and AndesVL-2B-Instruct-MPO. the top score of 83.5. These results collectively demonstrate our model’ s substantial expertise and distinct competitiv e advantage in UI understanding and application. 5.9 Ablation Studies In T able 15 , we present ablation results for AndesVL-2B-Instruct-Base, AndesVL-2B-Instruct-SFT and AndesVL-2B-Instruct-MPO. W e ﬁnd that MPO enhances the mathematical reasoning and multimodal un- derstanding capabilities of the Instruct model, with the MPO model achieving improv ements of over 1.0 pp and 0.9 pp on MMV et and MathV erse respecti vely . Furthermore, MPO impro ves the model’ s ability to resist hallucinations, yielding a 1.4 pp gain on HallusionBench. Notably , RealW orldQA and OCRBench sho w 0.9 pp and 0.7 pp improvement respecti vely . W e thought that this is because MPO corrects errors made by the SFT model on instances it was originally capable of solving correctly . In T able 16 , we present the ablation studies for AndesVL-2B-Thinking-Base, AndesVL-2B-Thinking-SFT , AndesVL-2B-Thinking-RL, AndesVL-4B-Thinking-Base, AndesVL-4B-Thinking-SFT and AndesVL-4B- Thinking-RL. As sho wn, the model trained with RL exhibits signiﬁcant improv ements in mathematical reasoning. For instance, performance increases by about 2 pp on MathV ista, MathV ision and W eMath on both AndesVL- 2B-Thinking and AndesVL-4B-Thinking. Moreov er , the RL-enhanced model also demonstrates improved performance on complex multimodal understanding tasks, such as over 2 pp improvement on MMMU_Pro. These results indicate that RL signiﬁcantly enhances the model’ s multimodal understanding and mathematical reasoning capabilities ev enif the model has only 2B parameters. Overall, both the Instruct and Thinking models exhibit improv ed performance after the SFT stage. For the Instruct model, MPO leads to signiﬁcant gains in mathematical reasoning, multimodal understanding capabilities, OCR accuracy , and hallucination resistance. For the Thinking model, RL notably enhances its abilities in mathematical reasoning and complex multimodal understanding. Model MathV ista (mini) MathV ision W eMath MathV erse (vision-only) MMMU MMMU_Pro Overall AndesVL-2B-Thinking-Base 68.1 32.1 38.2 51.7 48.0 35.7 45.6 AndesVL-2B-Thinking-SFT 69.7 32.0 37.0 52.3 52.6 35.5 46.5 AndesVL-2B-Thinking-RL 73.3 35.2 41.1 54.8 52.1 37.3 49.0 AndesVL-4B-Thinking-Base 76.2 48.1 49.5 64.9 62.3 46.0 57.8 AndesVL-4B-Thinking-SFT 77.4 48.4 54.2 66.4 64.8 48.7 60.0 AndesVL-4B-Thinking-RL 79.4 51.0 57.4 66.9 66.9 51.4 62.2 T able 16: Comparison on se veral reasoning and math benchmarks among AndesVL-2B-Thinking-Base, AndesVL-2B-Thinking-SFT , AndesVL-2B-Thinking-RL, AndesVL-4B-Thinking-Base, AndesVL-4B-Thinking- SFT and AndesVL-4B-Thinking-RL. 26 Model DocVQA (test) InfoVQA (test) T extVQA (val) ChartQA (test) Overall AndesVL-4B-Instruct-Base (PTQ) 93.2 89.0 91.4 89.3 90.7 AndesVL-4B-Instruct-Base (QA T+PTQ) 95.4 95.2 97.5 95.1 95.8 T able 17: T op-1 o verlap between AndesVL-4B-Instruct-Base (PTQ) and AndesVL-4B-Instruct-Base (QA T+PTQ) on 4 OCR benchmarks. LoRA Models T extVQA (v al) ChartQA (test) AI2D (w M) Overall AndesVL-4B-Instruct-Base-LoRA (ﬂoating point) 81.1 87.5 83.4 84.0 AndesVL-4B-Instruct-Base-LoRA (PTQ) 67.2 66.1 65.5 66.3 AndesVL-4B-Instruct-Base-LoRA (QA T+PTQ) 77.2 84.0 80.8 80.7 AndesVL-4B-Instruct-Base (QA T+PTQ+QALFT) 80.8 86.4 81.3 82.8 T able 18: Comparison on 3 OCR benchmarks performance among AndesVL-4B-Instruct-Base-LoRA (ﬂoat- ing point), AndesVL-4B-Instruct-Base-LoRA (PTQ), AndesVL-4B-Instruct-Base-LoRA (QA T+PTQ) and AndesVL-4B-Instruct-Base (QA T+PTQ+QALFT), the results prov e that QA T and QALFT signiﬁcantly impro ve performance. 6 On-Device P erformance 6.1 Results of Quantization-A ware T raining T o ev aluate the capabilities of our on-device models, we use OCR capabilities as a testbed and conduct experiments on multiple OCR-related benchmarks, including DocVQA [ 153 ], InfoVQA [ 152 ], T extVQA [ 200 ] and ChartQA [ 151 ]. As mentioned in Sec. 4.2.1 , directly applying PTQ to ﬂoating-point models can signiﬁcantly degrade model performance, and we introduced QA T to solve this. W e compare the quantized and ﬂoating-point models based on T op-1 overlap across multiple OCR-related benchmarks. The experimental results are shown in T able 17 . In T able 17 , AndesVL-4B-Instruct-Base (PTQ) represents the model of AndesVL-4B-Instruct-Base post-trained on OCR data with PTQ, AndesVL-4B-Instruct-Base (QA T+PTQ) is the model of AndesVL-4B-Instruct-Base post-trained on OCR data with QA T and PTQ. The results demonstrate that QA T+PTQ can achieve 95% T op-1 ov erlap [ 115 ] between the quantized and ﬂoating-point models, and achie ves signiﬁcant improvement o ver PTQ alone. 6.2 Results of QALFT T o further improve performance across v arious on-device scenarios, we utilize the QLAFT frame work to train LoRA weights speciﬁc to each situation. The experiments mentioned in the T able 18 are all completed based on the pre-trained AndesVL-4B-Instruct-Base model. AndesVL-4B-Instruct-Base-LoRA (ﬂoating point) represents LoRA ﬁne-tuned ﬂoating-point model trained on OCR data, AndesVL-4B-Instruct-Base-LoRA (PTQ) represents AndesVL-4B-Instruct-Base-LoRA (ﬂoating point) with PTQ, AndesVL-4B-Instruct-Base-LoRA (QA T+PTQ) represents AndesVL-4B-Instruct-Base-LoRA (ﬂoating point) with QA T and PTQ, AndesVL-4B-Instruct-Base (QA T+PTQ+QALFT) represents QALFT training on AndesVL-4B-Instruct-Base with QA T and PTQ. The experimental results in T able 18 show that the performance of the model only with PTQ decreases signiﬁcantly , QA T and QALFT can signiﬁcantly improv e the performance of the model on the device side, while QALFT can decrease slightly by 3% compared to the ﬂoating-point model. 6.3 Results of Cache Eviction Our cache e viction strategy is tailored for tasks with long prompts. W e use the call summary task, which is a popular and pioneering feature of OPPO AI phones and in volves substantial input information redundancy , to verify its ef fectiveness. In this task, our proprietary OKV cache e viction algorithm results in a more than 10% 27 Eviction Ratio Method R OUGE-1 ROUGE-2 R OUGE-L 0% (Baseline) AndesVL-4B-Instruct-Base 0.59 0.33 0.42 25% SnapKV 0.55 0.30 0.39 OKV 0.60 0.33 0.41 50% SnapKV 0.50 0.25 0.36 OKV 0.56 0.30 0.39 T able 19: R OUGE performance of the reproduced SnapKV and our OKV under 25% and 50% ke y-value cache eviction ratios on the call summary task. Compression & Acceleration Method P eak Speedup BPW PTQ (baseline) 1.0 × 3.0 + Hardware-aw are compression 1.1 × 3.0 + Sparsiﬁcation 1.6 × 1.8 + Speculativ e decoding 6.7 × 1.8 T able 20: Peak decoding speedup ratio and bits-per-weight (BPW) of AndesVL-4B-Instruct-Base under various compression and acceleration techniques on an edge device. The baseline is PTQ-only . improv ement in Rouge-1 relati ve to SnapKV with 50% e viction ratios. In certain instances, it e ven outperformed the baseline with full KV caches. Comprehensiv e results presented in T able 19 . All experiments are based on the same AndesVL-4B-Instruct-Base model and are carried out on one de vice. The baseline AndesVL-4B- Instruct-Base is supervised ﬁne-tuned on the call summary task; SnapKV and OKV are applied to the model for inference respectiv ely . 6.4 Results of Speculative Decoding Our customized speculativ e decoding achieves signiﬁcant decoding acceleration across multiple multimodal and text-only tasks. W e combined it with our ke y breakthrough in LLM sparsiﬁcation and MediaT ek’ s hardware- aware compression, and show the ﬁnal results in T able 20 . In this table, the PTQ (baseline) represents the quantized version of the ﬂoating point AndesVL-4B-Instruct-Base, + Hardw are-aware compression represents PTQ (baseline) with hardware compression, + Sparsiﬁcation denotes PTQ (baseline) with hardware-aware compression and sparsiﬁcation, and + Speculativ e decoding denotes PTQ (baseline) with speculativ e decoding, sparsiﬁcation, and hardware-aw are compression. The results show that we can achieve 6.7x peak decoding speedup ratio and 1.8 bits-per-weight under extreme sparsiﬁcation and hardware-a ware compression. Moreov er , we achiev ed a memory reduction of up to 30.9% on the MediaT ek Dimensity 9500 chips. 7 Future Dir ections In the future, several promising directions can be explored to further enhance the capabilities of mobile-side MLLMs. First, designing more optimal visual encoder solutions holds great potential. By lev eraging advanced network architectures and novel feature extraction strategies, we aim to improve the efﬁcienc y and accuracy of visual information processing, enabling the model to better understand comple x visual content on resource- constrained mobile-side devices. Second, de veloping superior post-training schemes is crucial. Reﬁning the post-training process can optimize the model performance in handling v arious multimodal tasks, reduce hallucinations, and enhance the consistency and reliability of generated outputs. This may in volv e exploring ne w types of training data, adjusting training objectiv es, and optimizing training algorithms to make the model more adaptable to real-world scenarios. 28 Third, implementing ef fectiv e distillation schemes between large and small models can signiﬁcantly improv e the performance-to-resource ratio of mobile-side models. By transferring knowledge from lar ge, high-performance cloud-based models to smaller mobile-side counterparts, we can boost the capabilities of the latter while maintaining low computational costs and memory requirements. Finally , the development of a uniﬁed mobile-side model integrating text, image, and speech modalities (a three-mode integrated model) represents an e xciting frontier . Such a model would enable seamless interaction with users across multiple modalities, providing more natural and intelligent user experiences. This will require in-depth research on multimodal fusion techniques, cross-modal representation learning, and efﬁcient inference algorithms to ensure the model’ s effecti veness and efﬁcienc y on mobile-side devices. These research directions will not only dri ve the progress of mobile-side MLLMs b ut also e xpand their application scope in v arious ﬁelds. 8 Conclusion This paper presents AndesVL, a suite of mobile-side MLLMs with parameter sizes ranging from 0.6B to 4B. By integrating Qwen3’ s LLM and various visual encoders, AndesVL achiev es ﬁrst-tier performance on multiple open-source benchmarks and the self-de veloped AndesUI benchmark, especially excelling in mobile UI understanding. The proposed 1+N LoRA architecture and Quantization-A ware LoRA Fine Tuning (QALFT) framew ork enable efﬁcient task adaptation and model compression. QALFT ensures that AndesVL maintains high precision performance with only ignorable de gradation (3%) after deployment on mobile de vices compared to the original ﬂoating-point model. By employing our proposed OKV , meticulously designed speculative decoding techniques and compression strategies, we can achiev e 1.8 bits-per-weight, 6.7x peak decoding speed ratio and up to 30.9% memory reduction when deplo ying AndesVL-4B on MediaT ek Dimenisity 9500 chips. This work bridges the gap between cloud-based MLLMs and edge de vices, providing a practical solution for mobile-side MLLM and paving the w ay for future advancements in edge AI. References [1] In voice-to-json: A document understanding and information e xtraction dataset, 2024. 12 [2] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad A wan, Jyoti Aneja, Ahmed A wadallah, Hany A wadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv pr eprint arXiv:2404.14219 , 2024. 17 , 19 , 20 , 21 , 22 , 23 , 24 [3] Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany A wadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, V ishrav Chaudhary , Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint , 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 [4] Manoj Acharya, Kushal Kaﬂe, and Christopher Kanan. T allyqa: Answering complex counting questions. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 33, pages 8076–8084, 2019. 12 [5] W asi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar , Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, V ahid Noroozi, and Boris Ginsbur g. Opencodereasoning: Advancing data distillation for competitiv e coding. 2025. 9 [6] Loubna Ben Allal, Anton Lozhkov , Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis T unstall, Andrés Maraﬁoti, Hynek K ydlí ˇ cek, Agustín Piqueres Lajarín, V aibhav Sri vasta v , Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier , Ben Burtensha w , Hugo Larcher , Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro v on W erra, and Thomas W olf. Smollm2: When smol goes big – data-centric training of a small language model, 2025. 9 , 10 [7] Anas989898. V ision ocr ﬁnancial reports 10k. https://huggingface.co/datasets/Anas989898/ Vision- OCR- Financial- Reports- 10k , 2024. 12 [8] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com , 2024. 1 , 3 [9] apoidea. ﬁntabnet. https://huggingface.co/datasets/apoidea/fintabnet- html , 2024. 12 [10] Apple. On device llama 3.1 with core ml. https://machinelearning.apple.com/research/ core- ml- on- device- llama?utm_source=chatgpt.com , 2024. 4 , 5 [11] Anas A wadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Sheng Shen, Mohamed A wadalla, Silvio Sav arese, Caiming Xiong, et al. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens. Advances in Neural Information Pr ocessing Systems , 37:36805–36828, 2024. 9 , 10 29 [12] Chongyang Bai, Xiaoxue Zang, Y ing Xu, Srinivas Sunkara, Abhina v Rastogi, Jindong Chen, et al. Uibert: Learning generic multimodal representations for ui understanding. arXiv pr eprint arXiv:2107.13731 , 2021. 9 [13] Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan, W enbin Ge, Y u Han, Fei Huang, et al. Qwen technical report. arXiv pr eprint arXiv:2309.16609 , 2023. 1 [14] Jinze Bai, Shuai Bai, Shusheng Y ang, Shijie W ang, Sinan T an, Peng W ang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv pr eprint arXiv:2308.12966 , 2023. 1 , 3 [15] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv preprint , 2025. 3 , 4 , 11 [16] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2.5-vl technical report. arXiv preprint , 2025. 1 , 2 , 9 , 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 [17] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv pr eprint arXiv:2401.02954 , 2024. 1 [18] CC1984. mall receipt extraction dataset. https://huggingface.co/datasets/CC1984/mall_ receipt_extraction_dataset , 2023. 12 [19] Y uxiang Chai, Siyuan Huang, Y azhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv pr eprint arXiv:2407.17490 , 2024. 9 [20] Shuaichen Chang, David P alzer , Jialin Li, Eric Fosler-Lussier , and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv pr eprint arXiv:2211.08545 , 2022. 12 [21] Jiaqi Chen, T ong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint , 2022. 12 [22] Jiaqi Chen, Jianheng T ang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv pr eprint arXiv:2105.14517 , 2021. 12 [23] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi W ang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In Eur opean Conference on Computer V ision , pages 370–387. Springer , 2024. 12 [24] Lin Chen, Jinsong Li, Xiao yi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi W ang, Y u Qiao, Dahua Lin, et al. Are we on the right way for ev aluating large vision-language models? arXiv preprint arXiv:2403.20330 , 2024. 22 [25] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi W ang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving lar ge multi-modal models with better captions. arXiv preprint , 2023. 9 [26] Qiaoling Chen, Diandian Gu, Guoteng W ang, Xun Chen, YingT ong Xiong, T ing Huang, Qinghao Hu, Xin Jin, Y onggang W en, T ianwei Zhang, et al. Internev o: Efﬁcient long-sequence large language model training via hybrid parallelism and redundant sharding. arXiv pr eprint arXiv:2401.09149 , 2024. 6 [27] Shuang Chen, Y ue Guo, Zhaochen Su, Y afu Li, Y ulun Wu, Jiacheng Chen, Jiayu Chen, W eijie W ang, Xiaoye Qu, and Y u Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning. arXiv pr eprint arXiv:2506.04207 , 2025. 9 [28] Zhe Chen, W eiyun W ang, Y ue Cao, Y angzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Y e, Hao T ian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv pr eprint arXiv:2412.05271 , 2024. 1 , 3 , 4 [29] Zhe Chen, W eiyun W ang, Hao Tian, Shenglong Y e, Zhangwei Gao, Erfei Cui, W enwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv pr eprint arXiv:2404.16821 , 2024. 1 , 3 [30] Zhe Chen, Jiannan W u, W enhai W ang, W eijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 24185–24198, 2024. 1 , 3 30 [31] Zhiyu Chen, W enhu Chen, Charese Smiley , Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, T ing-Hao Huang, Bryan Routledge, et al. Finqa: A dataset of numerical reasoning ov er ﬁnancial data. arXiv pr eprint arXiv:2109.00122 , 2021. 12 [32] Kanzhi Cheng, Qiushi Sun, Y ougang Chu, Fangzhi Xu, Y antao Li, Jianbing Zhang, and Zhiyong W u. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint , 2024. 9 , 25 [33] Liying Cheng, Lidong Bing, Ruidan He, Qian Y u, Y an Zhang, and Luo Si. Iam: a comprehensiv e and large-scale dataset for integrated ar gument mining tasks. arXiv preprint , 2022. 12 [34] Zewen Chi, He yan Huang, Heng-Da Xu, Houjin Y u, W anxuan Y in, and Xian-Ling Mao. Complicated table structure recognition. arXiv pr eprint arXiv:1908.04729 , 2019. 12 [35] Y ew K en Chia, V ernon T oh Y an Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzle vqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns. arXiv pr eprint arXiv:2403.13315 , 2024. 12 [36] Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Y ang Y ang, Y iming Hu, Fei W ei, Xinyu Zhang, Bo Zhang, Xiaolin W ei, et al. Mobilevlm: A fast, strong and open vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 , 2023. 2 , 4 [37] Xiangxiang Chu, Limeng Qiao, Xin yu Zhang, Shuang Xu, Fei W ei, Y ang Y ang, Xiaofei Sun, Y iming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv pr eprint arXiv:2402.03766 , 2024. 2 , 4 [38] cognitiv e computations. Systemchat-2.0. https://huggingface.co/datasets/ cognitivecomputations/SystemChat- 2.0 , 2024. 12 [39] OpenCompass Contributors. Opencompass: A uni versal ev aluation platform for foundation models. https: //github.com/open- compass/opencompass , 2023. 17 [40] X.AI Corp. Grok-1.5 vision pre view: Connecting the digital and physical w orlds with our ﬁrst multimodal model. https://x.ai/blog/grok- 1.5v , 2024. 22 [41] Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris T ensmeyer . Deep visual template-free form parsing. In 2019 International Confer ence on Document Analysis and Recognition (ICDAR) , pages 134–141. IEEE, 2019. 12 [42] Deepmind. Gemini-nano. https://deepmind.google/models/gemini/nano/ , 2024. 4 , 5 [43] Google Deepmind. Gemini 2.0 is now av ailable to ev eryone. https://blog.google/technology/ google- deepmind/gemini- model- updates- february- 2025/ , 202. 1 , 3 [44] Google Deepmind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology/google- deepmind/google- gemini- ai- update- december- 2024/ , 2024. 1 , 3 [45] Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner , Joan Puigcerver , Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Pr ocessing Systems , 36:2252–2274, 2023. 6 [46] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Y ue Y ang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, K yle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv pr eprint arXiv:2409.17146 , 2024. 9 , 12 [47] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Aferg an, Y ang Li, Jeffre y Nichols, and Ranjitha Kumar . Rico: A mobile app dataset for building data-driv en design applications. In Pr oceedings of the 30th Annual A CM Symposium on User Interface Software and T echnolo gy , pages 845–854, 2017. 9 [48] Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi W ang, Huan Sun, and Y u Su. Mind2web: T owards a generalist agent for the web . Advances in Neural Information Pr ocessing Systems , 36, 2024. 9 [49] T im Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer . Qlora: Ef ﬁcient ﬁnetuning of quantized llms, 2023. 15 [50] Khang T . Doan, Bao G. Huynh, Dung T . Hoang, Thuc D. Pham, Nhat H. Pham, Quan T . M. Nguyen, Bang Q. V o, and Suong N. Hoang. V intern-1b: An efﬁcient multimodal lar ge language model for vietnamese, 2024. 12 [51] Daxiang Dong, Mingming Zheng, Dong Xu, Bairong Zhuang, W enyu Zhang, Chunhua Luo, Haoran W ang, Zijian Zhao, Jie Li, Y uxuan Li, et al. Qianfan-vl: Domain-enhanced universal vision-language models. arXiv preprint arXiv:2509.18189 , 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 31 [52] Hongyuan Dong, Zijian Kang, W eijie Y in, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation. arXiv pr eprint arXiv:2501.05952 , 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 [53] T ianjiao Du, Junming Cao, Qinyue Wu, W ei Li, Beijun Shen, and Y uting Chen. Cocoqa: Question answering for coding con ventions ov er knowledge graphs. In 2019 34th IEEE/ACM International Confer ence on Automated Softwar e Engineering (ASE) , pages 1086–1089. IEEE, 2019. 12 [54] Haodong Duan, Junming Y ang, Y uxuan Qiao, Xinyu Fang, Lin Chen, Y uan Liu, Xiaoyi Dong, Y uhang Zang, Pan Zhang, Jiaqi W ang, et al. Vlme valkit: An open-source toolkit for ev aluating large multi-modality models. In Pr oceedings of the 32nd A CM International Conference on Multimedia , pages 11198–11201, 2024. 16 [55] Abhishek Dutta, Ankush Gupta, and Andre w Zisserman. Vgg image annotator (via). https://www.robots.ox. ac.uk/~vgg/software/via/ . 45 [56] Felprot75. Mmc instructed dataset. https://huggingface.co/datasets/Felprot75/MMC_ Instructed_Dataset , 2024. 12 [57] Enrico Fini, Mustafa Shukor , Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, V ictor G T urrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressi ve pre-training of large vision encoders. In Pr oceedings of the Computer V ision and P attern Recognition Conference , pages 9641–9654, 2025. 6 [58] Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, W ei Lin, Jinrui Y ang, Xiawu Zheng, et al. Mme: A comprehensiv e ev aluation benchmark for multimodal large language models. arXiv pr eprint arXiv:2306.13394 , 2023. 22 [59] Xingyu Fu, Y ushi Hu, Bangzheng Li, Y u Feng, Haoyu W ang, Xudong Lin, Dan Roth, Noah A Smith, W ei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see b ut not percei ve. arXiv pr eprint arXiv:2404.12390 , 2024. 20 [60] Fzkuji. orand-car-a. https://github.com/Fzkuji/Labeled_ORAND- CAR- A , 2024. 12 [61] Ggergano v . llama.cpp - llm inference with minimal setup and state-of-the-art performance on a wide range of hardware. https:github.com/ggerganov/llama.cpp/ , 2023. 4 [62] glaiv eai. glaiv e-function-calling. https://huggingface.co/datasets/glaiveai/ glaive- function- calling , 2023. 12 [63] Othón González-Chávez, Guillermo Ruiz, Daniela Moctezuma, and T ania Ramirez-delReal. Are metrics measuring what they should? an e valuation of image captioning task metrics. Signal Pr ocessing: Image Communication , 120:117071, 2024. 12 [64] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Y ao, Runhui Huang, W ei Zhang, Xin Jiang, et al. W ukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Pr ocessing Systems , 35:26418–26431, 2022. 9 [65] Shuhao Gu, Jialing Zhang, Siyuan Zhou, K evin Y u, Zhaohu Xing, Liangdong W ang, Zhou Cao, Jintao Jia, Zhuo yi Zhang, Y ixuan W ang, et al. Inﬁnity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv pr eprint arXiv:2410.18558 , 2024. 9 , 10 [66] T ianrui Guan, Fuxiao Liu, Xiyang W u, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun W ang, Lichang Chen, Furong Huang, Y aser Y acoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint , 2023. 23 [67] Etash Guha, Ryan Marten, Sedrick Keh, Ne gin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, T rung V u, Zayne Sprague, Ashima Suvarna, Benjamin Feuer , Liangyu Chen, Zaid Khan, Eric Frank el, Sachin Grov er, Caroline Choi, Niklas Muennighoff, Shiye Su, W anjia Zhao, John Y ang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Y ichuan Deng, Sarah Pratt, V ivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dav e, Alon Albalak, Kushal Arora, Blake W ulfe, Chinmay Hegde, Greg Durrett, Sew oong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grov er , Kai-W ei Chang, V aishaal Shankar , Aaron Gokaslan, Mike A. Merrill, T atsunori Hashimoto, Y ejin Choi, Jenia Jitsev , Reinhard Heckel, Maheswaran Sathiamoorthy , Alexandros G. Dimakis, and Ludwig Schmidt. Openthoughts: Data recipes for reasoning models, 2025. 9 [68] Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruo yu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi W ang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv pr eprint arXiv:2501.12948 , 2025. 1 , 8 , 11 [69] He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. Eaten: Entity-aware attention for single shot visual text e xtraction. In International Confer ence on Document Analysis and Recognition , pages 254–259, 2019. 12 32 [70] Jarvis Guo, Tune y Zheng, Y uelin Bai, Bo Li, Y ubo W ang, King Zhu, Y izhi Li, Graham Neubig, W enhu Chen, and Xiang Y ue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237 , 2024. 9 , 10 [71] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. V izwiz grand challenge: Answering visual questions from blind people. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 3608–3617, 2018. 12 [72] Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin W ang, W ei Li, Hang Y an, Jiaqi W ang, and Dahua Lin. W anjuan: A comprehensiv e multimodal dataset for advancing english and chinese large models. arXiv preprint arXiv:2308.10755 , 2023. 9 , 10 [73] Vlad Hosu, Hanhe Lin, T amas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE T ransactions on Image Pr ocessing , 29:4041–4056, 2020. 12 [74] how ard hou. Coco-text. https://huggingface.co/datasets/howard- hou/COCO- Text , 2024. 12 [75] Y u-Chung Hsiao, Fedir Zubach, Gilles Baechler, V ictor Carbune, Jason Lin, Maria W ang, Sriniv as Sunkara, Y un Zhu, and Jindong Chen. Screenqa: Large-scale question-answer pairs ov er mobile app screenshots. arXiv pr eprint arXiv:2209.08199 , 2022. 12 [76] Anwen Hu, Haiyang Xu, Jiabo Y e, Ming Y an, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Uniﬁed structure learning for ocr-free document understanding. arXiv pr eprint arXiv:2403.12895 , 2024. 9 , 12 [77] Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Y e, Ming Y an, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. arXiv pr eprint arXiv:2409.03420 , 2024. 9 [78] Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, W eizhu Chen, et al. Lora: Lo w-rank adaptation of large language models. ICLR , 1(2):3, 2022. 3 [79] Siming Huang, T ianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Y ang Xu, J. Y ang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Y uan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili W ang, Y uan Qi, Y inghui Xu, and W ei Chu. Opencoder: The open cookbook for top-tier code large language models. 2024. 9 , 10 [80] ilhamxx. dataset receipt. https://huggingface.co/datasets/ilhamxx/dataset_receipt , 2024. 12 [81] "T eknium" "interstellarninja". Hermes-function-calling-dataset-v1. https://huggingface.co/ NousResearch/hermes- function- calling- v1 . 12 [82] V enkatraman Iyer , Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, and Y oungjae Shin. Automated backend allocation for multi-model, on-de vice ai inference. Pr oceedings of the ACM on Measur ement and Analysis of Computing Systems , 7(3):1–33, 2023. 4 [83] Mengzhao Jia, W enhao Y u, Kaixin Ma, Tianqing F ang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, and Dong Y u. Leopard: A vision language model for text-rich multi-image tasks. arXiv pr eprint arXiv:2410.01744 , 2024. 9 [84] Y iming Jia, Jiachen Li, Xiang Y ue, Bo Li, Ping Nie, Kai Zou, and W enhu Chen. V isualwebinstruct: Scaling up multimodal instruction data through web search. arXiv pr eprint arXiv:2503.10582 , 2025. 9 , 10 [85] Jie Jiang, Qi Y ang, Bolin Ni, Shiming Xiang, Han Hu, and Houwen Peng. R-4b: Incentivizing general-purpose auto-thinking capability in mllms via bi-mode annealing and reinforce learning. arXiv preprint , 2025. 17 , 19 , 20 , 21 , 22 , 23 [86] Xiaotang Jiang, Huan W ang, Y iliu Chen, Ziqi W u, Lichuan W ang, Bin Zou, Y afeng Y ang, Zongyang Cui, Y u Cai, T ianhang Y u, et al. Mnn: A univ ersal and efﬁcient inference engine. Pr oceedings of Machine Learning and Systems , 2:1–13, 2020. 4 [87] Justin Johnson, Bharath Hariharan, Laurens V an Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 2901–2910, 2017. 12 [88] JourneyBench. Journe ybench hallucination. https://huggingface.co/datasets/JourneyBench/ JourneyBench_Hallucination , 2024. 12 33 [89] Seungjae Jung, Gunsoo Han, Daniel W ontae Nam, and Kyoung-W oon On. Binary classiﬁer optimization for lar ge language model alignment. arXiv pr eprint arXiv:2404.04656 , 2024. 8 [90] Kushal Kaﬂe, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 5648–5656, 2018. 12 [91] Samira Ebrahimi Kahou, V incent Michalski, Adam Atkinson, Ákos Kádár, Adam T rischler, and Y oshua Bengio. Figureqa: An annotated ﬁgure dataset for visual reasoning. arXiv pr eprint arXiv:1710.07300 , 2017. 12 [92] Raghav Kapoor , Y ash Parag Butala, Melisa Russak, Jing Y u Koh, Kiran Kamble, W aseem AlShikh, and Ruslan Salakhutdinov . Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web . In Eur opean Confer ence on Computer V ision , pages 161–178. Springer , 2025. 9 [93] kashindra mahato. nutritional-data-poie. https://huggingface.co/datasets/kashindra- mahato/ nutritional- data- poie- 1 , 2024. 12 [94] katanaml org. in voices-donut-data-v1. https://huggingface.co/datasets/katanaml- org/ invoices- donut- data- v1 , 2023. 12 [95] Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin W u, Xi Chen, and Radu Soricut. Geomverse: A systematic ev aluation of large models for geometric reasoning. arXiv pr eprint arXiv:2312.12241 , 2023. 12 [96] Aniruddha Kembha vi, Mike Salv ato, Eric Kolv e, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer V ision–ECCV 2016: 14th Eur opean Confer ence, Amster dam, The Netherlands, October 11–14, 2016, Pr oceedings, P art IV 14 , pages 235–251. Springer, 2016. 12 [97] Aniruddha Kembha vi, Mike Salv ato, Eric Kolv e, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Eur opean Conference on Computer V ision , pages 235–251, 2016. 19 [98] Geew ook Kim, T eakgyu Hong, Moonbin Y im, JeongY eon Nam, Jinyoung Park, Jinyeong Y im, W onseok Hwang, Sangdoo Y un, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer . In Eur opean Confer ence on Computer V ision , pages 498–517. Springer , 2022. 10 [99] Jin-Hwa Kim, Soohyun Lim, Jaesun Park, and Hansu Cho. K orean localization of visual question answering for blind people. In SK T -Brain-AI for Social Good W orkshop at NeurIPS , volume 2, 2019. 12 [100] Ranjay Krishna, Y uke Zhu, Oliv er Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Y annis Kalantidis, Li-Jia Li, David A Shamma, et al. V isual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer V ision , 123:32–73, 2017. 9 , 10 [101] laion. laion-gpt4v . https://huggingface.co/datasets/laion/gpt4v- dataset , 2024. 12 [102] Hugo Laurençon, Andrés Maraﬁoti, V ictor Sanh, and Léo T ronchon. Building and better understanding vision- language models: insights and future directions. arXiv pr eprint arXiv:2408.12637 , 2024. 9 [103] Hugo Laurençon, Andrés Maraﬁoti, V ictor Sanh, and Léo T ronchon. Building and better understanding vision- language models: insights and future directions. In W orkshop on Responsibly Building the Next Generation of Multimodal F oundational Models , 2024. 12 [104] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and V ictor Sanh. What matters when building vision-language models? Advances in Neural Information Pr ocessing Systems , 37:87874–87907, 2024. 12 [105] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and V ictor Sanh. What matters when building vision-language models?, 2024. 9 , 10 [106] Hugo Laurençon, Léo T ronchon, and V ictor Sanh. Unlocking the conv ersion of web screenshots into html code with the websight dataset, 2024. 12 [107] Paul Lerner , Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, José G Moreno, and Jesús Lovón Melgarejo. V iquae, a dataset for knowledge-based visual question answering about named entities. In Pr oceedings of the 45th International A CM SIGIR Confer ence on Resear ch and Development in Information Retrie val , pages 3108–3120, 2022. 12 [108] Bo Li, Y uanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Y anwei Li, Ziwei Liu, and Chunyuan Li. Llava-one vision: Easy visual task transfer. arXiv preprint , 2024. 9 , 10 [109] Bohao Li, Y uying Ge, Y i Chen, Yix iao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with te xt-rich visual comprehension. arXiv preprint , 2024. 19 34 [110] Chenxia Li, W eiwei Liu, Ruoyu Guo, Xiaoting Y in, Kaitao Jiang, Y ongkun Du, Y uning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the improv ement of ultra lightweight ocr system. arXiv pr eprint arXiv:2206.03001 , 2022. 10 [111] Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning W u, Y uan T ian, W ei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, W eisi Lin, et al. R-bench: Are your lar ge multimodal model robust to real-w orld corruptions? arXiv pr eprint arXiv:2410.05474 , 2024. 22 [112] Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Y uchen T ian, Jing Ma, Zhiyong Huang, and T at-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. arXiv preprint , 2025. 25 [113] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi W ang, Liang Chen, Y azheng Y ang, Benyou W ang, and Lingpeng K ong. Silkie: Preference distillation for large visual language models. arXiv preprint , 2023. 8 [114] Luchang Li, Sheng Qian, Jie Lu, Lunxi Y uan, Rui W ang, and Qin Xie. Transformer -lite: High-ef ﬁciency deployment of large language models on mobile phone gpus. arXiv preprint , 2024. 4 [115] Minchong Li, Feng Zhou, and Xiaohui Song. Bild: Bi-directional logits difference loss for large language model distillation. In Pr oceedings of the 31st International Confer ence on Computational Linguistics , pages 1168–1182, 2025. 27 [116] Qingyun Li, Zhe Chen, W eiyun W ang, W enhai W ang, Shenglong Y e, Zhenjiang Jin, Guanzhou Chen, Y inan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An uniﬁed multimodal corpus of 10 billion-le vel images interleav ed with text. arXiv preprint , 2024. 9 , 10 [117] W ei Li, W illiam E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Di vya T yamagundlu, and Oriana Riv a. On the effects of data scale on ui control agents. In The Thirty-eight Conference on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k , 2024. 9 [118] Y ang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements. arXiv pr eprint arXiv:2010.04295 , 2020. 9 [119] Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Evaluating object hallucination in large vision-language models. In The Confer ence on Empirical Methods in Natural Langua ge Pr ocessing , pages 292–305, 2023. 23 [120] Y uhong Li, Y ingbing Huang, Bo wen Y ang, Bharat V enkitesh, Acyr Locatelli, Hanchen Y e, T ianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Pr ocessing Systems , 37:22947–22970, 2024. 16 [121] Y uhui Li, Fangyun W ei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. arXiv pr eprint arXiv:2406.16858 , 2024. 16 [122] Zhang Li, Biao Y ang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Y ang, Y abo Sun, Y uliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv pr eprint arXiv:2311.06607 , 2023. 12 [123] Zhangheng Li, Keen Y ou, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy , Jeff Nichols, Y infei Y ang, and Zhe Gan. Ferret-ui 2: Mastering univ ersal user interface understanding across platforms. arXiv pr eprint arXiv:2410.18967 , 2024. 2 , 4 , 10 [124] LIME-D A T A. Lime-data-ai2d-train. https://huggingface.co/datasets/LIME- DATA/ai2d , 2024. 12 [125] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Dollár , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Eur opean Conference on Computer V ision , pages 740–755, 2014. 9 , 10 [126] Aixin Liu, Bei Feng, Bing Xue, Bingxuan W ang, Bochao W u, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv pr eprint arXiv:2412.19437 , 2024. 1 [127] Fuxiao Liu, Ke vin Lin, Linjie Li, Jianfeng W ang, Y aser Y acoob, and Lijuan W ang. Aligning large multi-modal model with robust instruction tuning. CoRR , 2023. 12 [128] Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llav a- next: Improved reasoning, ocr , and world knowledge. https://llava- vl.github.io/blog/ 2024- 01- 30- llava- next/ , January 2024. 6 35 [129] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in Neural Information Pr ocessing Systems , 36, 2023. 4 , 6 [130] Junpeng Liu, T ianyue Ou, Y ifan Song, Y uxiao Qu, W ai Lam, Chenyan Xiong, W enhu Chen, Graham Neubig, and Xiang Y ue. Harnessing webpage uis for text-rich visual understanding, 2024. 9 [131] W entao Liu, Qianjun Pan, Y i Zhang, Zhuo Liu, Ji W u, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, and Liang He. Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. arXiv pr eprint arXiv:2409.02834 , 2024. 12 [132] Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, W angbo Zhao, Y ike Y uan, Jiaqi W ang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv pr eprint arXiv:2307.06281 , 2023. 22 , 24 [133] Y uliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transv erse and longitudinal sequence connection. P attern Recognition , 90:337–345, 2019. 12 [134] Y uliang Liu, Zhang Li, Hongliang Li, W enwen Y u, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895 , 2023. 19 [135] Y ulong Liu, Guibo Zhu, Bin Zhu, Qi Song, Guojing Ge, Haoran Chen, GuanHui Qiao, Ru Peng, Lingxiang W u, and Jinqiao W ang. T aisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Pr ocessing Systems , 35:16705–16717, 2022. 9 [136] Zihan Liu, W ei Ping, Rajarshi Roy , Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa: Surpassing gpt-4 on con versational qa and rag. arXiv pr eprint arXiv:2401.10225 , 2024. 12 [137] LooksJuicy . ruozhiba. https://huggingface.co/datasets/LooksJuicy/ruozhiba , 2024. 12 [138] Anton Lozhkov , Loubna Ben Allal, Leandro von W erra, and Thomas W olf. Fineweb-edu: the ﬁnest collection of educational content, 2024. 9 , 10 [139] Pan Lu, Hritik Bansal, T ony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-W ei Chang, Michel Galley , and Jianfeng Gao. Mathvista: Ev aluating mathematical reasoning of foundation models in visual contexts. arXiv preprint , 2023. 18 [140] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Inter- pretable geometry problem solving with formal language and symbolic reasoning. arXiv pr eprint arXiv:2105.04165 , 2021. 12 [141] Pan Lu, Swaroop Mishra, T ony Xia, Liang Qiu, Kai-W ei Chang, Song-Chun Zhu, Oyvind T afjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Confer ence on Neural Information Pr ocessing Systems (NeurIPS) , 2022. 12 [142] Pan Lu, Liang Qiu, Kai-W ei Chang, Y ing Nian W u, Song-Chun Zhu, T anmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via polic y gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610 , 2022. 12 [143] Pan Lu, Liang Qiu, Jiaqi Chen, T ony Xia, Y izhou Zhao, W ei Zhang, Zhou Y u, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv pr eprint arXiv:2110.13214 , 2021. 12 [144] Quanfeng Lu, W enqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Y u Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile de vices. arXiv pr eprint arXiv:2406.08451 , 2024. 9 [145] Shiyin Lu, Y ang Li, Qing-Guo Chen, Zhao Xu, W eihua Luo, Kaifu Zhang, and Han-Jia Y e. Ovis: Structural embedding alignment for multimodal large language model. arXiv pr eprint arXiv:2405.20797 , 2024. 17 , 19 , 20 , 21 , 22 , 23 , 24 [146] Xudong Lu, Y inghao Chen, Cheng Chen, Hui T an, Boheng Chen, Y ina Xie, Rui Hu, Guanxin T an, Renshou W u, Y an Hu, et al. Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile de vices. arXiv pr eprint arXiv:2411.10640 , 2024. 2 , 4 [147] Luckyjhg. Geo170k. https://huggingface.co/datasets/Luckyjhg/Geo170K , 2024. 12 36 [148] Michael Luo, Sijun T an, Justin W ong, Xiaoxiang Shi, W illiam T ang, Manan Roongta, Colin Cai, Jef- frey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-previe w with a 1.5b model by scaling rl. https://pretty- radio- b75.notion.site/ DeepScaleR- Surpassing- O1- Preview- with- a- 1- 5B- Model- by- Scaling- RL- 19681902c1468005bed8ca30301 3a4e2 , 2025. Notion Blog. 13 [149] Chuofan Ma, Y i Jiang, Jiannan Wu, Zehuan Y uan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint , 2024. 9 , 10 [150] Andrés Maraﬁoti, Orr Zohar , Miquel Farré, Merve Noyan, Elie Bak ouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov , Nouamane T azi, et al. Smolvlm: Redeﬁning small and efﬁ cient multimodal models. arXiv pr eprint arXiv:2504.05299 , 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 [151] Ahmed Masry , Xuan Long Do, Jia Qing T an, Shaﬁq Joty , and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages 2263–2279, 2022. 13 , 19 , 27 [152] Minesh Mathew , V iraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest V alveny , and CV Jawahar . Infographicvqa. In Pr oceedings of the IEEE/CVF W inter Conference on Applications of Computer V ision , pages 1697–1706, 2022. 19 , 27 [153] Minesh Mathew , Dimosthenis Karatzas, and CV Jawahar . Docvqa: A dataset for vqa on document images. In Pr oceedings of the IEEE/CVF W inter Conference on Applications of Computer V ision , pages 2200–2209, 2021. 12 , 19 , 27 [154] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier , Sam Dodge, Bowen Zhang, Philipp Dufter , Dhruti Shah, Xianzhi Du, Futang Peng, Floris W eers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv pr eprint arXiv:2403.09611 , 2024. 4 [155] Nitesh Methani, Pritha Ganguly , Mitesh M. Khapra, and Pratyush Kumar . Plotqa: Reasoning ov er scientiﬁc plots. In The IEEE W inter Conference on Applications of Computer V ision (W ACV) , March 2020. 12 [156] Microsoft. Accelerated edge machine learning. https://onnxruntime.ai/ , 2023. 4 [157] Anand Mishra, Shashank Shekhar , Ajeet Kumar Singh, and Anirban Chakraborty . Ocr-vqa: V isual question answering by reading text in images. In International Confer ence on Document Analysis and Recognition , pages 947–952, 2019. 12 [158] MiXaiLL76. T extocr ocr . https://huggingface.co/datasets/MiXaiLL76/TextOCR_OCR , 2025. 12 [159] MMR1. Mmr1-math-rl-data-v0. https://huggingface.co/datasets/MMR1/ MMR1- Math- RL- Data- v0 , 2025. 12 [160] Ivan Moshko v , Darragh Hanley , Ivan Sorokin, Shubham T oshniwal, Christof Henkel, Benedikt Schifferer , W ei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv pr eprint arXiv:2504.16891 , 2025. 9 [161] mychen76. ds receipts v2 train. https://huggingface.co/datasets/mychen76/ds_receipts_v2_ train , 2023. 12 [162] mychen76. in voices and receipts ocr v1. https://huggingface.co/datasets/mychen76/ invoices- and- receipts_ocr_v1 , 2023. 12 [163] mychen76. in voices and receipts ocr v2. https://huggingface.co/datasets/mychen76/ invoices- and- receipts_ocr_v2 , 2023. 12 [164] Dhruv Nathawani, Igor Gitman, Somshubra Majumdar , Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar , , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1, 2025. 9 [165] nimapourjafar . Lacr i2i. https://huggingface.co/datasets/nimapourjafar/mm_LACR_I2I , 2024. 12 [166] nimapourjafar . Ladd. https://huggingface.co/datasets/nimapourjafar/mm_LADD , 2024. 12 [167] nimapourjafar . mm datikz. https://huggingface.co/datasets/nimapourjafar/mm_datikz , 2024. 12 [168] nimapourjafar . mm diagram image to te xt. https://huggingface.co/datasets/nimapourjafar/mm_ diagram_image_to_text , 2024. 12 37 [169] nimapourjafar . mm intergps. https://huggingface.co/datasets/nimapourjafar/mm_intergps , 2024. 12 [170] nimapourjafar . mm tqa. https://huggingface.co/datasets/nimapourjafar/mm_tqa , 2024. 12 [171] nimapourjafar . mm vqarad. https://huggingface.co/datasets/nimapourjafar/mm_vqarad , 2024. 12 [172] nimapourjafar . rob ut-wikisql. https://huggingface.co/datasets/nimapourjafar/mm_robut_ wikisql , 2024. 12 [173] nz. arxiv-ocr . https://huggingface.co/datasets/nz/arxiv- ocr- v0.1- sft , 2024. 12 [174] Jason Obeid and Enamul Hoque. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv pr eprint arXiv:2010.09142 , 2020. 12 [175] Beijing Academy of Artiﬁcial Intelligence (B AAI). Inﬁnity instruct. arXiv pr eprint arXiv:2406.XXXX , 2024. 9 , 10 , 11 [176] open r1. Openr1-math-220k. https://huggingface.co/datasets/open- r1/OpenR1- Math- 220k , 2025. 12 [177] OpenAI. Gpt-4o system card. https://openai.com/index/gpt- 4o- system- card/ , 2025. 1 , 3 , 11 , 12 [178] OpenAI. Introducing chatgpt. https://openai.com/index/chatgpt/ , 2025. 1 [179] OpenGVLab . Sharegpt-4o. https://huggingface.co/datasets/OpenGVLab/ShareGPT- 4o , 2005. 12 [180] Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A consolidated receipt dataset for post-ocr parsing. 2019. 12 [181] pengshuai rin. multimath-300k. https://huggingface.co/datasets/pengshuai- rin/ multimath- 300k , 2024. 12 [182] Bryan A Plummer, Liwei W ang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier , and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 2641–2649, 2015. 9 , 10 [183] Jordi Pont-T uset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and V ittorio Ferrari. Connecting vision and language with localized narrativ es. In ECCV , 2020. 12 [184] qgyd2021. chinese ner sft. https://huggingface.co/datasets/qgyd2021/chinese_ner_sft , 2023. 12 [185] qgyd2021. few shot ner sft. https://huggingface.co/datasets/qgyd2021/few_shot_ner_sft , 2024. 12 [186] Runqi Qiao, Qiuna T an, Peiqing Y ang, Y anzi W ang, Xiaowan W ang, Enhui W an, Sitong Zhou, Guanting Dong, Y uchen Zeng, Y ida Xu, Jie W ang, Chong Sun, Chen Li, and Honggang Zhang. W e-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning, 2025. 12 , 18 [187] Rafael Rafailo v , Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a re ward model. Advances in Neural Information Pr ocessing Systems , 36, 2024. 8 [188] Christopher Ra wles, Alice Li, Daniel Rodriguez, Oriana Ri va, and T imothy Lillicrap. Androidinthewild: A lar ge-scale dataset for android device control. Advances in Neural Information Pr ocessing Systems , 36, 2024. 9 [189] Machel Reid, Nikolay Savino v , Denis T eplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint , 2024. 1 , 3 [190] Christoph Schuhmann, Romain Beaumont, Richard V encu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell W ortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Pr ocessing Systems , 35:25278–25294, 2022. 9 , 10 38 [191] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world kno wledge. In Eur opean confer ence on computer vision , pages 146–162. Springer , 2022. 12 [192] Sanket Shah, Anand Mishra, Naganand Y adati, and Partha Pratim T alukdar . Kvqa: Knowledge-aw are visual question answering. In Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence , v olume 33, pages 8876–8884, 2019. 13 [193] Shuai Shao, Zeming Li, T ianyuan Zhang, Chao Peng, Gang Y u, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 8430–8439, 2019. 12 [194] Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv pr eprint arXiv:2402.03300 , 2024. 9 [195] Jianhao Shen, Y e Y uan, Srbuhi Mirzoyan, Ming Zhang, and Chenguang W ang. Measuring vision-language stem skills of neural models. arXiv pr eprint arXiv:2402.17205 , 2024. 13 [196] W enhao Shi, Zhiqiang Hu, Y i Bin, Junhua Liu, Y ang Y ang, See-Kiong Ng, Lidong Bing, and Roy Ka-W ei Lee. Math- llav a: Bootstrapping mathematical reasoning for multimodal lar ge language models. arXiv pr eprint arXiv:2406.17294 , 2024. 12 [197] W enhao Shi, Zhiqiang Hu, Y i Bin, Junhua Liu, Y ang Y ang, See-Kiong Ng, Lidong Bing, and Roy Ka-W ei Lee. Math- llav a: Bootstrapping mathematical reasoning for multimodal lar ge language models. arXiv pr eprint arXiv:2406.17294 , 2024. 12 [198] shibing624. sharegpt gpt4. https://huggingface.co/datasets/shibing624/sharegpt_gpt4 , 2023. 12 [199] Amanpreet Singh, V iv ek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. T owards vqa models that can read. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 8317–8326, 2019. 12 [200] Amanpreet Singh, V iv ek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. T owards vqa models that can read. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 8317–8326, 2019. 19 , 27 [201] sr5434. Codegebragpt data. https://huggingface.co/datasets/sr5434/CodegebraGPT_data , 2024. 12 [202] Jianlin Su. Transformer upgrade path: 4. rotary position encoding for two-dimensional positions, May 2021. 6 [203] Alane Suhr and Y oav Artzi. Nlvr2 visual bias analysis. arXiv preprint , 2019. 12 [204] Hamed Rahimi Sujet AI, Allaa Boutaleb . Sujet-ﬁnance-qa-vision-100k: A large-scale dataset for ﬁnancial document vqa, 2024. 12 [205] Hai-Long Sun, Da-W ei Zhou, Y ang Li, Shiyin Lu, Chao Y i, Qing-Guo Chen, Zhao Xu, W eihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. arXiv preprint , 2024. 24 [206] Quan Sun, Y ufeng Cui, Xiaosong Zhang, F an Zhang, Qiying Y u, Y ueze W ang, Y ongming Rao, Jingjing Liu, T iejun Huang, and Xinlong W ang. Generativ e multimodal models are in-context learners. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 14398–14409, 2024. 9 , 10 [207] L ynette Y ihui T an. Spot the diff. https://huggingface.co/datasets/Lancelot53/ spot- the- diff , 2013. 12 [208] Benny J T ang, Angie Boggust, and Arvind Satyanarayan. V istext: A benchmark for semantically rich chart captioning. arXiv pr eprint arXiv:2307.05356 , 2023. 12 [209] Jingqun T ang, Qi Liu, Y ongjie Y e, Jinghui Lu, Shu W ei, Chunhui Lin, W anqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric visual question answering. arXiv pr eprint arXiv:2405.11985 , 2024. 12 , 24 [210] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Y onghui W u, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andre w M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv pr eprint arXiv:2312.11805 , 2023. 1 , 3 39 [211] Gemma T eam, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino V ieillard, Ramona Merhej, Sarah Perrin, T atiana Matejovicov a, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penche v , Gaël Liu, Francesco V isin, Kathleen Kenealy , Lucas Beyer , Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdev a, Benjamin Coleman, Y i Gao, Basil Mustafa, Iain Barr , Emilio Parisotto, David T ian, Matan Eyal, Colin Cherry , Jan-Thorsten Peter , Danila Sinopalnikov , Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin K umar, Da vid V ilar, Idan Brusilo vsky , Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady , Adrian Goedeckeme yer, Alaa Saade, Ale x Feng, Ale xander K olesnikov , Ale xei Bendebury , Alvin Abdagic, Amit V adi, András György , André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Y ang, Antonia Paterson, Ashish Sheno y , A yan Chakrabarti, Bilal Piot, Bo W u, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey , Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Di vyashree Shiv akumar Sreepathihalli, Doug Reid, Dustin T ran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitono v , Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Pluci ´ nska, Harman Singh, Harsh Mehta, Harshal T ushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor , Ivan Nardini, Jean Pouget-Abadie, Jetha Chan, Joe Stanton, John W ieting, Jonathan Lai, Jordi Orbay , Joseph Fernandez, Josh Newlan, Ju yeong Ji, Jyotinder Singh, Kat Black, Kathy Y u, Ke vin Hui, Kiran V odrahalli, Klaus Gref f, Linhai Qiu, Marcella V alentine, Marina Coelho, Marvin Ritter , Matt Hof fman, Matthew W atson, Mayank Chaturvedi, Michael Moynihan, Min Ma, Nabila Babar , Natasha Noy , Nathan Byrd, Nick Roy , Nikola Momchev , Nilay Chauhan, Noveen Sachde va, Oskar Bun yan, Pankil Botarda, P aul Caron, Paul Kishan Rubenstein, Phil Culliton, Philipp Schmid, Pier Giuseppe Sessa, Pingmei Xu, Piotr Stanczyk, Pouya T afti, Rakesh Shiv anna, Renjie W u, Renke Pan, Reza Rokni, Rob W illoughby , Rohith V allu, Ryan Mullins, Sammy Jerome, Sara Smoot, Sertan Girgin, Shariq Iqbal, Shashir Reddy , Shruti Sheth, Siim Põder , Sijal Bhatnagar , Sindhu Raghuram Pan yam, Siv an Eiger , Susan Zhang, Tianqi Liu, Tre vor Y acovone, T yler Liechty , Uday Kalra, Utku Evci, V edant Misra, V incent Roseberry , Vlad Feinberg, Vlad Kolesnik ov , W oohyun Han, W oosuk Kwon, Xi Chen, Y inlam Chow , Y uvein Zhu, Zichuan W ei, Zoltan Egyed, V ictor Cotruta, Minh Giang, Phoebe Kirk, Anand Rao, Kat Black, Nabila Babar , Jessica Lo, Erica Moreira, Luiz Gustavo Martins, Omar Sanseviero, Lucas Gonzalez, Zach Gleicher, Tris W arkentin, V ahab Mirrokni, Evan Senter , Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, Y ossi Matias, D. Sculley , Slav Petro v , Noah Fiedel, Noam Shazeer , Oriol V inyals, Jeff Dean, Demis Hassabis, K oray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Jean-Baptiste Alayrac, Rohan Anil, Dmitry , Lepikhin, Sebastian Borgeaud, Olivier Bachem, Armand Joulin, Alek Andreev , Cassidy Hardin, Robert Dadashi, and Léonard Hussenot. Gemma 3 technical report, 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 [212] MLC team. Mlc-llm - universal llm deployment engine with ml compilation. https://github.com/mlc- ai/ mlc- llm/ , 2023-2024. 4 [213] Theonewhomadethings. fsc147-controlnet. https://huggingface.co/datasets/ Theonewhomadethings/fsc147- controlnet , 2024. 12 [214] Rubèn T ito, Dimosthenis Karatzas, and Ernest V alven y . Hierarchical multimodal transformers for multi-page docvqa. arXiv pr eprint arXiv:2212.05935 , 2022. 12 [215] toghrultahirov . handwritten text ocr . https://huggingface.co/datasets/toghrultahirov/ handwritten_text_ocr , 2024. 12 [216] Shengbang T ong, Ellis Brown, Penghao W u, Sanghyun W oo, Manoj Middepogu, Sai Charitha Akula, Jihan Y ang, Shusheng Y ang, Adithya Iyer , Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv pr eprint arXiv:2406.16860 , 2024. 12 [217] Michael Tschannen, Alexe y Gritsenko, Xiao W ang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy , T alfan Evans, Lucas Beyer , Y e Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improv ed semantic understanding, localization, and dense features. arXiv pr eprint arXiv:2502.14786 , 2025. 6 , 7 [218] Y uxiang T uo, W angmeng Xiang, Jun-Y an He, Y ifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing. 2023. 9 , 12 [219] v DeepMind. Gemini 2.5 pro. https://deepmind.google/technologies/gemini/pro/ , 2025. 1 , 3 [220] vikhyatk. st-vqa. https://huggingface.co/datasets/vikhyatk/st- vqa , 2024. 12 [221] VLM-Perception. Hme100k-400. https://huggingface.co/datasets/VLM- Perception/ HME100k- 400 , 2025. 12 [222] Bin W ang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A univ ersal network for real-world mathematical e xpression recognition, 2024. 12 40 [223] Bryan W ang, Gang Li, Xin Zhou, Zhourong Chen, T ovi Grossman, and Y ang Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Softwar e and T echnology , pages 498–510, 2021. 9 [224] Fei W ang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, W enxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv pr eprint arXiv:2406.09411 , 2024. 21 [225] Junjie W ang, Y in Zhang, Y atai Ji, Y uxiang Zhang, Chunyang Jiang, Y ubo W ang, Kang Zhu, Zekun W ang, Tiezhen W ang, W enhao Huang, et al. Pin: A knowledge-intensi ve dataset for paired and interleav ed multimodal documents. arXiv pr eprint arXiv:2406.13923 , 2024. 9 , 12 [226] Ke W ang, Junting Pan, W eikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv pr eprint arXiv:2402.14804 , 2024. 18 [227] Peng W ang, Shuai Bai, Sinan T an, Shijie W ang, Zhihao Fan, Jinze Bai, K eqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, et al. Qwen2-vl: Enhancing vision-language model’ s perception of the world at an y resolution. arXiv pr eprint arXiv:2409.12191 , 2024. 1 , 3 , 4 , 6 , 10 , 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 45 [228] W eiyun W ang, Zhe Chen, W enhai W ang, Y ue Cao, Y angzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Le wei Lu, Y u Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv pr eprint arXiv:2411.10442 , 2024. 8 , 12 [229] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility , reasoning, and efﬁcienc y . arXiv preprint , 2025. 1 , 2 , 3 , 4 , 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 [230] W eiyun W ang, Y iming Ren, Haowen Luo, T iantong Li, Chenxiang Y an, Zhe Chen, W enhai W ang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: T owards general relation comprehension of the open world. In Eur opean Confer ence on Computer V ision , pages 471–490. Springer , 2024. 23 [231] Xiyao W ang, Zhengyuan Y ang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Ke vin Lin, Furong Huang, and Lijuan W ang. Sota with less: Mcts-guided sample selection for data-efﬁcient visual reasoning self-improv ement. arXiv pr eprint arXiv:2504.07934 , 2025. 13 [232] Haoning W u, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan W ang, Chunyi Li, W enxiu Sun, Qiong Y an, Guangtao Zhai, and W eisi Lin. Q-bench: A benchmark for general-purpose foundation models on low-le vel vision. In ICLR , 2024. 21 [233] Qinzhuo W u, W eikai Xu, W ei Liu, T ao T an, Jianfeng Liu, Ang Li, Jian Luan, Bin W ang, and Shuo Shang. Mobile vlm: A vision-language model for better intra-and inter-ui understanding. arXiv preprint , 2024. 2 , 4 [234] Zhiyong W u, Zhenyu W u, Fangzhi Xu, Y ian W ang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 , 2024. 9 , 25 [235] Renqiu Xia, Bo Zhang, Haoyang Peng, Hancheng Y e, Xiangchao Y an, Peng Y e, Botian Shi, Junchi Y an, and Y u Qiao. Structchart: Perception, structuring, reasoning for visual chart understanding. arXiv preprint , 2023. 12 [236] Guangxuan Xiao, Y uandong T ian, Beidi Chen, Song Han, and Mike Lewis. Ef ﬁcient streaming language models with attention sinks. arXiv pr eprint arXiv:2309.17453 , 2023. 16 [237] Y ijia Xiao, Edward Sun, T ianyu Liu, and W ei W ang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint , 2024. 18 [238] Baojiao Xiong, Boheng Chen, Chengzhi W ang, Daxiong Luo, Dongsheng Xu, Dongyang Liu, Fan Y ang, Fangyuan Li, Fei T eng, Feng W ang, et al. Bluelm-2.5-3b technical report. arXiv preprint , 2025. 2 , 4 , 17 , 19 , 20 , 22 , 23 , 25 [239] Ruyi Xu, Y uan Y ao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, T at-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llav a-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv pr eprint arXiv:2403.11703 , 2024. 6 [240] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv preprint , 2025. 1 , 2 , 3 , 6 [241] An Y ang, Baosong Y ang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv pr eprint arXiv:2407.10671 , 2024. 1 41 [242] An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chengyuan Li, Dayiheng Liu, and Fei Huang. Qwen2.5 technical report. arXiv pr eprint arXiv:2412.15115 , 2024. 1 [243] Jianxin Y ang. Longqlora: Efﬁcient and ef fective method to extend conte xt length of large language models. arXiv pr eprint arXiv:2311.04879 , 2023. 13 [244] Y uhao Y ang, Y ue W ang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: V isual grounding for gui instructions, 2025. 9 [245] Y uan Y ao, Tian yu Y u, Ao Zhang, Chongyi W ang, Junbo Cui, Hongji Zhu, T ianchi Cai, Haoyu Li, W eilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v lev el mllm on your phone. arXiv preprint , 2024. 17 , 19 , 20 , 21 , 22 , 23 , 24 [246] Y eenyi. ner sentiment analysis share gpt. https://huggingface.co/datasets/Yeenyi/ner_ sentiment_analysis_sharegpt , 2024. 12 [247] W eijie Y in, Y ongjie Y e, Fangxun Shu, Y ue Liao, Zijian Kang, Hongyuan Dong, Haiyang Y u, Dingkang Y ang, Jiacong W ang, Han W ang, et al. Sail-vl2 technical report. arXiv preprint , 2025. 17 , 19 , 20 , 21 , 22 , 23 , 24 [248] Kaining Y ing, Fanqing Meng, Jin W ang, Zhiqian Li, Han Lin, Y ue Y ang, Hao Zhang, W enbo Zhang, Y uqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Y ali W ang, Y u Qiao, Ping Luo, Kaipeng Zhang, and W enqi Shao. Mmt-bench: A comprehensiv e multimodal benchmark for e valuating large vision-language models to wards multitask agi. arXiv pr eprint arXiv:2404.16006 , 2024. 20 [249] Keen Y ou, Haotian Zhang, Eldon Schoop, Floris W eers, Amanda Swearngin, Jeffrey Nichols, Y infei Y ang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. In Eur opean Conference on Computer V ision , pages 240–255. Springer , 2024. 2 , 4 [250] Licheng Y u, Patrick Poirson, Shan Y ang, Alexander C Berg, and T amara L Berg. Modeling context in referring expressions. In European Confer ence on Computer V ision , pages 69–85, 2016. 9 , 10 [251] Longhui Y u, W eisen Jiang, Han Shi, Jincheng Y u, Zhengying Liu, Y u Zhang, James T Kwok, Zhenguo Li, Adrian W eller, and W eiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv pr eprint arXiv:2309.12284 , 2023. 12 [252] W eihao Y u, Zhengyuan Y ang, Linjie Li, Jianfeng W ang, Kevin Lin, Zicheng Liu, Xinchao W ang, and Lijuan W ang. Mm-vet: Evaluating large multimodal models for inte grated capabilities. arXiv preprint , 2023. 22 [253] Y ijiong Y u, Ziyun Dai, Zekun W ang, W ei W ang, Ran Chen, and Ji Pei. Opencsg chinese corpus: A series of high-quality chinese datasets for llm training, 2025. 9 , 10 [254] Xiang Y ue, Xingwei Qu, Ge Zhang, Y ao Fu, W enhao Huang, Huan Sun, Y u Su, and W enhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv pr eprint arXiv:2309.05653 , 2023. 12 [255] Xiang Y ue, Y ueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lin- tang Sutawika, Sathyanarayanan Ramamoorthy , and Graham Neubig. Pangea: A fully open multilingual multimodal llm for 39 languages. arXiv pr eprint arXiv:2410.16153 , 2024. 9 , 10 [256] Xiang Y ue, T ianyu Zheng, Y uansheng Ni, Y ubo W ang, Kai Zhang, Shengbang T ong, Y uxuan Sun, Ming Y in, Botao Y u, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv pr eprint arXiv:2409.02813 , 2024. 18 [257] Chi Zhang, Feng Gao, Baoxiong Jia, Y ixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2019. 12 [258] Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter , Nina W enzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bo wen Zhang, Y anghao Li, et al. Mm1.5: Methods, analysis & insights from multimodal llm ﬁne-tuning. arXiv preprint arXiv:2409.20566 , 2024. 4 [259] Haotian Zhang, Haoxuan Y ou, Philipp Dufter, Bo wen Zhang, Chen Chen, Hong-Y ou Chen, Tsu-Jui Fu, W illiam Y ang W ang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv pr eprint arXiv:2404.07973 , 2024. 10 [260] Jiwen Zhang, Jihao W u, Y ihua T eng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu W ei, and Duyu T ang. Android in the zoo: Chain-of-action-thought for gui agents. arXiv pr eprint arXiv:2403.02713 , 2024. 9 [261] Lefan Zhang, Xiaodan W ang, Y anhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. arXiv pr eprint arXiv:2408.15766 , 2024. 16 42 [262] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Y i Niu, and Fei W u. Vsr: a uniﬁed framework for document layout analysis combining vision, semantics and relations. In Document Analysis and Recognition–ICDAR 2021: 16th International Confer ence, Lausanne, Switzerland, September 5–10, 2021, Pr oceedings, P art I 16 , pages 115–130. Springer , 2021. 12 [263] Renrui Zhang, Dongzhi Jiang, Y ichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-W ei Chang, Y u Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In Eur opean Confer ence on Computer V ision , pages 169–186. Springer , 2025. 18 [264] W enqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Y ongliang Shen, W eiming Lu, Deli Zhao, Y ueting Zhuang, and Lidong Bing. 2.5 years in class: A multimodal textbook for vision-language pretraining. arXiv pr eprint arXiv:2501.00958 , 2025. 9 , 10 [265] Xiaoman Zhang, Chaoyi W u, Ziheng Zhao, W eixiong Lin, Y a Zhang, Y anfeng W ang, and W eidi Xie. Pmc-vqa: V isual instruction tuning for medical visual question answering. arXiv preprint , 2023. 12 [266] Zhenyu Zhang, Y ing Sheng, Tian yi Zhou, T ianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Y uandong T ian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efﬁcient generativ e inference of large language models. Advances in Neural Information Pr ocessing Systems , 36:34661–34710, 2023. 16 [267] Bo Zhao, Boya W u, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint , 2023. 12 [268] Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, W eiyun W ang, Jiaqi W ang, Xinyu Fang, W enhai W ang, Guangtao Zhai, et al. Omnialign-v: T owards enhanced alignment of mllms with human preference. arXiv pr eprint arXiv:2502.18411 , 2025. 12 [269] Y ilun Zhao, Y unxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and te xtual data. arXiv preprint , 2022. 12 [270] Y ilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, W enlin Zhang, Xiangru T ang, Boyu Mi, and Dragomir Rade v . Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations. arXiv preprint arXiv:2306.14321 , 2023. 12 [271] Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Pr ocessing Systems , 36, 2024. 11 [272] Y iyang Zhou, Chenhang Cui, Rafael Raf ailov , Chelsea Finn, and Huaxiu Y ao. Aligning modalities in vision large language models via preference ﬁne-tuning. arXiv pr eprint arXiv:2402.11411 , 2024. 8 [273] Fengbin Zhu, W enqiang Lei, Y oucheng Huang, Chao W ang, Shuo Zhang, Jiancheng Lv , Fuli Feng, and T at-Seng Chua. T A T-QA: A question answering benchmark on a hybrid of tab ular and textual content in ﬁnance. In Pr oceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International J oint Confer ence on Natural Languag e Processing (V olume 1: Long P apers) , pages 3277–3287, Online, August 2021. Association for Computational Linguistics. 12 [274] Jinguo Zhu, W eiyun W ang, Zhe Chen, Zhaoyang Liu, Shenglong Y e, Lixin Gu, Hao T ian, Y uchen Duan, W eijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui W ang, Y ue Cao, Y angzhou Liu, Xingguang W ei, Hongjie Zhang, Haomin W ang, W eiye Xu, Hao Li, Jiahao W ang, Nianchen Deng, Songze Li, Y inan He, T an Jiang, Jiapeng Luo, Y i W ang, Conghui He, Botian Shi, Xingcheng Zhang, W enqi Shao, Junjun He, Y ingtong Xiong, W enwen Qu, Peng Sun, Penglong Jiao, Han Lv , Lijun W u, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin W ang, Min Dou, Lewei Lu, Xizhou Zhu, T ong Lu, Dahua Lin, Y u Qiao, Jifeng Dai, and W enhai W ang. Intern vl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. 1 , 3 , 4 , 12 , 17 , 19 , 20 , 21 , 22 , 23 , 24 , 25 [275] W anrong Zhu, Jack Hessel, Anas A wadalla, Samir Y itzhak Gadre, Jesse Dodge, Alex Fang, Y oungjae Y u, Ludwig Schmidt, W illiam Y ang W ang, and Y ejin Choi. Multimodal c4: An open, billion-scale corpus of images interlea ved with text. Advances in Neural Information Pr ocessing Systems , 36:8958–8974, 2023. 9 , 10 [276] Chengke Zou, Xingang Guo, Rui Y ang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for ev aluating mathematical reasoning robustness of vision language models. arXiv pr eprint arXiv:2411.00836 , 2024. 18 43 Category APP Names Shopping Alibaba, De wu, JD, Pinduoduo, T aobao, T aote, Xianyu, V ipshop T ransportation Baidu Maps, Amap, T encent Maps, Hello, Didi, Traf ﬁc 12123, Railway 12306 Lifestyle Services Meituan, Dazhong Dianping, Ele.me, Meituan W aimai, Ctrip, Qunar , SF Express Automotiv e Dongchedi, Autohome T elecommunications State Grid Online, China T elecom, China Unicom, China Mobile V ideo T encent V ideo, iQIYI, Bilibili, Y ouku, Kuaishou, Douyin, Migu V ideo, T encent Animation, Hongguo Short Drama Social Media T outiao, W eibo, W eChat, Xiaohongshu, Douban, Zhihu, Baidu T ieba, Momo, Facebook, Y ouT ube Gaming Xiaohonghe, League of Legends Mobile, Happy Match Music NetEase Cloud Music, Ximalaya Fitness Keep T ools T ianyancha, Quark, Cloud Flash Pay , Industrial and Commercial Bank of China, 58 City , Meitu Xiuxiu OPPO Built-in Apps Settings, Phone Migration, Xiaobu Assistant, Clock, W eather , Calendar , Notes, Calculator , Compass, Camera, Recorder , Album, Music, OPPO V ideo, Reader , Contacts, Dialer , Messages, Mini Games, Game Center , W allet, Cloud Services, My OPPO, OPPO Store, Main App Store T able 21: App List from Andes-UI Dataset Collection Data T ype T raining Set T est Set T otal Screenshots 13002 455 Referring Data Count 226901 8642 Grounding Data Count 185968 7194 Overall descripti ve data 13002 455 Natural Q&A Pairs 107688 1181 T able 22: AndesUI Dataset Statistics A Contributor Zhiwei Jin, Xiaohui Song, Nan W ang, Y afei Liu, Chao Li, Xin Li, Ruichen W ang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Y anhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Y in, Ruilin W ang, Liankai Cai, Haijing Liu, Y uqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang W ang, Kai T ang, Zhiguang Zhu, Kai T ang, W enmei Gao, Rui W ang, Jun W u, Chao Liu, Qin Xie, Chen Chen 1 , Haonan Lu 2 1 chenchen4@oppo.com 2 luhaonan@oppo.com 44 B AndesUI Dataset In this section, we pro vide a comprehensi ve presentation of the AndesUI dataset construction pipeline, including the data collection process, human annotation, and data generation. Selection of APPs. W e collected a total of 90 APPs, comprising 65 popular do wnload APPs from the OPPO Software Store, cov ering a wide range of categories commonly used by users, along with 25 ColorOS pre- installed APPs. These APPs are listed in T able 21 . Screenshot Data Collection. For each APP , we instructed annotators to capture screenshots of various di verse pages within the app, ensuring that each screenshot had distinct layouts and content. If two screenshots had similar layout structures but dif fered solely in text and images, they were classiﬁed as homogeneous interfaces. Our objective was to maximize div ersity within the dataset while cov ering all typical interfaces of the app. Depending on the homogeneity degree, we collected between 1 and 10 screenshots for each heterogeneous page. For example, in the Xiaohongshu post interface, the display of diff erent users’ posts is similar enough to be regarded as a homogeneous page; ho wever , since some posts include images while others do not, we aimed to collect additional screenshots from this homogeneous interface. Throughout the screenshot collection process, we focused on capturing various atypical scenarios, including network interruptions and pop-ups (encompassing advertisement, log-in, conﬁrmation, and phone pop-ups). For the training dataset, we collected a total of 10,747 screenshots from third-party apps and 2,255 screenshots from system pre-installed apps. In the testing set, there were a total of 455 screenshots. These screenshots were heterogeneous to reduce duplicate and similar pages. All detailed statistics of the dataset is shown in T able 22 Annotation of Widgets. Our objecti ve w as to pro vide annotations for all widgets present within each screenshot. This included delineating bounding box es, identifying widget types, recording any text on the widgets (when av ailable), and indicating whether they are clickable, among other details. For this process, we employed the VIA-2.0.12 tool [ 55 ]. Annotating all widgets manually from scratch is a labor -intensiv e endeav or; hence, we initially used Qwen2-VL-72B [ 227 ] to generate preliminary annotations on each screenshot, con verting these annotations into a JSON format compatible with VIA. Subsequent modiﬁcations and reﬁnements were then carried out by annotators. On average, each interface resulted in 18 widgets. The training dataset contained a total of 226,901 widgets, while the testing dataset included 9,068 widgets. Examples of labeled widgets of screenshots are provided in Fig. 9 . W e needed to construct both basic and advanced data. Basic data includes grounding and referring data, while advanced data consists of comprehensi ve descriptiv e data and natural question-answer pairs. Basic data can be generated through programmatic means. In particular , for each widget, a single grounding data entry and a single referring data entry are generated. As an illustration, for a “send” widget with coordinates [3212, 1045, 3550, 2242], the associated grounding and referring data are • Question: “Can you tell me the coordinates of the widget named ’ send’?” Answer: "<|box_start|>(3212, 1047),(3550, 2242)<|box_end|>" • Question: “What is the widget located within the bounding box <|box_start|>(3212, 1047),(3550, 2242)<|box_end|>?” Answer: “send” Consequently , the training dataset produced 226,901 data entries for referring and 185,968 for grounding. The test dataset included 7,194 grounding entries and 8,642 referring entries. This discrepancy occurs because a single page can contain several widgets sharing the same name, leading to fe wer grounding entries. The questions were randomly selected from a seed library of questions. Initially , GPT -4 was employed to create 100 different question formulations. The bounding box coordinates underwent normalization and were then scaled by a factor of 10,000. Generation of Advanced Data. For comprehensi ve descriptiv e data, each screenshot was analyzed by aggregat- ing the details of individual widgets. Subsequently , GPT -4 was utilized to generate a detailed description of the current page, including the theme, function, spatial arrangement of principal widgets, and a general usage guide for the interface. For dataset creation in volving natural question-answer pairs, we constructed se veral pairs for each screenshot. T o achie ve this, we initially utilized the information from each widget to instruct GPT -4 to formulate approximately ten question-answer pairs, emulating possible user inquiries during real-world application. The natural question- answer pairs were di vided into four categories: descriptive questions, locating questions, interaction questions, and questions regarding natural scenes. They can also be classiﬁed by difﬁculty lev el: easy , medium, and hard. Questions classiﬁed as “easy” can be immediately answered, whereas “hard” questions might necessitate 45 Figure 9: Examples of widget labels in the AndesUI dataset. reasoning or multiple steps to resolve. Initially , GPT -4 was employed to generate sev eral preliminary questions, which were then reﬁned by annotators. Ultimately , we generated 107,688 natural question-answer pairs for the training set and constructed 1,181 pairs for the test set. Below is the system prompt used to guide GPT -4 in generating the natural question-answer pairs: Y ou ar e an AI visual assistant capable of analyzing mobile scr eens. Y ou will receive a screenshot fr om the {app_info} app of the {page_description} pag e, along with a string repr esentation of a widget dictionary . Each element in the dictionary is a dictionary that repr esents a UI widget, where the ke y is the widget number and the value contains information about the widget, including its bounding box coor dinates, widget type, and widget description. The bounding box coor dinates ar e r epr esented as (x1, y1, x2, y2), with ﬂoating-point values ranging fr om 0 to 1. Based on the pr ovided text and coor dinates, please design several simulated question-and-answer dialogues that r epr esent interactions between the user and the system. These dialogues should focus on the user’ s potential actions on the screen (rather than per ceptions). The questions you create should be divided into thr ee levels of difﬁculty: easy , medium, and hard. Easy questions can be answered dir ectly from the widg et dictionary . Medium and har d questions r equir e some r easoning. The questions can also be cate gorized into four content types: descriptive questions, locating questions, interaction questions, and natural scene questions. Her e ar e four examples for r eference; the y may not be related to this imag e, and please do not r estrict yourself to these few questions. Examples of descriptive questions: Can you describe the function of widget_23? / How many video list items ar e there in this scr eenshot? Examples of locating questions: How do I access the cr eation page fr om the curr ent page? / How can I view detailed information about the fourth video? Examples of interaction questions: Can widget_2 be swiped? Examples of natural scene questions: How can I searc h for the latest movies? / How do I shar e the second video on social media? F or non-descriptive questions, you do not need to specify the type information of the widget in your r esponses. When cr eating dialogues involving speciﬁc widgets, please strictly r efer to the widget number (e.g ., "widget_3") rather than using the bounding box coor dinates. This is necessary for me to use this data for function calls, so clear r efer ence is r equired. Y our r esponse format should be: [{"User":"...","System":"...","Dif ﬁculty Level":"","Content Cate gory":""},...] 46 C Qualitative Examples In this section, a series of qualitative e xamples are illustrated for various capabilities of the AndesVL model by presenting responses generated from AndesVL-4B. Fig. 23 shows strong image understanding; Fig. 24 , 25 and 26 demonstrate multi-image perception ability; Fig. 27 shows basic reasoning capability of AndesVL; and Fig. 28 and 29 illustrate document understanding and OCR capabilities. Examples demonstrating AndesVL-4B-Thinking’ s visual reasoning capability are pro vided in Fig. 30 , 31 and 32 . Fig. 30 shows a case of AndesVL-4B-Thinking solving a geometry problem. Fig. 32 illustrates the model’ s capability of handling reasoning tasks related to charts. UI understanding and reasoning cases of AndesVL-4B are shown in Fig. 33 and 34 . Prompt : 详细描述一下这张图片的内容。 Model Response : 这张图片展示了一个夜晚的场景。在图片的右侧，有一只猫，它正坐在一根树枝上，猫的轮廓清晰地映衬在背景的明亮光线中。这只猫的姿势看起来很放松，它正面对着树枝的另一端。背景中，有一个巨大的满月，月光呈现出橙红色调，与夜晚的黑色背景形成鲜明对比。月球的表面细节不是很清晰，但可以看到一些暗淡的云状纹理。整个场景给人一种宁静而神秘的感觉。 T able 23: A dense captioning example of AndesVL. Prompt : What are these attractions? Please gi ve a short description of the attractions and output in Markdown format. Model Response : - **Left**: The Pyramids of Giza, located in Egypt, are the largest and oldest of the three pyramids in the Giza pyramid comple x. They were b uilt as tombs for the Pharaohs of Ancient Egypt. - **Right**: The Great W all of China is an ancient series of walls and fortiﬁcations that was b uilt to protect China from in vasions. It stretches for ov er 13,000 miles and is one of the most famous landmarks in the world. T able 24: A kno wledge question answering example of AndesVL. 47 Prompt : Here is one math problem with its ground-truth answer , the solving of which requires interpreting the paired diagram. Please select the correct paired math diagram from the four giv en images. Question: As shown in the ﬁgure, the straight line a is parallel to b, then the degree of angle 1 is (). # Choices: - A:75° - B:105° - C:140° - D:145° Answer the question ﬁrst and giv e a brief explanation of your choice. Model Response : B: 105° The giv en angles are 40° and 35°. Since line a is parallel to line b, angle 2 is equal to 40°. Therefore, angle 1 is the sum of angles 2 and 35°, which is 40° + 35° = 75°. Howe ver , angle 1 is an exterior angle to the triangle formed by angles 35° and 40°, so the exterior angle theorem tells us that angle 1 is equal to 180° - 75° = 105°. T able 25: An Example of Multi-Image Understanding of AndesVL. Prompt : A point is circled on the ﬁrst image, labeled with REF . W e change the camera position or lighting and shoot the second image. Y ou are given multiple red-circled points on the second image, choices of "A, B, C, D" are drawn beside each circle. Which point on the second image corresponds to the point in the ﬁrst image? # Select from the following options ﬁrst and brieﬂy e xplain the reason: - A: Point A - B: Point B - C: Point C - D: Point D Model Response : B: Point B The ﬁrst image’ s REF point is in the center of the ski lift area, and Point B in the second image is in the same relativ e position, maintaining the same perspectiv e and orientation. T able 26: An Example of Multi-Image Understanding of AndesVL. 48 Prompt : Look at this sequence of three shapes. What shape should come as the fourth shape? Explain your reasoning with detailed descriptions of the ﬁrst shapes. Model Response : The sequence consists of a triangle, a square, and a pentagon. The fourth shape in the sequence should be a hexagon. The reasoning is as follows: Each shape in the sequence has one more side than the previous shape. The triangle has three sides, the square has four sides, and the pentagon has ﬁv e sides. Therefore, the next shape should ha ve six sides, which is a hexagon. T able 27: A visual reasoning example of AndesVL. 49 Prompt : Make a detailed summary of the image. Model Response : This image is an infographic titled "How Brilliant W omen Make Business Bloom." It highlights the growth and impact of w omen-owned b usinesses in the United States. Here’ s a detailed summary of the content: 1. Title and Introduction: · The infographic begins with the title "Ho w Brilliant W omen Make Business Bloom." · A brief introduction mentions that women entrepreneurs are making signiﬁcant strides in the marketplace. It states that women-owned b usinesses grew by 44% between 1997 and 2007 ... 2. Where Girls Rule: · W omen-Owned Businesses by State: · California: 1,000,000 businesses (13.3% of total) ... · W omen-Owned Businesses by Metro Area: · Los Angeles - Long Beach - Santa Ana: 410,000 businesses. ... 3. Industries: The infographic lists the top 5 industries with the most women-o wned businesses: 1. Other Services (not public administration): 1,300,000 businesses. ... 4. Confronting the Challenges: ... Additional Information: · The infographic also includes a note on ﬁnancial tips and deals, mentioning CreditDonkey .com, which helps small business o wners with "donkey-proof" ﬁnancial advice. ... V isual Elements: ... Prompt : Which state has the second highest number of women-o wned businesses? Answer the question with detailed explanation. Model Response : According to the text, T exas has 610,000 women-o wned businesses, which is the second highest number after California. The text also provides percentages for the number of women-o wned businesses in dif ferent industries and metro areas. Howe ver , the question speciﬁcally asks for the state with the second highest number of women-o wned businesses, which is T exas. T able 28: Demonstration of AndesVL ’ s ability of accurately extracting and comprehensing texts and visual elements from an image, understanding the user’ s query , and retrieving the answer from the document. 50 Prompt : Parse the image into a markdo wn table. Model Response : “‘markdown |10% 氯化钾注射液（哈森）（省集采 2 ） ( 高警示药品 )|10ml|1 支 |0.1900|0.19| |5% 葡萄糖注射液 ( 威高 )|250ml|1 袋 |4.2600|4.26| | 一次性使用无菌注射器 |1.2mm(1.2*38TWLB)100 支 *6 盒 / 箱 |1 支 |0.64|0.64| | 碳酸氢钠注射液（哈森）（省集采 2 ） |10ml:0.5g|1 支 |0.4000|0.40| | 一次性使用无菌注射器 |1.2mm(1.2*38TWLB)100 支 *6 盒 / 箱 |2 支 |0.64|1.28| | 浓氯化钠注射液 ( 高警示药品 )|10ml:1g|1 支 |0.7040|0.70| | 静脉输液（ 2 瓶以上每瓶加收） |1|2 次 |2.00|2.00| | 一次性使用输液器 ( 超低密度聚乙烯 )|15 支 *20 包 / 箱 |1 支 |5.00|5.00| |5% 葡萄糖注射液 ( 威高 )|250ml|1 袋 |4.2600|4.26| | 静脉输液 |4|1 组 |4.00|4.00| |0.9% 氯化钠注射液 ( 威高 )|100ml|1 瓶 |4.8900|4.89| | 注射用奥美拉唑钠 ( 美斯汀 ) （国集采 7 ） |40mg*1 瓶 / 瓶 |1 瓶 |0.8400|0.84| | 浓氯化钠注射液 ( 高警示药品 )|10ml:1g|1 支 |0.7040|0.70| | 维生素 B6 注射液 ( 省集采 1)|2ml:0.1g|1 支 |0.4300|0.43| | 合计： |29.59| “‘ T able 29: An example of e xtracting multilingual texts from a long receipt. 51 Prompt : What is the volume of this azure right square pyramid? Model Response : Got it, let’ s see. The problem is to ﬁnd the volume of a right square p yramid. First, I need to recall the formula for the volume of a p yramid. The volume V of a pyramid is 1 3 × base area × height. First, identify the base. The base is a square, so the base area is side length squared. From the diagram, the side length of the square base is 8. The height h of the pyramid is gi ven as 7. So, base area = 8 × 8 = 64 . Then, v olume = 1 3 × 64 × 7 . ... So the answer is 149.33 (or 149.333... but let’ s present it as a ﬂoating number). T o ﬁnd the volume of a right square pyramid, we use the formula: V = 1 3 × base area × height Step 1: Determine the base area The base of the pyramid is a square with side length 8 . The area of a square is side 2 , so: Base area = 8 × 8 = 64 Step 2: Identify the height of the pyramid From the diagram, the height h of the pyramid is 7 . Step 3: Calculate the volume Substitute the base area and height into the volume formula: V = 1 3 × 64 × 7 First, multiply 64 × 7 = 448 . Then divide by 3: V = 448 3 ≈ 149 . 333 . . . Step 4: Present the result as a ﬂoating number Rounding to two decimal places, the v olume is 149 . 33 . the answer is 149.33 T able 30: An example of solving a geometry problem. 52 Prompt : What is the value of (e - f) / g? Model Response : Got it, let’ s try to solve this problem. First, we need to ﬁnd the value of (e - f)/g. Let’ s list out the giv en equations: 1. e - f = 8 2. g * h = 3 3. e - f + g = 11 First, let’ s ﬁnd the value of g. From equation 3: e - f + g = 11. W e know from equation 1 that e - f = 8. So substitute that into equation 3: 8 + g = 11. Solving for g, we get g = 11 - 8 = 3. So g is 3. Now , we need to ﬁnd (e - f)/g. W e know e - f is 8 (from equation 1) and g is 3. So substitute those values in: 8 / 3. Let’ s compute that. 8 di vided by 3 is approximately 2.666..., which as a ﬂoating number is 2.666666... (repeating). But we need to present it as a ﬂoating number . Let’ s check if we need to round or keep the exact v alue. Since it’ s a math problem, 8/3 is approximately 2.666..., so we can write it as 2.6667 (if we round to four decimal places) or just 2.666666... but let’ s conﬁrm. W ait, let’ s make sure we didn’t miss anything. Let’ s check the other equations. ... he problem says "con vert the ﬁnal result as a ﬂoating number". So 8 divided by 3 is approximately 2.666666..., so we can write it as 2.666666... or 2.6667. Let’ s go with 2.666666... (six decimal places) or 2.6667 (four decimal places). Let’ s pick 2.666666... as the ﬂoating number . T o solve for the v alue of (e - f)/g, we analyze the giv en equations step by step: Step 1: Find the value of g W e know: e - f = 8 (from the ﬁrst equation). e - f + g = 11 (from the fourth equation). Substitute e - f = 8 into e - f + g = 11: 8 + g = 11 Solve for g: g = 11 - 8 = 3 Step 2: Calculate (e - f)/g ... the answer is 2.666666 T able 31: An example of solving an algebra problem. 53 Prompt : Find the value of z ∗ ( z ∗ ) . Model Response : Got it, let’ s try to solve this problem. First, we need to ﬁnd the value of z multiplied by its complex conjugate z ∗ . Let’ s start by identifying the complex number z from the graph. Looking at the graph, the point (-2, 3i) corresponds to the complex number z = -2 + 3i. ... Alternativ ely , using the formula (a + bi)(a - bi) = a² + b². Here, a = -2, b = 3. So a² = (-2)² = 4, b² = 3² = 9. Then 4 + 9 = 13. So the value of z ∗ ( z ∗ ) is 13. T o solve for z ∗ ( z ∗ ) , follow these steps: Step 1: Identify the complex number z from the graph The graph shows a point at ( − 2 , 3 i ) . For a comple x number z = a + bi , the real part a is the x - coordinate, and the imaginary part b is the y - coordinate. Thus, z = − 2 + 3 i . Step 2: Find the complex conjugate z ∗ The complex conjugate of z = a + bi is z ∗ = a − bi . For z = − 2 + 3 i , the conjugate is z ∗ = − 2 − 3 i . Step 3: Calculate z ∗ z ∗ For an y complex number z = a + bi , the product z ∗ z ∗ simpliﬁes to a 2 + b 2 (deriv ed from multiplying ( a + bi )( a − bi ) ). For z = − 2 + 3 i : - Real part a = − 2 , so a 2 = ( − 2) 2 = 4 . - Imaginary part b = 3 , so b 2 = 3 2 = 9 . Adding these: a 2 + b 2 = 4 + 9 = 13 . The answer is 13 . T able 32: An example of solving an analytic geometry problem. 54 Prompt : 当前页面有弹窗吗？ Model Response : 是的，当前页面有一个弹窗，控件 (1741,7204),(8398,7858) 是一个文本按钮 , 描述为 “ 立即抢购 ” 。 Prompt : 如何关闭当前页面的弹窗？ Model Response : 您可以点击控件 (4509,8167),(5565,8533) ，它是一个图标（ Icon ），描述为 ’ 关闭 ’ 。 Prompt : 我想找平板电脑，应该怎么办？ Model Response : 您可以点击控件 (472,5638),(2000,6450) ，它是一个文本按钮，描述为 “ 平板电脑 ” 。 Prompt : 我想查看我的订单，应该怎么办？ Model Response : 您可以点击控件 (8241,9371),(9250,9988) ，它是一个多标签（ Multi-T ab ），描述为 “ 我的 ” ，进入后可以查看您的订单。 T able 33: An example of Mobile UI understanding and reasoning. Prompt : 这是什么页面？ Model Response : 这是音乐 APP 中的播放页面，用于控制和播放音乐。 Prompt : 现在正在播放什么音乐？ Model Response : 正在播放的音乐是 ’ 别怕我伤心 ’ 。 Prompt : 如何切换下一曲？ Model Response : 点击控件 (6376,8752),(7342,9218) 。 T able 34: An example of Mobile UI understanding and reasoning. 55

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment