HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model
📝 Abstract
Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.
💡 Analysis
Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.
📄 Content
HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model Yuanhao Xi1,2,3†, Xiaohuan Bing1,2,3†, Ramin Yahyapour2,3∗ 1 Liaoning Technical University, Huludao, China 2 University of G¨ottingen, G¨ottingen, Germany 3 Gesellschaft f¨ur Wissenschaftliche Datenverarbeitung mbH G¨ottingen, G¨ottingen, Germany Abstract—Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimiza- tion, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning. Index Terms—Instruction Tuning, Rank Adaptation, Mobile Vision Language Model. I. INTRODUCTION In recent years, the field of multimodal large language mod- els experiences rapid development, providing novel solutions to address complex tasks spanning various modalities [1], [2]. Some researchers have already applied VLMs to mobile devices, such as MobileVLM [3]. This significantly broadens the application scenarios of VLMs and possesses considerable practical value. However, training VLMs requires substantial computational resources [4], [5]. Therefore, developing an effi- cient fine-tuning methodology specifically tailored for mobile- oriented VLMs is of significant practical importance. To bridge this gap, techniques such as LoRA [6] have been developed. LoRA and its variants have demonstrated excellent performance in the fine-tuning tasks of current large language models (LLMs). The variants of LoRA can be divided into three categories: the first involves dynamically adjusting the rank values of LoRA, the second focuses on enhancements †Equal contribution Emails: {yuanhao.xi, xiaohuan.bing}@stud.uni-goettingen.de ∗Corresponding author: Ramin Yahyapour(ramin.yahyapour@gwdg.de) … … Ri={R1, R2, …,Rn } Model Layer n Dynamic Rank Hierarchical Rank R1 Rn … A Lightweight Performance Model Average Gradient Norms of Each Layer Model Layer 1 Pretrained Weights Ri Input Pretrained Weights Ri Input Fig. 1: An illustration of hierarchical and dynamic rank adaptation. The average gradient norms serve as the basis for assigning the rank of layers. A lightweight performance model determines the optimal set of rank values. to LoRA, and the third involves using multiple LoRA [7], [8]. The DyLoRA [9] sorts the representations learned by the adapter module at various ranks during fine-tuning, enabling the LoRA blocks to be trained across a range of ranks rather than a fixed rank. QLoRA [10] introduces a 4-bit NormalFloat data type to save memory without sacrificing performance. LoRAHub [11] gathers diverse task-specific LoRA modules and autonomously combines suitable ones without human input. However, these methods, which are applied to LLMs, are not suitable for the instruction tuning phase of mobile VLMs. Unlike LLMs, which primarily handle text, mobile VLMs process a combination of text and image modalities. Due to the differing sensitivities of text and image modalities to various layers of the model during fine-tuning, the final performance on downstream tasks is unsatisfactory. To address these issues, we propose adopting different rank settings for different layers to account for the varying sensitivities of multimodal tasks. As shown in Fig. 1,we introduce HyDRA, which incorporates two key techniques: hierarchical optimization and dynamic adjustment for the rank. In hierarchical optimization, we refine the rank values of different layers based on their uneven distribution of av- arXiv:2512.20674v1 [cs.LG] 20 Dec 2025 Large Language Model Projector Projector Q Up FFN Attention K V O Down Gate Q Up FFN Attention K V O Down Gate Layer 1 … … Tokenizer Hv Ht Xv Xt Layer n Average Gradient Norms Performance Model Dynamic Rank Ya Pretrained Weights Vison Encoder Attention FFN … H
This content is AI-processed based on ArXiv data.