AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.

💡 Research Summary

The paper introduces AutoQRA, a novel framework that jointly optimizes per‑layer quantization bit‑widths and LoRA adapter ranks for efficient fine‑tuning of large language models (LLMs) under strict GPU memory budgets. The authors first observe that the conventional “quantize‑then‑fine‑tune” pipeline treats quantization and adapter capacity as independent decisions. In practice, low‑bit quantization introduces noise that can be mitigated only if sufficient adapter capacity (rank) is allocated to the affected layers. Consequently, a bit‑width allocation that looks optimal under reconstruction or calibration metrics may still lead to poor downstream performance, and different combinations of bit‑width and rank can produce dramatically different results even when they consume the same amount of memory.

To address this, AutoQRA formulates the problem as a constrained black‑box optimization: maximize validation performance after fine‑tuning subject to a global memory budget, where the decision variables are the per‑layer pairs (qℓ, rℓ) drawn from discrete sets of bit‑widths and ranks. Direct exhaustive search is infeasible because the space grows exponentially with the number of transformer layers.

AutoQRA therefore adopts a two‑phase coarse‑to‑fine search strategy. Phase I performs a global multi‑fidelity evolutionary search. An initial population is warm‑started using layer‑wise importance priors: I_q(ℓ) measures sensitivity to low‑bit perturbations, while I_r(ℓ) measures the amount of update energy a layer exhibits during fine‑tuning. Importance‑guided mutation operators focus changes on influential layers. Candidates are evaluated at low fidelity (short fine‑tuning runs) and screened by a learned surrogate model that predicts high‑fidelity performance. Only the most promising configurations are promoted to high‑fidelity evaluation, and a Pareto front of accuracy versus memory consumption is approximated. The process stops automatically when hypervolume improvement saturates.

Phase II refines the promising configurations from Phase I using trust‑region Bayesian optimization. A Gaussian‑process surrogate is trained on the high‑fidelity evaluations, and Expected Improvement (EI) acquisition selects new (q, r) configurations within a trust region that respects the memory constraint. This local search fine‑tunes the allocation, allowing the adapter capacity to precisely compensate quantization noise in the most sensitive layers. The Bayesian loop terminates when acquisition improvement plateaus.

Experiments are conducted on LLaMA‑7B and LLaMA‑13B models across several downstream benchmarks (Winogrande, ARC‑Challenge, MMLU, etc.). AutoQRA consistently outperforms uniform 4‑bit quantization (QLoRA) and adaptive‑rank methods with fixed precision (AdaLoRA). Under the same memory budget, AutoQRA achieves 1.2–2.0 % higher accuracy on average and approaches full‑precision fine‑tuning performance. Analysis of the learned configurations reveals a pattern: layers quantized to very low bits (2‑4) are often paired with higher LoRA ranks, confirming the hypothesis that additional adapter capacity can offset quantization noise.

The contributions are threefold: (1) a formal joint optimization problem that captures the interaction between quantization precision and adapter rank; (2) the AutoQRA framework that combines multi‑fidelity evolutionary search with trust‑region Bayesian refinement to efficiently explore a large discrete design space; (3) empirical evidence that joint optimization yields near‑full‑precision performance with a memory footprint comparable to uniform 4‑bit methods.

Limitations include the reliance on an initial set of high‑fidelity evaluations to train the surrogate, which may be costly for extremely large models, and the fact that the surrogate’s accuracy depends on the quality of the importance priors. Future work could explore meta‑learning to transfer surrogate knowledge across tasks, dynamic quantization schedules during fine‑tuning, and more scalable distributed search strategies.

AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment