On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.
💡 Research Summary
The paper introduces CoA‑LoRA, a method that enables a single Low‑Rank Adaptation (LoRA) adapter to work across arbitrary per‑layer quantization configurations without the need to fine‑tune a separate adapter for each setting. The core idea is a configuration‑aware model that maps a quantization configuration—represented by five per‑layer parameters (two bit‑widths, two bucket sizes, and a final casting bit‑width)—into a compact embedding. This embedding, together with layer‑type and block‑index embeddings, is fed into a lightweight neural network θ that outputs an r × r adjustment matrix Uθ for each layer. The original LoRA matrix L₂ᵢ is then re‑parameterized as (I + Uθ(Qᵢ)) L₂ᵢ, allowing the model to apply a subtle, configuration‑specific transformation while keeping the total number of generated parameters at N·r² (N = number of layers).
Training proceeds by minimizing the expected task loss over a set of quantization configurations 𝒞: θ = arg minθ E_{C∈𝒞}
Comments & Academic Discussion
Loading comments...
Leave a Comment