Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($μ$A), a theoretical framework that characterizes how the “optimal” learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $μ$A is inspired from the Maximal-Update Parametrization ($μ$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision–language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
💡 Research Summary
Low‑Rank Adaptation (LoRA) has become the de‑facto method for parameter‑efficient fine‑tuning of large pretrained models. While it dramatically reduces memory and compute requirements, practitioners have observed that the optimal learning rate (LR) is highly sensitive to the adapter rank r, the initialization scheme, and the LoRA scaling factor α. Changing any of these hyper‑parameters often forces a costly re‑tuning of the LR, which hampers the practical adoption of LoRA at scale.
This paper introduces Maximal‑Update Adaptation (μA), a theoretical framework that extends the Maximal‑Update Parametrization (μP) used for pre‑training to the fine‑tuning setting. μA defines a “maximal yet stable” regime: the LoRA‑induced feature update ΔZ⁽ᴮ⁾ must stay O(1) (to avoid divergence) while also being Θ(1) (to avoid vanishing learning). The authors formalize LoRA as W = W* + α B A, with trainable low‑rank factors A∈ℝ^{r×n} and B∈ℝ^{n×r}. Two canonical initializations are considered:
- Init
Comments & Academic Discussion
Loading comments...
Leave a Comment