DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks
Parameter-efficient fine-tuning (PEFT) methods have become the standard paradigm for adapting large-scale models. Among these techniques, Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the learning capacity and training stability of the Low-Rank Adaptation (LoRA) method by explicitly decomposing pre-trained weights into magnitude and directional components. In this work, we propose DoRAN, a new technique designed to stabilize training and boost the sample efficiency of DoRA. Our framework introduces two key components: (i) the injection of learnable noise into the denominator of DoRA weight decomposition, which serves as an adaptive regularizer to mitigate instabilities and improve the estimation rate of low-rank matrices; and (ii) the replacement of static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling between the query and value projection matrices, leading to improved sample efficiency both theoretically and empirically. Comprehensive experiments on vision and language benchmarks show that DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines, underscoring the effectiveness of combining noise-based regularization with network-based parameter generation.
💡 Research Summary
DoRAN (Weight‑Decomposed Low‑Rank Adaptation with Noise Injection and Auxiliary Networks) addresses two critical shortcomings of the existing DoRA method for parameter‑efficient fine‑tuning of large pretrained models. First, DoRA normalizes the adapted weight matrix by its column‑wise norm, which can become arbitrarily small and cause gradient explosions. DoRAN introduces a learnable positive scalar τ added to the denominator, effectively acting as adaptive noise. This term guarantees that the denominator (|W’|_c + τ) never vanishes, stabilizing gradients throughout training. Gradient analysis shows that the orthogonal component of the upstream gradient is scaled by (m/(|W’|_c+τ)) while the parallel component is scaled by (τ/(|W’|_c+τ)). Consequently, τ interpolates smoothly between the pure directional update of DoRA (τ → 0) and a linear scaling regime (τ ≫ |W’|_c), allowing the model to automatically balance direction learning and norm control.
Second, DoRAN replaces static low‑rank adapters A and B with two‑layer hypernetworks (g_A) and (g_B). A shared embedding (A′, B′) and first‑layer weights (W_{A1}, W_{B1}) extract a common latent representation for both query and value projections, while separate second‑layer weights (W_{A2}, W_{B2}) produce task‑specific low‑rank matrices. This design couples the query and value adapters, encouraging information sharing across attention heads and reducing the total number of trainable parameters. The hypernetwork can be viewed as a mixture‑of‑experts (MoE) generator: each attention head acts as an expert, and the hypernetwork provides a compact gating mechanism. The authors prove that this re‑parameterization reduces the sample‑complexity from exponential in the number of experts to polynomial, yielding a provable improvement in estimation rate.
Empirically, DoRAN was evaluated on vision benchmarks (VTAB‑1K, FGVC) and language tasks (commonsense reasoning with LLaMA‑7B/13B). Across all datasets, DoRAN consistently outperformed LoRA, DoRA, and other recent PEFT baselines, achieving 1–2.5 percentage‑point gains in accuracy while exhibiting smoother loss curves and faster convergence. τ learned values typically ranged from 0.01 to 0.1, automatically decreasing as training progressed. The hypernetwork contributed less than 5 % of the total parameter budget yet delivered a 10–15 % boost in parameter‑efficiency compared to independently learned adapters. Computational overhead was negligible, and memory usage remained comparable to baseline methods.
In summary, DoRAN introduces (i) a learnable noise term that eliminates singularities in DoRA’s normalization and stabilizes gradient flow, and (ii) a hypernetwork‑based dynamic generation of low‑rank adapters that couples query and value projections, enhancing sample efficiency both theoretically and practically. The paper provides a thorough theoretical analysis, gradient derivations, and extensive experiments, establishing DoRAN as a robust and efficient alternative for fine‑tuning large foundation models, especially in low‑data regimes.
Comments & Academic Discussion
Loading comments...
Leave a Comment