AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

While Large Language Models (LLMs) excel at diverse NLP tasks, their adaptation through full fine-tuning is computationally expensive and memory-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) mitigate this by introducing low-rank updates to frozen weights, but this constrains optimization to a low-rank subspace and can limit performance. Focusing on Small Language Models (SLMs), where efficiency gains offer significant practical benefits, we introduce AdaGradSelect, an adaptive, gradient-guided block selection strategy for efficient fine-tuning. Motivated by preliminary findings that selectively updating transformer blocks with the highest gradient norms approaches full fine-tuning performance, AdaGradSelect dynamically prioritizes which blocks to train. The method combines Dirichlet-based sampling, informed by historical update frequencies, with an 𝜖-greedy exploration strategy. This approach initially balances the exploitation of important blocks with the exploration of new candidates before transitioning to full exploitation in later epochs, optimizing the training process. Experimental results demonstrate that AdaGradSelect trains approximately 12% faster and uses 35% less GPU memory while achieving performance nearly identical to full fine-tuning. On the GSM8K dataset, our method consistently outperforms LoRA (rank 256) by an average of 3% across Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B models. It also shows comparable accuracy on the MATH dataset, establishing AdaGradSelect as a more effective and resource-efficient fine-tuning approach.

💡 Research Summary

The paper addresses the high computational and memory cost of full fine‑tuning for large language models (LLMs), focusing on small language models (SLMs) where efficiency gains are most impactful. While parameter‑efficient fine‑tuning (PEFT) methods such as Low‑Rank Adaptation (LoRA) reduce the number of trainable parameters by adding low‑rank updates to frozen weights, they constrain optimization to a low‑dimensional subspace and can limit final performance. The authors observe that updating only those transformer blocks whose gradient norms are highest can approximate full‑model fine‑tuning performance, motivating a dynamic block‑selection strategy.

AdaGradSelect is introduced as an adaptive, gradient‑guided block‑selection algorithm. At each training step the method computes the L2 norm of the gradient for every transformer block. These norms are accumulated over time to form a historical frequency vector, which is then used as the concentration parameters of a Dirichlet distribution. Sampling from this Dirichlet yields a probability distribution over blocks, biasing selection toward those that have historically received large updates. To avoid premature convergence on a narrow set of blocks, an ε‑greedy exploration scheme is incorporated: with probability ε a block is chosen uniformly at random, while with probability 1‑ε the block is drawn from the Dirichlet‑based distribution. The ε value is annealed across epochs, providing balanced exploration in early training and full exploitation later on.

The authors evaluate AdaGradSelect on three SLMs—Qwen2.5‑0.5B, LLaMA3.2‑1B, and Phi4‑mini‑3.8B—using two benchmark datasets: GSM8K (arithmetic word problems) and MATH (college‑level problem solving). Baselines include (1) full fine‑tuning of all parameters, (2) LoRA with rank 256, and (3) a random‑block selection method. Results show that AdaGradSelect achieves nearly identical accuracy to full fine‑tuning while reducing wall‑clock training time by roughly 12 % and GPU memory consumption by about 35 %. On GSM8K, AdaGradSelect outperforms LoRA(rank‑256) by an average of 3 % across the three models, and on MATH it matches LoRA’s performance within statistical noise. The random‑block baseline consistently underperforms, confirming that gradient magnitude is an effective signal for identifying important blocks.

The paper also discusses the theoretical underpinnings of the approach. The Dirichlet‑based sampling implements a Bayesian update rule: blocks that have been selected frequently (and thus have larger accumulated gradients) receive higher concentration, increasing their future selection probability. The ε‑greedy schedule formalizes the exploration–exploitation trade‑off, ensuring that the algorithm does not become trapped in suboptimal block subsets. This combination yields a principled, data‑driven mechanism for allocating limited training resources.

In the discussion, the authors argue that the method is model‑agnostic and can be extended beyond block‑level granularity to attention‑head or parameter‑group selection. They also propose future work on multi‑task scenarios where block importance may shift across tasks, and on integrating other importance metrics such as Fisher information. Overall, AdaGradSelect offers a practical solution for fine‑tuning SLMs in resource‑constrained environments, delivering substantial savings in compute and memory while preserving, or even improving, downstream task performance.

💡 Research Summary

📜 Original Paper Content