AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning
Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.
💡 Research Summary
The paper tackles the memory bottleneck that hampers fine‑tuning of large language models (LLMs) by proposing Activation‑Guided Zeroth‑Order optimization (AGZO), a novel zero‑order (ZO) method that leverages information generated during the forward pass. Traditional ZO fine‑tuning approaches such as MeZO and LOZO rely on isotropic Gaussian perturbations (full‑space) or random low‑rank factorizations that are independent of the model’s internal activations. Consequently, a large portion of the query budget is spent exploring directions that carry little gradient signal.
The authors first establish a deterministic relationship for linear layers: the gradient with respect to a weight matrix Wℓ is the product Qℓ Hℓᵀ, where Hℓ contains the input activations for the current minibatch and Qℓ stores the upstream gradients. This factorization implies that the row‑space of the gradient is a subspace of the column‑space of Hℓ. Empirical SVD analyses on GPT‑2 fine‑tuned on SST‑2 show that projecting the true gradient onto the subspace spanned by the top r singular vectors of Hℓ retains almost all its energy (cosine similarity ≈ 1 for r ≥ 10). Both gradients and activations exhibit rapidly decaying spectra, confirming that the effective dimensionality is far lower than the ambient parameter space.
Motivated by these observations, AGZO constructs, on the fly, a low‑rank subspace from the current activations. For each linear layer, a few power‑iteration steps are applied to Hℓ Hℓᵀ to obtain an orthonormal basis Aℓ ∈ ℝ^{din×r}. The activation matrix Hℓ is then discarded to keep memory usage minimal. Perturbations for linear layers are generated as Δℓ = Rℓ Aℓᵀ, where Rℓ ∈ ℝ^{dout×r} has i.i.d. standard normal entries, yielding a rank‑r update whose row space lies entirely within the activation‑guided subspace. Non‑linear layers fall back to standard Gaussian perturbations, preserving generality across architectures.
Theoretically, the authors show that AGZO optimizes a “subspace‑smoothed” objective F_{μ,sub}(W) = E_u
Comments & Academic Discussion
Loading comments...
Leave a Comment