Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a simple yet effective approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in 16 bits, QZO can reduce the total memory cost by more than 18$\times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B within a single 24GB GPU.

💡 Research Summary

This paper tackles the severe GPU memory bottleneck that hampers fine‑tuning of ever‑larger large language models (LLMs). Traditional fine‑tuning consumes memory for four major components: model weights, gradients, optimizer states (often twice the size of gradients), and cached activations. For a 7‑billion‑parameter model stored in BF16, weights alone require ~14 GB, gradients another ~14 GB, and AdamW‑style optimizer states ~28 GB, totaling over 56 GB. To dramatically shrink this footprint, the authors propose Quantized Zeroth‑order Optimization (QZO), a unified framework that simultaneously (1) quantizes model weights to low‑bit formats (e.g., 4‑bit or 2‑bit) and (2) eliminates gradients and optimizer states by using zeroth‑order (ZO) optimization.

Zeroth‑order methods estimate gradients solely from forward passes. The classic Simultaneous Perturbation Stochastic Approximation (SPSA) perturbs the full parameter vector θ with a random Gaussian vector z scaled by ε, evaluates the loss at θ + εz and θ − εz, and forms a central‑difference estimator. While SPSA (and its recent memory‑efficient variant MeZO) works for full‑precision models, it cannot be directly applied to quantized models because (i) quantized weights are discrete and cannot be smoothly perturbed, and (ii) continuous gradient estimates cannot be applied to discrete weights without costly de‑quantize‑re‑quantize cycles.

QZO solves this by shifting the perturbation from the discrete weights to the continuous quantization scales Δ. In typical post‑training quantization, each weight w is represented as an integer w̄ and a scale Δ such that w = Δ·w̄. QZO keeps the integer matrix w̄ fixed throughout training and only perturbs Δ. The Quantized SPSA (Q‑SPSA) estimator is:

∇̂Δ L ≈

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment