LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.
💡 Research Summary
LoRA‑Squeeze tackles two persistent practical issues of Low‑Rank Adaptation (LoRA): the difficulty of selecting an optimal rank before fine‑tuning and the deployment complexity caused by heterogeneous‑rank adapters. The core idea is a “learn‑then‑compress” paradigm: first train a LoRA adapter with a deliberately high source rank (r_src), then compress it to any desired lower target rank (r_tgt) using Randomized Singular Value Decomposition (RSVD).
The method consists of three steps. (1) High‑rank fine‑tuning: a standard LoRA training run is performed with r_src chosen larger than any anticipated deployment rank. This yields matrices A_src∈ℝ^{m×r_src} and B_src∈ℝ^{r_src×n}. (2) Full‑delta reconstruction: the product ΔW_src = A_src·B_src reconstructs the weight‑update matrix in the original parameter space. (3) RSVD compression: ΔW_src is fed to an RSVD pipeline (random Gaussian sketch Ω, power‑iteration, QR orthogonalization) to obtain the top r_tgt singular vectors U_r, singular values Σ_r, and right singular vectors V_r^T. The compressed adapter is then built as A_tgt = U_r·Σ_r^{1/2} and B_tgt = Σ_r^{1/2}·V_r^T.
Two deployment strategies are described. Post‑Squeeze applies the RSVD compression once after the entire fine‑tuning is finished, allowing a single high‑rank training run to serve many different deployment ranks without any additional training. In‑Squeeze integrates the compression into the training loop: at predefined intervals the current adapter is reconstructed, compressed to a lower rank, and training continues with the newly compressed adapter. This gradual rank annealing lets the model first explore a richer parameter space and then progressively shrink its footprint while preserving performance.
Experiments span 13 text‑only benchmarks (GLUE, SuperGLUE, SQuAD, etc.) and 10 vision‑language tasks (ViLT, CLIP‑FineTune, VQA, etc.). Across a range of source ranks (64–128) and target ranks (8–32), Post‑Squeeze consistently outperforms adapters trained directly at the target rank, delivering 0.2–0.8 % higher accuracy on average. In‑Squeeze shows the strongest gains when the final rank is very low (≤16), achieving the best size‑performance trade‑off while cutting total fine‑tuning compute by 30–45 % because a single hyper‑parameter search suffices for all target ranks.
Compared to related work such as AdaLoRA (which scores importance and reallocates rank per layer) or LoRA‑XS (which freezes singular vectors), LoRA‑Squeeze requires no extra scoring, no layer‑wise rank scheduling, and preserves homogeneous rank structures, simplifying batching, memory management, and serving pipelines. The RSVD step is GPU‑memory friendly and scales to models with billions of parameters, making real‑time compression feasible.
Key contributions are: (1) decoupling training rank from deployment rank, eliminating the need for rank‑specific hyper‑parameter tuning; (2) introducing an efficient RSVD‑based compression that yields the optimal low‑rank approximation in Frobenius norm; (3) providing both offline (Post‑Squeeze) and online (In‑Squeeze) strategies; and (4) demonstrating broad applicability across a large suite of NLP and multimodal tasks. Future directions include extending the approach to multiple LoRA modules with shared subspaces, combining RSVD compression with non‑linear adapter transformations, and evaluating scaling behavior on extremely large language models (e.g., 175 B parameters).
Comments & Academic Discussion
Loading comments...
Leave a Comment