CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.
💡 Research Summary
The paper introduces CR‑Net (Cross‑layer Low‑Rank residual Network), a novel parameter‑efficient framework for large‑scale language model (LLM) pre‑training. The authors first identify a previously unreported property of transformer activations: the difference between the activations of adjacent layers (ΔY) exhibits a strong low‑rank structure. This observation is validated empirically on LLaMA‑3 8B and GPT‑2‑small models, where low‑rank approximations of ΔY achieve significantly lower relative reconstruction error than direct low‑rank approximations of the raw activations, even when using the same rank budget.
Building on this insight, CR‑Net replaces the full‑rank weight matrix of every layer (except the first) with two learnable low‑rank matrices A∈ℝ^{h_in×r} and B∈ℝ^{r×h_out}. The output of layer l at position P is computed as
Yₚˡ = βₚˡ·Yₚˡ⁻¹ + Xₚˡ·Aₚˡ·Bₚˡ,
where βₚˡ is a learnable scalar scaling factor. When βₚˡ≈0 the layer relies mainly on the low‑rank residual; when βₚˡ≈1 it heavily incorporates the high‑rank signal propagated from the previous layer. The first layer retains a conventional full‑rank weight matrix to preserve high‑rank information at the network entrance. This design mitigates the “information loss” problem that plagues pure low‑rank methods such as LoRA, while keeping the total number of trainable parameters dramatically lower than a full‑rank model.
A second contribution is a customized activation recomputation scheme. Standard gradient checkpointing (GCP) stores a subset of activations and recomputes the rest during back‑propagation, but for CR‑Net the dependency on previous‑layer outputs would make naive GCP incur O(L²) recomputation cost. The authors therefore store all layer inputs Xₗ and the linear outputs of a selected subset of layers A (including the first layer). During the backward pass, activations of non‑checkpointed layers are reconstructed using the stored inputs together with the low‑rank matrices, resulting in only linear‑time overhead while cutting activation memory by 30‑55 % across model sizes.
Extensive experiments cover models from 60 M to 7 B parameters, all based on the LLaMA‑2 architecture with SwiGLU. Compared with state‑of‑the‑art low‑rank techniques (LoRA, GaLore, RSO, CoLA‑M), CR‑Net consistently reduces trainable parameters by 20‑45 % and GPU memory consumption by 35‑55 %. Validation perplexity improves by 1.2‑2.0 % relative to LoRA, with the most pronounced gain (≈1.8 %) on the 7 B model. Training throughput remains on par with or slightly exceeds that of competing methods because the low‑rank residual computation is essentially a cheap matrix multiplication, and the tailored recomputation adds negligible overhead.
The paper also discusses stability: the full‑rank first layer and the learnable scaling factor β together prevent the collapse of representations into a low‑dimensional subspace, a risk observed in methods that enforce hard low‑rank constraints via SVD or QR projections. Ablation studies confirm that removing β or using a fixed scaling degrades performance, highlighting the importance of dynamic balancing between high‑rank and low‑rank signals.
Limitations and future directions are acknowledged. The current scalar β could be extended to vector or per‑head scalars for finer control, and the rank r could be made adaptive during training to further optimize the trade‑off between compression and expressivity. Nonetheless, CR‑Net demonstrates a principled way to achieve simultaneous gains in parameter efficiency, computational cost, and activation memory, addressing the three major shortcomings (L1‑L3) of existing low‑rank training paradigms. This makes it a compelling candidate for cost‑effective pre‑training of ever larger LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment