Cut Less, Fold More: Model Compression through the Lens of Projection Geometry
Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.
💡 Research Summary
This paper investigates calibration‑free neural network compression, focusing on two structured post‑training methods: magnitude‑based structured pruning and model folding (weight clustering). The authors cast both techniques as orthogonal projections in the parameter space, revealing a unified geometric perspective. Pruning corresponds to an axis‑aligned projection that zeros out entire rows (neurons, filters, or channels), while folding corresponds to a low‑rank projection that clusters weight vectors and replaces each cluster by its mean, preserving directional information.
Under a Lipschitz‑continuous loss assumption, the authors prove two central theorems. The first theorem shows that for any pruning of rank kₚ (meaning kₚ rows are kept), there exists a folding of rank k_f = kₚ + 1 that yields a Frobenius‑norm reconstruction error no larger than that of pruning. Intuitively, folding “adds” one extra degree of freedom by merging the pruned rows into a new cluster, thus staying closer to the original weight matrix. The second theorem strengthens this claim: when folding uses an optimal k‑means clustering, its reconstruction error is never larger than any possible pruning of the same effective rank (k_f − 1). Consequently, via the Lipschitz bound |L(W) − L(Ŵ)| ≤ κ‖W − Ŵ‖_F, folding also guarantees a tighter bound on the change in the loss (and thus on functional perturbation) compared with pruning.
To validate the theory, the authors conduct an extensive empirical study involving more than 1,000 checkpoints across a diverse set of models: ResNet‑18, PreActResNet‑18, ViT‑B/32, CLIP ViT‑B/32, and LLaMA‑60M/130M. Each model family is trained under a wide range of hyper‑parameters—including different optimizers (Adam, SGD), learning rates, data augmentations, regularization strengths, and Sharpness‑Aware Minimization (SAM)—to capture how upstream training influences compression outcomes. Compression is performed at matched sparsity/FLOP budgets, ensuring a fair comparison.
For CNNs, after compression the authors apply REP‑AIR to recompute batch‑norm statistics, isolating the structural effect of the compression. For ViTs and LLaMA models, they evaluate both the raw compressed models and versions that receive light weight‑only LayerNorm resets or a few epochs of full fine‑tuning. Across all settings, folding consistently outperforms magnitude pruning, especially at moderate to high compression ratios (e.g., retaining 20‑50 % of parameters). The advantage is most pronounced in Vision Transformers, where folding yields larger accuracy gains than in ResNets. At very low compression ratios, the gap narrows and can occasionally reverse, particularly when strong L1 regularization was used during training.
The empirical results align with the theoretical predictions: folding’s cluster‑structured subspace retains more of the original weight geometry, leading to smaller parameter distortion and, consequently, smaller functional degradation. Even after light fine‑tuning, folding’s edge remains, indicating that the retained directional information facilitates more efficient adaptation. The study also confirms that optimal k‑means folding achieves the minimal possible reconstruction error among all clustering‑based compressions, validating the optimality claim of Theorem 2.2.
In summary, the paper makes three major contributions: (1) a unified projection‑based framework that mathematically relates pruning and folding, together with provable guarantees that folding yields lower reconstruction error and tighter loss perturbation bounds; (2) a large‑scale empirical benchmark covering diverse architectures, datasets (CIFAR‑10, ImageNet‑1K), and training regimes, demonstrating folding’s superior post‑compression accuracy, especially in the moderate‑to‑high compression regime; and (3) practical insights that folding can serve as a geometry‑aware, calibration‑free alternative to pruning, offering consistent benefits across CNNs, Vision Transformers, and large language models, while also highlighting regimes where the advantage diminishes. This work bridges theoretical geometry with practical model compression, opening avenues for designing new compression schemes that explicitly optimize for functional closeness rather than merely parameter count.
Comments & Academic Discussion
Loading comments...
Leave a Comment