LLM 압축 기법 순서 최적화 연구

Reading time: 5 minute
...

📝 Abstract

Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

💡 Analysis

Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

📄 Content

Over the past decade, artificial intelligence has experienced an extraordinary surge in capability and adoption, evolving from traditional machine making them difficult to run on edge devices, real-time systems, or costconstrained environments. Their inference latency can hinder interactive applications, and frequent retraining or fine-tuning exacerbates the computational burden. These limitations have created a strong need for efficient model compression techniques, such as pruning [16], knowledge distillation [17], and quantization [18], which aim to reduce model size, improve inference speed, and lower resource requirements while maintaining accuracy. Motivated by these issues, this research explores how such optimization strategies can make large language models more scalable, accessible, and practical for real-world use. Quantization in deep learning is an optimization technique that reduces the precision of a model’s numerical values, typically converting them from floating-point numbers to lower-precision integers. The authors in [18,19] have published a comprehensive study on the quantization of LLMs range mapping viz, affine quantization, scale quantization, quantization techniques viz., post-training quantization, quantization-aware training, weight quantization and activation-aware weight quantization. Although post-training quantization techniques improve LLM’s computational efficiency and memory footprint, their hand-crafted quantization settings result in poor performance, particularly in very low-bit quantization. This issue is addressed in omnidirectionally calibrated quantization i.e., OmniQuant technique for LLMs. To address the issue of substantial training resources in quantization aware training, effientQAT method was proposed consisting of Block-wise training of all parameters and end-to-end training of quantization parameters [20]. Authors in [21] applied low rank adaptation along with quantization aware training simultaneously which reduces the difference between the full-precision and quantized models and greatly enhances generalization in subsequent challenges. Pruning in deep learning is a technique to reduce the size and complexity of a neural network by removing less important parameters, like weights, neurons, or entire layers. In one approach proposed, each weight matrix is parameterized using its low-rank factorization and adaptively rank-1 components are eliminated during training [22]. A novel and effective pruning method titled Wanda (pruning by Weights and activations), designed to induce sparsity in pretrained LLMs [16]. A batched greedy pruning method names SlimGPT for rapid and near-optimal pruning is proposed which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition [23]. Authors in [24] proposed Fluctuation-based Adaptive Structured Pruning which formulated structured importance metrics, adaptively searched the global compressed model, and implemented compensation mechanisms to mitigate performance loss.

While individual compression techniques-quantization [25], pruning [26], and knowledge distillation [27], have been extensively studied in isolation, real-world deployment scenarios often require combining multiple techniques to achieve aggressive compression ratios while maintaining acceptable performance [28]. However, the existing literature provides limited guidance on the optimal ordering of these techniques when applied sequentially on small scale LLMs. Recent surveys [29] identify this as a critical gap: different orderings may exhibit synergistic or antagonistic interactions, and certain sequences may be infeasible due to technical constraints (e.g., quantization’s incompatibility with gradient-based training). The foundational “Deep Compression” work in [30] used a pipeline of pruning, quantization, and encoding to drastically reduce the AlexNet model’s size. This same principle of applying techniques in a sequence can be extended to modern small-scale LLMs. This work systematically explores compression technique orderings to identify optimal strategies for practitioners deploying compressed LLMs in resource-constrained environments.

This research paper is organized as follows: Section 2 describes the methodology explaining single strategy baseline compression techniques in section 2.1. This includes knowledge distilation in section 2.1.1, pruning in section 2.1.2, and quantization in section 2.1.3. We propose our compresseion ordering strategy in section 2.2 with specific focus on 6 three-technique sequences explained in section 2.2.1. Section 3 discussed the results and the analysis part and section 4 concludes the paper.

The methodology begins by outlining each individual compression strategy-knowledge distillation, pruning and quantization-highlighting how each technique independently reduces model size or computation while maintaining acceptable performance. However, relying on any single method alone in

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut