LLM 프루닝을 위한 초고속 단일단계 강화학습 프레임워크

Reading time: 5 minute
...

📝 Original Info

  • Title: LLM 프루닝을 위한 초고속 단일단계 강화학습 프레임워크
  • ArXiv ID: 2511.18977
  • Date: 2023-09-15
  • Authors: : John Doe, Jane Smith, Michael Johnson

📝 Abstract

Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.

💡 Deep Analysis

📄 Full Content

Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks [1,2,3,4,5,6]. However, their substantial computational and storage requirements pose considerable challenges to practical deployment. Pruning provides an effective way to address this issue by reducing model size and computation. Among various pruning methodologies, structured pruning is regarded as a practical technique due to its ability to facilitate direct inference acceleration on general-purpose hardware [7]. This technique is implemented by setting a sparsity ratio for the model's prunable units, with the most straightforward approach being uniform pruning, where a single ratio is applied across all layers. However, studies indicate that layers in LLMs exhibit significant heterogeneity in pruning sensitivity [8,9,10]. This transforms the task of pruning into a challenging optimization problem: how to determine an optimal, non-uniform sparsity allocation that maximizes model performance under a global budget.

The pursuit of this goal has primarily led to two categories of pruning methods: heuristic and search-based approaches. Heuristic [11], (b) complex state representations [12], and (c) multi-agent coordination challenges [13].

strategies, such as SliceGPT [14], FLAP [15], and SVD-LLM [16], determine layer-wise sparsity or parameter reduction based on predefined, handcrafted rules. While these methods are computationally efficient and have evolved into strong performance baselines, they are often limited by their fixed, local assumptions, which may fail to capture the complex inter-layer dependencies that are crucial for preserving model capabilities [17,18]. To overcome these limitations, search-based approaches were introduced to globally explore the vast policy space. These search-based methods can be divided into two main paradigms: zero-order optimization [19,20] and gradient-guided methods [21,11,13]. Zero-order methods like Evolutionary Algorithms (EAS) [19,20] suffer from immense computational cost and instability, often failing to outperform strong heuristics. Gradient-guided methods like Reinforcement Learning (RL) [22,23,24] theoretically offer higher sample efficiency by using policy gradients to guide the search. However, pioneering RL frameworks are less efficient for LLMs. As illustrated in Figure 1, these legacy paradigms suffer from flaws that render them inefficient for large-scale models. These flaws include prohibitively slow sequential decision-making, computationally expensive state representations, and complex multi-agent coordination.

To resolve these shortcomings, we propose FastForward Pruning. Its core innovation is a decoupled design that separates the learning of layer importance from the mechanics of budget satis- faction. In this design, the RL agent learns a high-level vector of unconstrained retention scores, while a deterministic function maps these scores to a budget-compliant policy. This fundamental separation enables the formulation of the search as a stable, single-step policy generation task, avoiding the complexities of sequential RL on LLMs. To further manage the cost of this now-practical search, we introduce a Progressive Scheduling mechanism that dynamically adjusts task difficulty and evaluation fidelity. Our framework thus makes automated pruning search not just powerful, but also practical for large-scale models.

Our framework addresses the challenge of designing an efficient search strategy for discovering near-optimal, layer-wise retention policies p = [p1, . . . , pN ]. As illustrated in Fig. 2, our methodology is a two-stage pipeline: (1) an efficient search stage, where we formulate the problem as a direct policy optimization task, akin to a contextual bandit problem, which is accelerated by our novel Progressive Scheduling (ProgSched) mechanism; and (2) a performance recovery stage, which employs a lightweight, retraining-free weight calibration to compensate for pruning-induced performance loss.

A key challenge in policy learning is the credit assignment problem, which is exacerbated in a constrained action space. The global budget imposes a tight coupling among the retention rates of all layers; an increase in one layer’s retention must be compensated by a decrease elsewhere. Consequently, a single policy update is a composite action, and the resulting reward signal cannot be uniquely attributed to any single dimension of the action. This ambiguity blurs the policy gradient, leading to an unstable and inefficient search. We resolve this by decoupling policy learning from budget satisfaction. The RL agent learns a vector of unconstrained importance scores, where independent gradients provide a clear learning signal. The complex budget constraint is offloaded to a deterministic mapping function (Alg. 1), significantly improving search stability and sample efficiency. RL Environment. We construct a focused RL environment with the sole purpos

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut