Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search

Reading time: 6 minute
...

📝 Abstract

The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks (‘winning tickets’) within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI’s reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS’s subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.

💡 Analysis

The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks (‘winning tickets’) within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI’s reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS’s subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.

📄 Content

O VER the last decades, the exponential scaling of neural networks and the demand for edge deployment have led significant research into model compression through pruning, i.e., removing a large portion of their weights. Although traditionally implemented after training due to conventional knowledge that overparameterization was necessary for effective gradient descent [1], [2], Frankle and Carbin [3] provide empirical evidence that this computationally extensive process can be avoided.

They propose the Lottery Ticket Hypothesis (LTH): over any task, given a sufficiently complex, randomly initialized model, there exist sparse subnetworks that can be trained to accuracy comparable to that of their fully trained dense counterparts. Termed “winning” or “lottery” tickets, they find these subnetworks through Iterative Magnitude Pruning (IMP), which involves a cycle of training to convergence, pruning a small percentage of least-magnitude weights, and ‘rewinding’ the remaining weights to their value at initialization. In practice, however, on deeper models, IMP is unable to find lottery tickets unless rewound to an iteration slightly after initialization, which is known as Lottery Ticket Rewinding (LTR) [4].

The intense retraining required by LTR defeats the original purpose of lottery tickets: to improve training efficiency. To address this, researchers have explored pruning methods that identify such winning tickets at the start of training, known as Pruning-at-Initialization (PaI) methods. However, PaI methods consistently fall short of LTR, showing a noticeable tradeoff between sparsity and accuracy. Frankle and Carbin [5] demonstrated that popular PaI methods like SNIP, GraSP, and SynFlow also fail basic sanity checks, such as randomly shuffling the pruning mask within each layer. Such methods seem to merely identify good sparsity ratios for each layer, but lose initialization-specific information. They remark that the consistent failure of these diverse methods may point to a fundamental weakness in pruning at initialization and urge the development of new signals to guide pruning early in training.

To our knowledge, the search for an efficient algorithm to draw lottery tickets near initialization with state-of-the-art accuracy remains an open problem.

In this work, we answer this challenge by introducing Concrete Ticket Search (CTS), which reframes ticket discovery as a direct combinatorial optimization problem. Rather than scoring weight importance independently, CTS learns entire tickets through a concrete relaxation, [6], of the binary mask space. A novel gradient balancing scheme ensures that the target sparsity is met without any extensive tuning, and pruning objectives inspired by knowledge distillation rather than naive task loss guide the sparse model to preserve the behavior of its dense counterpart. For example, an especially effective objective comes from minimizing the reverse Kullback-Leibler (KL) divergence of the ticket with respect to its dense parent.

The main contributions of this work can be summarized as follows. We shed light on a fundamental flaw in the first-order approximations used by nearly all PaI works. As such, we posit that holistic subnetwork search is a necessity and develop CTS, an algorithm that draws lottery tickets near initialization with accuracy better than or comparable to the current stateof-the-art, LTR. It avoids the need for extra hyperparameter tuning and achieves these results in significantly less time. For example, at ∼200 times compression (0.47% density), the proposed method draws tickets in 192 (Quick CTS) and 24 (CTS) times less training iterations than LTR on CIFAR-10 tasks; this speedup only increases as subnetworks grow more and more sparse. These lottery tickets pass sanity checks as proposed in [5] and [7]. Further, inspired by knowledge distillation, we propose pruning criteria that perform better than existing approaches when pruning early in training.

While there have been a variety of approaches explored to compress neural networks, in this work, we focus on unstructured pruning.

Traditionally, network pruning has been treated as a postprocessing step applied to fully trained, dense models. The prevailing wisdom was that the overparameterization of dense networks was essential for successful optimization dynamics [1], [2]. The most common methods operate by assigning an importance score, or saliency score, to each weight and removing those with the lowest scores. Common criteria include least-magnitude pruning [13], and those that leverage secondorder information of the loss landscape, as in [14], [15], and [16]. However, one-shot approaches suffer from extreme drops in performance in the sparse regime, often due to layer collapse: all weights from a single layer are removed and the network is rendered naive. To combat this, effective works all follow a taxing iterative procedure consisting of repeatedly pruning a small percentage of weigh

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut