(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve $1%\sim 3%$ better accuracy on Food101 dataset; or with a similar performance of $80%$ accuracy, \texttt{PASS} subnetworks obtain $0.35\times$ more speedup than the baselines.

💡 Research Summary

The paper introduces PASS (Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork), a novel data‑centric structural pruning framework for convolutional neural networks. Traditional channel pruning methods rely solely on model‑centric statistics (weight norms, batch‑norm scaling factors, loss‑based metrics) and typically treat each layer independently, ignoring the sequential dependencies between layers. PASS addresses these limitations by incorporating visual prompts—learnable perturbations added to input images—as an additional source of information that reflects the data distribution and can highlight important channels.

PASS consists of three main components. First, a three‑layer convolutional encoder g ω extracts a dense embedding from the visual prompt V; this embedding serves as the initial hidden state for the second component. Second, a Long Short‑Term Memory (LSTM) hyper‑network parameterized by θ processes, for each layer i, three inputs: (i) the mask from the previous layer M(i‑1), (ii) the current layer’s weight statistics eW(i) = M(i‑1) ⊗ W(i) (i.e., the weights after pruning the input channels), and (iii) the prompt embedding. The LSTM outputs a latent vector that is linearly projected to a set of channel‑wise importance scores for layer i. Finally, a straight‑through estimator converts these scores into a binary mask M(i) by selecting the top‑(1‑s) fraction of channels, where s is the desired sparsity for that layer. Global pruning is applied across all layers to achieve non‑uniform layer‑wise sparsity ratios.

The overall training objective jointly optimizes the visual prompt V, the encoder parameters ω, and the LSTM parameters θ by minimizing the classification loss L(Φ_{bW}(x+V), y), where Φ_{bW} denotes the original CNN with pruned weights bW = M(i‑1) ⊗ W(i) ⊗ M(i). After the pruning phase, the binary masks are frozen and a fine‑tuning stage optimizes the remaining dense weights W together with the visual prompt V, further recovering any performance loss.

Experiments are conducted on four classic CNNs (ResNet‑18/34/50, VGG‑16) and three modern architectures (ResNeXt‑50, Vision Transformer‑B/16, Swin‑T), all pretrained on ImageNet‑1K. Six downstream tasks are evaluated: CIFAR‑10/100, Tiny‑ImageNet, Food101, DTD, and Stanford Cars. PASS is compared against five strong structural pruning baselines: Group‑L1, GrowReg, Slim, DepGraph, and ABC‑Pruner. Results show that, at equal FLOPs, PASS consistently yields 1–3 % higher top‑1 accuracy (e.g., on Food101) and, for a fixed 80 % accuracy target, achieves up to 0.35× greater speed‑up than the baselines. Moreover, the learned masks and hyper‑network generalize well: when transferred to unseen datasets or architectures, they retain most of their advantage, indicating that PASS captures transferable structural knowledge.

Ablation studies confirm the necessity of each component. Removing the visual prompt degrades performance, demonstrating that data‑centric cues are essential. Replacing the recurrent LSTM with a simple feed‑forward mapper also harms accuracy, highlighting the importance of modeling inter‑layer dependencies. The two‑step mask generation (embedding → linear projection → straight‑through binarization) proves effective for differentiable pruning.

In summary, PASS innovatively merges visual prompting with a recurrent hyper‑network to produce layer‑wise channel masks that respect both data‑driven signals and structural dependencies. This approach outperforms existing pruning techniques across a variety of models and tasks, and its masks exhibit strong transferability. The work opens a new direction for data‑centric model compression, suggesting that carefully designed input perturbations can guide more efficient and accurate network sparsification.

(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

💡 Research Summary

Comments & Academic Discussion

Leave a Comment