Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.


💡 Research Summary

The paper revisits the Lottery Ticket Hypothesis (LTH), which traditionally assumes a single, universal sparse subnetwork (a “winning ticket”) that can be trained in isolation to match the performance of its dense counterpart. The authors argue that this assumption ignores the inherent heterogeneity of real‑world data—different classes, semantic clusters, or environmental conditions often require distinct feature representations. To address this gap, they introduce “Routing the Lottery” (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a specific data subset while sharing a common dense initialization.

RTL operates in two stages. First, Adaptive Ticket Extraction iteratively applies magnitude‑based pruning (similar to Iterative Magnitude Pruning) separately for each of K data subsets (e.g., classes, clusters, or acoustic conditions). Starting from the same random initialization θ₀, the network is briefly trained on a subset, the lowest‑magnitude weights are pruned by a factor p, and the resulting binary mask mₖ is stored. The weights are then reset to θ₀ before moving to the next subset. This loop repeats until every mask reaches a target sparsity s, yielding K distinct masks that are largely non‑overlapping.

Second, Joint Retraining fixes the masks and jointly fine‑tunes the shared dense parameter tensor θ. Each subnetwork fₖ(x)=f(x; mₖ⊙θ) is trained exclusively on its own subset using mini‑batches interleaved across subsets. Gradient updates are masked (∇θ·mₖ) so that only the active weights of a given subnetwork are modified, preventing interference and catastrophic forgetting. By balancing batch counts across subsets, each subnetwork receives an equal number of updates, preserving specialization while still benefiting from shared weight statistics.

The authors evaluate RTL on several domains. On CIFAR‑10, they assign one subnetwork per class (K=10). Compared to a single‑mask IMP baseline and a multi‑model IMP baseline (independent models per class), RTL achieves higher balanced accuracy, precision, and recall at 25 %, 50 %, and 75 % sparsity levels, while using up to ten times fewer parameters than the independent‑model baseline. On CIFAR‑100, they cluster the 100 classes into eight semantic groups (derived via unsupervised text‑based clustering) and show that RTL still outperforms the baselines despite imperfect alignment between clusters and visual features. In an Implicit Neural Representation (INR) task, RTL learns region‑specific subnetworks for semantic segments within a single image, improving PSNR by an average of 1.2 dB over a baseline that uses class embeddings to condition a single network. Finally, in a speech‑enhancement experiment with heterogeneous acoustic environments, RTL’s environment‑specific tickets achieve higher SI‑SNR improvement than both universal and independently‑pruned models.

A notable contribution is the identification of “subnetwork collapse,” a sharp performance drop that occurs when pruning is too aggressive. To diagnose this without ground‑truth labels, the authors propose a Subnetwork Similarity Score based on cosine similarity between masks; high similarity indicates excessive overlap and potential oversparsification. This metric enables label‑free early warning and adaptive adjustment of sparsity targets.

Overall, RTL reframes pruning from a static compression technique into a dynamic, data‑aware mechanism that aligns model structure with data heterogeneity. It offers a lightweight alternative to Mixture‑of‑Experts (MoE) architectures—no auxiliary routing network, no extra parameters—while delivering modular, interpretable experts. The paper suggests future directions such as end‑to‑end joint learning of clusters and masks, scaling to large language or vision transformers, and exploring knowledge transfer among tickets. In sum, the work demonstrates that adaptive, multi‑ticket pruning can dramatically improve parameter efficiency and task performance across vision, representation, and audio domains, opening a new research avenue for modular deep learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment