Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.


💡 Research Summary

This paper establishes a rigorous theoretical foundation for Pruning at Initialization (PaI) by connecting discrete pruning masks to the continuous framework of graphons, the limit objects of convergent graph sequences. The authors focus on a single‑hidden‑layer fully‑connected network and view the binary mask applied to the first‑layer weights as the adjacency matrix of a bipartite graph between input coordinates and hidden units. By embedding this matrix into a step‑kernel on the unit square and measuring convergence with the bipartite cut distance, they ask whether the mask stabilises into a deterministic object as the input dimension d and hidden width n tend to infinity.

A central contribution is the “Factorised Saliency Model”. In this model each edge’s saliency score S₍ᵢⱼ₎ factorises into three non‑negative components: an input feature φᵢ (forward signal strength), a neuron feature ψⱼ (backward/gradient signal strength), and a random edge noise ξᵢⱼ arising from weight initialisation. The mask is then obtained by thresholding these scores. The authors demonstrate that many popular PaI criteria—Random, Magnitude, SNIP, GraSP, SynFlow—either exactly fit or asymptotically approximate this factorisation.

Under mild growth conditions (log(d+n)/min{d,n} → 0) and regularity assumptions (φₙ,i converges to a bounded continuous function φ, the empirical distribution of ψₙ,j converges to a deterministic CDF, and ξₙ,ij are i.i.d. with bounded moments), they prove Theorem 4.7: the step‑kernel associated with the mask converges in probability, in cut‑norm, to a deterministic bipartite graphon W. This validates the “Graphon Limit Hypothesis” previously posited for sparse networks.

The limit graphon provides a taxonomy of pruning methods. Unstructured approaches (Random, Magnitude) have φ≡1 and ψ≡1, yielding a constant graphon W(u,v)=ρ, i.e., a homogeneous connectivity pattern. Data‑dependent methods (SNIP, GraSP) produce heterogeneous graphons of the form W(u,v)=φ(u)·ψ(v), where φ reflects input magnitude (e.g., |xᵢ|) and ψ captures neuron‑level gradient magnitude (e.g., |aⱼ|·|σ′(hⱼ)|). Thus, these methods implicitly perform feature selection by concentrating connections on “active” coordinates while sparsifying noisy ones.

Beyond topology, the authors analyse functional expressivity. They prove a Universal Approximation Theorem for sparse networks: if the target function lives within the subspace spanned by the active coordinates identified by φ, then a network sparsified according to the limiting graphon can approximate any continuous function on that subspace arbitrarily well, provided the hidden width is sufficiently large. This extends classical universal approximation results for dense networks to the sparse, graphon‑induced regime.

For generalisation, the paper extends the Neural Tangent Kernel (NTK) framework to a “Graphon‑NTK”. The graphon modulates the path‑density term in the NTK, effectively re‑weighting kernel contributions according to the limiting connectivity pattern. By plugging the Graphon‑NTK into existing NTK‑based generalisation bounds (e.g., Arora et al., 2019; Cao & Gu, 2019), they derive a bound that explicitly depends on the cut‑norm distance between the empirical mask and its limit graphon. Homogeneous graphons lead to uniform path density and thus average‑case generalisation, whereas heterogeneous graphons concentrate paths on informative features, improving kernel alignment and tightening the bound.

In summary, the paper transforms the study of sparse neural networks from combinatorial graph problems into a continuous‑operator perspective. It provides (i) a provable convergence of PaI masks to deterministic bipartite graphons, (ii) a taxonomy distinguishing unstructured and structured pruning via the shape of the limit graphon, (iii) a universal approximation guarantee for sparse networks confined to the active subspace, and (iv) a graphon‑aware generalisation bound via the Graphon‑NTK. The work opens avenues for extending graphon analysis to deeper architectures, alternative activations, and empirical validation on large‑scale models.


Comments & Academic Discussion

Loading comments...

Leave a Comment