Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structu…

Authors: Tianhao Qian, Zhuoxuan Li, Jinde Cao

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks
1 Alter nating Gradient Flo w Utility: A Unified Metr ic f or Str uctur al Pr uning and Dynamic Routing in Deep Networks Tianhao Qian, Zhuo xuan Li, Jinde Cao , Xinli Shi, and Leszek Rutko wski Abstract —Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation a wareness (e.g., W anda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the str uctural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, f ailing to preser ve critical functional pathwa ys. T o ov ercome this, we propose a decoupled kinetic par adigm inspired by Alternating Gradient Flow (A GF), utilizing an absolute feature-space T a ylor expansion to accurately capture the netw ork’s structural ‘kinetic utility’. Fir st , we uncov er a topological phase transition at e xtreme sparsity , where AGF successfully preserves baseline functionality and e xhibits topological implicit regularization, av oiding the collapse seen in models trained from scratch. Second , transitioning to architectures without strict structural pr iors, we rev eal a phenomenon of Sparsity Bottleneck in Vision T ransformer s (ViTs) . Through a gradient-magnitude decoupling analysis , we discov er that dynamic signals suffer from signal compression in con verged models , rendering them suboptimal for real-time routing. Finally , driven b y these empir ical constraints, we design a h ybrid routing framework that decouples AGF-guided offline structural search from online ex ecution via zero-cost physical priors. W e validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K , AGF eff ectively av oids the str uctural collapse where traditional metrics aggressiv ely fall below random sampling. Further more, when systematically deploy ed for dynamic inf erence on ImageNet-100 , our hybrid approach achiev es Pareto-optimal efficiency . It reduces the usage of the hea vy exper t by appro ximately 50% (achieving an estimated ov erall cost of 0.92 × ) without sacrificing the full-model accuracy . Index T erms —Neural Efficiency , Alter nating Gradient Flows , Channel Pr uning, Dynamic Routing, Gradient-Magnitude Decoupling, Feature Learning Theory . ✦ 1 I N T R O D U C T I O N D E E P Neural Networks (DNNs) have evolved from static computational graphs to dynamic and flexible systems. T o mitigate computational costs of modern vision backbones, two main paradigms have emerged: Channel Pruning , which permanently removes redundant structur es [1], [2], and Dynamic Routing , which conditionally skips computations based on input complexity [3], [4]. Beyond their success, a central challenge is the accurate estimation of channel importance. Standard appr oaches rely on magnitude-based heuristics (e.g., ℓ 1 -norm) [5] given that ”smaller weights are less important.” While ef fective in saturated settings, these static metrics often fail to cap- ture the complex learning dynamics of modern networks. • T . Qian and J. Cao ar e with the School of Mathematics, Southeast University , Nanjing 210096, China (e-mail: qth2mir@seu.edu.cn; jd- cao@seu.edu.cn; liuhanjie1993@gmail.com). • Z. Li is with the School of Mathematics, Southeast University , Nan- jing 210096, China, and also with the Systems Research Institute of the Polish Academy of Sciences, Warsaw 01-447, Poland (e-mail: 230229338@seu.edu.cn). • X. Shi is with the School of Cyber Science & Engineering, Southeast University , Nanjing 210096, China (e-mail: xinli shi@seu.edu.cn). • L. Rutkowski is with the Systems Resear ch Institute of the Pol- ish Academy of Sciences, Warsaw 01-447, Poland; the Institute of Computer Science, AGH University of Krakow , Krak ´ ow 30-059, Poland; and also with SAN University, Ł ´ od ´ z 90-113, Poland (e-mail: leszek.rutkowski@ibspan.waw .pl). • J. Cao is the corresponding author . Conversely , gradient-based methods (e.g., SNIP , GraSP) [6], [7] utilize instantaneous sensitivity to estimate importance, often resulting in instability during optimization. Recently , activation-aware pruning metrics (such as W anda [8] and RIA [9]) have established themselves as highly ef fective criteria, mostly benefiting fr om their success in exploiting activation outliers in Large Language Models (LLMs). However , a critical vulnerability is revealed when these methods are applied to the structural pruning of deep vision backbones (e.g., ResNets). Our empirical results demonstrate a critical limitation of static metrics: at 25% width retention, networks pruned by these metrics under- perform even uniform random pruning, hitting an entropic lower bound. W e identify this failure as a magnitude bias. Despite exhibiting low magnitudes, some neurons work as critical feature integrators in highly compressed routing pathways, where static metrics fail to preserve these vital topological pathways. While our primary rigorous analysis focuses on CNNs, we further demonstrate in our extended experiment(see Section 4.4) that this bias is equally destruc- tive in architectur es lacking strict structural priors, such as V ision T ransformers (V iT s). Recent advancements, such as W anda++ [10], recognize this limitation by incorporating block-level regional gradients. Y et, these methods remain inherently confined to localized output matching and lack global topological awareness. In contrast, the end-to-end kinetic utility along the global optimization trajectory is captured via our framework. 2 T o circumvent magnitude bias, our work is inspired by the Alternating Gradient Flows (AGF) framework [11]. This ”gradient-driven utility” is reconstructed as a dynamic spatial criterion for str uctural ef ficiency . W e define the utility U ( c ) as the cumulative gradient norm, measuring the true learning potential rather than static amplitude. In this way , AGF naturally acts as a topological proxy: it preserves low- magnitude neurons if their gradient flow indicates high sensitivity and contribution to the global loss reduction. T ranslating this insight into practice reveals an asymme- try between topology construction and inference routing: 1) The Construction Phase (Pruning): AGF Utility is highly effective at constructing robust network topolo- gies. Capturing critical routing pathways allows the network to smoothly traverse the topological phase transition. Consequently , the model survives extreme structural pruning (e.g., 25% width in V iT s, or 3% width in ResNets)—conditions wher e magnitude-based selection simply fails to converge. 2) The Execution Phase (Routing): AGF Utility is sub- optimal as a r eal-time inference pr oxy . As the model converges, gradient dynamics diminish and saturate, offering little structural guidance. Instead, robust phys- ical priors (like Confidence or ℓ 1 ) prevail for ef ficient execution. By resolving this dilemma, we propose a decoupled kinetic paradigm that uses AGF to build the road (topology) and Confidence/ ℓ 1 to navigate it (routing). In summary , our contributions follow a progr essive trajectory from topologi- cal discovery to large-scale system deployment: 1) AGF-Guided Pruning and T opological Phase T ransi- tions: Through extensive analysis on CNN backbones, we identify a topological phase transition at extreme sparsity boundaries. Under these constraints, networks trained from scratch inevitably collapse. In contrast, AGF safely bypasses magnitude bias by anchoring to orthogonal dynamic routing hubs, preserving struc- tural functionality . Furthermore, we uncover a T opo- logical Implicit Regularization effect, demonstrating that stochastic gradient noise from sparse calibration sig- nificantly enhances the network’s long-term recovery elasticity . 2) Signal Saturation and the V iT Sparsity Bottleneck: Advancing beyond standard CNNs, we rigor ously ana- lyze metric fidelity across optimization trajectories. W e reveal a severe signal compression phenomenon in con- verged models, where the dynamic AGF signal dramat- ically underestimates the true physical cost ratio (com- pressing from ∼ 149 . 4 × to ∼ 21 . 4 × ). Consequently , when extending structural pruning to architectures lacking strict inductive biases like V ision T ransformers (V iT s) , all deterministic metrics (both magnitude and gradient-based) hit a strict Sparsity Bottleneck. These dual limitations mathematically necessitate the decou- pling of dynamic gradient proxies from static physical costs. 3) The Decoupled Kinetic Paradigm and Large-Scale V erification: T o resolve the afor ementioned dilemma, we propose a hybrid r outing framework that decouples offline topology construction (via AGF) fr om online dy- namic execution (via zero-cost priors like Confidence). W e subject this paradigm to an extreme 75% struc- tural compr ession str ess test on ImageNet-1K , wher e traditional metrics aggressively destroy core routing pathways and fall below the uniform random baseline. AGF , however , successfully navigates this intrinsic ca- pacity limit. Furthermor e, when deployed as a dynamic inference system on ImageNet-100 , our input-adaptive approach successfully br eaks the sparsity bottleneck. It matches the full-capacity baseline accuracy while reducing the usage of the heavy expert by ≈ 50 % (achieving an overall estimated computational cost of 0 . 92 × ). 2 R E L AT E D W O R K This section locates our work at the intersection of feature learning dynamics, structural pruning (from CNNs to V iT s and LLMs), and dynamic routing. 2.1 Theoretical Dynamics and Phase T ransitions Recent years have witnessed empirical success of deep learning, but its optimization landscapes and information bottlenecks remain challenging. For traditional statistical learning bounds, it’s quite struggling to explain the gen- eralization of over-parameterized networks. Further , physics-inspired approaches have provided al- ternative perspectives. For example, studies on the Informa- tion Bottleneck theory [12] and optimization dynamics [13], [14] have revealed that neural networks undergo distinct topological phase transitions during training (e.g., grokking [15], [16]). Concurrently , Kunin et al. (2025) [11] proposed the Alternating Gradient Flow (AGF) framework, model- ing feature learning as a saddle-to-saddle transition driven by a specific kinetic utility function. Our work translates this macroscopic continuous theory into practical network design: the continuous AGF utility is repurposed into a discrete, O (1) -cost metric to overcome the severe entropic lower bounds typical of extreme structural compression. 2.2 Neural Netw ork Pruning and Magnitude Bias Pruning techniques trace back to Optimal Brain Damage [17] and Optimal Brain Surgeon [18], which harnessed second-order derivative information (e.g., W oodFisher [19]). Modern approaches generally fall into two distinct trajecto- ries: 1) Magnitude and Activation-A ware Pruning: Classical heuristics like ℓ 1 [5], [20] assume small weights are redun- dant. Recently , activation-aware metrics have dominated the compression of Large Language Models (LLMs), with prominent methods like W anda [8], SparseGPT [21], LLM- Pruner [22], and SliceGPT [23] achieving state-of-the-art results by exploiting activation outliers [24]. However , we identify a limitation: when applied to extreme structural pruning of vision networks (CNNs and V iT s), these static proxies suffer from a severe magnitude bias, often dispro- portionately eliminating critical low-magnitude structural pathways and degrading performance below random prun- ing baseline. 3 2) Gradient-based Sensitivity: Methods like SNIP [6], GraSP [7], and SynFlow [25] leverage gradient signals to estimate structural importance [26]. T aylor Pruning [27] remains a cornerstone in this category . Y et, traditional first- order T aylor expansions suffer from signal cancellation and high variance. Furthermore, recent universal struc- tural pruning frameworks (e.g., DepGraph [28]) and V ision T ransformer (V iT) compr ession methods (e.g., FlexiV iT [29], T oMe [30]) highlight the difficulty of preserving global spatial priors, which is tackled by integrating absolute gra- dient norms over a calibration trajectory in our AGF-guided approach. Consequently , signal cancellation is circumvented and a mathematically tractable proxy is provided for the net- work’s kinetic utility , ensuring robust topology construction even in highly sensitive V iT architectures. Global Sparsity Distribution. Recognizing the limi- tations of layer-wise pruning, recent works like T ´ yr-the- Pruner [31] employ extensive evolutionary searches over multi-sparsity supernets to find optimal global topologies. Our AGF utility provides an elegant alternative: because AGF integrates the backpropagated loss gradient over an optimization trajectory , it inherently captures global, cross- layer dependencies through the chain rule. This endows our method with the topological awareness of a global search framework without the pr ohibitive computational overhead of supernet generation [32], [33]. 2.3 Dynamic Routing and Conditional Computation Dynamic Neural Networks (DyNNs) [34] adapt their com- putational graphs conditioned on intermediate featur es to optimize the efficiency-accuracy trade-of f. Routing strate- gies have seen massive scaling through Mixture-of-Experts (MoE) [35], [36], and mor e recently , continuous routing relaxations like Soft MoE [37]. In the vision domain, dynamic token sparsification (e.g., DynamicV iT [38]) and dynamic convolution [39] have be- come prevalent. Early Exiting (Cascading) frameworks like BranchyNet [40] and MSDNet [4] use intermediate clas- sifiers to route easy samples to early exits. A recurring challenge across all routing strategies is the calibration and saturation of the routing signals [41]. In this work, we formalize a decoupled kinetic paradigm : while Alternating Gradient Flow (AGF) excel at offline architecture search by identifying critical skeleton of the network, its dynamic sig- nals will lose discriminative power during online execution in converged models. Instead, our experiment indicates that AGF-pruned manifolds exhibit superior intrinsic calibration [42], enabling computationally free static proxies to govern real-time inference. This decoupling effectively resolves the dilemma of gradient signal compression, leveraging AGF to build the structural ’road’ and utilizing zero-cost priors to guide it. 3 M E T H O D O L O G Y In this section, we formally introduce the Alternating Gra- dient Flow (AGF) utility framework. 3.1 Framew ork Over view Our proposed framework unifies structural pruning with dynamic inference. As illustrated in Figure 1, the system consists of a calibration phase using AGF scores, an iterative inheritance pruning pr ocess, and a runtime dynamic r outing mechanism. 3.2 Discrete Pro xy of A GF Dynamics: Feature-Space T otal V ariation The continuous Alternating Gradient Flow (AGF) [11] re- veals that structural learning occurs through oscillatory ”saddle-to-saddle” dynamics. However , computing the con- tinuous path integral of these gradients is computationally intractable for deep vision backbones. T o bridge continuous AGF theory with tractable struc- tural pruning, we identify that the core topological value of a routing node lies in its dynamic ”kinetic utility” along the optimization trajectory , mathematically r epresented by its T otal V ariation (TV) . T raditional first-order T aylor pruning metrics evaluate the expected net gradient ( E [ Y ⊙ ∇L ] ). This formulation inherently suffers from signal cancellation due to the oscillating nature of AGF dynamics, frequently leading to the prematur e removal of critical routing hubs. Mathematically , the T otal V ariation (TV) of the objective function L with respect to a structural unit (feature map Y c ) along a continuous optimization trajectory t ∈ [0 , T ] is defined by the path integral: TV ( Y c ) = Z T 0      ∂ L ∂ Y ( t ) c d Y ( t ) c dt      dt (1) By applying a first-order T aylor expansion within a local discrete neighborhood, the infinitesimal change d Y c can be empirically approximated by the activation output itself ( Y c ), establishing a feature-space sensitivity pr oxy . Discretiz- ing the trajectory over T mini-batches B t , we derive the Absolute Feature-Space T aylor proxy: U c = 1 T T X t =1 E x ∼B t h    Y ( x ) c ⊙ ∇ Y ( x ) c L    i (2) Evaluating the continuous path integral (Eq. 1) is com- putationally prohibitive for deep over-parameterized net- works. The discrete approximation in Eq. 2 serves as an em- pirical invariant that efficiently accumulates the structural kinetic utility over the optimization trajectory . This pr ovides a mathematically tractable, O (1) -cost metric that translates macroscopic learning dynamics into explicit microscopic routing decisions. 3.3 Decoupled Arc hitecture: Confidence-Based Cas- cading Router Confidence-based dynamic routing and early exiting have been extensively explored since BranchyNet [40]. However , current routing paradigms typically struggle with a struc- tural trade-off. Joint search-and-routing methods often en- tangle topology discovery with inference-time gating, which induces gradient saturation and suboptimal convergence. Conversely , multidimensional co-optimization frameworks like MDP [43] r ely on Mixed-Integer Nonlinear Program- ming (MINLP) to generate static architectures, ultimately sacrificing input-adaptive flexibility during inference. 4 Fig. 1: Overview of the AGF-Guided Ef ficiency Framework. The pipeline integrates (A) Gradient-based Utility Calibration, (B) Iterative Structural Pruning, and (C) Confidence-based Dynamic Routing. AGF identifies the topological skeleton, while the routing policy handles runtime complexity . Rather than modifying the cascading r outing mechanism itself, we address this trade-off through a Decoupled Eval- uation Paradigm that separates architectural search from real-time execution. Specifically , we use dynamic Feature Sensitivity (Eq. 2) pur ely as an offline metric to construct and maintain the topological integrity of a lightweight pruned expert. For online dynamic routing, we bypass expensive gradient evaluations and instead rely on a com- putationally free static physical prior . In practice, our cascading system first processes each input thr ough the pruned expert to generate a pr edictive probability distribution. The input is forwarded to the full- capacity expert only if the top-1 confidence falls below a predefined threshold τ . By confining the computationally heavy gradient evaluations to the offline calibration phase, this decoupling enables zero-cost, input-adaptive inference, achieving Pareto-optimal efficiency without the need for hardwar e-specific re-optimization. 4 E X P E R I M E N T S 4.1 Experimental Setup W e evaluate on CIF AR-100 using W ideResNet-18-2 ( k = 1024 ) and ResNet-50. Pruning tar gets specific bottleneck layers to analyze redundancy . Please refer to Appendix B for detailed hardwar e specifications. 4.2 P ar t I: The Limits of Structural Inheritance W e first establish the validity of AGF as a feature selection metric by exploring the limitation of model compr ession. T able 1 presents a comprehensive comparison acr oss two distinct sparsity settings. Algorithm 1 AGF-Guided Decoupled Routing Framework Input: T eacher Model Θ , Calibration Data D calib , W idth T arget k , Confidence Threshold τ Output: Pruned Expert Θ sub , Routing Policy π 1: // Phase 1: Offline T opology Construction (Calibration) 2: Initialize U c = 0 for all channels c ∈ Θ 3: for mini-batch x ∼ D calib do 4: Forward pass to compute activations Y ( x ) 5: Backward pass to compute gradients ∇ Y L 6: Update U c ← U c +    Y ( x ) c ⊙ ∇ Y ( x ) c L    7: end for 8: Θ sub ← Select top- k channels based on U c 9: Fine-tune Θ sub to convergence 10: // Phase 2: Online Dynamic Routing (Inference) 11: for test sample x test do 12: P sub ← Θ sub ( x test ) // Lightweight inference 13: if max( P sub ) < τ then 14: Return Θ( x test ) // Route to Full Expert 15: else 16: Return P sub // Early Exit 17: end if 18: end for 4.2.1 Setting 1: Moderate Compression ( k ∈ { 256 , 128 } ) In this moderately sparse setting (25% to 12.5% width), the sub-networks retain sufficient capacity . Consequently , directly training a structurally identical narrow model from scratch (Narr ow) achieves an accuracy of ≈ 70 . 9% , perform- ing competitively with sophisticated pruning algorithms. 5 However , as sparsity scales from k = 256 to k = 128 , we observe a significant divergence in metric consistency . While AGF maintains r obust, SOT A-level performance ( 70.05% ) at k = 128 , activation-scaled magnitude methods like RIA suffer an obvious degradation, dropping to 68.51% and fall short of even the simple ℓ 1 -norm baseline. This suggests that heuristics derived completely from activation magnitude may overfit to specific channel r edundancies, unable to scale consistently across various width constraints. 4.2.2 Setting 2: Phase T ransition at Extreme Sparsity ( k = 32 ) A dramatic shift occurs at extreme sparsity ( k = 32 , 3% width). T raining a narrow model from scratch collapses entirely to 45 . 42 % , failing to capture the underlying data distribution. This collapse universally indicates that under extremely sparse settings, transferring topological knowl- edge (str uctural inheritance) from the dense teacher is an absolute prer equisite. Under this stringent constraint, activation-scaled heuris- tics like RIA achieve the highest mean accuracy ( 68.97% ). This indicates that when the network capacity is just suf- ficient to cross the phase transition, prioritizing static ac- tivation outliers remains an effective strategy . However , this peak performance comes at the cost of high structural instability ( σ = 0 . 40 ). In contrast, AGF operates from an orthogonal perspec- tive. Rather than relying on static magnitude, AGF evaluates dynamic learning potential. While its accuracy (68.40%) at this specific capacity does not surpass highly-optimized magnitude metrics, AGF achieves this competitive base- line by successfully anchoring to a completely distinct, magnitude-free structural subset. T o visualize this phe- nomenon, Figure 2 analyzes the underlying metric behavior of the W ideResNet baseline at this extreme sparsity ( k = 32 ). As shown in Figure 2(a), AGF exhibits superior batch-to- batch selection stability (72.97% Jaccard Index vs. T aylor ’s 68.42%), explaining its extremely low accuracy variance ( σ = 0 . 12 compared to RIA ’s 0.40). More importantly , Figure 2(b) reveals a strict Metric Orthogonality ( J ≈ 0 ). Magnitude-based heuristics inher ently suffer from a mag- nitude bias, heavily favoring static, high-capacity channels (blue crosses). By relying on dynamic utility (Feature-Space TV) rather than fragile magnitude outliers, AGF successfully identifies ”high-potential” routing hubs (red dots) that are conventionally overlooked, providing the essential struc- tural resilience required for network survival at extreme physical limits. 4.3 P ar t II: V erification of Stochastic Regularization While topological inheritance explains the immediate sur - vival of pruned networks, their long-term plasticity remains a critical question. Specifically , does the V alue Inversion ob- served in static metrics persist during extended fine-tuning? And crucially , does AGF require large-scale data calibration to be effective? 4.3.1 Experimental Configuration T o probe this, we extend the fine-tuning phase to 20 epochs on ImageNet-100. W e benchmark our AGF against the es- tablished T aylor baseline. A key variable is the calibration T aylor (Baseline) AGF (Ours) 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Index Consistency (Jaccard) 68.42% 72.97% +4.55% (a) Selection Stability (Batch-to-Batch) 0.0 0.2 0.4 0.6 0.8 1.0 N o r m a l i z e d M a g n i t u d e ( 1 - N o r m ) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Sensitivity (AGF -Score) High Capacity (Static) High P otential (Dynamic) ( b ) M e t r i c O r t h o g o n a l i t y ( J 0 ) Pruned Channels L1 Selections AGF Selections (Ours) Fig. 2: Analysis of Metric Stability and Orthogonality (W ideResNet on CIF AR-100 at k = 32 ). (a) Selection Sta- bility: AGF demonstrates superior batch-to-batch structural consistency compared to the T aylor baseline. (b) Metric Orthogonality ( J ≈ 0 ): A normalized scatter plot reveals a fundamental divergence between static and dynamic met- rics. T raditional magnitude metrics ( ℓ 1 ) rigidly select high- capacity channels (blue crosses), whereas AGF identifies an entirely distinct, orthogonal subset of dynamic routing hubs with high kinetic potential (red dots). T ABLE 1: Detailed Comparison of Pruning vs. T raining- from-Scratch. Consistency vs. Peak Performance: While RIA achieves a slight advantage at extreme sparsity ( k = 32 ), it exhibits inconsistency at moderate sparsity ( k = 128 ), falling behind simple baselines. AGF (Ours) demonstrates the best trade-off: it dominates at k = 128 and maintains competitive accuracy with superior stability ( σ = 0 . 12 ) at k = 32 , avoiding the high variance of activation-based methods like W anda ( σ = 0 . 63 ) and RIA ( σ = 0 . 40 ). Strategy Width ( k ) Mean Acc (%) Std ( σ ) Best Seed (%) Baseline (Full) 1024 68.66 N/A 68.66 Moderate Compression ( k = 256 , 25% Width) Random Pruning 256 70.21 0.45 70.66 ℓ 1 -Norm Pruning 256 70.32 0.22 70.46 AGF Pruning (Ours) 256 70.75 0.08 70.81 Narrow (Scratch) 256 70.92 0.28 71.17 Aggressive Compression ( k = 128 , 12.5% Width) ℓ 1 -Norm Pruning 128 69.78 0.23 69.95 T aylor Pruning 128 69.50 0.04 69.54 W anda Pruning 128 69.49 0.13 69.64 RIA Pruning 128 68.51 0.32 68.70 AGF Pruning (Ours) 128 70.05 0.17 70.23 Narrow (Scratch) 128 70.96 N/A 70.96 Extreme Compression ( k = 32 , 3% Width) Random Pruning 32 67.79 0.27 68.09 ℓ 1 -Norm Pruning 32 68.60 0.28 68.84 T aylor Pruning 32 68.05 0.37 68.45 W anda Pruning 32 68.47 0.63 69.19 RIA Pruning 32 68.97 0.40 69.30 AGF Pruning (Ours) 32 68.40 0.12 68.53 Narrow (Scratch) 32 45.42 N/A 45.42 data volume: we compare a ”Dense” AGF variant (100 batches) against a ”Sparse” AGF variant (only 10 batches). 4.3.2 Phenomenon I: Asynchronous Conv ergence Behav- iors As detailed in T able 2, we observe a clear temporal crossover . • T aylor metric: Leveraging local loss curvature, T aylor pruning preserves weights with high instantaneous sensitivity . This yields a strong initial accuracy ( 84.26% at Ep 10) but suffers from rapid saturation, peaking at 84.57% (Ep 20). • AGF: Initially , AGF-selected structures fall behind slightly (83.60% at Ep 10). However , they exhibit ob- 6 viously higher Recovery Elasticity . The AGF-Sparse model shows an upward trend, gaining +1.30% accu- racy to achieve a peak of 84.90% , ultimately surpassing the T aylor baseline. This result challenges the static view of magnitude: struc- tures that ar e ”good enough” at initialization (T aylor) ar e not necessarily those with the highest potential (AGF). 4.3.3 Phenomenon II: T opological Implicit Regularization. T able 2 reveals an intriguing property: the Sparse AGF vari- ant (10 batches) consistently outperforms its Dense counter- part (100 batches), achieving a +0.34% accuracy gain. Rather than an experimental anomaly , this systematic performance trajectory points to a mechanism we formalize as T opological Implicit Regularization . In post-training calibration, computing the exact global gradient expectation ( E [ ∇L ]) over a dense dataset risks overfitting the pruning metric to high-magnitude outlier samples specific to the calibration distribution. By restricting the AGF TV proxy to a sparse subset, we inherently intro- duce gradient variance. The estimated batch-level gradient can be formulated as ∇L B = ∇L + ϵ , where ϵ ∼ N (0 , Σ) repr esents the sampling noise. Analogous to how batch noise in Stochastic Gradi- ent Descent (SGD) assists in escaping sharp minima, this stochasticity ϵ r egularizes the utility estimation. It acts as a struct ural perturbation that prevents the pruning proxy from overfitting to data-specific noise. Instead, it leverages the AGF metric to identify low-frequency , but potential topological hubs, maintaining high utility even under high- variance conditions. Consequently , sparse calibration acts less as a test of data memorization and more as a measure of the sub-network’s true structural plasticity . T ABLE 2: 20-Epoch Recovery Analysis on ImageNet-100. Comparing the Peak Accuracy of pruning strategies. Sur - prisingly , AGF with only 10 batches achieves the highest performance ( 84 . 90 % ), outperforming both T aylor and the high-sample AGF variant. This suggests that the stochastic noise in limited-batch gradient estimation acts as a beneficial regularizer for structure selection. Strategy Calibration Data Ep 10 (Post-Prune) Peak Accuracy Epoch of Peak vs. T aylor T aylor (Baseline) 100 Batches 84.26% 84.57% Ep 20 - AGF (Dense) 100 Batches 83.84% 84.56% Ep 17 -0.01% AGF (Sparse) 10 Batches 83.60% 84.90% Ep 20 +0.33% Random N/A 83.64% 83.61% - -0.96% ℓ 1 -Norm N/A 83.15% 83.53% - -1.04% 4.4 P ar t III: Breaking the Bias in Vision T ransf ormers T o demonstrate the universality of our framework and the boundaries of metric ef ficacy , we extend our str uctural prun- ing experiments to V ision T ransformers (V iT -Base, Patch-16) on CIF AR-100. V iT s present a dif ferent topological challenge compared to CNNs or LLMs: they lack strict inductive biases and rely heavily on global attention routing within their intermediate MLP layers. W e target the intermediate MLP expansion layers (origi- nal width 3072) across two critical regimes: Moderate Com- pression ( k = 1536 , 50% capacity) and Extreme Compres- sion ( k = 768 , 25% capacity). 4.4.1 Moderate Compression: The Superior ity of Gradient Pro xy As shown in T able 3, in the moderate regime ( k = 1536 ), our proposed AGF demonstrates absolute superiority . It accurately captures dynamically critical pathways without falling into the magnitude bias trap. AGF peaks at 86.81% , effectively outperforming activation-aware magnitude met- rics like W anda (84.96%). 4.4.2 Extreme Compression: The Sparsity Bottlenec k However , under extreme structural compression ( k = 768 ), we observe a severe performance bottleneck across all de- terministic heuristics. While AGF ( 81.89% ) still outperforms W anda ( 81.36% ) and demonstrates highly competitive re- covery compared to RIA ( 82.24% ), all methods suffer a substantial degradation of nearly 10% compared to the unpruned baseline ( 91.50% ). Specifically , at extreme V iT compression ( k = 768 ), we observe a Sparsity Bottleneck. Both magnitude-based (W anda: 81.36%, RIA: 82.24%) and gradient-based (AGF: 81.89%) metrics hit a performance ceiling. This bottleneck illustrates that no single deterministic proxy is theoretically sufficient at zer o-tolerance sparsity . This empirical limitation mathematically necessitates our proposed Hybrid Routing approach: using AGF for robust offline topology construc- tion, and Confidence-based priors for online dynamic rout- ing. W e diagnose this severe degradation as an Informa- tion Bottleneck inherent to single-dimensional determinis- tic proxies at zero-tolerance sparsity: 1) magnitude bias (W anda/RIA): Strict penalty on low- magnitude weights irreversibly destroys the structural integrity of V iT’s attention routing pathways. 2) Sensitivity V ariance (AGF): At extreme sparsity , the metric relies on sample calibration, rendering the iso- lated feature sensitivity proxy vulnerable to variance. 4.4.3 The Mathematical Necessity of the Decoupled Kinetic P aradigm The rigid performance ceiling reached by all static proxies under extreme V iT compression provides the empirical jus- tification for our Decoupled Kinetic Paradigm . It mathe- matically demonstrates that to safely compress ar chitectures lacking structural priors, one can not rely on static topology alone. Instead, the problem must be decoupled: utilizing a robust kinetic signal (AGF) to construct an initial, highly capable offline topological pool, and relying on zero-cost physical priors (Confidence) to dynamically r oute inputs during online inference. Importantly , while AGF necessitates backward passes during the calibration phase (incurring higher offline com- putational overhead than forward-only metrics), this cost is strictly confined to the offline topology construction. During online inference, our Hybrid Router entirely relies on zero-cost physical priors , ensuring no latency penalty in production. 4.5 P ar t IV : Signal Saturation and Metric Fidelity A critical requir ement for dynamic routing is an energy proxy that can accurately quantify the computational cost of 7 T ABLE 3: Structural Pruning on V iT -Base (MLP Width = 3072). Moderate ( k = 1536 ): AGF demonstrates superior re- covery and peak performance over magnitude-based meth- ods. Extreme ( k = 768 ): All deterministic metrics encounter a severe Sparsity Bottleneck, highlighting the theoretical limitations of single-dimensional proxies and motivating our Hybrid Routing approach. Metric T ype Strategy Moderate ( k = 1536 , 50%) Extreme ( k = 768 , 25%) Ep 1 Acc Peak Acc Ep 1 Acc Peak Acc Unpruned Baseline (Full Capacity) 91.50%* Static/Magnitude (Activation-A ware) RIA Pruning - - 73.61% 82.24% W anda Pruning 79.82% 84.96% 66.50% 81.36% Dynamic/Gradient AGF Pruning (Ours) 85.00% 86.81% 73.88% 81.89% *Estimated full fine-tuning performance on CIF AR-100. experts. Our experiments reveal that the fidelity of gradient- based metrics is setting-dependent (T able 4). 4.5.1 Kinetic Regime (Signal Alignment) In the ResNet-50 experiments (Kinetic Regime, 62% Acc), the gradient flow remains vigorous. The AGF Utility demonstrates high fidelity: the gradient-derived cost ratio ( 44 . 7 × ) is closely aligned with the physical parameter ratio measured by ℓ 1 -Norm ( 60 . 2 × ). 4.5.2 Saturated Regime (Signal Compression) Conversely , the W ideResNet-18 baseline is highly con- verged. As gradients diminish, the AGF signal becomes compressed. While the physical capacity gap remains mas- sive ( ℓ 1 Ratio 149 . 4 × ), the AGF Ratio shrinks to 21 . 4 × . This justifies our Hybrid Strategy: using AGF for topology construction but ℓ 1 for routing penalties. T ABLE 4: Regime-Dependent Fidelity of Energy Proxies. ResNet-50 (Kinetic): AGF utility ( 44 . 7 × ) remains aligned with the physical capacity gap. W ideResNet-18 (Saturated): AGF utility becomes compressed ( 21 . 4 × ) due to vanishing gradients, under estimating the massive physical disparity ( 149 . 4 × ). Regime Metric Full Pruned Ratio Signal ResNet-50 (Kinetic) ℓ 1 Norm 107,142 1,781 60.2 × Physical T ruth AGF Utility 0.0048 0.0001 44.7 × Aligned W ideResNet (Saturated) ℓ 1 Norm 232,682 1,557 149.4 × Physical T ruth AGF Utility 4.89e-4 2.29e-5 21.4 × Compressed 4.6 P ar t V : Efficiency and Inference Routing Phase W e construct a Confidence-based Cascading system to test the quality of uncertainty . Inputs are processed by the lightweight Pruned Expert ( k = 32 ) unless the prediction confidence drops below a threshold τ , which then triggers routing to the full-capacity model. 4.6.1 Routing Mechanism Analysis T o fully understand the internal mechanism of our cascad- ing system, we visualize the routing distribution regar ding sample difficulty . As illustrated in Figure 3, we approximate the absolute difficulty of each input using its prediction entropy derived from the full-capacity model. The distri- bution demonstrates a clear structural disentanglement: the router intelligently assigns ”easy” samples (characterized by low prediction entropy , tightly peaked near zero) to the computationally cheap Pr uned Expert (green distribu- tion). Conversely , the Full Expert is strictly r eserved for ”hard”, ambiguous samples with high uncertainty (red dis- tribution). This empirical evidence validates that our AGF- guided dynamic inference does not merely skip compu- tations randomly , but fundamentally aligns computational cost with structural sample complexity . Fig. 3: Dif ficulty Distribution of Routed Samples. W e mea- sure sample difficulty using the prediction entropy of the full-capacity baseline. The lightweight router successfully learns to disentangle the input space without human priors: ”easy” samples (low entropy) are dominantly routed to the Pruned Expert (Gr een), while ”hard” ambiguous samples (high entropy) are forwarded to the Full Expert (Red). This adaptive decoupling is the core mechanism enabling our system’s high efficiency . 4.6.2 Routing Efficiency and Marginal Gains. Driven by this adaptive decoupling mechanism, our em- pirical results demonstrate a highly favorable accuracy-cost trade-off for the proposed system. Specifically , to achieve a strong baseline accuracy of 69.3%, a naive Random pruning topology requir es a normalized computational cost of 8.26, whereas our AGF-guided topology r equires only 7.28 (a 12% efficiency gain ). Furthermore, in the tail of performance recovery , traditional metrics like ℓ 1 saturate with a negative slope at higher cost budgets (Cost > 50 ). In contrast, the AGF-anchored system maintains a positive marginal gain, pushing peak accuracy to 72.65% . Notably , in the extreme high-efficiency zone, our hybrid router achieves 68.63% accuracy at a normalized cost of merely 7.5 (approximately 1 / 20 th of the full computational budget). This confirms that the router intelligently matches the static pruned expert’s baseline performance (a +0 . 13% boost) with negligible computational overhead, achieving maximum computation reduction without sacrificing accu- racy . While the current 12% ef ficiency gain is demonstrated under restricted layer-wise pruning, we hypothesize that ex- tending the AGF-guided topology to all T ransformer blocks (e.g., joint MLP and Attention pruning) will exponentially amplify this comparative advantage. As structural errors compound across layers, naive heuristics like Random or ℓ 1 8 are expected to suffer serious topological collapse, demand- ing exponentially higher computational budgets to recover . In contrast, the global topological awareness of AGF theo- retically guarantees a much wider efficiency gap in deep, multi-layer compression regimes, presenting a promising direction for future extreme-scale network optimization. 4.7 P ar t VI: Scalability and T opological Stress T ests on ImageNet T o verify the scalability of our proposed AGF-guided rout- ing mechanism beyond CIF AR, we conducted extensive experiments on ImageNet-100 and ImageNet-1K . W e constr ucted a dynamic inference system consisting of a Full Expert (Standard ResNet-50) and a Pruned Expert (ResNet-50 with layer3.1 pruned to k = 64 ). 4.7.1 P erf or mance vs. Baselines . As summarized in T able 5, the static Pruned Expert suffers a significant accuracy dr op (-8.94%) compared to the Full Expert. A naive Random Policy recovers some performance but remains suboptimal. Crucially , our Adaptive Router achieves 88.78% accu- racy , ef fectively matching the Full Expert (88.74%) while routing approximately 50 % of samples to the lightweight expert. This repr esents a +4.46% improvement over the Random baseline. T ABLE 5: Main Results on ImageNet-100. Our Adaptive Router (Ours) matches the upper bound (Full Model) while reducing the usage of the heavy expert by ≈ 50% . Method Acc (%) vs. Random Route Ratio (Full:Small) Est. Cost Static Full 88.74 +4.42 100 : 0 1.00x Static Pruned 79.80 -4.52 0 : 100 0.85x Random Policy 84.32 - 50 : 50 0.92x Ours (Adaptive) 88.78 +4.46 48.5 : 51.5 0.92x 4.7.2 P areto F rontier Analysis. T o demonstrate the flexibility of our approach, we varied the cost penalty λ to generate an Accuracy-Efficiency trade-off curve. As shown in Figure 4 , our method (Red Curve) consis- tently dominates the Random Baseline (Blue Cr oss), forming a convex hull that allows for flexible deployment configura- tions. Notably , at λ = 0 . 1 , the system achieves an optimal balance, preserving full accuracy with minimized FLOPs. 4.7.3 T opological Baselines at Extreme Sparsity (25% Ca- pacity) Under extreme structural compression (25% capacity) on ImageNet-1K, the network faces a severe information bot- tleneck. As shown in T able 6, the uniform random sampling baseline yields an accuracy of 64.93%, which effectively establishes the intrinsic lower bound of the remaining net- work capacity . Notably , standard deterministic heuristics (W anda, ℓ 1 ) and classical T aylor pruning perform slightly worse than this random baseline. This indicates that at near-zer o redun- dancy , magnitude-driven proxies eliminate critical routing pathways, introducing destructive structural biases. In con- trast, our AGF proxy (64.99%) performs equal to uniform Fig. 4: Accuracy-Efficiency T rade-off on ImageNet-100. Our adaptive method (Red) establishes a superior Par eto fr ontier compared to static and random baselines, enabling dynamic trade-offs between performance and cost. sampling. Rather than a metric failur e, this conver gence suggests that at 75% sparsity , the sub-network has reached a strict capacity limit: structural priors can no longer com- pensate for the sheer loss of parameters. T ABLE 6: Extreme Stress T est on ImageNet-1K (ResNet- 50). Comparison of pruning strategies under a severe 75% structural compression ( k = 64 ) on the bottleneck layers. Reference: The unpruned model achieves 80.35%. The sig- nificant absolute drop across all methods reflects an in- formation bottleneck. Notably , magnitude-based heuristics ( ℓ 1 , W anda) and T aylor approximations suffer degradation below the Random baseline. AGF remains competitive with uniform sampling, indicating that the network has reached an intrinsic capacity limit without suffering from negative proxy biases. Method Metric Proxy Acc (%) vs. Random vs. W anda Reference Upper Bound (100% Capacity) Unpruned Baseline Full Capacity 80.35 - - Extreme Compression Regime (25% Capacity) T aylor Pruning Loss Approx ( ∇ W · W ) 64.42 -0.51 -0.27 ℓ 1 -Norm Pruning W eights ( | W | ) 64.54 -0.39 -0.15 W anda Pruning W eights × Act 64.69 -0.24 - Random Pruning Uniform Sampling 64.93 - +0.24 AGF Pruning (Ours) Feature Sensitivity 64.99 +0.06 +0.30 This behavioral dif ference largely derives from proxy design. Classical T aylor pruning computes importance in the parameter space ( ∇ W · W ), making it vulnerable to signal saturation. Consequently , strictly directional metrics (T aylor) and static activation-scaled weights (W anda) are prone to making biased, suboptimal structural choices un- der extr eme constraints, pulling performance below the ran- dom baseline. By evaluating absolute expected utility in the feature space ( Y ⊙ ∇ Y L ), AGF avoids cross-sample signal cancellation. This property allows AGF to safely reduce the network to its topological limit without aggressively disrupting the core architecture, a common drawback of traditional heuristics. 4.7.4 Qualitativ e Visualization. Finally , we visualize representative samples routed to each expert in Figure 5 . 9 As shown, the router assigns images with clean back- grounds and centered subjects (e.g., fish) to the efficient ex- pert. Conversely , images with complex textures or multiple objects are correctly forwarded to the full-capacity expert for robust classification. This qualitative evidence reinfor ces that AGF-guided routing is semantically meaningful. Fig. 5: Qualitative V isualization of Routing Decisions. T op row: ”Easy” samples (e.g., clear objects) routed to the Pruned Expert. Bottom row: ”Hard” samples (e.g., clutter) routed to the Full Expert. The r outer ef fectively identifies samples requiring higher capacity for correct classification. 5 D I S C U S S I O N 5.1 The Role of W eight Inheritance Our observations are consistent with the Lottery T icket Hypothesis (L TH) [44]. While L TH establishes that specific sparse subnetworks initialized with original weights can train effectively from scratch, our findings extend this prin- ciple to structural compression. W e demonstrate that dis- covering these ’tickets’ in highly saturated models requir es dynamic feature-space sensitivity rather than static weight magnitudes. A surprising finding in our experiments (T able 1) is that Random Pruning at k = 32 achieves 67 . 79% , far sur- passing the T raining-from-Scratch baseline ( 45 . 42% ). This phenomenon suggests that the initialization values inherited from the teacher model carry significant inductive bias. Even with a random topological subset, the network starts in a ”basin of attraction” that is inaccessible from random initialization. However , our Routing analysis (Fig. 6) proves that these random sub-networks are poorly calibrated . While they classify well on average, their confidence scores are noisy , forcing the r outer to call the expensive expert mor e often (Cost 8.26 vs 7.28). AGF adds the necessary topological structure to maximize dynamic efficiency . 5.2 Limitations and Boundary Conditions W e acknowledge three limitations in this study: • Architectural Scope: Our initial rigorous analysis and theoretical validation naturally focus on ResNet-style CNNs, which possess strong inductive biases. How- ever , to truly test the universality of our framework, we subsequently extend our structural probing to V ision T ransformers (V iT s)—ar chitectures notoriously lacking such spatial priors—in Part VI. • Calibration Overhead: Unlike zer o-cost static metrics ( ℓ 1 ), AGF requires a calibration phase (approx. 10-20 0 20 40 60 80 Normalized Inference Cost 68 69 70 71 72 73 Top-1 Accuracy (%) Steepest Ascent (Slope +0.72) Metric Efficiency: The "Last Mile" Performance Random Strategy L1-Norm Strategy AGF Strategy (Ours) 55 60 65 72.4 72.6 Sustained Gain Saturation Peak Performance Zone (Zoom) Fig. 6: Metric Efficiency and Slope Analysis. Main Plot: Both AGF and L1 significantly outperform Random se- lection. Zoomed Inset: Highlights the critical diver gence at high costs. While L1 saturates (Negative Slope), AGF maintains a positive marginal gain . backward passes). However , this is a one-time offline cost that is negligible compared to the long-term infer- ence savings. • Baseline Selection: Our study is designed as an anal- ysis of metric properties (Gradient vs. Magnitude). W e therefor e benchmark against canonical pruning metrics (T aylor , ℓ 1 , Random) to isolate the metric’s contribution, rather than combining orthogonal engineering tricks (e.g., distillation) to chase state-of-the-art leaderboards. 6 C O N C L U S I O N W e have presented a unified framework connecting feature learning theory (AGF) to system ef ficiency . By identifying the ”Performance Degradation Threshold” in pruning and resolving ”Gradient-Magnitude Decoupling” in routing, we demonstrate that gradient-flow utilities pr ovide the struc- tural foundation for next-generation dynamic networks. A P P E N D I X A D E TA I L E D E X P E R I M E N TA L R E S U LT S A.1 Confidence-Based Routing Numerical Sweep T able 7 provides numerical results of the threshold sweep for AGF , ℓ 1 -Norm, and Random pruning strategies, as refer - enced in Section 4.6. A P P E N D I X B I M P L E M E N TA T I O N D E TA I L S T o facilitate r eproducibility , we pr ovide the specific hard- ware equipment and hyperparameter settings used in our experiments. 10 T ABLE 7: Full Numerical Sweep of Confidence-Based Routing. W e compare the Accuracy (%) and Normalized Cost of three strategies across varying confidence thresholds τ . Bold indicates the best trade-of f (Highest Accuracy or Lowest Cost) at critical operating points. Threshold ( τ ) AGF (Ours) ℓ 1 -Norm Random Regime Acc (%) Cost Acc (%) Cost Acc (%) Cost 0.000 68.54 1.00 68.83 1.00 68.13 1.00 Pruned Only 0.500 69.26 7.28 69.61 7.93 69.25 8.26 Low Cost 0.700 70.94 21.33 71.04 21.35 70.87 21.69 0.800 71.40 28.39 71.47 29.16 71.37 29.30 0.900 72.10 38.08 72.11 39.29 72.08 39.61 Balanced 0.950 72.35 45.70 72.51 47.39 72.35 47.63 0.980 72.48 54.61 72.66 56.39 72.48 56.85 0.990 72.65 60.44 72.57 62.57 72.48 62.66 Peak (Last Mile) 0.999 72.49 76.17 72.33 78.71 72.11 80.80 Over-Conservative 1.000 71.17 150.41 71.17 150.41 71.17 150.41 Full Expert B.1 Har dware and Software En vironment All CIF AR-100 experiments were conducted on a worksta- tion with the following specifications: • CPU: AMD Ryzen 7 7840H (8 cores, 16 threads). • GPU: NVIDIA GeForce R TX 4060 (8GB VRAM). • Software Stack: PyT orch 2.9.0 (Nightly/Custom Build) + CUDA 12.6. ImageNet-100 scalability experiments were performed us- ing NVIDIA L4/T4 GPUs via Google Colab Pro. B.2 T raining Recipes 1) T eacher Model Pre-training: W e train the W ideResNet- 18-2 teacher model from scratch to ensur e a stable, saturated baseline. T ABLE 8: Hyperparameters for T eacher T raining (CIF AR- 100). Parameter V alue Optimizer SGD (Nesterov) Momentum 0.9 W eight Decay 5 × 10 − 4 T otal Epochs 150 Batch Size 32 Initial Learning Rate 1 . 0 × 10 − 3 LR Schedule Cosine Annealing Data Augmentation RandomCrop, RandomHorizontalFlip 2) Pruning and Fine-tuning: For the AGF pruning phase, we employ an iterative schedule to allow the network dynamics to heal. • Initialization: W eights inherited from the converged T eacher (Epoch 150). • Fine-tuning LR: Fixed at 1 . 0 × 10 − 3 (CIF AR) and 1 . 0 × 10 − 4 (ImageNet). • AGF Calculation: W e use T = 4 ∼ 8 batches for score accumulation. R E F E R E N C E S [1] S. Han, J. Pool, J. T ran, and W . Dally , “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems , 2015. [2] Z. Liu, J. Li, Z. Shen, G. Huang, S. Y an, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer V ision , 2017, pp. 2736–2744. [3] X. W ang, F . Y u, Z.-Y . Dou, T . Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” in Proceed- ings of the Eur opean Conference on Computer V ision (ECCV) , 2018, pp. 409–424. [4] G. Huang, D. Chen, T . Li, F . Wu, L. van der Maaten, and K. Q. W einberger , “Multi-scale dense networks for resource efficient image classification,” in International Conference on Learning Rep- resentations , 2018. [5] H. Li, A. Kadav , I. Durdanovic, H. Samet, and H. P . Graf, “Pruning filters for ef ficient convnets,” in International Conference on Learning Representations , 2017. [6] N. Lee, T . Ajanthan, and P . H. T orr , “Snip: Single-shot network pruning based on connection sensitivity ,” in International Confer- ence on Learning Representations , 2019. [7] C. W ang, G. Zhang, and R. Grosse, “Picking winning tickets befor e training by preserving gradient flow ,” in International Conference on Learning Representations , 2020. [8] M. Sun, Z. Liu, A. Bair , and J. Z. Kolter , “A simple and effective pruning approach for large language models,” in International Conference on Learning Representations (ICLR) , 2024. [9] Y . Zhang, H. Bai, H. Lin, J. Zhao, L. Hou, and C. V . Cannistraci, “Plug-and-play: An efficient post-training pruning method for large language models,” in The T welfth International Conference on Learning Representations (ICLR) , 2024. [Online]. A vailable: https://openreview .net/forum?id=T r0lPx9woF [10] Y . Y ang, K. Zhen, B. Ganesh, A. Galstyan, G. Huybrechts, M. M ¨ uller et al. , “W anda++: Pruning large language models via regional gradients,” in ICLR Workshop on Sparsity in LLMs , 2025. [11] D. Kunin et al. , “Alternating gradient descent and the dynamics of feature learning,” Preprint , 2025. [12] R. Shwartz-Ziv and N. T ishby , “Opening the black box of deep neural networks via information,” arXiv preprint , 2017. [13] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural net- works,” arXiv preprint , 2013. [14] P . Nakkiran, G. Kaplun, Y . Bansal, T . Y ang, B. Barak, and I. Sutskever , “Deep double descent: Where bigger models and more data hurt,” in International Conference on Learning Represen- tations (ICLR) , 2020. [15] A. Power , Y . Bur da, H. Edwards, I. Babuschkin, and V . Misra, “Grokking: Generalization beyond overfitting on small algorith- 11 mic datasets,” in International Conference on Learning Representations (ICLR) , 2022. [16] Z. Liu, E. Matsubara, and S. Y . Uehara, “Omnigrok: Gr okking beyond algorithmic data,” in International Conference on Learning Representations (ICLR) , 2023. [17] Y . LeCun, J. S. Denker , and S. A. Solla, “Optimal brain damage,” Advances in Neural Information Processing Systems , vol. 2, 1989. [18] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems , 1993, pp. 164–171. [19] S. P . Singh and D. Alistarh, “W oodfisher: Efficient second-order approximation for neural network compression,” in Advances in Neural Information Processing Systems (NeurIPS) , 2020. [20] Y . He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer V ision , 2017, pp. 1389–1397. [21] E. Frantar and D. Alistarh, “SparseGPT: Massive language models can be accurately pruned in one-shot,” in International Conference on Machine Learning (ICML) , 2023. [22] X. Ma, G. Fang, and X. W ang, “LLM-Pruner: On the structural pruning of large language models,” in Advances in Neural Informa- tion Processing Systems (NeurIPS) , 2023. [23] S. Ashkboos, M. L. Croce et al. , “SliceGPT: Compress large lan- guage models by deleting rows and columns,” in International Conference on Learning Representations (ICLR) , 2024. [24] H. Y in et al. , “Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity ,” in International Conference on Machine Learning (ICML) , 2023. [25] H. T anaka, D. Kunin, D. L. Y amins, and S. Ganguli, “Pruning neu- ral networks without any data by iteratively conserving synaptic flow ,” Advances in Neural Information Processing Systems , vol. 33, pp. 6377–6389, 2020. [26] Y . W ang et al. , “A unified view of importance metrics for structural pruning,” in Proceedings of the IEEE/CVF International Conference on Computer V ision (ICCV) , 2023. [27] P . Molchanov , A. Mallya, S. T yree, I. Frosio, and J. Kautz, “Impor- tance estimation for neural network pruning,” in Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition , 2019, pp. 11 264–11 272. [28] G. Fang, X. Ma, M. Ming, and X. W ang, “DepGraph: T owards any structural pruning,” in Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR) , 2023. [29] L. Beyer et al. , “FlexiV iT: One model for all patch sizes,” in Pro- ceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR) , 2023. [30] D. Bolya, C.-Z. Fu, T . Darrell, and J. Hoffman, “T oken merging: Y our ViT but faster ,” in International Conference on Learning Repre- sentations (ICLR) , 2023. [31] G. Li, Y . Xu, Z. Li, J. Liu, X. Y in, D. Li, and E. Barsoum, “T ´ yr-the- pruner: Structural pruning llms via global sparsity distribution optimization,” in Advances in Neural Information Processing Systems (NeurIPS) , 2025. [32] H. Cai, C. Gan, T . W ang, Z. Zhang, and S. Han, “Once-for-all: T rain one network and specialize it for efficient deployment,” in International Conference on Learning Representations (ICLR) , 2020. [33] Z. Guo, X. Zhang, H. Mu, W . Heng, Z. Liu, Y . W ei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” in European Conference on Computer V ision (ECCV) , 2020, pp. 544–560. [34] Y . Han, G. Huang, S. Song, L. Y ang, H. W ang, and Y . W ang, “Dy- namic neural networks: A survey ,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 11, pp. 7436–7456, 2021. [35] N. Shazeer , A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer ,” in International Conference on Learning Representations , 2017. [36] W . Fedus, B. Zoph, and N. Shazeer , “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity ,” Journal of Machine Learning Research (JMLR) , 2022. [37] J. Puigcerver , C. Riquelme, B. Mustafa, and N. Houlsby , “From sparse to soft mixtures of experts,” in International Conference on Learning Representations (ICLR) , 2024. [38] Y . Rao, W . Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dy- namicV iT : Efficient vision transformers with dynamic token spar- sification,” in Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2021. [39] Y . Chen, X. Dai, M. Liu, D. Chen, L. Y uan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR) , 2020. [40] S. T eerapittayanon, B. McDanel, and H.-T . Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in 2016 23rd International Conference on Pattern Recognition (ICPR) , 2016, pp. 2464–2469. [41] B. E. Bejnordi, T . Blankevoort, and M. W elling, “Batch-shaping for learning conditional channel gated networks,” in International Conference on Learning Representations , 2020. [42] C. Guo, G. Pleiss, Y . Sun, and K. Q. W einberger , “On calibration of modern neural networks,” in International Conference on Machine Learning , 2017, pp. 1321–1330. [43] B. Xiao, P . W ang, Q. He, and M. Dong, “MDP: Multidimensional vision model pruning with latency constraint,” in Proceedings of the IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR) , 2025, pp. 20 113–20 123. [44] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning Representations , 2019.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment