DynamicGate MLP Conditional Computation via Learned Structural Dropout and Input Dependent Gating for Functional Plasticity
Dropout is a representative regularization technique that stochastically deactivates hidden units during training to mitigate overfitting. In contrast, standard inference executes the full network with dense computation, so its goal and mechanism dif…
Authors: Yong Il Choi
DynamicGate‑MLP: Conditional Computation via Learned Structural Drop out and Input‑Dep endent Gating for F unctional Plasticit y Y ong Il Choi Sorynorydotcom Co., Ltd./AI Op en Researc h Lab: hurstchoi@sorynory.com https://orcid.org/0009- 0009- 8813- 5420 Decem b er 24, 2025 Abstract Drop out is a representativ e regularization tec hnique that stochastically deactiv ates hidden units during training to mitigate ov ertting. In contrast, standard inference executes the full netw ork with dense com- putation, so its goal and mechanism dier from conditional computation, where the executed op erations dep end on the input. This pap er organizes DynamicGate-MLP into a single framework that simultane- ously satises b oth the regularization view and the conditional-computation view. Instead of a random mask, the prop osed mo del learns gates that decide whether to use each unit (or blo ck), suppressing un- necessary computation while implementing sample-dep endent execution that concentrates computation on the parts needed for each input. T o this end, w e dene con tinuous gate probabilities and, at inference time, generate a discrete execution mask from them to select an execution path. T raining controls the compute budget via a p enalty on exp ected gate usage and uses a Straight-Through Estimator (STE) to optimize the discrete mask. W e ev aluate DynamicGate-MLP on MNIST, CIF AR-10, Tiny-ImageNet, Sp eec h Commands, and PBMC3k, and compare it with v arious MLP baselines and MoE-style v ariants. Compute eciency is compared under a consistent criterion using gate activ ation ratios and a lay er- w eighted relative MAC metric, rather than w all-clo ck latency that dep ends on hardware and back end k ernels. In tro duction 1.1 Motiv ation The starting p oin t of this study dates back to the mid-2000s, when w e encoun tered early neuromorphic c hips such as General Vision and IBM’s ZISC and b egan to susp ect that “brain-inspired” c hips migh t p oin t b eyond the traditional v on Neumann arc hitecture. Later, while training and analyzing large neural net w orks—esp ecially LLMs—w e rep eatedly observ ed that dense computation in hidden la yers is structurally far from biological mechanisms such as neuronal ring/silence and synaptic plasticity . W e also found that sparse-activ ation phenomena suc h as ring/silence can be approximated more precisely in a mathematical form. Considering the curren t tec hnical limitations of neuromorphic hardware, w e believe there is a need for conditional-computation structures that are implemen table on general-purp ose hardware. In this context, w e designed DynamicGate-MLP , whic h selectiv ely p erforms computation via input-dep enden t gating. 1.2 Problem Statemen t Deep learning mo dels are often ov er-parameterized, which can be benecial for expre ssivit y and optimization but increases compute cost and may raise the risk of o vertting. Dropout is a widely used regularization metho d that remo v es random units during training to reduce co-adaptation and impro ve generalization[1]. Ho w ev er, standard drop out has the follo wing limitations. • Sparsity only during training (T raining-time sparsit y only) : inference is t ypically executed with dense computation, making it dicult to translate in to conditional execution. • Input-agnostic sto c hasticit y (Input-agnostic sto c hasticity) : in standard drop out, the mask is not c hosen adaptiv ely p er input, but is sampled from xed-probabilit y randomness. Pruning, on the other hand, can compress a mo del by removing weigh ts/channels after training[14], but it t ypically applies the same static structure to all inputs. Conditional computation and sparse routing aim to reduce av erage computation by executing only a subset of paths p er input; Mixture-of-Exp erts (MoE)[12] and Switc h T ransformer[13] are represen tativ e examples. 1 1.3 Core Idea Input Hidden la y er (all neurons activ e) Output (a) Baseline Input Hidden la y er + random mask (training only) Output (b) Drop out Input Hidden la y er (static sparsit y) Output (c) Pruning Input Hidden la y er + input-dep enden t gates ( z ( x ) , p ( x ) , g ( x )) Output (d) DynamicGate-MLP Input Hidden la y er + dynamic sparse rewiring (RigL) ( m ( t ) , W ⊙ m ( t )) Output (e) RigL (Dynamic Sparse T raining) Input Hidden la y er + input-dep enden t gates ( z ( x ) , p ( x ) , g ( x )) + RigL dynamic rewiring ( m ( t ) , W ⊙ m ( t )) Output (f ) DynamicGate-MLP + RigL (Gated Dynamic Sparse) Figure 1: Conceptual comparison: Baseline / Drop out / Pruning / DynamicGate-MLP / RigL / Dynamic- Gate‑MLP + RigL DynamicGate-MLP can con trol sample-wise activation p atterns via input-dependent gating. In Fig. 1, (d) and (f ) are mo dels newly designed in this w ork. This pap er connects drop out–pruning–conditional compu- tation in to a single narrativ e through DynamicGate-MLP . The core idea of DynamicGate-MLP is to implemen t the viewp oin t: “turn units o like dropout, but not randomly—turn them o by learning, and turn them o dierently for each input at execution time. ” Con- cretely , we insert learnable gates into each lay er, replacing dropout’s random mask with a learned gate probabilit y p . W e then discretize this probability with a threshold to obtain a hard mask g ( x ) , allowing only selected units (or blocks) to participate in computation. A t the same time, b y including a p enalty on gate usage (e.g., E [ p ] ) in the objective, w e can directly tune the activ ation-rate (compute) budget during training while maintaining accuracy . F urthermore, com bining grow-and-prune metho ds such as RigL/SET enables an extension that couples fast time-scale functional selection (“which units to use for this input”) with slow time-scale structural c hange (“whic h connections should exist at all”). F rom a neuroscience p er- sp ectiv e, the brain exhibits b oth functional plasticity—selectiv ely activ ating circuits dep ending on tasks and con text—and structural plasticity—reconguring circuits via synapse/spine formation and elimination[27,28]. 2 Figure 2: Conceptual diagram of the gating la y er DynamicGate-MLP focuses primarily on the former (functional selection), while its com bination with RigL- st yle rewiring can complemen tarily connect it to the latter (structural c hange). 1.4 Con tributions • W e introduce a shared gating structure that decides unit activ ation, unifying drop out-style probabilistic masking (during training) and input-dep enden t conditional execution (at inference) within a single gating la y er. • W e in tro duce expected gate-usage regularization to directly control a compute budget and pro vide an implemen table training metho d that stably learns a discrete gate p olicy via STE. • W e compare Baseline / Dropout / Pruned / DynamicGate-MLP / RigL / DynamicGate-MLP + RigL on MNIST and CIF AR. • W e report compute using pro xy metrics based on gate activ ation ratios (Compute Pro xy) and a MA C- w eigh ted relativ e metric (RelMA C), and w e do not equate these directly with w all-clo c k latency , which dep ends on bac k end optimizations. 3 Figure 3: Conceptual illustration of activ e and silen t neurons 2. Bac kground and Related W ork 2.1 Neuronal ring, silence, and remov al as conditional computation In neuroscience, neuronal silence refers to a state where a neuron structurally exists but do es not re under certain conditions. This can be understo o d as part of sparse information processing in whic h we do not “alw ays compute ev erything” but rather “compute only when needed. ” Ov er longer time scales, it is also related to structural changes suc h as pruning, where rarely used connections w eak en or are remo v ed. This view can b e summarized as follo ws. • Short term: input-dep endent ring/silence (functional selection) • Long term: remov al/rewiring based on usage statistics (structural change) DynamicGate-MLP implemen ts such “rev ersible silence” via input-dependent gating, (reversible silence) 1 and it pro vides a bridge b etw een gating and pruning in that units with lo w a v erage gate usage can b ecome candidates for long-term remo v al. 2.2 Drop out, DropConnect, and Ba y esian in terpretations Drop out[1] and DropConnect[2] sto chastically deactiv ate units or connections during training to improv e generalization 2 . Drop out can also b e in terpreted from a Bay esian approximation p ersp ective[4], and many extensions exist, including v ariational drop out[3] and Concrete Dropout[5]. DynamicGate-MLP can b e view ed as a learnable/structured v ariant of dropout in that it uses extbearned gate probabilities instead of a xed-probabilit y random mask. 2.3 Learned sparsit y , discrete gating, and con tin uous relaxations L0-regularization-based sparse learning[7] learns structural sparsit y by directly controlling the expected activ ation of a mask. T o make discrete choices dierentiable, v arious metho ds hav e been prop osed, including 1 Here, “reversible silence” means that a unit/connection is not p ermanently remov ed, but is temp orarily deactiv ated dep end- ing on the input and thus excluded from computation. Therefore, an inactive unit can b ecome active again for future inputs, and once re-activ ated, the parameters on that path re-enter gradien t up dates from that point on ward. 2 I.e., impro ving performance on unseen inputs. 4 STE[6], Gumbel-Softmax[8], and Concrete distributions[9]. F or simplicity and clarity of implementation, this pap er fo cuses on hard threshold-based discrete gates trained with STE. 2.4 Pruning, structured sparsity , and hardware friendliness Pruning compresses a mo del by remo ving w eigh ts/c hannels/lters[14]. The Lottery Tick et Hyp othesis[15] and Mov ement Pruning[16] discuss the existence of learnable sparse structures and training strategies. Struc- tured pruning (at the c hannel/lter lev el)[17] is often fa v orable for real sp eedups, but it typically yields a static structure with w eak input dep endence. 2.5 Conditional computation, MoE, and adaptive computation Conditional computation reduces av erage computation by executing only a subset of paths for each input. A CT[11], MoE[12], and Switch T ransformer[13] are represen tativ e approaches. Unlik e MoE, which introduces m ultiple exp erts, DynamicGate-MLP implements selective execution of units (or blo cks) inside a simple MLP , aiming for a compact form ulation that explicitly includes budget con trol. 2.6 Dynamic sparse training (DST) and rewiring SET[20] and RigL[30] are dynamic sparse training (DST) methods that prune and regro w connections during training, oering a rewiring p ersp ective in which the structure itself changes ov er time. While this pap er primarily fo cuses on functional gating, w e later presen t an extension that com bines gating with rewiring. 3. DynamicGate‑MLP: Unied F orm ulation Consider an L -lay er MLP . F or an input x , let h ( ℓ − 1) ( x ) denote the activ ation vector of la y er ℓ − 1 . A standard linear la y er is dened as follo ws. z ( l ) ( x ) = W ( l ) h ( l − 1) ( x ) + b ( l ) , h ( l ) raw ( x ) = ϕ z ( l ) ( x ) (1) DynamicGate-MLP m ultiplies eac h hidden unit b y a gate. h ( l ) ( x ) = g ( l ) ( x ) ⊙ h ( l ) raw ( x ) . (2) . Here, g ( l ) ( x ) is dened as follows: g ( l ) ( x ) ∈ [0 , 1] ( soft gate ) , g ( l ) ( x ) ∈ { 0 , 1 } ( hard gate ) . (3) That is, a soft gate is a real v alue b et w een 0 and 1, while a hard gate tak es either 0 or 1. 3.1 Static vs. dynamic gating • Static (input-agnostic): for eac h unit, in tro duce a learnable logit Z ( l ) i and pro duce a xed probabilit y P ( l ) i . This can b e in terpreted as learned structural drop out. • Dynamic (input-dep enden t): using a small gate net w ork. (1) Logit (score) generation z ( l ) g ( x ) = GateNet h ( l − 1) ( x ) (4) • h ( l − 1) ( x ) : represen tation from the previous la y er (feature v ector) 5 • GateN et ( · ) : a small netw ork that decides “which neurons/blo cks to op en, and by how m uc h, for this input. ” This naturally connects to conditional computation [11, 12]. z ( l ) g ( x ) : logit = the gate’s “ra w score” • V alue range: ( −∞ , + ∞ ) • Meaning: learned tow ard large p ositive v alues if it should op en, and tow ard large negative v alues if it should close (2) Con v erting to gate probabilities with a sigmoid • σ ( t ) = 1 1 + e − t (sigmoid) • Result: g ( l ) ( x ) ∈ (0 , 1) • Interpretation: – z g ≫ 0 ⇒ g ≈ 1 (almost fully ON) – z g ≪ 0 ⇒ g ≈ 0 (almost fully OFF) – z g = 0 ⇒ g = 0 . 5 In other words, this step conv erts GateNet’s score z g in to something like the probability/fraction that the gate is ON. 3.2 Probabilit y parameterization, threshold, and temp erature F rom the gate logit Z ( l ) i ( x ) , we dene p ( ℓ ) i ( x ) = σ z ( ℓ ) g,i ( x ) τ ! , g ( ℓ ) i ( x ) = ⊮ 3 h p ( ℓ ) i ( x ) > θ i . (5) where τ is the temp erature and θ is a global hard threshold. (Extension: if lay erwise thresholds are needed, this can b e generalized to θ ( l ) , but for simplicit y w e use a global θ .) F or stabilit y , training may use soft gating ( p ), while the deplo ymen t/inference policy is often dened by hard gating ( g ). Accordingly , we distinguish the follo wing when rep orting results. • T raining pro xy: E [ p ] (dierentiable expected activ ation) • Deplo ymen t activ ation: E [ g ] = E [1( p > θ )] 3.3 Budget con trol via exp ected gate usage W e denote the total ob jectiv e b y J to av oid confusion with the num b er of lay ers L . J = L task + λ g L − 1 X ℓ =1 1 n ℓ n ℓ X i =1 ¯ p ( ℓ ) i , ¯ p ( ℓ ) i = 1 | B | X x ∈ B p ( ℓ ) i ( x ) . (6) (1) Exp ected activ ation ratio of lay er l (denition) ρ ( l ) := 1 n l n l X i =1 E h p ( l ) i ( x ) i ≈ 1 n l n l X i =1 1 | B | X x ∈ B p ( l ) i ( x ) (7) 3 ⊮ : the indicator function. 6 ρ ( l ) indicates “ho w m uc h (probabilistically) la y er l is ON on av erage. ” (2) The corresp onding p enalty term in the ob jectiv e λ g L − 1 X l =1 ρ ( l ) (8) This is similar to L0-st yle sparse learning ob jectiv es [7]. (3) Compute-cost weigh ting (optional). If w e in tro duce a p er-unit compute cost C ( l ) i , then J = L task + λ g L − 1 X ℓ =1 P n ℓ i =1 c ( ℓ ) i p ( ℓ ) i P n ℓ i =1 c ( ℓ ) i . (9) whic h enables cost-aligned budget con trol [17]. 3.4 T op‑k gating (optional hard budget) Instead of a threshold, one can activ ate exactly k ℓ units p er la y er: g ( ℓ ) ( x ) = T opK p ( ℓ ) ( x ) , k ℓ . (10) enforces exactly k ℓ activ ations per lay er to guarantee a strict budget, but it ma y reduce the exibility of p er-input adaptiv e con trol [13]. 3.5 Learning discrete gates via STE Because hard gates are non-dierentiable, w e use STE [6]. The forw ard pass uses g , while the bac kw ard pass uses the deriv ativ e of p . STE equations (hard gate in forward, soft-gate gradient in bac kward) (1) Soft gate probability (dierentiable path) p ( ↕ ) i ( x ) = σ z ( ↕ ) g,i ( x ) τ ! (11) (2) F orw ard: use a hard gate (binarization) g ( ↕ ) i ( x ) = ⊮ h p ( ↕ ) i ( x ) > θ i (12) (3) Bac kw ard(STE): ∂ g ∂ z g is appro ximated b y ∂ p ∂ z g . ∂ g ( ↕ ) i ( x ) ∂ z ( ↕ ) g,i ( x ) ≈ ∂ p ( ↕ ) i ( x ) ∂ z ( ↕ ) g,i ( x ) = 1 τ p ( ↕ ) i ( x )(1 − p ( ↕ ) i ( x )) (13) (If w e write it in c hain-rule form including the loss J ) 7 L task y i = g i · h i F orw ard pass: p i ( x ) = σ z i ( x ) τ g i ( x ) = 1 [ p i ( x ) > θ ] Bac kw ard pass (STE): ∂ g i ∂ z i ≈ ∂ p i ∂ z i = 1 τ p i (1 − p i ) z i ( x ) ∂ L ∂ y i ∂ L ∂ g i Surrogate gradient λ g P i ¯ p i Figure 4: Conceptual diagram of forw ard execution with hard gating and STE-based bac kpropagation ∂ J ∂ z ( ↕ ) g,i ( x ) ≈ ∂ J ∂ g ( ↕ ) i ( x ) · 1 τ p ( ↕ ) i ( x )(1 − p ( ↕ ) i ( x )) (14) Alternativ ely , one ma y use Gum b el-Softmax [8] or Concrete [9]. A gate-usage p enalty enables budget control. Fig. 4 illustrates the concept of learning discrete gates with STE. Budget control is performed via a penalty based on E [ p ] , while deplo ymen t-time activ ation is interpreted via E [ g ] . 3.6 The gap b etw een compute proxies and real deplo ymen t sp eed In this pap er, compute reduction is not rep orted via direct measurements of FLOPs/latency , but via proxy metrics based on gate activ ations. A simple a v erage unit-activ ation pro xy is dened as follo ws. α ( ℓ ) p = 1 n ℓ n ℓ X i =1 p ( ℓ ) i , α ( ℓ ) g = 1 n ℓ n ℓ X i =1 E h 1 p ( ℓ ) i > θ i . (15) ComputePro xy p = 1 L − 1 L − 1 X ℓ =1 α ( ℓ ) p , ComputePro xy g = 1 L − 1 L − 1 X ℓ =1 α ( ℓ ) g . (16) Ho w ev er, MA Cs of fully connected (F C) la y ers are w eigh ted b y fan-in/out, so w e additionally dene a relativ e MA C (RelMA C) metric: RelMA C = P L − 1 ℓ =1 α ( ℓ ) n ℓ − 1 n ℓ P L − 1 ℓ =1 n ℓ − 1 n ℓ , α ( ℓ ) ∈ { α ( ℓ ) p , α ( ℓ ) g } . (17) DynamicGate con trols input-dep endent unit activ ation g ( x ) , while RigL p erio dically rewires a sparse con- nectivit y mask m ( t ) during training to learn structural sparsity . Here, α ( ℓ ) denotes the lay erwise av erage activ ation (under either the training proxy or the deploymen t p olicy). Real sp eedups dep end on sparse kernels, block/c hannel-wise structured sparsit y , or routing-based implemen tations [17, 14]. This paper fo cuses on con trollable activ ation patterns and accuracy rather than absolute sp eed. 8 3.7 RigL: learning dynamic sparse connectivity (rewiring) RigL is a Dynamic Sparse T raining metho d that dynamically reallo cates the connections themselves (a sparse mask o v er w eigh ts) during training, maximizing p erformance under a xed parameter budget[30]. (1) Sparse-mask parameterization. F or the la y er- ℓ weigh t matrix W ( ℓ ) ∈ R n ℓ × n ℓ − 1 , dene a discrete mask at training step t m ( ℓ ) ( t ) ∈ { 0 , 1 } n ℓ × n ℓ − 1 . (18) and dene the sparse w eigh ts used in computation as f W ( ℓ ) ( t ) = W ( ℓ ) ( t ) ⊙ m ( ℓ ) ( t ) . (19) as follows. Here, f W ( ↕ ) ( t ) is the “mask-applied weigh t actually used in computation. ” Note that the mask m ( t ) is not up dated at every step; under RigL it is up dated every xed p erio d (∆ T ) , and remains xed b et w een up dates while only W ( t ) is trained. Then the forw ard computation of la y er ℓ is a ( ℓ ) ( x ; t ) = f W ( ℓ ) ( t ) h ( ℓ − 1) ( x ; t ) + b ( ℓ ) , h ( ℓ ) ( x ; t ) = ϕ a ( ℓ ) ( x ; t ) . (20) (2) Fixed sparsity constraint. T o k eep the n um b er of activ e connections per la y er constan t, the mask is main tained to satisfy the follo wing budget condition. m ( ℓ ) ( t ) 0 = (1 − s ℓ ) n ℓ n ℓ − 1 . (21) Here, s ℓ ∈ [0 , 1] is the lay erwise sparsity; RigL keeps the n umber of connections xed while changing whic h connections remain activ e during training. (3) Prune & Grow (RigL up date). RigL up dates the mask every perio d ∆ T . t � {ΔT, 2ΔT, 3ΔT, . . . }. (15) A t eac h up date, it prunes (remov es) active connections ( m = 1 ) with small weigh t magnitude and gro ws (adds) new connections among inactiv e lo cations ( m = 0 ) with large gradient magnitude. In equations, dene the set of indices to prune as P ( ℓ ) ( t ) = BottomK | W ( ℓ ) ij ( t ) | : m ( ℓ ) ij ( t ) = 1 , K ℓ . (22) The set of indices to gro w (new connections) is the T op- K gradien ts among locations with no curren t connection ( m = 0 ). G ( ↕ ) ( t ) = T opK |∇ W ( ↕ ) ij ( t ) L ( t ) | : m ( ↕ ) ij ( t ) = 0 , K ↕ . (23) • L ( t ) or L task ( t ) : the loss • ∇ W ij L : ho w sensitiv e the loss w ould b e to that connection if it existed • m ij ( t ) = 0 : crucially , we only select candidates among curren tly absent connections Dene them as ab o v e, and up date the mask as m ( ℓ ) ij ( t + ) = 0 , ( i, j ) ∈ P ( ℓ ) ( t ) , 1 , ( i, j ) ∈ G ( ℓ ) ( t ) , m ( ℓ ) ij ( t ) , otherwise . (24) 9 as follows. Here, t + denotes immediately after the up date. This pro cedure preserves the parameter budget while allo wing training to disco v er imp ortan t connections [30]. (4) RigL compute proxy . The la y erwise connection densit y in RigL is p ℓ = m ( ℓ ) ( t ) 0 n ℓ n ℓ − 1 = 1 − s ℓ . (25) and can b e used as a pro xy for structural (connection-lev el) sparsit y . 3.8 DynamicGate‑MLP + RigL: a unied form ulation of gating and rewiring L task y i = g i · h i Gating forward pass: p i ( x ) = σ z i ( x ) τ g i ( x ) = 1 [ p i ( x ) > θ ] STE backw ard pass: ∂ g i ∂ z i ≈ ∂ p i ∂ z i = 1 τ p i (1 − p i ) z i ( x ) ∂ L ∂ y i ∂ L ∂ g i Surrogate gradient λ g P i ¯ p i W m ( t ) f W ( t ) = W ⊙ m ( t ) (sparse weights) RigL up date (p erio dic): Prune: Bottom- K by | W | Grow: T op- K by |∇ W L| ∥ m ( t ) ∥ 0 kept xed rewire ∇ W L h raw = ϕ ( f W x ) Figure 5: Diagram of adding RigL’s dynamic rewiring to the hard-gating (STE) training path. DynamicGate con trols input-dep endent unit activ ation g ( x ) , and RigL p erio dically rewires a sparse connectivit y mask m ( t ) during training to learn structural sparsit y . In this subsection, we dene a fused model that applies input-dep endent unit gating (DynamicGate-MLP) and connection rewiring (RigL) simultaneously . The k ey p oin t is that (1) unit-level conditional computation and (2) connection-lev el structural learning pro vide sparsit y along dieren t axes. (1) F used forw ard computation. With RigL, the sparse w eigh t is f W ( ℓ ) ( t ) = W ( ℓ ) ( t ) ⊙ m ( ℓ ) ( t ) , and (Sparse w eigh t denition) f W ( ↕ ) ( t ) = W ( ↕ ) ( t ) ⊙ m ( ↕ ) ( t ) (26) (F orward: sparse structure + unit gating) a ( ℓ ) ( x ; t ) = ˜ W ( ℓ ) ( t ) h ( ℓ − 1) ( x ; t ) + b ( ℓ ) . (27) h ( ℓ ) raw ( x ; t ) = ϕ a ( ℓ ) ( x ; t ) . (28) h ( ℓ ) ( x ; t ) = g ( ℓ ) ( x ; t ) ⊙ h ( ℓ ) raw ( x ; t ) . (29) 10 Here, the mask m ( ℓ ) ( t ) constrains the c onne ctions (structur e) , and the gate g ( ℓ ) ( x ; t ) selects unit activation (function) dep ending on the input. h ( ℓ ) ( x ; t ) = g ( ℓ ) ( x ; t ) ⊙ ϕ W ( ℓ ) ( t ) ⊙ m ( ℓ ) ( t ) h ( ℓ − 1) ( x ; t ) + b ( ℓ ) . (30) In short, connectivit y is constrained b y the mask, and unit activ ation is selected p er input on top of it. GateNet input: the gate logits are dened in the same w a y as b efore, z ( ℓ ) g ( x ) = GateNet h ( ℓ − 1) ( x ; t ) , p ( ℓ ) ( x ) = σ z ( ℓ ) g ( x ) τ ! , g ( ℓ ) ( x ) = ⊮ h p ( ℓ ) ( x ) > θ i . (31) and GateNet tak es as input a represen tation computed under the curren t sparse structure. (2) F used objective. RigL maintains sparsit y primarily as a (xed) constraint, while DynamicGate-MLP p enalizes the exp ected activ ation E [ p ] . Therefore, the combined loss naturally becomes J fuse = L task + λ g L − 1 X ℓ =1 1 n ℓ n ℓ X i =1 ¯ p ( ℓ ) i . (32) (3) F used compute pro xy (imp ortan t). In la y er ℓ , FC compute can b e appro ximated by the pro duct of connection densit y ρ ℓ and unit activ ation ratio α ℓ . Here, ρ ℓ = ∥ m ( ℓ ) ∥ 0 n ℓ − 1 n ℓ , α ℓ ∈ n E h p ( ℓ ) i , E h g ( ℓ ) io . (33) Therefore, the relativ e MA C pro xy of the fused mo del is RelMA C fuse = P L − 1 ℓ =1 ( n ℓ − 1 n ℓ ) ρ ℓ α ℓ P L − 1 ℓ =1 ( n ℓ − 1 n ℓ ) . (34) This proxy metric simultaneously reects the eects of RigL (structure) and DynamicGate (input-conditional execution). RigL learns which connections should exist, while DynamicGate selects which units to use among existing connections dep ending on the input. Thus, they can op erate on complementary sparsit y axes rather than b eing redundan t. 4. T raining recip e and practical considerations 4.1 Prev en ting gate collapse and tuning guide A strong gate p enalty ( λ g ) or a low temperature ( τ ) can cause extitgate collapse early in training, where gates close excessively . In particular, when forming a thresholded binary gate g ∈ { 0 , 1 } from the gate probabilit y p as in our mo del, abruptly increasing θ or λ g in a certain phase can make g drop sharply and collapse p erformance. T o prev en t this, w e recommend the follo wing practical tuning recip es. 11 (1) Early diagnosis of collapse (log-based) Recording the batch-a veraged v alues of b oth p and g during training helps quic kly classify the failure mo de. • p is high but g go es to 0: the ev aluation threshold θ is to o high (or jumps up at a phase transition), making hard thresholding o v erly aggressiv e. • p itself rapidly drops to 0: the λ g ramp is to o fast, or τ is to o low so the sigmoid saturates and training b ecomes unstable. • Collapse only in a sp ecic phase: a sho ck pattern where keep-target c hanges, θ increases, and λ g increases sim ultaneously when en tering phase-3. (2) Safe scheduling rules • Increase the p enalty “later” and “slo wly”: Keep λ g = 0 during the initial warm up E w to rst secure capacit y , then gradually increase λ g up to λ max with a linear or cosine ramp. • Anneal temp erature gently: Start with a relativ ely high temperature (e.g., τ ∈ [1 , 2] ) and decrease it gradually , av oiding abrupt drops to ov erly low v alues (e.g., a sudden drop to τ ≲ 0 . 5 increases collapse risk). • Initialize the gate bias to start “op en”: Choose an initial op en rate p 0 (e.g., 0.8) and initialize the gate bias as b ← τ · logit( p 0 ) (35) so that training starts from a naturally op en state. (3) Recommended tuning order for a 3-phase schedule When using a 3-phase schedule, it is impor- tan t not to c hange the k eep target and threshold/p enalt y abruptly at phase b oundaries. • Phase-1 (stabilization): high keep target (e.g., 0.85–0.95), λ g = 0 , lo w θ (e.g., 0.50–0.65). • Phase-2 (encourage reduction): medium keep target (e.g., 0.45–0.65), gradually increase λ g , gently raise θ (e.g., 0.65–0.80). • Phase-3 (nal compression): guide tow ard a low k eep target (e.g., 0.20–0.35), but a v oid ov erly high θ (e.g., abov e 0.90) or abrupt increases of λ g . If collapse signs app ear, it is often more stable to buer the k eep target around ∼ 0.30 at the b eginning of phase-3 and then lo w er it further. (4) Guaranteeing minimum activity (optional safet y mechanisms) If collapse rep eats or the model is sensitiv e to data/initialization, one can add safet y mec hanisms that structurally prev en t total collapse. • F orce T op- k activ ations: keep the top- k units with the highest probabilities in each lay er alwa ys activ e ( g = 1 ). • Low er b ound on the op en rate: enforce E [ g ℓ ] ≥ r min (e.g., 0.05–0.10) per la yer to preven t total collapse. (5) Practical con text The ab ov e warmup → gr adual r e gularization → smo oth temp er atur e anne aling → optional minimum-activity c onstr aint pattern aligns with practical conv entions in sparse routing/sparse learning, and can b e in terpreted as a general stabilization strategy to a v oid gate collapse [7, 13]. 4.2 T raining algorithm Algorithm 1 summarizes the training lo op with an exp ected gate-usage p enalty and optional hard masking. 12 Algorithm 1 T raining DynamicGate-MLP with an exp ected gate-usage p enalty Input: data loader D , model parameters Θ , gate parameters Φ , total ep o chs T , w armup epo chs E w , maxi- m um p enalt y λ max g , temp erature sc hedule τ ( t ) , threshold θ 1: for t ← 1 to T do Set p enalty co ecient: 2: if t ≤ E w then 3: λ g ← 0 4: else 5: λ g ← λ max g · min 1 , t − E w max(1 ,T − E w ) ▷ linear ramp (example) 6: end if 7: for minibatch B ∼ D do 8: for all lay ers ℓ do 9: z ( ℓ ) ( x ) ← GateNet ( ℓ ) ( h ( ℓ − 1) ( x )) 10: p ( ℓ ) ( x ) ← σ z ( ℓ ) ( x ) /τ ( t ) 11: g ( ℓ ) ( x ) ← I p ( ℓ ) ( x ) > θ ▷ F orward hard gate 12: h ( ℓ ) ( x ) ← g ( ℓ ) ( x ) ⊙ ϕ W ( ℓ ) h ( ℓ − 1) ( x ) + b ( ℓ ) 13: end for 14: L task ← T askLoss ( B ) Batc h-a verage gate usage: 15: for all lay ers ℓ do 16: ¯ p ( ℓ ) i ← 1 | B | P x ∈ B p ( ℓ ) i ( x ) 17: end for 18: L g ← P ℓ 1 n ℓ P n ℓ i =1 ¯ p ( ℓ ) i 19: J ← L task + λ g L g 20: Backw ardUpda te ( J ; Θ , Φ) using STE ∂ g ∂ z ≈ ∂ p ∂ z 21: end for 22: end for Algorithm 2 summarizes RigL’s dynamic sparse rewiring. Algorithm 2 RigL: Dynamic sparse rewiring (Dynamic Sparse T raining) Input: data loader D , weigh ts Θ = { W ( ℓ ) } , masks M = { m ( ℓ ) } , total steps S , sparsit y s ℓ , up date p erio d ∆ T , rewires per up date K ℓ 1: Initialize: for each la yer ℓ , create a random sparse mask satisfying ∥ m ( ℓ ) ∥ 0 = (1 − s ℓ ) n ℓ − 1 n ℓ 2: for t ← 1 to S do 3: Sample minibatc h B ∼ D 4: F orw ard: compute with sparse w eigh ts f W ( ℓ ) ← W ( ℓ ) ⊙ m ( ℓ ) 5: Compute L task and then compute ∇ W L task 6: Up date Θ with an optimizer ▷ connections with mask 0 are excluded from updates 7: if t mod ∆ T = 0 then 8: for all lay ers ℓ do 9: Prune: remo v e the K ℓ activ e connections with the smallest | W ( ℓ ) ij | 10: Gro w: add the K ℓ inactiv e connections with the largest |∇ W ( ℓ ) ij L task | 11: Up date mask: set pruned indices to 0 and grown indices to 1 12: Chec k consistency: k eep ∥ m ( ℓ ) ∥ 0 xed (preserv e the n um b er of connections) 13: end for 14: end if 15: end for Algorithm 3 summarizes the fused training of DynamicGate-MLP with RigL. 13 Algorithm 3 DynamicGate-MLP + RigL fused training (Gated Dynamic Sparse T raining) Input: data loader D , weigh t parameters Θ = { W ( ℓ ) , b ( ℓ ) } , gate parameters Φ , sparse masks M = { m ( ℓ ) } , total steps S , total ep o c hs T , warm up ep o chs E w , maxim um gate p enalty λ max g , temp erature schedule τ ( e ) , threshold θ , RigL up date p erio d ∆ T , rewires per up date K ℓ 1: Initialize: create sparse masks satisfying ∥ m ( ℓ ) ∥ 0 = (1 − s ℓ ) n ℓ − 1 n ℓ for eac h la y er ℓ 2: for e ← 1 to T do Set gate p enalty co ecient: 3: if e ≤ E w then 4: λ g ← 0 5: else 6: λ g ← λ max g · min 1 , e − E w max(1 ,T − E w ) 7: end if 8: for t ← 1 to S do 9: Sample minibatc h B ∼ D (1) Sparse forward: 10: f W ( ℓ ) ← W ( ℓ ) ⊙ m ( ℓ ) ( t ) (2) Gating forward: 11: for all lay ers ℓ do 12: z ( ℓ ) ( x ) ← GateNet ( ℓ ) ( h ( ℓ − 1) ( x ; t )) 13: p ( ℓ ) ( x ) ← σ z ( ℓ ) ( x ) /τ ( e ) 14: g ( ℓ ) ( x ) ← I p ( ℓ ) ( x ) > θ 15: end for (3) F used forw ard: 16: for all lay ers ℓ do 17: h ( ℓ ) ( x ; t ) ← g ( ℓ ) ( x ) ⊙ ϕ f W ( ℓ ) ( t ) h ( ℓ − 1) ( x ; t ) + b ( ℓ ) 18: end for 19: Compute L task Batc h-a verage gate usage: 20: for all lay ers ℓ do 21: ¯ p ( ℓ ) i ← 1 | B | P x ∈ B p ( ℓ ) i ( x ) 22: end for 23: L g ← P ℓ 1 n ℓ P n ℓ i =1 ¯ p ( ℓ ) i 24: J ← L task + λ g L g (4) P arameter up date: 25: Learn gates with STE (bac kw ard uses p ) and up date (Θ , Φ) with an optimizer (5) RigL rewiring (p erio dic): 26: if t mod ∆ T = 0 then 27: for all lay ers ℓ do 28: Prune: remo v e K ℓ activ e connections with the smallest | W ( ℓ ) ij | 29: Gro w: add K ℓ inactiv e connections with the largest |∇ W ( ℓ ) ij L task | 30: Up date mask m ( ℓ ) ( t ) and keep ∥ m ( ℓ ) ∥ 0 xed 31: end for 32: end if 33: end for 34: end for 5. Exp erimen tal Results Due to limited resources, all experiments w ere run in Go ogle Colab, and we therefore ev aluated primarily on relativ ely small datasets. The datasets used in this pap er are as follo ws. 14 • MNIST: a standard toy dataset of handwritten digits. • CIF AR-10: a small natural image dataset (Krizhevsky , 2009). • Tiny ImageNet: a reduced version of ImageNet with 200 classes, for natural image classication at 64 × 64 resolution. • Sp eech Commands: a k eyw ord-sp otting dataset that classies short (ab out 1-second) sp eech clips in to a limited v o cabulary . • PBMC3k: a single-cell RNA-seq dataset of human p eripheral blo o d mononuclear cells (PBMC), widely used to classify/cluster cell t yp es from high-dimensional gene-expression v ectors. T o demonstrate that DynamicGate-MLP can reduce computation, we selected diverse datasets and rep ort sev eral key results. Recently , interest in GPU-ecient mo dels has increased—for example, the emergence of DeepSeek raised public a w areness that LLM-lev el accuracy might b e achiev able with fewer GPUs under certain conditions. W e ha v e also seen online reports that OpenAI’s ChatGPT-4.0 and F rance-based Mistral AI adopt MoE-st yle mo dels. Motiv ated by this trend, we additionally compared DynamicGate-MLP to a Switc h-MoE-st yle MLP on MNIST. The following exp erimen ts highlight learned structural-dropout behavior and conditional execution c haracteristics. More detailed descriptions of all exp eriments and datasets are pro vided in App endix A. T able 1: Ov erview of datasets used in this pap er Data Set Domain Dimensionalit y T raining Set T est Set MNIST Vision 28×28 gra yscale (784) 60,000 10,000 CIF AR-10 Vision 32×32 R GB (3,072) 50,000 10,000 Tin y ImageNet Vision 200 classes, 64x64 R GB(12,288) 100,000 10,000 Sp eec h Sp eec h 2624 (=64×41) 86,843 11,305 PBMC3k Genomics 2000 (HV G) 100,000 10,000 5.1 MNIST comparison T able 2 Mo del accuracy and relative compute (pro xy) comparison. The compute reduction ratio is computed relativ e to the Baseline. T able 2: Model comparison (Accuracy / P arams / FLOPs) Mo del A ccuracy(%) P aram Reduction(%) FLOPs Reduction(%) P arams FLOPs Baseline MLP 98.07 0.0 0.000000 203,530 406,528 Drop out MLP 98.31 0.0 0.000000 203,530 406,528 Pruned MLP 98.02 0.0 29.622314 203,530 286,105 Dynamic MLP 98.07 0.0 21.711912 204,570 318,263 15 Figure 6: Mo del comparison plots on MNIST In terpretation. DynamicGate-MLP matches the Baseline accuracy (98.07%) while reducing proxy compute b y ab out 21.71%. The parameter count increases slightly due to gate logits and related parameters. The a v erage activ ation of the input gate (or the hard ON ratio) is lo w er than that of the hidden gate, suggesting that most of the sa vings come from the rst large FC op eration (e.g., 784 → 256 ) in the fan-in-dominant region. Meanwhile, the hidden gate remains mostly op en, whic h is fa vorable for preserving accuracy . This can b e in terpreted as a trade-o: “more aggressively blo ck unnecessary input dimensions while keeping represen tational capacit y in the hidden la y er. ” Pruning achiev es a larger reduction ratio but slightly decreases accuracy . W e veried that with further h yp erparameter tuning it is p ossible to push the reduction b eyond the Pruned setting, but in this exp eriment w e prioritized accuracy . DynamicGate-MLP provides stable accuracy retention even when the reduction 16 ratio is moderate, and it oers in terpretability as conditional execution through input-dependent activ ation patterns. 5.2 CIF AR-10 T able 3: A ccuracy and relativ e FLOPs (pro xy) comparison Mo del A ccuracy (%) FLOPs(rel) Notes Baseline 43.30 1.000 Drop out 41.45 1.000 Pruned 48.90 0.941 Dynamic 43.29 0.843 Op enRate L1/L2: 0.94/0.29 Figure 7: Mo del comparison on CIF AR-10 T able 3 compares accuracy and relative compute (FLOPs) across mo del v ariants. Baseline serves as the reference point with accuracy 43.30% and FLOPsrel 1.000. With Drop out, accuracy decreases to 41.45% 17 (−1.85%p), while inference compute remains the same (FLOPsrel = 1.000). Gate op en-rate (threshold=0.5): L1=94%, L2=29%. This suggests that, under this setting, drop out regularization did not translate in to b etter generalization. In contrast, the Pruned mo del ac hiev es the highest accuracy at 48.90% (+5.60%p ov er Baseline) with a sligh t compute reduction (FLOPsrel = 0.941, ab out 5.9%). This can b e interpreted as pruning suppressing redundan t capacit y and impro ving generalization b y remo ving redundan t connections/represen tations. Gate-MLP achiev es 43.29% accuracy—nearly identical to Baseline (−0.01%p)—while reducing relative FLOPs to FLOPsrel = 0.843, i.e., about 15.7% compute reduction. Thus, Gate-MLP is an eciency-oriented alternativ e that meaningfully reduces compute with minimal accuracy loss. A dditionally , at threshold 0.5, the gate op en-rates are measured as L1 = 93.9% and L2 = 28.9%. This indicates that the rst lay er remains activ e for most inputs, while the second lay er is selectiv ely activ ated, and most compute sa vings arise from selectiv e computation in deep er la y ers (L2). Relativ e to Baseline, Gate-MLP nearly preserves accuracy while reducing FLOPsrel to 0.843 (ab out 15.7% reduction). At threshold 0.5, the op en-rate is high in L1 (�93.9%) and low in L2 (�28.9%), implying that compute savings primarily come from selective computation in deep er la yers. Drop out did not improv e accuracy in this setting, whereas pruning show ed b oth an accuracy gain and a modest compute reduction, suggesting that suppressing excess capacit y ma y ha v e b eneted generalization. 5.3 Tin y ImageNet T able 4: Model comparison (Accuracy / P arams / FLOPs) Mo del A ccuracy(%) P aram Reduction(%) FLOPs Reduction(%) P arams FLOPs B aseline MLP 3.24 0.0 0.000 3,197,384 6,393,856 D ropout MLP 2.51 0.0 0.000 3,197,384 6,393,856 P runed MLP 4.28 0.0 29.519 3,197,384 4,506,418 D ynamicMLP 2.97 0.0 80.088 3,197,640 1,273,117 On the more complex ImageNet-like dataset, all MLP v ariants achiev e lo w absolute accuracy . How ever, DynamicGate-MLP slightly outp erforms Dropout, and achiev es a large proxy compute reduction (about 80%) relativ e to the Baseline. 5.4 Sp eech Commands (Keyword sp otting) Data and prepro cessing (repro ducibilit y) Sp eech Commands is an audio dataset; after prepro cessing, eac h sample becomes a 2D feature map (log-mel sp ectrogram) with shap e 40 × 98 . W e used the standard (ocial) 12-class mapping to construct the classication task. W e used 40 mel bins, an STFT windo w of 30 ms and hop of 10 ms, and normalized eac h sample b y p er-sample standardization. Mo del and training W e used a light weigh t MLP-based classier on top of the attened feature input. Gates were inserted in to the hidden la yers, and the gate p enalty w as tuned to achiev e a target activ ation ratio without harming accuracy . DynamicGate ac hieves accuracy comparable to the Baseline while reducing proxy compute (Compute Pro xy/RelMA C). This suggests that ev en in non-image mo dalities, input-dependent gating can suppress unnecessary computation. In terpretation Sp eec h Commands has inputs with heterogeneous dicult y and redundancy (e.g., silence vs. sp eech segments). DynamicGate-MLP can allocate compute more selectively b y closing units for easy or 18 T able 5: Spe ec h Commands prepro cessing parameters (repro ducibilit y) Item V alue Sample rate 16 kHz Clip length 1.0 s (xed length, 16,000 samples) STFT windo w / hop 30 ms / 10 ms Mel bins 40 F eature log-mel spectrogram (shap e: 40 × 98 ) A ugmen tation time shift (100 ms), additiv e noise (optional) T rain/v alid/test split ocial split Batc h size 256 Normalization per-sample standardization: ( x − µ ) / ( σ + ϵ ) , ϵ = 10 − 5 Input dimension (attened) 64 × 41 = 2624 Lab el space 12-class (mapp ed): {y es,no,up,do wn,left,righ t, Lab el space (con t.) on,o,stop,go,_silence_,_unkno wn_} T able 6: Mo del comparison on Sp eec h Commands Mo del V al A cc (b est) T est A cc Gate op en ( g 1 /g 2 ) RelMAC MACs (baseline → gated) BaselineMLP 0.8762 0.8656 – 1.000 973,824 → 973,824 DynamicGateMLP 0.8739 0.8654 0.924 / 0.820 0.989 973,824 → 963,287 redundan t inputs. (0.8656 → 0.8654; best v alidation accuracy is also similar: 0.8762 vs. 0.8739). How ever, as discussed in Section 6, the degree of real speedup dep ends on whether the runtime can exploit the sparsity pattern (e.g., blo c k-wise structured k ernels). At threshold 0.5, the open-rates are L1=92.4% and L2=82.0%, meaning ab out 7.6% (L1) and 18.0% (L2) of units are deactiv ated on a v erage. In this pap er, w e therefore fo cus on controllable activ ation patterns and proxy compute reduction as a consisten t measure across en vironmen ts. In this conguration, RelMA C=0.989 (only about 1.1% reduction), b ecause the open-rates remain relativ ely high; stronger reduction would require a higher p enalty or a higher threshold, but with increased collapse risk. F urther improv ements are exp ected by introducing blo ck-/c hannel-wise gating or routing-friendly im- plemen tations. 5.5 MoE vs. DynamicGate-MLP W e compare DynamicGate-MLP to a SwitchMoE-st yle MLP on MNIST to understand dierences b etw een within-la y er unit gating and exp ert routing. T able 7: Epo c h-wise comparison of DynamicGateMLP vs. SwitchMoE_MLP (MNIST snapshots). DynamicGateMLP SwitchMoE_MLP ep acc loss main aux acc loss main aux drop 21 0.9795 0.010982 0.010823 0.000159 0.9787 0.025858 0.018638 0.007220 44 22 0.9806 0.007150 0.006996 0.000155 0.9815 0.021875 0.014871 0.007004 13 23 0.9815 0.007374 0.007222 0.000152 0.9815 0.020910 0.014248 0.006663 0 (c ontinue d) DynamicGateMLP Switc hMoE_MLP ep active_g corr(h,a) entrop y cv MI cap 21 0.377229 -0.001672 2.044537 0.266152 0.871885 63.291139 22 0.367201 0.002379 2.043974 0.265866 0.899625 63.291139 23 0.359287 -0.006189 2.051539 0.229780 0.848144 63.291139 In terpretation. Across all epo c hs, DynamicGate-MLP achiev es accuracy comparable to or b etter than 19 T able 8: P erformance and eciency a v eraged o v er m ultiple random seeds (mean ± std). Mo del A cc (%) Macro-F1 (%) MACs ( × 10 6 ) MACs Red. (%) Time (s) Baseline 91.83 ± 1.49 93.29 ± 1.77 2.505 ± 0.000 0.00 ± 0.00 7.20 ± 0.24 Drop out 92.48 ± 1.07 94.22 ± 1.11 2.505 ± 0.000 0.00 ± 0.00 7.75 ± 0.57 Pruned 92.17 ± 1.46 93.95 ± 1.36 1.252 ± 0.000 50.00 ± 0.00 3.31 ± 0.38 DynamicGate 92.57 ± 0.89 94.18 ± 0.78 0.988 ± 0.446 60.57 ± 17.82 8.32 ± 0.26 RigL-only 93.33 ± 1.04 94.18 ± 2.06 0.629 ± 0.000 74.87 ± 0.00 9.61 ± 0.53 DynamicGate+RigL 92.43 ± 1.34 93.43 ± 2.41 0.541 ± 0.033 78.41 ± 1.33 11.24 ± 0.30 Switc hMoE, while using fewer eective compute resources under the pro xy metric. Notably , SwitchMoE sho ws unstable b ehavior early in training (accuracy collapsing to around 10% in several epo chs), whereas DynamicGate-MLP remains stable. (e.g., at ep o c h 23, b oth reach 0.9815). When including the compute proxy , DynamicGate-MLP also shows a meaningful reduction ratio. This suggests that, in this small MLP setting, gating within a single expert (unit/block-lev el conditional execution) can b e a simpler and more stable alternativ e than routing among m ultiple exp erts. F or DynamicGate-MLP , the av erage active gate ratio avg_active_ratio_g gradually decreases as training progresses (0.377 → 0.367 → 0.359), indicating conv ergence tow ard sparser activ ation patterns. In practice, the trade-o dep ends on the target scale and implementation: MoE can scale capacity b y adding exp erts, while DynamicGate fo cuses on selectively executing parts of a xed netw ork. indicating that the mo del gradually con v erges to sparser activ ation patterns. F or Switc hMoE_MLP , training uses an auxiliary routing loss ( train_aux ) in addition to the main loss ( train_main ); the auxiliary loss is non-negligible (e.g., train_aux =0.007220 at ep och 21). (e.g., at ep o ch 21, train_aux = 0 . 007220 ), As training pro ceeds, the num b er of dropped tok ens/samples due to capacity constraints ( train_dropped ) decreases from 44 → 13 → 0, suggesting that routing stabilizes ov er time. decreases from 44 → 13 → 0, suggesting that routing stabilizes o v er time. DynamicGate-MLP’s hardness–activ ation correlation ( corr_hardness_active ) sta ys near zero (ab out -0.0017 to 0.0024), implying that in this setup, gating do es not strongly correlate with the chosen hardness measure. sta ys around − 0 . 0017 ∼ 0 . 0024 , i.e., very close to zero, Ov erall, DynamicGate-MLP provides a simpler, more stable training path in this small setting, while MoE in tro duces additional routing dynamics (auxiliary loss, capacity/dropped samples) that can add instability early in training. 5.6 PBMC3k: MA Cs–A ccuracy P areto (mean±std ov er seeds) MA Cs denote the compute proxy (multiply–accum ulate op erations), and MACs reduction is rep orted relative to the Baseline. Time is the measured w all-clo c k run time under the same exp erimen tal setting. 20 Figure 8: P areto plot of classication accuracy versus compute reduction on PBMC3K. In terpretation. Fig 8 Each p oint denotes a mo del v ariant, and the Pareto fron tier highlights the non- dominated trade-os b et w een predictiv e accuracy and computational eciency . The Pareto frontier identies the non-dominated models, namely those for which no other v ariant achiev es b oth higher accuracy and greater compute reduction sim ultaneously . Mo dels lo cated on or near this frontier pro vide the most fa vorable accuracy–eciency balance, whereas p oints b elow the fron tier are comparatively sub optimal. In T able 8, we compare predictive p erformance (Acc, Micro-F1, Macro-F1) and eciency (seconds, MACs reduction) across three random seeds. Drop out slightly improv es Acc/Micro-F1/Macro-F1 compared to Baseline, but yields no structural compute reduction (MA Cs reduction 0.00%). Therefore, in this setting, drop out’s b enet app ears mainly as regularization (mitigating ov ertting) rather than as conditional computation. Pruned achiev es a 50.00% MA Cs reduction with little p erformance degradation, and also records the fastest wall time (3.31 s), showing that static structured sparsit y can translate in to practical gains. By con trast, DynamicGate achiev es the largest gain in Macro-F1 (93.90), and reduces MACs b y 60.57% on a v erage; how ever, its wall time is slo wer than Baseline (8.24 s), likely due to gating o verhead and limited k ernel-lev el skipping. RigL-only records strong eciency (74.87% MACs reduction) while achieving the b est accuracy (93.33%), indicating that dynamic structural sparsit y can impro v e b oth accuracy and compute pro xy . In particular, RigL-only provides a fa v orable accuracy–eciency p oint among the compared mo dels. Ho w ev er, w all time do es not perfectly matc h the MA Cs pro xy: despite large MA Cs reductions, RigL v arian ts can b e slo w er due to o v erheads in sparse execution and memory/launc h costs. This mismatch can b e explained by (i) ov erhead from gating/masking, (ii) dynamic sparse-structure managemen t costs (e.g., mask up dates), and (iii) limited sparse-k ernel optimization in general bac k ends. DynamicGate+RigL com bines b oth metho ds and attains the largest MACs reduction (78.41%), showing the complemen tary eect on functional and structural sparsit y . Nev ertheless, it is the slo w est in wall-clock time (11.24 s) in this prototype implementation, highlighting 21 the need for hardw are-/k ernel-a w are realizations (e.g., blo c k-structured sparsit y) for real sp eed gains. Th us, pro xy compute reduction should be in terpreted as p otential eciency , and deploying it eectively requires implementation support. Finally , DynamicGate shows relativ ely large v ariance in MACs reduction (60.57% ± 17.82%), meaning that the active-gate ratio v aries signicantly with input/seed. If inference-time compute stability is imp ortan t, this v ariance can b e reduced via additional regularization and smo other sc heduling (e.g., for λ g and τ ). In addition, the P areto chart in Fig. 8 shows that even using DynamicGate alone yields a lo w er av- erage MA C (0.988) than Pruned (1.252%) while ac hieving higher accuracy (92.57% vs. 92.17%). When RigL and DynamicGate are com bined, the av erage MA C is the best at 0.541, and the accuracy remains high at 92.43%, whic h is higher than the Baseline (91.83%) although sligh tly low er than RigL-only . These results suggest that, dep ending on the exp eriment and input conditions, DynamicGate alone or its com bination with RigL can pro duce strong outputs. Because of the limits of repeated exp eriments, more extensiv e sw eeps are needed to tune parameters; w e lea v e this to future w ork. 6. Discussion 6.1 Connecting drop out, pruning, and routing Drop out samples man y subnetw orks during training, while pruning selects a single static subnet w ork. DynamicGate-MLP forms a family of subnet works via learned gates and can select them dep ending on the input, whic h giv es it routing-lik e prop erties [12, 13]. The k ey distinction of this pap er is that we do not rely on “large-scale MoE”; instead, we provide input-dep enden t gating in a simple MLP that remains in terpretable from the dropout p ersp ective, together with explicit budget con trol (p enalt y / T op- k ). 6.2 Compute pro xies vs. deplo ymen t realit y Ev en if Compute Proxy/RelMA C decreases, real sp eedup is not automatically guaranteed. T o ac hiev e wall- clo c k acceleration, one typically needs (i) blo ck/c hannel-wise structured gating, (ii) sparse kernels, and/or (iii) routing-based implemen tations. This pap er fo cuses on learning con trollable activ ation patterns rather than claiming direct sp eedups [17, 14, 13]. 6.3 Extending to w ard structural plasticit y: gating + rewiring DynamicGate-MLP implements selectiv e activ ation via functional gating. T o extend this tow ard structural c hange, one can com bine gro w-and-prune metho ds: • RigL-style growth: remo v e inactive connections + gro w new connections at large-gradien t locations [30]. • SET-style evolution: main tain a sparse structure while rep eatedly pruning and regrowing connec- tions [20]. 6.4 Con tin ual learning and forgetting Gating encourages the selection of dieren t sparse subnetw orks depending on input/task conditions, and rewiring enables capacit y reallocation. This can potentially b e interpreted as a structural/functional separa- tion mechanism that alleviates catastrophic forgetting in contin ual learning [24, 25, 26]. Here, “alleviating catastrophic forgetting” means that when learning a new task (T ask B), the model is less lik ely to suddenly and sev erely lose p erformance on a previously learned task (T ask A). This happens because neural net w orks t ypically reuse and ov erwrite the same parameters while learning new tasks (parameter in terference), whic h can break represen tations needed for old tasks. 22 • Gating (DynamicGate): b y making the activ e neurons/paths depend on the task/input, it reduces the ov erlap b etw een the paths used by T ask A and T ask B → less mutual in terference → older knowledge is less damaged. • Rewiring (RigL-like): by reorganizing connections during training, it can allo cate separate capacity for the new task or preserve important connections → it can reduce the need to ov erwrite existing connections to learn the new task. Therefore, “alleviating forgetting” means that the drop in T ask A accuracy after learning T ask B b ecomes smaller. 7. Limitations This w ork shows that gating can reduce av erage compute (Compute Pro xy/RelMAC), but it does not guar- an tee that compute reduction directly translates in to lo w er w all-clo c k latency . The reasons are as follo ws. • Lack of sparse k ernels / back end optimization – On GPUs and general CPUs, dense matm ul is hea vily optimized. Therefore, ev en if some units are deactiv ated, computation may still be p erformed unless the back end provides a sparse k ernel that truly “skips” those op erations. • Masking/gating ov erhead – Additional costs arise from computing gate probabilities (GateNet), threshold comparisons, ap- plying masks, and conditional branc hing. In particular, sample-wise gating changes activ ation patterns across batches, whic h can reduce kernel and vectorization eciency; in suc h cases, wall- clo c k time ma y ev en increase. • Memory/launch ov erhead may dominate – F or small MLPs or small batc h sizes, memory mo vemen t and kernel launc h/sync hronization ov er- head can dominate ov er arithmetic op erations. In these regimes, reducing arithmetic op erations ma y yield limited p erceiv ed sp eedup. • Implementations that only mak e v alues zero (not structural sparsit y) – If deactiv ation is implemented as a simple multiplication such as h ← h ⊙ g , the next lay er’s matm ul can still run densely , so real computation ma y not decrease. T o obtain speed gains, gates m ust b e reected structurally in the computation graph or implemen ted in hardw are-friendly forms suc h as blo c k/c hannel-wise structured sparsit y . F or these reasons, rather than directly measuring FLOPs/latency and claiming “acceleration,” this pap er rep orts compute reduction conserv atively using hardware/bac k-end-agnostic proxy metrics, namely Compute Pro xy and RelMA C. A dditional limitations include: • Hyp erparameter sensitivity . λ g , τ , and θ can cause gate collapse or under/o v er-activ ation dep end- ing on their settings; stable training t ypically requires w arm up and smo oth sc heduling. • Scale/generalization. Current ev aluation fo cuses on small MLP settings. Extending to T ransformers (FFN/atten tion) requires additional v alidation, including routing, sparse k ernels, and batch-eciency issues [16, 13]. 8. F uture W ork • Real latency measurement with block/c hannel gating and sparse kernels (or routing-friendly execution). 23 • Extension to T ransformers (FFN/atten tion) with head/blo c k gating and comparison to sparse routing metho ds [13]. • Quantify forgetting metrics on contin ual learning b enchmarks and test whether gating reduces inter- ference [24, 25]. • Study the coupling of fast functional gating and slow structural rewiring (gating + RigL/SET) under v arious time-scale sc hedules [30, 20]. • Additional experiments on gene-expression and other high-dimensional tabular/omics datasets. • Blo ck-structured sparsity , compaction (reordering active blocks), and ecient GPU execution via blo ck- GEMM k ernels. 9. Conclusion W e prop osed DynamicGate-MLP , a unied framew ork that bridges drop out-style regularization and input- dep enden t conditional computation via learned gating. By in tro ducing gate probabilities, thresholded hard masks, and an exp ected gate-usage p enalty , the mo del can control a compute budget while maintaining accuracy . A cross MNIST, CIF AR-10, Tiny ImageNet, Sp eech Commands, and PBMC3k, DynamicGate-MLP demonstrates meaningful reductions in proxy compute metrics with comp etitive performance. W e further sho w ed that combining DynamicGate with RigL-st yle dynamic sparse rewiring can yield complementary functional and structural sparsity , improving the accuracy–eciency trade-o. F uture work will fo cus on hardw are-/k ernel-a w are implemen tations and extensions to larger arc hitectures. A ckno wledgemen ts This study employ ed a generative AI to ol (Op enAI ChatGPT Plus) in an iterative prototyping worko w to assist with the implemen tation of exp erimental co de and the generation of experimental data. The to ol w as mainly used for co de structure design, repetitive implemen tation, debugging assistance, correction of XeLaT eX errors, and the search and comparison of related literature. How ever, the exp erimental design, data generation pro cedures, interpretation of results, and nal conclusions were directly v eried by the author, who assumes full resp onsibilit y for them. It should b e noted that some automatically generated pro cesses ha v e limitations in that their in ternal reasoning steps are not alw ays fully explicit; accordingly , these outputs w ere ev aluated primarily in terms of repro ducibilit y and v alidation of results. The author also sincerely thanks Professor In-Jo ong Kim of Handong Global Universit y for recommending the submission of the author’s rst MLP-related pap er to arXiv. Co de and License The reference implementation asso ciated with this w ork is made av ailable under the Apache License 2.0. F or commercial or en terprise use inv olving proprietary optimizations, deploymen t toolchains, or hardware- sp ecic run time in tegrations, separate commercial licensing terms ma y apply . References [1] N. Sriv astav a, G. Hin ton, A. Krizhevsky , I. Sutskev er, and R. Salakhutdino v. Dropout: A simple wa y to prev en t neural net w orks from o v ertting. JMLR, 15:1929–1958, 2014. [2] L. W an, M. Zeiler, S. Zhang, Y. LeCun, and R. F ergus. Regularization of neural netw orks using DropConnect. In ICML, 2013. [3] D. P . Kingma, T. Salimans, and M. W elling. V ariational drop out and the local reparameterization tric k. In NeurIPS, 2015. [4] Y. Gal and Z. Ghahramani. Drop out as a Ba yesian appro ximation: Representing model uncertaint y in deep learning. In ICML, 2016. 24 [5] Y. Gal, J. Hron, and A. Kendall. Concrete drop out. In NeurIPS, 2017. [6] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradien ts through stochastic neurons. arXiv:1305.2982, 2013. [7] C. Louizos, M. W elling, and D. P . Kingma. Learning sparse neural netw orks through L 0 regularization. In ICLR, 2018. [8] E. Jang, S. Gu, and B. P o ole. Categorical reparameterization with Gum b el‑Softmax. In ICLR, 2017. [9] C. J. Maddison, A. Mnih, and Y. W. T eh. The concrete distribution: A contin uous relaxation of discrete random v ariables. In ICLR, 2017. [10] Y. Bengio, N. Boulanger‑Lew andowski, and R. Pascan u. Adv ances in optimizing recurren t netw orks. In ICASSP , 2013. [11] A. Gra v es. A daptiv e computation time for recurren t neural net w orks. arXiv:1603.08983, 2016.16 [12] N. Shazeer, A. Mirhoseini, K. Maziarz, et al. Outrageously large neural net works: The sparsely‑gated mixture‑of‑exp erts la y er. In ICLR, 2017. [13] W. F edus, B. Zoph, and N. Shazeer. Switc h transformers: Scaling to trillion parameter mo dels with simple and ecien t sparsit y . JMLR, 23(120):1–39, 2022. [14] S. Han, H. Mao, and W. J. Dally . Deep compression: Compressing deep neural netw orks with pruning, trained quan tization and Human co ding. arXiv:1510.00149, 2015. [15] J. F rankle and M. Carbin. The lottery tick et hypothesis: Finding sparse, trainable neural net w orks. In ICLR, 2019. [16] V. Sanh, T. W olf, and A. M. Rush. Mov ement pruning: Adaptiv e sparsit y b y netuning. In NeurIPS, 2020. [17] H. Li, A. Kadav, I. Durdano vic, H. Samet, and H. P . Graf. Pruning lters for ecien t con vnets. In ICLR, 2017. [18] P . W arden. Sp eec h commands: A dataset for limited‑vocabulary sp eec h recognition. 2018. [19] Y. LeCun, L. Bottou, Y. Bengio, and P . Haner. Gradient‑based learning applied to do cument recog- nition. Pro ceedings of the IEEE, 86(11):2278–2324, 1998. [20] D. C. Mocanu, E. Mocanu, P . Stone, et al. Scalable training of articial neural net works with adaptiv e sparse connectivit y inspired b y net w ork science. Nature Comm unications, 9:2383, 2018. [21] H. Mostafa and X. W ang. Parameter ecien t training of deep conv olutional neural net w orks by dynamic sparse reparameterization. In ICML, 2019. [22] T. Dettmers. Sparse netw orks from scratch: F aster training without losing p erformance. 2019. [23] N. Gale, E. Elsen, and S. Ho oker. The state of sparsit y in deep neural netw orks. 2019. [24] J. Kirkpatric k, R. P ascanu, N. Rabino witz, et al. Overcoming catastrophic forgetting in neural net- w orks. PNAS, 114(13):3521–3526, 2017. [25] F. Zenk e, B. P o ole, and S. Ganguli. Con tin ual learning through synaptic in telligence. In ICML, 2017. [26] S.‑A. Rebu, A. Kolesnik ov, G. Sp erl, and C. H. Lamp ert. iCaRL: Incremental classier and repre- sen tation learning. In CVPR, 2017. 25 [27] A. Holtmaat and K. Sv obo da. Exp erience‑dep endent structural synaptic plasticit y in the mammalian brain. Nature Reviews Neuroscience, 10:647–658, 2009. [28] Y. Zuo, G. Y ang, E. K w on, and W.‑B. Gan. Long‑term sensory depriv ation preven ts dendritic spine loss in the adult cortex. Nature, 436:261–265, 2005. [29] S. F usi, P . J. Drew, and L. F. Abb ott. Cascade mo dels of synaptically stored memories. Neuron, 45(4):599–611, 2005. [30] Utku Evci, T rev or Gale, Jacob Menic k, Pablo Samuel Castro, and Erich Elsen. Rigging the Lottery: Making All Tic k ets Winners. Pro ceedings of Mac hine Learning and Systems (MLSys), 2020. App endix A. Repro ducibilit y This app endix provides concrete v alues and links to reproduce the exp eriments. F or repro ducibility , we pro vide Google Colab noteb o oks that can run the reported settings as-is. These noteb o oks are executable reference materials and do not replace the exp erimen tal descriptions in the pap er. A.1 Repro ducibility chec klist A.1.1 Co de and environmen t All co de is organized in a public GitHub rep ository; eac h dataset has its o wn folder and a corresp onding Colab noteb o ok. A.1.2 Repro ducibility Artifacts • Co de Rep ository: https://github.com/YongilChoi/DynamicGate_MLP_Model.git • Exp erimental En vironment: Go ogle Colab (Python 3.12.12, PyT orch 2.9.0+cpu, CUD A None, GPU: CPU) A.1.3 Hardware W e used Google Colab GPUs (e.g., T4/A100 depending on a v ailability). F or CPU-only runs, results ma y dier in w all time but pro xy metrics remain comparable. A.1.4 Key hyperparameters • Optimizer: Adam W (lr, weigh t decay as in eac h exp eriment cong). • T raining: ep o chs T , batch size, and random seeds (3 seeds) are rep orted p er exp erimen t; hardware and soft w are v ersions are logged in the Colab noteb o oks. A.1.5 Gate scheduling In addition, due to Go ogle Colab session interruptions (e.g., memory limits), we p erformed memory optimizations and v arious parameter tuning tailored to each dataset. W e used a w arm up ( E w ) with λ g = 0 , follo wed by a gradual ramp of λ g up to λ max g ; temp erature τ is annealed smoothly . Threshold θ is xed unless otherwise stated. A.1.6 Compute Proxy and RelMAC Compute Proxy is the av erage gate activ ation ratio; RelMAC w eigh ts activ ation by lay erwise MAC contributions (fan-in/out). W e rep ort b oth as hardw are-agnostic proxy metrics. A.2 Colab noteb o oks 26 T able 9: Colab source co de links for eac h dataset. Dataset Colab Source Co de (URL) MNIST https://colab.research.google.com/drive/1PbKL4So4Vqel9VTN- SxFsEcHJGoo7aib?usp=sharing CIF AR-10 https: //colab.research.google.com/drive/1q8aA74ImyfW8QB55RrqImBDSOt2IUtbv?usp=sharing Tin y ImageNet https://colab.research.google.com/drive/19V6E8EHPjwdW- zgv1jnyxCgPiRT9_F3M?usp=sharing Sp eec hCommands https://colab.research.google.com/drive/1TZBecTZQxlEu2ME_q3Ta5mM_jgt- fQTZ?usp=sharing PBMC3k (HVG) https://colab.research.google.com/drive/1DffTQEO8Ctw0h8UQGL- RajQW6HtBKEjX?usp=sharing 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment