Inhibitory normalization of error signals improves learning in neural circuits

Normalization is a critical operation in neural circuits. In the brain, there is evidence that normalization is implemented via inhibitory interneurons and allows neural populations to adjust to changes in the distribution of their inputs. In artific…

Authors: Roy Henha Eyono, Daniel Levenstein, Arna Ghosh

Inhibitory normalization of error signals improves learning in neural circuits
Inhibitory normalization of err or signals impr ov es lear n- ing in neural cir cuits Roy Henha Eyono 1 , 2 , ∗ , Daniel Le venstein 3 , Arna Ghosh 4 , Jonathan Cornford 5 , Blake Richards 1 , 2 , 4 , 6 , 7 , 8 1 Mila-Quebec AI Institute, 2 School of Computer Science, McGill Univ ersity , 3 Y ale Uni versity , 4 Google, Paradigms of Intelligence T eam, 5 Leeds Univ ersity , 6 Department of Neurology & Neurosurgery , McGill Uni versity , 7 Montreal Neurological Institute, McGill Univ ersity , 8 CIF AR Learning in Ma- chines & Brains Program ∗ Correspondence: ro y .eyono@mila.quebec K eywords: Layer Normalization, Inhibition, Credit Assignment, Neural Networks Abstract Normalization is a critical operation in neural circuits. In the brain, there is evidence that normal- ization is implemented via inhibitory interneurons and allo ws neural populations to adjust to changes in the distribution of their inputs. In artificial neural netw orks (ANNs), normalization is used to improv e learning in tasks that in v olve complex input distrib utions. Howe ver , it is unclear whether inhibition-mediated normalization in biological neural circuits also improv es learning. Here, we ex- plore this possibility using ANNs with separate excitatory and inhibitory populations trained on an image recognition task with v ariable luminosity . W e find that inhibition-mediated normalization does not improve learning if normalization is applied only during inference. Ho wev er , when this nor - malization is extended to include back-propagated errors, performance improv es significantly . These results suggest that if inhibition-mediated normalization improv es learning in the brain, it additionally requires the normalization of learning signals. 1 Intr oduction Inhibitory plasticity has traditionally been studied as a means of maintaining excitation–inhibition (E–I) balance in neural circuits (van Vreeswijk and Sompolinsky, 1998; V ogels et al., 2011; Den ` eve and Machens, 2016). In this work, we examine a complementary function of inhibitory circuits: their role in normalization, in which inhibitory neurons scale the activity of e xcitatory neurons relativ e to nearby neurons (Carandini and Heeger, 2012). There are sev eral examples in which inhibitory interneurons facilitate normalization in the brain. For instance, Atallah et al. (2012) sho wed that manipulating parv albumin-positi ve interneurons in mouse V1 produces lar gely divisi ve and partially additi ve changes in pyramidal cell responses, sug- gesting that this class of interneurons can implement gain control consistent with normalization. Sim- ilarly , Carandini and Heeger (2012) described a feedforward normalization circuit in the fly antennal lobe, in which presynaptic local interneurons di visiv ely scaled odor inputs. Similar to the brain, normalization plays an important role in artificial neural networks (ANNs) (W u and He, 2018). Layer normalization, which normalizes across units within the same layer (Ba et al., 2016), has become a key component of transformer-based architectures and recurrent neural network models (V aswani et al., 2017; Ba et al., 2016), and it leads to significant improvements in learning, especially for sequence modeling tasks (Xiong et al., 2020). While inhibitory circuits are critical for learning (Richards et al., 2010), it is unclear whether their importance for learning may relate to their role in normalization. This raises a question: Could nor - malization mediated by inhibitory circuits improve learning in the same way that layer normalization does in ANNs? T o understand the potential role of inhibitory normalization in learning, we trained ANNs with separate excitatory and inhibitory populations (EI-networks) on a visual classification task. First, using hard-coded layer normalization, we found that adding layer normalization to the EI-network significantly improv ed model training in the face of luminance changes. W e then asked whether layer normalization mediated by inhibitory circuits could pro vide a similar boost in performance. T o answer this we trained the inhibitory cells in the network to perform layer normalization (I-normalization). W e observ ed that, while I-normalization could successfully center and scale neural activity , it did not produce the same benefits for learning as the hard-coded layer normalization operation. Closer examination of the layer normalization operation re vealed that its primary contribution to learning lay not in stabilizing activ ations, but in shaping gradients during backpropagation (Xu et al., 2019). This insight led us to implement an additional lateral inhibitory mechanism to normalize the back- propagated error signals in EI-networks. W ith this form of I-normalization, the EI-network was able to recapitulate the performance benefits of hard-coded layer normalization. Altogether , our results support the idea that inhibition-mediated normalization could be one of the reasons that inhibition is important for learning in real neural networks. But, our results also suggest that normalizing activity alone would not be suf ficient. T o obtain the benefits for learning, I-normalization would need to not only normalize the neural activity , but also any signals used for learning. This has interesting implications for inhibitory circuits in the brain that target the apical dendrites of pyramidal neurons, where learning signals may be recei ved. 2 2 Results 2.1 Layer -normalization impro ves per ceptual in variance T o study how I-normalization could impact learning, we used an ANN that enforces Dale’ s principle by constraining each neuron to be either purely excitatory or purely inhibitory , so that all outgoing synaptic weights from the same neurons share the same sign (EI-networks; Fig. 1a). Follo wing prior work (Cornford et al., 2021), such networks ha ve been shown to achiev e performance comparable to standard ANNs when trained with gradient descent. In these netw orks, the acti vity of e xcitatory units at layer ℓ is governed by the interaction between a direct excitatory driv e and an indirect, feed-forward inhibitory pathway . The acti vity in the network is calculated as follo ws: h I ℓ = W I E ℓ h E ℓ − 1 , z ℓ = W E E ℓ h E ℓ − 1 − W E I ℓ h I ℓ , h E ℓ = ϕ ( z ℓ + b ℓ ) , where h E ℓ and h I ℓ represent the acti vity vectors of the excitatory and inhibitory populations at layer ℓ , respecti vely . The weight matrices are defined as follows: W E E ℓ is the direct excitatory-to-e xcitatory connection, W I E ℓ projects acti vity from the pre vious e xcitatory layer to the current inhibitory popula- tion (feedforward inhibition), and W E I ℓ represents the inhibitory weights that subtracti vely modulate the excitatory driv e. The term b ℓ denotes the learnable bias vector , and ϕ ( x ) = ReLU ( x ) is the non-linear acti vation function applied element-wise. W e trained the EI-network on a modified Fashion MNIST categorization task with shifts in lu- minosity . W e did this because we reasoned that normalization would be especially important for handling input distribution variability . Specifically , for each image, we created a series of augmen- tations of luminance by translating the pixel v alues by a constant, ∆ , sampled within the threshold range ∆ ∼ Uniform ( − ϵ, + ϵ ) . This was done for ev ery sampled image during both training and test time (see Methods). The v ariable ϵ then served as a hyperparameter to adjust the magnitude of the range of luminance shifts in the data distribution (Fig. 1b). Succeeding in this task requires perceptual in v ariance to changes in luminosity , a capability that animals routinely e xhibit and which is critical for navig ating a dynamic en vironment. 3 a b > -ε < ε Luminosity Range T raining Samples Inhibition (I, -) Excitation (E,+) Figure 1 : Schematic of the Excitatory-Inhibitory (EI) network with Layer Normalization and the perceptual in variance task. a: Feedforward EI network architecture. Gray units represents the Excitatory (E) population , and purple unit represents the Inhibitory (I) population . Outgoing synaptic weights share the same sign (E , + or I , − ). b: T o test perceptual in v ariance, a shift is applied to each indi vidual image in the FashionMNIST dataset during both training and testing. F or ev ery image, a unique constant ∆ is sampled within the threshold | ∆ | < ϵ . The figure displays three example augmentations to illustrate ho w the shift varies across the allo wable range. W e first examined the impact of hard-coded layer normalization in these EI-networks (Fig. 1a). Specifically , we applied a centering and scaling operation to the excitatory pre-activ ations in each layer in order to bring the pre-acti vation mean to zero and v ariance, one: h E ℓ = ϕ ( ˆ z ℓ ) , ˆ z ℓ = z ℓ − µ ℓ p σ 2 ℓ + c , µ ℓ = 1 n ℓ n ℓ X i =1 z i ℓ , σ ℓ = 1 n ℓ n ℓ X i =1 ( z i ℓ − µ ℓ ) 2 , where n ℓ is the number of excitatory neurons in layer ℓ , c = 1 × 10 − 5 is a small constant to prev ent di vision by zero, and the bias term has been omitted for simplicity . W e found that this layer normal- ization operation improved training, with the improv ement being more pronounced for larger ranges of luminance shifts (Fig. 2a). 4 a b Acc % (T op 10) ε =0.25 ε =0 ε =0.5 ε =0.75 E-only LN E-I 89 87 85 90 85 80 78 82 86 90 Luminosity No LN (Acc %) LN (Acc %) n.s. Figure 2 : Layer normalization (LN) impr oves perceptual in variance in Excitatory-Inhibitory (EI) networks. a: T est accuracy (Acc %) comparison of EI networks with LN (x-axis) to those without LN (y-axis). Data points represent performance across 30 hyperparameter combinations (layer widths and E/I learning rates) and four luminosity ranges ( ϵ = 0 , 0 . 25 , 0 . 5 , 0 . 75 ). Points below the dashed diagonal line indicate cases where networks with LN performed better . b: T op-10 test accuracy comparison of an EI network with LN against an Excitatory-only (E-only) network, also with LN. The box plots summarize the distribution across the same 30 hyperparameter combinations reported in panel a . T o assess the empirical contribution of the inhibitory units, we selecti vely ablated them before training. In this condition, only the excitatory weights were trained, with layer normalization alone pre venting activity blo w-up. Despite having fewer parameters than their EI-network counterparts, the E-netw orks with layer normalization sho wed no statistically significant difference in performance to the EI-network with layer normalization (Fig. 2b), suggesting that, for this task, training the in- hibitory units on the task loss (cross-entropy) provides little to no benefit when hard-coded layer normalization is present. In other words, inhibition does not meaningfully contrib ute to task perfor- mance under these conditions, suggesting that the inhibitory units’ capacity could instead be directed to ward implementing layer normalization. 2.2 Learned inhibition normalizes excitatory acti vity W e next asked whether hard-coded layer normalization could be removed entirely and replaced with layer normalization implemented by inhibitory neurons. T o implement layer normalization with in- hibitory neurons, we used two distinct inhibitory populations: one providing subtractiv e inhibition 5 a b - ÷ Inhibition (I, - & ÷ ) Excitation (E,+) ℒ I-Norm ℒ T ask Forward Pass Backward Pass First Moment No-Norm ∆ ∆ Second Moment I-Norm(sub) I-Norm 0 -0.5 -1 1 2 0.6 Figure 3 : Inhibitory populations learn to implement layer normalization of excitatory activity . a: Schematic showing ho w the inhibitory circuit (purple) is trained locally via the L I-Norm loss (purple lines) to normalize excitatory acti vity . Excitatory to excitatory weights are updated only by the task loss L T ask (dotted left arrow). Forw ard Pass: Inhibition performs subtracti ve ( − ) and di visiv e ( ÷ ) modulation. Backward Pass: Inhibitory gradients enforce layer-normalized excitatory statistics. b: Box-and-whisker plots of the first and second moments of excitatory acti v ations. Each plot compares three conditions: No-Norm, Subtractiv e-only I-Norm (sub), and I-Norm (as depicted in a). Each box plot summarizes model results aggregated across the sampled range of ϵ luminosity augmentations. and the other providing divisi v e inhibition. This design accounts for the functional div ersity of in- hibitory subtypes, which can implement divisi ve or subtractiv e operations or both depending on the context (W ilson et al., 2012; Pouille et al., 2013; El-Boustani and Sur, 2014). W e will refer to these networks as I-normalization (or I-Norm) networks. Their activity was calculated as follo ws: h E ℓ = ϕ ( z ℓ ) z ℓ = W E E ℓ h E ℓ − 1 − W E I ℓ − 1 h I ℓ p U E I ℓ h D ℓ , (1) where h D ℓ = U I E ℓ h E ℓ − 1 represents the di visive inhibition population, h I ℓ = W I E ℓ h E ℓ − 1 represents the subtracti ve inhibition population, and U X ℓ , W X ℓ represents the synaptic weights associated with each respecti ve inhibitory population. W e then introduced a new normalization loss ( L I-Norm ), which was applied only to the inhibitory pathway weights ( W E I , W I E , U E I , U I E ), while the excitatory weights ( W E E ) were trained solely 6 with the cross-entropy loss from the Fashion MNIST categorization task ( L T ask , see Methods). The normalization loss was calculated as: L I − N or m = 1 n n X i =1 h E i ! 2 + 1 n n X i =1 ( h E i ) 2 − 1 ! 2 . This loss was designed to optimize the first and second moments of the excitatory unit activity ( h E ). Figure 3a provides a schematic of this setup, showing the distinct inhibitory pathways and ho w the loss gradients propagate through them. Analysis of excitatory unit acti vity revealed that the inhibitory circuits effecti v ely learned to nor- malize excitatory activity (Fig. 3b). In contrast, networks without layer normalization exhibited first and second moments that were of f the target ( µ = 0 , σ 2 = 1 ). Figure 3b similarly shows that a purely subtracti ve inhibitory circuit does not match the layer normalization target statistics as ef fectiv ely as the segre gated inhibitory approach depicted in Figure 3a, despite being trained with the same I-norm loss objecti ve. These results demonstrate that inhibitory circuits can ef fectiv ely recapitulate the impact of hard- coded layer normalization on neural activity . Importantly , unlike hard-coded layer normalization, which requires a population-lev el computation to aggreg ate statistics, the learned I-Normalization depends only on the feedforward activity of the preceding layer ( h E ℓ − 1 , see Eq. 1). This makes normalization inherently “predictiv e”: the inhibitory population must anticipate excitatory activity , rather than simply enforcing a hard-coded normalization operation. 2.3 Normalizing err or signals recapitulates lay er normalization function W e next examined the performance of the trained I-Norm networks on the perceptual in variance task. Although the I-norm network normalized excitatory activity , it still underperformed compared to EI- networks with hard-coded layer normalization (Fig. 4a). These observations point to a ke y mechanism found in layer normalization being absent from I-Norm networks. One possibility , moti v ated by observ ations in the machine learning literature (Xu et al., 2019), is that the effecti veness of hard-coded layer normalization depends on ho w it transforms error signals during error propagation, rather than how it shapes activity during forward processing. T o test this idea, we compared the activity vectors and weight updates during training in I-Norm networks with the a v ersion of the network with inhibition remov ed and hard-coded layer normalization. Despite the strong cosine similarity in the activity statistics (Fig. 4b, top ), the weight updates between the two networks were poorly aligned (Fig. 4b, bottom ). This suggests that the impact of layer normalization is related to its impact on the weight updates, rather than its impact on forward-pass acti vity , per se. 7 LN (Acc %) 80 85 90 78 I-Norm (Acc %) 82 86 90 I-Norm vs LN Gradient Alignment a b ε =0.25 ε =0 ε =0.5 ε =0.75 Luminosity Luminosity ε =0.75 Luminosity ε =0.75 Count Count I-Norm vs LN Output Alignment 0 0 4 0 4 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Figure 4 : I-Norm networks struggle to recapitulate the perf ormance of LN in EI networks. a: T est accuracy comparison between LN (x-axis) and I-Norm (y-axis) across hyperparameter and luminosity ranges ( ϵ = 0 , 0 . 25 , 0 . 5 , 0 . 75 ). Each point represents a single network training run. Points falling below the dashed diagonal line indicate cases where LN achiev ed higher test accuracy than I-Norm. b: Layer-wise alignment between I-Norm and LN. W e quantify alignment as the cosine sim- ilarity between I-Norm and LayerNorm (LN) across all network layers for outputs (top) and gradients (bottom). All results correspond to the highest luminosity range ( ϵ = 0 . 75 ). Formal analysis of the weight updates confirmed that the ef fect of hard-coded layer normalization on learning arises from its influence on the weight updates themselves (see Appendix, 5.1). Specif- ically , for a network with layer normalization, the partial deri vati ve used to update weights, ∂ L T ask ∂ z i , corresponds to a normalized version of the propagated error signal, which we denote as ˆ δ i for neuron i . This relationship can be expressed as follo ws (see Appendix for a full deri v ation): ∆ W E E ∝ − ∂ L T ask ∂ z = − ˆ δ ˆ δ i = 1 √ σ 2 + c | {z } scale δ i − 1 n n X j =1 δ j | {z } center − ˆ z i n n X j =1 δ j · ˆ z j | {z } decorrelate ! , (2) where z i is the activ ation of unit i , ˆ z i is its normalized acti vation, δ i is the backpropagated error before normalization, ˆ δ i is the error after normalization, µ and σ 2 are the mean and v ariance across the n units, and c is a small constant for numerical stability . For clarity , we hav e omitted the layer index ℓ and color-coded the terms to highlight their different functional roles. As sho wn in the equation, layer normalization transforms the propagated error signals in three distinct ways: (1) it rescales the 8 GN GN GradNorm LN (Acc %) I-Norm w/ GradNorm (Acc %) 84 86 88 90 84 86 88 90 a b c ℒ I-Norm ℒ T ask // // Backward Pass ε =0.75 ε =0.5 ε =0.25 ε =0 Luminosity 0.96 0.97 0.98 0.99 1 0.96 0.97 0.98 0.99 1 0 5 10 10 Count Count Luminosity ε =0.75 I-Norm vs LN Output Alignment I-Norm vs LN Gradient Alignment Figure 5 : Hard-coded LN gradients in I-Norm networks restore LN performance in EI net- works. a: Schematic illustrating the Backward Pass of the I-Norm network incorporating, Grad- Norm , from equation 2. The GradNorm operation is applied to the backward signal ( δ ) to enforce the LN gradient. b: A verage cosine similarity across all network layers. The top and bottom panels sho w the similarity between LN and I-Norm ( with GradNorm ) for outputs and gradients, respectiv ely . c: T est accuracy (Acc %) comparison. LN network performance (x-axis) v ersus I-Norm network with GradNorm (y-axis). Data is shown across 30 hyperparameter initializations and four luminosity ranges ( ϵ = 0 , 0 . 25 , 0 . 5 , 0 . 75 ). Points clustered along the dashed diagonal line indicate a strong match in performance between the two models. errors according to the variance of the excitatory acti vity; (2) it centers the errors by removing their mean; and (3) it orthogonalizes the error signal from the acti vations. W e next e valuated I-Norm networks using the gradient modification in Equation 2, which we call GradNorm. In this setting, inhibition handles activity normalization, while the error calculations are hard-coded (Eqn. 2; Fig. 5a). GradNorm significantly improved the alignment of I-Norm weight updates with those of standard layer normalization (Fig. 5b), successfully recapitulating the perfor- mance gains associated with hard-coded layer normalization (Fig. 5c). This performance held across all ranges of luminosity , despite the lack of perfect output alignment(Fig. 5b). T ogether, these analyses and experiments indicate that the performance improv ements observed in EI-networks with layer normalization deri ve from its effect on error signal propagation. This suggests that achieving the benefits of layer normalization in I-Norm networks would potentially require an additional inhibitory population capable of normalizing error signals. 9 T est Acc b Center Decorrelate Scale a I-Norm vs LN Gradient Alignment Count 0 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0 10 2 10 84 86 88 Scale Decorrelate Center Ful l ~ ~ *** *** n.s. n.s. ε =0 LN+ Scale Decorrelate Center Full 20 40 60 80 *** *** ** n.s. ε =0.75 LN+ Count Count Figure 6 : Gradient centering, not scaling or decorrelating, drives LN performance recov ery in I-Norm networks. a: Box plots of test accurac y comparing Scale , Decorr elate , Center , and Full LN gradient components applied to the I-Norm network. The red dashed line ( LN + ) shows the baseline performance of the hard-coded LN network. Results are separated by luminosity range ( ϵ = 0 , 0 . 75 ). Significance is indicated comparing components to the LN + baseline. Red tildes ( ∼ ) denote outliers performing near random chance, omitted to preserve plot scaling. b: The panels contrast the LN gradient with specific gradient components trained on an I-Norm network: Center (top, orange), Decorr elate (middle, teal), and Scale (bottom, purple). 2.4 Mean centering of err or signals is the most salient component of credit assignment normalization T o better understand which aspects of gradient normalization driv e performance, we first examined the contribution of each term in the layer-normalization gradient equation. In particular , we asked whether the performance of I-Norm networks with hard-coded GradNorm hinges more on the scaling , centering , or decorrelation of the error signals (Eqn. 2). T o in vestigate this, we trained I-Norm networks with hard-coded modifications to the gradient calculations (Eqn. 2), applying each of the three operations in isolation. Across luminance lev- els, we found that centering was critical for performance, while scaling and decorrelation had little impact. Networks trained with centering alone achie ved test accuracies comparable to those of hard- coded layer normalization at luminance ranges of 0 and 0.75 (Mann-Whitney U test, p = 0 . 4641 and p = 0 . 6099 , respecti vely), whereas networks trained with only scaling or decorrelation performed significantly worse across luminance le vels (Fig. 6a). W e further confirmed these findings by analyzing the alignment of the gradients in I-Norm net- 10 works with those trained using standard layer normalization. Networks trained with the centering operation showed higher alignment with the layer normalization networks compared to networks trained with either scaling or decorrelation alone (Fig. 6b). These results indicate that the centering of gradient calculations induced by layer normalization is the key factor dri ving its impact on training for the task. Motiv ated by this, we next aimed to implement the centering operation using an inhibitory circuit. 2.5 Centering of cr edit assignment signals via lateral inhibition An inhibitory implementation of the mean-centering operation must (i) pool information across the δ i ’ s within the layer , and (ii) broadcast back a common signal that each excitatory neuron can subtract from its update. But, inhibitory units in real neural circuits cannot all possess the same exact synaptic weights. This raises a fundamental question: can random inhibitory synapses implement gradient centering? Se veral theories argue for gradient approximation in the brain, each sharing a common require- ment: error signals (i.e., δ i ) are propagated between regions or layers (Richards et al., 2019; Guerguie v et al., 2017; Lillicrap et al., 2016; Whittington and Bogacz, 2017; Lillicrap et al., 2020). Motiv ated by this, we assume here for sake of theorizing that such across-layer error signals are av ailable, and focus on a more specific question: giv en access to the δ i within a layer, how could an inhibitory circuit implement the mean-centering operation? Belo w , we establish a theoretical guarantee showing that a single inhibitory unit with fixed, ran- dom synaptic weights can indeed approximate the population mean of error signals. Theorem 1: Mean estimation via fixed random inhibition Let { ω i } n i =1 be i.i.d. positiv e random variables with E [ ω i ] = µ > 0 and V ar( ω i ) < ∞ that parameterize a set of n synaptic weights, { ν i,n } n i =1 , whose values sum to 1 .Let { δ i } n i =1 be a bounded sequence of error signals with an empirical mean denoted by ¯ δ , i.e., ¯ δ = 1 n n X i =1 δ i . Define the normalized synaptic weight onto inhibitory neuron, i as ν i,n = ω i P n j =1 ω j . If we consider the inhibitory error pooling operation, which can be expressed as the dot product be- tween the n -dimensional weight vector ν n = [ ν 1 ,n , . . . , ν n,n ] ⊤ and the error vector δ n = [ δ 1 , . . . , δ n ] ⊤ : s n = ν n · δ n = n X i =1 ν i,n δ i . 11 a b ℒ I-Norm ℒ T ask / / / / Lateral Inhibition - - T est Acc Center Full Lat. Inhib. 86 85 84 83 n.s. n.s. ** LN+ Center Full Lat. Inhib. 89 88 87 n.s. n.s. n.s. LN+ ε =0 ε =0.75 Figure 7 : Lateral inhibition with fixed synaptic weights implements gradient centering. a: Schematic of the I-Norm network with a lateral inhibition pathway . The inhibitory unit (red) pools and transforms excitatory activity (gray) using fixed, random connections to compute the centering term (mean). b: T est accuracies (Acc %) of the lateral inhibition solution, e xplicit gradient centering, and the full layer normalization gradient applied to the I-Norm network. The red dashed line ( LN + ) sho ws the baseline performance of the explicit layer normalization network. Results are averaged across h yperparameter and luminosity ranges ( ϵ = 0 , 0 . 75 ). Significance is sho wn relati ve to the LN+ baseline. Then, in the limit as n → ∞ , the pooled signal con ver ges to the empirical mean: lim n →∞ s n = ¯ δ almost surely . (End of Theor em 1) The proof of this theorem is provided in the Appendix. Theorem 1 sho ws that pooling errors via normalized random synapses becomes equi v alent to uniform av eraging in the large- n limit. Sub- tracting this inhibitory signal, therefore, could approximate gradient centering. Moti vated by this theoretical guarantee, we construct a simple lateral inhibitory circuit (Fig. 7a), inspired by “blanket” inhibition in cortex, in which a single inhibitory pool targets large populations of excitatory cells (Karnani et al., 2014). W e tested this lateral inhibition mechanism for error normalization and compared its perfor - mance to networks using explicit centering or full gradient normalization. Across all luminance ranges, test accuracy with lateral inhibition was comparable to the explicit centering and full gradient- 12 normalization baselines, with no significant dif ferences found at luminance range = 0 (Mann-Whitney U test, p = 0 . 4641 ) or luminance range = 0.75 (Mann-Whitney U test, p = 0 . 6099 ) (Fig. 7b). T ogether, these results sho w that the learning benefits of layer normalization can be reproduced using three distinct inhibitory populations: tw o inhibitory populations that estimate the mean and v ariance to normalize e xcitatory acti vity in the forw ard pass and another lateral inhibitory population that centers error signals to normalize gradient updates. 3 Discussion Layer normalization enhances learning in ANNs by stabilizing activ ations in the forward pass and regularizing error signals in the backward pass. In our work, we lev eraged a neural network with distinct excitatory and inhibitory circuits to understand the potential role of inhibitory normalization in learning in neural circuits. Using local objecti ves, the inhibitory circuits learned to stabilize excita- tory acti v ations, but lack ed the additional learning benefits pro vided by standard layer normalization. Through ablation studies, we showed that centering the gradient alone is sufficient to recov er these benefits, e ven without explicitly regularizing error signals. Moreov er , we demonstrated that such centering can be achiev ed through lateral inhibitory circuits with fixed random weights, leading to performance improv ements approaching that of standard layer normalization. Altogether , our results suggest that if inhibitory normalization in the brain helps learning it may be due to there being multi- ple inhibitory populations with distinct roles, some related to normalizing acti vity , and others related to normalizing error signals used for learning. Pre vious studies have emphasized how inhibition can center (subtractiv e) or scale (di visiv e) neu- ronal activity: Somatic-targeting PV interneurons typically scale down (di vide) responses, whereas dendrite-targeting SST interneurons subtract from them (W ilson et al., 2012; Pouille et al., 2013) . Consistent with this, Atallah et al. (2012) sho wed that increasing PV -cell activity produces a linear combination of additiv e and multiplicativ e changes in pyramidal firing, effecti vely a form of gain control with bias while maintaining tuning specificity . These findings collectiv ely support the idea that PV -cells could implement acti vity normalization. Our w ork builds on this foundation by training inhibitory cells to perform normalization on excitatory acti vity , demonstrating that such circuits can recapitulate both the centering and gain control functions of standard layer normalization. Ho wev er , previous models of inhibitory normalization have focused primarily on regulating neu- ronal activity . Learned divisi v e normalization frameworks (Burg et al., 2021; Shen et al., 2021), for example, describe how cortical circuits can perform normalization but do not consider how in- hibitory mechanisms might also regulate error signals. Our findings address this gap by showing that forward-pass normalization by inhibition alone cannot reproduce the full learning benefits of 13 layer normalization. W e propose that inhibitory subcircuits, potentially SST -subtypes targeting apical dendrites where error-related signals arri ve or neurogliaform cells (Overstreet-W adiche and McBain, 2015), can normalize credit assignment signals and thereby support ef ficient and stable learning. This aligns with recent theories of burst-dependent backpropagation (P ayeur et al., 2021), which suggest that inhibitory microcircuits are essential in shaping back-propagated error signals rather than merely transforming neural responses. In this vie w , inhibition not only stabilizes neuronal activity b ut also contributes to modulating credit assignment signals, rev ealing a hypothetical, pre viously unrecog- nized role for inhibition. One of the more interesting insights from our work is the ability of inhibitory circuits to infer the mean and v ariance of do wnstream excitatory activity from an earlier layer using a simple regulariza- tion loss, without direct knowledge of the excitatory activ ations. In the context of ANNs, predicting these statistics eliminates the need for hard-coded layer normalization. Although this mechanism is not present in standard networks, it could help maintain stable representations under varying sensory conditions, such as lar ge shifts in luminance for image cate gorization or other distrib utional changes, reducing the need for explicit gain, bias, or error correction in v ariance computation. Reflecting on the role of gradient normalization, a particularly striking result of our study was that centering the gradients alone was sufficient in recapitulating layer normalization’ s performance on the task, indicating that the mean component of the gradient carries the majority of the functional impact. In practice, the mean computation occasionally rev erses the signs of individual gradients, which in the context of gradient descent may produce substantially dif ferent learning dynamics. Why such sign changes can improv e learning remains unclear and potentially represents an interesting direction for future in v estigation. Here, we implemented the gradient-centering operation using a lateral inhibitory circuit with fixed random weights. This mechanism is conceptually related to feedback alignment (Lillicrap et al., 2016), in which fixed random feedback weights transmit useful gradient signals. In our case, the fixed random inhibitory weights serve to compute the mean of the gradient rather than the exact back- propagated values. This observation raises further questions for future research, including whether using random inhibitory weights to maintain homeostasis could represent a biologically plausible alternati ve to approximating precise gradient signals. All things considered, this work has some important limitations that should be considered. First, our empirical findings rely primarily on a perceptual in variance task in which the dominant inductiv e bias is a global shift in pixel intensities. In this task, mean-centering would naturally be the most functionally relev ant component of normalization. Consequently , subtracting the mean from the ac- ti vations directly targets the structure of the perturbation. In more complex sensory domains, there is no guarantee that the mean will remain the most salient statistic to normalize. Natural images, for 14 example, e xhibit v ariance fluctuations, which might require inhibitory circuits to compensate beyond centering the gradient. Although centering the gradient works well for our perceptual in v ariance, we are yet to test whether it generalizes to other tasks. Finally , a core conceptual limitation is that the random-weight inhibitory mechanism we employ to center error signals is not biologically grounded; for example, it is linear and and operates under the assumption that the random weights ν i are unit vectors. Nonetheless, it provides a high-lev el illus- tration of ho w lateral inhibition with random synaptic connections could implement normalization of error signals. Thus, the biological claims of this work are best interpreted as a high-level algorithmic hypothesis, rather than as a proposal for a physiological mechanism. Future work should e v aluate inhibitory normalization in tasks where additional statistics be yond the mean (e.g., variance, cov ariance, or sparsity) are behaviorally relev ant. Furthermore, additional studies could build biophysical models that examine the relationship between plasticity rules and normalization in more realistic inhibitory circuits. By expanding both the task domain and the realism of the circuit motifs studied, future work may uncover a more complete and unified account of how normalization dri ven by inhibition could impact learning in neural circuits. 4 Methods 4.1 Dataset W e constructed a modified version of the Fashion-MNIST classification task that incorporates shifts in image luminosity . All images were normalized to lie within the range [0, 1]. For each image, we generated luminance augmentations by adding a constant offset, ∆ , to all pixel v alues, where ∆ ∼ Uniform ( − ϵ, + ϵ ) . Since the pixel v alues are bounded between 0 and 1, the augmented images were clamped to remain within this range. The parameter ϵ served as a hyperparameter controlling the magnitude of the luminance shifts. W e conducted experiments with ϵ ∈ 0 , 0 . 25 , 0 . 5 , 0 . 75 , where ϵ = 0 corresponds to the standard Fashion-MNIST dataset. Each model was trained on 60,000 images and ev aluated on a test set of 10,000 images, both of which included luminosity-shifted samples. Reported accuracies in the manuscript refer to model performance on the test dataset. 4.2 Layer Normalization Layer Normalization (LN) (Ba et al., 2016) normalizes the pre-activ ations of a layer across hidden units for each input sample, rather than across the batch. 15 Gi ven a vector of pre-acti vation inputs z = ( z 1 , . . . , z N ) to a layer of N hidden units, LN computes the mean and v ariance ov er the hidden units for a single sample: µ = 1 N N X i =1 z i , σ 2 = 1 N N X i =1 ( z i − µ ) 2 . Each acti vation is then normalized as ˆ z i = z i − µ √ σ 2 + c , where c is a small constant to prev ent numerical instability . In our experiments, we use c = 10 − 5 . W e omit the additional gain and bias terms often used in LN in our experiments, because existing literature indicates that the gain and bias occasionally hurts training (Xu et al., 2019). 4.3 Network Ar chitectur e & Loss Functions W e enforce Dale’ s principle in our networks by constraining each neuron to be exclusi vely excitatory (E) or inhibitory (I), such that all outgoing synaptic weights share the same sign. Our networks are limited to 2 hidden layers, with the inhibitory layer width set to 10% of the excitatory layer width. The architecture is trained end-to-end using standard gradient descent. Standard EI Network In the standard EI-Network, the acti vity of the e xcitatory units ( h E ℓ ) at layer ℓ is gov erned by the subtractiv e interaction of E and I populations: h E ℓ = ϕ ( z ℓ ) where z ℓ = W E E ℓ h E ℓ − 1 − W E I ℓ − 1 h I ℓ . The inhibitory population activity ( h I ℓ ) is a feedforward function of the preceding excitatory ac- ti vity: h I ℓ = W I E ℓ h E ℓ − 1 . Here, ϕ ( x ) = ReLU ( x ) is the non-linear activ ation function. For E-only networks (Fig.2), the in- hibitory computation ( h I ℓ ) is excluded, simplifying the pre-acti v ation to z ℓ = W E E ℓ h E ℓ − 1 . The Inhibitory Normalization (I-Norm) Netw ork T o approximate Layer Normalization (LN) us- ing segre gated inhibitory populations, we e xtended the EI-netw ork architecture by introducing a sep- arate inhibitory population dedicated to di visiv e inhibition ( h D ℓ ). The resulting I-Norm network equations introduce di visiv e normalization to the pre-acti v ation z ℓ : 16 h E ℓ = ϕ ( z ℓ ) where z ℓ = W E E ℓ h E ℓ − 1 − W E I ℓ − 1 h I ℓ p U E I ℓ h D ℓ ( Eq. 1) The subtracti ve ( h I ℓ ) and di visiv e ( h D ℓ ) inhibitory populations are computed as: h I ℓ = W I E ℓ h E ℓ − 1 and h D ℓ = U I E ℓ h E ℓ − 1 . The weights associated with the di visiv e pathway are denoted by U X . Initialization of Standard EI-network F ollo wing the procedures established by Cornford et al. (2021), all excitatory parameters ( W E E , W I E ) are initialized from an exponential distrib ution: W E E ij ∼ Exp( λ E ) . The inhibitory parameters are initialized to ensure that excitation and subtractiv e inhibition are balanced ( E [ z E k ] = E [( W E I z I ) k ] ). Specifically: W I E is initialized using the mean of the rows of W E E : W I E = 1 n e P n e j =1 w E E j, : . W E I is initialized from Exp( λ E ) and then ro w-normalized ( W E I i, : ← W E I i, : P k W E I ik ), which approximates the balancing constant 1 n i . Initialization of I-Norm Divisi ve Pathway T o maintain consistency and ensure E [ z E ] = 0 at ini- tialization, the subtractiv e inhibition pathway ( W E I , W I E ) is initialized exactly as in the standard EI-network. For the di visiv e pathway ( U I E , U E I ), the goal is to initialize the denominator ( √ U E I h D ) to approximate the empirical variance of the subtractiv e pre-activ ations. W e achie ve this by employing a Singular V alue Decomposition (SVD) of the effecti ve excitatory weight matrix W = W E X − W E I W I X (where W I X is the W I E weights): W = UΣV ⊤ . The principal components of V are used to initialize the di visi ve pathw ay: U I E = ΣV ⊤ , and U E I ij = 1 n e . This ensures that the denominator approximates the required empirical v ariance at the initialization. W e note that the use of SVD for U I E initialization does not guarantee that all resulting weights are positiv e, so technically , the divisi ve pathway in our I-Norm network breaks with Dale’ s Law . Ho wev er , we note that divisi ve inhibition in real neurons is likely driven by shunting (Carandini and Hee ger, 1994), and shunting inhibition is more likely to be able to switch signs due to chloride dynamics in dendrites (Raimondo et al., 2012). 17 T raining Procedure and Dual Learning Objectives The network optimizes excitatory and in- hibitory synapses under distinct learning objectiv es to decouple task learning from neural activity normalization. Excitatory connections ( W E E ) are trained on the standard cross-entropy classification loss ( L task ), follo wing the rule: ∆ W E E = − η E E ∂ L task ∂ W E E . Inhibitory connections ( W E I , W I E , U E I , U I E ) are optimized with a local loss function ( L I-Norm ) designed to preserv e stability by enforcing the statistics of layer normalization (mean = 0 , variance = 1 ): L I-Norm = 1 n n X i =1 h E i ! 2 + 1 n n X i =1 ( h E i ) 2 − 1 ! 2 . T o ensure that stability mechanisms do not interfere with task learning, gradient isolation is en- forced using stop-gradient mechanisms: The excitatory acti v ations are detached when computing the I-Norm loss. The inhibitory outputs are detached during the main forward pass (propagation of L task ), pre venting I-Norm gradients from af fecting the task objecti ve. 4.4 Hyperparameters: W e conducted a comprehensiv e hyperparameter sweep to ev aluate the performance on the Fashion- MNIST dataset. Our experimental design employed a grid search strategy combined with random sampling to explore the hyperparameter space systematically (T able 1). T raining Configuration: All experiments were trained for 50 epochs using mini-batches of size 32. W e used the FashionMNIST dataset with 10 output classes. The training employed SGD opti- mization with no momentum or weight decay to isolate the ef fects of the I-Norm mechanisms. Learning Rate Sampling: W e implemented separate learning rates for excitatory and inhibitory connections, sampled from log-uniform distributions. The excitatory learning rate ( η exc ) was sam- pled from [10 − 3 , 10 − 1 ] , while inhibitory learning rates for excitatory-inhibitory ( η wei ) and inhibitory- inhibitory ( η wix ) connections were sampled from [10 − 5 , 10 − 2 ] and [10 − 2 , 10 0 ] respectiv ely . W e em- ployed the same inhibitory learning rates for both the subtractiv e and di visive inhibitory components. This design reflects the biological principle that inhibitory plasticity operates on different timescales than excitatory plasticity . Network Architecture: The hidden layer width was randomly sampled from a uniform distri- bution ov er [100 , 500] neurons, allo wing us to e valuate the robustness of I-Norm mechanism across dif ferent network sizes. W e maintained the same hidden layer depth of 2 hidden layers. I-Norm Loss Parameters: W e systematically v aried the I-Norm loss weight ( λ homeo ) across fiv e v alues: 10 − 5 , 10 − 4 , 10 − 3 , 10 − 2 , 10 − 1 , and 10 0 . The default I-norm weight we show in our results is 10 − 2 . 18 T able 1: Hyperparameters used in the experimental e v aluation Parameter V alue/Range Description T raining P arameters Epochs 50 Number of training epochs Batch size 32 Mini-batch size Dataset FashionMNIST Source dataset Learning Rates Excitatory LR ( η exc ) 10 − 3 to 10 − 1 Log-uniform sampling Inhibitory LR ( η wei ) 10 − 5 to 10 − 2 Log-uniform sampling Inhibitory LR ( η wix ) 10 − 2 to 10 0 Log-uniform sampling Network Ar chitecture Hidden Layer depth 2 Fixed Hidden layer width 100-500 Uniform sampling Output classes 10 FashionMNIST classes I-Norm Loss W eight λ I − N or m 10 − 5 to 10 0 Grid Search Optimization Momentum 0 SGD momentum W eight decay 0 L2 regularization Algorithm SGD Optimizer type Data A ugmentation: T o test robustness to input variations, we applied brightness jitter with factors of 0, 0.25, 0.5, and 0.75, simulating v arying lighting conditions. 4.5 Measur es and Analysis Cosine Similarity Analysis T o measure the alignment of the internal dynamics of the I-Norm net- work against the explicit LN + baseline. W e computed the cosine similarity for both the activity outputs and the gradient signals. W e measured the output alignment between the excitatory unit outputs ( h E l ) of the I-Norm network and the LN + baseline at layer l : output alignment l = ( h E l ) I-Norm · ( h E l ) LN + || ( h E l ) I-Norm || 2 || ( h E l ) LN + || 2 . 19 W e similarly measured the alignment of the gradient signals, specifically the cosine similarity between the gradient of the L I-Norm loss and the task-driven gradient of the L T ask loss, with respect to the excitatory weights W E E l . Note that the L T ask loss is computed with respect to the excitatory-only network with LN + . gradient alignment l = ∇ W E E l L I-Norm · ∇ W E E l L T ask ||∇ W E E l L I-Norm || 2 ||∇ W E E l L T ask || 2 . Statistical Moments of Neural Activations T o confirm that the I-Norm mechanism successfully learned to enforce normalization constraints, we tracked the first and second moments of the exci- tatory pre-acti vations ( z l ) throughout training, where l is the layer . • First Moment (Mean): µ l = E [ z l ] = 1 N P N i =1 z l,i . • Second Moment: σ 2 l = E [ z 2 l ] = 1 N P N i =1 z 2 l,i . The objecti v e of the L I-Norm loss is to dri ve these moments to ward the LN targets (mean = 0 , v ariance = 1 ). Statistical Significance T esting T o rob ustly determine statistical significance for comparisons be- tween different network architectures and training conditions (e.g., accuracy comparisons across hy- perparameter runs), we employed the Mann-Whitney U T est . This non-parametric test was chosen because the accuracy distributions obtained from hyperparameter sampling may not strictly follow a normal distribution. The results of this test are reported using con ventional notations. 4.6 Implementation and Repr oducibility All experiments were implemented in PyT orch and conducted on NVIDIA R TX 8000 GPUs with 16GB memory allocation. T raining was performed using SLURM job arrays with 30 random hy- perparameter configurations per experimental condition, where each run required approximately 20 minutes of compute time. The codebase includes automated batch scripts for hyperparameter sweeps and random configuration generation, ensuring consistent e xperimental protocols across all runs. The full repository is publicly av ailable at: https://github.com/Ro yHEyono/inhibitory-normalization. 4.7 Acknowledgments The authors would like to thank T om George and Ibrahima Da w for helpful comments on the manuscript. This work was supported by the following sources RHE: Deepmind fellowship. DL: FRQNT Strate- gic Clusters Program (2020-RS4-265502 – Centre UNIQUE – Unifying Neuroscience and Artificial 20 Intelligence – Qu ´ ebec), the Richard and Edith Strauss Postdoctoral Fellowship in Medicine and the Thomas Kingsley Lawrence Fund. BR: This work was supported by NSERC (Discovery Grant: RGPIN-2020- 05105; Discovery Accelerator Supplement: RGP AS-2020-00031), CIF AR (Canada AI Chair; Learning in Machine and Brains Fellowship), and DoD OUSD (R&E) under Cooperativ e Agreement PHY -2229929 (The NSF AI Institute for Artificial and Natural Intelligence). This research was enabled in part by support provided by (Calcul Qu ´ ebec) (https://www .calculquebec.ca/en/) and the Digital Research Alliance of Canada (https://alliancecan.ca/en). The authors ackno wledge the material support of NVIDIA in the form of computational resources. 21 5 A ppendix 5.1 Deriv ative of Lay er Normalization W e begin by establishing the objectiv e function for our network. Let’ s define the cross entropy L task = − X i y i log( ˆ y i ) , where y i is the label and ˆ y i = sof tmax ( ϕ ( z L )) is the prediction of the network. The error at the final softmax layer L can be defined as: δ L = ∇ ˆ y L task ⊙ ϕ ′ ( z L ) , where ϕ is the activ ation function. In the final output layer L , ϕ L is defined as the softmax function. For all preceding hidden layers l < L , we employ the Rectified Linear Unit (ReLU) activ ation, ϕ l ( z ) = max(0 , z ) . Using the chain rule, we can propagate this error backward through the network. The backpropa- gated error for earlier layers l is defined as: δ l = ( W E E l +1 ) T δ l +1 ⊙ ϕ ′ ( z l ) . W ith the error signal established, we can define the gradient for the weight parameters. The weight update for excitatory weights W E E l +1 w .r .t to the error at δ l +1 is defined as: ∂ L task ∂ W E E l +1 = δ l +1 ( h E l ) T . In standard architectures, layer normalization is often introduced before the activ ation function. Layer normalization is defined as: ˆ z l = z l − µ l σ l , µ l = 1 H H X i =1 z l , σ l = v u u t 1 H H X i =1 ( z l − µ l ) 2 + c. (3) T o backpropagate through this operation, we must account for the dependency of the normalized output ˆ z l on the input vector z l . When layer normalization is applied to layer l , the error signal at layer l becomes: δ l norm =  ∂ ˆ z l ∂ z l  T δ l . (4) And hence, the weight update for the preceding layer no w incorporates this adjusted error: 22 ∂ L task ∂ W E E l +1 = δ l +1 norm ( h E l ) T . Let’ s no w compute the Jacobian ∂ ˆ z l ∂ z l in δ l norm . This represents ho w each component of the normal- ized v ector changes with respect to each component of the input vector . W e first apply the quotient rule to ∂ ˆ z l ∂ z l from Equation 3: ∂ ˆ z l ∂ z l = 1 σ l ∂ ∂ z l  z l − µ l  − z l − µ l ( σ l ) 2 ∂ σ l ∂ z l = 1 σ l ∂  z l − µ l  ∂ z l − z l − µ l ( σ l ) 2 ∂ σ l ∂ z l = 1 σ l  I H − ∂ µ l ∂ z l − z l − µ l σ l ∂ σ l ∂ z l  = 1 σ l I H − ∂ µ l ∂ z l − ˆ z l  ∂ σ l ∂ z l  T ! . T o complete the deriv ation, we solve for the partial deriv ati ves of the mean ( µ ) and standard de viation ( σ ). Note that, ∂ µ l ∂ z l = 1 H 1 H ( 1 H ) T , ∂ σ l ∂ z l = 1 H  z l − µ l σ l  = 1 H ˆ z l . By substituting these intermediate results back into our Jacobian expression, we arriv e at the full matrix representation. Altogether , ∂ ˆ z l ∂ z l = 1 σ l  I H − 1 H 1 H ( 1 H ) T − ˆ z l ∂ σ l ∂ z l  = 1 σ l  I H − 1 H 1 H × H − 1 H ( ˆ z l )( ˆ z l ) T  . When we re-introduce Equation 4, the follo wing expression for the normalized error emer ges: δ l norm =  ∂ ˆ z l ∂ z l  T δ l = 1 σ l  I H − 1 H 1 H × H − 1 H ˆ z l ( ˆ z l ) T  δ l . For implementation purposes, it is often useful to vie w the transformation on a per-element basis. Elementwise, this translates to: 23 ( δ l norm ) i = H X j =1 δ l j ∂ ˆ z l j ∂ z l i = 1 σ l δ l i − 1 H H X j =1 δ l j − ˆ z l i H H X j =1 δ l j ˆ z l j ! . 5.2 Mean estimation via fixed random inhibition Theorem 1: Let { ω i } n i =1 be i.i.d. positi ve random v ariables with E [ ω i ] = µ > 0 and V ar( ω i ) < ∞ that parameter- ize a set of n synaptic weights, { ν i,n } n i =1 , whose v alues sum to 1 .Let { δ i } n i =1 be a bounded sequence of error signals with an empirical mean denoted by ¯ δ , i.e., ¯ δ = 1 n n X i =1 δ i . Define the normalized synaptic weights ν i,n = ω i P n j =1 ω j , and the inhibitory pooling operation, which can be expressed as the dot product between the n - dimensional weight vector ν n = [ ν 1 ,n , . . . , ν n,n ] ⊤ and the error vector δ n = [ δ 1 , . . . , δ n ] ⊤ : s n = ν n · δ n = n X i =1 ν i,n δ i . Then, in the limit as n → ∞ , the pooled signal con ver ges to the empirical mean: lim n →∞ s n = ¯ δ . Proof: W e first rewrite the pooled signal as a ratio: s n = n X i =1 ν i,n δ i = P n i =1 ω i δ i P n j =1 ω j . Di viding numerator and denominator by n giv es s n = 1 n P n i =1 ω i δ i 1 n P n j =1 ω j . 24 Con ver gence of the denominator . By K olmogorov’ s Strong Law of Lar ge Numbers (SLLN) (K ol- mogorof f, 1933), 1 n n X j =1 ω j → µ as n → ∞ . Con ver gence of the numerator . Decompose: 1 n n X i =1 ω i δ i = 1 n n X i =1 ( ω i − µ ) δ i + µ 1 n n X i =1 δ i . Because the sequence { δ i } is bounded, say | δ i | ≤ C , we ha ve V ar  ( ω i − µ ) δ i  ≤ C 2 V ar( ω i ) < ∞ . Thus the random variables ( ω i − µ ) δ i are independent, mean-zero, and uniformly square-integrable. K olmogorov’ s Strong Law of Large Numbers for independent, non-identically distributed random v ariables therefore implies 1 n n X i =1 ( ω i − µ ) δ i → 0 By assumption on { δ i } , 1 n n X i =1 δ i → ¯ δ , hence 1 n n X i =1 ω i δ i → µ ¯ δ . T aking the ratio. Since the denominator con ver ges to µ > 0 , we obtain s n = 1 n P n i =1 ω i δ i 1 n P n j =1 ω j → µ ¯ δ µ = ¯ δ . (End of Pr oof 1) 25 Refer ences Atallah, B. V ., Bruns, W ., Carandini, M., and Scanziani, M. (2012). Parv alb umin-expressing interneu- rons linearly transform cortical responses to visual stimuli. Neur on , 73(1):159–170. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450 . Burg, M. F ., Cadena, S. A., Denfield, G. H., W alker , E. Y ., T olias, A. S., Bethge, M., and Ecker , A. S. (2021). Learning di visi ve normalization in primary visual corte x. PLoS computational biology , 17(6):e1009028. Carandini, M. and Hee ger , D. J. (1994). Summation and division by neurons in primate visual cortex. Science , 264(5163):1333–1336. Carandini, M. and Heeger , D. J. (2012). Normalization as a canonical neural computation. Natur e r evie ws neur oscience , 13(1):51–62. Cornford, J., Kalajdzievski, D., Leite, M., Lamarquette, A., Kullmann, D., and Richards, B. (2021). Learning to liv e with dale’ s principle: Anns with separate excitatory and inhibitory units. In ICLR 2021-9th International Confer ence on Learning Repr esentations . ICLR. Den ` eve, S. and Machens, C. K. (2016). Efficient codes and balanced networks. Natur e neur oscience , 19(3):375–382. El-Boustani, S. and Sur , M. (2014). Response-dependent dynamics of cell-specific inhibition in cor- tical networks in vi v o. Natur e communications , 5(1):5689. Guerguie v , J., Lillicrap, T . P ., and Richards, B. A. (2017). T o wards deep learning with segregated dendrites. elife , 6:e22901. Karnani, M. M., Agetsuma, M., and Y uste, R. (2014). A blanket of inhibition: functional inferences from dense inhibitory connecti vity . Curr ent opinion in Neur obiology , 26:96–102. K olmogorof f, A. (1933). Grundbegrif fe der wahrscheinlichk eitsrechnung. Lillicrap, T . P ., Cownden, D., T weed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Natur e communications , 7(1):13276. Lillicrap, T . P ., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020). Backpropagation and the brain. Natur e Revie ws Neur oscience , 21(6):335–346. 26 Overstreet-W adiche, L. and McBain, C. J. (2015). Neurogliaform cells in cortical circuits. Natur e Revie ws Neur oscience , 16(8):458–468. Payeur , A., Guerguie v , J., Zenke, F ., Richards, B. A., and Naud, R. (2021). Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Natur e neur oscience , 24(7):1010–1019. Pouille, F ., W atkinson, O., Scanziani, M., and T rev elyan, A. J. (2013). The contribution of synaptic location to inhibitory gain control in pyramidal cells. Physiological r eports , 1(5). Raimondo, J. V ., Markram, H., and Akerman, C. J. (2012). Short-term ionic plasticity at gabaer gic synapses. F r ontiers in synaptic neur oscience , 4:5. Richards, B. A., Lillicrap, T . P ., Beaudoin, P ., Bengio, Y ., Bogacz, R., Christensen, A., Clopath, C., Costa, R. P ., de Berk er , A., Ganguli, S., et al. (2019). A deep learning frame work for neuroscience. Natur e neur oscience , 22(11):1761–1770. Richards, B. A., V oss, O. P ., and Akerman, C. J. (2010). Gabaergic circuits control stimulus-instructed recepti ve field de velopment in the optic tectum. Natur e neur oscience , 13(9):1098–1106. Shen, Y ., W ang, J., and Navlakha, S. (2021). A correspondence between normalization strate gies in artificial and biological neural networks. Neur al computation , 33(12):3179–3203. v an Vreeswijk, C. and Sompolinsky , H. (1998). Chaotic balanced state in a model of cortical circuits. Neural computation , 10(6):1321–1371. V aswani, A., Shazeer , N., Parmar , N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser , Ł., and Polo- sukhin, I. (2017). Attention is all you need. Advances in neural information pr ocessing systems , 30. V ogels, T . P ., Sprekeler , H., Zenke, F ., Clopath, C., and Gerstner , W . (2011). Inhibitory plas- ticity balances excitation and inhibition in sensory pathways and memory networks. Science , 334(6062):1569–1573. Whittington, J. C. and Bogacz, R. (2017). An approximation of the error backpropagation algo- rithm in a predicti ve coding network with local hebbian synaptic plasticity . Neural computation , 29(5):1229–1262. W ilson, N. R., Run yan, C. A., W ang, F . L., and Sur , M. (2012). Division and subtraction by distinct cortical inhibitory networks in vi v o. Natur e , 488(7411):343–348. W u, Y . and He, K. (2018). Group normalization. In Pr oceedings of the Eur opean confer ence on computer vision (ECCV) , pages 3–19. 27 Xiong, R., Y ang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., W ang, L., and Liu, T . (2020). On layer normalization in the transformer architecture. In International confer ence on machine learning , pages 10524–10533. PMLR. Xu, J., Sun, X., Zhang, Z., Zhao, G., and Lin, J. (2019). Understanding and improving layer normal- ization. Advances in neural information pr ocessing systems , 32. 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment