BEAN: Interpretable Representation Learning with Biologically-Enhanced Artificial Neuronal Assembly Regularization

BEAN: Interpretable and Efﬁcient Learning with Biologicall y-Enhanced Ar tiﬁcial Neur onal Assembl y Regularization Y uy ang Gao 1 , Giorgio A. Ascoli 2 and Liang Zhao 1 , ∗ 1 Depar tment of Inf ormation Sciences and T echnology , George Mason Un iv ersity , F airf ax, V A, United States 2 Center for Neural Informatics, Bioengineer ing Depar tment, and Krasnow Institute f or Advanced Study , George Mason University , F airf ax, V A, United States Correspondence*: Liang Zhao lzhao9@gmu.edu ABSTRA CT Deep neur al networks (DNNs) are kno wn f or e xtracting useful inf or mation from large amounts of data. How e ver , the representations lear ned in DNNs are typically hard to interpret, especially in dense la y ers. One crucial issue of the classical DNN model such as multilay er perceptron (MLP) is that neurons in the same la yer of DNNs are conditionally independent of each other , which makes co-training and emergence of higher modularity difﬁcult. In contrast to DNNs , biological neurons in mammalian br ains displa y substantial dependency patter ns. Speciﬁcally , biological neural netw orks encode representations by so-called neuronal assemb lies: groups of neurons interconnected by strong synaptic inter actions and shar ing joint semantic content. The resulting population coding is essential f or human cognitive and mnemonic processes . Here, w e propose a nov el Biologically Enhanced Ar tiﬁcial Neuronal assemb ly (BEAN) regularization 1 to model neuronal correlations and dependencies, inspired by cell assembly theor y from neuroscience. Experimental results show that BEAN enab les the f or mation of interpretable neuronal functional clusters and consequently promotes a sparse, memor y/computation-efﬁcient network without loss of model performance. Moreov er , our fe w-shot lear ning experiments demonstrate that BEAN could also enhance the generalizability of the model when training samples are extremely limited. 1 INTR ODUCTION Deep neural netw orks (DNNs) are kno wn for extracting useful information from a lar ge amount of data [Bengio et al. (2013)]. Despite the success and popularity of DNNs in a wide v ariety of ﬁelds, including computer vision [Krizhe vsky et al. (2012); He et al. (2016)] and natural language processing [Collobert and W eston (2008); Y oung et al. (2018)], there are still many drawbacks and limitations of modern DNNs, 1 Please ﬁnd the source code at https://drive.google.com/file/d/115bNbyiXj- Ip1RMExj3c18a6hliiNneo/view?usp=sharing. 1 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization including lack of interpretability [Zhang and Zhu (2018)], the requirement of large data [Kimura et al. (2018)], and post selection on complex model architecture [Zheng and W eng (2016b,a)]. Speciﬁcally , the representations learned in DNNs are typically hard to interpret, especially in dense (fully connected) layers. Despite recent attempts to b uild intrinsically more interpretable con volutional units [Zhang and Zhu (2018); Sabour et al. (2017)], the exploration of learned representations in the dense layer has remained limited. In fact, dense layers are the fundamental and critical component of most state-of-the-art DNNs, which are typically used for the late stage of the network’ s computation, akin to the inference and decision-making processes [Krizhe vsky et al. (2012); Simonyan and Zisserman (2014); He et al. (2016)]. Thus improving the interpretability of the dense layer representation is crucial if we are to fully understand and exploit the po wer of DNNs. Ho we ver , interpreting the representations learned in dense layers of DNNs is typically a very challenging task. One crucial issue of the classical DNN model such as multilayer perceptron (MLP) is that neurons in the same layer of DNNs are conditionally independent of each other , as dense layers in MLP are typically acti v ated by all-to-all feed-forward neuron activity and trained by all-to-all feedback weight adjustment. In this comprehensi vely ‘v ertical’ connecti vity , ev ery node is independent and abstracted ‘out of the context’ of the other nodes. This issue limits the analysis of the representation learned in DNNs to single-unit lev el, as opposed to the higher modularity in principle af forded by neuron population coding. Moreov er , recent studies on single unit importance seem to suggest that individually selecti ve units may have little correlation with ov erall network performance [Morcos et al. (2018); Zhou et al. (2018)]. Speciﬁcally , [Morcos et al. (2018); Zhou et al. (2018)] conducted unit-lev el ablation experiments on CNNs trained on large scale image datasets and found that ablating any indi vidual unit does not hurt overall classiﬁcation accurac y . On the other hand, understanding the complex patterns of neuron correlations in biological neural networks (BNNs) has long been a subject of intense interest for neuroscience researchers. Circuitry blueprints in the real brain are ‘ﬁltered’ by the physical requirements of axonal projections and the consequent need to minimize cable while maximizing connections. One could naiv ely expect that the non- all-to-all limitations imposed in natural neural systems would be detrimental to their computational power . Instead, it makes them superiorly ef ﬁcient and allows cell assemblies to emer ge. Neuronal assemblies or cell assemblies [Hebb (1949)] can be described as groups of neurons interconnected by strong synaptic interactions and sharing joint semantic content. The resulting population coding is essential for human cogniti ve and mnemonic processes [Braitenber g (1978)]. In this paper , we bridge such a crucial gap between DNNs and BNNs by modeling the neuron correlations within each layer of DNNs. Le veraging biologically inspired learning rules in neuroscience and graph theory , we propose a nov el Biologically-Enhanced Artiﬁcial Neuronal assembly (BEAN) regularization that can enforce dependencies among neurons in dense layers of DNNs without substantially altering the con ventional architecture. The resultant adv antages are threefold: • Enhancing interpr etability and modularity at the neuron population level. Modeling neural correlations and dependencies allows us to better interpret and visualize the learned representation in hidden layers at the neuron population lev el instead of the single neuron le vel. Both qualitative and quantitati ve analyses sho w that BEAN enables the formations of identiﬁable neuronal assembly patterns in the hidden layers, enhancing the modularity and interpretability of the DNN representations. • Promoting jointly sparse and efﬁcient encoding of rich semantic corr elation among neurons. Here, we sho w that BEAN can promote jointly sparse and ef ﬁcient encoding of rich semantic correlation among neurons in DNNs similar to connection patterns in BNNs. BEAN enables the model to This is a provisional ﬁle , not the ﬁnal typeset ar ticle 2 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization parsimoniously le verage av ailable neurons and possible connections through modeling structural correlation, yielding both connection-lev el and neuron-lev el sparsity in the dense layers. Experimental results show that BEAN not only enables the formation of neuronal functional clusters that encode rich semantic correlation, but also allo ws the model to achiev e state-of-the-art memory/computational ef ﬁciency without loss of model performance. • Impro ving model generalizability with few training samples. Humans and animals can learn and generalize to ne w concepts with just a fe w trials of learning, while DNNs generally perform poorly on such tasks. Current fe w-shot learning techniques in deep learning still rely hea vily on a lar ge amount of additional knowledge to work well. For e xample, transfer-learning-based methods typically le verage a model pre-trained with a large amount of data [Xian et al. (2018); Socher et al. (2013)], and meta-learning-based methods require a large number of additional side tasks [Finn et al. (2017); Snell et al. (2017)]. Here we e xplore BEAN with a substantially more challenging fe w-shot learning fr om scratc h task ﬁrst studied by [Kimura et al. (2018)], where no additional kno wledge is provided aside from a fe w training observ ations. Extensi ve e xperiments sho w that BEAN has a signiﬁcant advantage in improving model generalizability o ver con ventional techniques. 2 BIOLOGICALL Y-ENHANCED AR TIFICIAL NEUR ONAL ASSEMBL Y REGULARIZA TION This section describes the ov erall objecti ve of Biologically-Enhanced Artiﬁcial Neuronal Assembly (BEAN) regularization as well as t he implementation of BEAN on DNNs, as Layer -wise Neur on Corr elation and Co-activation Diver gence to model the implicit dependencies between neurons within the same layer . 2.1 Lay er-wise Neuron Co-activ ation Divergence Due to the physical restrictions imposed by dendrites and axons [Riv era-Alba et al. (2014)] and for energy ef ﬁciency , biological neural systems are “parsimonious” and can only afford to form a limited number of connections between neurons. The neuron connectivity patterns of BNNs are intertwined with their acti v ation patterns based on the principle of “ Cells that ﬁr e together wir e together ”, which is known as cell assembly theory . It explains and relates to se veral characteristics and advantages of BNN architecture such as modularity [Peyrache et al. (2010)], ef ﬁciency , and generalizability , that are just the aspects in which the current DNNs are us ually struggling [LeCun et al. (2015)]. T o take adv antage of the beneﬁcial architectural features in BNNs and ov ercome the existing drawbacks of DNNs, we propose the Biologically-Enhanced Artiﬁcial Neuronal assembly (BEAN) regularization. BEAN ensures neurons which “wire” together with a high outgoing weight correlation also “ﬁre” together with small di ver gence in terms of their acti v ation patterns. An example of the artiﬁcial neuronal assembly achie ved by our method can be seen in Figure 1(d). The regularization is formulated as follo ws: L ( l ) c = 1 / ( S N 2 l ) X s X i X j A ( l ) i,j × d ( H ( l ) s,i , H ( l ) s,j ) (1) where L c is the regularization loss; the term A ( l ) i,j characterizes the wiring strength (the higher v alue, the stronger connection) between two neurons i and j within layer l ; the term d ( H ( l ) s,i , H ( l ) s,j ) models the di ver gence of ﬁring patterns (the higher value, the more dif ferent the ﬁring) between two neurons i and j on input sample s . Thus, by multiplying these two functions, we penalize those neurons with strong connecti vity but high activ ation div ergence, in line with the principles of cell assembly theory . S is the total number of input samples while N l is the total number of hidden neurons in layer l . Frontier s 3 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization Speciﬁcally , A ( l ) i,j deﬁnes the connecti vity relation among neuron i and neuron j in DNN, which is instantiated by our newly proposed “Layer-wise Neuron Correlation” and will be elaborated in Sections 2.2 and 2.3. On the other hand, to model the “co-ﬁring” correlation, d ( H ( l ) s,i , H ( l ) s,j ) is deﬁned as “Layer -wise Neuron Co-activ ation Div ergence” which denotes the dif ference in the acti vation patterns in l th layer between H ( l ) s,i and H ( l ) s,j of neuron i and neuron j , respecti vely . Here H ( l ) s,i represents the acti v ation of neuron i in layer l for a gi ven input sample s . The function d ( x, y ) can be a common di vergence metric such as absolute dif ference or square dif ference. In this study , we sho w the results for a square difference in the Experimental Study Section; the absolute dif ference results follo w a similar trend. Model T raining: The general objectiv e function of training a DNN model along with the proposed regularization on fully connected layer l can be written as: L = L D N N + α L ( l ) c , where L D N N represents the general deep learning model training loss and the h yper-parameter α controls the relati ve strength of the regularization. Equation 1 can be optimized with backpropagation [Rumelhart et al. (1988)] using the chain rule: ∂ L ( l ) c ∂ W ( l +1) = ∂ A ( l ) ∂ W ( l +1) D ( l ) , ∂ L ( l ) c ∂ W ( l ) = A ( l ) ∂ D ( l ) ∂ H ( l ) ∂ H ( l ) ∂ W ( l ) , ... (2) where D ( l ) ∈ R S × N l × N l of which each element is D ( l ) s,i,j = d ( H ( l ) s,i , H ( l ) s,j ) . R E M A R K 1 . BEAN re gularization has several str engths. F irst, it enfor ces interpr etable neur onal assemblies without the need to intr oduce sophisticated handcrafted designs into the ar chitectur e, whic h is justiﬁed later in Section 3.1. In addition, modeling the neur on corr elations and dependencies further r esults in sparse and ef ﬁcient connectivity in dense layers, which substantially r educed the computation/memory cost of the model, as shown in Section 3.2. Besides, the encoding of rich semantic corr elation among neur ons may impr ove the generalizability of the model when insufﬁcient data and knowledge ar e pr ovided, which is demonstrated later in Section 3.3. F inally , the Layer -wise Neur on Corr elation can be efﬁciently computed with matrix operations, as per Equations 5 and 7, which enables modern GPUs to boost up the speed during model training . In practice, we observe negligible run time overhead of the addition computation needed for BEAN r e gularization. 2.2 The First-Or der Layer -wise Neuron Correlation This section introduces the formulation of the layer-wise neuron correlation A ( l ) i,j between any pair of neurons i and j . In the human brain, the correlation between two neurons depends on the wiring between them [Buzsáki (2010)] and hence is typically treated as a binary value in BNN studies, with “1” indicating the presence of a connection and “0” the absence, so the correlation among a group of neurons can be represented by the corresponding adjacency matrix. Although there is typically no direct connection between neurons within the same layer of DNNs, it is possible to model neuron correlations based on their connectivity patterns to the next layer . This resembles a common approach in netw ork science, where it is useful to consider the relationships between nodes based on their common neighbors in addition to their direct connections. One classic concept widely used to describe such a pattern is called triadic closur e [Granovetter (1977)]. As sho wn in Figure 1 (b), triadic closure can be interpreted here as a property among three nodes i , j , and k , such that if connections exist between i − k and j − k , there is also a connection between i − j . This is a provisional ﬁle , not the ﬁnal typeset ar ticle 4 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization -50 0 50 -50 0 50 LeNet-5+BEAN-1 -50 0 50 -60 -40 -20 0 20 40 60 LeNet-5+BEAN-2 1 2 3 4 5 6 7 8 9 10 -40 -20 0 20 40 -60 -40 -20 0 20 40 ResNet18+BEAN-1 -20 0 20 40 -40 -20 0 20 40 ResNet18+BEAN-2 dog horse frog ship airplane bird automobile truck cat deer 𝒊 𝒋 𝒌 𝒎 Axo n Den dri te Co nn ecti on Neu r on al as se m bl y i n w ei ght spa ce (a) (b) (c) Neu r on al Cor re l at i on 𝒌 𝒎 𝒊 𝒋 𝑨 𝒊, 𝒋 2- Ord er 𝒌 𝒊 𝒋 𝑨 𝒊, 𝒋 1- Ord er Laye r ! 𝒍 + 𝟏 Laye r 𝒍 -100 0 100 -100 -50 0 50 100 Vanilla LeNet-5 -50 0 50 -50 0 50 LeNet-5+BEAN-1 -50 0 50 -50 0 50 LeNet-5+BEAN-2 1 2 3 4 5 6 7 8 9 10 -20 0 20 40 -20 -10 0 10 20 Vanilla ResNet18 -40 -20 0 20 40 -50 0 50 ResNet18+BEAN-1 -20 0 20 40 -40 -20 0 20 40 ResNet18+BEAN-2 dog horse frog ship airplane bird automobile truck cat deer (d) 𝑾 𝒊,𝒌 𝑾 𝒋,𝒌 Figure 1. An illustration of how the proposed constraint drew inspiration from BNNs and bipartite graphs. (a) neuron correlations in BNNs correspond to connections between dendrites, which are represented by blue lines, and axons, which are represented by red lines. (b) and (c) analogy of ﬁgure (a) represented as connections between layers in DNNs; although nodes i and j cannot form direct links, they can be correlated by a giv en node k as a ﬁrst-order correlation, or by two nodes k and m as a second-order correlation which is also equi v alent to a 4-c ycle in bipartite graphs. (d) an example of a learned neuronal assembly in neurons outgoing weight space, with the dimensionality reduced to 2D with T -SNE [Maaten and Hinton (2008)]. Each point represents one neuron and the neurons are colored according to their highest acti v ated class in the test data. W e take this scheme a step further to model the correlations between neurons within the same layer by their connections to the neurons in the ne xt layer . This can be considered loosely analogous to the degree of similarity of the axonal connection pattern of biological neurons in BNNs [Rees et al. (2017)]. T o simulate the relati ve strength of such connections in DNNs, we introduce a function f ( · ) that con verts the actual weights into a relati ve connecti vity strength. Suppose matrix W ( l +1) ∈ R N l × N l +1 represents all the weights between neurons in layers l and l + 1 in DNNs, where N l and N l +1 represent the numbers of neurons, respecti vely . The relative connecti vity strength can be estimated by the following equation 2 : f ( W ( l +1) ) = | tanh ( γ W ( l +1) ) | (3) where | · | represents the element-wise absolute operator; tanh ( · ) represents the element-wise hyperbolic tangent function; and γ is a scalar that controls the curvature of the hyperbolic tangent function. The v alues of f ( W ( l +1) ) ∈ R N l × N l +1 will all be positiv e and in the range of [0 , 1) with the v alue simulating the relati ve connecti vity strength of the corresponding synapse between neurons. Although there can be positi ve and ne gativ e weights in DNNs, our assumption on connection strength follo ws the typical w ay of BNN studies, which measures the presence and absence of the connection as mentioned above. Moreover , since DNNs require continuous values instead of discrete values to make the function dif ferentiable for optimization, we further use Equation (3) to con vert the concept of the presence/absence of the connections to the relativ e strength of the connections. More speciﬁcally , the dif ference is that instead of treating connection to be either “1” (indicating the presence of a connection) or “0” (indicating the absence of the connection), we treat the output of Equation (3) as the strength of that connection, where high values (i.e. close to “1”) indicate the presence of a strong connection and lo w v alues (i.e. close to “0”) indicate weak or no connection. Based on this, we can no w gi ve the deﬁnition for the layer -wise ﬁrst-or der neur on corr elation as: 2 Similar to the ReLU activ ation function, our formulation introduces a non-differentiable point at zero; we follow the conventional setting by using the sub-gradient for model optimization. Frontier s 5 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization Deﬁnition 1. Lay er -wise ﬁrst-order neuron correlation. For a given neuron i and neuron j in layer l , the layer-wise ﬁrst-order neuron correlation is gi ven by: A ( l ) i,j = (1 / N l +1 ) X N l +1 k =1 f ( W ( l +1) i,k ) × f ( W ( l +1) j,k ) (4) The abov e formula can be expressed as the product of two matrices: A ( l ) = (1 / N l +1 ) f ( W ( l +1) ) · f ( W ( l +1) ) T (5) where · represents the matrix multiplication operator . The layer -wise neuron correlation matrix A ( l ) is a symmetric square matrix that models all the pairwise correlations of neurons with respect to their corresponding outgoing weights in layer l . Each entry A ( l ) i,j takes a v alue in the range [0 , 1) and models the correlation between neuron i and neuron j in terms of the similarity of their connectivity patterns. The higher the v alue, the stronger the correlation between the two. In this setting, two neurons i and j from layer l will be linked and correlated by an intermediate node k from layer l + 1 if and only if both edges f ( W ( l +1) i,k ) and f ( W ( l +1) j,k ) are non zero, and the relativ e strength can be estimated by f ( W ( l +1) i,k ) × f ( W ( l +1) j,k ) , which will be in the range [0 , 1) . Since there are N l +1 neurons in layer l + 1 , where each neuron k can contribute to such connections, running over all neurons in layer l + 1 we obtain Equation 4 and Equation 5. 2.3 The Second-Order La yer-wise Neur on Correlation Although the ﬁrst-order correlation is able to estimate the degree of dependency between each pair of neurons, it may not be sufﬁcient to strictly reﬂect the degree of grouping or assembly of the neurons. Thus, here we further propose a second-order neuron correlation based on the ﬁrst-order correlation deﬁned in Equation 4 and 5, as: Deﬁnition 2. Layer -wise second-order neur on correlation. For a gi ven neuron i and neuron j in layer l , the layer-wise second-order neuron correlation is gi ven by: A ( l ) i,j = (1 / N 2 l +1 ) X k,m f ( W ( l +1) i,k ) × f ( W ( l +1) j,k ) × f ( W ( l +1) i,m ) × f ( W ( l +1) j,m ) (6) The abov e formula can be expressed as the product of four matrices: A ( l ) = (1 / N 2 l +1 )( f ( W ( l +1) ) · f ( W ( l +1) ) T )  ( f ( W ( l +1) ) · f ( W ( l +1) ) T ) (7) where  represents the element-wise multiplication of matrices. The second-order correlation provides a stricter criterion for relating neurons, as it requires at least tw o common neighbor nodes from the layer abo ve to hav e strong connectivity , as compared to the ﬁrst-order correlation that requires just one common neighbor . Moreover , the second-order neuron correlation is closely related both to graph theory concepts and a neuroscience-inspired learning rule: R E M A R K 2 . Graph theory and neuroscience interpretation. Modeling the ﬁrst-or der corr elation between two neur ons within the same layer is based on the co-connection to a common neighbor neur on fr om the layer above, whic h is closely r elated to the concepts of clustering coefﬁcient [W atts and Str ogatz (1998)] and transitivity [Holland and Leinhar dt (1971)] in graph theory . On the other hand, modeling the second-or der corr elation between two neur ons in volves two common neighbor neur ons in the layer This is a provisional ﬁle , not the ﬁnal typeset ar ticle 6 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization above, which is closely r elated to calculating the 4-cycle pattern wher e all 4 possible connections in between are taken into account, as shown in F igur e 1 (b). This 4-cycle pattern is linked to the global clustering coefﬁcients of bipartite networks [Robins and Ale xander (2004)], wher e the set of vertices can be decomposed into two disjoint sets such that no two vertices within the same set are adjacent. Similarly , if we consider neur ons within one layer as the nodes that belong to one set of the bipartite network between two adjacent layers of the neural networks, forming this 4-cycle will tend to incr ease the clustering coefﬁcients of the network. Mor eover , the second-order corr elation is also related to sever al cognitive neur oscience studies, suc h as the BIG-ADO learning rule and the principal semantic components of language [Mainetti and Ascoli (2015); Samsonovich et al. (2010)] as well as the notion of discr ete neur onal cir cuits [Pulvermüller and Knoblauch (2009)]. F igur e 1 (a) illustr ates a scenario of the BIG-ADO learning rule in BNNs. The blue blobs r epr esents a connection that was formed between two neur ons (i.e., a synapse), while the dashed cir cle between neur ons j and m r epr esents an Axo-Dendritic Overlap (ADO) (i.e., a potential synapse) between the two neur ons. BIG-ADO posits that in or der to form a synapse, ther e must be a potential synapse in place, and the pr obability of having a potential synapse gr ows with the second-or der corr elation. Notably , both of the neur oscience papers cited above r elate such a learning mechanism to the formation of cell assemblies in the br ain, which parallels our observation of neur onal functional clusters among neur ons in DNNs when BEAN was imposed, as shown in F igur e 1 (c) and F igur e 6 (b). 3 EXPERIMENT AL STUD Y Our description of the empirical analysis design and results is or ganized in the following fashion. In Section 3.1, we ﬁrst characterize the interpretable patterns from the learning outcomes of BEAN regularization on multiple classic image recognition tasks. W e then further analyze in Section 3.2 how BEAN could beneﬁt the model from learning sparse and efﬁcient neuron connections. Finally , in Section 3.3 we study the ef fect of BEAN regularization on improving the generalizability of the model on sev eral few-shot learning from scratch task simulations. W e refer to both distinct BEAN v ariations, BEAN-1 and BEAN-2, based on the two proposed layer-wise neuron correlation deﬁned by Equation 5, and Equation 7 respectiv ely . The value for γ (Equation 3) was set to 1. This paper focuses on examining the effects of the proposed regularization rather than the dif ferences between distinct types of neural network architectures. Hence, we simply adopted se veral of the most popular neural network architectures for the chosen datasets and did not perform any hyperparameter or system parameter tuning using the test set; in other words, we did not perform an y "post selection" (i.e. selecti vely reporting the model results based on testing set [Zheng and W eng (2016a,b)]). All network architectures used in this paper are fully described in their respecti ve cited references, including the speciﬁcation of their system parameters. The regularization factor of BEAN and other baseline methods were chosen based on the model performance on the validation set. All the experiments were conducted on a 64-bit machine with Intel(R) Xeon(R) W -2155 CPU 3.30GHz processor and 32GB memory and an NVIDIA TIT AN Xp GPU. 3.1 The Interpretable P atterns of BEAN Regularization Due to the highly complex computation among numerous layers of neurons in traditional DNNs, it is typically dif ﬁcult to understand ho w the network learned what it remembers and the system is more commonly treated as a black-box model [Zhang et al. (2018)]. Here, to ascertain the effect of BEAN regularization on the interpretability of network dynamics, we analyze the differences in neuronal representation properties of the DNNs with and without BEAN regularization. W e conducted experiments on three classic image recognition tasks on the MNIST [LeCun et al. (1998)], Fashion-MNIST [Xiao et al. Frontier s 7 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization (2017)] and CIF AR-10 [Krizhevsk y and Hinton (2009)] datasets by starting with three predeﬁned network architectures as listed belo w: 1. An MLP with one hidden layer of 500 neurons with ReLU activ ation function for MNIST and Fashion-MNIST datasets. 2. A LeNet-5 [LeCun et al. (1998)] for MNIST and Fashion-MNIST datasets. 3. ResNet18 [He et al. (2016)] for CIF AR-10 dataset. The Adam optimizer [Kingma and Ba (2014)] was used with a learning rate of 0.0005 and a batch size of 100 for model training until train loss con ver gence was achie ved; BEAN was applied to all the dense layers of each model. 3.1.1 Biological plausibility of the lear ned neuronal assemb lies By analyzing the neurons’ connectivity patterns based on their outgoing weights, we discov ered neuronal assemblies in dense layers where BEAN re gularization was enforced. Speciﬁcally , for both datasets, we found that the neuronal assemblies at the last dense layer could be best described by 10 clusters with K- means clustering [MacQueen et al. (1967)] v alidated by Silhouette analysis [Rousseeuw (1987)]. Silhouette analysis is a widely-used method for interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classiﬁed. As sho wn in Figure 2, we visualized the K-means clustering results in neurons’ weight space of the dense layer on both MNIST (top) and CIF AR-10 (bottom) datasets. Each data point in the ﬁgure indicates one single neuron and the color indicates its cluster assignment by the clustering algorithm. The Silhouette v alue is further used to assess the quality of the clustering assignment: high Silhouette values support the existence of clear clusters in the data points, which here correspond to neural assembly patterns among neurons. Both BEAN-1 and BEAN-2 could enforce neuronal assemblies for v arious models on sev eral datasets, yielding Silhouette indices around 0.9, which indicates strong clustering patterns among neurons in dense layers where BEAN regularization was applied. On the other hand, training con ventional DNN models with the same architectures could only yield Silhouette indices near 0.5, which indicates no clear clustering patterns in con ventional dense layers of deep neuronal networks. Moreov er , we found co-activ ation behavior of neurons within each neuronal assembly , which is both interpretable and biologically plausible. Figure 3 sho ws the visualization of neuron co-acti v ation patterns found in the last dense layer of LeNet-5+BEAN-2 model on MNIST dataset. For the samples of each speciﬁc class, only those neurons in the speciﬁc neuron group that is associated with that digit class ha ve high activ ation while all the other neurons remain silent. This strong correlation between each unique assembly and each unique class concept allo ws straightforward interpretation of the neuron populations in the dense layers. From the neuroscience perspective, those co-acti v ation patterns and the association between high-le vel concepts and neuron groups may reﬂect similar co-ﬁring patterns observed in biological neural systems [Peyrache et al. (2010)] and underscore the strong association between neuronal assembly and concepts [T ononi and Sporns (2003)] in biological neural networks. W e also found a strong correlation between neuronal assembly and class selecti vity indices. Selecti vity index was originally proposed and used in systems neuroscience [De V alois et al. (1982); Freedman and Assad (2006)]. Recently , machine learning researchers also studied unit class selectivity [Morcos et al. (2018); Zhou et al. (2018)] as a metric for interpreting the beha viors of single units in deep neural networks. Mathematically , it is calculated as: sel ectiv ity = ( µ max − µ − max ) / ( µ max + µ − max ) , where This is a provisional ﬁle , not the ﬁnal typeset ar ticle 8 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization -20 0 20 -20 -10 0 10 20 MLP 0 0.5870 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster MLP -20 0 20 -40 -20 0 20 40 MLP+BEAN-1 0 0.8379 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster MLP+BEAN-1 -20 0 20 -40 -20 0 20 MLP+BEAN-2 0 0.8418 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster MLP+BEAN-2 -20 0 20 -20 -10 0 10 20 ResNet18 0 0.5512 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster ResNet18 -20 0 20 -60 -40 -20 0 20 40 ResNet18+BEAN-1 0 0.9442 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster ResNet18+BEAN-1 -20 0 20 40 -40 -20 0 20 40 ResNet18+BEAN-2 0 0.8717 Silhouette Value 1 2 3 4 5 6 7 8 9 10 Cluster ResNet18+BEAN-2 Figure 2. Neuronal assembly patterns found in neurons’ weight space of the dense layer of different models on both MNIST (top) and CIF AR-10 (bottom) datasets, along with clustering v alidation via Silhouette score on 10 clusters K-means clustering. The dimensionality of neurons’ weight space was reduced to 2D with T -SNE for visualization. -50 0 50 -50 0 50 0 -50 0 50 -50 0 50 1 -50 0 50 -50 0 50 2 -50 0 50 -50 0 50 3 -50 0 50 -50 0 50 4 -50 0 50 -50 0 50 5 -50 0 50 -50 0 50 6 -50 0 50 -50 0 50 7 -50 0 50 -50 0 50 8 -50 0 50 -50 0 50 9 Figure 3. Neuron co-activ ation patterns found in the representation of the last dense layer of LeNet- 5+BEAN-2 model on MNIST dataset. The dimensionality of neurons’ weight space was reduced to 2D with T -SNE for visualization. Each point represents one neuron within the last dense layer of the model and is colored based on its acti v ation scale. The 10 subplots sho w the av erage activ ation heat-maps when each digit’ s samples were fed into the model. The warmer color indicates a higher neuron acti vation. µ max represents the highest class-conditional mean activity and µ − max represents the mean activity across all other classes. T o better visualize how high-le vel concepts are associated with the learned neuron assemblies, we further labeled each neuron with the class in which it achiev ed its highest class-conditional mean activity µ max in the test data. Figure 4 shows the results for the last dense layer of the models trained with both datasets. W e found that the neuronal assembly could be well described based on selectivity . The strong association between neuronal assemblies and neurons’ selecti vity index further demonstrated the biological plausibility of the learning outcomes of BEAN regularization. Moreover , the strong neuron activ ation patterns towards each individual high-le vel concepts or classes could in principle enable one to better understand what each indi vidual neuron has learned to represent. Ho wev er , more relev ant to and consistent with our regularization, these selecti ve acti vation patterns re veal ho w a group of neurons (i.e. neuronal assembly) together capture the whole picture of each high-le vel concept, such as the ‘bird’ class in CIF AR-10 as shown in Figure 4. In this subsection, we ha ve demonstrated the promising effect of the proposed BEAN regularization on forming the neural assembly patterns among the neurons in the last layer of the network and their correspondence with biological neural netw orks. Although the ef fect of BEAN regularization is not yet clear on the lower layers of the netw orks, it will be interesting in the future to explore additional relations Frontier s 9 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization -100 0 100 -100 -50 0 50 100 Vanilla LeNet-5 -50 0 50 -50 0 50 LeNet-5+BEAN-1 -50 0 50 -50 0 50 LeNet-5+BEAN-2 1 2 3 4 5 6 7 8 9 10 -20 0 20 40 -20 -10 0 10 20 Vanilla ResNet18 -40 -20 0 20 40 -50 0 50 ResNet18+BEAN-1 -20 0 20 40 -40 -20 0 20 40 ResNet18+BEAN-2 dog horse frog ship airplane bird automobile truck cat deer Figure 4. The strong association between neuronal assemblies and neurons’ class selecti vity index with BEAN regularization on both MNIST (left) and CIF AR-10 (right) datasets. Each point represents one neuron and the color represents the class where the neuron achie ved its highest class-conditional mean acti vity in the test data. Ablation of neuron group 1 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 2 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 3 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 4 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 5 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 6 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 7 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 8 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 9 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 neuron group 10 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 Vanilla LeNet-5 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 Accuracy change LeNet-5+BEAN1 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 LeNet-5+BEAN2 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 1 2 3 4 5 6 7 8 9 10 class -1 -0.5 0 Figure 5. The ablation study at the neuron population level of the last dense layer of LeNet-5 models. Each time, one distinct group of neurons were ablated based on their most selecti ve class and the model performance changes for each indi vidual class were recorded. between computational function and the architecture of earlier processing stations in biological neural systems. 3.1.2 Quantitativ e analysis of inter pretability Experimental neuropsychologists commonly use an ablation protocol when studying neural function, whereas parts of the brain are remo ved to inv estigate the cogniti ve ef fects. Similar ablation studies ha ve also been adapted for interpreting deep neural networks, such as understanding which layers or units are critical for model performance [Girshick et al. (2014); Morcos et al. (2018); Zhou et al. (2018)]. T o quantitativ ely ev aluate and compare interpretability , we performed an ablation study at the neuron population lev el, each time ablating one distinct group of neurons and recording the consequent model performance changes for each class. As sho wn in Figure 4, we identiﬁed neuron groups via class selectivity and performed neuron population ablation accordingly . Figure 5 sho ws the results of all 10 ablation runs for each class in MNIST dataset. As also reported by [Morcos et al. (2018)], for con ventional deep neural nets, there is indeed no clear association between neuron’ s selectivity and importance to the ov erall model performance, as rev ealed by neuron population ablation. Howe ver , when BEAN regularization was utilized during training, such association clearly emerged, especially for BEAN-2. This is because BEAN-2 could enforce neurons to form stricter neuron correlations than BEAN-1 with the second-order correlation, enabling groups of neurons to represent more compact and disentangled concepts, such as handwritten digits. This discovery further demonstrated the interpretability and concept lev el representation in each neuronal assembly learned by applying BEAN regularization. Such compact and interpretable structure of This is a provisional ﬁle , not the ﬁnal typeset ar ticle 10 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization concept-le vel information encoding could also beneﬁt the ﬁeld of disentanglement representation learning [Bengio et al. (2013)]. 3.2 Learning Sparse and Efﬁcient Netw orks T o ev aluate the effect of BEAN regularization on learning sparse and ef ﬁcient networks, we conducted experiments on tw o real-world benchmark datasets, i.e., the MNIST [LeCun et al. (1998)] and F ashion- MNIST [Xiao et al. (2017)] datasets. W e compared BEAN with sev eral state-of-the-art regularization methods that could enforce sparse connection of the network, including ` 1 -norm, group sparsity based on ` 2 , 1 -norm [Y uan and Lin (2006); Alv arez and Salzmann (2016)], and exclusi ve sparsity based on ` 1 , 2 -norm [Zhou et al. (2010); Kong et al. (2014)]. Notable studies also in vestigated the combination of the sparsity terms listed abov e, such as combining group sparsity and ` 1 -norm [Scardapane et al. (2017)], and combining group and exclusi ve sparsity [Y oon and Hwang (2017)]. The combinatorial study is outside the scope of this work, as our focus is on showing and comparing the effecti veness of the single regularization term to the network. T o keep the comparison fair and accurate, we use the same base network architecture for all re gularization methods tested in this experiment, which is a predeﬁned fully connected neural network with 3 hidden layers, 500 neurons per layer , and ReLU as the neuron acti v ation function. The regularization methods are applied to all layers of the network, e xcept the bias term. The regularization co-ef ﬁcients are selected through a grid search varying from 10 − 5 to 10 3 based on the model performance on the v alidation set, as sho wn in Algorithm 1. T o obtain a more reliable and fair result, we ran a total of 20 random weight initializations for e very netw ork architecture studied and reported the ov erall av erage performance of all 20 results as the ﬁnal model performance of each architecture. Algorithm 1: The pseudo code for searching for the best α v alue in BEAN func hyperparameter_tuner ( training_data, v alidation_data, alpha_list = [0 . 001 , 0 . 01 , 0 . 1 , 1 , 10 , 100]) : hp_perf = [] % train and e valuate on all hyper-par ameter settings foreach α in alpha_list : m = train_model ( training_data , alpha ) v alidation_results = ev al_model ( m, v alidation_data ) hp_perf.append ( v alidation_results ) % ﬁnd the best alpha on validation set best_alpha = alpha_list [ max_index ( hp_perf )] return best_alpha T o quantitativ ely measure the performance of various sparse regularization techniques, we used three e v aluation metrics, including the prediction accurac y on test data (i.e. measured by the number of correct predictions di vided by the total number of samples in test data), the ratio of parameters used in the network (i.e. total number of non-zero weights di vided by the total number of weights in the netw orks after training), and the corresponding number of ﬂoating point operations (FLOPs). A higher accuracy means that the model can train a better network for the classiﬁcation tasks. A lower FLOP indicates that the network needs fe wer computational operations per forw arding pass, which reﬂects computation ef ﬁciency . Similarly , a lo wer parameter usage indicates the network requires less memory usage, which reﬂects memory ef ﬁciency . The results are shown in T able 1. For each e v aluation metric, the best and second-best results are highlighted in boldf ace and italic font, respectiv ely . As can be seen, both BEAN-1 and BEAN-2 can achie ve high memory and computational ef ﬁciency without sacriﬁcing network performance for the Frontier s 11 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization T able 1. Ef ﬁcient model learning experiments on MNIST and F ashion-MNIST datasets. The FLOPs and ef fecti ve parameters (i.e. number of non-zero parameters) are normalized by the v alue of vanilla model. Performance is a veraged o ver 20 runs. The best and second-best results are highlighted in boldface and italic font, respecti vely . Dataset Measure V anilla ` 1 -norm Group Sparsity Exclusiv e Sparsity BEAN-1 BEAN-2 MNIST accuracy 0.9812 0.9835 0.9813 0.9824 0.9842 0.9823 FLOPs 1 0.8106 0.6098 0.4248 0.2212 0.1320 parameter 1 0.2921 0.0982 0.1375 0.1496 0.0730 Fashion-MNIST accuracy 0.8986 0.8924 0.8925 0.8930 0.8960 0.8916 FLOPs 1 0.8011 0.5384 0.5320 0.2913 0.1622 parameter 1 0.4357 0.1378 0.2257 0.2592 0.1259 classiﬁcation tasks. Speciﬁcally , BEAN-2 achieved the best memory and computational efﬁcienc y , out- performing baseline models by 25-75% on memory ef ﬁciency and 69-84% on computational ef ﬁciency on the MNIST dataset, and by 9-71% on memory ef ﬁciency and 69-80% on computational ef ﬁciency on the Fashion-MNIST dataset. BEAN-1 also achiev ed a good trade-off between model performance and ef ﬁciency , being the second-best on computational efﬁcienc y and the best on model performance on both the MNIST and Fashion-MNIST datasets. Comparing with BEAN-2, BEAN-1 leans more toward the model performance side in such a trade-of f. This is because the ﬁrst-order correlation used in BEAN-1 is less restricti ve than a higher -order correlation in BEAN-2, as only one support neuron in the layer abov e is enough to build up a strong correlation. Thus, in practice, using a higher-order correlation might be promising when the objecti ve is to learn a more ef ﬁcient model. Interestingly , BEAN regularization seemed to adv ance the state-of-the-art by an e ven more signiﬁcant margin in terms of computational ef ﬁciency . In fact, BEAN re gularization reduces the number of FLOPs needed for the network by automatically "pruning" a substantial proportion of neurons in the hidden layers (whereas a neuron is considered pruned if either all incoming or all outgoing weights are zero), due to the penalization of connections between neurons that encode di ver gent information. Although group sparsity and exclusi ve sparsity are designed to achie ve a similar objecti ve for obtaining neuron-le vel sparsity , they are less effecti ve than BEAN regularization. This is due to the fact that BEAN takes into consideration not only the correlations between neurons via their connection patterns b ut also the consistenc y of those correlations with their acti v ation patterns. W e ha ve sho wn in T able 1 that the proposed BEAN regularization can ef fecti vely mak e the connection sparser in the dense layers of the artiﬁcial neural networks. In general, this ‘sparsifying’ ef fect can be beneﬁcial for any models with at least one dense layer in the netw ork architecture. Most modern deep neural networks (such as VGG [Simon yan and Zisserman (2014)] and ImageNet [Russakovsky et al. (2015)]) can enjoy this sparsity beneﬁt, as the dense layers typically contribute to the majority of the model parameters [Cheng et al. (2015)]. 3.3 T owards f ew-shot learning fr om scratch with BEAN regularization In an attempt to test the inﬂuence of BEAN regularization on the generalizability of DNNs in the scenarios where the training samples are extremely limited, we conducted a few-shot learning fr om scratch task, i.e. without the help of any additional side tasks and pre-trained models [Kimura et al. (2018)]. Notice that in the few-shot learning setting, the model typically requires an iterativ e learning process over the sample set. In other words, for each indi vidual fe w-shot learning experiment, only a fe w image samples per digit are randomly selected to form the training set. The model then iteratively learns from the selected image samples until con ver gence is achie ved. So far , this kind of learning task has rarely been explored This is a provisional ﬁle , not the ﬁnal typeset ar ticle 12 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization T able 2. Fe w-shot learning from scratch experiments on the MNIST (left), Fashion-MNIST (middle), and CIF AR-10 (right) datasets. Performance is averaged ov er 20 simulations of randomly sampled training data from the original training base. The best and second-best results for each few-shot learning setting are highlighted in boldface and italic font, respecti vely . Dataset MNIST Fashion-MNIST CIF AR-10 Model 1-shot 5-shot 10-shot 20-shot 1-shot 5-shot 10-shot 20-shot 1-shot 5-shot 10-shot 20-shot V anilla 38.63 70.21 78.97 86.68 39.32 59.02 64.50 70.23 15.60 18.49 22.45 26.39 Dropout 40.13 72.45 82.04 89.22 40.78 60.04 65.40 71.83 15.10 18.85 22.73 26.01 W eight decay 39.51 71.76 82.87 90.15 41.31 61.98 67.25 71.88 15.47 19.17 23.74 26.77 ` 1 -norm 40.96 74.35 81.17 90.68 41.26 62.18 67.30 70.85 15.64 18.95 23.16 26.99 Imitation networks 44.10 70.40 80.00 86.70 44.80 62.10 68.00 72.50 - BEAN-cos 54.05 80.16 86.28 92.22 42.48 65.49 68.97 74.20 18.23 21.45 24.66 28.74 BEAN-1 54.79 83.42 87.51 92.79 50.57 66.95 69.21 74.25 19.39 21.92 24.81 28.95 BEAN-2 53.75 80.76 88.08 92.97 49.94 65.98 70.21 75.06 19.28 21.28 25.04 29.23 due to the difﬁculty of the problem setup as compared to other con ventional fe w-shot learning tasks where additional data or kno wledge could be accessed. Currently , only [Kimura et al. (2018)] carried out a preliminary exploration with their proposed Imitation Networks model. W e conducted sev eral simulations of the few-shot learning fr om scratc h task on the MNIST [LeCun et al. (1998)], Fashion-MNIST [Xiao et al. (2017)], and CIF AR-10 [Krizhe vsky and Hinton (2009)] datasets. Besides Kimura’ s Imitation Networks, we also compared BEAN with other con v entional regularization techniques commonly used in the deep learning literature. Speciﬁcally , we compared dropout [Sriv astav a et al. (2014)], weight decay [Krogh and Hertz (1992)], and ` 1 -norm. Similarly to the description of Section 3.2, we k ept the comparison fair and accurate by using a predeﬁned network architecture, namely LeNet-5 [LeCun et al. (1998)], as the base network architecture for all re gularization methods studied in this experiment. The regularization terms were applied to all three dense layers of the base LeNet-5 network. Once again, the hyperparameter of each regularization along with all other system parameters were selected through a grid search and based on the best performance on a predeﬁned 10k validation set sampled from the original training base and completely distinct from the training samples used in the fe w-shot learning tasks and the testing set. T able 2 shows model performance on se veral fe w-shot learning fr om scratch e xperiments on the MNIST , Fashion-MNIST , and CIF AR-10 datasets. Performance is averaged over 20 experiments of randomly sampled training data from the original training base. The best and second-best results for each fe w-shot learning settings are highlighted in boldface and italic font, respectiv ely . As can be seen, the proposed BEAN regularization adv anced the state-of-the-art by a signiﬁcant mar gin on all four few-shot learning fr om scratc h tasks tested among all three datasets. Moreo ver , BEAN advanced the performance more signiﬁcantly when training samples were more limited. For instance, BEAN outperformed all comparison methods by 24-42%, 13-29%, and 24-28% on 1-shot learning tasks on the MNIST , Fashion-MNIST , and CIF AR-10 datasets, respecti vely . This observation demonstrates the promising effect of BEAN regularization on improving the generalizability of the neural nets when the training samples are extremely limited. Another interesting observation is that BEAN-1 in general performed the best with extremely limited training samples, such as the 1-shot and 5-shot learning tasks, while BEAN-2 regularization in general performed the best with slightly more training samples, such as the 10-shot and 20-shot learning tasks. The reason behind this observ ation might be related to the more stringent higher-order correlation, which requires more common neighbor neurons that appear to have strong connections with both neurons. Thus, a modestly increased av ailability of sample observations could enable BEAN-2 to form more ef fecti ve neuronal assemblies, further improving the model performance. Furthermore, we studied an additional variant for BEAN, i.e. BEAN-cos, which calculates the layer-wise neuron correlation via cosine similarity between the do wnstream weights of two neurons. As shown in Frontier s 13 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization (a) (b) (c) The second order neuron correlation matrix A 20 40 60 80 20 40 60 80 Structure weight conectivity 1 2 3 4 5 6 7 8 9 0 Output layer FC3 layer Weight sharing patterns in 3 neruonal assemblies 1 2 3 4 5 6 7 8 9 10 -0.2 0 0.2 0.4 0.6 0.8 high low Structure weight connectivity Weight sharing patterns in 3 neuronal assemblies 0.8 Figure 6. Analysis and visualization of the last dense layer of LeNet-5+BEAN-2 model on the MNIST 10-shot learning from scratch task. (a) Heat-map of the learned second-order neuron correlation matrix: neuron indices are re-ordered for best visualization of neuronal assembly patterns; BEAN is able to enforce plausible assembly patterns that act as functional clusters for the cate gorical learning task. (b) V isualization of the parsimonious connecti vity learned in the dense layer: both neuron-lev el and weight-lev el sparsity are simultaneously promoted in the netw ork after applying BEAN regularization. The neurons are grouped and colored by neuronal assemblies. (c) V isualization of the scales of neurons’ outgoing weights: the weights of the neurons are colored to be consistent with the neuron group in (b). T able 2, we found that BEAN-cos can still yield good performance and beat other existing re gularization methods, and getting competiti ve results as compared with BEAN-1 and BEAN-2. Ho we ver , it is inferior to BEAN-1 in 1-shot and 5-shot settings, and inferior to BEAN-2 in 10-shot and 20-shot settings. This is because BEAN-cos is unable to handle the order of correlation between neurons, as using cosine similarity requires us to treat the out-going weights of a neuron as a whole (v ector) to compute the pair-wise similarity between neurons. Thus, doing this will lose the ability to calculate higher-order correlation (such as the second-order correlation), and consequentially lose the good interpretation from graph theory and neuroscience (as described in Remark 2). T o better understand why BEAN regularization could help the seemingly ov er-parameterized model generalize well on a small sample set, we further analyzed the learned hidden representation of the dense layers where BEAN regularization was employed. W e found that BEAN helped the model gain better generalization po wer in two aspects: 1) by automatic sparse and structured connectivity learning and 2) by weak parameter sharing among neurons within each neuronal assembly . Both aspects enhanced the dense layers to promote ef ﬁcient and parsimonious connections, which consequently prev ented the model from ov er-ﬁtting with a small training sample size. Figure 6 sho ws the learned parameters of the last dense layer of LeNet-5+BEAN2 on the MNIST 10-shot learning task. As shown in Figure 6 (b), instead of using all possible weights in the dense layer , BEAN caused the model to parsimoniously le verage the weights and e ven the neurons, yielding a bio-plausible sparse and structured connectivity pattern. This is because the learned neuron correlation helped the model disentangle the co-connections between neurons from different assemblies, as shown in Figure 6 (a). Additionally , BEAN enhanced parameter sharing among neurons within each assembly , as demonstrated in Figure 6 (c). For instance, neurons in the red-colored assembly all had high positi ve weights to ward class 4, meaning that this group of neurons was helping the model identify Digit 4. Similarly , neurons in the green-colored assembly were trying to distinguish between Digits 9 and 7. Such automatic weak parameter sharing not only helped pre vent the model from ov er -ﬁtting but al so enabled an intuiti ve interpretation of the behavior of the system as a whole from a higher modularity le vel. This is a provisional ﬁle , not the ﬁnal typeset ar ticle 14 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization T able 3. Statistic of data v alues test set error rate - v alidation set error rate on 10-shot learning on the MNIST dataset from 20 random runs. Other n-shot learning settings follo w the same trend. Model / Metric Max 75%-rank 50%-rank Mean 25%-rank Min V anilla 0.06% 0.01% -0.06% -0.30% -0.74% -0.81% BEAN-1( α = 1 ) -0.04% -0.20% -0.62% -0.58% -0.93% -1.13% BEAN-2( α = 100 ) 0.10% -0.02% -0.42% -0.48% -0.97% -1.35% 0.001 0.01 0.1 1 10 100 α 80 82 84 86 88 90 Accuracy BEAN-1 with 10-shot learning on the MNIST datset Validation Test 0.001 0.01 0.1 1 10 100 α 80 82 84 86 88 90 Accuracy BEAN-2 with 10-shot learning on the MNIST datset Validation Test Figure 7. Parameter sensiti vity study of BEAN regularization on 10-shot learning on the MNIST dataset. Each data point is centered by the mean v alue and the error bar measures the standard de viation ov er 20 runs. 3.3.1 P arameter sensitivity study There are two hyperparameters in the proposed BEAN regularization: 1) α , which balances between the regularization loss and DNN training loss, and 2) γ , which controls the curvature of the hyperbolic tangent function as sho wn in Equation 3. As already mentioned in the ﬁrst paragraph of Section 3, γ was set to 1 for all experiments. Thus, the only parameter we need to study is α . Figure 7 shows the accuracy of the model versus α on the few-shot learning setting on the MNIST dataset. Only the results for the 10-shot learning task are sho wn due to space limitations. By varying α across the range from 0.001 to 100, the best performance is obtained when α = 1 for BEAN-1 and α = 100 for BEAN-2. Speciﬁcally , for BEAN-1, W e can see a clear trend where the model performance drops when α is too small or too big. Furthermore, the results sho w that the performance of the v alidation set is well aligned with the model performance on the test set, this demonstrates the superior generalizability of the model when applying BEAN re gularization. Notably , although in Figure 7 we accessed the model performance on multiple settings of α , we did not use any of the results on the test set to choose an y parameters of the model, i.e. no post-selection was performed. W e belie ve post-selection should be completely av oided and it can cause the test set to lose its po wer to test the model’ s generalizability to future unseen data. 4 CONCLUSION In this paper , we propose a nov el Biologically Enhanced Artiﬁcial Neuronal assembly (BEAN) regularization to model neuronal correlations and dependencies inspired by cell assembly theory from neuroscience. W e show that BEAN can promote jointly sparse and efﬁcient encoding of rich semantic correlation among neurons in DNNs similar to connection patterns in BNNs. Experimental results show that BEAN enables the formations of interpretable neuronal functional clusters and consequently promotes a sparse, memory/computation-ef ﬁcient network without loss of model performance. Moreo ver , our few-shot learning experiments demonstrated that BEAN could also enhance the generalizability of the model when Frontier s 15 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization training samples are extremely limited. Our re gularization method has demonstrated its capability in enhancing the modularity of the representations of neurons for image semantic meanings such as digits, animals, and objects on image datasets. While the generality of the approach introduced here is at this time e v aluated on MNIST and CIF AR datasets, future studies might consider additional experiments on other datasets such as texts or graphs to demonstrate the broader effecti veness of the proposed method. Another direction to further enhance the model might be to include separate e xcitatory and inhibitory nodes, as in BNNs, which would allow implementation of speciﬁc microcircuit computational motifs [Ascoli and Atkeson (2005)]. Furthermore, since there are other choices for deﬁning the af ﬁnity matrix between neurons in a certain layer based on their do wnstream weights, answering the question about “what is the best way to compute afﬁnity matrix” can be an interesting direction to be more comprehensiv ely studied in future works. 5 A CKNO WLEDGMENTS This work is supported by the National Institutes of Health grant (NS39600), the National Science Foundation grant: #1755850, #1841520, #1907805, Jef fress T rust A ward, NVIDIA GPU Grant, and the Design Kno wledge Company (subcontract number: 10827.002.120.04). This manuscript has been released as a pre-print at arXi v: 1909.13698 [Gao et al. (2019)]. REFERENCES Alv arez, J. M. and Salzmann, M. (2016). Learning the number of neurons in deep networks. In Advances in Neural Information Pr ocessing Systems . 2270–2278 Ascoli, G. A. and Atkeson, J. C. (2005). Incorporating anatomically realistic cellular-le vel connectivity in neural network models of the rat hippocampus. Biosystems 79, 173–181 Bengio, Y ., Courville, A., and V incent, P . (2013). Representation learning: A revie w and new perspecti ves. IEEE transactions on pattern analysis and mac hine intelligence 35, 1798–1828 Braitenberg, V . (1978). Cell assemblies in the cerebral cortex. In Theor etical appr oaches to comple x systems (Springer). 171–188 Buzsáki, G. (2010). Neural syntax: cell assemblies, synapsembles, and readers. Neur on 68, 362–385 Cheng, Y ., Y u, F . X., Feris, R. S., Kumar , S., Choudhary , A., and Chang, S.-F . (2015). An exploration of parameter redundancy in deep networks with circulant projections. In Pr oceedings of the IEEE International Confer ence on Computer V ision . 2857–2865 Collobert, R. and W eston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In Pr oceedings of the 25th international confer ence on Machine learning (A CM), 160–167 De V alois, R. L., Y und, E. W ., and Hepler , N. (1982). The orientation and direction selecti vity of cells in macaque visual cortex. V ision r esearc h 22, 531–544 Finn, C., Abbeel, P ., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Pr oceedings of the 34th International Confer ence on Machine Learning-V olume 70 (JMLR. org), 1126–1135 Freedman, D. J. and Assad, J. A. (2006). Experience-dependent representation of visual categories in parietal cortex. Natur e 443, 85 Gao, Y ., Ascoli, G., and Zhao, L. (2019). Bean: Interpretable representation learning with biologically- enhanced artiﬁcial neuronal assembly regularization. arXiv pr eprint Girshick, R., Donahue, J., Darrell, T ., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic se gmentation. In Pr oceedings of the IEEE confer ence on computer vision and This is a provisional ﬁle , not the ﬁnal typeset ar ticle 16 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization pattern r ecognition . 580–587 Granov etter, M. S. (1977). The strength of weak ties. In Social networks (Elsevier). 347–367 He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gnition . 770–778 Hebb, D. O. (1949). The organization of beha vior . a neuropsychological theory Holland, P . W . and Leinhardt, S. (1971). T ransiti vity in structural models of small groups. Comparative gr oup studies 2, 107–124 Kimura, A., Ghahramani, Z., T akeuchi, K., Iw ata, T ., and Ueda, N. (2018). Few-shot learning of neural networks from scratch by pseudo e xample optimization. arXiv pr eprint Kingma, D. P . and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint K ong, D., Fujimaki, R., Liu, J., Nie, F ., and Ding, C. (2014). Exclusiv e feature learning on arbitrary structures via ` 1 , 2 -norm. In Advances in Neural Information Pr ocessing Systems . 1655–1663 Krizhe vsky , A. and Hinton, G. (2009). Learning multiple layer s of featur es fr om tiny images . T ech. rep., Citeseer Krizhe vsky , A., Sutske ver , I., and Hinton, G. E. (2012). Imagenet classiﬁcation with deep con v olutional neural networks. In Advances in neural information pr ocessing systems . 1097–1105 Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Advances in neural information pr ocessing systems . 950–957 LeCun, Y ., Bengio, Y ., and Hinton, G. (2015). Deep learning. natur e 521, 436 LeCun, Y ., Bottou, L., Bengio, Y ., Haffner , P ., et al. (1998). Gradient-based learning applied to document recognition. Pr oceedings of the IEEE 86, 2278–2324 Maaten, L. v . d. and Hinton, G. (2008). V isualizing data using t-sne. Journal of machine learning r esear ch 9, 2579–2605 MacQueen, J. et al. (1967). Some methods for classiﬁcation and analysis of multiv ariate observ ations. In Pr oceedings of the ﬁfth Berkele y symposium on mathematical statistics and pr obability (Oakland, CA, USA), vol. 1, 281–297 Mainetti, M. and Ascoli, G. A. (2015). A neural mechanism for background information-g ated learning based on axonal-dendritic ov erlaps. PLoS computational biology 11, e1004155 Morcos, A. S., Barrett, D. G., Rabino witz, N. C., and Botvinick, M. (2018). On the importance of single directions for generalization. arXiv pr eprint Peyrache, A., Benchenane, K., Khamassi, M., W iener , S. I., and Battaglia, F . P . (2010). Principal component analysis of ensemble recordings rev eals cell assemblies at high temporal resolution. Journal of computational neur oscience 29, 309–325 Pulvermüller , F . and Knoblauch, A. (2009). Discrete combinatorial circuits emer ging in neural networks: A mechanism for rules of grammar in the human brain? Neural networks 22, 161–172 Rees, C. L., Moradi, K., and Ascoli, G. A. (2017). W eighing the e vidence in peters’ rule: does neuronal morphology predict connecti vity? T r ends in neur osciences 40, 63–71 Ri vera-Alba, M., Peng, H., de Pola vieja, G. G., and Chklovskii, D. B. (2014). W iring economy can account for cell body placement across species and brain areas. Curr ent Biology 24, R109–R110 Robins, G. and Alexander , M. (2004). Small worlds among interlocking directors: Network structure and distance in bipartite graphs. Computational & Mathematical Or ganization Theory 10, 69–94 Rousseeuw , P . J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J ournal of computational and applied mathematics 20, 53–65 Frontier s 17 Gao et al. Interpretable and Efﬁcient Learning with BEAN Regularization Rumelhart, D. E., Hinton, G. E., W illiams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive modeling 5, 1 Russako vsky , O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252 Sabour , S., Frosst, N., and Hinton, G. E. (2017). Dynamic routing between capsules. In Advances in neur al information pr ocessing systems . 3856–3866 Samsonovich, A. V ., Goldin, R. F ., and Ascoli, G. A. (2010). T ow ard a semantic general theory of e verything. Complexity 15, 12–18 Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017). Group sparse regularization for deep neural networks. Neur ocomputing 241, 81–89 Simonyan, K. and Zisserman, A. (2014). V ery deep con volutional networks for large-scale image recognition. arXiv pr eprint Snell, J., Swersky , K., and Zemel, R. (2017). Prototypical networks for fe w-shot learning. In Advances in Neural Information Pr ocessing Systems . 4077–4087 Socher , R., Ganjoo, M., Manning, C. D., and Ng, A. (2013). Zero-shot learning through cross-modal transfer . In Advances in neural information pr ocessing systems . 935–943 Sri v astav a, N., Hinton, G., Krizhe vsky , A., Sutske ver , I., and Salakhutdinov , R. (2014). Dropout: a simple way to pre vent neural networks from ov erﬁtting. The journal of machine learning r esear ch 15, 1929–1958 T ononi, G. and Sporns, O. (2003). Measuring information integration. BMC neur oscience 4, 31 W atts, D. J. and Strogatz, S. H. (1998). Collectiv e dynamics of ‘small-world’networks. natur e 393, 440 Xian, Y ., Lampert, C. H., Schiele, B., and Akata, Z. (2018). Zero-shot learning-a comprehensi ve e valuation of the good, the bad and the ugly . IEEE transactions on pattern analysis and mac hine intelligence Xiao, H., Rasul, K., and V ollgraf, R. (2017). F ashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv pr eprint Y oon, J. and Hwang, S. J. (2017). Combined group and exclusi ve sparsity for deep neural networks. In Pr oceedings of the 34th International Confer ence on Machine Learning-V olume 70 (JMLR. org), 3958–3966 Y oung, T ., Hazarika, D., Poria, S., and Cambria, E. (2018). Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13, 55–75 Y uan, M. and Lin, Y . (2006). Model selection and estimation in regression with grouped v ariables. J ournal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67 Zhang, Q., Nian W u, Y ., and Zhu, S.-C. (2018). Interpretable con volutional neural networks. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition . 8827–8836 Zhang, Q.-s. and Zhu, S.-C. (2018). V isual interpretability for deep learning: a surv ey . F r ontiers of Information T echnology & Electr onic Engineering 19, 27–39 Zheng, Z. and W eng, J. (2016a). Challenges in visual parking and how a dev elopmental netw ork approaches the problem. In 2016 International J oint Confer ence on Neural Networks (IJCNN) (IEEE), 4593–4600 Zheng, Z. and W eng, J. (2016b). Mobile device based outdoor navigation with on-line learning neural network: A comparison with con volutional neural netw ork. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition W orkshops . 11–18 Zhou, B., Sun, Y ., Bau, D., and T orralba, A. (2018). Revisiting the importance of indi vidual units in cnns via ablation. arXiv pr eprint Zhou, Y ., Jin, R., and Hoi, S. C.-H. (2010). Exclusi ve lasso for multi-task feature selection. In Pr oceedings of the Thirteenth International Confer ence on Artiﬁcial Intelligence and Statistics . 988–995 This is a provisional ﬁle , not the ﬁnal typeset ar ticle 18

BEAN: Interpretable Representation Learning with Biologically-Enhanced Artificial Neuronal Assembly Regularization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment