Robustness in sparse artificial neural networks trained with adaptive topology

We investigate the robustness of sparse artificial neural networks trained with adaptive topology. We focus on a simple yet effective architecture consisting of three sparse layers with 99% sparsity followed by a dense layer, applied to image classif…

Authors: Bendegúz Sulyok, Gergely Palla, Filippo Radicchi

Robustness in sparse artificial neural networks trained with adaptive topology
Robustness in sparse artificial neural net w orks trained with adaptiv e top ology Bendeg ´ uz Suly ok 1 , Gergely P alla 1,2 , Filipp o Radicc hi 3 , and San to F ortunato 3 1 Dept. of Biological Ph ysics, E¨ otv¨ os Lor´ and Univ ersit y , Budapest, Hungary 2 Semmelw eis Universit y , F acult y of Health and Public Administration, Health Services Managemen t T raining Cen tre, Budap est, Hungary 3 Cen ter for Complex Netw orks and Systems Researc h Luddy Sc ho ol of Informatics, Computing, and Engineering Indiana Univ ersity Bloomington, USA F ebruary 26, 2026 Abstract W e in vestigate the robustness of sparse artificial neural net works trained with adaptiv e top ol- ogy . W e fo cus on a simple yet effective architecture consisting of three sparse lay ers with 99% sparsit y follow ed b y a dense la yer, applied to image classification tasks suc h as MNIST and F ashion MNIST. By up dating the topology of the sparse la yers b et ween eac h ep och, w e ac hieve comp etitiv e accuracy despite the significan tly reduced num ber of w eigh ts. Our primary contribution is a de- tailed analysis of the robustness of these net works, exploring their p erformance under v arious p er- turbations including random link remo v al, adversarial attack, and link weigh t shuffling. Through extensiv e exp erimen ts, we demonstrate that adaptive top ology not only enhances efficiency but also maintains robustness. This w ork highlights the p oten tial of adaptive sparse netw orks as a promising direction for developing efficient and reliable deep learning models. 1 In tro duction Deep Neural Net works (DNNs) hav e become the cornerstone of mo dern artificial intelligence, ac hieving state-of-the-art p erformance in diverse domains such as computer vision, natural language pro cessing, and scientific computing [ 1 , 2 ]. This success has been fueled b y a trend of ever-increasing mo del size and complexit y . Mo dern mo dels often contain billions of parameters, leading to substantial computational and memory burden as well as significant energy cost for both training and inference [ 3 ]. This ov er- parameterization not only poses significant challenges for deplo yment on resource-constrained devices lik e mobile phones and IoT sensors, but also raises en vironmental concerns regarding the carb on fo otprin t of AI. In parallel, the human brain provides an exceptionally efficient and pow erful computational system for which neurological studies hav e long established that its connectivit y is remark ably sparse [ 4 , 5 ]. A t ypical neuron connects to only a v anishing fraction of other neurons, forming a complex y et highly efficien t net work structure. This biological sparsity is b eliev ed to b e a key principle enabling the brain to learn and p erform complex tasks with unparalleled energy efficiency [ 5 , 6 , 7 ]. This con trast b et w een dense, computationally hea vy artificial net works and the sparse, efficien t biological brain drew considerable scientific interest to wards sparsit y in deep learning [ 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ]. Ho wev er, it is w orth noting that while artificial Sparse Neural Netw orks (SNNs) are significantly less connected than traditional dense mo dels, they still retain a finite fraction of all the possible edges, meaning that they are not “sparse” in the jargon of net w ork science [ 16 ]. On the other hand, real neural netw orks are sparse, as they display only a v anishing fraction of all p ossible connections. Despite this distinction, SNNs hav e emerged as a promising solution to mitigate the burdens of o v er- parameterisation and are attractiv e for deplo yment on resource-constrained devices. By ensuring that a significan t p ortion of the netw ork’s weigh ts are zero, SNNs can theoretically ac hiev e dramatic reductions in storage requiremen ts and computational complexity . 1 Seminal work in this area demonstrated that the sparsification of pre-trained dense mo dels via net- w ork pruning can b e achiev ed without a significant loss in the accuracy [ 8 , 9 ]. More recent discov eries, suc h as the Lottery Tick et Hyp othesis, suggest that sparse sub-net works with exceptional potential p erformance exist within dense mo dels from the very b eginning of the training [ 10 ]. Accordingly , in the pruning approach to SNN one starts from a fully connected neural net work and gradually remov es links while aiming to achiev e an optimal balance betw een final performance and post-training inference sp eed [ 11 , 12 , 13 , 14 ]. An equally important alternative to net w ork pruning is the idea of sparse training, where a sparsely connected net work is initiated, ensuring an inheren t sparsity throughout the training [ 17 , 18 , 19 , 20 , 21 , 22 , 23 ]. A prominent approac h in this direction is dynamic sparse training [ 17 , 19 , 20 ], whic h pro- vides an efficient framew ork, where the netw ork structure is also dynamically evolving during training. Epitop ological learning is a v ariation of dynamic sparse training that is inspired b y the plasticity of the brain and the wa y organic neural netw orks change their connectivity during learning, forming epitop o- logical engrams (memory traces) [ 24 , 25 , 26 , 27 ]. A relativ ely simple implementation of epitop ological learning consists in applying link prediction during training, where one ev aluates the likelihoo d of the existence of missing links based on the curren t netw ork structure and takes this lik elihoo d into accoun t b efore the next mo dification to the netw ork topology . In a slightly more general framework, epitop o- logical learning can b e implemented with the help of netw ork automatons that pro vide connectivity predictions based on the input kno wledge and the top ological netw ork organization. Recen tly , a sparse netw ork trained in the framew ork of Epitopological Sparse Meta-deep Learning (ESML) was shown to outperform fully connected netw orks across m ultiple architectures and datasets while retaining only 1% of the connections [ 28 ]. The basis of the framework was pro vided b y the Cannistraci-Hebb (CH) automata learning theory , and the training pro cedure relied heavily on the CH3-CL net work automata rules [ 29 ], offering an effective solution for general-purp osed link prediction in bipartite net works. In the presen t w ork, w e examine this highly effective sparse neural netw ork from the p erspective of robustness. In the ligh t of practical applications, robustness in neural netw orks is a critical concern, as models must p erform reliably under v arious real-w orld conditions, including the presence of noise, adv ersarial p erturbations, and shifts in data distribution. While sparsity can reduce ov erfitting and impro ve generalization, its impact on robustness is less understo od, esp ecially in the context of adaptive top ologies. Here, w e study the resilience of sparse neural net works trained in the ESML approach against v arious link remo v al pro cedures. 2 Results 2.1 Sparse architecture and training The structure of the net work arc hitecture we used follow ed the setup proposed in Ref. [ 28 ], as illustrated in Fig. 1 , sho wing the final stage of the training in one of our exp erimen ts with the MNIST dataset. The input la yer consists of 784 pixels, whic h is follow ed b y 3 sparsely connected neuron lay ers with 1000 neurons each. The netw ork also contains an additional la yer for readout, con taining 10 neurons that are densely connected to the third lay er. The training of the netw ork starts already with a sparse configuration, where only 1% of the p ossible connections are present betw een the input and the first 3 neuron la yers, placed uniformly at random with a weigh t drawn from a normal distribution ha ving a mean µ = 0 . 0 and a standard deviation σ = q 2 f in , where f in is the size of the previous neuron lay er. During the training in eac h ep och, first the weigh ts are adjusted based on backpropagation. This is follow ed by rewiring of the connections, where first a small fraction of the links are deleted, and then an equal amount of new links are in tro duced to keep the ov erall n um b er of connections in the net work constan t. F or the link regrowth pro cedure w e applied t wo distinct strategies: Random Link Regrowth (RLR) and the CH3L3 heuristic [ 29 , 28 ]. The RLR metho d, as its name suggests, assigns the new links simply at random. In contrast, the CH3L3 metho d is based on link prediction, where the new links are placed according to the highest lik eliho ods for unobserv ed connections in the current net work top ology . The t wo p ossible link regro wth strategies were not “m ixed,” i.e., for an y giv en experiment, one of the t wo p ossibilities was used exclusively throughout the en tire training. In Fig. 2 , we show the duration of the topology update for the tw o differen t link regrowth strategies. As expected, due to its simplicit y , 2 Figure 1: Illustration of the netw ork architecture. The input lay er (shown on the left) contains 28 x 28 pixels, follo wed b y 3 sparsely connected neuron lay ers with 1000 neurons eac h. Blue coloured connections hav e a p ositiv e link w eight, whereas red colour indicates a negative w eight. The last la yer, pro viding the readout is densely connected to the third sparse la yer. the RLR method roughly halves the av erage top ology up date time compared to the more complex CH3L3 metho d for all studied datasets. F or a more detailed description of the training pro cedures, see Metho ds. 2.1.1 Accuracy Figure 3 shows the evolution of test accuracy as a function of the training ep ochs for the MNIST (Fig. 3 a), F ashion MNIST (Fig. 3 b), KMNIST (Fig. 3 c), and EMNIST letters datasets (Fig. 3 d), where eac h curve represen ts the median o ver 10 independent training runs started from a different random seed. The shaded region around the curves falls betw een the 40 th and 60 th p ercen tiles and indicates that a noticeable v ariabilit y can b e introduced by random initialization in the middle range of the training pro cedure for some systems. F or the studied datasets, the accuracy of the net works trained according to the CH3L3 method approac hes its limiting v alue considerably faster compared to the accuracy of the netw orks trained with random link regrowth. How ever, the accuracy at the end of the training pro cedure is the same for both approaches within a short margin of uncertaint y for all datasets except the KMNIST dataset (Fig. 3 c), where the maximal accuracy of the net work trained with the CH3L3 metho d is sligh tly higher. In terestingly , in the case of the F ashion MNIST dataset (Fig. 3 b), after hitting a lo cal maximum, the accuracy curve briefly decreases until reac hing a lo cal minim um, and then monotonically improv es with each further training epo c h for b oth training metho ds. Put together, these results demonstrate that both training metho ds benefit from the alternating approac h b et ween weigh t up date and link rewiring, and the applied sparse dynamic training frame- w ork is effective. In the meantime, the sp ecific design of the up date rule plays an imp ortan t role in determining the speed of con vergence. 3 Figure 2: Duration of the top ology up date. W e sho w the bar chart of the av erage duration in seconds for a single top ology up date ov er 4 training instances and 50 ep ochs. The results for the CH3L3 metho d are shown in light blue, whereas for the Random Link Regrowth metho d in purple, where the dataset is marked under the corresponding bars. 2.2 Robustness analysis W e assessed the robustness of the trained sparse net works through a systematic p ost-training p ertur- bation analysis. After the completion of the training phase, we exp osed the net w orks to t wo distinct t yp es of disruption: iterativ e structural pruning (remo ving connections) and sto c hastic w eight p ertur- bation (randomly altering trained weigh ts). The netw ork p erformance was re-ev aluated on the test set at each p erturbation step, critically , without any subsequen t fine-tuning or retraining. The detailed list of the applied different types of p erturbations is: • R andom Pruning - links are remov ed from the net work lay er-b y-lay er uniformly at random. • Weight Or der Pruning - links are remo ved by decreasing order of magnitude (i.e., highest- magnitude connections are remo ved first). • R everse Weight Or der Pruning - links are remov ed in increasing order based on their weigh t magnitude. • Weight Shufling - weigh t v alues are sh uffled within fixed-size bins applied lay er-wise. The size of the perturbation is controlled b y a v alue ranging from 0 to 1, which represen ts the ratio of the bin size by the total range of w eight v alues within that la yer. As this v alue increases, the bins b ecome larger, allo wing more disparate weigh t v alues to b e shuffled together, thereby increasing the strength of the p erturbation • Weight Mo dific ation - random noise, sampled from a normal distribution centered on 0, is added to the weigh t v alues. The p erturbation v alue m p con trols the standard deviation of the noise as σ p = ¯ w · m p , where ¯ w is the a v erage w eight magnitude for the giv en la yer. In Fig. 4 we show the accuracy as a function of the fraction of remov ed links for Random Pruning (green), W eight Order Pruning (gra y) and Rev erse W eigh t Order Pruning (red). As expected, for all tested datasets, the W eight Order Pruning decreases the p erformance of the net works in the most rapid fashion, with an almost vertical drop in the accuracy close to the origin, reac hing the minim um v alue already at 1 to 5% of remov ed links. The studied sparse neural netw orks sho w more resilience against Random Pruning, where w e can observ e a gradual decrease in accuracy with roughly a constan t (and finite) slope in the early stages of the link remo v al process, with steepness of the curv es sligh tly reduced in the later stages, even tually decreasing to the minimum at around 80% remo ved links. Finally , the studied sparse dynamic neural net works sho w high resilience against Reverse W eigh Order pruning, where accuracy ma y suffer only a minor decrease ev en at 80% of the links remov ed, which then is follo wed by a steep decline for larger fractions of remov ed links. 4 (a) MNIST (b) F ashionMNIST (c) KMNIST (d) EMNIST (Letters) Figure 3: Accuracy during training. W e show the accuracy ( fraction of correctly classified inputs, where the shaded area indicates the standard deviation) as a function of the num b er of ep ochs (on logarithmic scale). The results for the CH3L3 metho d are shown in blue, for the R andom Link R e gr owth metho d in red. Results are displa yed for the follo wing datasets: a) MNIST, b) F ashionMNIST, c) KMNIST, and d) EMNIST Letters. Each setup w as run 10 times with a different random seed and eac h training instance run for 1000 ep o c hs with a top ology up date b et w een consecutiv e ep o c hs. 5 (a) MNIST (b) F ashionMNIST (c) KMNIST (d) EMNIST (Letters) Figure 4: Robustness of trained sparse net works against link remo v al. Average accuracy ov er 32 netw orks as a function of the fraction of remov ed links (p erturbation v alue) for systems trained according to the CHCL3 metho d (continous lines) and the RLR metho d (dashed lines) for a) the MNIST dataset, b) the F ashion MNIST dataset, c) the KMNIST dataset, and d) the EMNIST letters dataset. Results for Reverse W eight Order Pruning are shown in red, for Random Pruning in green and for W eight Order Pruning in gray . 6 When comparing the robustness of the net w orks from the point of view of the t wo different train- ing strategies, we can observe that net works trained with the RLR metho d (dashed lines) show higher resilience against Reverse W eigh t Order pruning for all datasets in Fig. 4 . Resilience against Random Pruning yielded mixed results. Netw orks trained with the CH3L3 regrowth metho d show ed slightly higher resilience across the MNIST, F ashion MNIST, and KMNIST datasets (Fig. 4 a-c). How ever, for the EMNIST letters dataset, the resilience curv es for the tw o training strategies were nearly indistin- guishable (Fig. 4 d). Finally , p erformance under W eight Order Pruning was dep enden t on the dataset. Net works trained using CH3L3 app eared sligh tly more resilient –though by a v ery small margin– for the MNIST and F ashion MNIST datasets (Fig. 4 a-b). Conv ersely , the RLR metho d resulted in slightly higher resilience for the KMNIST and EMNIST letters datasets (Fig. 4 c-d). In Fig. 5 w e show the deca y of accuracy as a function of the perturbation v alue for W eight Sh uffling (purple) and W eigh t Mo dification (ligh t blue). Similarly to Fig. 4 , curves plotted with con tinuous lines corresp ond to the results obtained for the net works trained with the CH3L3 link regro wth metho d, whereas results shown with dashed lines were obtained with the RLR metho d. F or all tested datasets, the accuracy drops very fast for b oth types of p erturbation, with the W eigh t Shuffling reducing the accuracy sligh tly ev en faster than the W eight Modification at the v ery start of the p erturbation pro- cess. How ev er, at some p oint the curves cross and in the later stages the accuracy for W eigh t Shuffling remains higher compared to that of W eight Modification. According to Fig. 5 , net works trained accord- ing to the CH3L3 metho d show slightly but noticeably stronger resilience against W eight Modification compared to netw orks trained with RLR in all tested datasets. In parallel, the comparison of the t w o training metho ds with resp ect to resilience against W eight Shuffling yields mixed results: Netw orks trained according to the CH3L3 method achiev ed higher robustness in the case of the MNIST and the F ashionMNIST datasets (Fig. 5 a-b), whereas in the case of the KMNIST and the EMNIST Letters datasets it is the other w ay around (Fig. 5 c-d). T o in vestigate p oten tial structural differences b et ween netw orks trained using the CH3L3 algorithm and the RLR metho d, we measured the link weigh t distribution up on completion of the training pro cesses, with Fig. 6 displa ying the resulting density distributions of the w eight magnitude across the v arious datasets. F or all datasets and b oth training metho ds, w e observe a narrow and pronounced lo cal minimum in the density distribution near the origin. This is a consequence of the repeated link remo v al in each ep o c h, where a fixed fraction of the connections –sp ecifically the weak est links– is consisten tly pruned. The steep deca y immediately to the left of this minim um is primarily attributed to new links inserted in the final ep o c h, as these are initialized with weigh ts dra wn from a normal distribution cen tered at zero. F ollo wing this local minimum, a similarly narro w global maximum can b e observed, and the density curves then pro ceed with a near-exponential decay tow ards larger weigh t v alues. When comparing the tw o training metho ds, the distributions show a consisten t difference within the in termediate and mo derately large w eight range: sp ecifically , in the range of approximately w = 0 . 2 to w = 1, the density curves for netw orks trained with RLR (red lines) consistently exceed those for net works trained with the CH3L3 algorithm (light blue lines) across all datasets. How ev er, the results for extremely large and rare weigh ts ( w > 1) v ary by dataset: the density for CH3L3 net works was higher than for RLR net works in the MNIST (Fig. 4a) and EMNIST datasets (Fig. 4d), w as low er than for RLR netw orks in the F ashion MNIST dataset (Fig. 4b), and w as roughly equal for the KMNIST dataset (Fig. 4c). The measured prop erties of the weigh t magnitude densities, coupled with the assumption that connections with large w eights contribute more significantly to netw ork accuracy , offer a qualitative explanation for the observed robustness differences. Net works trained with RLR exhibit a higher densit y of medium-to-mo derately-large weigh ts compared to CH3L3 net works. Consequen tly , when b oth netw ork t yp es are sub jected to Reverse W eigh t Order Pruning, the RLR netw orks retain larger w eights among their remaining connections for any given fraction of links remo ved. This structural difference is exp ected to translate in to sup erior p erformance and resilience for RLR-trained netw orks under R WO Pruning. Conv ersely , when links are remov ed randomly , the higher density of important, medium-to-mo derately-large w eigh ts in RLR netw orks makes them more vulnerable to damage, result- ing in a larger performance degradation compared to the CH3L3-trained netw orks at the same remo v al fraction. 7 (a) MNIST (b) F ashionMNIST (c) KMNIST (d) EMNIST (Letters) Figure 5: Robustness of trained sparse netw orks against p erturbing the w eights by shuffling or adding noise. Accuracy as a function of the p erturbation v alue for netw orks trained with the CH3L3 Link Regrowth metho d (contin uous lines) and with the RLR metho d (dashed lines) for a) the MNIST dataset, b) the F ashion MNIST dataset, c) the KMNIST dataset, and d) the EMNIST letters dataset. Results for W eight Sh uffling are sho wn in purple, for W eight Mo dification in light blue. 8 (a) MNIST (b) F ashionMNIST (c) KMNIST (d) EMNIST (Letters) Figure 6: W eigh t magnitude distribution in the trained net works. W e plot the distribution densit y for the absolute v alue of the connection weigh ts at the end of the sparse training process, where results for netw orks trained according to the CH3L3 algorithm are sho wn in ligh t blue and results for net works trained with RLR are sho wn in red. Panels depict the distributions observ ed for a) the MNIST dataset, b) the F ashionMNIST dataset, c) the KMNIST dataset, and d) the EMNIST letters dataset. 9 3 Discussion The study of netw ork robustness, often framed as an inv erse p ercolation process, where the netw ork structure is incremen tally deteriorated through subsequent link remov al, has a long history in netw ork science[ 30 , 31 ]. Dep ending on the initial net work structure and the sp ecific p erturbation applied, at a certain p oin t the system may easily undergo a drastic change analogous to a phase transition in statistical physics. Such p ercolation transitions ha ve b een observed b oth in living neural net works, when chemically blo c king neurotransmitter receptors [ 32 ], and in simulated neural net works, where resilience against v arious attacks strongly dep ended on the underlying architecture. The sequential pruning of connections is also a crucial metho d for compression in artificial neural netw orks, allowing an initially dense netw ork to b e conv erted in to a sparse one while retaining near-original p erformance, despite the loss of the ma jorit y of its connections. Relatedly , in the presen t paper w e in vestigated the consequences of pruning and w eight p erturba- tions in neural net w orks that are already sparse from the outset. The netw orks studied w ere trained on four distinct datasets using tw o alternativ e dynamic training methods: the CH3L3 method (in which a fixed fraction of connections are remo ved and then reintroduced based on missing link prediction in eac h ep och) and the Random Link Regro wth (RLR) metho d (where the remo v ed links are reintroduced at random). Analysis of the training dynamics revealed that, across all datasets, CH3L3 netw orks reached max- im um accuracy after fewer training ep ochs compared to RLR netw orks, though the final conv ergence accuracy for b oth metho ds was consisten tly the same within a negligible margin. Regarding resilience, CH3L3 net works prov ed to be more resilien t against random link remo v al for three of the four datasets, with the fourth dataset showing nearly identical accuracy decay curv es for b oth metho ds. In con trast, RLR net works show ed higher resilience against Reverse W eight Order remov al across all datasets. This latter result is particularly notable as it demonstrates the p oten tial for significan t further sparsifica- tion in RLR-trained netw orks, allowing up to appro ximately 80% of the connections to b e remov ed without a critical loss in accuracy . How ever, b oth netw ork t yp es w ere highly susceptible to W eight Order remo v al, with accuracy often reducing to its minimum v alue after the remo v al of only 3% to 10% of the links. Beyond link remo v al, w e analysed the effects of direct weigh t mo dification, finding that RLR-trained net works exhibited sligh tly greater resilience against random noise added to the w eight v alues, whereas CH3L3 net works prov ed more resilient against random weigh t sh uffling in three out of the four tested datasets. A qualitativ e explanation for these differences stems from the resulting weigh t distributions. Our results show that in RLR-trained net w orks, a higher proportion of the distribution is concen trated in the medium-to-high w eight range compared to CH3L3 net works. Based on this structural difference, RLR net works are exp ected to show higher resilience under Rev erse W eight Order Pruning (which preferen tially remo ves small w eights), while CH3L3 netw orks are exp ected to b e more resilient against Random Pruning (which is more likely to disrupt the concen trated, critical weigh ts in the RLR distri- bution). In summary , our study of the robustness of dynamically trained sparse neural netw orks against link pruning and weigh t modification shows that the specific dynamic training method emplo yed has a notable effect on net work resilience. F urthermore, results rev eal that RLR-trained net works p ossess a structural quality that allows for extensive p ost-training compressibility without a significan t loss in accuracy . 4 Metho ds Our approac h is based on training sparse neural netw orks while dynamically mo difying their connec- tivit y structure during training. The ov erall procedure consists of t wo main comp onen ts: (i) w eight learning and (ii) top ology up date. These are applied in an alternating fashion throughout training. 4.1 Arc hitecture W e employ multiple sparse lay ers follo wed b y a single dense la yer. The size of the input lay er is fixed b y the problem w e are solving, for instance the p opular MNIST dataset consists of input images of 28 × 28 pixels, this corresp onds to input size of 784 after flattening the image. This is follow ed b y a 10 certain num b er of hidden neuron lay ers, for all experiments in this study w e used three lay ers of 1000 neurons unless otherwise sp ecified. These hidden neurons are sparsely connected to the neurons of the preceding la yer. Finally , the last lay er of hidden neurons is densely connected to the output la yer whose size is again specified b y the problem at hand. MNIST has 10 classes , therefore this last dense la yer contains 1000 × 10 links. Eac h neurons on the hidden la yers represen t an aggregation (weigh ted sum) of incoming links and a ReLU activ ation function [ 33 ]. 4.2 Initialization W e b egin b y initializing a sparse neural net work with a predefined sparsit y lev el. The initial connectiv- it y pattern is created at random follo wing the Erd˝ os-R ´ en yi random graph mo del. Of course, any other metho d can b e used here. The weigh t parameters of the net work are then initialized. The Kaiming initialization [ 34 ] is a p opular option when working with ReLU activ ation, as it preserves the v ariance of activ ations during the forward pass and v ariance of gradien ts during the bac kward pass. In practice this is ac hiev ed b y sampling w eight v alues from the normal distribution N (0 , 2 /n in ), where n in is the n umber of neurons of the previous lay er (i.e. total n um b er of possible incoming links). W e are using Kaiming initialization unless otherwise mentioned. 4.3 T raining Pro cedure T raining takes place in discrete ep ochs. Each ep och is comp osed of the follo wing steps: 1. W eigh t learning. Given the current sparse connectivit y , standard gradien t-based optimization is applied to update the active parameters. During this step the netw ork adapts its weigh t v alues according to the task at hand, conditioned on the present topology . 2. T op ology up date. After eac h epo c h, the connectivity of the netw ork is mo dified according to a sp ecified up date rule. The stages of the top ology up date are sho wn in Fig. 7 . One alwa ys starts with pruning (removing) links that are deemed irrelev an t. In the simplest case w e remov e a predefined fraction of all existing links based on the magnitude of their weigh t v alues. This is motiv ated b y the fact that connections with v anishing w eight v alues tend to not con tribute meaningfully to the activ ation of the neuron they feed in to. During an optional second pruning stage, we remo ve all disconnected neurons and their associated links. A neuron is considered disconnected if it does not lie on an y directed path originating from the input la yer and ending on the output la yer. A link that is not connected to the output simply o ccupies precious space in the sparse netw ork that could b e used to house a meaningful link. Similarly , a link not connected to the input will at best hav e contributed to the activ ation of the neuron by a constant offset and should therefore be merged with the appropriate bias neuron. Then, new links are grown. One ma y opt to introduce new links randomly or according to some heuristic. W e analyzed trained sparse netw orks with a random rule – this is the metho d used in [P APER F OR S.E.T. METHOD] – CH3L3 link prediction [P APER for CH3L3] as w ell as simpler link prediction method based on counting the num b er of paths connecting tw o neurons. Finally , the new links are initialized b y sampling from a distribution. Here, again, we use Kaiming initialization unless otherwise sp ecified. By alternating b et w een weigh t learning and topology up dates, the net work explores different sparse configurations o ver the course of training, p otentially improving its ability to disco ver efficient and expressiv e structures. References [1] Alex Krizhevsky , Ilya Sutsk ev er, and Geoffrey E. Hin ton. Imagenet classification with deep con vo- lutional neural netw orks. In Peter L. Bartlett, F ernando C. N. P ereira, Christopher J. C. Burges, L ´ eon Bottou, and Kilian Q. W ein b erger, editors, A dvanc es in Neur al Information Pr o c essing Sys- tems 25: 26th Annual Confer enc e on Neur al Information Pr o c essing Systems 2012. Pr o c e e dings of a me eting held De c emb er 3-6, 2012, L ake T aho e, Nevada, Unite d States , pages 1106–1114, 2012. 11 (a) (b) (c) Figure 7: The three stages of the top ology up date. In the first stage (a) a fraction of links (dotted red lines) is remo ved in eac h sparse lay er. In the second optional stage (b) all links that are no longer connected to the input and/or output la y er (dotted red lines) are also remo v ed. In the final stage (c) new links (green) are drawn and initialized such that the total n umber of links in each lay er will b e unchanged after the procedure. [2] Ashish V asw ani, Noam Shazeer, Niki P armar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L uk asz Kaiser, and Illia Polosukhin. Atten tion is all you need. In I. Guy on, U. V on Luxburg, S. Bengio, H. W allach, R. F ergus, S. Vish w anathan, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 30. Curran Asso ciates, Inc., 2017. [3] Emma Strubell, Anany a Ganesh, and Andrew McCallum. Energy and p olicy considerations for deep learning in NLP. In Anna Korhonen, Da vid T raum, and Llu ´ ıs M` arquez, editors, Pr o c e e dings of the 57th A nnual Me eting of the Asso ciation for Computational Linguistics , pages 3645–3650, Florence, Italy , July 2019. Asso ciation for Computational Linguistics. [4] S. Ram´ on y Ca jal. The cro onian lecture.—la fine structure des centres nerveux. Pr o c e e dings of the R oyal So ciety of L ondon , 55:444–468, 1894. [5] V. Braitenberg and A. Sc h ¨ uz. Cortex: Statistics and Ge ometry of Neur onal Conne ctivity . Springer- V erlag, Berlin Heidelberg, 1998. [6] Babak Hassibi, David G. Stork, and Gregory J. W olff. Optimal brain surgeon and general net w ork pruning. IEEE International Confer enc e on Neur al Networks , pages 293–299 vol.1, 1993. [7] Bruno A. Olshausen and Da vid J. Field. Sparse coding with an o vercomplete basis set: A strategy emplo yed by v1? Vision R ese ar ch , 37(23):3311–3325, 1997. [8] Y ann LeCun, John Denk er, and Sara Solla. Optimal brain damage. In D. T ouretzky , editor, A dvanc es in Neur al Information Pr o c essing Systems , v olume 2. Morgan-Kaufmann, 1989. [9] Song Han, Jeff Pool, John T ran, and William Dally . Learning b oth w eights and connections for efficien t neural netw ork. In C. Cortes, N. Lawrence, D. Lee, M. Sugiy ama, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 28. Curran Asso ciates, Inc., 2015. [10] Jonathan F rankle and Michael Carbin. The lottery tick et hypothesis: Finding sparse, trainable neural netw orks. In 7th International Confer enc e on L e arning R epr esentations, ICLR 2019, New Orle ans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019. [11] Jia j un Li and Ahmed Louri. Adaprune: An accelerator-aw are pruning tec hnique for sustainable cnn accelerators. IEEE T r ansactions on Sustainable Computing , 7(1):47–60, 2022. [12] Ming jie Sun, Zh uang Liu, Anna Bair, and J. Zico Kolter. A simple and effectiv e pruning approac h for large language mo dels, 2024. [13] Elias F rantar and Dan Alistarh. Sparsegpt: Massive language models can b e accurately pruned in one-shot, 2023. 12 [14] Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-pla y: An efficient p ost-training pruning metho d for large language mo dels. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. [15] Ziming Liu, Eric Gan, and Max T egmark. Seeing is b elieving: Brain-inspired mo dular training for mechanistic interpretabilit y . Entr opy , 26(1), 2024. [16] Mark Newman. Networks . Oxford univ ersity press, 2018. [17] Decebal Constan tin Mo can u, Elena Mocanu, P eter Stone, Ph uong H. Nguy en, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural netw orks with adaptive sparse connec- tivit y inspired by netw ork science. Natur e Communic ations , 9, 2017. [18] Namho on Lee, Thalaiyasingam Ajan than, and Philip T orr. SNIP: SINGLE-SHOT NETW ORK PR UNING BASED ON CONNECTION SENSITIVITY. In International Confer enc e on L e arning R epr esentations , 2019. [19] Utku Ev ci, T revor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tick ets winners. In Pr o c e e dings of the 37th International Confer enc e on Machine L e arning, ICML 2020, 13-18 July 2020, Virtual Event , v olume 119 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 2943–2952. PMLR, 2020. [20] Geng Y uan, Xiaolong Ma, W ei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoy ang He, Qing Jin, Siyue W ang, Minghai Qin, Bin Ren, Y anzhi W ang, Sijia Liu, and Xue Lin. Mest: Accurate and fast memory-economic sparse training framew ork on the edge, 2021. [21] Geng Y uan, Y anyu Li, Sheng Li, Zhenglun Kong, Sergey T ulyak ov, Xulong T ang, Y anzhi W ang, and Jian Ren. Lay er freezing & data sieving: Missing pieces of a generic framew ork for sparse training, 2022. [22] Y uxin Zhang, Mingbao Lin, Mengzhao Chen, F ei Chao, and Rongrong Ji. Optg: Optimizing gradien t-driven criteria in net work sparsit y , 2022. [23] Y uxin Zhang, Yiting Luo, Mingbao Lin, Y unshan Zhong, Jing jing Xie, F ei Chao, and Rongrong Ji. Bi-directional masks for efficient n:m sparse training, 2023. [24] Simone Daminelli, Josephine Maria Thomas, Claudio Dur´ an, and Carlo Vittorio Cannistraci. Common neighbours and the lo cal-comm unit y-paradigm for topological link prediction in bipartite net works. New Journal of Physics , 17(11):113037, nov 2015. [25] Claudio Duran, Simone Daminelli, Josephine M Thomas, V Joachim Haupt, Mic hael Sc hro eder, and Carlo Vittorio Cannistraci. Pioneering top ological metho ds for netw ork-based drug–target prediction by exploiting a brain-net work self-organization theory . Briefings in Bioinformatics , 19(6):1183–1202, 2017. [26] V aibhav Narula, Anders S Sørensen, Josephine M Thomas, Carlo Vittorio Cannistraci, Jay A Sank ar, Christian A Jurgens, Anne E Raftery , and Anders Eklund. Can lo cal-comm unity- paradigm and epitopological learning enhance our understanding of ho w local brain connectivity is able to pro cess, learn and memorize c hronic pain? Applie d Network Scienc e , 2(1):48, 2017. [27] Carlo Vittorio Cannistraci. Mo delling self-organization in complex net works via a brain-inspired net work automata theory impro ves link reliabilit y in protein in teractomes. Scientific R ep orts , 8(1), o ct 2018. [28] Yingtao Zhang, Jialin Zhao, W enjing W u, Alessandro Muscoloni, and Carlo Vittorio Cannistraci. Epitop ological learning and cannistraci-hebb netw ork shap e in telligence brain-inspired theory for ultra-sparse adv antage in deep learning. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. [29] Alessandro Muscoloni, Um b erto Mic hieli, and Carlo Vittorio Cannistraci. Adaptiv e netw ork au- tomata mo delling of complex net works, December 2020. 13 [30] R. Alb ert, H. Jeong, and AL. Barab´ asi. Error and attack tolerance of complex netw orks. Natur e , 406:378–382, 20002. [31] Oriol Artime, Marco Grassia, Manlio De Domenico, James P . Gleeson, Hern´ an A. Makse, Giusepp e Mangioni, Matjaˇ z Perc, and Filipp o Radicc hi. Robustness and resilience of complex net works. Natur e R eviews Physics , 6(2):114–131, F ebruary 2024. [32] Ilan Breskin, Jordi Soriano, Elisha Moses, and Tsvi Tlust y . Percolation in living neural netw orks. Phys. R ev. L ett. , 97:188102, Oct 2006. [33] Vino d Nair and Geoffrey E Hinton. Rectified linear units improv e restricted b oltzmann machines. In Pr o c e e dings of the 27th int ernational c onfer enc e on machine le arning (ICML-10) , pages 807– 814, 2010. [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing h uman-level p erformance on imagenet classification. In Pr o c e e dings of the IEEE international c onfer enc e on c omputer vision , pages 1026–1034, 2015. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment