Deep Neural Regression Collapse

Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this…

Authors: Akshay Rangamani, Altay Unal

Deep Neural Regression Collapse
Deep N eural Regression Collapse Aksha y Rangamani 1 , Alta y U nal 1 1 Department of Data Science, Ne w Jersey Institute of T echnology {akshay.rangamani, au252}@njit.edu Neur al Collapse is a phenomenon that helps identify sparse and lo w rank s tr uctures in deep classifiers. Recent w or k has extended the definition of neural collapse to regression problems, albeit only measuring t he phenomenon at the last la y er . In this paper , w e establish t hat Neural Regression Collapse (NR C) also occurs below the last la y er across different types of models. W e show that in the collapsed lay ers of neural reg ression models, features lie in a subspace t hat corresponds to t he tar- get dimension, t he feature co v ariance aligns with the targ et cov ariance, the input subspace of t he la y er w eights aligns with t he feature subspace, and t he linear pre- diction error of t he features is close to t he ov erall prediction error of the model. In addition to establishing Deep NR C 1 , w e also sho w that models that exhibit Deep NR C lear n t he intrinsic dimension of low rank targets and explore t he necessity of w eight decay in inducing Deep NR C. This paper provides a more complete picture of t he simple structure lear ned by deep networ ks in the context of regression. 1. Introduction Neur al Collapse is a phenomenon t hat was recentl y identified in super vised deep classification prob- lems that char acterizes the representations as w ell as the weights lear ned at t he last la yer of deep classifiers. Neur al Collapse (NC) is characterized by four distinct but related conditions in vol ving the class means and cov ariance of t he last la yer f eatures, and t he w eights of the last lay er . These conditions w ere first proposed and measured at the last la yer [1], and later modified and extended to deeper lay ers [2] to pro vide a more complete understanding of deep classifiers. The extension of neural collapse t o deep neur al collapse t hus helps identify an important implicit bias of deep netw or k training to netw orks of minimum depth. This low rank bias in the top la y ers of deep classi- fiers allows us to derive tighter gener alization bounds for netw orks exhibiting deep neural collapse [3, 4]. Thus Deep Neural Collapse can help us understand how deep lear ning finds fa v orable solu- tions. Moreov er , identifying these sparse and low rank structures in models can po tentially aid us in model editing and control. This leads us to the ques tion: Can we e xplain the power of deep learn- ing thr ough t he emerg ence of deep neural collapse? How ev er , before we can claim that t he implicit bias to war ds simple solutions of minimal depth is univ ersal across tr aining deep netw or ks, we need t o in v estigate whet her this appears in models bey ond super vised classification. This question w as partiall y explored in a recent paper [5] t hat proposed t hree conditions to describe Neur al Regression Collapse (NR C) at t he las t lay er . The conditions proposed b y Andriopoulos et al. [5] are: last la yer f eatures become low r ank, the features lie in t he row space of the weights, and the outer product of t he last la yer weights matches t he target cov ariance. The aut hors of t his paper measured these conditions at t he output la yer of deep regression models, and established that NR C is a prev alent phenomenon. How ev er characterizing t his just at t he last lay er does not immediately allow us to explain ho w deep netw orks learn pow erful, gener alizeable reg ression models. It still remains to identify whether the conditions of NR C extend belo w the last la y er , and t he implicit bias to war ds simple models of minimal dep th also emerg es in the case of deep reg ression. This is t he focus of our paper . Our Contr ibutions: 1 Deep NRC codebase is a vailable at: https://github.com/altayunal/neural- collapse- regression Third Conference on Parsimon y and Learning (CP AL 2026). 1. In this paper we provide a complete description of Deep Neural Regression Collapse by proposing a set of NR C conditions t hat can be applied across all la y ers of a deep reg ressor . W e train a number of models on real datasets and demonstrate that Deep NRC occurs across different types of model architectures. 2. W e sho w that solutions that exhibit Deep NR C lear n t he intrinsic dimension of lo w rank targ ets. This means that Deep NRC solutions do not just memorize a task and lear n gener - alizable solutions. 3. W e explore how w eight deca y can control t he emergence of Deep NR C and show t hat it is necessary for finding t hese solutions. Outline: W e briefly discuss w or k on the phenomenon of neural collapse (section 2 before describ- ing how t he conditions of NR C can be deriv ed from the N C conditions in section 3. Our main results are presented in section 4, and w e conclude in section 5. 2. Related W ork Neur al collapse is a recent phenomenon that was first identified by [1]. During the terminal phase of training (TPT) where t he training error becomes zero, it is obser v ed t hat the class mean vect ors and last la y er features conv er ge to each other while class mean v ectors form a sim ple equiangular tight frame (ETF) structure. Since t he identification of t his phenomenon, the different aspects of neural collapse hav e been in ves tigated. Although neural collapse was initiall y identified under t he cross entropy (CE) loss, sev eral studies [6, 7] discov ered the neural collapse under the mean squared error (MSE) loss. Mean while, other types of losses hav e also been shown to exhibit neur al collapse [8, 9], o t her than MSE loss and CE loss. Neur al collapse is also in v estigated in t he different techniques used in model training such as w eight deca y and batch normalization [10 – 12]. In addition, some studies [2, 13] show ed that the neural collapse exceeds the last la y er and is also obser v ed wit hin the intermediate lay ers. After the observation of neural collapse under different settings, neural collapse has started to be in v estigated in other netw orks and tasks as well. K othapalli et al. [14] studied t he neural collapse within graph neural netw orks (GNNs) while Wu and P ap yan [15] defined neural collapse proper - ties for large language models (LLMs). As for other tasks, neural collapse has been in v estigated in imbalanced data classification [16, 17], robustness [18] and reg ression [5]. In addition to observing neural collapse in different tasks, some studies lev erag e neural collapse to impro v e performance on sev eral task s such as imbalanced data classification [19] and continual lear ning [20, 21]. Theoretical explanations hav e also been explored for neural collapse, primarily through the U ncon- str ained Features Model [22], though alternate approaches also exis t [23]. Prior w ork has shown that neural collapse at t he last lay er is t he optimal solution for t he unconstrained features model under both the cross entropy and MSE loss [24, 25]. This landscape argument has also been ex- tended to modern architectures lik e ResN ets and T ransformers [26], though a characterization of how collapse emerg es in different types of lay ers is still missing. Theoretical characterizations of Deep Neur al Collapse are few er in number . The most popular approach has been t o extend t he UFM t o man y la y ers, and sho wing that Deep Neur al Collapse is t he op timal solution t o the Deep UFM [27] in t he case of binar y classification. Prior wor k [28] has also attemp ted to explain t he “law of data separation ” [13] through a phenomelogical lens, but onl y f ocuses on linear predictivity of the targets from the representations. While t his can model the phenomenon of data separation and provide some understanding of t he training h yper parameters, it does not explain the low rank na- ture of the representations and the w eights t hat is crucial to explaining better gener alization [3, 4]. In t his paper w e will primarily study reg ression using t he MSE (mean squared error) loss. The MSE loss has also been used in t he context of classification [29, 30] wit h comparable performance to the cross-entrop y loss. It has also been studied specifically in t he context of neural collapse [6, 25, 31] and lo w rank matrix factorization [32]. While t here are definitel y similarities betw een regression and classification using the MSE loss [33], w e can expect t hat reg ression and classification models 2 learn different solution g eometries. Deep neural collapse in classifiers ensures t hat features in col- lapsed la y ers form a sim plex Equiangular Tight Frame, and lie in a C − 1 dimensional subspace f or a C − class classification problem. In contrast, collapsed la yer features in regression models preser ve the continuous co variance structure of the targ et Y . Thus the spectral properties and the metrics necessary to capture them will be different betw een the tw o settings. The paper t hat is most related to our in v estigation is that of Andriopoulos et al. [5], which defines three conditions of Neural Regression Collapse and characterizes t he emergence of NR C at the last la y er . In this paper w e will explain where t hese conditions emerge from and establish that they occur bey ond t he last lay er . From a t heoretical perspectiv e as w ell, it is not clear whether deep neural collapse is t he optimal solution to t he Deep UFM for multi-class classification [34] or reg ression with multiple targ ets. Our findings in this paper make it clear that deep neural collapse occurs in regression settings as w ell. 3. Conditions for N eural Reg ression Collapse Neural Collapse in Classification: Consider a deep classifier f W ( x ) wit h L la yers that map inputs x ∈ R d to a v ector of C class scores. Let H ℓ ∈ R p × N C denote the activation matrix, M ℓ = [ µ ℓ c − µ ℓ G ] ∈ R p × C denote t he matrix of class means, and Σ ℓ W , Σ ℓ B , Σ ℓ T denote t he within-class, between-class and total cov ariance matrices of t he activations at lay er ℓ . Deep neur al collapse is characterized by the four conditions of variability collapse (N C1), emerg ence of simplex equiangular tight frames (ETFs) in the mean features (N C2), feature-w eight alignment (N C3), and equivalence t o nearest class center classification (NC4). Extensions to regression: The core insight of neural collapse follo ws from the decomposition of t he total cov ariance into the wit hin and betw een class cov ariances Σ ℓ T = Σ ℓ W + Σ ℓ B . This is a signal-noise decomposition where the between class cov ariance Σ ℓ B is the signal, and t he wit hin class cov ariance Σ ℓ W is the noise. Also, for a balanced classification problem, w e ha v e the targ et co variance Σ Y = I C − 1 C 1 C 1 ⊤ C . Highlighting these facts allo ws us t o reinterpret t he f our N C conditions in ter ms of how lay ers of a deep netw ork extract t he targ et signal and suppress t he noise. Consider a regression problem wit h inputs and targ ets { ( x i , y i ) } N i =1 ⊆ R d × R t , being solv ed with a deep netw ork f W with depth L and widt h h . For deep neural collapse to occur in t his scenario, w e w ould expect t hat the top lay ers of t he netw ork extract the target signal and minimize t he amount of noise in their w eights and representations. This leads us to the f ollowing conditions for neural regression collapse: (NR C1) Noise Suppression: For the features H ℓ at la y er ℓ , let U ℓ ∈ R h × t denote t he top t singular v ectors of t he feature cov ariance Σ H ℓ . The magnitude of the noise component of t he cov ariance can be computed as T r  I h − U ℓ U ℓ ⊤  Σ H ℓ  . If w e compute t his as a fr action of t he t otal feature cov ariance, we get the noise component as 1 − (T r( U ℓ ⊤ Σ H ℓ U ℓ ) / T r( Σ H ℓ )) . W e can t hus sa y a la y er of a regression model is collapsed and obeys NR C1 if the noise component is ≪ 1 . (NR C2) Signal − T arget Alignment : W e can define NRC2 as t he condition under which the signal component of the la y er features U ℓ U ℓ ⊤ H ℓ is aligned with the targ et Y . How e v er , this alignment can only happen up to a certain scaling and rotation factor , so w e use t he Centered K er nel Alignment (CKA) [35, 36] betw een the f eatures and the targ et as our NR C2 criterion. A collapsed lay er will ha v e CKA ( H ℓ , Y ) ≈ 1 . (NR C3) Feature-W eight Alignment: W e find t he alignment betw een the signal components of la y er features and t he input subspace of the w eights. More precisely , w e compute the mean cosines of the principal angles between subspaces (P ABS) betw een the signal subspace U ℓ and the top t - dimensional input subspace of W ℓ . A la yer is said to be collapsed if 1 t P t k =1 cos ( θ k ) → 1 . (NR C4) Linear Predict ability: In classification, N C4 expresses the idea that all information re- quired to perform classification is present in t he features of collapsed la yers. Gener alizing this idea, 3 w e expect that in the collapsed lay ers of regression models, one can predict the target from the la yer features t hrough just a linear transformation. W e can sa y t hat a collapsed la y er shows NR C4 if t he mean squared error (MSE) of t he pseudo inv erse solution 1 N ∥ H ℓ ( H ℓ † Y ) − Y ∥ 2 F is not much larger than the MSE of the entire trained networ k. W e summarize our obser vations in t his section in T able 1. T able 1: A unified picture of Neural Collapse in classification and reg ression Collapse Condi- tion Classification (NC) [1, 2] Regression (NRC) [5], This paper NC1/NR C1: N oise Suppression T r(Σ ℓ W ) / T r(Σ ℓ T ) → 0 1 − T r( U ℓ ⊤ Σ H ℓ U ℓ ) T r( Σ H ℓ ) → 0 NC2/NR C2: Signal- T arget Alignment M ℓ M ℓ ⊤ ∝ I C − 1 C 1 C 1 ⊤ C CKA ( H ℓ , Y ) ≈ 1 NC3/NR C3: Feature- W eight Alignment 1 C P cos ∠ ( M ℓ , W ℓ ⊤ ) → 1 1 t P cos ∠ ( U ℓ , W ℓ ⊤ ) → 1 NC4/NR C4: Lin- ear Predictability A cc (NCC ( H ℓ ) ) ≈ A cc ( f W ) 1 N ∥ H ℓ ( H ℓ † Y ) − Y ∥ 2 F ≈ 1 N ∥ f W ( X ) − Y ∥ 2 F 4. Exper iments & Results In this section, w e demonstr ate t hat Deep Neur al Regression Collapse emerges in well trained mod- els across model architectures and datasets. W e will present results on synt hetically generated and real datasets using multila yer perceptrons (MLPs) as w ell as con v olutional netw orks (CNNs). W e also demonstrate t hat deep regressors t hat exhibit collapse can lear n the intrinsic dimension of low rank targ ets, which means w e can expect the collapsed solutions to gener alize as w ell. Finall y , w e in v estigate the role of w eight deca y in inducing Deep NRC, and show t hrough experiments t hat w eight decay is necessary . 4.1. Datasets & Models Synthetic Dat a: W e gener ated a regression dataset by first dra wing n = 10 , 000 samples of d = 20 dimensional input v ectors. The targets w ere generated b y passing these v ectors t hrough a lin- ear/nonlinear gener ativ e model wit h output dimension t . The nonlinear generativ e models w ere fully -connected neural networ ks with 2 hidden la yers of dimension r and randoml y initialized w eights, while t he linear generativ e models w ere d × t matrices of r ank r ≤ t . The first 80% of the data was used for training while t he remaining 20% w as kept aside as test data. Real Data: W e used imitation lear ning datasets based on the MuJoCo ph ysics engine [40, 41]. W e ran experiments on three datasets based on t he Swimmer , Reacher , and Hopper en vironments. The inputs f or t he imitation lear ning en vironments correspond to r aw robo tic states while the targ ets cor- respond to t he choice of actions to take in t he state. W e also used two image datasets - Carla2D[37] and UTKF ace[38] to test whet her Deep NR C occurs in CNNs. Carla2D consists of images from an autonomous driving simulator , and the task is to predict t he speed as w ell as the steering angle. T able 2: Overview of datasets and models used in experiments Dataset Input Dimension T arget Dimension N umber of Samples T ask Architecture Swimmer 8 2 1,000 Imitation Learning 8-la yer MLP , hidden dimension h = 256 Reacher 11 2 1,000 8-la yer MLP , hidden dimension h = 512 Hopper 11 3 5,000 8-la yer MLP , hidden dimension h = 256 Carla2D [37] 288 × 200 × 3 2 50,000 Driving Simulation ResN et-18 UTKF ace [38] 200 × 200 × 3 1 25,000 Ag e Regression ResN et-34 SGEMM [39] 14 4 240,000 Matrix Multiplication 8-lay er MLP , hidden dimension h = 512 4 UTKF ace is a dataset of face imag es commonly used f or age es timation. Finally we also used the SGEMM [39] dataset t hat estimates the runtime of different GPU kernels. The dataset contains 4 different measurements of the runtime, so w e used t his as an example of a dataset wit h lo w rank targ ets t hat w e inv es tigate in section 4.3. Since t he 4 targets are expected t o be correlated, the rank of the targ et should be 1. The dataset dimensions as w ell as sizes are provided in T able 2. As in t he case of the synt hetic data experiments, 80% of t he data was used f or training while t he remaining 20% was kep t aside as test data. Our code repository is av ailable from this link. 4.2. Measuring Deep N eural Regression Collapse Our main results demonstr ating the emergence of Deep NR C in CNNs as w ell as MLPs are pre- sented in Figures 1 and 2, respectivel y . In each figure, we present from top to bottom t he noise component (NRC1), signal-tar get alignment (NR C2), f eature-weight alignment (NR C3), and lin- ear predictability (NR C4). Each column of figures 1 and 2 represents the results from one dataset. In the top row of Figures 1 and 2, we can observe t he noise component - which is the fraction of the energy in the feature co variance at each la y er ( Σ H ℓ ) that lies outside its top t -dimensional subspace U ℓ . If the noise component is ≈ 1 , t he la yer features are not low rank, whereas a noise component → 0 means that the la y er features lie in a subspace whose dimension corresponds to t he targ et dimension t . The second ro w of Figures 1 and 2 establishes the feature-targ et alignment (NR C2) condition. W e project the la y er features H ℓ onto U ℓ and measure its centered kernel alignment (CKA) wit h the tar get Y . Using CKA , w e can measure the alignment betw een f eature and tar get cov ariances ev en though they may hav e differences in scale and rotations. Next in t he third row of Figures 1 and 2, w e can see t he Feature- W eight alignment (NR C3) condi- tion, which measures the mean cosines of the angles betw een t he top t -dimensional input subspace of the w eights at a la yer W ℓ and the top- t dimensional subspace of the previous lay er features H ℓ − 1 . Here we find that as moving to war ds the output la y ers of the models, the alignment increases  1 t P t i =1 cos θ ℓ i → 1  , showing t hat t he top lay ers are indeed collapsed. In t he case of con volutional la y ers, we reshape t he feature and filter tensors to matrices with t he first dimension equaling t he number of input channels ( c ℓ in ) and measure t he angles between the top- t dimensional lef t singular v ectors. Finall y in the bottom row of Figures 1 and 2, w e measure the Linear Predictability (NRC4) condition. Here we sho w t he mean squared error (MSE) of predicting t he target from t he features at each la yer wit h t he pseudo inv erse solution H ℓ † Y . In each plot, we also display the MSE of t he entire model on t he training set in a dashed red line for reference. W e find the la y er prediction MSE to be comparable to the model’s MSE on t he training set. A cross models and datasets, our results sho w that the NR C conditions are more likel y to be satisfied close to t he output lay ers of deep regressors. W e use t he NR C1 condition to identify t he collapsed la y ers, and mark t he first collapsed lay er in each model using a g reen line. In t he la yers subsequent to the first collapsed la yer , we find t hat t he conditions of neural reg ression collapse occur together . W e ha v e a small noise com ponent, a high feature-target alignment, a high feature-w eight alignment, and lo w linear prediction error . More importantl y , t hese conditions are observ ed bey ond jus t t he output la yer . 4.3. Deep NR C Learns Lo w Rank T ar gets In the previous subsection, w e obser v ed how deep netw orks trained on regression problems learn features t hat correspond to the targ et subspace. How ev er , it is unclear whether the solutions t hat displa y collapse tr uly lear n generalizable solutions. T o test this, we study whet her deep NRC solu- tions for models trained on low -rank tar gets learn the intrinsic dimension of t heir targ ets or span the entire target subspace. W e conduct this experiment using tw o datasets wit h low rank struc- ture: a synthetic dataset, and SGEMM [39], a GPU kernel perf ormance dataset. F or the synthetic dataset, we dra w n = 10 , 000 inputs from a d = 20 dimensional normal distribution, and w e com- pute Y = f 0 ( X ) where f 0 : R d → R t is a fully connected neur al netw or k with 2 hidden la y ers of 5 layer1.0.conv1 layer1.1.conv1 layer1.2.conv1 layer2.0.conv1 layer2.1.conv1 layer2.2.conv1 layer2.3.conv1 layer3.0.conv1 layer3.1.conv1 layer3.2.conv1 layer3.3.conv1 layer3.4.conv1 layer3.5.conv1 layer4.0.conv1 layer4.1.conv1 layer4.2.conv1 fc Layers 0.2 0.3 0.4 0.5 0.6 0.7 noise component layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0.0 0.1 0.2 0.3 0.4 0.5 0.6 noise component layer1.0.conv1 layer1.1.conv1 layer1.2.conv1 layer2.0.conv1 layer2.1.conv1 layer2.2.conv1 layer2.3.conv1 layer3.0.conv1 layer3.1.conv1 layer3.2.conv1 layer3.3.conv1 layer3.4.conv1 layer3.5.conv1 layer4.0.conv1 layer4.1.conv1 layer4.2.conv1 fc Layers 0.0 0.2 0.4 0.6 0.8 1.0 CK A Alignment layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0.0 0.2 0.4 0.6 0.8 1.0 CK A Alignment layer1.0.conv1 layer1.1.conv1 layer1.2.conv1 layer2.0.conv1 layer2.1.conv1 layer2.2.conv1 layer2.3.conv1 layer3.0.conv1 layer3.1.conv1 layer3.2.conv1 layer3.3.conv1 layer3.4.conv1 layer3.5.conv1 layer4.0.conv1 layer4.1.conv1 layer4.2.conv1 fc Layers 0.0 0.2 0.4 0.6 0.8 WH alignment layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0.2 0.4 0.6 0.8 1.0 WH alignment layer1.0.conv1 layer1.1.conv1 layer1.2.conv1 layer2.0.conv1 layer2.1.conv1 layer2.2.conv1 layer2.3.conv1 layer3.0.conv1 layer3.1.conv1 layer3.2.conv1 layer3.3.conv1 layer3.4.conv1 layer3.5.conv1 layer4.0.conv1 layer4.1.conv1 layer4.2.conv1 fc Layers 1 0 0 1 0 1 1 0 2 pr ediction er r or tar get layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 1 0 3 1 0 2 pr ediction er r or tar get Figure 1: Deep NR C in ResN ets: NR C measurements from a ResN et34 tr ained on t he age- regression task in UTKF ace (left column) and a ResNet18 trained on Car la2D (right column). The v ertical g reen line in all plots indicates t he first collapsed lay er . Firs t row (NR C1) shows t he noise component being a small fr action of t he energy in t he collapsed lay er representations. The second row sho ws the CKA betw een la y er features and the target (NR C2). The t hird ro w (NRC3) sho ws the alignment between the features and t he w eights in the collapsed la y ers and the final row (NRC4) show s t he MSE of linearl y predicting the targets from the features in each la yer . 6 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.2 0.3 0.4 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.1 0.2 0.3 0.4 0.5 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.3 0.4 0.5 0.6 0.7 CK A Alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.4 0.6 0.8 CK A Alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.2 0.4 0.6 0.8 WH alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.2 0.4 0.6 0.8 WH alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 2 × 1 0 1 3 × 1 0 1 4 × 1 0 1 pr ediction er r or tar get fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 1 0 1 2 × 1 0 1 3 × 1 0 1 4 × 1 0 1 pr ediction er r or tar get Figure 2: Deep NR C in MLPs: NR C measurements from 8 -lay er MLPs tr ained on imitation learning tasks in t he MuJoCo Swimmer (left column) and Hopper (right column) environments. The v ertical green line in all plo ts indicates the first collapsed la y er . T op ro w (NRC1) sho ws t he noise com ponent being a small fraction of t he ener gy in t he collapsed la y er representations. The second row show s the CKA between la y er features and the targ et (NR C2). The third row (NR C3) shows the alignment betw een the features and the weights in the collapsed la y ers and t he bottom ro w (NRC4) shows the MSE of linear ly predicting t he targets from t he features in each lay er . dimension r = 2 . The SGEMM task requires us to predict the runtimes of different matrix multi- plication ker nels from v arious input features. The targets are 4 different measurements of t he same kernel. This means that ev en t hough t he targ et is notionall y t = 4 -dimensional, t hese measurements should be highly correlated, and in fact rank r = 1 . In order t o identify whether deep NRC solutions lear n the intrinsic dimension, w e make measure NR C1 using t he top r -dimensional feature subspace rather than t he top t -dimensional subspace of the f eatures H ℓ . If t he noise com ponent is s till small, w e can conclude that t he models trul y lear n the intrinsic dimension of the problem. W e also measure t he stable rank of t he features and see whether it corresponds with t he s table rank of the targets. F or the NR C3 condition, in addition to feature w eight alignment in the top r -dimensional subspaces of the w eights and f eatures, w e also measure t he alignment betw een the top r -dimensional w eight subspace and t he bottom ( h − r )- dimensional feature subspace to measure whet her the collapsed la yers pass noisy information to subsequent la yers. Results: The results of t hese experiments are presented in Figure 3. The top row presents the mea- surements on t he SGEMM task, while t he bottom row presents the lo w-r ank nonlinear targ et results. 7 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.0 0.2 0.4 0.6 0.8 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 2 4 6 stable rank H tar get fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.25 0.50 0.75 1.00 WH alignment signal tar get-subspace alignment signal noise-subspace alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.0 0.2 0.4 0.6 0.8 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 5 10 15 stable rank H tar get fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.2 0.4 0.6 0.8 1.0 WH alignment signal tar get-subspace alignment signal noise-subspace alignment Figure 3: Lear ning intrinsic dimension of low rank targets: T op row - 8 -lay er , 256 -width MLP trained on SGEMM ( 4 -dim target, rank - 1 ). Bottom row - 10 -lay er , 1024 -widt h MLP trained on syn- thetic low rank nonlinear dataset ( 10 -dim target, rank - 2 ). In the lef t column w e plot t he noise com- ponent using t he NRC1 formula, but measured using t he bottom ( h − r ) dimensional subspace, not ( h − t ). In t he middle column, w e plot t he stable rank of t he lay er features and observ e t hat it matches the targ et stable rank in the collapsed lay ers. In the right column, w e plot the feature-w eight alignment (NR C3) using t he top- r dimensional subspace in pur ple and t he alignment between t he features and the bottom ( h − r ) dimensional subspace of t he w eights in salmon. These plots estab- lish t hat collapsed la y ers learn the intrinsic dimension of t he data. The left column contains noise component (NR C1) measurements, the middle column contains the feature s table rank measurements, and t he right column contains feature-w eight Alignment (NR C3) results. In t he right column plots, the signal target-subspace alignments are depicted in pur ple, while the signal noise-subspace alignments are in salmon. From Figure 3, w e can clear l y see that the lay er features collapse to t he intrinsic dimension of t he target, not just t he target dimension. Moreov er , t he signal component of t he features aligns wit h the top r -dimensional subspace of the w eights, while t he noise subspace of t he features is orthogonal to to t he w eights. W e can thus make the claim that deep netw orks that exhibit collapse learn the intrinsic dimension of t heir targets, in- dicating feature lear ning instead of memorization. W e also note t hat t his is related to a result in [42] t hat show s through t he unconstr ained features (UFM) model t hat learning a single model on multiple targets can be beneficial. As we ha v e demons trated, learning a single model on multiple targ ets allo ws one to disco v er t he intrinsic relationships between the multiple targ ets, if t hey ha ve low rank structure. 4.4. Effect of W eight Deca y on Deep NR C In t his section, w e will in v estigate the conditions under which deep NRC occurs, and specificall y the role of w eight deca y . In [5], t he aut hors use t he uncons trained features model t o sho w t hat in the presence of w eight deca y , w e can guarantee t hat NR C will be satisfied at the las t la y er . Moreov er , in the absence of weight deca y , w e cannot guarantee t hat training deep networ ks will find the NRC solution. This is because wit hout weight deca y , the problem of training deep netw orks (in the un- constr ained features model) jus t requires minimizing the MSE loss L ( W , H ) = 1 2 N ∥ W H − Y ∥ 2 F , which results in t he famil y of solutions H = W † Y + ( I − W † W ) Z for an y full rank matrix W . This does no t guarantee alignment betw een W and H at the las t la yer , let alone la y ers belo w the last la y er . W e confirm that this obser v ation still holds for t he case of deep NR C. W e trained ResNet-18 models on Carla2D, MLPs on SGEMM, and a synthetic dataset with different v alues of weight deca y to s tudy how w eight deca y influences deep NR C. In Figure 4, we present the results from Carla2D (left column) and t he synt hetic dataset (right column). W e compare the noise component (NR C1), feature-w eight alignment (NRC3), and the stable rank of t he w eights for 8 Car la2D for models trained wit h different values of w eight decay between λ = 0 and a reasonable solution λ = 5 e − 3 . W e see t hat for small values of w eight decay (in salmon), our Carla2D models do not achiev e deep NRC and show high v alues of t he noise component (top left) and low feature- w eight alignment (middle lef t) ev en though they achie ve low tr ain and test loss. The model that exhibits deep NR C is the one trained wit h sufficient w eight deca y λ = 5 e − 3 (plotted in purple). When we compare t he stable rank of the w eight matrices in the Carla2D models in t he bottom row of the lef t column, w e see that using a higher value of weight deca y can result in low er rank w eight matrices. The bes t wa y to jointly minimize the loss and the rank of w eight matrices is to find the deep NR C solution. How e v er , larg e v alues of w eight deca y can also be detrimental. In our experiments on MLPs trained on the synt hetic dataset presented in t he right column of Figure 4, w e observ e that increasing t he w eight decay to a large value λ = 1 e − 3 ma y result in a model with collapsed lay ers as indicated by the noise component and feature-w eight alignment g raphs. How e ver , wit h ex cessivel y high w eight deca y , w e will find models wit h higher train and test loss. As sho wn by Theorem 4.1 in [5], if t he w eight deca y exceeds t he size of the targ et co variance, t he op timal solution f or the uncons trained features model is W , H = 0 , 0 . 4.5. Discussion In the previous sections w e ha ve presented experimental results that demonstr ate t hat t he four con- ditions of Deep NRC defined in section 3 appear bey ond t he last la yer of deep netw or k s trained on re- gression problems. A reasonable question that ma y arise is whether these four conditions uniquely characterize NR C, and whet her alter nativ e formulations exist. T o answer this in a concrete manner , w e define a collapsed la yer in a deep networ k to be one t hat exclusivel y contains information about the tar get in its representations and w eights. Our NRC conditions in section 3 attemp t t o characterize this geometricall y b y ensuring t he representations lie in a lo w-r ank subspace (NR C 1), this lo w rank subspace aligns wit h the targ et (NR C 2), and that this also aligns wit h the t op targ et subspace of the weights (NR C 3). In Appendix C w e show that the NRC 1,2 ensure that linear predictability (NR C 4) is satisfied. Our results show t hat deep NRC occurs for both full r ank (Figure 2) and low -rank targets (Figure 3). Our obser vation about low rank targets are similar to t he observations made in Jacot [43] about representation costs of deep networ ks and t heir relationship to low -rank structures. Deep netw orks that exhibit Deep NRC may lie in t he “rank recov ery” regime where t hey learn the true rank of the function being estimated. This ma y be used to estimate t he intrinsic dimensionality of a prediction task. Prior theoretical anal ysis of neural regression collapse at the las t la y er through the uncons trained features model [5] has shown t hat global minima of t he square loss exhibit NR C. Extending t his to deeper la y ers is how ev er not straightf or war d as Súkeník et al. [34] sho w that low er -rank solutions that do not exhibit neural collapse exist f or multi-class classification be yond t he binary case. Our goal in t his paper is to establish Deep Neur al Regression Collapse as an empirical phenomenon, and w e lea v e the t heoretical characterization to future research. It appears that w e ma y need to t hink be- y ond t he deep unconstrained features model for this. Future theoretical analysis w ould also ideally characterize whet her the NRC conditions alwa ys occur together . In Figure 4 w e see that t he NR C1 ma y be satisfied wit hout NR C3 occurring when the weight decay is too low - meaning the networ k ma y solv e t he reg ression problem wit hout enf orcing an y structure on the w eights. How e ver if w e set the w eight deca y parameter too high, w e ma y underfit the training data while enforcing collapse - which w ould lead to a poor NR C2 metric while obser ving NRC1 and NR C3. More research into the training dynamics in the NR C regime will help answ er t hese questions. 5. Conclusion In t his paper , we present t he four conditions of Deep Neur al Regression Collapse, namely Noise Suppression, Signal − T arg et Alignment, Feature- W eight Alignment, and Linear Predictability . W e 9 layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0.0 0.2 0.4 0.6 noise component = 0 = 5 e 0 5 = 0 . 0 0 0 5 = 0 . 0 0 5 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.0 0.2 0.4 0.6 0.8 Noise Component good high low zer o layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0.0 0.2 0.4 0.6 0.8 1.0 WH alignment = 0 = 5 e 0 5 = 0 . 0 0 0 5 = 0 . 0 0 5 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.4 0.5 0.6 0.7 0.8 0.9 1.0 WH Alignment good high low zer o layer1.0.conv1 layer1.1.conv1 layer2.0.conv1 layer2.1.conv1 layer3.0.conv1 layer3.1.conv1 layer4.0.conv1 layer4.1.conv1 fc Layers 0 50 100 150 200 250 300 stable rank W = 0 = 5 e 0 5 = 0 . 0 0 0 5 = 0 . 0 0 5 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.2 0.4 0.6 0.8 CK A good high low zer o Figure 4: Effect of weight deca y: Left Column - W e tr ain ResNet18s on CARLA2D with v ar ying v alues of weight deca y , and obser v e the effects on deep neural reg ression collapse. In the top and middle row s w e plot the NR C1 and NR C3 metrics. The bottom ro w sho ws the stable rank of the w eights. In each plot, t he measurements with the right v alue of w eight deca y ( λ = 5 e − 3 ) are sho wn in pur ple, while t he measurements with smaller values of w eight decay ( λ ∈ [5 e − 4 , 5 e − 5 , 0] ) are in salmon. This show s that w eight deca y is necessar y to achiev e feature-w eight alignment, which implies a low rank bias in t he w eights of the la yers. In t he right column w e explore through an experiment on synthetic data, how larg er values of w eight deca y ( λ = 1 e − 3 ) can induce NRC1 and NR C3 (top and middle row s), but perform w orse at prediction (bottom ro w). While w eight deca y is necessar y for obser ving Deep NR C, too high a v alue can hinder learning. 10 obtain these conditions as a principled extension of the N C conditions in classification. W e sho w that Neur al R eg ression Collapse occurs bey ond t he last la y er on a v ariety of model architectures and datasets. In addition, w e also explore the necessity of weight deca y , and obser v e t hat Deep NR C solutions capture the intrinsic dimension rather than the ambient dimension of low -rank targ ets, showing t hat inducing Deep NRC can learn gener alizable solutions. This paper provides further evidence for a univ ersal bias tow ards minimal depth for deep netw orks. In the future, w e w ould like to lev erag e the bias of Deep Neur al Regression Collapse to design more efficient training algorithms. It also remains to prov e theoretically that deep NRC is a consequence of training deep networ ks with MSE loss. Other potential directions for future research include using t he trained models to uncov er intrinsic relationships betw een targ et variables, and using t he low -r ank structure of Deep NR C for efficient model editing. Ref erences [1] V ardan Pap y an, XY Han, and David L Donoho. Prev alence of neur al collapse during t he ter- minal phase of deep learning training. Proceedings of the National Academy of Sciences , 117(40): 24652–24663, 2020. [2] Aksha y Rangamani, Marius Lindegaar d, T omer Galanti, and T omaso A. P oggio. Feature lear n- ing in deep classifiers t hrough intermediate neural collapse. In International Confer ence on Ma- chine Learning , 2023. URL https://api.semanticscholar.org/CorpusID:259289154 . [3] Andrea Pinto, Ak sha y Rangamani, and T omaso A P oggio. On gener alization bounds for neural netw or k s with low rank la y ers. In Gautam Kamath and P o-Ling Loh, editors, Proceedings of The 36th International Confer ence on Algorithmic Learning Theory , v olume 272 of Proceedings of Machine Learning Researc h , pages 921–936. PMLR, 24–27 F eb 2025. URL https://proceedings. mlr.press/v272/pinto25a.html . [4] Antoine Ledent, Rodrigo Alv es, and Y unw en Lei. Generalization bounds for rank -sparse neu- ral netw orks. In The Thirty-ninth Annual Conference on Neur al Information Processing Sys tems , 2025. URL https://openreview.net/forum?id=n3M8h9mqDm . [5] Georg e Andriopoulos, Zixuan Dong, Li Guo, Zifan Zhao, and Keith W . R oss. The prev a- lence of neural collapse in neural multivariate reg ression. In The Thirty-eighth Annual Con- fer ence on Neur al Information Processing Syst ems , 2024. URL https://openreview.net/forum? id=Wq6aY6fC2H . [6] X.Y . Han, V ardan Pap y an, and Da vid L. Donoho. Neur al collapse under MSE loss: Proximity to and dynamics on t he central pat h. In International Confer ence on Learning Represent ations , 2022. URL https://openreview.net/forum?id=w1UbdvWH_R3 . [7] T omaso P oggio, Qianli Liao, and Andrzej Banburski. Complexity control by g radient descent in deep netw or k s. Nature communications , 11(1):1027, 2020. [8] T olga Ergen and Mert Pilanci. Re v ealing t he structure of deep neural netw orks via con vex dual- ity . In International Conference on Machine Learning , 2020. URL https://api.semanticscholar. org/CorpusID:222142337 . [9] Jinxin Zhou, Chong Y ou, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspectiv e. In S. K oy ejo, S. Mohamed, A . Agar - w al, D. Belgra v e, K. Cho, and A. Oh, editors, Advances in Neur al Information Processing Sys tems , v olume 35, pages 31697–31710. Curran Associates, Inc., 2022. [10] Arthur Jacot, P eter Súkeník, Zihan W ang, and Marco Mondelli. W ide neural netw orks trained with w eight decay prov abl y exhibit neural collapse. In The Thirt eenth International Confer ence on Learning Represent ations , 2025. URL https://openreview.net/forum?id=1HCN4pjTb4 . 11 [11] Ley an P an and Xin yuan Cao. T ow ards understanding neural collapse: The effects of batch normalization and w eight deca y . arXiv pr eprint arXiv :2309.04644 , 2023. [12] Aksha y Rangamani and Andrzej Banburski-Fahe y . N eural collapse in deep homogeneous classifiers and the role of w eight decay . In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4243–4247. IEEE, 2022. [13] Hangfeng He and W eijie J. Su. A la w of data separation in deep learning. Proceedings of t he National Academy of Sciences , 120(36):e2221704120, 2023. [14] Vignesh Ko t hapalli, T om Tirer , and Joan Br una. A neural collapse perspectiv e on feature ev o- lution in graph neural netw orks. Advances in Neur al Information Processing Sys tems , 36, 2024. [15] Robert W u and V ardan Pap y an. Linguistic collapse: Neur al collapse in (large) language mod- els. In The Thirty-eighth Annual Confer ence on Neur al Information Processing Syst ems , 2024. URL https://openreview.net/forum?id=G0LfcMiRkc . [16] Cong F ang, Hangfeng He, Qi Long, and W ei jie J Su. Exploring deep neural networ ks via la yer - peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences , 118(43):e2103091118, 2021. [17] Hongren Y an, Y uhua Qian, Furong P eng, Jiachen Luo, zheqing zhu, and Fei jiang Li. Neural col- lapse to multiple centers for imbalanced data. In The Thirty-eighth Annual Confer ence on Neur al Information Processing Syst ems , 2024. URL https://openreview.net/forum?id=RJEC9fZ9Ma . [18] Jingtong Su, Y a Shi Zhang, Nikolaos T silivis, and Julia Kempe. On the robustness of neural collapse and the neural collapse of robustness. arXiv preprint , 2023. [19] Yibo Y ang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng T ao. Inducing neural collapse in imbalanced learning: Do w e reall y need a learnable classifier at the end of deep neural netw ork? Advances in neural information processing syst ems , 35:37991–38002, 2022. [20] T rung- Anh Dang, Vincent Nguyen, Ngoc-Son V u, and Chris tel V rain. Memory-efficient con- tinual lear ning wit h neural collapse contras tive. arXiv preprint arXiv :2412.02865 , 2024. [21] Antoine Montmaur , Nicolas Larue, and Ngoc-Son V u. Neur al collapse inspired contrastiv e continual lear ning. BMV C, 2024. [22] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse wit h unconstr ained features. Sampling Theory , Signal Processing, and Data Analy sis , 20(2):11, 2022. [23] Mengjia X u, Ak sha y Rangamani, Qianli Liao, T omer Galanti, and T omaso P oggio. Dynamics in deep classifiers trained with t he square loss: Normalization, low rank, neural collapse, and gener alization bounds. Resear ch , 6:0024, 2023. [24] Zhihui Zhu, Tian yu Ding, Jinxin Zhou, Xiao Li, Chong Y ou, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse wit h unconstrained features. Advances in Neur al Informa- tion Processing Sys tems , 34:29820–29834, 2021. [25] Jinxin Zhou, Xiao Li, Tian yu Ding, Chong Y ou, Qing Qu, and Zhihui Zhu. On the optimization landscape of neur al collapse under mse loss: Global op timality with unconstrained features. In International Confer ence on Machine Learning , pages 27179–27202. PMLR, 2022. [26] P eter Súkeník, Christoph H. Lampert, and Marco Mondelli. Neur al collapse is globally optimal in deep regularized resnets and transformers. In The Thirty-ninth Annual Confer ence on Neur al Information Processing Syst ems , 2025. URL https://openreview.net/forum?id=8WKOk4U9R4 . [27] P eter Súkeník, Marco Mondelli, and Christoph H Lampert. Deep neur al collapse is prov ably optimal for t he deep unconstrained features model. Advances in Neur al Inf ormation Processing Sys tems , 36:52991–53024, 2023. 12 [28] Cheng Shi, Liming P an, and Ivan Dokmanić. Spring-block t heor y of feature lear ning in deep neural netw or ks. Physical Review Letters , 134(25):257301, 2025. [29] R yan Michael Rifkin. Everything old is new again: a fr esh look at hist orical appr oaches in machine learning . PhD thesis, MaSSachuSettS InStitute of T echnology , 2002. [30] Like Hui and Mikhail Belkin. Evaluation of neural architectures tr ained with square loss v s cross-entrop y in classification task s. In International Confer ence on Learning Represent ations , 2021. URL https://openreview.net/forum?id=hsFN92eQEla . [31] Jinxin Zhou, Chong Y ou, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspective. Advances in N eural Information Processing Sys tems , 35:31697–31710, 2022. [32] Can Y aras, P eng W ang, W ei Hu, Zhihui Zhu, Laur a Balzano, and Qing Qu. Inv ariant low - dimensional subspaces in gradient descent for learning deep matrix factorizations. In NeurIPS 2023 W or kshop on Mathematics of Modern Mac hine Learning , 2023. URL https://openreview. net/forum?id=4pPnQqUMLS . [33] Vidy a Muthukumar , A dh yyan N arang, V ignesh Subr amanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs reg ression in ov erparameterized regimes: Does the loss function matter? Journal of Machine Learning Researc h , 22(222):1–69, 2021. URL http://jmlr. org/papers/v22/20- 603.html . [34] P eter Súkeník , Christoph Lampert, and Marco Mondelli. Neural collapse vs. low -r ank bias: Is deep neural collapse reall y optimal? Advances in Neur al Inf ormation Pr ocessing Sys tems , 37: 138250–138288, 2024. [35] Y anting Lu, Liantao W ang, Jianfeng Lu, Jingyu Y ang, and Chunhua Shen. Multiple kernel clustering based on centered kernel alignment. P att ern Recognition , 47(11):3656–3664, 2014. [36] Simon Kornblith, Mohammad N orouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural netw ork representations revisited. In International confer ence on machine learning , pages 3519–3529. PMlR, 2019. [37] Felipe Codevilla, Matt hias Müller , Antonio López, Vladlen K oltun, and Alexe y Dosovitskiy . End-to-end driving via conditional imitation lear ning. In 2018 IEEE international confer ence on robo tics and automation (ICRA) , pages 4693–4700. IEEE, 2018. [38] Zhifei Zhang, Y ang Song, and Hairong Qi. Ag e prog ression/reg ression by conditional adv er- sarial autoencoder . In Proceedings of t he IEEE confer ence on comput er vision and pattern recognition , pages 5810–5818, 2017. [39] Enrique P aredes and Rafael Ballester -Ripoll. SGEMM GPU kernel performance. UCI Machine Learning Reposit or y , 2017. DOI: https://doi.org/10.24432/C5MK70. [40] Emanuel T odoro v , T om Erez, and Y uv al T assa. Mujoco: A ph ysics engine for model-based control. In 2012 IEEE/RSJ international confer ence on intellig ent robots and sy st ems , pag es 5026– 5033. IEEE, 2012. [41] Greg Brockman, V icki Cheung, Ludwig P ettersson, Jonas Schneider , John Schulman, Jie T ang, and W ojciech Zaremba. Openai gym, 2016. URL . [42] Georg e Andriopoulos, So yuj Jung Basnet, Juan Guev ara, Li Guo, and Keith Ross. N eural mul- tiv ariate regression: Qualitativ e insights from t he unconstrained feature model. arXiv preprint arXiv :2505.09308 , 2025. [43] Arthur Jacot. Implicit bias of larg e depth networ ks: a notion of rank for nonlinear func- tions. In The Elevent h International Confer ence on Learning Representations , 2023. URL https: //openreview.net/forum?id=6iDHce- 0B- a . 13 A. Exper imental Settings W e conduct experiments with MLP and ResNet architectures. W e tr ained our models to minimize the mean squared error (MSE) loss using a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and emplo y ed a multi-step lear ning rate scheduler . W e trained MLP models for MuJoCo en vironments and SGEMM dataset. W e trained 8-lay er MLPs with h = 256 dimensional hidden la y ers for Swimmer and Hopper en vironments, and 8-la y er MLPs with h = 512 dimensional hidden lay ers for Reacher en vironment and SGEMM dataset. Our initial learning rates are 0.005 for SGEMM, 0.05 f or Reacher , and 0.1 f or both Hopper and Swimmer en vironments. In addition, w e used w eight decay parameters of 5 e − 3 for Swimmer , 1 e − 3 for Hopper and SGEMM, and 1 e − 4 for Reacher . W e trained all MLPs for 1000 epochs, excep t for SGEMM wit h a batch size of 128. The MLP model is trained for 200 epochs for SGEMM. For R esNet experiments, w e trained a ResN et-18 for Carla2D and a ResN et-34 for UTKFace. The initial learning rate for Car la2D is 0.01 and 0.001 is t he initial lear ning rate for UTKF ace training. A dditionally , our w eight deca y hyperparameters are 5 e − 3 and 5 e − 4 for Car la2D and UTKF ace, respectiv ely . W e trained ResN et-18 for 250 epochs on Carla2D while ResN et-34 w as trained for 100 epochs on UTKF ace. The batch size for Carla2D is 128, while the batch size for UTKF ace is 512. B. Additional Figures B.1. Loss Plots of Collapsed Models The training and test plots for the collapsed models are presented in Figure 5 due to t he lack of space. B.2. Deep NR C on Additional Datasets W e present results on t he Reacher dataset, SGEMM, and synt hetic data experiments that could not be included in t he main paper due to the lack of space. 14 0 50 100 150 Epochs 1 0 2 L oss T rain L oss T est L oss 0 100 200 300 400 500 600 700 800 900 Epochs 1 0 0 L oss T rain L oss T est L oss 0 100 200 300 400 500 600 700 800 900 Epochs 1 0 1 1 0 0 L oss T rain L oss T est L oss 0 100 200 300 400 500 600 700 800 900 Epochs 1 0 1 1 0 0 L oss T rain L oss T est L oss 0 50 100 150 200 Epochs 1 0 3 1 0 2 L oss T rain L oss T est L oss 0 50 Epochs 1 0 0 1 0 1 1 0 2 L oss T rain L oss T est L oss Figure 5: Loss Plots: The training and test loss plots for the collapsed models tr ained on SGEMM (top left), Swimmer (top right), Reacher (middle left), Hopper (middle right), Car la2D (bottom left), and UTKF ace (bottom right) are shown. The vertical green line indicates t he epoch at which the trained models first start experiencing Deep NR C. W e observe that t he collapsed models exhibit Deep NR C after both training and test losses are stabilized. W e also obser v e t hat the generalization gap is small for the models experiencing Deep NR C, showing that Deep NR C promotes t he gener - alization capabilities of t he deep learning models. fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.2 0.3 0.4 0.5 0.6 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.4 0.6 0.8 CK A Alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.2 0.4 0.6 0.8 WH alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 1 0 1 pr ediction er r or tar get Figure 6: Results on Reacher dataset : T op row - Noise Suppression (NRC1) shown in t he left plot and CKA alignment (NR C2) shown in the right plot . Bottom row - Feature- W eight Alignment (NR C3) sho wn in the left plot and the right plot sho ws Linear Predictability (NR C4) or the MSE of predicting t he target from the la yer features using t he pseudo inv erse method (ov erall train loss shown for reference). W e obser v e t hat all four conditions of collapse occur not just in the last lay er , but also in t he previous lay ers. 15 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.0 0.2 0.4 0.6 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 0.2 0.4 0.6 0.8 1.0 CK A Alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 0.4 0.5 0.6 0.7 0.8 0.9 WH alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 x 1 0 2 1 0 1 pr ediction er r or tar get Figure 7: Results on SGEMM dataset : T op row - N oise Suppression (NR C1) shown in t he left plot and CKA alignment (NR C2) shown in the right plot . Bottom row - Feature- W eight Alignment (NR C3) sho wn in the left plot and the right plot sho ws Linear Predictability (NR C4) or the MSE of predicting t he target from the la yer features using t he pseudo inv erse method (ov erall train loss shown for reference). W e obser v e t hat all four conditions of collapse occur not just in the last lay er , but also in t he previous lay ers. fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.0 0.2 0.4 0.6 noise component fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 5 10 15 stable rank H tar get fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.4 0.6 0.8 ck a fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 0.2 0.4 0.6 0.8 WH alignment fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 10 20 30 stable rank W fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 fc9 fc10 1 0 4 1 0 3 1 0 2 pr ediction er r or tar get Figure 8: Results on synt hetic nonlinear dataset: T op row - Noise Suppression (NR C1) shown in the left tw o plots - noise component and stable rank H (targ et stable r ank sho wn for reference) - and CKA alignment (NR C2) sho wn in t he right most plot. Bott om row - F eature- W eight Alignment (NR C3) shown in t he left most plot, middle plot measures t he stable rank of t he weight matrix W and t he right most plot shows Linear Predictability (NRC4) or the MSE of predicting the target from the la y er features using the pseudo in v erse method (ov er all train loss shown for reference). W e obser v e that all f our conditions of collapse occur not just in t he last la yer , but also in t he previous la y ers. Inputs are dra wn from a 20 -dimensional nor mal distribution. t = 3 dimensional targets for this model w ere generated from a full y-connected neural netw or k with 2 hidden la yers of dimension r = 10 . B.3. W eight Decay Plots for SGEMM Dataset W e present our results regarding t he w eight deca y experiments on SGEMM dataset due to the lack of space. 16 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 Layers 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 noise component = 0 = 1 e 0 5 = 0 . 0 0 0 1 = 0 . 0 0 1 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 Layers 0.3 0.4 0.5 0.6 0.7 0.8 0.9 WH alignment = 0 = 1 e 0 5 = 0 . 0 0 0 1 = 0 . 0 0 1 fc1 fc2 fc3 fc4 fc5 fc6 fc7 fc8 Layers 0 10 20 30 40 50 60 stable rank W = 0 = 1 e 0 5 = 0 . 0 0 0 1 = 0 . 0 0 1 Figure 9: W eight decay exper iment results on SGEMM dat aset: W e train MLPs on SGEMM dataset with varying values of w eight deca y , and observ e the effects of deep NR C. F rom left to right, w e plot the NRC1, NR C3 metrics, and t he stable rank of t he weights at each la yer respectiv ely . In each plot, the measurements with the right w eight deca y v alue ( λ = 1 e − 3 ) are shown in pur ple, while the remaining weight deca y v alues ( λ = [1 e − 4 , 1 e − 5 , 0] ) are shown in salmon. From the top lef t figure, w e obser v e t hat t he noise component is prog ressing to 0 for the models trained with lo wer amounts of weight decay or without weight deca y . How e ver , we observe t hat the model trained wit h the right w eight decay induces NR C1 in ear lier lay ers compared to t he o t her models. A dditionally , w e also observe t hat NR C3 is only induced wit h t he right weight decay v alue while t he models wit h low er w eight deca y cannot achiev e feature-w eight alignment. This show s that weight deca y is a necessity for NR C3, and this implies a low -rank bias in t he weights of t he la yers, which can also be obser ved with the bottom plot. C. V alidity of NR C Conditions A lay er in a deep netw ork is said to be collapsed if it exclusiv el y contains information about the targ et Y . W e can prov e t hat t he conditions defined in section 3 uniquely capture this no tion of collapse t hrough t he following analysis which shows t hat NR C1 + NR C2 imply NRC4. Proposition 1. Let H ∈ R N × h be the center ed featur e matrix from any lay er in a deep regr ession model and Y ∈ R N × t be t he cent ered targ ets. Let P Y be t he pr ojection onto the targ et subspace, H sig = P U denot e the projection of features into its top t -dimensional subspace and H noise denot e the residual. This means H = H sig + H noise . Let the layer be collapsed and satisfy NRC1, NR C2 up to ϵ 1 and ϵ 2 respectiv ely , and the energy of H sig be spread out evenl y among its singular vector s.. NR C1(Noise Suppr ession): 1 − T r ( H ⊤ sig H sig ) T r ( H ⊤ H ) ≤ ϵ 1 This means that ∥ H noise ∥ F ∥ H ∥ F ≤ √ ϵ 1 NR C2(Signal Alignment): 1 − CKA ( H , Y ) ≤ ϵ 2 Then we can show that t he MSE of linear ly predicting Y from H is bounded as: ∥ H − P Y H ∥ F ∥ H ∥ F ≤ √ ϵ 1 + √ ϵ 2 Proof. By t he triangle inequality , t he projection error onto t he target subspace P Y is: ∥ ( I − P Y ) H ∥ F ≤ ∥ ( I − P Y ) H noise ∥ F + ∥ ( I − P Y ) H sig ∥ F Bounding t he Noise T er m: Since t he projection operator ( I − P Y ) is non-expansiv e (spectral nor m ≤ 1 ), t he error contributed b y noise is strictl y bounded by the noise magnitude defined in NR C1: ∥ ( I − P Y ) H noise ∥ F ≤ ∥ H noise ∥ F ≤ √ ϵ 1 ∥ H ∥ F Bounding the Signal T er m: The signal component H sig lies in the subspace spanned by U . The error is determined by the principal angles θ i betw een U and P Y . Linear CKA measures t he cosine similarity of these angles: CKA ( H , Y ) = 1 t P cos 2 θ i , where θ i are t he principal angles betw een the subspaces P Y and P U . This im plies the signal term is 1 t P sin 2 θ i ≤ ϵ 2 . The ratio of t otal squared error to total signal energy is: ∥ ( I − P Y ) H sig ∥ F ∥ H sig ∥ F = s P σ 2 i sin 2 θ i P σ 2 i ≤ √ ϵ 2 17 Putting t he two ter ms together yields the desired bound. Moreov er , if a lay er is not collapsed - in t he sense that it contains information t hat does not relate to the target Y , it should violate eit her NRC1 (noise component → 0 ) or NR C2 (CKA → 1 ). Without both these conditions w e cannot guarantee linear predictivity (NRC4). The NR C3 condition extends this description to the w eights of the collapsed lay ers as well. 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment