Path-conditioned training: a principled way to rescale ReLU neural networks

Path-conditioned training: a principled way to r escale ReLU neural networks Arthur Lebeurrier 1 Titouan V ayer 2 Remi Gribon val 3 Abstract Despite recent algorithmic advances, we still lack principled ways to le verage the well-documented rescaling symmetries in ReLU neural network pa- rameters. While two properly rescaled weights implement the same function, the training dynam- ics can be dramatically different. T o offer a fresh perspectiv e on exploiting this phenomenon, we build on the recent path-lifting frame work, which provides a compact factorization of ReLU net- works. W e introduce a geometrically motiv ated criterion to rescale neural network parameters which minimization leads to a conditioning strat- egy that aligns a kernel in the path-lifting space with a chosen reference. W e deriv e an efﬁcient algorithm to perform this alignment. In the con- text of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training. 1. Introduction Deep learning has become the dominant paradigm in ma- chine learning, with neural networks achieving break- through performance across vision, language, and scientiﬁc domains. Training these netw orks represents a central com- putational challenge, particularly for large language models where the cost approximately doubles ev ery 6 months ( Sas- try et al. , 2024 ). Addressing this challenge requires progress on two fronts: (1) a better understanding of the optimization dynamics, and (2) translating these insights into practical improv ements for state-of-the-art models. 1 ENS de L yon, CNRS, Inria, Uni versit ´ e Claude Bernard L yon 1, LIP , UMR 5668, 69342, L yon cedex 07, France 2 Inria, Rennes, France 3 Inria, CNRS, ENS de L yon, Univ ersit ´ e Claude Bernard L yon 1, LIP , UMR 5668, 69342, L yon cedex 07, France. Correspondence to: Arthur Lebeurrier < arthur .lebeurrier@ens- lyon.fr > . Pr eprint. F ebruary 24, 2026. Code available at https://github.com/Artim436/ pathcond Among all neural network architectures, ReLU networks play a singular role, as they ha v e demonstrated strong empir- ical performance o ver the years on a lot of tasks ( Krizhe vsky et al. , 2012 ; Szegedy et al. , 2015 ; Ho ward et al. , 2017 ; Silver et al. , 2016 ; Dosovitskiy et al. , 2021 ). Importantly , ReLU networks exhibit a well-documented symmetry: speciﬁc weights rescaling leav e the network function unchanged due to the positiv e homogeneity of the ReLU activ ation. This symmetry has many implications. First, it has proven useful in understanding existing opti- mization algorithms through conservation la ws and implicit biases ( Du et al. , 2018 ; Kunin et al. , 2021 ; Marcotte et al. , 2023 ; Zhao et al. , 2023 ). In short, these analyses rev eal conserved quantities, arising for example from the rescal- ing symmetry , that constrain and shape the geometry of learning trajectories in both parameter and function space. These ef fects are closely related to the so-called “rich feature regime” in which the network learns “meaningful” repre- sentations from the data ( Kunin et al. , 2024 ; Domin ´ e et al. , 2025 ). Moreover , it has been shown that the implicit bias of gradient-based optimization can be characterized through a mirror-ﬂo w reparameterization of the dynamics ( Gunasekar et al. , 2018 ; Azulay et al. , 2021 ), which can be understood, in part, by lev eraging the rescaling symmetry of ReLU net- works ( Marcotte et al. , 2025 ). Second, rescaling symmetry has enabled the design of novel optimization methods that exploit this inv ariance to either accelerate training or reach better local minima. Notable examples include G -SGD ( Meng et al. , 2019 ), Path-SGD ( Neyshab ur et al. , 2015 ), and Equinormalization ( Stock et al. , 2019 ), among others ( Badrinarayanan et al. , 2015 ; Saul , 2023b ; Mustafa & Burkholz , 2024 ). Broadly , these methods can be categorized into two groups: teleporta- tion approaches ( Zhao et al. , 2022 ; Armenta et al. , 2023 ), which perform a standard optimization step followed by a “symmetry-aw are” correction that preserves the imple- mented function, and algorithms that are intrinsically in v ari- ant to the underlying symmetries. Howe ver , the problem of leveraging rescaling symmetries to better understand and improv e training dynamics remains open, as the previously mentioned schemes are often guided more by practical considerations than by formal theoretical principles. In this context, the objecti ve of this article is 1 Path-conditioned training: a principled way to rescale ReLU neural networks to le verage the concept of path-lifting Φ ( Neyshab ur et al. , 2015 ; Bona-Pellissier et al. , 2022 ; Stock & Gribon val , 2023 ; Gonon et al. , 2023 ) in a geometrically motiv ated manner to design algorithms that improv e training dynamics. The path-lifting Φ( θ ) provides an intermediate representation between the ﬁnite-dimensional parameter space θ (weights and biases) and the inﬁnite-dimensional space of network realizations f θ (the functions implemented by the network). Contributions and outline. After dev eloping a geometric perspectiv e based on path-lifting that clariﬁes the role of rescaling symmetries in neural network training dynamics, we propose a new geometry-aware rescaling criterion to “teleport” any gi ven parameter θ by a rescaling-equi v alent one θ ′ with better training properties, as well as an efﬁcient algorithm to compute θ ′ , called PathCond . W e provide a thorough analysis of the effects of our criterion, establishing regimes of interest in terms of initialization scale and archi- tectural choices. W e demonstrate that on speciﬁc datasets and architectures, using our (parameter free) rescaling crite- rion at initialization alone can signiﬁcantly impro v e training dynamics, and nev er degrades it. On CIF AR-10, we show that we can achieve the same accuracy as the baseline in 1.5 × fe wer epochs. Notations. The data dimension will be denoted by d , the output dimension by k , parameters of the neural network will be in R p and the lifted representation in R q . The net- works will ha ve H neurons. 2. Sketch of the idea on a simple example T o illustrate the ov erall approach considered in this paper, let us start with a standard problem where we ha ve access to n training data ( x i , y i ) with x i ∈ R d , y i ∈ R k , and, giv en a loss we wish to ﬁnd a parametrized function (typically a neural network) f θ : R d → R k with θ ∈ R p that minimizes L ( θ ) := 1 n P n i =1 loss ( f θ ( x i ) , y i ) . T o giv e intuition about our method, we place ourselves in the idealized scenario where the optimization is carried via a gradient ﬂow ˙ θ = −∇ L ( θ ) . (1) From parameter space to lifted space. Consider the sim- plest ReLU setting where H = d = k = 1 and f θ is a one-neuron ReLU network with bias f θ =( u,v ,w ) ( x ) = u ReLU( v x + w ) . The rescaling in variance property man- ifests itself via the fact that, for each λ > 0 , the rescaled parameter θ ( λ ) := ( 1 λ u, λv , λw ) satisﬁes f θ = f θ ( λ ) . This function rewrites as f θ ( x ) = u 1 v x + w> 0 ( v x + w ) , that is, f θ ( x ) = 1 v x + w> 0 ⟨ Φ( θ ) ,  x 1  ⟩ where Φ( θ ) = ( uv , uw ) ⊤ . One important property of the vector Φ( θ ) , called the path- lifting (formal deﬁnition in Section 3 ), is its rescaling in- so that f θ ′ = f θ variance property: Φ( θ ) = Φ( θ ( λ ) ) for any λ > 0 . In other words, if we rescale the input parameters by λ > 0 and the output parameter by 1 /λ we do not change f θ and Φ( θ ) . In particular for a parameter v ector θ = ( u, v , w ) with u > 0 , rescaling with λ := u yields that for ev ery input x we hav e f θ ( x ) = f θ ( u ) ( x ) = ReLU( ⟨ Φ( θ ) ,  x 1  ⟩ ) , i.e., f θ only depends on Φ( θ ) . W ith θ opt = (1 , 1 , 0) ⊤ , f θ opt = ReLU , and if we sample x i ∼ N (0 , 1) , y i = f θ opt ( x i ) = ReLU( x i ) , and train with the square loss, the previous discussion leads to two con- clusions. First, the risk L ( θ ) only depends on Φ( θ ) when u > 0 , it can be factorized as L ( θ ) = ℓ (Φ( θ )) . Second, any model f θ satisfying Φ( θ ) = Φ( θ opt ) = (1 , 0) ⊤ implements the same function ReLU and achieves zero training loss. This moti vates an approach in which the learning problem is viewed not only as optimization over θ in parameter space, but also as optimization in the lifted space over Φ( θ ) , wher e tar geting Φ opt := Φ( θ opt ) pr o vides a principled strate gy fr om the per spective of minimizing the training loss . Rescaling to mimic gradient descent in lifted space. The ov erall approach of this paper is thus guided by a geometric view of the training dynamics in a lifted r epresentation of the network. This can be illustrated on our simple example by minimizing L ( θ ) with standard gradient descent (GD) and small learning rates (simulating the gradient ﬂow ( 1 ) ), and observing it both in parameter space and in lifted space, for different initializations (Figure 1 ). By the chain rule, the ODE ( 1 ) also induces an ODE in the lifted space: denoting P θ := ∂ Φ( θ ) ∂ Φ( θ ) ⊤ the path-kernel , ∂ t Φ( θ ) := d dt Φ( θ ( t )) = − P θ ∇ Φ ℓ (Φ( θ )) . (2) The rescaling inv ariance of Φ( θ ) does not extend to the path kernel P θ : the latter varies when θ is replaced by θ ( λ ) . As described later , gi v en θ , PathCond aims to identify a parameter rescaling λ such that θ ′ := θ ( λ ) somehow condi- tions the path kernel P θ ′ , i.e. best aligns it with the identity so that P θ ′ ≈ I . This alignment aims to induce trajec- tories in lifted space close to those that would have been achiev ed by directly performing gradient descent / ﬂow in the lifted space ˙ Φ ≈ −∇ ℓ (Φ) , without ever incurring the cost of pseudo-inv erting large matrices appearing in “natural gradient” approaches (see discussion in Appendix B ). For our example, such rescaling factors are computed us- ing PathCond for three considered initializations, and the resulting trajectories of ( θ ( t ) , Φ( θ ( t ))) are illustrated in Figure 1 . As shown in the left subplot, rescaled ini- tializations (plain lines) lead to faster con v ergence, and indeed, as illustrated in the middle subplot, the trajectories In general, f θ ( x ) = sign ( u ) ReLU( sign ( u ) ⟨ Φ( θ ) , ( x, 1) ⊤ ⟩ ) , so f θ only depends on sign ( θ ) , Φ( θ ) . 2 Path-conditioned training: a principled way to rescale ReLU neural networks 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 iter 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 1 0 1 L o s s L ( ) d u r i n g G D V anilla GD P athCond 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 u v 3 2 1 0 1 2 3 4 u w T r a j e c t o r y i n s p a c e V anilla GD P athCond G D f o r ( ) init end 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 L o s s ( ) w v u T r a j e c t o r y i n s p a c e o p t o p t F igur e 1. GD for a toy model f θ =( u,v ,w ) ( x ) = u ReLU( vx + w ) on a loss L ( θ ) that can be factorized as L ( θ ) = ℓ (Φ( θ )) (see Section 2 ). (Left) Loss L ( θ ) during GD iterations for three dif ferent initializations θ 0 (three colors). Dashed lines correspond to GD starting at θ 0 , bold lines to GD starting at rescaled θ ( λ ) 0 ∼ θ 0 using PathCond ; (Middle) T rajectories in lifted space Φ( θ ) = ( uv , uw ) ⊤ . Dotted lines are trajectories corresponding to ∂ t Φ = −∇ Φ ℓ (Φ) (GD on ℓ (Φ) ). (Right) Trajectories in parameter space. in lifted space obtained with rescaled initializations (bold lines) more closely follo w the trajectory of the idealized ODE ∂ t Φ = −∇ Φ ℓ (Φ) (dotted lines) than the trajectories starting from the non-rescaled initalizations (dashed lines). Relation to (neural tangent) kernel conditioning. The general approach described in the next section builds on the vision that the insights drawn from the above simple example, as well as the underlying motiv ations, extend to larger -scale models. Before entering into the details, let us brieﬂy explain how the resulting approach also bears connections with the conditioning of the celebrated neural tangent kernel (NTK) ( Jacot et al. , 2018 ). As described in many contexts, the dynamic in ( 1 ) also induces an ODE on f θ ( t ) ( x ) (at any point x ) driven by the NTK K θ ( x, x ′ ) := ∂ θ f θ ( x ) ∂ θ f θ ( x ) ⊤ ∈ R k × k ( Jacot et al. , 2018 ) where ∂ θ f θ ( x ) ∈ R k × p denotes the Jacobian of θ → f θ ( x ) . The NTK can be used to analyze the dy- namics of neural networks in speciﬁc training regimes. In particular , the spectrum of the NTK gov erns the con ver - gence rate of the training loss ( Bowman , 2023 ); for instance, larger NTK eigenv alues in certain ov er-parameterized (lazy) regimes—corresponding to linearized networks ( Chizat et al. , 2019 )—lead to faster con ver gence rates ( Jacot et al. , 2018 ; Arora et al. , 2019 ). As already observ ed by Gebhart et al. ( 2021 ); Patil & Do vro- lis ( 2021 ), when applying the general path-lifting framework to ReLU networks, the NTK admits a factorization that sep- arates the architectural contribution from a data-dependent term: K θ ( x, x ′ ) = Z ( x, θ ) P θ Z ( x ′ , θ ) ⊤ , (3) where P θ = ∂ Φ( θ ) ∂ Φ( θ ) ⊤ is the previously introduced path-kernel and Z ( x, θ ) depends on the activ ations at e very ReLU neuron. Equation ( 3 ) suggests that modifying the spectral properties of P θ , for instance through appropriate changes on θ , can improv e the training dynamics, and that such modiﬁcations can be performed solely based on the network architecture via Φ( θ ) , i.e., independently of the data . For example, in the context of parameter pruning, Patil & Dovrolis ( 2021 ) exploit this decomposition to select a subset of parameters that maximizes the trace of P θ within the subnetwork. In contrast, our approach PathCond fo- cuses on improving the conditioning of P θ without altering the function f θ , by applying an efﬁciently chosen rescaling based on a principled criterion. 3. Rescaling symmetries and the path-lifting In this section, we introduce the tools underlying our method PathCond , which builds upon the rescaling symmetries of ReLU networks and the path-lifting frame work. Rescaling equivalent parameters. As advertised in Sec- tion 2 , for any neuron in a ReLU netw ork, we can scale its incoming weights by a factor λ > 0 and simultaneously scale its outgoing weights by 1 /λ without changing the neu- ron’ s input-output mapping ( Neyshabur et al. , 2015 ; Dinh et al. , 2017 ). This corresponds to multiplying the parameter vector θ by a certain admissible diagonal matrix D ∈ D with positi ve entries. Formally , denoting D the multiplica- ti ve group generated by the set of all such diagonal matrices (when both the chosen neuron and the factor λ vary), two parameter vectors θ and θ ′ are said to be rescaling equiv a- lent (denoted θ ′ ∼ θ ) if there is D ∈ D such that θ ′ = D θ (see details in Appendix D ). This deﬁnes an equiv alence relation on the parameter space, partitioning it into equiv- alence classes. Importantly , when θ , θ ′ belong to the same class we hav e f θ ′ = f θ . Gradient ﬂow is not rescaling inv ariant. Crucially , gradient ﬂow does not respect this symmetry: rescaling- equiv alent parameters θ ∼ θ ′ used at initialization of the gradient ﬂo w ( 1 ) generally lead to dif ferent trajectories. This lack of in v ariance has profound implications. For instance, 3 Path-conditioned training: a principled way to rescale ReLU neural networks unbalanced parameter initializations—obtained by applying extreme rescalings (with λ ≫ 1 or λ ≪ 1 for each neuron) to a balanced initialization—can lead to arbitrarily poor opti- mization and generalization performance ( Neyshabur et al. , 2015 ). Con versely , this sensitivity can be exploited: by carefully controlling the rescaling structure, we can hope to guide the training dynamics toward more f av orable trajecto- ries (as illustrated in Figure 1 ). While these ideas hav e been exploited through se v eral heuristics, which we will re vie w below , our goal is to design a more principled heuristic. Path-lifting. The path lifting framew ork ( Neyshab ur et al. , 2015 ; Bona-Pellissier et al. , 2022 ; Gonon et al. , 2023 ) fac- tors out the parameter redundancy induced by rescaling symmetries. This is achiev ed through a lifting map Φ from R p to a lifted space R q that captures the essential geomet- ric structure of the network while eliminating redundant degrees of freedom. For any parameter θ and path p , the p -th entry of Φ( θ ) writes Q i ∈ p θ i , i.e., is the product of the weights along the path. Interestingly , for an y output neuron v out of f θ , we can compactly write v out ( θ ; x ) =  Φ( θ ) , A v out ( θ , x )  x 1  , (4) where A v out ( θ , x ) is a binary-valued matrix capturing the data-dependent acti vation patterns (this generalizes the toy example of Section 2 ). T wo of the ke y properties of Φ are: a) rescaling-equiv alent parameters hav e the same lifted representation: Φ( θ ) = Φ( D θ ) , ∀ D ∈ D . (5) and b) it induces a (local) factorization of the global loss . ∀ θ ∈ Ω ⊂ R p , L ( θ ) = ℓ (Φ( θ )) , (6) for some ℓ : R q → R , enabling the analysis of the dynam- ics in the lifted space rather than the redundant parameter space ( Stock & Gribon v al , 2023 ; Marcotte et al. , 2025 ). 4. Path-conditioned training Using the local factorization of the loss ( 6 ) , the starting point of our analysis is that, by the chain rule, the variable z ( t ) := Φ( θ ( t )) in lifted space R q satisﬁes ˙ z ( t ) = ∂ Φ( θ ( t )) ˙ θ ( t ) = − ∂ Φ( θ ( t )) ∂ Φ( θ ( t )) ⊤ ∇ ℓ ( z ( t )) = − P θ ( t ) ∇ ℓ ( z ( t )) (7) as soon as the variable θ ( t ) in parameter space R p satis- ﬁes ( 1 ) . The path-kernel P θ = ∂ Φ( θ ) ∂ Φ( θ ) ⊤ ∈ R q × q ( Gebhart et al. , 2021 ; Marcotte et al. , 2025 ) is thus a (local) A sequence of edges from some neuron to an output neuron. In the simple example gi ven in Section 2 , one possible choice of Ω corresponds the region u > 0 . metric tensor that encodes ho w parameter updates propagate in the lifted space R q . This shows that ( 1 ) , the gradient ﬂow in parameter space R p , induces a preconditioned ﬂow in the lifted space R q , with preconditioner equal to P θ . This can be put in perspectiv e with natural gradient approaches ( Amari , 1998 ; Martens , 2020 ), where the geometry is induced by the Fisher information matrix, whereas in our setting it is shaped by the path kernel. 4.1. Rescaling is pre-conditioning If we replace a parameter θ (either at initialization or during training iterations) by a rescaled v ersion θ ′ = D θ , D ∈ D , then since z = Φ( θ ) is unchanged, only P θ on the right- hand-side of ( 7 ) can depend on D . In other words, selecting some admissible rescaling corr esponds to shaping the ge- ometry of the optimization dynamics in the lifted space. Howe ver , what criterion should guide the choice of a rescal- ing? Sev eral heuristic approaches exist: Equinormaliza- tion ( Stock et al. , 2019 ; Saul , 2023a ) explicitly chooses D to minimize some ℓ p,q norm of θ ′ = D θ or Mustafa & Burkholz ( 2024 ) select rescalings that promote more bal- anced weights across successi ve layers (for graph attention networks). W e show in Appendix C that on to y examples sev eral natural criteria used in the literature are in f act equi v alent: the Eu- clidean norms ∥ θ ′ ∥ 2 and ∥∇ L ( θ ′ ) ∥ 2 , the condition number κ ( ∇ 2 L ( θ ′ )) , the trace tr( ∇ 2 L ( θ ′ )) , or the operator norm ∥∇ 2 L ( θ ′ ) ∥ 2 → 2 . Ho we ver , this no longer holds for larger models, and an appropriate criterion must be selected. 4.2. Proposed r escaling criterion Rather than the heuristic criteria to choose D described abov e, PathCond relies on a criterion deﬁned by the path kernel. Our strategy is based on the assumption that gradient descent in the lifted space pr ovides a suitable, r escaling- in variant idealized trajectory . Accordingly , the rescaling D is chosen to mimic this behavior by seeking to best align P Dθ ∇ ℓ (Φ( θ )) ∈ R q with ∇ ℓ (Φ( θ )) ∈ R q . A ﬁrst naiv e approach would attempt to actually compute ∇ ℓ (Φ( θ )) and optimize some correlation with P Dθ ∇ ℓ (Φ( θ )) . This would howe ver be highly impractical for practical networks as the dimension q of these vectors is the number of paths in the graph associated to the network, which gro ws combinatori- ally with the netw ork size (e.g., it is in the order of q = 10 58 for ResNet18, details in Appendix I ). Instead, we propose to directly seek an alignment between P θ and the identity matrix. As before, a naive attempt to explicitly compute and optimize P θ ∈ R q × q is computa- tionally infeasible. Howe ver , as we sho w next, a carefully designed criterion enables us to entirely bypass this bot- tleneck: we identify optimal rescalings—according to the 4 Path-conditioned training: a principled way to rescale ReLU neural networks Algorithm 1 Idealized Rescaling Algorithm 1: Input: Learning rates ( µ k ) k ≥ 0 , loss function L ( · ) , di- ver gence d ( ·||· ) . 2: Iterate following or only teleport initialization. 3: for k = 0 , 1 , 2 , . . . do 4: ( α k , D k ) = argmin α> 0 ,D ∈D d ( αG Dθ k || I p ) // see Algorithm 2 5: Set θ k + 1 2 = D k θ k // T eleportation step 6: θ k +1 = θ k + 1 2 − µ k ∇ L ( θ k + 1 2 ) // Gradient step 7: end for proposed criterion— without ever forming or storing P θ . T o this end, we must address two challenges: how to measur e the “distance” between P θ and the identity matrix I , and how to cir cumvent the computational bottleneck arising fr om the combinatorial size q × q of P θ . T o address the ﬁrst challenge, we use Bregman matrix di- ver gences d ζ ( X || Y ) , induced by a strictly con ve x function ζ ( Kulis et al. , 2009 ), which provide a principled geometry on the space of symmetric matrices (see Appendix E for a formal deﬁnition). In particular , we focus on spectral diver - gences , a subclass deﬁned via functions of the eigen values, encompassing common choices such as the Frobenius and logdet ( James et al. , 1961 ) di ver gences. For the second question, we note that P θ can be expressed as P θ = AA ⊤ , where A = ∂ Φ( θ ) . Using any div er gence satisfying the property d ( AA ⊤ || I q ) = d ( A ⊤ A || I p ) for ev- ery q × p matrix A , we will be able to work with the matrix G θ := A ⊤ A = ∂ Φ( θ ) ⊤ ∂ Φ( θ ) , which has size p × p , with p the number of parameters. In typical neural networks, we hav e p ≪ q , making G θ signiﬁcantly more manageable than P θ . As written in Appendix E , the above property is satisﬁed with spectral div er gences, as AA ⊤ and A ⊤ A share the same spectrum. Generic conditioning algorithm. Given such a spectral- based div er gence measure d ( ·∥· ) , the above vie wpoint com- bined with the teleportation-based algorithm of Zhao et al. ( 2022 ), leads to Algorithm 1 . Note that we add a coef ﬁcient α k in the optimization step to manage the relative scale of the kernel with respect to the identity matrix. 4.3. Explicit algorithm with the logdet diver gence The ﬁnal step is to select an appropriate diver gence. W e choose the Bregman di ver gence induced by ζ ( X ) = − log det( X ) , deﬁned on positiv e semi-deﬁnite matrices. In light of this, we focus on the optimization problem min D ∈G , α> 0 d − logdet ( α G Dθ ∥ I p ) . (8) W ith our ﬁnal criterion, only the diagonal of G θ is required. Interestingly , d − logdet ( X || I ) provides an upper bound on the condition number of X ( Dhillon , 2008 ; Bock & An- dersen , 2025 ), further supporting the interpretation of our approach as selecting a rescaling that impro v es the condition number of the path kernel. T o handle potential issues arising from zero eigen v alues, we consider in Appendix F a slight v ariation that emplo ys the generalized determinant in place of the standard det . Fur- thermore, as sho wn in Appendix F , if G := G θ is positi ve deﬁnite, that is when ∂ Φ is full rank, then G Dθ remains positiv e deﬁnite for every D ∈ D , and the resulting opti- mization problem coincides with the formulation using the standard log det . The latter can be re written as (Lemma F .3 ) min u ∈ R H F ( u ) := p log( p X i =1 e ( B u ) i G ii ) − p X i =1 ( B u ) i , (9) where H is the number of hidden neurons, and B ∈ R p × H is a matrix with entries in { 1 , − 1 , 0 } , indicating whether a pa- rameter enters ( − 1 ), lea ves ( 1 ), or is unrelated to ( 0 ) a given neuron. Each column of B corresponding to a neuron h ∈ H takes the form b h = (1 out h , − 1 in h , 0 other h ) ⊤ ∈ R p (up to a permutation of coordinates), where out h , in h , and other h are deﬁned formally in Theorem D.1 . The rescal- ing of h is gi ven by λ h = e u h and D = e B u (the exponen- tial is taken entrywise) describes the overall rescaling on all parameters. As shown in Lemma F .6 in appendix, F is con- ve x and ( 9 ) admits a solution as soon as g := diag( G ) > 0 . Although G is generally only positiv e semi -deﬁnite ( g ≥ 0 ), we nonetheless adopt ( 9 ) as the basis for our concrete crite- rion. Indeed, the reformulation ( 9 ) presents se veral practical ad- vantages. First, the objecti ve F is deﬁned in dimension H , corresponding to the number of neurons, which is typically much smaller than the number of parameters p . Moreov er , F depends only on g = diag( G ) and therefore does not require the explicit computation of the full p × p matrix G θ . The vector g can be itself computed efﬁciently using automatic differentiation using a single backward pass on the network (see Appendix G ). Finally , ( 9 ) can be solved efﬁciently via alternating minimization as described belo w . Lemma 4.1. If g > 0 , for any neur on h the pr oblem min u h ∈ R F ( u 1 , · · · , u h , · · · , u H ) has a solution given by log( r h ) wher e r h is the unique positive root of the poly- nomial B ( A + p ) X 2 + AD X + C ( A − p ) wher e, with ρ i,h := e ( B u ) i − B ih u h , we consider A := |{ i ∈ in h }| − |{ i ∈ out h }| , B := X i ∈ out h ρ i,h g i C := X i ∈ in h ρ i,h g i , D := X i ∈ other h ρ i,h g i . W e provide a detailed proof of this lemma in Appendix F . 5 Path-conditioned training: a principled way to rescale ReLU neural networks Algorithm 2 PathCond : Path-conditioned rescaling input : DA G ReLU network, θ to rescale 1 u (0) = 0 ∈ R H 2 for k = 0 , . . . , n iter do 3 for each neur on h do 4 Giv en u ( k ) compute A , B , C , D deﬁned in Lemma 4.1 . 5 Solve r h = find positive root ( B ( A + p ) X 2 + AD X + C ( A − p )) > 0 . 6 u ( k +1) h = log( r h ) output : Rescaled D θ where D = diag ( e B u ( n iter ) ) Based on this result, we use alternating minimization with closed-form coordinate updates, yielding Algorithm 2 . Computational complexity . Algorithm 2 has time com- plexity O ( n iter p ) and space complexity O ( p + H ) . In prac- tice, n iter ≪ p with typical con ver gence in fewer than 10 iterations. Indeed, the computation of diag( G ) requires one backward pass at cost O ( p ) . W e never compute B u explicitly: at the ﬁrst iteration B u = u = 0 , and thereafter we incrementally update B u by adding the coef ﬁcient dif- ference at cost | out h | + | in h | per iteration. Each iteration cycles through H neurons, where updating neuron h costs O ( | out h | + | in h | ) . Since P H h =1 ( | out h | + | in h | ) ≈ 2 p , each iteration has complexity O ( p ) . The algorithm stores dual variables B u ∈ R p , primal v ariables u ∈ R H , and a sparse matrix B with ≈ 2 p nonzero coefﬁcients (all ± 1 ). and inde x tensors for edge sets, yielding space complexity O ( p + H ) . Full implementation details and pseudocode are provided in Section H . 5. Experiments In this section, we demonstrate that rescaling only at ini- tialization with PathCond improv es training dynamics by accelerating training loss con ver gence. In a nutshell PathCond enables us to reach the same accuracy with up to 1.5 times less epochs without compromising generaliza- tion (Figure 3 ). W e also compare our approach with the baseline GD with standard initialization without any rescal- ing, and the Equinormalization heuristic ( Stock et al. , 2019 ) ( ENorm ), which performs rescaling at every iteration to minimize a weighted ℓ 2 , 2 norm of the layer weights. Unlike the ENorm criterion and its variants ( Saul , 2023a ), which require tuning of hyper parameters (the choice of p,q and of the weights), the PathCond criterion has no hyperpa- rameter . Full details of the ENorm conﬁguration used in our experiments are provided in Section N . Finally , we char- acterize theoretically the fav orable regimes for PathCond and validate them on MNIST autoencoders (Figure 5 ). 5.1. Faster T raining Dynamics T raining setup. Follo wing the experimental protocol of Stock et al. ( 2019 ), we ev aluate our method on fully con- nected networks for CIF AR-10 classiﬁcation. The models take ﬂattened 3072 -dimensional images as input. The ar- chitecture consists of an initial linear layer ( 3072 × 500 ), follo wed by L − 1 hidden layers of size 500 × 500 , and a ﬁ- nal classiﬁcation layer ( 500 × 10 ). W e vary the depth L from 3 to 9 to analyze depth-dependent ef fects. All weights are initialized using Kaiming uniform initialization ( He et al. , 2015 ). W e do 100 epochs of SGD, with a batch size of 128 and a ﬁxed learning rate (for all architectures) of 10 − 3 . Results are av eraged ov er 3 independent runs. Results. The results are sho wn in Figure 2 , where we report the number of parameters required to reach 99% training accuracy (left plot). Overall, PathCond reaches target ac- curacy up to 1.5 × faster at the same learning rate, and as a side ef fect also improv es parameter ef ﬁcienc y: a 2.5M pa- rameter model matches the performance of 3.25M parameter baseline models. The improv ement magnitude v aries with the depth. The 3-hidden-layer network (2M parameters) exhibits the strongest gains: training accuracy conv er ges faster with PathCond initialization and the training loss decreases more rapidly toward zero. Deeper networks sho w more modest but consistent advantages. Notably , ENorm does not yield improvements in this experiment, whereas only PathCond achie ves speedups. Additional learning rates are reported in Appendix K . PathCond matches or exceeds baseline performance across small to moderate learning rates, with the greatest improv ements observed at small v alues where initialization most critically inﬂuences training dynamics. As the learn- ing rate increases, the ef fect becomes less pronounced, yet remains comparable to the baseline and ENorm . 5.2. Generalization While our method is designed to improv e training loss con- ver gence, it does not explicitly target generalization. In this experiment, we sho w that PathCond accelerates training without degrading generalization performance. T raining Setup. W e ev aluate our method on the CIF AR- NV fully con volutional architecture ( Gitman & Ginsburg , 2017 ). The architecture processes CIF AR-10 images with channel-wise normalization. The original training set is split into 40,000 images for training and 10,000 for v ali- dation. Training is performed for 128 epochs using SGD with a learning rate of 0.001. W eights are initialized using Kaiming’ s initialization scheme. W e employ a BatchNorm layer after the ﬁnal fully connected layer for all methods (Baseline, ENorm , and PathCond ). Additional learning rates are reported in Appendix L . 6 Path-conditioned training: a principled way to rescale ReLU neural networks 2.0M 2.5M 3.0M Number of parameters 30 40 50 60 70 80 90 Epochs to reach 99% Train Acc 0 50 100 Epochs 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Train Accuracy 0 50 100 Epochs 1 0 1 1 0 0 Train Loss Baseline P a t h c o n d Enorm F igur e 2. PathCond performance comparison across network depths on CIF AR-10 with multilayer perceptrons ( Left ) Number of epochs required to reach 99% training accuracy for networks with 2 to 8 hidden layers (abscissa = number of parameters, which increases with depth). ( Middle ) T raining accuracy curv es for the 3 -hidden-layer network. ( Right ) Corresponding training loss curves 0 50 100 Epochs 10 −3 10 −2 10 −1 10 0 Train Loss 0 50 100 Epochs 0.25 0.50 0.75 1.00 Train Accuracy 0 50 100 Epochs 0.2 0.4 0.6 0.8 T est Accuracy Baseline Pathcond Enorm F igur e 3. Training dynamics on CIF AR-10 with fully con volutional architecture (CIF AR-NV). ( Left ) T raining loss, ( Middle ) trai ning accuracy , and ( Right ) test accuracy . Results. Figure 3 shows the training dynamics with the fully conv olutional architecture. PathCond achiev es faster con v ergence in both training loss (left panel) and training ac- curacy (middle panel), reaching near -optimal performance approximately 20 epochs earlier than baseline methods. Crit- ically , this acceleration translates to improved test accuracy (right panel): PathCond reaches 80% best test accuracy while Baseline and ENorm plateau at 77%, demonstrating that the improved training dynamics could also enhance rather than compromise generalization. 5.3. What are the F a v orable Regimes f or PathCond ? As observed in Figure 2 , network architecture inﬂuences the effecti v eness of our algorithm. This raises a natural question: are there architectures or initializations for which G is already close to the identity (according to the Breg- man div er gence) without rescaling θ ? Understanding such cases helps identify when PathCond provides the greatest beneﬁts. A simple example is when the diagonal of G is constant, i.e., where ∃ α > 0 such that g = α 1 . In this case, the objecti ve function ( 9 ) admits u = 0 as its unique minimizer . Consequently , the rescaling matrix is the identity and the algorithm has no effect (proof in Appendix J ). Identifying fav orable regimes. Our goal is to identify regimes in which the diagonal exhibits a wide spr ead , i.e., deviat es signiﬁcantly from being constant, by analyzing how g depends on the network architecture and initialization. W e hypothesize—and v erify empirically—that signiﬁcant spread in the diagonal correlates with the ef fectiv eness of our rescaling algorithm. For layered fully-connected ReLU networks (LFCNs) with- out biases , the expected diagonal coefﬁcients at initializa- tion can be computed explicitly from the network widths n 0 , . . . , n L and the variance of the initialization of each layer σ 2 k . Standard initializations such as Kaiming ( He et al. , 2015 ) hav e the property that n k σ 2 k is constant across all layers. W e denote this v alue by a := n k σ 2 k . W e hav e: biases are dealt with in Section J.2 7 Path-conditioned training: a principled way to rescale ReLU neural networks 1 0 1 1 0 2 Max width ratio 1 0 0 1 0 1 l o g ( r e s c a l i n g ) V arying-width Near constant-width a 1 s m a l l v a r i a n c e a 1 l a r g e v a r i a n c e F igur e 4. Analysis of the relationship between architectural bal- ance (controlled by the maximum width ratio max n i n j ) and log- rescaling magnitude for small and large v ariance re gimes. 0.20 0.25 0.30 0.35 0.40 Final Training Loss Baseline Pathcond 1.0 2.0 4.0 7.0 Compression F actor 0 2 ‖log(rescaling)‖ ∞ F igur e 5. Effect of compression on MNIST autoencoder train- ing. ( T op ) Final training loss for different compression factors. ( Bottom ) Maximum absolute v alue of the log rescaling at initial- ization. Proposition 5.1 (Expected diagonal under standard initial- ization) . F or any edge parameter i at layer k ∈ { 0 , . . . , L − 1 } , the expected i th coef ﬁcient of diag( G ) is given by E [ G ii ] = n L n k +1 a L − 1 + n L n k +1 k − 1 X j =0 a L − 1 − j n j . (10) The full deriv ation, including the case with biases, is pro- vided in Appendix J . This result helps us identify regimes in which diag( G ) is close to constant. Consider networks with constant width across layers n 0 = . . . = n L := n : • Case a = 1 : ( 94 ) simpliﬁes to E [ G ii ] = 1 + k n . When n ≫ L , we hav e k n ≪ 1 and thus E [ G ii ] ≈ 1 . • Case a → + ∞ : W e hav e E [ G ii ] ∼ a L − 1 (1 + 1 /n ) for 1 ≤ k ≤ L − 1 . Again, for n large enough, diag( G ) is close to α 1 with α = a L − 1 . In contrast, when a → 0 , for all weights on layer k , we hav e E [ G ii ] ∼ 1 n a L − k . Because a is small, there is a lar ge difference in the expected values of g for coefﬁcients on different layers, creating signiﬁcant spread. In Appendix J.4 , we analyze networks with non-constant widths and show that g depends on the width ratios between layers. Based on this analysis, we identify two most fa v orable regimes: 1. V arying width architectures: When layer widths vary signiﬁcantly ( n i  = n j ), the diagonal exhibits widespread regardless of initialization scale a . 2. Small variance initialization: As sho wn abov e, for near constant-width architectures ( n i ≈ n for all i ), the diagonal exhibits signiﬁcant spread when the initializa- tion standard deviation σ k is small relative to 1 / √ n , i.e., when a ≪ 1 . In both cases, the diagonal magnitudes vary widely across parameters, creating fa vorable conditions for PathCond . Empirical validation. T o validate these predictions, we generate networks with ﬁxed depth (8 layers) and mean width (32 neurons per layer), v arying the architectural re g- ularity via a Dirichlet-based sampling method (detailed in Appendix J ) from nearly uniform (all layers ≈ 32 neurons) to highly non-uniform conﬁgurations. For each network, we compute: (1) the width ratio max i n i / min i n i , quantifying architectural non-uniformity , and (2) the inﬁnity norm of the log-rescaling factors ∥ log( rescaling ) ∥ ∞ , measuring the magnitude of required rescaling adjustments. W e test both small and large variance initializations. Figure 4 shows a strong correlation between width ratio and log-rescaling magnitude, as well as between small variance initialization and larger rescaling ef fects, conﬁrming our theoretical pre- dictions. Experiment on the MNIST A utoencoder . T o further vali- date our characterization of fav orable regimes, we consider an autoencoder task where we can systematically vary ar- chitectural balance through the compression factor . W e follow the experimental setup of Desjardins et al. ( 2015 ). The encoder is a fully connected ReLU network with archi- tecture [784 , 784 /c f , 784 /c 2 f ] where the compression factor c f ∈ { 1 , 2 , 4 , 7 } (the decoder is symmetric). All weights are initialized using Kaiming uniform initialization. W e train for 500 epochs using SGD with a batch size of 128 and learning rate 10 − 3 (additional results in Section M ). Each experiment is repeated o ver 3 independent runs. Results. Figure 5 shows the ﬁnal training loss as a function of the compression factor . The effect of PathCond is more pronounced for larger compression factors, which produce more unbalanced architectures, consistent with our theoretical predictions. In all cases, PathCond achiev es smaller training loss at con v ergence. 8 Path-conditioned training: a principled way to rescale ReLU neural networks 6. Conclusion PathCond is a rescaling strategy based on geometric prin- ciples and on the path-lifting that aims to accelerate the training of ReLU networks. Our criterion is motiv ated by the idea that promoting a better -conditioned gradient ﬂo w in the network’ s lifted space provides a principled approach for minimizing the training loss. This work opens sev eral promising directions: adapting the method for adaptiv e op- timizers with momentum such as Adam and in v estigating applications to modern architectures including transformers with self-attention mechanisms. Acknowledgements This project was supported by the SHARP project of the PEPR-IA (ANR-23-PEIA-0008, granted by France 2030). W e thank the Blaise Pascal Center for its computational support, using the SIDUS solution ( Quemener & Corvellec , 2013 ). All experiments were implemented in Python using PyT orch ( Paszke et al. , 2019 ) and NumPy ( Harris et al. , 2020 ). References Amari, S.-I. Natural gradient works ef ﬁciently in learning. Neural computation , 10(2):251–276, 1998. Armenta, M., Judge, T ., Painchaud, N., Skandarani, Y ., Lemaire, C., Gibeau Sanchez, G., Spino, P ., and Jodoin, P .-M. Neural teleportation. Mathematics , 11(2):480, 2023. Arora, S., Du, S., Hu, W ., Li, Z., and W ang, R. Fine-grained analysis of optimization and generalization for overpa- rameterized two-layer neural networks. In International confer ence on Mac hine Learning (ICML) , pp. 322–332. PMLR, 2019. Azulay , S., Moroshko, E., Nacson, M. S., W oodworth, B. E., Srebro, N., Globerson, A., and Soudry , D. On the im- plicit bias of initialization shape: Beyond inﬁnitesimal mirror descent. In International Conference on Mac hine Learning (ICML) , volume 139. PMLR, 2021. Badrinarayanan, V ., Mishra, B., and Cipolla, R. Symmetry- in v ariant optimization in deep netw orks. arXiv pr eprint arXiv:1511.01754 , 2015. Bock, A. A. and Andersen, M. S. Connecting kaporin’ s condition number and the bregman log determinant di v er- gence, 2025. Bona-Pellissier , J., Malgouyres, F ., and Bachoc, F . Local identiﬁability of deep relu neural networks: the theory . volume 35, 2022. Bowman, B. A brief introduction to the neural tangent kernel. 2023. Chizat, L., Oyallon, E., and Bach, F . On lazy training in differentiable programming. volume 32, 2019. Cormen, T . H., Leiserson, C. E., Riv est, R. L., and Stein, C. Intr oduction to algorithms . MIT press, 2022. Desjardins, G., Simonyan, K., Pascanu, R., et al. Natural neural networks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , v olume 28, 2015. Dhillon, I. S. The log-determinant div er gence and its ap- plications. In Householder Symposium XVII, Zeuthen, Germany , 2008. Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y . Sharp minima can generalize for deep nets. In International Confer ence on Machine Learning (ICML) . PMLR, 2017. Domin ´ e, C. C. J., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P . A. M., and Sax e, A. M. From lazy to rich: Exact learning dynamics in deep linear networks. In International Confer ence on Learning Representations (ICLR) , 2025. Dosovitskiy , A., Beyer , L., K olesniko v , A., W eissenborn, D., Zhai, X., Unterthiner , T ., Dehghani, M., Minderer , M., Heigold, G., Gelly , S., Uszkoreit, J., and Houlsby , N. An image is worth 16x16 words: T ransformers for image recognition at scale. In International Confer ence on Learning Repr esentations (ICLR) , 2021. Du, S. S., Hu, W ., and Lee, J. D. Algorithmic regularization in learning deep homogeneous models: Layers are auto- matically balanced. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , v olume 31, 2018. Gao, B. and P a vel, L. On the properties of the softmax func- tion with application in game theory and reinforcement learning. arXiv preprint , 2017. Gebhart, T ., Saxena, U., and Schrater , P . A uniﬁed paths perspectiv e for pruning at initialization. arXiv preprint arXiv:2101.10552 , 2021. Gitman, I. and Ginsbur g, B. Comparison of batch nor- malization and weight normalization algorithms for the large-scale image classiﬁcation. arXiv preprint arXiv:1709.08145 , 2017. Glorot, X. and Bengio, Y . Understanding the difﬁculty of training deep feedforward neural networks. In Pr o- ceedings of the thirteenth international confer ence on artiﬁcial intelligence and statistics , pp. 249–256. JMLR W orkshop and Conference Proceedings, 2010. 9 Path-conditioned training: a principled way to rescale ReLU neural networks Gonon, A. Harnessing symmetries for modern deep learning challenges: a path-lifting perspective . PhD thesis, Ecole normale sup ´ erieure de lyon-ENS L YON, 2024. Gonon, A., Brisebarre, N., Riccietti, E., and Gribon v al, R. A path-norm toolkit for modern networks: con- sequences, promises and challenges. arXiv pr eprint arXiv:2310.01225 , 2023. Gunasekar , S., Lee, J., Soudry , D., and Srebro, N. Character- izing implicit bias in terms of optimization geometry . In International Confer ence on Mac hine Learning (ICML) . PMLR, 2018. Harris, C. R. et al. Array programming with NumPy. Natur e , 585:357–362, 2020. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectiﬁers: Surpassing human-lev el performance on imagenet classiﬁcation. In Pr oceedings of the IEEE inter- national confer ence on computer vision , pp. 1026–1034, 2015. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 770–778, 2016. How ard, A. G., Zhu, M., Chen, B., Kalenichenk o, D., W ang, W ., W eyand, T ., Andreetto, M., and Adam, H. Mobilenets: Efﬁcient con volutional neural networks for mobile vision applications. arXiv preprint , 2017. Jacot, A., Gabriel, F ., and Hongler, C. Neural tangent ker - nel: Con ver gence and generalization in neural netw orks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 31, 2018. James, W ., Stein, C., et al. Estimation with quadratic loss. In Pr oceedings of the fourth Berkele y symposium on mathe- matical statistics and pr obability , volume 1, pp. 361–379. Univ ersity of California Press, 1961. Krizhevsk y , A., Sutske ver , I., and Hinton, G. E. Imagenet classiﬁcation with deep con v olutional neural networks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 25, 2012. Kulis, B., Sustik, M. A., and Dhillon, I. S. Low-rank k ernel learning with bregman matrix div ergences. Journal of Machine Learning Resear ch (JMLR) , 10(2), 2009. Kunin, D., Sag astuy-Brena, J., Ganguli, S., Y amins, D. L., and T anaka, H. Neural mechanics: Symmetry and brok en conservation la ws in deep learning dynamics. In Interna- tional Confer ence on Learning Repr esentations (ICLR) , 2021. Kunin, D., Ravent ´ os, A., Domin ´ e, C., Chen, F ., Klindt, D., Saxe, A., and Ganguli, S. Get rich quick: exact solutions re veal ho w unbalanced initializations promote rapid feature learning. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , v olume 37, 2024. Marcotte, S., Gribon v al, R., and Peyr ´ e, G. Abide by the law and follo w the ﬂow: Conservation laws for gradient ﬂo ws. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 36, 2023. Marcotte, S., Peyr ´ e, G., and Gribon v al, R. Intrinsic train- ing dynamics of deep neural networks. arXiv preprint arXiv:2508.07370 , 2025. Martens, J. New insights and perspectives on the natural gradient method. Journal of Mac hine Learning Resear ch (JMLR) , 21(146):1–76, 2020. Meng, Q., Zheng, S., Zhang, H., Chen, W ., Ma, Z.-M., and Liu, T .-Y . G-SGD: Optimizing reLU neural networks in its positiv ely scale-inv ariant space. In International Confer ence on Learning Repr esentations (ICLR) , 2019. Mustafa, N. and Burkholz, R. Dynamic rescaling for train- ing GNNs. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2024. Neyshab ur , B., Salakhutdinov , R. R., and Srebro, N. Path- sgd: Path-normalized optimization in deep neural net- works. volume 28, 2015. Paszke, A., Gross, S., Massa, F ., Lerer , A., Bradbury , J., Chanan, G., Killeen, T ., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperativ e style, high-performance deep learning library . In Advances in Neural Information Pr ocessing Systems , v olume 32. 2019. Patil, S. M. and Dovrolis, C. Phew: Constructing sparse net- works that learn fast and generalize well without training data. In International Conference on Mac hine Learning (ICML) , pp. 8432–8442. PMLR, 2021. Quemener , E. and Corvellec, M. Sidus—the solution for extreme deduplication of an operating system. Linux Journal , 2013(235):3, 2013. Sastry , G., Heim, L., Belﬁeld, H., Anderljung, M., Brundage, M., Hazell, J., O’keefe, C., Hadﬁeld, G. K., Ngo, R., Pilz, K., et al. Computing power and the gov ernance of artiﬁcial intelligence. arXiv preprint arXiv:2402.08797 , 2024. Saul, L. K. W eight-balancing ﬁxes and ﬂows for deep learning. T r ansactions on Machine Learning Researc h (TMLR) , 2023a. 10 Path-conditioned training: a principled way to rescale ReLU neural networks Saul, L. K. W eight-balancing ﬁxes and ﬂows for deep learning. T r ansactions on Machine Learning Researc h (TMLR) , 2023b. ISSN 2835-8856. Silver , D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., V an Den Driessche, G., Schrittwieser , J., Antonoglou, I., Panneershelv am, V ., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Natur e , 529(7587):484–489, 2016. Simonyan, K. and Zisserman, A. V ery deep conv olu- tional networks for lar ge-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. Stock, P . and Gribon v al, R. An embedding of relu networks and an analysis of their identiﬁability . Constructive Ap- pr oximation , 57(2):853–899, 2023. Stock, P ., Graham, B., Gribon v al, R., and J ´ egou, H. Equi- normalization of neural networks, 2019. Szegedy , C., Liu, W ., Jia, Y ., Sermanet, P ., Reed, S., Anguelov , D., Erhan, D., V anhoucke, V ., and Rabinovich, A. Going deeper with con v olutions. In Proceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 1–9, 2015. V erbockhav en, M., Rudkiewicz, T ., Chev allier , S., and Charpiat, G. Growing tiny networks: Spotting expressi v- ity bottlenecks and ﬁxing them optimally . T r ansactions on Machine Learning Researc h (TMLR) , 2024. ISSN 2835-8856. V on Neumann, J. Thermodynamik quantenmechanischer gesamtheiten. Nachrichten von der Gesellschaft der W is- senschaften zu G ¨ ottingen, Mathematisch-Physikalische Klasse , 1927:273–291, 1927. Zhao, B., Dehmamy , N., W alters, R., and Y u, R. Symmetry teleportation for accelerated optimization. volume 35, 2022. Zhao, B., Gane v , I., W alters, R., Y u, R., and Dehmamy , N. Symmetries, ﬂat minima, and the conserved quantities of gradient ﬂo w . In International Conference on Learning Repr esentations (ICLR) , 2023. 11 Path-conditioned training: a principled way to rescale ReLU neural networks A. Properties of the path-lifting W e ha ve the follo wing properties on the path-lifting Lemma A.1. Let D ∈ D , ∂ Φ( D θ ) = ∂ Φ( θ ) D − 1 . Pr oof. Using the deﬁnition of the path-lifting, Φ( θ ) := (Φ p ( θ )) p with Φ p ( θ ) := Q i ∈ p θ i , we hav e ∂ Φ p ( θ ) ∂ θ i =    Q k ∈ p k  = i θ k if p ∋ i 0 otherwise . (11) Giv en any D ∈ D , we hav e for all θ ∈ R p , Φ( D θ ) = Φ( θ ) ( Gonon , 2024 , Theorem 2.4.1). This means the functions θ 7→ Φ( θ ) and θ 7→ Φ( D θ ) are identical and they share the same Jacobian. Applying the chain rule yields ∂ Φ( θ ) = ∂ Φ( D θ ) D (12) The conclusion follows by multiplying on the right by D − 1 , which exists by Theorem D.3 . B. Gradient in Φ -space As described in the main text, if θ ( t ) follows the gradient ﬂow ˙ θ = ∇ L ( θ ) of ( 1 ) then z ( t ) := Φ( θ ( t )) satisﬁes ˙ z = ∂ t Φ( θ ) = − P θ ∇ Φ ℓ (Φ( θ )) with P θ = ∂ Φ( θ ) ∂ Φ( θ ) ⊤ . As an alternative to the proposed PathCond approach to “align” P θ ∇ Φ ℓ (Φ( θ )) with ∇ Φ ℓ (Φ( θ )) , one could en vision a “natural gradient” version along the follo wing route ( Amari , 1998 ; V erbockhav en et al. , 2024 ). Considering any p × p pre-conditioning matrix M ( θ ) , it is possible to replace ( 1 ) with ˙ θ = − M ( θ ) ∇ L ( θ ) . (13) W ith the same reasoning based on chain rules, if the trajectory θ ( t ) is gov erned by such an ODE then z ( t ) satisﬁes the modiﬁed ODE ˙ z = − ∂ Φ( θ ) ˙ θ = − ∂ Φ( θ ) M ( θ ) ∇ L ( θ ) = − ∂ Φ( θ ) M ( θ ) ∂ Φ( θ ) ⊤ ∇ Φ ℓ (Φ( θ )) . (14) The best alignment with ∇ Φ ℓ (Φ( θ )) is achie ved with the pseudo-in verse M ( θ ) := [ ∂ Φ( θ ) ⊤ ∂ Φ( θ )] + , leading to ˙ z = − Q θ ∇ Φ ℓ (Φ( θ )) . (15) with Q θ the q × q matrix projecting orthogonally onto range( ∂ Φ( θ )) . While this approach leads to the maximum alignment, it in v olves the computation of the pseudo-inv erse of a lar ge p × p matrix, which is a strong computational bottleneck that PathCond circumvents. C. Equivalence between differ ent r escaling criteria on toy examples First example. Consider a minimal architecture with two weights θ = ( u, v ) , f θ ( x ) := uv x , and a loss L ( θ ) = ℓ (Φ( θ )) where Φ( θ ) := uv . Straightforward calculus yields ∇ Φ( θ ) =  v u  =  0 1 1 0  θ and ∇ 2 Φ( θ ) =  0 1 1 0  ∇ L ( θ ) = ℓ ′ (Φ( θ )) · ∇ Φ( θ ) ∇ 2 L ( θ ) = ℓ ′′ (Φ( θ )) | {z } =: a ·∇ Φ( θ )( ∇ Φ( θ )) ⊤ + ℓ ′ (Φ( θ )) | {z } =: b ·∇ 2 Φ( θ ) = a  v 2 uv uv u 2  + b  0 1 1 0  =  av 2 a Φ( θ ) + b a Φ( θ ) + b au 2  Since Φ is rescaling in v ariant it follo ws that for an y norm ∥ · ∥ arg min θ ′ ∼ θ ∥∇ L ( θ ) ∥ = arg min θ ′ ∼ θ ∥∇ Φ( θ ) ∥ = arg min θ ′ ∼ θ ∥ θ ∥ (16) 12 Path-conditioned training: a principled way to rescale ReLU neural networks and for the special case of the Euclidean norm arg min θ ′ ∼ θ tr( ∇ L ( θ )) = arg min θ ′ ∼ θ ∥ θ ∥ 2 . (17) W e no w turn to the behavior of the condition number κ ( ∇ 2 L ( θ )) and the operator norm ∥∇ 2 L ( θ ) ∥ 2 → 2 by considering the spectrum of ∇ 2 L ( θ ) , which is made of two eigen values λ − ≤ λ + such that λ + λ − = det( ∇ 2 L ( θ )) = a 2 ( uv ) 2 − ( a Φ( θ ) + b ) 2 = ( a Φ( θ )) 2 − ( a Φ( θ ) + b ) 2 = − b (2 a Φ( θ ) + b ) =: c λ + + λ − = tr( ∇ 2 L ( θ )) = a ( u 2 + v 2 ) = a ∥ θ ∥ 2 2 . Assuming that a ≥ 0 (this is the case for example when ℓ is the quadratic loss), the sum λ + + λ − is thus non-negati ve, hence λ + ≥ | λ − | and therefore ∥∇ 2 L ( θ ) ∥ 2 → 2 = λ + . W e also obtain that κ ( ∇ 2 L ( θ )) = λ + / | λ − | . Notice that each of these quantities can be expressed as g ( λ + , | λ − | ) where g increases with its ﬁrst argument, and decreases with is second one. W e no w in vestigate how the y v ary with ∥ θ ′ ∥ 2 2 when θ ′ ∼ θ . Since a, b, c can be expressed as a function of Φ( θ ) , they are unchanged if we replace θ by a rescaling equi valent paramter θ ′ ∼ θ . When varying such a θ ′ , an increase in ∥ θ ′ ∥ 2 2 increases the sum λ + + λ − while leaving the product λ + λ − constant. Let us distinguish two cases: • case c ≥ 0 : both eigenv alues have the same sign, and since their sum a ∥ θ ′ ∥ 2 2 is positi ve, they are both positi ve. Straightforward calculus shows that increasing their sum while keeping their product constant increases the largest one λ + while decreasing the smallest one λ − = | λ − | . Therefore with g as abov e the value g ( λ + , λ − ) increases with ∥ θ ′ ∥ 2 2 • case c < 0 here since we hav e seen that λ + remains positiv e we must ha ve λ − = −| λ − | . The same type of reasoning shows that we reach the same conclusion. Overall it follo ws that arg min θ ′ ∼ θ κ ( ∇ L ( θ )) = arg min θ ′ ∼ θ ∥∇ L ( θ ) ∥ 2 → 2 = arg min θ ′ ∼ θ ∥ θ ∥ 2 . (18) Second example. The above example is a particular case of a deeper toy example where θ = ( u 1 , . . . , u L ) and Φ( θ ) = u 1 . . . u L . Similar computations (we skip the details) show that when all entries u i are nonzero, denoting θ − 1 := ( u − 1 i ) L i =1 ∇ Φ( θ ) = (Φ( θ ) /u i ) L i =1 = Φ( θ ) · θ − 1 ∇ L ( θ ) = ℓ ′ (Φ( θ )) · ∇ Φ( θ ) = ℓ ′ (Φ( θ )) · Φ( θ ) | {z } =: a (Φ( θ )) · θ − 1 ∇ 2 L ( θ ) = a (Φ( θ )) · oﬀdiag( θ − 1 ( θ − 1 ) ⊤ ) + (Φ( θ )) 2 ℓ ′′ (Φ( θ )) · θ − 1 ( θ − 1 ) ⊤ W e deduce that tr( ∇ 2 L ( θ )) = (Φ( θ )) 2 ℓ ′′ (Φ( θ )) ∥ θ − 1 ∥ 2 2 . Overall we obtain with the same reasoning that arg min θ ′ ∼ θ ∥∇ L ( θ ) ∥ = arg min θ ′ ∼ θ tr( ∇ 2 L ( θ )) = arg min θ ′ ∼ θ ∥ θ − 1 ∥ 2 . (19) D. Characterization of the r escaling symmetry Let G = ( V , E ) be a ﬁxed directed ac yclic graph (D AG), representing a ﬁxed network architecture, with vertices V called neurons and edges E . For a neuron v ∈ V , we deﬁne the sets of antecedents and successors as ant ( v ) := { u ∈ V | ( u, v ) ∈ E } , suc ( v ) := { u ∈ V | ( v , u ) ∈ E } . (20) Neurons with no antecedents (resp. no successors) are called input (resp. output ) neurons, and their sets are denoted N in := { v ∈ V | ant ( v ) = ∅} and N out := { v ∈ V | suc ( v ) = ∅} respectiv ely . The input and output dimensions are d in := | N in | and d out := | N out | . Neurons in H := V \ ( N in ∪ N out ) are called hidden neur ons , and we denote their cardinality 13 Path-conditioned training: a principled way to rescale ReLU neural networks by H := |H | . Note that for any hidden neuron v ∈ H , we hav e ant ( v )  = ∅ and suc ( v )  = ∅ , i.e., there exist both incoming and outgoing edges ( Gonon , 2024 , Deﬁnition 2.2.2). T o parametrize this graph in the context of neural networks, we perform a topological ordering ( Cormen et al. , 2022 ) and assign indices to all neurons and edges. All parameters (weights along edges and biases on neurons) can then be arranged into a vector θ ∈ R p , where p is the total number of parameters. As the architecture is ﬁxed, this v ector fully represents our network. Deﬁnition D.1 (Neuron-wise rescaling symmetry) . Let v ∈ H be a hidden neuron and λ > 0 a scaling factor . W e deﬁne the sets of incoming and outgoing parameter indices associated to v as in v := { index ( b v ) } ∪ { index ( θ u → v ) | u ∈ ant ( v ) } (21) and out v := { index ( θ v → u ) | u ∈ suc ( v ) } , (22) where b v denotes the bias of neuron v , and θ u → v (resp. θ v → u ) denotes the weight of edge ( u, v ) ∈ E (resp. ( v , u ) ∈ E ) ( Gonon , 2024 , Deﬁnition 2.2.2). The diagonal matrix D λ,v ∈ R p × p associated to the neur on-wise r escaling symmetry ( Stock & Gribon v al , 2023 , Def. 2.3) is deﬁned by its diagonal entries: ( D λ,v ) ii =      λ if i ∈ in v 1 /λ if i ∈ out v 1 otherwise . (23) This matrix implements a rescaling of neuron v by factor λ : incoming weights and bias are multiplied by λ , while outgoing weights are divided by λ . W e can then deﬁne the follo wing set: Deﬁnition D.2 (Rescaling symmetry group) . W e denote by D the subgroup generated by the neuron-wise rescaling symmetries ( Stock & Gribon v al , 2023 , Def. 2.3): D = gr  { D λ,v | v ∈ H , λ ∈ R + ∗ }  . (24) Lemma D.3. The rescaling symmetry gr oup admits the following explicit char acterization: D = ( Y v ∈H D λ v ,v      λ ∈ ( R + ∗ ) H ) . (25) The or der of the pr oduct does not matter since diagonal matrices commute. Pr oof. Let A := { Q v ∈H D λ,v | λ ∈ ( R + ∗ ) H } . Since diagonal matrices commute and D − 1 λ,v = D 1 /λ,v , it is straightforward to sho w that A is stable under product and in v ersion, and is therefore a subgroup of the group of strictly positi v e diagonal matrices. Let us show that it contains all D λ,v . Indeed, given any v ∈ H and λ > 0 , choosing λ v = λ and λ v ′ = 1 for all v ′ ∈ H\{ v } yields Q v ′ ∈H D λ v ′ ,v ′ = D λ,v . By minimality of the generated subgroup, we therefore have D ⊂ A . Con v ersely , ev ery element of A is a product of elements of D , hence an element of D by the group property . For the ne xt deﬁnitions, we recall some notation from Gonon et al. ( 2023 ). Deﬁnition D.4 (Path in a D A G) . ( Gonon et al. , 2023 , Deﬁnition A.1) A path of a DA G G = ( V , E ) is any sequence of neurons v 0 , . . . , v d such that each v i → v i +1 is an edge . Such a path is denoted p = v 0 → . . . → v d . The length of a path p = v 0 → . . . → v d is length (p) = d (the number of edges). Deﬁnition D.5 (Path lifting) . ( Gonon et al. , 2023 , Deﬁnition A.2) For a path p ∈ P with p = v 0 → . . . → v d , the path lifting is deﬁned as Φ p ( θ ) :=              length (p) Y ℓ =1 θ v ℓ − 1 → v ℓ if v 0 ∈ N in , b v 0 length (p) Y ℓ =1 θ v ℓ − 1 → v ℓ otherwise , (26) 14 Path-conditioned training: a principled way to rescale ReLU neural networks where an empty product equals 1 by con v ention. The path-lifting Φ( θ ) of θ over the entire graph is Φ( θ ) := (Φ p ( θ )) p ∈ P . (27) By a slight ab use of notation, we e xtend the path-lifting to diagonal matrices by applying it to their diagonal entries. For D ∈ R p × p diagonal, we deﬁne Φ( D ) := Φ (( D ii ) i =1 ,...,p ) . (28) Deﬁnition D.6 (Admissible rescaling matrices) . W e denote by Φ − 1 ( 1 ) + the set of diagonal matrices with strictly positiv e entries whose path-lifting equals the all-ones vector: Φ − 1 ( 1 ) + :=  D ∈ R p × p diagonal   Φ( D ) = 1 and D ii > 0 for all i  . (29) Lemma D.7. The set of admissible rescaling matrices coincides with Φ − 1 ( 1 ) + : D = Φ − 1 ( 1 ) + . (30) Pr oof. W e prov e both inclusions. • D ⊆ Φ − 1 ( 1 ) + Let D ∈ D . By Theorem D.3 , there exists λ ∈ ( R + ∗ ) H such that D = Y v ∈H D λ v ,v . (31) Since λ ∈ ( R + ∗ ) H , we hav e D ii > 0 for all i . It remains to show that Φ( D ) = 1 . By Gonon ( 2024 , Theorem 2.4.1), the rescaling symmetry property ensures that for any θ ∈ R p , Φ( D θ ) = Φ( θ ) , (32) Applying this property with θ = 1 , we obtain Φ( D 1 ) = Φ( 1 ) = 1 , (33) where the last equality holds because Φ p ( 1 ) = Q i ∈ p 1 = 1 for all paths p ∈ P . Now , observe that ( D 1 ) i = D ii for all i . Therefore, Φ( D ) = Φ( D 1 ) = 1 . (34) This shows that D ∈ Φ − 1 ( 1 ) + , hence D ⊆ Φ − 1 ( 1 ) + . • Φ − 1 ( 1 ) + ⊆ D Let D ∈ Φ − 1 ( 1 ) + . W e can write D = diag ( d ) for some d ∈ ( R + ∗ ) p . W e want to sho w that D ∈ D , which by Theorem D.3 means showing that D can be written as a product of neuron-wise rescaling matrices: D = Y v ∈H D λ v ,v (35) for some λ ∈ ( R + ∗ ) H . Notation. T o simplify the exposition, for an y neuron v ∈ V and edge ( u, v ) ∈ E , we denote: d v := d index ( b v ) and d u → v := d index ( θ u → v ) , (36) i.e., d v is the rescaling factor of the bias of neuron v , and d u → v is the rescaling factor of the weight on edge ( u, v ) . Characterization of D . Recall from Theorem D.1 that the matrix D λ v ,v multiplies the bias and all incoming weights of neuron v by λ v , while dividing all outgoing weights by λ v . Therefore, a diagonal matrix D belongs to D if and only if there exist rescaling factors λ v > 0 for each v ∈ H such that: 15 Path-conditioned training: a principled way to rescale ReLU neural networks (i) For each hidden neuron v ∈ H , the bias rescaling factor satisﬁes: d v = λ v . (37) (ii) For each edge ( u, v ) ∈ E , the weight rescaling factor satisﬁes: d u → v = λ v λ u , (38) where we adopt the con v ention λ u = 1 for u ∈ N in and λ v = 1 for v ∈ N out . Indeed, condition (i) ensures that each bias is rescaled by the appropriate factor , while condition (ii) captures the combined effect on edges: as an outgoing edge from u , the weight is divided by λ u ; as an incoming edge to v , it is multiplied by λ v . Proof strategy . Gi ven D ∈ Φ − 1 ( 1 ) + , we deﬁne for all hidden neurons v ∈ H : λ v := d v > 0 . (39) This deﬁnition automatically satisﬁes condition (i). T o show that D ∈ D , it therefore suf ﬁces to v erify condition (ii), namely that for any edge ( u, v ) ∈ E : d u → v = λ v λ u . (40) W e distinguish three cases based on the nature of the neurons u and v . Note that we do not need to consider the case where u ∈ N in and v ∈ N out simultaneously , as such edges would bypass all hidden neurons and in volve no rescaling from D . Case 1: v ∈ N out Consider the path p = ( u → v ) . By hypothesis, Φ( D ) = 1 , so Φ p ( D ) = d u · d u → v = 1 . (41) Since d u = λ u , we obtain d u → v = 1 λ u (42) Case 2: u ∈ N in Consider a path p starting from input neuron u and passing through the edge ( u, v ) . W e can write p = ( u → v → w 1 → · · · → w k ) where w k ∈ N out . Consider also the path p ′ = ( v → w 1 → · · · → w k ) which coincides with p after verte x v . By hypothesis, Φ( D ) = 1 , we hav e Φ p ( D ) = d u → v k Y ℓ =1 d w ℓ − 1 → w ℓ = 1 , (43) where w 0 := v , and Φ p ′ ( D ) = d v k Y ℓ =1 d w ℓ − 1 → w ℓ = 1 . (44) Dividing the ﬁrst equation by the second yields d u → v = d v = λ v . (45) Case 3: u, v ∈ H Let ( u, v ) ∈ E be an edge between two hidden neurons. 16 Path-conditioned training: a principled way to rescale ReLU neural networks Consider any path p starting from neuron u and passing through the edge ( u, v ) . W e can write p = ( u → v → w 1 → · · · → w k ) where w k ∈ N out . Consider also the path p ′ = ( v → w 1 → · · · → w k ) which coincides with p after verte x v . Since both u and v are hidden neurons, both paths start from their respectiv e biases. By hypothesis, Φ( D ) = 1 , we hav e Φ p ( D ) = d u · d u → v k Y ℓ =1 d w ℓ − 1 → w ℓ = 1 , (46) where w 0 := v , and Φ p ′ ( D ) = d v k Y ℓ =1 d w ℓ − 1 → w ℓ = 1 . (47) Dividing the ﬁrst equation by the second yields d u · d u → v = d v . (48) Since d v = λ v and d u = λ u , we obtain d u → v = λ v λ u . (49) In all cases, equation ( 40 ) holds, which shows that D = Q v ∈H D λ v ,v ∈ D . Therefore, Φ − 1 ( 1 ) + ⊆ D . E. Bregman matrix di ver gences W e ﬁrst formally introduce Bre gman matrix di v ergences. Let S q be the space of q × q symmetric matrices on R . Giv en a strictly con v ex and dif ferentiable function ζ : S q → R , the Bregman di v ergence between matrices X and Y is deﬁned as d ζ ( X || Y ) = ζ ( X ) − ζ ( Y ) − tr  ∇ ζ ( Y ) ⊤ ( X − Y )  . (50) The most well-known example is built from the squared Frobenius norm ζ ( X ) = ∥ X ∥ 2 F , which yields d ζ ( X || Y ) = ∥ X − Y ∥ 2 F . The logdet di ver gence corresponds to ζ ( X ) = − log det( X ) ( James et al. , 1961 ), and the von Neumann div er gence to ζ ( X ) = tr( X log X − X ) ( V on Neumann , 1927 ). W ithin the class of Bregman matrix di v ergences, spectr al diver gences are particularly important: they take the form ζ ( X ) = φ ◦ λ ( X ) , where λ denotes the vector of eigen values of X in decreasing order and φ is a strictly con vex real-v alued function on vectors ( Kulis et al. , 2009 ). The aforementioned examples are all spectral diver gences: φ ( x ) = P i | x i | 2 for the Frobenius norm, φ ( x ) = − P i log( x i ) for the logdet div er gence, and φ ( x ) = P i ( x i log x i − x i ) for the von Neumann di v er gence. Spectral div er gences with φ ( x ) = P i ψ ( x i ) for some ψ : R → R admit the expression ( Kulis et al. , 2009 , Lemma 1) s d ζ ( X || Y ) = X ij ( u ⊤ i v j ) 2 ( ψ ( λ i ( X )) − ψ ( λ j ( Y )) − ( λ i ( X ) − λ j ( Y )) ∇ ψ ( λ j ( Y )) , (51) where u i (resp. v j ) is an eigen v ector associated to the i -th largest eigen value λ i ( X ) of X (resp. Y ). Hence, d ζ ( X || Y ) can be written as d ζ ( X || Y ) = P ij ( u ⊤ i v j ) 2 div ψ ( λ i ( X ) || λ j ( Y )) where di v ψ is the Bregman di v ergence associated to ψ . Deﬁnition E.1. Consider ζ = φ ◦ λ with φ ( x ) = P i ψ ( x i ) that induces a spectral div ergence as described above. W e deﬁne the ζ + div er gence as the di ver gence that only considers the nonzer o eigen values of the matrices. In other words, d ζ + ( X || Y ) := X i : λ i ( X )  =0 j : λ j ( Y )  =0 ( u ⊤ i v j ) 2 div ψ ( λ i ( X ) || λ j ( Y )) . (52) W e ha ve the follo wing lemma. Lemma E.2. Consider a spectral diver gence on S q with ζ ( X ) = φ ◦ λ ( X ) wher e φ can be written as φ ( x ) = P i ψ ( x i ) for some strictly con ve x and dif fer entiable function ψ : R → R and ζ + as deﬁned in ( 52 ) . Then, for any A ∈ R q × p d ζ + ( AA ⊤ || I q ) = d ζ + ( A ⊤ A || I p ) . (53) 17 Path-conditioned training: a principled way to rescale ReLU neural networks Pr oof. When Y = I q , d ζ + ( X || Y ) becomes P i : λ i ( X )  =0 ∥ u i ∥ 2 2 div ψ ( λ i ( X ) || 1) = P i : λ i ( X )  =0 div ψ ( λ i ( X ) || 1) thus only depends on the eigen v alues of X . Using that the nonzero eigen values of AA ⊤ and A ⊤ A are the same concludes. Now we apply this result to the logdet di vergence and we ha ve Lemma E.3. T ake ζ ( X ) = − log det( X ) . Then, for any A ∈ R q × p , d ζ + ( AA ⊤ || I q ) = tr( A ⊤ A ) − X i : λ i ( A ⊤ A )  =0 log λ i ( A ⊤ A ) − rank( A ) = tr( A ⊤ A ) − log det + ( A ⊤ A ) − rank( A ) , (54) wher e det + ( X ) it the gener alized determinant, that is the pr oduct of the nonzeros eig en values of X . Pr oof. Based on the proof of Equation ( 52 ) we ha v e d ζ + ( AA ⊤ || I q ) = P i : λ i ( A ⊤ A )  =0 div ψ ( λ i ( A ⊤ A ) || 1) where ψ = − log . So in this case div ψ ( x || y ) = − log( x ) + log ( y ) + 1 y ( x − y ) = − log( x y ) + x y − 1 . Using this formula, rank( A ⊤ A ) = rank( A ) and P i : λ i ( A ⊤ A )  =0 λ i ( A ⊤ A ) = tr( A ⊤ A ) gi ves the result. F . PathCond optimization problem W e detail here the deﬁnition of the PathCond optimization problem. W e recall that P θ = ∂ Φ( θ ) ∂ Φ( θ ) ⊤ ∈ R q × q is the path-kernel. W e deﬁne G = G θ = ∂ Φ( θ ) ⊤ ∂ Φ( θ ) ∈ R p × p and consider ζ ( X ) = − logdet( X ) and d ζ , d ζ + the associated spectral div er gence and its version that considers only nonzero eigenv alues, see Deﬁnition E.1 . Furthermore, we place ourselves in the re gime q ≥ p (which is the case for most D A G ReLU networks, except very small ones). Proposition F .1. Denote r = rank( ∂ Φ( θ )) and let ∂ Φ( θ ) = U Σ V ⊤ be a compact SVD decomposition of ∂ Φ( θ ) , i.e., U ∈ R q × r , V ∈ R p × r , U ⊤ U = V ⊤ V = I r , Σ = diag ( σ 1 , · · · , σ r ) > 0 . F or any admissible D = diag ( d 1 , · · · , d p ) ∈ D and α > 0 , we have d ζ + ( αP Dθ || I q ) = d ζ + ( αG Dθ || I p ) = α p X i =1 d − 2 i G ii − r log ( α ) − log det(Σ 2 ) − log det( V ⊤ D − 2 V ) − r . (55) When r = p , the matrix G Dθ is positive deﬁnite for any D ∈ D and d ζ + ( αP Dθ || I q ) = d ζ + ( αG Dθ || I p ) = d ζ ( αG Dθ || I p ) . (56) Mor eover , the minimization o ver α > 0 , D ∈ D of these quantities is equivalent to the problem min D ′ ∈D p log( p X i =1 d ′ i G ii ) − p X i =1 log( d ′ i ) . (57) Pr ecisely if D ′ solves ( 57 ) then D := D ′− 1 / 2 and α = p P i d − 2 i G ii minimize the pr e vious quantities. Pr oof. First we recall that det + is the product of the nonzero eigenv alues. First, we have from Lemma A.1 ∂ Φ( D θ ) = ∂ Φ( θ ) D − 1 . Using this property , Lemma E.3 with A = √ α∂ Φ( θ ) D − 1 , and the commutation property of the trace, we hav e d ζ + ( αP Dθ || I q ) = d ζ + ( α∂ Φ( θ ) D − 1 ( ∂ Φ( θ ) D − 1 ) ⊤ || I q ) = d ζ + ( AA ⊤ || I q ) = tr( A ⊤ A ) − log det + ( A ⊤ A ) − rank( A ) = α tr( D − 2 G ) − r log ( α ) − log det + ( D − 1 GD − 1 ) − r . (58) Using the compact singular value decomposition of ∂ Φ( θ ) we have that D − 1 GD − 1 = D − 1 V Σ 2 V ⊤ D − 1 . Since V has orthogonal columns and D is in vertible, the nonzero eigenv alues of D − 1 V Σ 2 V ⊤ D − 1 are the same as the nonzero eigen v alues of Σ 2 V ⊤ D − 2 V . Indeed, if λ is such an eigen value then D − 1 V Σ 2 V ⊤ D − 1 x = λx = ⇒ V Σ 2 V ⊤ D − 1 x = λD x . Thus D x ∈ Im( V ) and there is y such that D x = V y that is x = D − 1 V y . Thus, V Σ 2 V ⊤ D − 1 D − 1 V y = 18 Path-conditioned training: a principled way to rescale ReLU neural networks λD D − 1 V y that is V Σ 2 V ⊤ D − 2 V y = λV y , left multiplying by V ⊤ giv es that λ is an eigenv alue of Σ 2 V ⊤ D − 2 V . The con v erse is straightforward. Using the deﬁnition of det + we get that det + ( D − 1 GD − 1 ) = det + (Σ 2 V ⊤ D − 2 V ) = det(Σ 2 V ⊤ D − 2 V ) = det(Σ 2 ) det( V ⊤ D − 2 V ) , (59) where we used that Σ 2 ( V ⊤ D − 2 V ) is inv ertible since both Σ and V ⊤ D − 2 V are inv ertible (the latter is positi ve deﬁnite because D > 0 ). This giv es the ﬁrst formula ( 55 ). Consequently , for an y diagonal matrix D we have min α> 0 d ζ + ( αP Dθ || I q ) = − r log( r ) + r log( p X i =1 d − 2 i G ii ) − log det(Σ 2 ) − log det( V ⊤ D − 2 V ) , (60) as the ﬁrst order conditions for the loss in α giv e α = r P i d − 2 i G ii (the G ii ≥ 0 since the matrix is positi ve semi-deﬁnite). When r = p , G is full rank and since G Dθ = D − 1 GD − 1 it follo ws that G Dθ is positi ve deﬁnite hence d ζ + ( αG Dθ || I p ) = d ζ ( αG Dθ || I p ) for any α > 0 and any diagonal matrix D . In this case, since V is square and unitary , log det( V ⊤ D − 2 V ) = log det( D − 2 ) + log det( V V ⊤ ) = log det( D − 2 ) . In light of ( 60 ) , the minimization ov er α > 0 and D ∈ D of d ζ + ( αP Dθ || I q ) thus corresponds to choosing α = r P i d − 2 i G ii where D ∈ D minimizes the sum of terms that depend on D in ( 60 ) : r log P p i =1 d − 2 i G ii − log det( D − 2 ) . This corresponds to ( 57 ) with the change of variable D ′ = D − 2 (observe that D ′ ∈ D if, and only if, D ∈ D ) and the expression of the determinant of a diagonal matrix. Remark F .2 . With the same reasoning, when r = rank( ∂ Φ( θ )) < p the minimization of ( 55 ) over α, D is equiv alent to min D ′ ∈D r log( p X i =1 d ′ i G ii ) − log det( V ⊤ D ′ V ) . (61) W e no w prov e that the optimization problem ( 57 ) can be re written as ( 9 ) and that it has a solution. Lemma F .3. Consider B ∈ R p × H with entries in { 1 , − 1 , 0 } indicating whether a parameter enter s ( − 1 ), leaves ( 1 ), or is unr elated to ( 0 ) a given neur on. Precisely , B = ( b 1 , · · · , b H ) with ∀ neur on h, ( b h ) i =      − 1 if i ∈ in v 1 if i ∈ out v 0 otherwise . (62) Then D = { diag(exp( B u )) , u ∈ R H } . Thus ( 57 ) is equivalent to min u ∈ R H F ( u ) with F ( u ) := p log ( p X i =1 e ( B u ) i G ii ) − p X i =1 ( B u ) i , (63) in the sense that if u solves ( 63 ) then D ′ := diag(exp( B u )) solves ( 57 ) . Pr oof. Let D ′ ∈ D , and d be the vector of R p such that D ′ = diag ( d ) . W e want to sho w that d can be written as d = exp( B u ) . Thanks to Theorem D.3 , we kno w that there exists λ ∈ ( R + ∗ ) H such that d = K v ∈H d λ v ,v (64) Where J denotes the Hadamard coordinate wise product and d λ v ,v is the diagonal of the neuron-wise rescaling matrix D λ v ,v introduced in Theorem D.1 . For v ∈ H , λ v ∈ R + ∗ , by deﬁnition, we get ( d λ v ,v ) i =      λ v if i ∈ in v 1 /λ v if i ∈ out v 1 otherwise . (65) 19 Path-conditioned training: a principled way to rescale ReLU neural networks But because λ v ∈ R + ∗ , there exists an u v ∈ R such that λ v = exp( − u v ) . W e can re write d λ v ,v as ( d λ v ,v ) i =      exp( − u v ) if i ∈ in v exp( u v ) if i ∈ out v exp(0) otherwise . (66) W e can re write the pre vious equation with a column of B , precisely: d λ v ,v = exp( u v b v ) (67) where the exp is taken coordinate-wise. Now let i an integer with 1 ≤ i ≤ p , thanks to ( 64 ), we get d i = H Y h =1 ( d λ h ,h ) i = H Y h =1 exp( u h b h ) i = H Y h =1 exp( B ih u h ) (68) So, d i = exp  H X h =1 B ih u h  (69) W e can conclude that d = exp( B u ) (70) V ice-versa, if d writes this way it is a Hadamard product of d h := exp( b h u h ) , so that D ′ = diag( d ) is a product of diagonal matrices which can easily be checked to belong to D . W e conclude using that D is a group. T o prov e e xistence of a solution we will use the follo wing result. Lemma F .4. If G ii > 0 for each i and B  = 0 satisﬁes 1 ∈ span( B ) , then the function F in ( 63 ) is coercive . Pr oof. F can be written as F ( u ) = f ( B u ) where f ( x ) = p log( P p i =1 e x i G ii ) − P i x i . W e have that P p i =1 e x i G ii ≥ max k ∈ [ [ p ] ] e x k G kk thus f ( x ) ≥ p max k ∈ [ [ p ] ] { x k + log( G kk ) } − P i x i ≥ pc + P i (max k x k − x i ) where c = min k log( G kk ) > −∞ using the hypothesis on G . Denoting e i the i-th standard basis vector , this gives F ( u ) ≥ pc + n X i =1  max k {⟨ e ⊤ k , B u ⟩} − ⟨ e ⊤ i , B u ⟩  . (71) W e deﬁne ∀ u, α ( u ) = p X i =1  max k {⟨ e ⊤ k , B u ⟩} − ⟨ e ⊤ i , B u ⟩  . (72) Clearly for any t ≥ 0 , α ( t · u ) = t · α ( u ) . T o conclude that the function is coerciv e we introduce β := min ∥ u ∥ 2 =1 α ( u ) . (73) Since α is continuous, this minimum is attained. Also, for any u  = 0 , F ( u ) = F  ∥ u ∥ 2 · u ∥ u ∥ 2  ≥ pc + α  ∥ u ∥ 2 · u ∥ u ∥ 2  = pc + ∥ u ∥ 2 · α  u ∥ u ∥ 2  ≥ pc + β ∥ u ∥ 2 . (74) Proving that β > 0 is thus suf ﬁcient to ensure the coerci vity of F . For this, ﬁrst note that for any u, i ∈ [ [ p ] ] , max k {⟨ e ⊤ k , B u ⟩} − ⟨ e ⊤ i , B u ⟩ ≥ 0 , (75) hence β ≥ 0 , and we only need to rule out the case β = 0 . The above inequality indeed also shows β = 0 if and only if there exists u such that ∀ i, max k {⟨ e ⊤ k , B u ⟩} = ⟨ e ⊤ i , B u ⟩ . This condition is equiv alent to the existence of u such that ⟨ e ⊤ i , B u ⟩ = cte , ∀ i , that is to say the existence of u such that B u = cte , which is not possible since 1 ∈ span( B ) . 20 Path-conditioned training: a principled way to rescale ReLU neural networks T o conclude we use the second lemma Lemma F .5. Consider B as deﬁned in ( 62 ) . Then 1 ∈ span( B ) . Pr oof. By contradiction, assume that 1 ∈ span( B ) then there is u such that B u = 1 , and by Theorem F .3 the diagonal matrix D := diag(exp( B u )) = e 1 I belongs to D and thanks to Theorem D.7 it belongs to Φ − 1 ( 1 ) + . Howe v er for an y path p ∈ P , Φ p ( e 1 I ) = Q i ∈ p e 1  = 1 which leads to the desired contradiction. W e conclude that 1 ∈ span( B ) . Corollary F .6. Suppose that ∀ i, G ii > 0 . The optimization pr oblem ( 9 ) always has a solution. Pr oof. Since the logsumexp function is con ve x ( Gao & Pav el , 2017 , Lemma 4), one easily sho ws that the function F is con v ex. It is also continuous, and coerciv e by Theorem F .4 and Theorem F .5 . So there exists a minimizer . Finally we prove that ( 9 ) can be solved via simple alternating minimization, where each minimization admits a closed form. Lemma F .7. W e note ρ i,h = e ( B u ) i − B ih u h . If g > 0 , for any neur on h , the pr oblem min u h ∈ R F ( u 1 , · · · , u h , · · · , u H ) has a solution given by log( r h ) wher e r h is the unique positive r oot of the polynomial B ( A + p ) X 2 + AD X + C ( A − p ) with A := |{ i ∈ in h }| − |{ i ∈ out h }| , B := X i ∈ out h ρ i,h g i C := X i ∈ in h ρ i,h g i , D := X i ∈ other h ρ i,h g i . Pr oof. W e note, g i = G ii , f ( x ) = p log ( P i e x i g i ) − P i x i such that F ( u ) = f ( B u ) . W e have ﬁrst ∇ f ( x ) = p ⟨ e x ⊙ g , 1 ⟩ ( e x ⊙ g ) − 1 . (76) Since ∇ F ( u ) = B ⊤ ∇ f ( B u ) the h -th coordinate of this gradient is ( ∇ F ( u )) h = ⟨ b h , ∇ f ( B u ) ⟩ = p ⟨ e B u ⊙ g , 1 ⟩ ⟨ b h , e B u ⊙ g ⟩ − ⟨ b h , 1 ⟩ = p P i ∈ out h e ( B u ) i g i − P i ∈ in h e ( B u ) i g i P i e ( B u ) i g i + ( X i ∈ in h 1 − X i ∈ out h 1) . (77) Now writing u = ( u 1 , · · · , u h , · · · , u H ) and using that ( B u ) i = X k B ik u k = B ih u h + X k  = h B ik u k , (78) and with the notation ρ i,h := exp(( B u ) i − B ih u h ) , we get e ( B u ) i = ρ i,h e B ih u h =      ρ i,h e u h i ∈ out h ρ i,h e − u h i ∈ in h ρ i,h i ∈ other h . (79) So with A, B , C , D deﬁned in the lemma, ( 77 ) becomes ( ∇ F ( u )) h = p P i ∈ out h e u h ρ i,h g i − P i ∈ in h e − u h ρ i,h g i P i ∈ out h e u h ρ i,h g i + P i ∈ in h e − u h ρ i,h g i + P i ∈ other h ρ i,h g i + A = p e u h B − e − u h C e u h B + e − u h C + D + A . (80) Hence ( ∇ F ( u )) h = 0 ⇐ ⇒ B ( A + p ) e u h + AD + C ( A − p ) e − u h = 0 . (81) 21 Path-conditioned training: a principled way to rescale ReLU neural networks Introducing X = e u h > 0 , the problem of ﬁnding a minimizer u h (which we kno w exists by coerci vity of F ) reduces to ﬁnding a positiv e root of the polynomial B ( A + p ) X 2 + AD X + C ( A − p ) . (82) One easily checks that − p < A < p and B , C > 0 (since g > 0 ), hence B ( A + p ) > 0 , C ( A − p ) < 0 thus the polynomial has exactly one positi v e root. G. Computation of the diagonal of G Proposition G.1. The diagonal of G = ∂ Φ ⊤ ∂ Φ can be computed efﬁciently as: diag( G ) = ∇ θ 2 ∥ Φ( θ ) ∥ 2 2 (83) wher e ∇ θ 2 denotes differ entiation with r espect to the vector of squar ed parameters θ 2 = ( θ 2 1 , . . . , θ 2 p ) , or equivalently θ 2 = θ ⊙ θ where ⊙ denotes the Hadamar d (element-wise) pr oduct and p is the number of parameter s. Pr oof. Denote P the set of all paths. For a parameter i , we ha ve by making the change of v ariable θ 2 i = ω i : G ii = X p ∈ P p ∋ i Y k ∈ p k  = i θ 2 k (deﬁnition of G ii ) (84) = X p ∈ P p ∋ i Y k ∈ p k  = i ω k (85) = X p ∈ P p ∋ i ∂ ∂ ω i Φ p ( ω ) (Because Φ p ( ω ) = Y k ∈ p ω k , the deriv ative is straightforw ard) (86) = ∂ ∂ ω i X p ∈ P Φ p ( ω ) (terms with i / ∈ p do not depend on ω i and vanish with the deri vati ve) (87) = ( ∇ ω ∥ Φ( ω ) ∥ 1 ) i (Because Φ p ( ω ) ≥ 0 ) (88) = ( ∇ θ 2 ∥ Φ( θ ) ∥ 2 2 ) i (Because ∥ Φ p ( ω ) ∥ 1 = ∥ Φ p ( θ ) ∥ 2 2 ) (89) H. Detailed Algorithm and Complexity Analysis Algorithm 3 giv es the full pseudocode of PathCond . The matrix B ∈ R p × H is nev er formed explicitly; only its non-zero entries are stored. Since each parameter (edge) appears in at most two columns of B —once as an outgoing edge and once as an incoming edge—we hav e nnz( B ) ≤ 2 p . The O ( n iter · p ) total complexity rests on tw o incremental updates. Dual variable v = B u . At each coordinate-descent step, only u h changes by some increment δ , inducing the update ∆ v = δ b h , where b h is the h -th column of B . Since b h has only | out h | + | in h | non-zero entries, this costs O ( | out h | + | in h | ) rather than O ( p ) . Global auxiliary sum E . Computing the degree polynomial (Lemma 4.1 ) requires D = X i ∈ other h e v i g i , a sum ov er p − | out h | − | in h | ≈ p terms. Na ¨ ıvely e v aluating this for each of the H neurons would gi ve O ( pH ) per sweep. Instead, we maintain the global sum E = P p i =1 e v i g i and recov er D = E − S out − S in , S out = X i ∈ out h e v i g i , S in = X i ∈ in h e v i g i , at cost O ( | out h | + | in h | ) . After updating v , the sum E is refreshed within the same budget by recomputing only the af fected terms. One full sweep therefore costs P H h =1 O ( | out h | + | in h | ) = O ( p ) , and the total complexity ov er n iter sweeps is O ( n iter · p ) , linear in the number of parameters per iteration. 22 Path-conditioned training: a principled way to rescale ReLU neural networks Algorithm 3 PathCond : Path-conditioned rescaling Input : D A G ReLU network with parameters θ ; scaling weights g = diag( G ) ∈ R p ; tolerance ε > 0 ; maximum iterations n iter Output : Rescaled parameters ˜ θ = diag( e v ) θ 7 u ← 0 ∈ R H ; // Primal variables 8 v ← 0 ∈ R p ; // Dual variables, invariant: v = B u 9 E ← P p i =1 g i ; // Global sum E = P i e v i g i , initialized with v = 0 10 for k = 0 , . . . , n iter − 1 do 11 ∆ ← 0 for eac h neur on h = 1 , . . . , H do // Index sets of outgoing/incoming parameters of neuron h 12 out h ← { i : B i,h = +1 } in h ← { i : B i,h = − 1 } // Recover D from global sum in O ( | out h | + | in h | ) 13 S out ← P i ∈ out h e v i g i S in ← P i ∈ in h e v i g i D ← E − S out − S in // Coefficients of the degree polynomial (Lemma 4.1 ) 14 A ← | in h | − | out h | B ← P i ∈ out h e v i − u h g i ; // ρ i,h = e v i − u h for i ∈ out h 15 C ← P i ∈ in h e v i + u h g i ; // ρ i,h = e v i + u h for i ∈ in h // Optimal step: solve B ( A + p ) X 2 + AD X + C ( A− p ) = 0 16 r h ← −AD + p ( AD ) 2 − 4 B ( A + p ) C ( A − p ) 2 B ( A + p ) δ ← ln r h − u h ; // u ( k +1) h = ln r h // Incremental update of v and E in O ( | out h | + | in h | ) 17 v i ← v i + δ for i ∈ out h v i ← v i − δ for i ∈ in h E ← D + P i ∈ out h e v i g i + P i ∈ in h e v i g i 18 u h ← u h + δ ∆ ← ∆ + | δ | 19 if ∆ < ε then break ; 20 return ˜ θ = diag( e v ) θ I. Architectur e Statistics: Number of Parameters, Hidden Units, and Paths This section reports detailed architectural statistics for several standard conv olutional neural network (CNN) models commonly used in computer vision. For each architecture, we consider the follo wing three quantities: • Parameters : the total number of trainable parameters (weights and biases) in the network. • Hidden Units : the total number of hidden channels across all con v olutional layers, serving as the con v olutional analogue of hidden neurons in fully connected networks. • Number of P aths : the total number of distinct computational paths from the input to the output induced by the network architecture. The number of hidden units is computed by summing the output dimensionality (i.e., the number of channels) of each con v olutional layer , excluding the ﬁnal classiﬁcation layer . T o compute the number of paths, we rely on the path-norm formalism introduced by Gonon et al. ( 2023 ). Let Π denote the set of all paths in the network, and let Φ( θ ) denote the vector of path products associated with parameters θ . The path-1 norm is deﬁned as ∥ Φ( θ ) ∥ 1 = X p ∈ P   Φ π ( θ )   . By setting all network parameters to one, i.e., θ = 1 , each path product equals one, and therefore ∥ Φ( 1 ) ∥ 1 = X p ∈ P 1 = | Π | , 23 Path-conditioned training: a principled way to rescale ReLU neural networks which exactly corresponds to the number of paths in the architecture. Importantly , this quantity can be computed ef ﬁciently using a single forward pass through the network Gonon et al. ( 2023 ). T able 1 summarizes these statistics for se veral widely used CNN architectures. Model Parameters Hidden Units Paths ResNet-18 ( He et al. , 2016 ) 11,689,512 4,800 1 . 13 × 10 58 ResNet-34 ( He et al. , 2016 ) 21,797,672 8,512 6 . 68 × 10 109 ResNet-50 ( He et al. , 2016 ) 25,557,032 26,560 7 . 16 × 10 140 AlexNet ( Krizhe vsk y et al. , 2012 ) 61,100,840 9,344 1 . 21 × 10 30 VGG-16 ( Simon yan & Zisserman , 2014 ) 138,357,544 12,416 1 . 15 × 10 56 CIF AR-NV ( Gitman & Ginsbur g , 2017 ) 2,616,576 1,792 2 . 46 × 10 28 T able 1. Architectural statistics for common conv olutional neural networks. J. Regimes Analysis at Initialization W e begin with a fundamental observ ation about the rescaling algorithm. Proposition J .1 (T rivial case: constant diagonal) . If the diagonal of G is constant, i.e., ther e e xists α > 0 such that g = α 1 , then the objective of the r escaling pr oblem ( 9 ) admits u = 0 as its unique minimizer . Consequently , the r escaling matrix is the identity and the algorithm has no effect. Pr oof. Let’ s go back to ( 9 ) F ( u ) = p log p X i =1 e ( B u ) i G ii ! − p X i =1 ( B u ) i . If G ii = α > 0 for all i , then F ( u ) = p log α p p X i =1 e ( B u ) i ! − p X i =1 ( B u ) i = p log α + p log 1 p p X i =1 e ( B u ) i ! − p 1 p p X i =1 ( B u ) i ! . Because the log is conca ve on R + ∗ , by Jensen’ s inequality , we have F ( u ) ≥ p log( α ) , which is achiev ed at u = 0 . Since the objective is strictly con vex, u = 0 is the unique minimizer . J.1. The Matrix G Consider a neural network represented as a directed acyclic graph (D A G) with ReLU, identity , and max-pooling operations. Let Φ denote the path-feature map and deﬁne the path-magnitude matrix G := ∂ Φ ⊤ ∂ Φ . W e focus on its diagonal g = diag( G ) , whose entries encode path magnitudes. For an y parameter i (edge weight or bias), the corresponding diagonal entry is g i = X p ∈ P p ∋ i Y j ∈ p j  = i θ 2 j , (90) where P denotes the set of all paths ending at an output neuron. 24 Path-conditioned training: a principled way to rescale ReLU neural networks Expected path magnitudes. At initialization, parameters θ i are independent random variables. By independence, the linearity of expectation yields E [ g i ] = X p ∈ P p ∋ i Y j ∈ p j  = i E [ θ 2 j ] . (91) J.2. Lay ered Fully-Connected ReLU Netw orks W e specialize to layered fully-connected ReLU netw orks (LFCNs) ( Gonon et al. , 2023 ). Such a network is speciﬁed by a depth L and widths ( n 0 , . . . , n L ) , where layer k maps R n k to R n k +1 . W e distinguish two types of paths: (i) paths starting from input neurons, and (ii) paths starting from bias parameters. As is customary , we omit a bias in the last layer since it does not affect e xpressi vity . Let σ 2 k := E [ θ 2 j ] for any edge parameter in layer k . This assumption of layer -wise constant v ariance is standard in neural network initialization schemes such as Xavier/Glorot initialization ( Glorot & Bengio , 2010 ) and He initialization ( He et al. , 2015 ), which are designed to maintain activ ation variance across layers during forward and backward passes. Counting paths through a parameter . T o compute expected path magnitudes via equation ( 91 ) , we decompose the calculation into: (1) counting paths passing through a parameter , and (2) computing the expected squared magnitude of each path type. Consider an edge parameter w located in layer k (connecting layer k to layer k + 1 ). The number of paths passing through w and originating from input neurons equals the number of input-to-output paths in the reduced network obtained by ﬁxing layers k and k + 1 to a single neuron: N input edge ,k = L Y i =0 i  = k,k +1 n i . This counts all combinations of neurons in layers other than k and k + 1 . For an y such path p containing w , the expected contribution to g w (excluding w itself) is E     Y j ∈ p j  = w θ 2 j     = L − 1 Y ℓ =0 ℓ  = k σ 2 ℓ , where the product ranges ov er all layers e xcept layer k (which contains w ) and the output layer L (which has no outgoing edges). Additionally , we must account for paths originating from bias parameters in layers 0 to k − 1 . A path starting from a bias in layer i < k and passing through w trav erses all layers from i + 1 to L , excluding the neurons in layers k and k + 1 . The number of such paths is N bias ,i edge ,k = L Y j = i +1 j  = k ,k +1 n j . Each such path has expected squared magnitude (e xcluding w from the product) L − 1 Y j = i j  = k σ 2 j , where the product starts from layer i (containing the bias) and excludes layer k . For a bias parameter b in layer k , the analysis is simpler . Such a bias only participates in paths from layer k onward to the output. The number of such paths is N bias ,k = L Y i = k +1 n i , 25 Path-conditioned training: a principled way to rescale ReLU neural networks and each path has expected squared magnitude (e xcluding the bias itself from the product) L − 1 Y i = k +1 σ 2 i . Expected path magnitudes: closed-form f ormula. Combining path counts with their corresponding expected squared magnitudes yields the following closed-form e xpression for e xpected path magnitudes: E [ g i ] =                L Y i = k +1 n i L − 1 Y i = k +1 σ 2 i , if i is a bias in layer k , L Y i =0 i  = k,k +1 n i L − 1 Y i =0 i  = k σ 2 i + k − 1 X i =0 L Y j = i +1 j  = k ,k +1 n j L − 1 Y j = i j  = k σ 2 j , if i is an edge in layer k . (92) Introducing the quantities a k := n k σ 2 k , equation ( 92 ) simpliﬁes to: E [ g i ] =                n L σ 2 k +1 L − 1 Y i = k +2 a i , if i is a bias in layer k , n L σ 2 k +1 L − 1 Y i =0 i  = k,k +1 a i + n L σ 2 k +1 k − 1 X i =0 σ 2 i L − 1 Y j = i +1 j  = k a j , if i is an edge in layer k . (93) J.3. Classical Initializations For standard initializations such as He or Kaiming initialization, one has a k = n k σ 2 k = a, ∀ k , for some constant a > 0 . Proposition J .2 (Expected diagonal under standard initialization) . F or a LFCN with standar d initialization, the expected path magnitude associated with a parameter in layer k is given by E [ g i ] =          n L n k +1 a L − 1 − k , if i is a bias in layer k , n L n k +1 a L − 1 + n L n k +1 k − 1 X i =0 a L − 1 − i n i , if i is an edge in layer k . (94) J.4. Consequences f or the Rescaling Algorithm For the rescaling algorithm to hav e a non-trivial ef fect, the expected diagonal of G must exhibit v ariation across parameters. The analysis abov e re veals that this depends delicately on both the netw ork architecture and the initialization scheme. Architectur es with non-constant width. W e focus on edge parameters as their behavior already allows to highlight architectural regimes where the e xpected diagonal of G exhibits signiﬁcant v ariations in magnitude. For an edge parameter in layer k , equation ( 94 ) becomes E [ g i ] = n L n k +1 a L − 1 + n L n k +1 k − 1 X i =0 a L − 1 − i n i . The behavior of this e xpression depends critically on the regime of the v ariance scaling factor a . 26 Path-conditioned training: a principled way to rescale ReLU neural networks In the lar ge variance r e gime ( a → + ∞ ), the ﬁrst term dominates: E [ g i ] ∼ n L n k +1 a L − 1 . The expected path magnitude is determined primarily by the ratio n L /n k +1 . The ratio between E [ g i ] and E [ g j ] for parameters in different layers is of order n k /n k +1 , provided that a is lar ge enough. In the small variance r e gime ( a → 0 ), the sum dominates and retains only its last term: E [ g i ] ∼ n L n k +1 a L − k n k − 1 . This regime e xhibits much stronger variation in e xpected path magnitudes. The ratio between E [ g i ] and E [ g j ] for parameters in different layers can be of order 1 /a , which is v ery lar ge. In the critical case a = 1 , corresponding for instance to Kaiming normal initialization with unit gain and fan-in scaling, we obtain E [ g i ] = n L n k +1  1 + k − 1 X i =0 1 n i  , k ∈ { 1 , . . . , L − 1 } . Hence, unless the width is constant across all layers, the expected diagonal of G e xhibits non-tri vial v ariation. J.5. Constant-W idth Networks Setup. W e no w consider multi-layer perceptrons with constant width n i = n for all hidden layers i ∈ { 0 , . . . , L − 1 } . In this simpliﬁed setting, the expected path magnitude for an edge parameter in layer k reduces to: E [ g i ] =        a L − 1 , if k = 0 , a L − 1 + 1 n k X i =1 a L − i , if k ∈ { 1 , . . . , L − 1 } . (95) Case a = 1 . When a = 1 , equation ( 95 ) simpliﬁes to E [ g i ] = 1 + k n , k ∈ { 0 , . . . , L − 2 } . (96) The expected path magnitude increases linearly with depth. For networks with large width n ≫ L , the diagonal is approximately constant, and the rescaling algorithm has little effect. Case a  = 1 . When a  = 1 , the geometric series yields E [ g i ] = a L − 1 + a L − 1 n · 1 − (1 /a ) k 1 − 1 /a , k ∈ { 0 , . . . , L − 1 } . (97) Large variance r egime: a → + ∞ . In this regime, E [ g i ] ∼    a L − 1 , if k = 0 , a L − 1  1 + 1 /n  , if k ∈ { 1 , . . . , L − 1 } . (98) Small variance r egime: a → 0 . In this re gime, E [ g i ] ∼            a L − 1 , if k = 0 , a L − 1  1 + 1 /n  , if k = 1 , a L − k n , if k ∈ { 2 , . . . , L − 1 } . (99) The diagonal exhibits a multi-le v el structure with signiﬁcant v ariation across layers, especially for deeper layers. 27 Path-conditioned training: a principled way to rescale ReLU neural networks J.6. Summary and Implications Based on the analysis abov e, we identify regimes in which the rescaling algorithm is e xpected to ha ve a non-tri vial ef fect. • V arying-width architectures: For networks with non-uniform layer widths, regardless of the initialization scale a , the expected diagonal of G is non-constant across layers. The rescaling algorithm is expected to hav e a signiﬁcant ef fect on such irregular conﬁgurations. • Near constant-width architectur es with small variance: For networks with approximately uniform width n i ≈ n for all i , the algorithm has a non-tri vial ef fect when initialized with small variance, i.e., when a ≪ 1 . In this regime, the diagonal exhibits signiﬁcant v ariation across layers despite the re gularity of the architecture. • Near constant-width architectur es with critical initialization: For such regular conﬁgurations with a = 1 , the diagonal varies as 1 + k /n . The effect is signiﬁcant when the depth L is comparable to or larger than the width n . • Near constant-width architectur es with large variance: For near constant-width networks with a ≫ 1 , the diagonal exhibits a simple tw o-le vel structure. The rescaling may help but the ef fect is limited to distinguishing the ﬁrst layer from subsequent layers. Conclusion. The rescaling algorithm is e xpected to hav e the most signiﬁcant impact in two scenarios: 1. V arying-width architectures, regardless of initialization. 2. Near constant-width architectures of width n , initialized with small standard deviation relati ve to 1 / √ n . In both cases, the expected path magnitudes exhibit signiﬁcant variation across parameters, leading to ill-conditioning of the path-magnitude matrix G that the rescaling algorithm can address. W e empirically v alidate these assumptions through a controlled experiment on synthetic netw orks. Experimental Setup. W e generate multiple networks with a ﬁxed depth of 8 layers and a mean width of 32 neurons per layer , ensuring a total budget of 8 × 32 = 256 neurons distributed across all hidden layers. T o systematically vary the width regularity of the netw ork architecture, we introduce a concentration parameter α ∈ [10 − 1 , 10 2 ] that controls a symmetric Dirichlet distribution used to sample the proportion of neurons allocated to each layer . When α is very lar ge, the Dirichlet distribution becomes highly concentrated around uniform proportions, resulting in near constant-width conﬁgurations (all layers ≈ 32 neurons). Con versely , when α is small, the distribution allo ws for greater variability in the proportions, creating varying-width architectures where some layers are signiﬁcantly wider or narrower than others. T o ensure architectural validity , each layer is guaranteed a minimum width of 1 neuron, with the remaining budget distributed proportionally according to the Dirichlet samples using a largest remainder method to maintain the e xact total neuron count. W e repeat this experiment 20 times, sampling 20 dif ferent α v alues in each run to ensure statistical robustness. Remark. W e can notice that these fav orable regimes are common in practice. Even when most hidden layers share a near constant-width, input and output dimensions often dif fer signiﬁcantly , giving rise to an inherently non-uniform conﬁguration. For instance, in the CIF AR-10 experiments (Figure 2 ), the 3072-dimensional input and 10-dimensional output create architectural irregularity despite the many intermediate layers of width 500. Howe v er , as depth increases, the network approaches near constant-width behavior , which explains why the performance gains decrease for v ery deep networks. K. Additional Experiments f or Section 5.1 W e provide here additional e xperiments complementing Section 5.1 , co vering a broader range of learning rates. Figure 6 demonstrates that PathCond achie ves strong performance at small to moderate learning rates, consistent with our theoretical frame work where proper initialization critically affects training dynamics. Howe v er , performance degrades at larger learning rates, where optimization instabilities ov ershadow initialization beneﬁts and our theoretical assumptions (small learning rate regime) no longer hold. 28 Path-conditioned training: a principled way to rescale ReLU neural networks 1.5M 2.0M 2.5M 3.0M Number of parameters 20 40 60 80 Epochs to reach 99% Train Acc lr = 0.1 2.0M 2.5M 3.0M Number of parameters lr = 0.01 2.0M 2.5M 3.0M Number of parameters lr = 0.005 2.0M 2.5M 3.0M Number of parameters lr = 0.001 Baseline P a t h c o n d Enorm F igur e 6. Conv ergence speed across dif ferent learning rates and network sizes. L. Additional Experiments f or Section 5.2 W e provide complementary e xperiments for Section 5.2 , exploring the impact of learning rate on the generalization beneﬁts of PathCond across a wider range of v alues. Figure 7 presents training and test dynamics for learning rates ranging from 10 − 2 to 10 − 4 . Consistent with our theoretical predictions, PathCond demonstrates improv ed con v ergence and generalization performance at small to moderate learning rates, where proper initialization plays a critical role in training stability . For the smallest learning rate ( lr = 10 − 4 ), we extend training to 500 epochs to observ e con v er gence behavior , while other conﬁgurations use 128 epochs. The results conﬁrm that PathCond ’ s advantages are most pronounced in regimes where our theoretical assumptions hold, with performance conv er ging to baseline as learning rates increase and optimization dynamics begin to dominate over initialization effects. 0 50 100 Epochs 10 −4 10 −3 10 −2 10 −1 10 0 Train Loss 0 50 100 Epochs 0.4 0.6 0.8 1.0 Train Accuracy 0 50 100 Epochs 0.4 0.6 0.8 T est Accuracy Baseline Pathcond Enorm (a) lr = 0 . 01 0 50 100 Epochs 10 −3 10 −2 10 −1 10 0 Train Loss 0 50 100 Epochs 0.25 0.50 0.75 1.00 Train Accuracy 0 50 100 Epochs 0.2 0.4 0.6 0.8 T est Accuracy Baseline Pathcond Enorm (b) lr = 0 . 001 0 50 100 Epochs 10 −1 10 0 Train Loss 0 50 100 Epochs 0.2 0.4 0.6 0.8 1.0 Train Accuracy 0 50 100 Epochs 0.2 0.4 0.6 0.8 T est Accuracy Baseline Pathcond Enorm (c) lr = 0 . 0005 0 200 400 Epochs 10 −1 10 0 Train Loss 0 200 400 Epochs 0.25 0.50 0.75 1.00 Train Accuracy 0 200 400 Epochs 0.2 0.4 0.6 0.8 T est Accuracy Baseline Pathcond Enorm (d) lr = 0 . 0001 (500 epochs) F igur e 7. Training dynamics across dif ferent learning rates. M. Additional Experiments f or Section 5.3 W e provide complementary experiments for Section 5.3 , exploring the impact of learning rate on the relationship between varying-width architectures and required rescaling adjustments. Figure 8 presents the ﬁnal training loss and rescaling magnitudes across different compression factors for three learning rates. Consistent with our theoretical frame work, the required rescaling magnitude increases with architectural width variation (higher compression factors) across all learning rates. 29 Path-conditioned training: a principled way to rescale ReLU neural networks 0.00 0.05 0.10 0.15 0.20 Final Training Loss Baseline Pathcond 1.0 2.0 4.0 7.0 Compression F actor 0 2 ‖log(rescaling)‖ ∞ (a) lr = 0 . 1 0.05 0.10 0.15 0.20 0.25 0.30 Final Training Loss Baseline Pathcond 1.0 2.0 4.0 7.0 Compression F actor 0 2 ‖log(rescaling)‖ ∞ (b) lr = 0 . 01 0.20 0.25 0.30 0.35 0.40 Final Training Loss Baseline Pathcond 1.0 2.0 4.0 7.0 Compression F actor 0 2 ‖log(rescaling)‖ ∞ (c) lr = 0 . 001 F igur e 8. Final training loss (top) and rescaling magnitude (bottom) across compression factors for different learning rates. N. ENorm : Experimental Setup W e describe the ENorm hyperparameter choices used in our comparisons. ENorm ( Stock & Gribonv al , 2023 ) rescales network weights at regular interv als to minimize a weighted sum of per-layer norms, X ℓ c ℓ ∥ W ℓ ∥ p p where the depth-dependent coefﬁcient c ℓ affects each layers. W e use the default value p = 2 , so ENorm minimizes the sum of squared ℓ 2 norms across layers. W e set c = 1 , which remov es the depth penalty and treats all layers uniformly; as noted by the original authors, values of c at or slightly above 1 generally yield the best results. ENorm can be applied e very k ≥ 1 SGD iterations. Following the recommendation of the original authors, we apply one ENorm cycle after each SGD iteration ( k = 1 ). For netw orks with BatchNorm layers, we follo w the recommendation of the authors and exclude BatchNorm parameters from the rescaling cycle, applying ENorm to the remaining weights only—consistently with how PathCond handles BatchNorm. In summary , all ENorm results in this paper use the default conﬁguration: p = 2 , c = 1 , one rescaling cycle per SGD iteration, BatchNorm layers excluded. 30

Path-conditioned training: a principled way to rescale ReLU neural networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment