Reading time: 55 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.18209
  • Date:
  • Authors: Unknown

📝 Abstract

Empirical power-law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution-Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse-grained dynamical description of training. Within GRSD, power-law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse-grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log-shift invariance of renormalized shell couplings. We further show that power-law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log-shift invariance is combined with the intrinsic time-rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power-law form. Beyond the theoretical analysis, we provide direct empirical evidence that the key structural requirement-log-shift invariance of renormalized shell couplings-is approximately realized in modern residual architectures, and is significantly degraded in non-residual counterparts. These experiments validate the non-vacuous nature of the sufficient conditions and clarify the architectural mechanisms that promote renormalizable spectral transport. We emphasize that the conditions identified here are sufficient but not necessary, and we do not claim universality of power-law scaling across all

📄 Full Content

Empirical power-law scaling has emerged as one of the most robust and reproducible phenomena in modern deep learning. Across model families, tasks, and training regimes, performance metrics such as loss, error, or perplexity often exhibit smooth power-law dependence on model size, dataset size, or compute budget [18,25,17,20]. More recently, scaling behavior has been observed to persist beyond classical regimes, extending to precision, data pruning, and architectural interventions [42,31]. Despite its empirical ubiquity, the theoretical origins of these power laws remain only partially understood.

A growing body of work has sought to explain neural scaling laws through kernel limits and mean-field approximations [23,32,51], as well as through dynamical models that interpolate between lazy and feature-learning regimes [5,6,7]. While these approaches provide valuable insight into specific limits or architectures, they do not by themselves explain why power-law behavior arises so broadly, nor under what conditions it should be expected to fail. In particular, empirical evidence clearly indicates that scaling laws do not hold universally: depending on architecture, optimization stability, data distribution, or training regime, scaling behavior may degrade, break, or transition between distinct regimes [42,26,10].

In recent work, the Generalized Resolution-Shell Dynamics (GRSD) framework was proposed as a spectral and operator-theoretic description of learning dynamics in deep networks [52]. GRSD formulates learning as an energy transport process across logarithmic spectral shells, leading to a coarse-grained, onedimensional conservation law for error energy. Within this framework, empirical neural scaling laws correspond to a particularly simple form of the renormalized shell velocity: a power-law function of the spectral coordinate. However, in GRSD this power-law form is not automatic. Rather, it reflects a nontrivial structural property of the learning process-namely, that the shell dynamics admits a renormalizable closure in the sense familiar from statistical physics and turbulence theory [24,49,28,8,43].

The central question addressed in this paper is therefore the following: under what conditions does the GRSD shell dynamics admit a power-law renormalized description? Our goal is not to claim that all deep learning systems satisfy such conditions, nor that power-law behavior is universal. Indeed, the empirical literature suggests the opposite: scaling laws are contingent and can fail outside specific regimes. Instead, we seek to identify a coherent and mathematically well-defined set of sufficient conditions under which renormalizable shell dynamics can be established, and under which power-law behavior follows as a rigidity consequence.

A central difficulty in this program is that some of the required conditions-most notably log-shift invariance of renormalized shell couplings-are genuinely structural and do not follow from generic stability or locality considerations alone. Such conditions would be of limited interest if they were never even approximately realized in practical learning systems. For this reason, an essential part of our analysis is to identify concrete architectural mechanisms that promote these properties, and to empirically test whether they are approximately satisfied in realistic training settings.

Concretely, we study learning configurations defined by a network architecture, an initialization scheme, and an optimization trajectory. We provide a theorem establishing that the GRSD shell dynamics admits a power-law velocity field when a collection of structural and dynamical conditions are satisfied. These conditions include (i) locality in the gradient computation graph, (ii) weak functional incoherence at initialization, (iii) controlled evolution of the Jacobian along training, and (iv) log-shift invariance of the renormalized shell couplings. Crucially, power-law scaling does not follow from renormalizability or log-shift invariance alone; rather, it emerges from a rigidity mechanism once these properties are combined with the intrinsic time-rescaling covariance of gradient flow.

At first sight, the sufficient conditions identified in this work may appear strong or even restrictive. Indeed, they impose nontrivial requirements on architectural structure, initialization geometry, and training stability. However, it is precisely these requirements that align strikingly well with several core design principles of modern deep learning systems.

Contemporary architectures are overwhelmingly engineered to avoid uncontrolled recurrence or long-range instantaneous coupling, favoring feedforward or residual structures with bounded gradient propagation. Training pipelines are carefully designed to ensure stability and controllability, with explicit emphasis on preventing gradient explosion or catastrophic spectral reorganization. Likewise, random or weakly correlated initialization schemes are ubiquitous, reflecting an implicit preference for functional incoherence at the start of training. Viewed through the GRSD lens, these widely adopted engineering choices correspond closely to Conditions 1-3, which enforce locality, stability, and statistical regularity of the induced operator dynamics.

From this perspective, the emergence of neural scaling laws should not be interpreted as an intrinsic or inevitable property of deep learning models. Rather, it reflects the fact that engineering requirements of trainability, stability, and scalability progressively constrain learning systems toward a dynamical regime with no intrinsic spectral scale and a renormalizable coarse-grained description. Once learning dynamics enters this approximately scale-free and renormalizable region, power-law behavior is no longer a matter of modeling choice or empirical coincidence, but instead arises as a rigidity consequence of the underlying dynamics.

In addition to the theoretical analysis, we empirically probe the spectral transport operators induced during training and directly measure the resulting coarsegrained shell couplings. These experiments demonstrate that residual architectures approximately realize the log-shift invariance required by the theory, while structurally similar non-residual architectures do not. This contrast provides direct evidence that the sufficient conditions identified here capture meaningful architectural distinctions, rather than abstract or vacuous assumptions.

Importantly, the conditions identified here are sufficient but not necessary. We do not exclude the possibility that other mechanisms-such as stochastic optimization effects or alternative sources of effective regularization-may also lead to scaling behavior. Rather, our objective is to construct a principled framework in which renormalizability and power-law scaling can be derived from structural and dynamical assumptions, rather than postulated a priori.

A key feature of our analysis is that the sufficient conditions are expressed at the level of operator dynamics and gradient evolution, rather than being tied to a specific model family. As a result, they can be meaningfully interpreted across a range of modern architectures-including multilayer perceptrons, convolutional networks, transformers, and structured state-space models-provided these architectures satisfy the stated assumptions. We emphasize that this correspondence is not universal: the presence of architectural features aligned with our conditions does not imply that scaling must occur, but rather that scaling is structurally permitted within the GRSD framework.

From a broader perspective, our results position neural scaling laws within the classical theory of renormalization and large-scale dynamics. Just as power laws in turbulence or critical phenomena arise when microscopic details become irrelevant under coarse-graining, GRSD predicts power-law learning dynamics when the learning process itself admits a renormalizable and log-shift invariant closure compatible with gradient-flow covariance. Viewed through this lens, the observed success-and occasional failure-of scaling laws in deep learning reflects whether the underlying learning configuration lies within a renormalizable universality class.

The remainder of the paper is organized as follows. Section 2 reviews the GRSD framework and the formulation of shell dynamics. Section 3 states the sufficient conditions for renormalizable shell dynamics and presents the main the-orem. Section 4 provides a concrete structural mechanism-based on residual learning-by which the most nontrivial condition can be realized. Section 5 interprets these conditions and discusses their relationship to common deep learning design choices. Section 7 presents empirical validation of the structural assumptions underlying the theory. Finally, Section 8 discusses limitations, non-universal regimes, and directions for future work.

In this section we briefly review the Generalized Resolution-Shell Dynamics (GRSD) framework, which provides the dynamical and spectral setting for the results of this paper. Our presentation is intentionally concise and focuses only on the aspects of GRSD that are required to formulate and analyze power-law renormalizability. We refer to [53] for a detailed and comprehensive treatment.

Consider a supervised learning problem with model parameters θ(t) trained by gradient-based optimization. Let J(t) denote the Jacobian of the model outputs with respect to parameters, viewed as a linear operator from parameter space to function space. We define the associated positive semidefinite operator

which plays the role of a time-dependent kernel governing error dynamics. Operators of this form have been extensively studied in learning theory and operatorbased analyses of neural networks [39,29]. GRSD analyzes learning dynamics through the spectral decomposition of M (t). Let λ denote the spectral coordinate of M (t), and introduce logarithmic spectral shells S α := {λ :

where {s α } is a uniform partition of the log-spectrum. For each shell S α , GRSD defines a shell energy E α (t) corresponding to the error energy carried by modes within that spectral band. Passing to a shell-averaged continuum description yields an energy density ε(λ, t), defined as the piecewise-constant interpolation of E α (t) over logarithmic shells. This representation enables a coarse-grained description of learning dynamics in which microscopic spectral details are suppressed while large-scale spectral structure is retained.

A central result of GRSD is that, under mild structural assumptions, the shell energies satisfy an approximate one-dimensional balance law of conservative form. Specifically, the shell-averaged energy density obeys

where J(λ, t) is a spectral flux density describing energy transport across scales, and D(λ, t) collects dissipative contributions. This equation is the learning-theoretic analogue of energy cascade equations in turbulence and large-scale interacting systems [28,8,43]. The GRSD formulation does not, by itself, specify the functional form of the flux J(λ, t). Instead, it expresses learning as a transport process whose qualitative behavior depends on how J relates to the local energy density ε. A particularly simple and analytically tractable regime arises when the dynamics admits a renormalized velocity field

which depends on λ through a power law. In this case, the shell dynamics becomes renormalizable in the classical sense: under spectral coarse-graining, the functional form of the evolution equation is preserved up to rescaling. Importantly, such power-law renormalizability is not automatic. The GRSD framework allows for general, scale-dependent fluxes that may involve multiple characteristic scales or exhibit broken scaling behavior. Consequently, the appearance of power-law velocity fields reflects additional structural properties of the learning configuration rather than a generic consequence of spectral coarsegraining.

The goal of this paper is to make this distinction explicit. Rather than postulating power-law renormalizability as an assumption, we ask under what conditions on the learning configuration-encompassing architecture, initialization, optimization stability, and statistical scale relations-the GRSD shell dynamics necessarily admits a power-law renormalized description. The next section formalizes this question and presents a set of sufficient conditions under which such renormalizability can be rigorously established.

This section formulates a set of sufficient conditions under which the GRSD shell dynamics admits a power-law renormalized description. We begin by formalizing the notion of a learning configuration and its associated Jacobian dynamics. We then state the sufficient conditions and present the main theorem. Interpretation and architectural implications are deferred to Section 5.

We consider supervised learning dynamics parameterized by time t ∈ [0, T ], induced by a learning configuration consisting of: (i) a network architecture specifying a parameter-to-function map, (ii) an initialization scheme for the parameters, and (iii) an optimization trajectory generated by gradient-based training.

Let J(t) denote the Jacobian of the model outputs with respect to parameters, viewed as a linear operator from parameter space to function space. As in GRSD, we define the associated positive semidefinite operator

which governs the instantaneous learning dynamics in function space. Operator evolutions of this form arise naturally in both kernel limits and feature-learning regimes [23,32,5,53].

To capture structural locality in the gradient computation graph, we decompose the Jacobian into a collection of blocks,

where each block corresponds to a contiguous subgraph of the computation graph (e.g., layers, modules, or residual units). This decomposition is assumed to be fixed throughout training and serves only as a bookkeeping device for expressing locality and coherence properties. No assumption is made that the blocks correspond to distinct spectral shells. The evolution of J(t) along the optimization trajectory induces an evolution of M (t) and, through its spectrum, the GRSD shell dynamics reviewed in Section 2. Our goal is to characterize conditions under which this induced shell dynamics is renormalizable with a power-law velocity field.

We now state a set of sufficient conditions under which power-law renormalizability of GRSD shell dynamics can be rigorously derived. These conditions constrain the learning configuration at the level of computation graph structure, initialization geometry, training stability, and scale consistency. They are not claimed to be necessary, nor are they asserted to hold universally across all deep learning settings.

Condition 1 (Graph-banded Jacobian evolution). There exists a constant K = O(1) such that for each block index l and all t ∈ [0, T ],

Condition 1 expresses locality of gradient propagation in the computation graph. It allows for recurrent or cyclic dependencies provided that their influence remains confined to a bounded neighborhood and does not induce long-range instantaneous coupling across blocks.

for all l, m.

Condition 2 requires that long-range correlations between Jacobian blocks at initialization are summable. It does not preclude structured local dependencies or short-range correlations, and is naturally satisfied by a wide class of random or weakly correlated initializations.

Moreover, on any intermediate spectral window away from the extreme edges, the Jacobian-induced operators and their spectral projectors evolve without abrupt reorganization: all quadratic statistics entering the GRSD closure remain uniformly bounded, admit Lipschitz dependence in logarithmic spectral coordinates, and their shell-wise averages concentrate around their expectations at rates compatible with the width and depth scaling limits. Condition 3 enforces stability and regularity of the learning trajectory. Beyond uniform operator norm control, it rules out catastrophic spectral rearrangements and ensures that the statistical quantities required for renormalized shell closure are well-defined and self-averaging along the optimization path.

Assumptions of this type-uniformly bounded Jacobians/gradients along the optimization trajectory, sometimes supplemented by smoothness controls-are standard in analyses of overparameterized learning and trainability; see, e.g., bounded-Jacobian conditions in [45] and Jacobian-spectrum stability via dynamical isometry in [38,44].

Condition 4 (Log-shift invariance of renormalized shell couplings). Let s = log λ, and let {B i } denote logarithmic spectral shells of fixed width h in s, centered at s i = ih. Let Ω ij denote the shell-level renormalized coupling, obtained by coarsegraining the mode-level operator Ṁ over shells B i and B j .

On an intermediate spectral window, the shell-level statistics satisfy Log-shift invariance, i.e. Ω ij follows scale-free structure:

where K h depends only on the relative log-scale separation, and the residual terms err ij vanish in the joint limit of large model size and fine shell resolution.

Condition 4 requires the absence of any intrinsic absolute scale in the renormalized spectral couplings beyond relative separations in logarithmic scale. It forces that ṡ = λ λ is invariant to the absolute scale of λ (detailed proof provided in A.2). Unlike Conditions 1-3, this condition does not follow from generic stability or locality considerations and constitutes a genuinely nontrivial structural requirement.

We emphasize that Condition 4 is not intended as a verifiable microscopic assumption with necessary and sufficient criteria. Rather, it should be interpreted as an effective principle characterizing learning configurations that operate near a renormalization fixed point in the GRSD sense. In practice, such configurations may be approached through architectural design (e.g., residual parameterizations), optimization choices, data preprocessing, and hyperparameter tuning, without being exactly realized. In this sense, Condition 4 plays a role analogous to idealized principles such as reversibility in Carnot engines or scale invariance at criticality: it defines a limiting structure that constrains admissible large-scale dynamics when approximately satisfied.

Together, Conditions 1-4 ensure that GRSD shell dynamics admits a closed, scale-consistent, and renormalizable description under logarithmic spectral coarsegraining.

We are now ready to state the main result of this paper.

Theorem 1 (Power-law renormalizability of GRSD shell dynamics). Fix a learning configuration and consider the induced GRSD shell dynamics defined on logarithmic spectral shells. Suppose Conditions 1-4 hold on a training horizon t ∈ [t 0 , T ]. Assume in addition the standard GRSD structural property that intra-shell couplings are antisymmetric, so that shell-internal transfers cancel in the energy balance.

Then the renormalized GRSD velocity field

is uniquely constrained to the form

for a scalar coefficient c 0 = v(λ,t 0 ) λ . Remark 1 (Effective time and learning-rate schedules). Throughout this work, the time variable t denotes the effective time associated with the continuous-time gradient flow

which corresponds to a constant learning rate and no explicit scheduler. All covariance and scaling properties used in our analysis are derived directly from this ODE. When a learning-rate schedule η(t) is present, the parameter dynamics take the form θ = -η(t)∇ θ L and can be mapped to the above gradient-flow form by introducing a reparameterized time [34]

All results in this paper then apply with t replaced by the effective time τ .

We briefly outline the structure of the proof; technical details are deferred to the appendix. First, Conditions 1-3 imply that locality and summable incoherence of Jacobian blocks propagate along the training trajectory. At the level of the error dynamics ė = -M e with M = JJ * , a Taylor expansion around a fixed logshell basis shows that direct off-diagonal error exchange between shells separated by more than one shell index arises only at second order in variations of M [46]. Consequently, the leading-order shell dynamics is governed solely by nearestneighbor boundary fluxes, while nonlocal contributions enter only as subleading renormalizations (see Appendix A.2).

Second, Condition 4 implies homogeneity of shell-averaged energy and flux densities under spectral rescaling. Finally, homogeneity of the ratio v(λ, t) = J(λ, t)/ε(λ, t) enforces a power-law functional form. Detailed proofs are provided in Appendix A.2.

Condition 4 plays a fundamentally different role from Conditions 1-3. While the latter encode generic locality, stability, and regularity properties that are naturally satisfied by a wide class of modern block-stacked architectures, Condition 4 imposes a genuinely nontrivial structural requirement: the absence of any intrinsic absolute scale in the renormalized spectral couplings beyond relative separations in logarithmic scale.

In particular, log-shift invariance does not follow from graph locality, controlled optimization paths, or statistical self-averaging alone. Establishing this condition therefore requires additional architectural structure. In this section, we show that residual learning, when combined with sufficient depth, provides a concrete mechanism by which Condition 4 can be rigorously realized.

We begin by formalizing the Jacobian structure induced by residual learning configurations at the level of parameters.

Let θ = (θ 1 , . . . , θ L ) denote the decomposition of model parameters into depth-indexed blocks, and let f (θ) denote the model output. The full parameter Jacobian is defined as

which we view as a linear operator from parameter space to function space. Throughout this section, we adopt the block decomposition J = J (1) , J (2) , . . . , J (L) , J (ℓ) := ∂f ∂θ ℓ , corresponding to concatenation over parameter blocks.

Layerwise parameter gradients and Jacobian factors. In residual architectures, the contribution J (ℓ) of the ℓ-th block admits an explicit factorization reflecting the forward-and backward-propagation structure. Let L be the training loss and denote the final hidden representation by h L . For each residual block, write the state-to-state Jacobian

Then the layerwise parameter gradient admits the factorization

where F ℓ denotes the residual branch of the ℓ-th block. Thus, while the full Jacobian J is obtained by concatenation, each layerwise Jacobian J (ℓ) involves the same multiplicative Jacobian structure analyzed in earlier residual learning arguments, with left and right factors corresponding to downstream and upstream propagation, respectively.

Residual form. In a residual learning configuration, the state Jacobian takes the near-identity form

where 0 < ε ≤ ε 0 and ∥G k ∥ op is uniformly controlled along the optimization path.

Additive GRSD operator. Collect the layerwise contributions by concatenation

and define the associated self-adjoint operator

By concatenation, M decomposes additively as

i.e., M is the sum of the self-adjoint contributions induced by individual layerwise parameter Jacobians.

We now establish the main structural result of this section. It shows that residual learning configurations induce log-shift invariance through a two-stage mechanism: first, beyond a finite depth threshold, individual layerwise Jacobian contributions become depth-stationary on logarithmic spectral scales; second, the additive structure of the GRSD operator causes finite-depth inhomogeneities to be diluted once the total depth is sufficiently large.

We now state the main structural result of this section. It shows that residual learning configurations, once sufficiently deep, satisfy Condition 4.

Theorem 2 (Residual learning induces log-shift invariance beyond a depth threshold). Consider a residual learning configuration in the sense of Section 4.1. Fix an error tolerance η ∈ (0, 1). For any layer, if the layer depth L satisfies

where the constants C 1 , C 2 depend only on the residual block law and not on ℓ, then Condition 4 holds for M (ℓ) on the intermediate spectral window. In particular, the renormalized log-bin shell coupling statistics depend, up to O(η) error, only on relative log-scale separations.

Proof sketch. The residual form A k = I + εG ℓ induces additive log-spectral increments across depth. Under the absence of depth-wise parameter sharing and the controlled path regularity of Condition 3, these increments define a stationary Markov-additive process on spectral directions. Two mechanisms govern the depth threshold. First, Berry-Esseen type bounds [4] yield self-averaging of accumulated log-spectral increments once L ≳ ε -2 η -2 . Second, the near-identity residual perturbations induce quantitative mixing on the projective space, leading to isotropization of spectral directions on a scale L ≳ ε -2 log(1/η). Beyond the larger of these two thresholds, bin-averaged coupling statistics become translation-invariant in logarithmic scale. A complete proof is provided in Appendix A.3.

Theorem 2 identifies residual learning as a mechanism by which individual layerwise Jacobian contributions become asymptotically free of absolute scale information once depth exceeds a finite threshold. Its proof relies in an essential way on the near-identity residual parameterization and depth-wise statistical stationarity. The emergence of directionally homogeneous behavior under repeated small random perturbations is consistent with classical results on random matrix products and Markovian mixing on projective spaces [9,33]. As a result, the argument does not directly extend to architectures with explicit gating or depth-dependent weighting, nor to general non-skip architectures. This limitation should not be interpreted as ruling out scaling behavior in such models; rather, it indicates that different analytical tools would be required to verify Condition 4 outside the residual setting.

We now show that the additive structure of the GRSD operator converts this bulk stationarity into global log-shift invariance.

Proposition 1 (Depth averaging yields Condition 4). Assume the hypotheses of Theorem 2. Fix a tolerance η ∈ (0, 1) and an intermediate spectral window as in Condition 4. Let ℓ * be any depth index such that for every ℓ > ℓ * , the layerwise operator M (ℓ) := J (ℓ) J (ℓ) * satisfies Condition 4 on the intermediate window with error at most O(η) (as guaranteed by Theorem 2).

Decompose

Then there exists a constant C > 0, depending only on the uniform bounds in Condition 3, such that if For the bulk layers, Theorem 2 shows that each individual layer already satisfies log-shift invariance on the intermediate window, up to an O(η) error. Averaging these bulk contributions over depth preserves the same log-shift invariant form, with the same O(η) accuracy.

Combining the two parts, the depth-averaged shell coupling statistics induced by M consist of a log-shift invariant bulk term plus errors of order O(η)+O(ℓ * /L). Choosing L sufficiently large compared to ℓ * , specifically L ≳ ℓ * η -1 , makes the boundary dilution error comparable to the bulk error. This yields Condition 4 for the full operator M on the intermediate spectral window.

Proposition 1 completes the structural argument. Log-shift invariance emerges not from exact scale symmetry at finite depth, but from depth averaging over a stationary bulk of residual blocks, with finite-depth inhomogeneities rendered negligible once the network exceeds a controlled depth threshold.

In this section we interpret the sufficient conditions introduced in Section 3 and clarify their conceptual meaning. Our goal is not to justify these conditions empir-ically or to argue that they hold universally in deep learning, but rather to explain what structural and dynamical properties they encode. Throughout, we emphasize that these conditions are sufficient but not necessary for power-law renormalizability of the GRSD shell dynamics.

Condition 1 should be interpreted as a boundedness condition on the evolution of the gradient computation graph, rather than as a strict locality or acyclicity assumption. It requires that the instantaneous evolution of each Jacobian block can be expressed using only a uniformly bounded neighborhood of other blocks, with coefficients that remain controlled along the training trajectory.

A canonical example of a bounded computation graph is a directed acyclic graph (DAG) [1,36], such as those induced by standard feedforward networks, residual architectures, or transformers. In these settings, gradient propagation is naturally confined by the depth of the graph, and the influence of any parameter update cannot spread arbitrarily far in an infinitesimal time step. As a result, the induced Jacobian evolution is uniformly bounded in the sense required by Condition 1.

By contrast, recurrent architectures with unconstrained feedback loops may violate this boundedness condition. In particular, when cyclic structures contribute divergent amplification-such as repeated multiplication by unstable recurrent weights-the Jacobian evolution may accumulate unbounded long-range couplings. Classical recurrent neural networks (RNN) [21] and certain long shortterm memory (LSTM) [19] configurations fall into this category when their recurrent dynamics is not properly controlled. In such cases, the GRSD coarse-graining procedure may fail, as infinitesimal updates can induce global spectral reorganization.

Importantly, the presence of cycles alone does not preclude boundedness. Recurrent or iterative architectures may still satisfy Condition 1 provided that the contribution of cycles is contractive or effectively damped. For example, when recurrent interactions decay exponentially [14,41,13,37,50], the cumulative influence of long feedback paths remains finite, and Jacobian evolution can be bounded despite the existence of loops. From the GRSD perspective, such architectures behave similarly to bounded-depth computation graphs at the level of infinitesimal learning dynamics. We provide a precise proposition and a proof sketch for this case in Appendix 2 Viewed in this light, Condition 1 is best understood as excluding learning configurations in which gradient propagation is unbounded or unstable, rather than as forbidding recurrence or complex connectivity per se. It delineates the class of computation graphs for which shell-wise coarse-graining remains well-defined over the training horizon.

Condition 2 constrains the geometry of the Jacobian at initialization. It requires that correlations between distant Jacobian blocks, measured through operator inner products, decay sufficiently fast so that their tail is summable. Modern zero-mean random i.i.d. initializations [11,15,27] commonly used in deep learning naturally satisfy the condition of initial gradient incoherence, in the sense that gradient components associated with distinct functional modes are uncorrelated at initialization.

This assumption does not require strict independence or orthogonality between blocks. Local correlations are permitted, and even expected, as a consequence of shared inputs, architectural coupling, or structured parameterization. The essential requirement is that long-range correlations are weak enough to prevent the formation of coherent global modes at initialization.

Within the GRSD framework, initial functional incoherence plays a role analogous to short-range correlations in statistical physics. It ensures that coarse-grained shell dynamics is not dominated by fine-tuned cancellations or global alignments present at initialization. As shown in the proof of Theorem 1, this condition propagates along the training trajectory under the locality and stability assumptions.

Condition 3 imposes a stability constraint on the learning trajectory. It requires that both the Jacobian and its instantaneous time derivative remain uniformly bounded over the training horizon.

Conceptually, this condition ensures that learning proceeds through a sequence of controlled, incremental updates rather than through abrupt transitions. From the perspective of GRSD, it guarantees that temporal evolution and spectral coarsegraining are compatible: the shell dynamics remains well-defined and does not develop singular behavior over finite time.

This assumption should be viewed as an abstraction of common stability-inducing practices in modern deep learning, such as normalization, residual connections, and conservative optimization schedules [45,38,44]. However, the condition itself is agnostic to the specific optimization algorithm and does not rely on stochasticity or noise-induced regularization.

Condition 4 encodes the absence of any intrinsic absolute scale in the renormalized shell couplings. Unlike pointwise equivariance or parameter-level symmetries, this condition is formulated at the level of coarse-grained spectral statistics: it requires that, on the intermediate spectral window, renormalized shell interactions depend only on relative separations in logarithmic scale and not on absolute spectral position.

Crucially, log-shift invariance should be understood as a structural constraint rather than a generic consequence of stability or locality. It rules out not only explicit scale parameters, but also implicit time-dependent reference scales that could emerge during training and induce drift of the effective shell dynamics. In this sense, Condition 4 is a genuinely nontrivial requirement that restricts the admissible large-scale behavior of the renormalized dynamics.

Within the GRSD framework, log-shift invariance alone enforces homogeneity of the renormalized shell equations, but does not by itself fix the functional form of the shell velocity. The emergence of a power-law velocity is instead a rigidity phenomenon: when log-shift invariance is combined with the intrinsic time-rescaling covariance of gradient flow, all admissible functional forms except pure power laws are excluded. Such rigidity phenomena, in which scale invariance combined with consistency constraints restrict admissible large-scale behavior to power laws, are well known in renormalization-group analyses of critical phenomena and turbulent transport [24,49,8]. If Condition 4 fails, the shell dynamics may exhibit scaledependent drift, moving spectral reference points, or broken scaling behavior, even if the remaining conditions are satisfied.

Taken together, the four conditions introduced in Section 3 enforce a learning dynamics that is local in the computation graph, weakly correlated at initialization, stable along the optimization path, and free of intrinsic absolute scales in the renormalized spectral description. These properties are precisely those required for the GRSD shell dynamics to admit a well-defined and renormalizable closure under logarithmic spectral coarse-graining.

We emphasize that none of the conditions alone is sufficient to guarantee powerlaw behavior. Conditions 1-3 ensure the existence and stability of the renormalized shell dynamics, while Condition 4 restricts its large-scale structure by enforcing log-shift invariance. Power-law scaling then emerges as a rigidity consequence when these structural properties are combined with the intrinsic time-rescaling covariance of gradient flow. Under this combination, the GRSD framework predicts that the only admissible large-scale form of the shell velocity is a power law.

The sufficient conditions introduced in Section 3 are formulated at the level of operator dynamics and gradient evolution, rather than in terms of specific model classes. As a result, they can be meaningfully interpreted across a broad range of modern architectures. At the same time, the conditions-especially Condition 4-are structural and nontrivial, and are not generically guaranteed by architectural design alone. In this section, we discuss how common deep learning architectures align with the sufficient conditions, and where their limitations arise. Throughout, we emphasize that the discussion is interpretive rather than universal: the presence of architectural features consistent with some conditions does not imply that powerlaw scaling must occur, but rather that it is not a priori excluded within the GRSD framework.

Multilayer perceptrons (MLPs) [40] provide the simplest setting in which the sufficient conditions can be examined. Their feedforward structure naturally induces locality in the gradient computation graph, consistent with Condition 1. Moreover, standard random initialization schemes typically lead to weak functional correlations between distant layers in sufficiently wide networks, aligning with Condition 2 in early training regimes.

When trained with stable optimization procedures, MLPs may also satisfy Condition 3 over substantial training horizons. However, the absence of an explicit near-identity accumulation mechanism means that Condition 4 is not generically enforced in MLPs. As a result, while MLPs often admit a renormalizable shell description, power-law behavior cannot be structurally guaranteed and may depend sensitively on training dynamics, data distribution, or implicit regularization effects.

Convolutional networks introduce additional structure through spatial locality and weight sharing. These features further reinforce locality in the gradient computation graph and typically support Condition 1. Standard initialization and normalization practices can also promote weak long-range functional correlations at initialization and stable Jacobian evolution during training.

From the perspective of GRSD, convolutional architectures often admit a welldefined renormalized shell dynamics. However, as with MLPs, log-shift invariance of renormalized shell couplings is not generically guaranteed. Architectural choices such as aggressive pooling, explicit bottlenecks, or strongly scaledependent preprocessing may introduce effective spectral reference scales, leading to deviations from simple power-law behavior even when the remaining conditions are approximately satisfied.

An important exception arises in convolutional architectures equipped with residual parameterizations. Residual convolutional networks [16] introduce an explicit near-identity accumulation mechanism across depth, which aligns closely with the structural setting analyzed in Section 4. In this case, log-shift invariance of the renormalized shell couplings can be established under additional assumptions, rather than merely postulated. The residual parameterizations, as introduced in ResNet architectures [16], have also been shown to promote stable Jacobian spectra across depth [44].

Transformer architectures combine multiple computational modules within each block and rely heavily on normalization and residual connections. These design choices promote stable training dynamics and help maintain bounded Jacobian norms over long training horizons, consistent with Condition 3 [48,3].

The effective gradient propagation distance per training step in transformers remains limited, supporting the locality requirement of Condition 1. At initialization, widely used random parameterizations lead to weak functional correlations between distant blocks in sufficiently wide settings. Nevertheless, the presence of residual connections alone does not automatically imply log-shift invariance of renormalized shell couplings. Whether Condition 4 is realized depends on additional structural and dynamical properties, such as the near-identity nature of residual accumulation and the absence of depth-dependent gating or weighting effects.

Accordingly, while transformers empirically exhibit robust scaling behavior across many regimes, deviations from power-law scaling are naturally interpreted within GRSD as violations of one or more sufficient conditions, rather than as contradictions of the framework.

Structured state-space models and related architectures introduce [14,41,13] explicit recurrence or long-range temporal structure. From the perspective of Condi-tion 1, such models are compatible with the sufficient conditions provided that recurrent interactions remain effectively contractive, so that Jacobian evolution does not induce unbounded long-range coupling.

When recurrence is stabilized through normalization, decay mechanisms, or controlled parameterization, the resulting Jacobian dynamics may remain localized and amenable to renormalized shell descriptions. However, strong recurrence or unstable feedback can violate locality or stability assumptions, leading to nonrenormalizable shell dynamics and the breakdown of simple scaling behavior. In such cases, GRSD predicts deviations from power-law scaling as a natural consequence of the learning configuration.

Across architectures, the sufficient conditions identified in this work do not single out a specific model class. Rather, they characterize a regime of learning dynamics in which locality, weak long-range correlations, stability, and the absence of intrinsic spectral reference scales coexist. While modern architectures frequently incorporate design choices that support some of these properties, none guarantees them in isolation.

From the GRSD perspective, the empirical success of neural scaling laws reflects the prevalence of learning configurations that lie within this renormalizable regime. Equally, observed deviations from power-law behavior signal departures from the sufficient conditions-particularly from Condition 4-rather than failures of the GRSD framework itself.

The theory developed in this work relies on three intertwined empirical assumptions about the spectral organization of learning dynamics:

  1. Controlled inter-shell coupling. Spectral shells interact weakly, so that cross-shell transport acts as a subleading and stable perturbation.

  2. Necessity of shell coarse-graining. Intra-shell mixing is strong compared to inter-shell transport, making coarse-graining over spectral shells essential for any effective description.

In residual architectures, the induced spectral transport operator becomes approximately Toeplitz in an intermediate spectral window, reflecting approximate shift-invariance across shells.

In this section, we provide minimal yet direct empirical evidence for all three claims using controlled experiments on CIFAR-10 [30]. We emphasize that the goal is not to exhaustively characterize training dynamics, but to validate the structural assumptions required for a closed, shell-based description of learning.

We evaluate the spectral coupling structure induced by training dynamics on image classification using CIFAR-10. All experiments are conducted with ResNet-18 and a corresponding plain (non-residual) architecture with identical depth and channel configuration.

Models. The residual model is a standard ResNet-18 [16]. The plain counterpart (PlainNet-18) is constructed by removing all residual skip connections while keeping convolutional blocks, batch normalization, and parameter counts identical. Both models share the same trained parameter checkpoints.

Dataset and checkpoints. We use the CIFAR-10 validation set with batch size B = 64. Unless otherwise stated, all spectral operators are evaluated at a fixed training checkpoint using a local finite-difference approximation of the operator derivative (see below), ensuring that both architectures are compared at the same effective training stage.

Error representation. The error vector e(x; θ) is constructed from model logits using the ponehot formulation, e = p θ (x) -onehot(y), flattened over batch and class dimensions. This choice aligns the error space with the Fisher-like geometry of classification loss and is used consistently across all experiments.

Our goal is to probe the spectral structure of the operator

where J = ∂e/∂θ is the Jacobian of the error vector with respect to model parameters.

Monte Carlo operator evaluation. The expectation defining M is approximated using a small number (S = 4) of fixed mini-batches sampled from the validation set. Matrix-vector products M v are computed without explicitly forming M , using a Jacobian-vector product followed by a vector-Jacobian product (JVP-VJP), enabling scalable operator access.

Block Lanczos projection. We apply a block Lanczos algorithm [12,47] with block size b = 8 and k = 16 steps, resulting in a Krylov subspace of dimension r = bk = 128. This produces a block-tridiagonal matrix T whose eigenpairs {(λ i , u i )} approximate the spectrum of M in the probed subspace.

Directional derivative of the operator. To probe spectral transport, we compute a finite-difference approximation of the operator derivative

where ε is a small step along the gradient direction. The projected operator

More details regarding the estimation algorithm are included in Appendix B.

Eigenvalues of M are binned into logarithmically spaced spectral shells based on log λ. Given the projected operator Ω, we define a coarse-grained shell coupling matrix K by averaging squared couplings within each shell pair:

Toeplitz residual. To quantify the degree of shift-invariance across shells, we compare K to its best Toeplitz approximation obtained by averaging along diagonals. The normalized Frobenius residual

F serves as our primary diagnostic: lower values indicate stronger Toeplitz structure i.e. the coupling strength relies on the relative scale shift rather than absolute scales.

All scores are computed over sliding spectral windows. We focus on windows centered in the intermediate spectral regime, where the assumptions underlying our effective theory are expected to hold.

We empirically examine the diagonal components of the renormalized spectral coupling matrix, Ω ii , to test the scale-free diagonal structure postulated in Condition 4. Recall that Condition 4 predicts that, on an intermediate spectral window, the diagonal terms satisfy a linear scaling relation

corresponding to scale-invariant drift in logarithmic spectral coordinates.

Figure 1 shows log-log scatter plots of Ω ii versus λ i for both architectures, together with linear least-squares fits in logarithmic coordinates. In both cases, the data exhibit a clear power-law relation over a broad intermediate spectral range, consistent with the scale-free diagonal ansatz of Condition 4.

Quantitatively, the ResNet model displays an almost perfectly linear scaling with fitted slope close to unity and an excellent coefficient of determination (R 2 ≈ 0.99). In contrast, while PlainNet also shows approximate power-law behavior, its fitted slope deviates more noticeably from unity and exhibits increased scatter, reflected in a lower R 2 value. This indicates that the diagonal scaling relation is significantly more stable and coherent in the presence of residual connections.

Interpretation. The near-linear relation Ω ii ∼ λ i observed in ResNet supports the assumption that diagonal spectral couplings introduce no intrinsic absolute scale beyond that already encoded in λ i . This behavior is a necessary ingredient for scale-invariant log-spectral drift and underpins the emergence of renormalizable spectral dynamics. The degradation of this relation in PlainNet suggests that residual connections play a structural role in stabilizing diagonal scaling, thereby promoting the conditions required for log-shift invariance and controlled spectral transport.

Overall, these results provide direct empirical support for the diagonal component of Condition 4 and highlight the architectural dependence of scale-free spectral structure. Residual networks exhibit a rapid and monotonic improvement in Toeplitz structure when approaching the intermediate spectral regime. In contrast, the plain architecture shows significantly weaker Toeplitz behavior with little sensitivity to window position. This provides direct empirical evidence that residual learning facilitates the approximate shift-invariance across spectral shells required by Con-dition 4.

Near the spectral edges, Toeplitz structure deteriorates for both models. This behavior is expected, as these regions lie outside the effective spectral and time window assumed by the theory, rather than indicating a failure of the condition itself.

This work identifies a set of sufficient conditions under which the GRSD shell dynamics admits a power-law renormalized description. While these conditions provide a rigorous route to power-law behavior, they are intentionally not presented as universal or exhaustive. In this section we discuss the scope of the results, their limitations, and directions for future investigation.

Empirical studies have repeatedly shown that neural scaling laws are not universal. Depending on architecture, data distribution, optimization stability, or training regime, power-law behavior may weaken, fragment into multiple regimes, or fail altogether. Within the GRSD framework, such deviations are naturally interpreted as violations of one or more sufficient conditions identified in this paper.

For example, strong long-range functional correlations at initialization may obstruct coarse-grained shell closure. Unstable optimization trajectories may generate abrupt spectral reorganizations that invalidate controlled Jacobian evolution. Similarly, the presence of intrinsic scales in the data or preprocessing pipeline may break statistical scale covariance. Rather than viewing these failures as anomalies, GRSD predicts them as signatures of non-renormalizable learning dynamics.

A central limitation of the present analysis is that the conditions identified here are sufficient but not necessary. We do not claim that all learning configurations exhibiting power-law scaling must satisfy these assumptions, nor do we rule out alternative mechanisms leading to scaling behavior.

In particular, stochastic optimization effects, noise-induced regularization, or implicit averaging over training trajectories may produce effective renormalization even when deterministic stability or scale covariance assumptions fail. Characterizing such mechanisms lies beyond the scope of the current work. Our results should therefore be understood as delineating a class of learning dynamics for which power-law renormalizability can be rigorously established, rather than as an exhaustive theory of all observed scaling laws.

The GRSD framework encompasses both kernel-like and feature-learning regimes, depending on the structure of the Jacobian evolution. In kernel limits, where M (t) remains close to its initialization, shell dynamics may become effectively frozen, leading to degenerate or trivial velocity fields. In contrast, feature-learning regimes permit nontrivial spectral transport and admit renormalized dynamics.

The sufficient conditions identified here are compatible with both perspectives, but they do not reduce to either limit. Instead, they describe a regime in which learning dynamics is sufficiently structured to admit coarse-grained closure, yet sufficiently flexible to allow nontrivial spectral flow. Understanding how these conditions interpolate between known limits remains an open direction for future work.

Several directions for further investigation emerge from this work. First, it would be valuable to develop empirical diagnostics that test the sufficient conditions directly, for example by measuring locality of Jacobian evolution or scale covariance of gradient statistics during training. Second, extending the analysis to stochastic optimization dynamics may clarify whether noise can relax or replace certain stability assumptions. Finally, exploring non-renormalizable regimes within GRSD may shed light on multi-scaling phenomena, regime transitions, and the limits of predictability in deep learning.

More broadly, this work suggests that neural scaling laws are best understood not as universal empirical facts, but as manifestations of renormalizable learning dynamics. GRSD provides a framework in which both the success and the failure of scaling laws can be interpreted within a unified spectral and operator-theoretic perspective.

  1. Uniform exponential stability. There exist constants ρ ∈ (0, 1) and C A < ∞ such that for all integers k ≥ 0, Then for any tolerance δ ∈ (0, 1) there exists an effective interaction range

such that the GRSD Jacobian path is δ-approximately graph-banded in the sense that

uniformly for all t ∈ [0, T ] and all blocks l. In particular, when the blocks J (l) are chosen as coarse-grained time/graph blocks with width ≳ K δ , Condition (1) in Theorem 1 holds with some constant K = O(1) at that coarse scale.

Proof sketch. The key point is that in a stable SSM/RWKV-type model, every “loop” contribution along the recurrent edge carries a factor of the state propagator, which decays exponentially in the number of traversals. Fix a parameter block index l and consider the block Jacobian J (l) that collects the function-space derivatives contributed by a localized portion of the computation graph (e.g. a window of time steps, or a local module in a scan/SSM implementation). Differentiating J (l) with respect to training time produces terms of the form

Whenever a term couples block l to a far block m (large |m -l|), it must traverse a long recurrent path in the unrolled graph. By (8), the operator norm of the corresponding propagator is at most C A ρ |m-l| up to bounded local factors. Thus the total contribution from distances |m -l| > K is bounded by a geometric tail:

Choosing K = K δ so that ρ K δ ≲ δ yields (10). Finally, if we define the GRSD blocks at a coarse scale larger than K δ , the residual R (l) δ becomes negligible at the GRSD shell scale, and the exact banded form holds at that renormalized resolution.

Theorem 3 (Power-law renormalizability of GRSD shell dynamics). Fix a learning configuration (architecture, initialization, and optimization path) and consider the GRSD shell dynamics defined on logarithmic spectral shells (i.e., shells in s = log λ as in GRSD). Let J(t) denote the (function-space) Jacobian, decomposed into computation-graph blocks J(t) = (J (1) (t), . . . , J (L) (t)), and let M (t) := J(t)J(t) * .

Assume the following conditions hold on a training horizon t ∈ [0, T ]:

  1. Graph-banded Jacobian path. There exists a constant K = O(1) such that for every block index l,

There exists a nonnegative sequence {ε k } k≥1 with k≥1 ε k < ∞ such that

  1. Controlled Jacobian path. There exists

  2. Log-shift invariance of renormalized shell couplings. On an intermediate spectral window, the renormalized shell coupling statistics are translation invariant in the logarithmic coordinate s = log λ, i.e. there exists a kernel K h (∆), ∆ = (j -i)h, such that

where err(n, h, L) → 0 in the joint limit of large width n, small bin size h, and sufficient depth L.

In addition, assume the GRSD structural property already adopted in the main text: (i) intra-shell couplings are antisymmetric, so shell-internal transfers cancel in the shell energy balance.

Then the induced GRSD shell dynamics admits a renormalized velocity field

for some exponent a ∈ R and scalar coefficient c(t).

Proof. We prove Theorem 1 by decomposing the argument into three steps.

Step I establishes an effective one-dimensional shell conservation law with nearest-neighbor boundary fluxes, where the closure is obtained as a structural consequence of M = JJ * via a Taylor expansion. Steps II and III use Condition 4 together with gradient-flow covariance to derive the power-law constraint on the velocity field.

Notation. Let {S α } be the GRSD logarithmic shells in λ (equivalently, equalwidth bins in s = log λ). Let E α (t) denote the shell energy (shell-aggregated error energy), and let ε(λ, t) be the corresponding shell-averaged energy density (defined by the usual piecewise-constant interpolation over shells). Let F α+ 1 2 (t) denote the net flux across the boundary between S α and S α+1 , and let D α (t) be the shell-aggregated dissipation term.

Step I: From

By (11), for each l there exist operator coefficients A lp (t) such that

By (13), these coefficients are uniformly bounded: there exists

Substituting ( 16) into (15) yields

Taking operator norms and using submultiplicativity gives

Define the distance-profile envelope

Using (19) and that |p -m| differs from |l -m| by at most K, one obtains

with u k = 0 for k < 0. Let U (t) := k≥0 ρ k u k (t) for some ρ ∈ (0, 1). Then (20) implies a Grönwall inequality

hence U (t) ≤ e Cρt U (0). By (12),

and in particular there exists a summable tail {ε k } k≥1 such that

(I.b) Automatic nearest-neighbor closure via a parameter-space Taylor expansion and the exact Ω formula. By antisymmetry, intra-shell transfers cancel in the energy balance, hence

where T αβ (t) denotes net transfer from shell β to shell α.

We now show that direct transfers between non-adjacent shells are second order, and therefore the inter-shell dynamics closes (up to a controlled remainder) on nearest-neighbor boundary fluxes 1 . The argument uses a Taylor expansion of the model output in parameter space and the exact formula for the spectral rotation generator Ω.

Taylor expansion of the output and the order of ∆M . Let f (θ) denote the model output in function space, with Jacobian J(θ) = ∂ θ f (θ) and squared loss L(θ) = 1 2 ∥e(θ)∥ 2 where e(θ) = f (θ) -y. For a small parameter increment δθ, the output admits the expansion

where H(θ) is the second derivative (a Hessian tensor) of f . Consider one infinitesimal gradient-flow step δθ = θ(t) δt with θ(t) = -∇ θ L(θ(t)) = -J(θ(t)) * e(t). The linearized (first-order) model f lin (θ + δθ) := f (θ) + J(θ)δθ keeps J(θ) fixed, hence the corresponding operator M lin := J(θ)J(θ) * is constant over the step. Therefore any change of M (t) = J(t)J(t) * must come from the quadratic and higher terms in (24), i.e. from curvature. Equivalently, over a time increment δt,

and the contribution of ∆M to the error update appears only at second order in time, because the first-order error update uses M (t) frozen: e(t + δt) = e(t) -M (t)e(t) δt + (δt) 2 2 M (t) 2 e(t) -Ṁ (t)e(t) + O((δt) 3 ).

(26) In particular, the term involving Ṁ (hence curvature) enters the error evolution only through the (δt) 2 term.

Exact Ω formula and gap suppression across non-adjacent shells. Let {(λ i (t), u i (t))} be an orthonormal eigen-decomposition of M (t) on the GRSD window and define the instantaneous rotation generator

For i ̸ = j, the standard exact identity gives

Assume the GRSD shell partition {S α } is chosen so that there exists a minimal spectral gap between non-adjacent shells:

Under the controlled path condition (13) we have sup t∈[0,T ] ∥ Ṁ (t)∥ op < ∞, hence ( 27) and ( 28) imply the uniform bound

Second-order non-adjacent transfers and boundary-flux closure. Write the error in the instantaneous eigenbasis, e(t) = i a i (t)u i (t), and note that intershell exchange is induced by basis rotation. Combining the time-Taylor expansion (26) with the fact that Ω is generated by Ṁ through (27), we obtain the key structural consequence: since Ṁ enters e(t + δt) only at order (δt) 2 , any energy exchange mediated by Ω is at least second order in δt. Moreover, for non-adjacent shells |α -β| ≥ 2, the gap bound (28) controls the corresponding rotation coefficients via (29), so the associated transfers are uniformly of second order:

with constants controlled by the bounds in Conditions (1)-( 3). (Adjacent shells are not covered by the uniform gap bound and therefore may contribute at leading order; these contributions define the renormalized boundary fluxes.)

Consequently, the inter-shell balance (23) closes, up to a controlled remainder collecting the non-adjacent second-order transfers, in telescoping boundary-flux form

where F α+ 1 2

(t) depends only on spectral content in a fixed neighborhood of the boundary between S α and S α+1 , and the remainder R α (t) := |β-α|≥2 T αβ (t) is uniformly controlled on [0, T ] by the second-order estimate (30) together with the summable incoherence propagation from Step I.a. Passing to the shell-averaged density representation yields the effective conservation law

where r corresponds to the controlled remainder. This completes Step I.

Step II: Scale covariance as a rigidity consequence of log-shift invariance.

We now invoke Condition 4 (log-shift invariance). On the intermediate spectral window, the renormalized shell coupling statistics governing the effective flux law (32) are asymptotically translation invariant in s = log λ. Equivalently, for any τ ∈ R, a shift ε(s, t) → ε(s + τ, t) induces a corresponding shift of the flux J(s, t), without introducing any intrinsic length scale. As a consequence, the velocity field v(s, t) := J(s, t)/ε(s, t) cannot depend on the absolute position s except through multiplicative rescaling. In particular, any admissible dependence of v on s must be compatible with a functional relation of the form

for some scale factor α(τ ) and a possibly rescaled time t τ . To identify the form of α(τ ) and the relation between t τ and t, we use the covariance of gradient flow.

Step III: Gradient-flow covariance and exclusion of time-dependent log shifts.

We first record a basic covariance property of gradient flow.

Lemma 1 (Gradient-flow covariance fixes the time dependence of v). Consider gradient-flow training with mean-squared error loss

and induced error dynamics

. Let λ(t) denote an eigenvalue of M (t) and define the spectral drift v(λ, t) := dλ dt .

Then v satisfies the covariance functional equation

and consequently admits the representation

Proof.

Step 1: Loss rescaling induces time reparameterization.

For any a > 0, consider the rescaled loss L a := a L. The corresponding gradient flow satisfies

If θ(t) solves the original gradient flow, then θ a (t) := θ(at) satisfies

and hence solves the rescaled flow. Thus, rescaling the loss by a is equivalent to the time reparameterization t → at.

Step 2: Loss rescaling multiplies the error operator.

For MSE loss, ∇ θ L(θ) = -J * θ e, so θ = J * θ e. Differentiating e = y -f θ yields ė = -J θ θ = -J θ J * θ e = -M (t)e. Under L a = aL, the parameter flow becomes θa = aJ * θ e, and the induced error dynamics are ėa = -J θ θa = -a J θ J * θ e a = -a M (t)e a . Hence, at the level of the error ODE, loss rescaling induces M (t) → a M (t), and therefore λ(t) → a λ(t).

Step 3: Covariance of eigenvalue trajectories. Combining Steps 1-2, the eigenvalue trajectories obey λ a (t) = a λ(at), where λ a denotes the eigenvalue under L a . Differentiating with respect to t gives v a (t) = λa (t) = d dt a λ(at) = a 2 λ(at) = a 2 v(at).

Rewriting this relation in field form yields (33).

Step 4: Solving the functional equation. Fix t > 0 and choose a = t in (33). Then

which can be rearranged as v(λ, t) = t -2 v(tλ, 1).

Defining F (u) := v(u, 1) and substituting u = λt yields (34).

Condition 4 rules out that, on an intermediate spectral window, shell couplings are translation-invariant in s = log λ. In other words, it forces that ṡ = λ λ is invariant to the absolute scale of λ.

Scale-invariance of log-spectral drift induced by Condition 4. Recall that under Condition 4, the shell-level coupling statistics admit a log-shift invariant form on an intermediate spectral window,

where h denotes the log-shell width, λ i = λ 0 e -ih , and K h depends only on the relative log-scale separation.

The evolution of individual eigenvalues takes the form

Introducing the log-spectral coordinate s i := log λ i , we obtain ṡi = λi

The diagonal contribution simplifies immediately as Ωii

which is independent of the absolute scale of λ i .

For the off-diagonal terms, writing λ i -λ j = λ i 1 -e -(s j -s i ) and using the log-shift invariant structure Ωij = K h (s j -s i ), we find that each summand depends only on the relative separation s j -s i . Upon coarse-graining over shells and passing to the continuum limit, the sum reduces to an integral of the form

where K is a scale-free kernel inherited from K h . Equation ( 39) depends exclusively on relative log-scale separations and contains no reference to the absolute value of s or λ. Consequently,

demonstrating that the log-eigenvalue drift is invariant under global spectral rescalings λ → e ∆ λ. This establishes explicitly that Condition 4 enforces the absence of any intrinsic absolute spectral scale in the renormalized spectral dynamics.

As a consequence, any effective drift velocity in s-space cannot depend explicitly on the absolute position s. In particular, u(s, t)

is independent of λ within the window. Equivalently, v(λ, t) = λ c(t) on the intermediate spectral window. (C2)

Combining this formula with v(λ, t) = t -2 v(tλ, 1)., we have:

This establishes the power-law form of the velocity field and completes the proof of Theorem 1.

In this appendix we provide a complete proof of Theorem 2. The argument refines the intermediate result (log-bin relative-scale kernel) by making explicit the probabilistic and dynamical mechanisms underlying log-shift invariance. Compared to the informal derivation sketched in the main text, the present proof addresses three technical issues: (i) the additive log-spectral increments are treated as a Markovadditive process rather than i.i.d.; (ii) directional mixing is quantified via a uniform mixing time; (iii) no independence is assumed between eigenvector couplings and spectral gaps.

where ε > 0 is sufficiently small and {G ℓ } satisfy the structural assumptions of Section 4.1. Define M (L) := J (L) (J (L) ) * , with eigen-decomposition

Let {B i } denote logarithmic spectral bins of width h in s, and write I i := {u : s u ∈ B i }. We restrict attention to an intermediate spectral window bounded away from the spectral edges, as specified in Condition 3.

For any unit vector u ∈ S n-1 define the normalized direction process

The associated single-layer log-increment is

The pair (u ℓ , δ ℓ ) defines a Markov-additive process [35] on S n-1 × R. Under Condition 3 and the residual small-step assumption, δ ℓ admits uniform moment bounds

uniformly in u ℓ-1 . Moreover, on the intermediate spectral window the conditional variance Var(δ ℓ | u ℓ-1 ) is bounded below by c 0 ε 2 for some c 0 > 0.

Let

where expectation is taken with respect to the stationary distribution of the direction process. Since u ℓ is a uniformly ergodic Markov chain (see Section A.3.4), standard Berry-Esseen bounds [4] for Markov-additive processes apply. In particular, there exists C BE < ∞ such that

where σ 2 = Var(δ ℓ ) = Θ(ε 2 ). Consequently, for any Lipschitz test function f and any tolerance η > 0, there exists

such that the distribution of accumulated log-increments is η-close (in total variation or Wasserstein distance) to a translation-invariant limit on the intermediate spectral window. This yields approximate shift-invariance of log-spectral statistics.

The direction chain {u ℓ } evolves on the unit sphere. Under the residual perturbation J ℓ = I + εG ℓ with approximately isotropic law and no parameter sharing across depth, the chain admits the uniform distribution as its unique invariant measure. For products of i.i.d. random matrices, classical results of Furstenberg and Kesten establish the existence of a unique stationary distribution on the projective space and ergodicity of the induced Markov dynamics, leading to direction independent asymptotic behavior of matrix products under suitable conditions [9]. The directional updates induced by the random residual perturbations define a Markov process on the unit sphere, whose mixing properties can be analyzed in the classical Markov chain framework [33]. Standard coupling or diffusion-approximation arguments imply that the mixing time τ mix (η) satisfies

Hence, for all L ≥ τ mix (η), the law of u L is η-close to uniform, uniformly over initial directions. This ensures isotropization of eigenvector statistics on the intermediate spectral window.

Off-Diagonal shell statistics Let X uv := ϕ * u Aϕ v denote generic quadratic couplings entering the GRSD closure. By Condition 3, conditional second moments satisfy

with a Lipschitz on the intermediate window.

Combining the shift-invariance from Appendix A.3.3 with directional mixing from Appendix A.3.4 yields

uniformly on the window. Logarithmic bin averaging then gives

for some kernel K h depending only on the bin width h.

For quantities involving spectral gaps, such as Ω v→u = X uv /(λ v -λ u ), we restrict to pairs with |s v -s u | ≥ δ. Since λ v -λ u is then uniformly bounded away from zero on the intermediate window, Cauchy-Schwarz yields the same bin-level structure without invoking independence.

On-Diagonal shell statistics The diagonal structure in Condition 4 can be obtained from the same bulk self-mixing mechanism used in the off-diagonal analysis. Fix an intermediate spectral window and condition on a mode u with λ = ⟨u, M u⟩ lying in a log-shell B i . Beyond a boundary depth, residual blocks induce fast mixing on the projective space of directions, so that the conditional distribution of mode directions u | (λ ∈ B i ) is asymptotically isotropic (up to a small mixing error).

Consequently, for any layerwise diagonal quadratic observable of the form Ω (ℓ) uu = ⟨u, Ṁ (ℓ) u⟩, isotropy implies that its conditional expectation retains only the Rayleigh component:

for some scalar α ℓ (t) independent of the absolute shell index i on the intermediate window, and where the O(η) term captures the finite-depth imperfect mixing (the same O(η) level as in the bulk part of the Toeplitz argument). This reduction to the Rayleigh component is standard for isotropic quadratic observables and follows from classical results on quadratic forms under rotational invariance [22,2,29]. Summing over layers and applying the same boundary-bulk decomposition as in depth dilution, the boundary contribution is suppressed by ℓ * /L, while the bulk average preserves the shell-independent coefficient. Hence the full diagonal coupling satisfies, on the intermediate window,

i.e. Ω ii = c(t)λ i + err ii after shell averaging, which is the on-diagonal part of Condition 4. Standard self-averaging over depth and within-shell averaging further reduce the residual fluctuations around this mean.

Combining Appendix A.3.3 and A.3.4, log-shift invariance holds whenever

which is precisely the depth threshold stated in Theorem 2. This completes the proof.

In this appendix we provide a complete proof of Proposition 1. The proof proceeds by exploiting the additive structure of the GRSD operator across depth. Since the renormalized shell coupling statistics used in Condition 4 are quadratic and binwise in the Jacobian, they depend on the network only through the operator M = JJ * and therefore decompose additively over layerwise contributions. This allows us to split M into a finite boundary part and a bulk part, control the boundary contribution using uniform operator bounds, and transfer the layerwise log-shift invariance established in Theorem 2 to the depth-averaged operator. The resulting argument shows that sufficiently large depth suppresses nonstationary boundary effects and upgrades bulk stationarity to global log-shift invariance.

Proof. Throughout, we work on the intermediate spectral window of Condition 4. Let K ij (•) denote the renormalized log-bin shell coupling statistics appearing in Condition 4. In GRSD these couplings are quadratic statistics of the Jacobian and therefore depend on J only through M = JJ * ; in particular, under the additive decomposition M = ℓ M (ℓ) (Section 4.1, paragraph “Additive GRSD operator”), the corresponding bin-level couplings decompose additively:

(Concretely, this is because all GRSD closure statistics are formed from shell/boundary quadratic forms of M , hence linear in M once the binning is fixed.) Define the depth-averaged couplings

Using (41) and splitting into boundary and bulk,

Step

Step 2: the bulk contribution inherits log-shift invariance from Theorem 2. By choice of ℓ * and Theorem 2, for every ℓ > ℓ * , the layerwise couplings satisfy

uniformly over bins (i, j) in the intermediate window, for the same bin width h as in Condition 4. Averaging over ℓ = ℓ * + 1, . . . , L yields

where the O(η) term is uniform over (i, j) in the window. (Here we used that the average of O(η) errors remains O(η).)

Step 3: conclude Condition 4 for M . Combining ( 42)- (43) gives

Since L-ℓ * L = 1 -O(ℓ * /L), the prefactor only perturbs the kernel by an additional O(ℓ * /L) on the same window, and thus

Choosing L ≥ C ℓ * η -1 makes the boundary dilution term O(ℓ * /L) at most O(η), so the total error is O(η). This is exactly Condition 4 for the couplings induced by M (on the intermediate spectral window), completing the proof. W ← M V j-1

Orthogonalize W against {V 0 , . . . , V j-1 } 5:

QR factorization W = V j R j 6: end for 7: Assemble block tridiagonal matrix T 8: Compute eigenpairs (λ i , u i ) of T

The eigenvectors of the block tridiagonal matrix T define a Ritz basis that approximates the true eigenmodes of M within the Krylov subspace. Expressing operators in this basis yields a meaningful discretization of spectral dynamics while remaining computationally tractable.

All subsequent diagnostics, including shell binning and Toeplitz analysis, are performed in this approximate spectral basis.

To probe spectral transport, we compute a finite-difference approximation of the directional derivative of M along the training trajectory:

Projecting Ṁ into the Ritz basis yields the operator

which captures how spectral modes exchange energy under training. This operator forms the basis of all shell-level coupling measurements.

Eigenvalues are grouped into logarithmic spectral shells. The projected operator Ω is coarse-grained into a shell coupling matrix K by averaging squared couplings within each shell pair.

the renormalized log-bin shell coupling statistics induced by M satisfy Condition 4 on the intermediate window (with O(η) error).Proof sketch. The key input is the additive structure of the GRSD operator across depth. Since M decomposes as a sum of layerwise contributions, the renormalized log-bin shell coupling statistics considered in Condition 4 decompose additively as well. We therefore analyze the depth-averaged couplings by splitting M into a finite boundary part and a bulk part.The boundary contribution involves only the first ℓ * * /L.

the renormalized log-bin shell coupling statistics induced by M satisfy Condition 4 on the intermediate window (with O(η) error).Proof sketch. The key input is the additive structure of the GRSD operator across depth. Since M decomposes as a sum of layerwise contributions, the renormalized log-bin shell coupling statistics considered in Condition 4 decompose additively as well. We therefore analyze the depth-averaged couplings by splitting M into a finite boundary part and a bulk part.

the renormalized log-bin shell coupling statistics induced by M satisfy Condition 4 on the intermediate window (with O(η) error).

The idea is inspired by[46]

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut