Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions
We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.
💡 Research Summary
The paper introduces a novel analytical framework for studying gradient flow (GF) dynamics in large‑scale learning problems, focusing on the Canonical Polyadic (CP) decomposition of high‑order tensors. The authors start by expanding the loss L(t) as a formal power series in time, L(t)=∑ₛ (d⁽ˢ⁾L/dt⁽ˢ⁾(0))·tˢ/s!. Each time derivative is a polynomial in the model parameters u (the CP factors), and the coefficients of these polynomials are expressed through combinatorial objects called “diagrams”, which are essentially graphs reminiscent of Feynman diagrams.
In the diagram language, p‑nodes represent summations over the data dimension (size p), H‑nodes represent summations over the rank dimension (size H), and edges (colored by the tensor mode) correspond to the individual parameters u^{(m)}{k,i}. The loss itself decomposes into three parts: a pure model term (graph D{2ν}), a model‑target interaction term (graph R_ν), and a constant target‑only term. The evolution equation dG/dt = -(1/T)·G⋆(½D_{2ν}−R_ν) defines a binary “merger” operation ⋆ that contracts edges of the same color and identifies the corresponding nodes, mirroring the chain rule for gradients.
Because the parameters are initialized i.i.d. Gaussian with variance σ², the expectation of any diagram can be computed via Wick’s theorem. Each pairing of edges yields a contracted diagram whose contribution is a monomial of the form p^q·H^n·σ^{2l}, where q and n count the remaining p‑ and H‑nodes after contraction, and l is the number of original edges. Thus the s‑th coefficient of the loss expansion, denoted Y_s(H,p,σ²), becomes a finite sum of such monomials.
The authors then study the large‑size limit p, H → ∞ under a power‑law scaling p≈a^{α_p}, H≈a^{α_H}, σ≈a^{α_σ} (a→∞). For each monomial the effective exponent is α·(q,n,l)=α_p q+α_H n+2α_σ l. The dominant term for a given scaling triple α is the one with the maximal exponent; these are called “Pareto‑optimal” terms because they are not dominated by any other term with the same l. Pareto‑optimal terms correspond to minimally contracted diagrams, i.e., diagrams that keep the number of summed indices as large as possible.
The paper provides a complete classification of Pareto‑optimal terms for the identity target tensor F_{i₁…i_ν}=δ_{i₁…i_ν} in both symmetric (SYM) and asymmetric (ASYM) settings, except for the SYM odd‑ν case which is left for future work. The resulting “Pareto polygons” in the (q,n) plane delineate distinct learning regimes:
- Free evolution (q=n=0) – the loss decays essentially as if the model were absent; this regime appears when the scaling makes all parameter contributions sub‑dominant.
- Neural Tangent Kernel (NTK) regime – characterized by linearized dynamics; it emerges only in the asymmetric case because the contraction constraints in SYM prevent the necessary edge‑color matching.
- Mean‑field regimes – under‑parameterized (q>0, n=0) and over‑parameterized (q=0, n>0) regimes where the loss obeys a nonlinear first‑order PDE of the form L_t + a L L_x = 0.
- Rich (rich‑learning) regimes – interior points of the Pareto polygon where both q and n are positive, leading to more intricate dynamics that still reduce to solvable first‑order PDEs.
A central technical contribution is the method for summing the formal series in the large‑size limit. In all identified regimes the series collapses to a first‑order partial differential equation for the expected loss. By applying the method of characteristics, the authors obtain explicit closed‑form expressions for L(t) in several cases:
- Free evolution – L(t) follows a simple exponential decay.
- NTK – L(t) evolves linearly in time, reproducing the classic NTK result.
- Mean‑field (under‑parameterized) – the solution is a rational function of time reflecting a logistic‑type decay.
- Mean‑field (over‑parameterized) – the solution exhibits a finite‑time blow‑up in the loss for gradient ascent, and a saturating decay for descent.
These analytical predictions are validated on synthetic tensor decomposition tasks. Experiments show that the loss trajectories match the derived formulas across a wide range of p, H, σ, and learning rates, confirming the correctness of the diagram‑based expansion and the Pareto‑optimal analysis.
Finally, the authors extend the framework to a fourth‑order tensor (ν=4) where the explicit solution is only available for gradient ascent. They demonstrate two qualitatively different “unlearning” regimes depending on the magnitude of the weight initialization, illustrating how the same formalism can capture both learning and forgetting dynamics.
In summary, the paper establishes a powerful diagrammatic expansion for gradient flow, links scaling exponents to a geometric Pareto picture, and leverages this structure to derive explicit loss dynamics in multiple regimes, including novel nonlinear solutions beyond the traditional NTK and mean‑field limits. This approach opens a systematic pathway to analyze and solve gradient‑based learning dynamics for a broad class of high‑dimensional models.
Comments & Academic Discussion
Loading comments...
Leave a Comment