The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.


💡 Research Summary

This paper investigates the high‑dimensional asymptotics of empirical risk minimization (ERM) for over‑parameterized two‑layer neural networks with quadratic activations, trained on synthetic Gaussian data. The authors consider a “student” network with m hidden units (m ≥ d) and a “teacher” network with m★ hidden units that generates the labels yμ = f★(xμ) + √Δ ξμ, where ξμ is standard Gaussian noise and the inputs xμ ∈ ℝ^d are i.i.d. N(0, I_d). Both networks use the centered quadratic activation σ(u)=u²−‖w‖²/d, so the function class consists of centered positive‑semidefinite quadratic forms.

The learning objective is the ℓ₂‑regularized square loss
L(W)=∑_{μ=1}^n (yμ−f̂(xμ;W))² + λ‖W‖_F²,
with λ>0. The key technical step is to rewrite the non‑convex problem in terms of the symmetric matrix S = WᵀW / √{md}. Under this mapping, the ℓ₂ penalty on W becomes a nuclear‑norm (trace) penalty λ·Tr(S) on S, while the data‑fit term becomes a linear measurement of S with Gaussian sensing matrices. Consequently, the original ERM is equivalent to a convex matrix sensing problem with nuclear‑norm regularization.

The authors work in the proportional asymptotic regime d → ∞, n ≈ α d², m ≈ κ d, m★ ≈ κ★ d, with α, κ, κ★ = O(1). The empirical spectral distribution of the teacher’s Gram matrix S★ = (W★)ᵀW★ / √{m★ d} converges to a limiting law μ★ (e.g., Marchenko–Pastur when W★ has i.i.d. entries). The analysis assumes μ★ has bounded first two moments and compact support.

Using Gaussian universality, the random feature matrices (xxᵀ−I)/√d are replaced by GOE(d) matrices, reducing the problem to a rank‑penalized matrix recovery with i.i.d. Gaussian measurements. The authors then apply Approximate Message Passing (AMP) with a non‑separable denoiser tailored to the nuclear‑norm penalty. The state‑evolution (SE) equations of AMP admit a unique non‑trivial fixed point, which coincides with the global minimizer of the convex matrix problem and, via the mapping, with any global minimizer of the original ERM.

The SE analysis yields two scalar order parameters δ̄ and ε̄ that solve a coupled system of equations (Equation 6 in the paper). These equations involve the free convolution μ★_δ = μ★ ⊞ μ_sc,δ (the semicircular law of radius 2δ) and a scalar function J(a,b)=∫ (x−b)² dμ★_a(x). The regularization strength appears as λ̃ = √κ·λ. The final asymptotic formulas are:

  • Test error: E_test → 2α δ̄² − Δ²,
  • Scaled training loss: d⁻² E_L → (δ̄²)/(4 ε̄²) − λ̃² ∂₂ J(δ̄, λ̃ ε̄),

where ∂₂ denotes derivative with respect to the second argument of J. Remarkably, the expressions do not depend on κ as long as κ ≥ 1; thus any sufficiently wide student network (even massively over‑parameterized) achieves the same asymptotic performance.

Moreover, the singular‑value distribution of the optimal weight matrix ˆW is characterized explicitly (Equation 8). It consists of an atom at zero of mass F_δ (the CDF of μ★_δ at zero) plus a continuous density derived from μ★_δ shifted by λ̃ ε̄. Numerical experiments (Figures 1–2) confirm the theoretical predictions even for moderate dimensions (d≈50–400) and for both gradient‑based solvers and convex optimization of the equivalent matrix problem.

The paper leverages this framework to answer several concrete questions:

  1. Interpolation threshold – For λ → 0⁺, a unique global minimizer exists when the sample‑complexity α exceeds a critical value α_c(κ★,Δ). Below this threshold the loss landscape contains a continuum of minima.
  2. Perfect generalization – In the noiseless case (Δ=0), the test error vanishes once α surpasses α_perfect(κ★). The required α scales with the teacher width κ★; narrower (low‑rank) teachers need fewer samples.
  3. Low‑rank limit – When κ★ ≪ 1, the asymptotic test error scales linearly with κ★, showing that low‑rank target functions are learned efficiently despite the high ambient dimension.

The work unifies three previously separate strands of research: (i) teacher‑student dynamics for quadratic networks, (ii) high‑dimensional Bayesian optimality for wide networks, and (iii) low‑rank matrix recovery with nuclear‑norm regularization. It provides a rigorous justification for the empirical observation that weight decay (ℓ₂ regularization) implicitly promotes low‑effective‑rank solutions, thereby controlling capacity even in massively over‑parameterized regimes.

In summary, the authors deliver a complete, mathematically rigorous description of the training and generalization behavior of over‑parameterized quadratic neural networks. By mapping the problem to convex nuclear‑norm matrix sensing and analyzing it via AMP, they obtain sharp, closed‑form asymptotics for both training loss and test error, elucidate the role of model width, regularization, and target rank, and validate the theory with extensive simulations. This contribution deepens our theoretical understanding of why heavily over‑parameterized nonlinear models can generalize well and offers a concrete analytical tool for studying similar architectures beyond the linear or kernel regimes.


Comments & Academic Discussion

Loading comments...

Leave a Comment