Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we provide sufficient conditions of benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. Our results are derived by establishing directional convergence of the network parameters and classification error bound of the convergent direction. Our classification error bound also lead to the discovery of a newly identified phase transition. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. Due to the lack of directional convergence, previous results on benign overfitting were limited to those trained on nearly orthogonal data. All of our results hold on mixture data, which is a broader data setting than the nearly orthogonal data setting in prior work. We demonstrate our findings by showing that benign overfitting occurs with high probability in a much wider range of scenarios than previously known. Our results also allow us to characterize cases when benign overfitting provably fails even if directional convergence occurs. Our work thus provides a more complete picture of benign overfitting in leaky ReLU two-layer neural networks.


💡 Research Summary

The paper investigates the phenomenon of benign overfitting in fixed‑width two‑layer neural networks with leaky ReLU activation, trained by discrete‑time gradient descent on a mixture data distribution. The authors consider a binary classification setting where each data point is generated as (x = y\mu + z), with label (y\in{-1,1}) equally likely, a deterministic signal vector (\mu), and a noise vector (z) that may follow either a sub‑Gaussian (sG) or a polynomial‑tail (PM) distribution. The network has the form
(f(x;W)=\sum_{j=1}^{m} a_j\phi(\langle w_j,x\rangle))
with fixed second‑layer weights (a_j\in{\pm 1/\sqrt m}) and leaky ReLU (\phi(t)=\max{t,\gamma t}) for (\gamma\in(0,1)). Training minimizes the exponential loss (\ell(u)=e^{-u}) via gradient descent with a constant step size (\alpha).

The central technical contribution is a directional convergence theorem for gradient descent, a result previously known only for gradient flow (continuous‑time dynamics). The authors introduce the notion of “neuron activation”: a neuron is said to be activated if (a_j y_i\phi(\langle w_j,x_i\rangle)>0) for all training samples. By imposing deterministic conditions on the data—captured by an event (E(\theta_1,\theta_2)) that bounds normalized inner products among the noise vectors and between noise and the signal—they prove that, under suitable choices of initialization magnitude (\sigma) and step size (\alpha), all neurons become activated after the first iteration and thereafter the weight vectors evolve only in direction, not magnitude. Consequently, the empirical loss decays as (O(t^{-1})) and the parameter matrix converges to a limiting direction (\widehat W).

Two regimes are analyzed:

  1. Positive correlation case (Case 1) – where for any distinct pair (i\neq k) the product (\langle y_i x_i, y_k x_k\rangle) is non‑negative. This is guaranteed when the signal (\mu) is sufficiently large relative to the noise, formalized in Assumption 4.1. Small initialization (Assumption 4.2) and a modest step size (Assumption 4.3) ensure immediate activation and stable convergence.

  2. Near‑orthogonal case (Case 2) – where the inner products may be negative, but the data are almost orthogonal (controlled by small parameters (\varepsilon_1,\varepsilon_3) in Assumptions 4.4 and 4.6). This regime includes the setting of prior works that required (\mu=0) and strong orthogonality among samples.

Under either set of assumptions, Theorem 4.8 states that the limiting direction (\widehat W) solves a strictly convex optimization problem (5): minimize a weighted sum of squared norms of the positive‑neuron vector (w^+) and negative‑neuron vector (w^-) subject to margin constraints that enforce a margin of at least one on all positively and negatively labeled samples. The solution yields a linear decision boundary (\bar w = |J^+|\sqrt m, w^+ - |J^-|\sqrt m, w^-) such that (\operatorname{sign}(f(x;\widehat W)) = \operatorname{sign}(\langle x,\bar w\rangle)). In other words, the network converges to the maximum‑margin classifier for the given data.

The authors then analyze the generalization performance of this limiting classifier. For Gaussian mixture data they derive a lower bound on the classification error, showing that the bound matches the upper bound up to constants, thereby proving the tightness of their analysis. Moreover, they demonstrate that the deterministic event (E(\theta_1,\theta_2)) holds with high probability for both sub‑Gaussian and polynomial‑tail mixtures when the network is sufficiently over‑parameterized (i.e., the product of input dimension and hidden width exceeds the sample size). This extends prior results that were limited to sub‑Gaussian mixtures and required the hidden width to grow with the sample size.

Compared with earlier literature, the paper makes several notable advances:

  • Gradient descent vs. gradient flow – It is the first work to establish directional convergence for the discrete‑time algorithm in leaky ReLU networks, removing the need for continuous‑time analysis.
  • Relaxed data assumptions – The analysis covers a broader class of mixtures, allowing a non‑zero signal (\mu) and polynomial‑tail noise, whereas previous works were confined to nearly orthogonal or strictly sub‑Gaussian settings.
  • Fixed‑width networks – The results hold for a constant hidden width, moving beyond the neural‑tangent‑kernel (NTK) or lazy‑training regimes that require the width to diverge with the sample size.
  • Explicit error lower bound – The derived lower bound for Gaussian mixtures quantifies precisely when benign overfitting can fail, even if directional convergence is achieved.

In conclusion, the paper provides a rigorous theoretical framework that links implicit bias of gradient descent, directional convergence, and maximum‑margin classification to explain benign overfitting in leaky ReLU two‑layer networks. It broadens the understanding of when and why over‑parameterized neural networks can generalize well despite perfect interpolation of noisy training data. Future directions suggested include extending the analysis to deeper architectures, other loss functions (e.g., hinge or logistic), adaptive learning‑rate schedules, and empirical validation on real‑world datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment