Optimal Initialization in Depth: Lyapunov Initialization and Limit Theorems for Deep Leaky ReLU Networks
The development of effective initialization methods requires an understanding of random neural networks. In this work, a rigorous probabilistic analysis of deep unbiased Leaky ReLU networks is provided. We prove a Law of Large Numbers and a Central Limit Theorem for the logarithm of the norm of network activations, establishing that, as the number of layers increases, their growth is governed by a parameter called the Lyapunov exponent. This parameter characterizes a sharp phase transition between vanishing and exploding activations, and we calculate the Lyapunov exponent explicitly for Gaussian or orthogonal weight matrices. Our results reveal that standard methods, such as He initialization or orthogonal initialization, do not guarantee activation stabilty for deep networks of low width. Based on these theoretical insights, we propose a novel initialization method, referred to as Lyapunov initialization, which sets the Lyapunov exponent to zero and thereby ensures that the neural network is as stable as possible, leading empirically to improved learning.
💡 Research Summary
The paper presents a rigorous probabilistic analysis of deep unbiased Leaky ReLU networks and introduces a novel initialization scheme—Lyapunov initialization—designed to keep activations stable across many layers. The authors first formalize a fully‑connected feed‑forward network of fixed width d, where each layer computes Xₗ = φ(Wₗ Xₗ₋₁) with φ(x)=max(x,αx) (α≠0) applied element‑wise. The weight matrices Wₗ are drawn i.i.d. from a distribution μ that either (i) has independent entries with bounded densities, finite second moments, and positive density near the identity (Assumption 3.1), or (ii) is absolutely continuous with respect to the Haar measure on a scaled orthogonal group η·O(d) (Assumption 3.2).
Under these mild conditions the authors prove two fundamental limit theorems for the logarithm of the Euclidean norm of the activations, Lₗ = log‖Xₗ‖. The Law of Large Numbers (Theorem 3.3) shows that Lₗ/ℓ converges almost surely (and in L¹ uniformly over unit‑norm inputs) to a deterministic constant λ_{μ,φ}, which they call the Lyapunov exponent of the network. If λ>0 the activations explode exponentially with depth; if λ<0 they vanish. This result extends the classical Furstenberg‑Kesten theorem for products of random matrices to the nonlinear setting of Leaky ReLU networks, a non‑trivial achievement because the nonlinearity destroys the usual multiplicative structure.
The Central Limit Theorem (Theorem 3.4) refines this picture: after centering by ℓλ, the fluctuations of Lₗ are Gaussian with variance growing linearly in ℓ, i.e., (Lₗ – ℓλ)/√ℓ → N(0,γ_{μ,φ}) in distribution, where γ_{μ,φ}>0 depends on μ and α. Consequently, even when λ=0 the norm of activations fluctuates on the order of e^{±O(√ℓ)}.
The authors then compute λ_{μ,φ} explicitly for two widely used families of weight initializations. For Gaussian weights W_{ij}∼N(0,σ²) they obtain λ = E
Comments & Academic Discussion
Loading comments...
Leave a Comment