Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Singularities

Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Singularities
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding the properties of the parameter space in feed-forward ReLU networks is critical for effectively analyzing and guiding training dynamics. After initialization, training under gradient flow decisively restricts the parameter space to an algebraic variety that emerges from the homogeneous nature of the ReLU activation function. In this study, we examine two key challenges associated with feed-forward ReLU networks built on general directed acyclic graph (DAG) architectures: the (dis)connectedness of the parameter space and the existence of singularities within it. We extend previous results by providing a thorough characterization of connectedness, highlighting the roles of bottleneck nodes and balance conditions associated with specific subsets of the network. Our findings clearly demonstrate that singularities are intricately connected to the topology of the underlying DAG and its induced sub-networks. We discuss the reachability of these singularities and establish a principled connection with differentiable pruning. We validate our theory with simple numerical experiments.


💡 Research Summary

This paper investigates the geometry and topology of the parameter space of feed‑forward ReLU networks whose architectures are described by arbitrary directed acyclic graphs (DAGs). The authors begin by formalizing the well‑known positive‑homogeneity of ReLU (and Leaky ReLU) as a family of rescaling symmetries: for each hidden neuron v, multiplying all incoming weights by a positive scalar α and dividing all outgoing weights by α leaves the network function unchanged. When training is performed by continuous‑time gradient flow (GF) on a loss that depends only on the network output, the loss is constant along each rescaling orbit, which forces the gradient to be orthogonal to that orbit.

Using the incidence matrix B of the graph, with rows corresponding to input and output nodes removed (denoted B̃), the authors rewrite the orthogonality condition compactly as

 B̃ (θ ⊙ g(θ)) = 0,

where θ is the vector of edge weights, g(θ) is the gradient of the loss, and ⊙ denotes element‑wise multiplication. Differentiating the squared‑weight vector θ² along the flow yields

 d/dt B̃ θ(t)² = −2 B̃ (θ(t) ⊙ g(θ(t))) = 0,

so the quantity c = B̃ θ(0)² is conserved for all time. Consequently, the trajectory of any gradient‑flow training initialized in the set

 H_G(c) = { θ ∈ ℝ^{|E|} | B̃ θ² = c }

remains inside this algebraic variety for the entire training process. Each hidden node v contributes a quadratic equation

 ∑{i→v} θ{i,v}² − ∑{v→j} θ{v,j}² = c_v,

which can be interpreted as a hyperbolic quadric hypersurface in the local weight space.

The core contributions of the paper are twofold: (1) a complete characterization of when H_G(c) is connected, and (2) a detailed analysis of its singularities.

Connectedness. The authors identify “bottleneck” subsets of hidden nodes that separate all input‑to‑output paths. If such a subset S exists and the conserved constants c_v have opposite signs for nodes inside versus outside S, the quadratic system splits into two disjoint solution components, making H_G(c) disconnected. Conversely, if every bottleneck satisfies a “balance” condition—namely, for each node v the sum of squared incoming weights equals the sum of squared outgoing weights (equivalently, c_v has the same sign across the graph)—then the variety is a single connected component. This result generalizes earlier work on shallow (single‑layer) networks to arbitrary DAGs, showing that the topology of the underlying graph directly governs the topology of the invariant set.

Singularities. Points where some edge weights become exactly zero correspond to the emergence of sub‑networks that are completely cut off from the rest of the graph. At such points the Jacobian of the defining equations loses rank, creating non‑smooth singularities of H_G(c). The paper proves that these singularities are in one‑to‑one correspondence with the disconnected sub‑graphs identified in the connectivity analysis. Moreover, for generic random initializations (which typically yield all c_v of the same sign), gradient flow never reaches these singular points because the flow is confined to the interior of the invariant set where the conservation law holds strictly. Hence singularities are “unreachable” under standard training dynamics.

Implications for pruning. Leveraging the link between singularities and network sparsity, the authors propose a differentiable pruning scheme based on a nuclear‑norm regularizer applied to the weight matrix. This regularizer encourages the squared‑weight flow to concentrate, driving certain edges toward zero and thereby inducing singular configurations that correspond to pruned sub‑networks. Empirically, simple L₁ regularization exhibits a similar effect, suggesting that standard sparsity‑inducing penalties may implicitly push the dynamics toward the singular regions of H_G(c). The experiments on small synthetic DAGs confirm that the regularized training converges to low‑dimensional singular manifolds without incurring loss in performance, effectively achieving lossless pruning.

Overall significance. By unifying symmetry‑based conservation laws with graph‑theoretic concepts, the paper provides a data‑independent, function‑space‑agnostic perspective on ReLU network training. The derived connectivity criteria give network designers concrete guidelines: avoid bottlenecks or enforce balance to ensure a connected learning manifold, which may improve optimization stability. The singularity analysis clarifies why certain pruning methods succeed and offers a principled, differentiable alternative that exploits the intrinsic geometry of the loss landscape rather than heuristic sparsity thresholds. The work thus bridges deep learning theory, algebraic geometry, and practical model compression, opening avenues for further research on topology‑aware architecture design and optimization algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment