Gradient flow in parameter space is equivalent to linear interpolation in output space

Gradient flow in parameter space is equivalent to linear interpolation in output space
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We prove that the standard gradient flow in parameter space that underlies many training algorithms in deep learning can be continuously deformed into an adapted gradient flow which yields (constrained) Euclidean gradient flow in output space. Moreover, for the $L^{2}$ loss, if the Jacobian of the outputs with respect to the parameters is full rank (for fixed training data), then the time variable can be reparametrized so that the resulting flow is simply linear interpolation, and a global minimum can be achieved. For the cross-entropy loss, under the same rank condition and assuming the labels have positive components, we derive an explicit formula for the unique global minimum.


💡 Research Summary

The paper establishes a rigorous connection between the standard gradient flow in parameter space—commonly used to model continuous‑time gradient descent in deep learning—and a Euclidean gradient flow in the network’s output space. By introducing a simple pre‑conditioning of the parameter‑space dynamics using the pseudoinverse of the Jacobian (D = \partial \mathbf{x}/\partial\theta), the authors construct an “adapted gradient flow” (\dot\theta = -(D^{\top}D)^{+}\nabla_\theta C) whose induced dynamics on the output vector (\mathbf{x}(\theta)) satisfy (\dot{\mathbf{x}} = -\nabla_{\mathbf{x}} C). This flow is shown to share exactly the same equilibrium points as the original parameter‑space flow, and a one‑parameter family of vector fields interpolates continuously between the two (Theorem 2.3).

For the squared‑error (L²) loss, the output‑space flow reduces to a linear ordinary differential equation (\dot{\mathbf{x}} = -(\mathbf{x}-\mathbf{y})). When the Jacobian has full rank ((\operatorname{rank} D = QN)), the solution is explicit: (\mathbf{x}(s) = \mathbf{y} + e^{-s/N}(\mathbf{x}_0-\mathbf{y})). By re‑parameterising time as (t = 1 - e^{-s/N}), the trajectory becomes exactly the linear interpolation (\mathbf{x}(t) = \mathbf{y} + (1-t)(\mathbf{x}_0-\mathbf{y})) (Proposition 2.5). Thus, under the full‑rank condition, gradient flow in output space is equivalent to moving along the straight line that connects the initial network outputs to the target labels. If the Jacobian loses rank, the flow acquires a projection term onto the range of (DD^{\top}); the deviation from linear interpolation is quantified in Proposition 2.6 via a matrix‑valued ODE that involves the orthogonal projector onto the null‑space of the Jacobian.

The analysis is extended to the cross‑entropy loss with a softmax output. Assuming again that the Jacobian is full rank and that the label vectors have strictly positive components, the adapted flow again satisfies (\dot{\mathbf{x}} = -\nabla_{\mathbf{x}} C). Solving the stationary condition (\nabla_{\mathbf{x}} C = 0) yields a unique global minimiser (\mathbf{x}^* = \log \mathbf{y} + c\mathbf{1}), where the constant (c) enforces the softmax normalisation. This provides an explicit formula for the optimum network logits under the cross‑entropy loss (Proposition 2.8).

A key conceptual contribution is the emphasis on the rank of the Jacobian (equivalently, the Neural Tangent Kernel). The authors argue that full‑rank Jacobians guarantee that the output‑space gradient flow is convex and that any fixed point of the flow must correspond either to exact data interpolation or to a rank‑deficient Jacobian. When the Jacobian is rank‑deficient, the dynamics are confined to a lower‑dimensional sub‑manifold, preventing convergence to a global minimum.

Section 3 discusses several implications: (a) one can prescribe any smooth path in output space and recover a corresponding parameter trajectory via the pseudoinverse construction; (b) the phenomenon of “neural collapse” emerges naturally under the trivialised dynamics of the L² case (Corollary 3.2); (c) Theorem 2.3 is re‑interpreted in terms of the Neural Tangent Kernel, highlighting the geometric nature of the homotopy between the two flows.

In summary, the paper provides a unifying geometric framework that transforms the often intractable non‑convex optimisation in parameter space into a tractable Euclidean gradient flow in output space, provided the Jacobian retains full rank. This perspective not only yields explicit solution formulas for common loss functions but also clarifies the role of the Jacobian/NTK in governing training dynamics, offering a fresh lens through which to analyse and potentially design more robust optimisation algorithms for deep neural networks.


Comments & Academic Discussion

Loading comments...

Leave a Comment