Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear   Networks

Introduction

Through their myriad successful applications across a wide range of disciplines, it is now well established that deep neural networks possess an unprecedented ability to model complex real-world datasets, and in many cases they can do so with minimal overfitting. Indeed, the list of practical achievements of deep learning has grown at an astonishing rate, and includes models capable of human-level performance in tasks such as image recognition , speech recognition , and machine translation .

Yet to each of these deep learning triumphs corresponds a large engineering effort to produce such a high-performing model. Part of the practical difficulty in designing good models stems from a proliferation of hyperparameters and a poor understanding of the general guidelines for their selection. Given a candidate network architecture, some of the most impactful hyperparameters are those governing the choice of the model’s initial weights. Although considerable study has been devoted to the selection of initial weights, relatively little has been proved about how these choices affect important quantities such as rate of convergence of gradient descent.

In this work, we examine the effect of initialization on the rate of convergence of gradient descent in deep linear networks. We provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. In particular, we show that for deep networks, the width needed for efficient convergence for orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence of Gaussian networks scales linearly in the depth.

Orthogonal weight initializations have been the subject of a significant amount of prior theoretical and empirical investigation. For example, in a line of work focusing on dynamical isometry, it was found that orthogonal weights can speed up convergence for deep linear networks  and for deep non-linear networks  when they operate in the linear regime. In the context of recurrent neural networks, orthogonality can help improve the system’s stability. A main limitation of prior work is that it has focused almost exclusively on model’s properties at initialization. In contrast, our analysis focuses on the benefit of orthogonal initialization on the entire training process, thereby establishing a provable benefit for optimization.

The paper is organized as follows. After reviewing related work in Section 16 and establishing some preliminaries in Section 15, we present our main positive result on efficient convergence from orthogonal initialization in Section 7. In Section 18, we show that Gaussian initialization leads to exponentially long convergence time if the width is too small compared with the depth. In Section 17, we perform experiments to support our theoretical results.

Efficient Convergence using Orthogonal Initialization

In this section we present our main positive result for orthogonal initialization. We show that orthogonal initialization enables efficient convergence of gradient descent to a global minimum provided that the hidden width is not too small.

In order to properly define orthogonal weights, we let the widths of all hidden layers be equal: $`d_1=d_2=\cdots=d_{L-1}=m`$, and let $`m \ge \max\{d_x, d_y\}`$. Note that all intermediate matrices $`W_2, \ldots, W_{L-1}`$ are $`m\times m`$ square matrices, and $`W_1 \in \R^{m\times d_x}, W_L\in\R^{d_y\times m}`$. We sample each initial weight matrix $`W_i(0)`$ independently from a uniform distribution over scaled orthogonal matrices satisfying

\begin{equation}
 \label{eqn:ortho-init}
\begin{aligned}
&W_1^\top(0) W_1(0) = mI_{d_x},\\
&W_i^\top(0) W_i(0) = W_i(0)W_i^\top(0) = mI_m, \qquad 2\le i \le L-1, \\
&W_L(0)W_L^\top(0)=mI_{d_y}.
\end{aligned}
\end{equation}

In accordance with such initialization, the scaling factor $`\alpha`$ in [eqn:linear-net] is set as $`\alpha = \frac{1}{\sqrt{m^{L-1}d_y}}`$, which ensures $`\expect{\norm{f(x; W_L(0), \ldots, W_1(0))}^2} = \norm{x}^2`$ for any $`x\in \R^{d_x}`$.1 The same scaling factor was adopted in , which preserves the expectation of the squared $`\ell_2`$ norm of any input.

Let $`W^* \in \argmin_{W \in \R^{d_y \times d_x}} \norm{WX-Y}_F`$ and $`\opt = \frac12 \norm{W^* X - Y}_F^2`$. Then $`\opt`$ is the minimum value for the objective [eqn:loss-func]. Denote $`r = \rank(X)`$, $`\kappa = \frac{\lambda_{\max}(X^\top X)}{\lambda_r(X^\top X)}`$, and $`\tilde{r} = \frac{\norm{X}_F^2}{\norm{X}^2}`$.2 Our main theorem in this section is the following:

Suppose

\begin{equation}
 \label{eqn:m-bound-for-ortho}
    m\ge C \cdot \tilde{r} \kappa^2 \left( d_y(1+\norm{W^*}^2) + \log(r/\delta) \right) \text{ and } m \ge d_x,
\end{equation}

for some $`\delta\in(0, 1)`$ and a sufficiently large universal constant $`C>0`$. Set the learning rate $`\eta \le \frac{d_y}{2L \norm{X}^2}`$. Then with probability at least $`1-\delta`$ over the random initialization, we have

\begin{align*}
    &\ell(0) - \opt \le O\left( 1 + \frac{\log(r/\delta)}{d_y} + \norm{W^*}^2 \right) \norm{X}_F^2, \\
    &\ell(t) - \opt \le \left( 1 - \frac{1}{2} \eta L \lambda_r(X^\top X) / d_y \right)^t (\ell(0)-\opt), \quad t = 0, 1, 2, \ldots,
\end{align*}

where $`\ell(t)`$ is the objective value at iteration $`t`$.

Notably, in , the width $`m`$ need not depend on the depth $`L`$. This is in sharp contrast with the result of for Gaussian initialization, which requires $`m\ge \tilde{\Omega}(L r \kappa^3 d_y)`$. It turns out that a near-linear dependence between $`m`$ and $`L`$ is necessary for Gaussian initialization to have efficient convergence, as we will show in . Therefore the requirement in is nearly tight in terms of the dependence on $`L`$. These results together rigorously establish the benefit of orthogonal initialization in optimizing very deep linear networks.

If we set the learning rate optimally according to Theorem [thm:ortho] to $`\eta = \Theta(\frac{d_y}{L\norm{X}^2})`$, we obtain that $`\ell(t) - \ell^*`$ decreases by a ratio of $`1 - \Theta(\kappa^{-1})`$ after every iteration. This matches the convergence rate of gradient descent on the ($`1`$-layer) linear regression problem $`\min\limits_{W\in\R^{d_y\times d_x}} \frac12 \norm{WX-Y}_F^2`$.

Proof of

The proof uses the high-level framework from , which tracks the evolution of the network’s output during optimization. This evolution is closely related to a time-varying positive semidefinite (PSD) matrix (defined in [eqn:P-def]), and the proof relies on carefully upper and lower bounding the eigenvalues of this matrix throughout training, which in turn implies the desired convergence result.

First, we can make the following simplifying assumption without loss of generality. See Appendix B in for justification.

(Without loss of generality) $`X \in \R^{d_x \times r}`$, $`\rank(X) = r`$, $`Y = W^* X`$, and $`\opt =0`$.

Now we briefly review ’s framework. The key idea is to look at the network’s output, defined as

\begin{align*}
U = \alpha W_{L:1} X \in \mathbb{R}^{d_y \times n}.
\end{align*}

We also write $`U(t) = \alpha W_{L:1}(t) X`$ as the output at time $`t`$. Note that $`\ell(t) = \frac{1}{2}\norm{U(t)-Y}_F^2`$. According to the gradient descent update rule, we write

\begin{align*}
&W_{L:1}(t+1) 
= \prod_i \left( W_i(t) - \eta \frac{\partial \ell}{\partial W_i}(t)  \right) 
= W_{L:1}(t) - \sum_{i=1}^L \eta W_{L:i+1}(t) \frac{\partial \ell}{\partial W_i}(t) W_{i-1:1}(t) + E(t),
%=\,& W_{L:1}(t)  \\
% & -\frac{\eta}{\sqrt{m^{L-1}\dout}} \sum_{i=1}^L \Big( W_{L:i+1}(t)  W_{L:i+1}^\top(t)\\ & \quad \cdot (U(t) - Y)  \left( W_{i-1:1}(t) X \right)^\top W_{i-1:1}(t) \Big) \\& + E(t),
\end{align*}

where $`E(t)`$ contains all the high-order terms (i.e., those with $`\eta^2`$ or higher). With this definition, the evolution of $`U(t)`$ can be written as the following equation:

\begin{equation}
 \label{eqn:U-dynamics}
\begin{aligned}
&\vectorize{U(t+1) - U(t)} 
= -\eta P(t) \cdot \vectorize{U(t)-Y} + \alpha \cdot \vectorize{E(t)X},
\end{aligned}
\end{equation}

where

\begin{equation}
 \label{eqn:P-def}
\begin{aligned}
    P(t) = \alpha^2 \sum_{i=1}^L \Big[ \left( \left( W_{i-1:1}(t) X \right)^\top \left( W_{i-1:1}(t)X \right) \right)  \otimes \left( W_{L:i+1}(t)  W_{L:i+1}^\top(t) \right) \Big] .
    \end{aligned}
\end{equation}

Notice that $`P(t)`$ is always PSD since it is the sum of $`L`$ PSD matrices. Therefore, in order to establish convergence, we only need to (i) show that the higher-order term $`E(t)`$ is small and (ii) prove upper and lower bounds on $`P(t)`$’s eigenvalues. For the second task, it suffices to control the singular values of $`W_{i-1:1}(t)`$ and $`W_{L:i+1}(t)`$ ($`i\in[L]`$).3 Under orthogonal initialization, these matrices are perfectly isometric at initialization, and we will show that they stay close to isometry during training, thus enabling efficient convergence.

The following lemma summarizes some properties at initialization.

At initialization, we have

\begin{equation}
 \label{eqn:init-spec}
    \begin{aligned}
    %&\sigma_{\max}(W_{i:1}(0)\cdot X) = m^{\frac i2} \sigma_{\max}(X),     \sigma_{\min}(W_{i:1}(0)\cdot X) = m^{\frac i2} \sigma_{\min}(X), & \forall 1\le i <L, \\
    &\sigma_{\max}(W_{j:i}(0)) = \sigma_{\min}(W_{j:i}(0)) = m^{\frac{j-i+1}{2}} , & \forall 1\le i\le j \le L, (i, j)\not=(1,L).
    \end{aligned}
\end{equation}

Furthermore, with probability at least $`1-\delta`$, the loss at initialization satisfies

\begin{equation}
 \label{eqn:init-loss}
    \ell(0) \le O\left( 1 + \frac{\log(r/\delta)}{d_y} + \norm{W^*}^2 \right) \norm{X}_F^2.
\end{equation}

Proof sketch. The spectral property [eqn:init-spec] follows directly from [eqn:ortho-init].

To prove [eqn:init-loss], we essentially need to upper bound the magnitude of the network’s initial output. This turns out to be equivalent to studying the magnitude of the projection of a vector onto a random low-dimensional subspace, which we can bound using standard concentration inequalities. The details are given in Appendix 14.1. ◻

Now we proceed to prove . We define $`B = O \left( 1 + \frac{\log(r/\delta)}{d_y} + \norm{W^*}^2 \right) \norm{X}_F^2`$ which is the upper bound on $`\ell(0)`$ from [eqn:init-loss]. Conditioned on [eqn:init-loss] being satisfied, we will use induction on $`t`$ to prove the following three properties $`\gA(t)`$, $`\gB(t)`$ and $`\gC(t)`$ for all $`t=0, 1, \ldots`$:

  • $`\gA(t)`$: $`\ell(t) \le \left( 1 - \frac12 \eta L \sigma_{\min}^2(X) / d_y \right)^t \ell(0) \le \left( 1 - \frac12 \eta L \sigma_{\min}^2(X) /d_y \right)^t B`$.

  • $`\gB(t)`$: $`%\begin{cases} %&\sigma_{\max}(W_{i:1}(0)\cdot X) \le 1.1 m^{\frac i2} \sigma_{\max}(X), \sigma_{\min}(W_{i:1}(0)\cdot X) \ge 0.9 m^{\frac i2} \sigma_{\min}(X), \quad \forall 1\le i

  • $`\gC(t)`$: $`\norm{W_i(t) - W_i(0)}_F \le \frac{8\sqrt{Bd_y}\norm{X}}{ L \sigma_{\min}^2(X)}, \quad \forall 1\le i\le L`$.

$`\gA(0)`$ and $`\gB(0)`$ are true according to , and $`\gC(0)`$ is trivially true. In order to prove $`\gA(t)`$, $`\gB(t)`$ and $`\gC(t)`$ for all $`t`$, we will prove the following claims for all $`t\ge0`$:

$`\gA(0), \ldots, \gA(t), \gB(0), \ldots, \gB(t) \Longrightarrow \gC(t+1)`$.

$`\gC(t) \Longrightarrow \gB(t)`$.

$`\gA(t), \gB(t) \Longrightarrow \gA(t+1)`$.

The proofs of these claims are given in Appendix 14. Notice that we finish the proof of once we prove $`\gA(t)`$ for all $`t\ge 0`$.

Conclusion

In this work, we studied the effect of the initialization parameter values of deep linear neural networks on the convergence time of gradient descent. We found that when the initial weights are iid Gaussian, the convergence time grows exponentially in the depth unless the width is at least as large as the depth. In contrast, when the initial weight matrices are drawn from the orthogonal group, the width needed to guarantee efficient convergence is in fact independent of the depth. These results establish for the first time a concrete proof that orthogonal initialization is superior to Gaussian initialization in terms of convergence time.


  1. We have $`\expect{\norm{f(x; W_L(0), \ldots, W_1(0))}^2} = \alpha^2 \expect{x^\top W_1^\top(0) \cdots W_L^\top(0) W_L(0) \cdots W_1(0) x }`$. Note that by our choice [eqn:ortho-init] we have $`\expect{W_L^\top(0) W_L(0)}=d_yI_m`$ and $`W_i^\top(0) W_i(0)=mI \,(1\le i \le L-1)`$, so we have $`\expect{\norm{f(x; W_L(0), \ldots, W_1(0))}^2} = \alpha^2 m^{L-1} d_y \norm{x}^2 = \norm{x}^2`$. ↩︎

  2. $`\tilde{r}`$ is known as the stable rank of $`X`$, which is always no more than the rank. ↩︎

  3. Note that for symmetric matrices $`A`$ and $`B`$, the set of eigenvalues of $`A \otimes B`$ is the set of products of an eigenvalue of $`A`$ and an eigenvalue of $`B`$. ↩︎