Simple2Complex: Global Optimization by Gradient Descent

Reading time: 5 minute
...

📝 Original Info

  • Title: Simple2Complex: Global Optimization by Gradient Descent
  • ArXiv ID: 1605.00404
  • Date: 2016-05-03
  • Authors: Ming Li

📝 Abstract

A method named simple2complex for modeling and training deep neural networks is proposed. Simple2complex train deep neural networks by smoothly adding more and more layers to the shallow networks, as the learning procedure going on, the network is just like growing. Compared with learning by end2end, simple2complex is with less possibility trapping into local minimal, namely, owning ability for global optimization. Cifar10 is used for verifying the superiority of simple2complex.

💡 Deep Analysis

Figure 1

📄 Full Content

Deep learning has dramatically advanced the state of the art in many fields such as vision, NLP, games and so on. Stochastic gradient descent (SGD) and it's variants such as Momentum and Adagrad used to achieve state of the art performance for many problems have been proved a effective way of training deep neural networks [1,2] . In recent years, it is popular to model a complex problem as a extremely deep neural network and train it by SGD in the end to end way. It is well known that SGD is an optimization method based on gradient descent, and optimization by gradient decent end to end is prone to fall into local minimum as the nonlinearity of parameter space become more and more higher. Though alleviated by batch normalization [3] and residual network [4,5,6] , it still become more and more difficult as the depth of neural network increasing. Therefore, the questions arises: is there exists a global optimization method based on gradient descent and is there a better choice than training by end to end ?

In this paper, the above questions are answered by proposing a method named simple2complex for modeling and training deep neural networks.

The world is complex and highly non-convex, but if we look over it by a hierarchical way “from global to local” or “from abstract to detail”, a simple and relative convex world is presented to you. “Do not know the true face of Mount Lu just because in the mountains”. Imaging we are birds flying high over the mountain and look down at it, then we can get it’s global appearance, and we can easily find out the approximate region that the lowest valley located at. Then, we fly to the region and slightly decline flight attitude to see the region more clearly, therefore, we can further reduce the region scope. This process move in circles until we find the exact position of that valley, and the road to the Arcadia is right over there.

Considering the long biological evolution history as a whole optimization process is very interesting. The process begun with the occurrence of multi-molecular system about 40 hundred-million years ago, then be prokaryote, eucaryon, invertebrate, fish, reptile, bird, mammal, ape and the last is human. Early species possesses simple stimuli-responsive molecular structure as original prototype of neural system, under the rigorous long time natural selection, the original prototype evolved into shallow neural system, at this stage, the fundamental global biological structure is formed. Then, induced by various kinds of stimulus such as light, voice, chemical ingredients , neurons close to those stimulus continuously divide and grow, as neurons grow toward different stimulus, kinds of sub-neuron-structures is formed, finally those sub-neuron-structures evolved into eyes, ears, nose, etc. As the neural systems grow deeper and deeper, advanced species arose with the assistance of nature selection. This process continues for hundreds of millions of years, finally human beings appeared. This is a typical simple2complex optimization procedure, first, learn a shallow neural systems to model basic global biological functions, then, let the neural systems grow deeper and deeper to model more and more complicated biological functions. On the contrary, if we image biological evolution history as a end2end optimization procedure, then, at the beginning, there should be a huge deep neural networks initialized by lots of random inorganics, optimizing this systems by nature must be much harder than the way of simple2complex described earlier.

We always model a complex problem by a highly nonconvex functions ) (x f c with a lot of parameters, then search over those parameters using gradient, and obviously, always trapped into local minimal. In this paper, we model a complex problem by a sequence of functions

, in which, 0 s f is a relative simple and convex function with few parameters, and w respectively, and comply with the following conditions:

The difference between end2end and simple2complex is that end2end optimize

For simplicity, the method of simple to complex is denoted as s2c, and the method of end to end is denoted as e2e.

) As shown in Figure 1, a directed edge is used to represent a conv-bn layer, one or several directed edge emitting from one point means data split, one or several directed edge directing to one point means element-wise add operation followed by a nonlinear activation layer. Left graph in Figure 1 represents 0 s N with only one layer, right graph in Figure 1 represents n s N , this type of neural network can be called series neural network because that any neuron in the network can be seen as mathematical series, the type of series depends on the type of nonlinear activation layer, it can be called Fourier series neural network if trigonometric function such as tanh is used and Taylor series neural network if power function such as 2 x y  is used. In this paper, relu series neural network is used in all experiments. The valu

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut