Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
Deep Dive into Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition.
Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
Automatic handwriting recognition is of great academic and commercial interest. Current algorithms are already pretty good at learning to recognize handwritten digits. Post 1 http://yann.lecun.com/exdb/mnist/ arXiv:1003.0358v1 [cs.NE] 1 Mar 2010 offices use them to sort letters; banks use them to read personal checks. MNIST (LeCun et al., 1998) is the most widely used benchmark for isolated handwritten digit recognition. More than a decade ago, artificial neural networks called Multilayer Perceptrons or MLPs (Werbos, 1974;LeCun, 1985;Rumelhart et al., 1986) were among the first classifiers tested on MNIST. Most had few layers or few artificial neurons (units) per layer (LeCun et al., 1998), but apparently back then they were the biggest feasible MLPs, trained when CPU cores were at least 20 times slower than today. A more recent MLP with a single hidden layer of 800 units achieved 0.70% error (Simard et al., 2003).
However, more complex methods listed on the MNIST web page always seemed to outperform MLPs, and the general trend went towards more and more complex variants of Support Vector Machines or SVMs (Decoste & Scholkopf, 2002) and combinations of NNs and SVMs (Lauer et al., 2007) etc. Convolutional neural networks (CNNs) achieved a record-breaking 0.40% error rate (Simard et al., 2003), using novel elastic training image deformations. Recent methods pre-train each hidden CNN layer one by one in an unsupervised fashion (this seems promising especially for small training sets), then use supervised learning to achieve 0.39% error rate (Ranzato et al., 2006(Ranzato et al., , 2007)).
The biggest MLP so far (Salakhutdinov et al., 2007) also was pre-trained without supervision then piped its output into another classifier to achieve an error of 1% without domain-specific knowledge.
Are all these complexifications of plain MLPs really necessary? Can’t one simply train really big plain MLPs on MNIST? Why is there no literature on this? One reason is that at first glance deep MLPs do not seem to work better than shallow networks (Bengio et al., 2006). Training them is hard as back-propagated gradients quickly vanish exponentially in the number of layers (Hochreiter, 1991;Hochreiter et al., 2001;Hinton, 2007), just like in the first recurrent neural networks (Hochreiter & Schmidhuber, 1997). Indeed, previous deep networks successfully trained with back-propagation (BP) either had few free parameters due to weight-sharing (e.g. LeCun et al., 1998;Simard et al., 2003) or used unsupervised, layer-wise pre-training (e.g. Bengio et al., 2006;Ranzato et al., 2006). But is it really true that deep BP-MLPs do not work at all, or do they just need more training time? How to test this? Unfortunately, on-line BP for hundreds/thousands of epochs on large MLPs may take weeks or months on standard serial computers. But can’t one parallelize it? Well, on computer clusters this is hard due to communication latencies between individual computers. Multi-threading on a multi-core processor is not easy either. We may speed up BP using SSE (Streaming Single Instruction, Multiple Data Extensions), either manually, or by setting appropriate compiler flags. The maximum theoretical speedup under single precision floating-point, however, is four, which is not enough. And MNIST is large -its 60,000 images take almost 50MB, too much to fit in the L2/L3 cache of any current processor. This requires to continually access data in considerably slower RAM. To summarize, currently it is next to impossible to train big MLPs on CPUs.
We will show how to overcome all these problems by training large, deep MLPs on graphics cards.
MNIST consists of two datasets, one for training (60,000 images) and one for testing (10,000 images). Many studies divide the training set into two sets consisting of 50,000 images for training and 10,000 for validation. Our network is trained on slightly deformed images, continually generated in on-line fashion; hence we may use the whole un-deformed training set for validation, without wasting training images. Pixel intensities of the original gray scale images range from 0 (background) to 255 (max foreground intensity). 28 × 28 = 784 pixels per image get mapped to real values pixel intensity 127.5 -1.0 in [-1.0, 1.0], and are fed into the NN input layer.
We train 5 MLPs with 2 to 9 hidden layers and varying numbers of hidden units. Mostly but not always the number of hidden units per layer decreases towards the output layer (Table 1). There are 1.34 to 12.11 million free parameters (or weights, or synapses).
We use standard on-line BP (e.g. Russell & Norvig, 2002, pages 744-748), without momentum, but with a variable learning rate that shrinks by a multiplicative constant after each epoch, from 10 -3 down to 10 -6 . Weights are initialized with a uniform random distribution in [-0.05, 0.05]. Each neuron’s activation function is a scaled hyperbolic tangent: y(a) = A tanh Ba, where A = 1.7159 and B = 0.6666 (LeCun e
…(Full text truncated)…
This content is AI-processed based on ArXiv data.