Knowledge Matters: Importance of Prior Information for Optimization

Kno wledge Matters: Imp ortance of Prior Information for Optimization C ¸ a˘ glar G ¨ ul¸ cehre gulcehrc@ir o.umontreal.ca D ´ ep artement d’informatique et de r e cher che op´ er ationnel le Universit ´ e de Montr´ eal, Montr´ eal, QC, Canada Y osh ua Bengio bengioy@ir o.umontreal.ca D ´ ep artement d’informatique et de r e cher che op´ er ationnel le Universit ´ e de Montr´ eal, Montr´ eal, QC, Canada Editor: Not Assigned Abstract W e explore the eﬀect of introducing prior information into the in termediate lev el of deep sup ervised neural netw orks for a learning task on which all the black-box state-of-the-art ma- c hine learning algorithms tested ha v e failed to learn. W e motiv ate our w ork from the h yp othesis that there is an optimization obstacle inv olved in the nature of such tasks, and that humans learn useful in termediate concepts from other individuals via a form of sup ervision or guidance using a curriculum. The exp eriments we ha ve conducted provide p ositiv e evidence in fav or of this h yp othesis. In our exp erimen ts, a t wo-tiered MLP arc hitecture is trained on a dataset for whic h each image input contains three sprites, and the binary target class is 1 if all three hav e the same shape. Black-box machine learning algorithms only got chance on this task. Standard deep sup ervised neural netw orks also failed. How ever, using a particular structure and guiding the learner by providing in termediate targets in the form of in termediate concepts (the pres- ence of eac h ob ject) allo ws to nail the task. Much b etter than chance but imp erfect results are also obtained by exploring architecture and optimization v arian ts, pointing tow ards a diﬃcult optimization task. W e hypothesize that the learning diﬃcult y is due to the c omp osition of tw o highly non-linear tasks. Our ﬁndings are also consistent with hypotheses on cultural learning inspired by the observ ations of eﬀe ctive local minima (p ossibly due to ill-conditioning and the training pro cedure not b eing able to escap e what app ears lik e a lo cal minimum). Keyw ords: Deep Learning, Neural Netw orks, Optimization, Ev olution of Culture, Curricu- lum Learning, T raining with Hints 1. In tro duction There is a recen t emerging interest in diﬀerent ﬁelds of science for cultur al le arning (Henrich and McElreath, 2003) and how groups of individuals exchanging information can learn in wa ys sup erior to individual learning. This is also witnessed b y the emergence of new research ﬁelds suc h as ”So cial Neuroscience”. Learning from other agen ts in an en vironment b y the means of cultural transmission of knowledge with a p eer-to-peer communication is an eﬃcien t and natural wa y of acquiring or propagating common knowledge. The most p opular b elief on how the information is transmitted b et ween individuals is that bits of information are transmitted b y small units, called memes, whic h share some c haracteristics of genes, suc h as self-replication, m utation and resp onse to selective pressures (Dawkins, 1976). 1 This pap er is based on the hypothesis (which is further elab orated in Bengio (2013a)) that h uman culture and the ev olution of ideas hav e been crucial to coun ter an optimization issue: this diﬃculty would otherwise mak e it diﬃcult for h uman brains to capture high level knowl- edge of the world without the help of other educated humans. In this pap er mac hine learning exp erimen ts are used to in vestigate some elements of this hypothesis by seeking answers for the following questions: are there machine learning tasks which are intrinsically hard for a lone learning agen t but that ma y b ecome v ery easy when in termediate concepts are pro vided b y another agen t as additional intermediate learning cues, in the spirit of Curriculum Learn- ing (Bengio et al., 2009b)? What makes such learning tasks more diﬃcult? Can sp eciﬁc initial v alues of the neural netw ork parameters yield success when random initialization yield com- plete failure? Is it p ossible to v erify that the problem b eing faced is an optimization problem or with a regularization problem? These are the questions discussed (if not completely addressed) here, which relate to the follo wing broader question: ho w can humans (and p oten tially one day , mac hines) learn complex concepts? In this pap er, results of diﬀerent machine learning algorithms on an artiﬁcial learning task in volving binary 64 × 64 images are presented. In that task, each image in the dataset contains 3 Pen tomino tetris sprites (simple shap es). The task is to ﬁgure out if all the sprites in the image are the same or if there are diﬀerent sprite shap es in the image. Several state-of-the-art mac hine learning algorithms hav e b een tested and none of them could p erform b etter than a random predictor on the test set. Nevertheless by pro viding hints ab out the intermediate concepts (the presence and location of particular sprite classes), the problem can easily b e solved where the same-architecture neural netw ork without the intermediate concepts guidance fails. Surprisingly , our attempts at solving this problem with unsup ervised pre-training algorithms failed solve this problem. Ho wev er, with sp eciﬁc v ariations in the netw ork architecture or training pro cedure, it is found that one can mak e a big dent in the problem. F or showing the impact of intermediate level guidance, we exp erimen ted with a tw o-tiered neural netw ork, with sup ervised pre-training of the ﬁrst part to recognize the category of sprites indep endently of their orientation and scale, at diﬀeren t lo cations, while the second part learns from the output of the ﬁrst part and predicts the binary task of interest. The ob jective of this pap er is not to prop ose a nov el learning algorithm or architecture, but rather to reﬁne our understanding of the learning diﬃculties in volv ed with comp osed tasks (here a logical form ula comp osed with the detection of ob ject classes), in particular the training diﬃculties inv olved for deep neural netw orks. The results also bring empirical evidence in fav or of some of the hypotheses from Bengio (2013a), discussed b elo w, as well as introducing a particular form of curriculum learning (Bengio et al., 2009b). Building diﬃcult AI problems has a long history in computer science. Sp eciﬁcally hard AI problems ha ve b een studied to create CAPTCHA’s that are easy to solve for humans, but hard to solve for machines (V on Ahn et al., 2003). In this pap er we are in vestigating a diﬃcult problem for the oﬀ-the-shelf black-box mac hine learning algorithms. 1 1.1 Curriculum Learning and Cultural Ev olution Against Eﬀective Lo cal Minima What Bengio (2013a) calls an eﬀective lo cal minim um is a p oin t where iterative training stalls, either b ecause of an actual lo cal minimum or b ecause the optimization algorithm is 1. Y ou can access the source co de of some exp erimen ts presented in that pap er and their hyperparameters from here: https://github.com/caglar/kmatters 2 unable (in reasonable time) to ﬁnd a descent path (e.g., b ecause of serious ill-conditioning). In this pap er, it is h yp othesized that some more abstract learning tasks suc h as those obtained b y comp osing simpler tasks are more likely to yield eﬀective lo cal minima for neural netw orks, and are generally hard for general-purp ose mac hine learning algorithms. The idea that learning can b e enhanced by guiding the learner through intermediate easier tasks is old, starting with animal training by shaping (Skinner, 1958; Peterson, 2004; Krueger and Da yan, 2009). Bengio et al. (2009b) introduce a computational hypothesis related to a presumed issue with eﬀective lo cal minima when directly learning the target task: the go o d solutions correspond to hard-to-ﬁnd-by-c hance eﬀective lo cal minima, and in termediate tasks prepare the learner’s in ternal conﬁguration (parameters) in a w a y similar to con tin uation meth- o ds in global optimization (whic h go through a sequence of in termediate optimization problems, starting with a conv ex one where lo cal minima are no issue, and gradually morphing in to the target task of interest). In a related vein, Bengio (2013a) mak es the followin g inferences based on exp erimen tal observ ations of deep learning and neural net work learning: P oint 1: T raining deep arc hitectures is easier when some hints are giv en ab out the function that the intermediate levels should compute (Hinton et al., 2006; W eston et al., 2008; Salakh utdinov and Hinton, 2009; Bengio, 2009). The exp eriments p erforme d her e exp and in p articular on this p oint. P oint 2: It is muc h easier to train a neural netw ork with sup ervision (where examples ar pro vided to it of when a concept is present and when it is not present in a v ariety of examples) than to exp ect unsup ervised learning to discov er the c oncept (which may also happ en but usually leads to p o orer renditions of the concept). The p o or r esults obtaine d with unsup ervise d pr e-tr aining r einfor c e that hyp othesis . P oint 3: Directly training all the lay ers of a deep netw ork together not only makes it diﬃcult to exploit all the extra modeling p o w er of a deeper arch itecture but in man y cases it actually yields worse results as the num b er of r e quir e d layers is increased (Laro c helle et al., 2009; Erhan et al., 2010). The exp eriments p erforme d her e also r einfor c e that hyp othesis. P oint 4: Erhan et al. (2010) observed that no tw o training tra jectories ended up in the same eﬀective lo cal minim um, out of hundreds of runs, even when comparing solutions as functions from input to output, rather than in parameter space (th us eliminating from the picture the presence of symmetries and multiple local minima due to relab eling and other reparametrizations). This suggests that the num b er of diﬀeren t eﬀectiv e lo cal minima (ev en when considering them only in function space) m ust b e huge. P oint 5: Unsup ervised pre-training, which changes the initial conditions of the descent pro- cedure, sometimes allo ws to reac h substantially b etter eﬀective lo cal minima (in terms of generalization error!), and these b etter lo cal minima do not app ear to b e reachable b y c hance alone (Erhan et al., 2010). The exp eriments p erforme d her e pr ovide another pie c e of evidenc e in favor of the hyp othesis that wher e r andom initialization c an yield r ather p o or r esults, sp e ciﬁc al ly tar gete d initialization c an have a dr astic imp act, i.e., that 3 eﬀe ctive lo c al minima ar e not just numer ous but that some smal l subset of them ar e much b etter and har d to r e ach by chanc e. 2 Based on the ab o ve p oin ts, Bengio (2013a) then prop osed the follo wing h yp otheses regarding learning of high-lev el abstractions. • Optimization Hypothesis: When it learns, a biological agent performs an appro ximate optimization with respect to some implicit ob jectiv e function. • Deep Abstractions Hyp othesis: Higher lev el abstractions represen ted in brains re- quire deep er computations (in volving the comp osition of more non-linearities). • Lo cal Descen t Hyp othesis: The brain of a biological agent relies on appro ximate lo cal descen t and gradually improv es itself while learning. • Eﬀectiv e Lo cal Minima Hyp othesis: The learning pro cess of a single h uman learner (not help ed by others) is limited b y eﬀectiv e lo cal minima. • Deep er Harder Hypothesis: Eﬀective lo cal minima are more likely to hamper learning as the required depth of the architecture increases. • Abstractions Harder Hyp othesis: High-lev el abstractions are unlik ely to be disco v- ered by a single human learner by c hance, b ecause these abstractions are represen ted by a deep subnet work of the brain, whic h learns by lo cal descent. • Guided Learning Hyp othesis: A human brain can learn high level abstractions if guided b y the signals pro duced by other agen ts that act as hin ts or indirect supervision for these high-lev el abstractions. • Memes Divide-and-Conquer Hyp othesis: Linguistic exchange, individual learning and the recom bination of memes constitute an eﬃcien t ev olutionary recom bination op er- ator in the meme-space. This helps human learners to c ol le ctively build b etter in ternal represen tations of their environmen t, including fairly high-level abstractions. This pap er is fo cused on “ Point 1 ” and testing the “ Guide d L e arning Hyp othesis ”, using mac hine learning algorithms to pro vide exp erimental evidence. The exp erimen ts p erformed also provide evidence in fav or of the “ De ep er Har der Hyp othesis ” and asso ciated “ A bstr actions Har der Hyp othesis ”. Machine Learning is still far b ey ond the current capabilities of humans, and it is imp ortant to tac kle the remaining obstacles to approac h AI. F or this purp ose, the question to b e answered is why tasks that humans learn eﬀortlessly from very few examples, while machine learning algorithms fail miserably? 2. Recen t work show ed that rather deep feedforward net works can b e v ery successfully trained when large quan tities of labeled data are av ailable (Ciresan et al., 2010; Glorot et al., 2011a; Krizhevsky et al., 2012). Nonetheless, the exp erimen ts rep orted here suggest that it all dep ends on the task b eing considered, since ev en with v ery large quantities of lab eled examples, the deep net works trained here were unsuccessful. 4 2. Culture and Optimization Diﬃcult y As hypothesized in the “ L o c al Desc ent Hyp othesis ”, human brains w ould rely on a lo cal ap- pro ximate descent, just like a Multi-La yer Perceptron trained by a gradien t-based iterative op- timization. The main argument in fa vor of this h yp othesis relies on the biologically-grounded assumption that although ﬁring patterns in the brain change rapidly , synaptic strengths un- derlying these neural activities change only gradually , making sure that b eha viors are generally consisten t across time. If a learning algorithm is based on a form of lo cal (e.g. gradient-based) descen t, it can b e sensitive to eﬀective lo cal minima (Bengio, 2013a). When one trains a neural netw ork, at some p oint in the training phase the ev aluation of error seems to saturate, ev en if new examples are in tro duced. In particular Erhan et al. (2010) ﬁnd that early examples ha ve a m uc h larger w eigh t in the ﬁnal solution. It looks like the learner is stuc k in or near a lo cal minim um. But since it is diﬃcult to v erify if this is near a true lo cal minim um or simply an eﬀect of strong ill-conditioning, we call suc h a “stuc k” conﬁguration an eﬀe ctive lo c al minimum , whose deﬁnition dep ends not just on the optimization ob jective but also on the limitations of the optimization algorithm. Erhan et al. (2010) highlighted b oth the issue of eﬀective lo cal minima and a regulariza- tion eﬀect when initializing a deep net work with unsup ervised pre-training. In terestingly , as the netw ork gets deep er the diﬃcult y due to eﬀective lo cal minima seems to b e get more pro- nounced. That migh t b e b ecause of the n umber of eﬀectiv e lo cal minima increases (more like an actual lo cal minima issue), or maybe b ecause the go od ones are harder to reac h (more lik e an ill-conditioning issue) and more w ork will b e needed to clarify this question. As a result of Poin t 4 w e h yp othesize that it is v ery diﬃcult for an individual’s brain to disco ver some higher lev el abstractions b y chance only . As mentioned in the “ Guide d L e arning Hyp othesis ” h umans get hints from other h umans and learn high-level concepts with the guid- ance of other humans 3 . Curriculum learning (Bengio et al., 2009a) and incremental learning (Solomonoﬀ, 1989), are examples of this. This is done b y prop erly c ho osing the sequence of examples seen b y the learner, where simpler examples are in tro duced ﬁrst and more complex examples sho wn when the learner is ready for them. One of the hypothesis on wh y curriculum w orks states that curriculum learning acts as a con tinuation metho d that allo ws one to discov er a go od minimum, b y ﬁrst ﬁnding a go o d minim um of a smo other error function. Recent ex- p erimen ts on human sub jects also indicates that humans te ach by using a curriculum strategy (Khan et al., 2011). Some parts of the h uman brain are known to hav e a hierarc hical organization (i.e. visual cortex) consisten t with the deep arc hitecture studied in mac hine learning papers. As w e go from the sensory level to higher lev els of the visual cortex, we ﬁnd higher lev el areas corresponding to more abstract concepts. This is consistent with the De ep Abstr actions Hyp othesis . T raining neural netw orks and machine learning algorithms by decomp osing the learning task into sub-tasks and exploiting prior information ab out the task is w ell-established and in fact constitutes the main approach to solving industrial problems with mac hine learning. The contribution of this pap er is rather on rendering explicit the eﬀective lo cal minima issue and providing evidence on the type of problems for which this diﬃcult y arises. This prior information and hints giv en to the learner can b e view ed as inductive bias for a particular task, an imp ortan t ingredien t to obtain a go od generalization error (Mitchell, 1980). An interesting 3. But some high-lev el concepts may also b e hardwired in the brain, as assumed in the univ ersal grammar h yp othesis (Mon tague, 1970), or in nature vs nurture discussions in cognitiv e science. 5 earlier ﬁnding in that line of research was done with Explanation Based Neural Net works (EBNN) in which a neural netw ork transfers knowledge across multiple learning tasks. An EBNN uses previously learned domain knowledge as an initialization or search bias (i.e. to constrain the learner in the parameter space) (O’Sulliv an, 1996; Mitc hell and Thrun, 1993). Another related work in mac hine learning is mainly fo cused on reinforcement learning al- gorithms, based on incorp orating prior kno wledge in terms of logical rules to the learning algorithm as a prior kno wledge to sp eed up and bias learning (Kunapuli et al., 2010; T o well and Shavlik, 1994). As discusse d in “ Memes Divide and Conquer Hyp othesis “ so cieties can b e viewed as a distributed computational processing systems. In civilized so cieties knowledge is distributed across diﬀerent individuals, this yields a space eﬃciency . Moreo v er computation, i.e. eac h individual can sp ecialize on a particular task/topic, is also divided across the individuals in the so ciet y and hence this will yield a computational eﬃciency . Considering the limitations of the h uman brain, the whole pro cessing can not be done just b y a single agen t in an eﬃcien t manner. A recen t study in paleoantropology states that there is a substan tial decline in endo cranial v olume of the brain in the last 30000 years Henneb erg (1988). The volume of the brain shrunk to 1241 ml from 1502 ml (Henneb erg and Steyn, 1993). One of the hypothesis on the reduction of the v olume of skull claims that, decline in the volume of the brain migh t be related to the functional c hanges in brain that arose as a result of cultural developmen t and emergence of so cieties given that this time p erio d o verlaps with the transition from h unter-gatherer lifestyle to agricultural societies. 3. Exp erimen tal Setup Some tasks, whic h seem reasonably easy for h umans to learn 4 , are nonetheless app earing almost imp ossible to learn for current generic state-of-art machine learning algorithms. Here we study more closely suc h a task, which b ecomes learnable if one pro vides hin ts to the learner ab out appropriate intermediate concepts. In terestingly , the task we used in our exp erimen ts is not only hard for deep neural netw orks but also for non-parametric mac hine learning algorithms suc h as SVM’s, b o osting and decision trees. The result of the exp erimen ts for v arying size of dataset with several oﬀ-the-shelf black b o x mac hine learning algorithms and some p opular deep learning algorithms are pro vided in T able 1. The detailed explanations ab out the algorithms and the hyperparameters used for those algorithms are given in the App endix Section 5.2. W e also pro vide some explanations ab out the metho dologies conducted for the exp eriments at Section 3.2. 3.1 P en tomino Dataset In order to test our hypothesis, an artiﬁcial dataset for ob ject recognition using 64 × 64 binary images is designed 5 . If the task is t wo tiered (i.e., with guidance pro vided), the task in the ﬁrst part is to recognize and lo cate eac h Pen tomino ob ject class 6 in the image. The second 4. k eeping in mind that h umans can exploit prior knowledge, either from previous learning or innate kno wledge. 5. The source co de for the script that generates the artiﬁcial Pen tomino datasets (Arcade-Universe) is av ailable at: https://github.com/caglar/Arcade- Universe . This implementation is based on Olivier Breuleux’s bugland dataset generator. 6. A human learner does not seem to need to be taught the shap e categories of eac h Pen tomino sprite in order to solve the task. On the other hand, humans hav e lots of previously learned kno wledge about the notion of shap e and ho w central it is in deﬁning categories. 6 (a) sprites, not all same t yp e (b) sprites, all of same t yp e Figure 1: Left (a): An example image from the dataset which has a diﬀer ent sprite typ e in it. Righ t (b): An example image from the dataset that has only one t yp e of P entomino ob ject in it, but with diﬀerent orien tations and scales. part/ﬁnal binary classiﬁcation task is to ﬁgure out if all the Pen tominos in the image are of the same shap e class or not. If a neural netw ork learned to detect the categories of each ob ject at each lo cation in an image, the remaining task b ecomes an XOR-lik e op eration b etw een the detected ob ject categories. The t yp es of Pen tomino ob jects that is used for generating the dataset are as follows: P entomino sprites N, P , F, Y, J, and Q, along with the Pen tomino N2 sprite (mirror of “P entomino N” sprite), the Pen tomino F2 sprite (mirror of “Pen tomino F” sprite), and the P entomino Y2 sprite (mirror of “Pen tomino Y” sprite). Figure 2: Diﬀerent classes of P entomino shap es used in our dataset. As shown in Figures 1(a) and 1(b), the syn thesized images are fairly simple and do not ha ve any texture. F oreground pixels are “1” and background pixels are “0”. Images of the training and test sets are generated iid. F or notational conv enience, assume that the domain of raw input images is X , the set of sprites is S , the set of intermediate ob ject categories is Y for each p ossible lo cation in the image and the set of ﬁnal binary task outcomes is Z = { 0 , 1 } . Tw o diﬀerent types of rigid b ody transformation is p erformed: sprite rotation r ot ( X , γ ) where Γ = { γ : ( γ = 90 × φ ) ∧ [( φ ∈ N ) , (0 ≤ φ ≤ 3)] } and scaling scal e ( X, α ) where α ∈ { 1 , 2 } is the scaling factor. The data generating pro cedure is summarized b elow. Sprite transformations: Before placing the sprites in an empt y image, for eac h image x ∈ X , a v alue for z ∈ Z is randomly sampled which is to hav e (or not) the same three sprite shap es in the image. Conditioned on the constrain t giv en b y z , three sprites are randomly 7 selected s ij from S without replacemen t. Using a uniform probabilit y distribution o ver all poss ible scales, a scale is chosen and accordingly each sprite image is scaled. Then rotate each sprite is randomly rotated by a multiple of 90 degrees. Sprite placemen t: Up on completion of sprite transformations, a 64 × 64 uniform grid is gener- ated which is divided into 8 × 8 blo c ks, each blo c k b eing of size 8 × 8 pixels, and randomly select three diﬀerent blo cks from the 64=8 × 8 on the grid and place the transformed ob jects in to diﬀer ent blo cks (so they cannot o verlap, by construction). Eac h sprite is cen tered in the block in whic h it is located. Th us there is no ob ject translation inside the blo cks. The only translation in v ariance is due to the lo cation of the blo c k inside the image. A Pen tomino sprite is guaranteed to not ov erﬂo w the blo c k in which it is lo cated, and there are no collisions or ov erlaps b et ween sprites, making the task simpler. The largest p ossible P entomino sprite can b e ﬁt in to an 8 × 4 mask. 3.2 Learning Algorithms Ev aluated Initially the mo dels are cross-v alidated by using 5-fold cross-v alidation. With 40,000 examples, this giv es 32,000 examples for training and 8,000 examples for testing. F or neural netw ork algorithms, sto c hastic gradien t descent (SGD) is used for training. The follo wing standard learning algorithms w ere ﬁrst ev aluated: decision trees, SVMs with Gaussian k ernel, ordinary fully-connected Multi-La yer P erceptrons, Random F orests, k-Nearest Neighbors, Conv olutional Neural Netw orks, and Stack ed Denoising Auto-Enco ders with sup ervised ﬁne-tuning. More details of the conﬁgurations and hyper-parameters for each of them are given in App endix Section 5.2. The only b etter than c hance results w ere obtained with v ariations of the Structured Multi-La yer P erceptron describ ed b elow. 3.2.1 Structured Mul ti-La yer Perceptr on (SMLP) The neural netw ork arc hitecture that is used to solve this task is called the SMLP (Structured Multi-La yer Perceptron), a deep neural netw ork with tw o parts as illustrated in Figure 5 and 7: The low er part, P1NN ( Part 1 Neur al Network , as it is called in the rest of the pap er), has shared weigh ts and lo cal connectivity , with one identical MLP instance of the P1NN for eac h patch of the image, and typically an 11-element output vector p er patch (unless otherwise noted). The idea is that these 11 outputs p er patch could represent the detection of the sprite shap e category (or the absence of sprite in the patc h). The upp er part, P2NN ( Part 2 Neur al Network ) is a fully connected one hidden la yer MLP that tak es the concatenation of the outputs of all patc h-wise P1NNs as input. Note that the ﬁrst la yer of P1NN is similar to a con volutional la yer but where the stride equals the k ernel size, so that windo ws do not o verlap, i.e., P1NN can b e decomp osed in to separate netw orks sharing the same parameters but applied on diﬀeren t patc hes of the input image, so that each netw ork can actually b e trained patch-wise in the case where a target is provided for the P1NN outputs. The P1NN output for patc h p i whic h is extracted from the image x is computed as follows: f θ ( p i ) = g 2 ( V g 1 ( U p i + b ) + c ) (1) 8 where p i ∈ R d is the input patch/receptiv e ﬁeld extracted from lo cation i of a single image. U ∈ R d h × d is the weigh t matrix for the ﬁrst lay er of P1NN and b ∈ R d h is the vector of biases for the ﬁrst la yer of P1NN. g 1 ( · ) is the activ ation function of the ﬁrst lay er and g 2 ( · ) is the activ ation function of the second la yer. In many of the exp eriments, b est results were obtained with g 1 ( · ) a rectifying non-linearity (a.k.a. as RELU), which is max (0 , X ) (Jarrett et al., 2009b; Nair and Hinton, 2010; Glorot e t al., 2011a; Krizhevsky et al., 2012). V ∈ R d h × d o is the second la yer’s weigh ts matrix, such that and c ∈ R d o are the biases of the second lay er of the P1NN, with d o exp ected to b e smaller than d h . In this wa y , g 1 ( U p i + b ) is an o vercomplete represen tation of the input patch that can p oten tially represent all the p ossible Pen tomino shap es for all factors of v ariations in the patch (rotation, scaling and Pen tomino shap e type). On the other hand, when trained with hints, f θ ( p i ) is exp ected to b e the lo wer dimensional representation of a P entomino shap e category in v arian t to scaling and rotation in the given patch. In the exp erimen ts with SMLP trained with hints (targets at the output of P1NN), the P1NN is exp ected to p erform classiﬁcation of each 8 × 8 non-o verlapping patches of the original 64 × 64 input image without having an y prior knowledge of whether that sp eciﬁc patch con tains a Pen tomino shap e or not. P1NN in SMLP without hints just outputs the lo cal activ ations for eac h patch, and gradients on f θ ( p i ) are backpropagated from the upp er lay ers. In b oth cases P1NN pro duces the input represen tation for the Part 2 Neural Net (P2NN). Th us the input represen tation of P2NN is the concatenated output of P1NN across all the 64 patch lo cations: h o = [ f θ ( p 0 ) , ..., f θ ( p i ) , ..., f θ ( p N ))] where N is the num b er of patches and the h o ∈ R d i , d i = d o × N . h o is the concatenated output of the P1NN at each patch. There is a standardization la yer on top of the output of P1NN that centers the activ ations and performs divisive normalization b y dividing b y the standard deviation ov er a minibatc h of the activ ations of that la yer. W e denote the standardization function z ( · ). Standardization mak es use of the mean and standard deviation computed for each hidden unit such that eac h hidden unit of h o will ha ve 0 activ ation and unit standard deviation on av erage ov er the minibatc h. X is the set of p en tomino images in the minibatch, where X ∈ R d in × N is a matrix with N images. h ( i ) o ( x j ) is the vector of activ ations of the i -th hidden unit of hidden la yer h o ( x j ) for the j -th example, with x j ∈ X . µ h ( i ) o = 1 N X x j ∈ X h ( i ) o ( x j ) (2) σ h ( i ) o = s P N j ( h ( i ) o ( x j ) − µ h ( i ) o ) 2 N +  (3) z ( h ( i ) o ( x j )) = h ( i ) o ( x j ) − µ h ( i ) o max( σ h ( i ) o ,  ) (4) where  is a v ery small constant, that is used to prev ent n umerical underﬂo ws in the standard deviation. P1NN is trained on eac h 8 × 8 patches extracted from the image. h o is standardized for eac h training and test sample separately . Diﬀerent v alues of  were used for SMLP-hin ts and SMLP-nohin ts. The concatenated output of P1NN is fed as an input to the P2NN. P2NN is a feedforward MLP with a sigmoid output la yer using a single RELU hidden lay er. The task of P2NN is to p erform a nonlinear logical op eration on the represen tation pro vided at the output of P1NN. 9 3.2.2 Structured Mul ti La yer Perceptr on Trained with Hints (SMLP-hints) The SMLP-hin ts architecture exploits a hin t ab out the presence and category of Pen tomino ob jects, sp ecifying a semantics for the P1NN outputs. P1NN is trained with the intermediate target Y , sp ecifying the t yp e of Pen tomino sprite shap e presen t (if any) at eac h of the 64 patches (8 × 8 non-ov erlapping blo cks) of the image. Because a p ossible answer at a given lo cation can b e “none of the ob ject types” i.e., an empty patc h, y p (for patch p ) can tak e one of the 11 p ossible v alues, 1 for rejection and the rest is for the Pen tomino shap e classes, illustrated in Figure 2: y p = ( 0 if patch p is empty s ∈ S if the patc h p con tains a Pen tomino sprite . A similar task has b een studied b y Fleuret et al. (2011) (at SI app endix Problem 17), who compared the performance of humans vs computers. The SMLP-hints arc hitecture takes adv antage of dividing the task into t wo subtasks during training with prior information ab out intermediate-lev el relev an t factors. Because the sum of the training losses decomp oses into the loss on each patc h, the P1NN can b e pre-trained patch- wise. Eac h patch-speciﬁc component of the P1NN is a fully connected MLP with 8 × 8 inputs and 11 outputs with a softmax output la yer. SMLP-hin ts uses the the standardization given in Equation 3 but with  = 0. The standardization is a crucial step for training the SMLP on the P entomino dataset, and yields muc h sparser outputs, as seen on Figures 3 and 4. If the standardization is not used, ev en SMLP-hints could not solv e the Pen tomino task. In general, the standardization step damp ens the small activ ations and augments larger ones(reducing the noise). Cen tering the activ ations of each feature detector in a neural netw ork has b een studied in (Raiko et al., 2012) and (V atanen et al., 2013). They proposed that transforming the outputs of eac h hidden neuron in a m ulti-lay er p erceptron netw ork to ha ve zero output and zero slop e on av erage makes ﬁrst order optimization methods closer to the second order techniques. By default, the SMLP uses rectiﬁer hidden units as activ ation function, w e found a sig- niﬁcan t b o ost by using rectiﬁcation compared to hyperb olic tangent and sigmoid activ ation functions. The P1NN has a highly ov ercomplete architecture with 1024 hidden units p er patc h, and L1 and L2 weigh t decay regularization co eﬃcien ts on the w eights (not the biases) are re- sp ectiv ely 1e-6 and 1e-5. The learning rate for the P1NN is 0.75. 1 training ep och was enough for the P1NN to learn the features of Pen tomino shap es p erfectly on the 40000 training ex- amples. The P2NN has 2048 hidden units. L1 and L2 p enalty co eﬃcien ts for the P2NN are 1e-6, and the learning rate is 0.1. These w ere selected b y trial and error based on v alidation set error. Both P1NN (for each patc h) and P2NN are fully-connected neural netw orks, even though P1NN globally is a sp ecial kind of con volutional neural net work. Filters of the ﬁrst la yer of SMLP are sho wn in Figure 6. These are the examples of the ﬁlters obtained with the SLMP-hints trained with 40k examples, whose results are given in T able 1. Those ﬁlters lo ok v ery noisy but they w ork p erfectly on the Pen tomino task. 3.2.3 Deep and Structured Super vised MLP without Hints (SMLP-nohints) SMLP-nohin ts uses the same connectivity pattern (and deep architecture) that is also used in the SMLP-hin ts architecture, but without using the in termediate targets ( Y ). It directly predicts the ﬁnal outcome of the task ( Z ), using the same num b er of hidden units, the same 10 Figure 3: Bar chart of concatenated softmax output activ ations h o of P1NN (11 × 64=704 out- puts) in SMLP-hints b efore standardization, for a selected example. There are very large spikes at each lo cation for one of the p ossible 11 outcome (1 of K represen ta- tion). 11 Figure 4: Softmax output activ ations h o of P1NN at SMLP-hin ts b efore standardization. There are p ositiv e spiked outputs at the lo cations where there is a Pen tomino shap e. Posi- tiv e and negativ e spik es arise b ecause most of the outputs are near an av erage v alue. Activ ations are higher at the lo cations where there is a p en tomino shap e. 12 S t r u ct u r ed ML P A r chi t e ct u r e w i t h H i n t s Fin a l B i n a r y t a s k l a b e l s I n t e r m e d i a t e l e v e l t a r g e t s . Seco n d L ev el N eu r a l N etw o r k Fi r s t L ev el N eu r a l N etw o r k Figure 5: Structured MLP arc hitecture, used with hints (trained in t wo phases, ﬁrst P1NN, b ottom tw o la yers, then P2NN, top t wo lay ers). In SMLP-hin ts, P1NN is trained on eac h 8x8 patch extracted from the image and the softmax output probabilities of all 64 patches are concatenated in to a 64 × 11 vector that forms the input of P2NN. Only U and V are learned in the P1NN and its output on each patc h is fed into P2NN. The ﬁrst level and the second level neural net works are trained separately , not jointly . Figure 6: Filters of Structured MLP architecture, trained with hints on 40k examples. 13 connectivit y and the same activ ation function for the hidden units as SMLP-hin ts. 120 h y- p erparameter v alues ha ve b een ev aluated by randomly selecting the num b er of hidden units from [64 , 128 , 256 , 512 , 1024 , 1200 , 2048] and randomly sampling 20 learning rates uniformly in the log-domain within the in terv al of [0 . 008 , 0 . 8]. Tw o fully connected hidden lay ers with 1024 hidden units (same as P1NN) p er patc h is used and 2048 (same as P2NN) for the last hid- den lay er, with tw en ty training ep ochs. F or this netw ork the best results are obtained with a learning rate of 0.05. 7 S t r u ct u r ed ML P A r chi t e ct u r e w i t h o u t H i n t s Fin a l B i n a r y t a s k l a b els Seco n d L ev el N eu r a l N etw o r k Fi r s t L ev el N eu r a l N etw o r k Figure 7: Structured MLP architecture, used without hints (SMLP-nohints). It is the same arc hitecture as SMLP-hin ts (Figure 5) but with both parts (P1NN and P2NN) trained join tly with resp ect to the ﬁnal binary classiﬁcation task. W e chose to exp erimen t with v arious SMLP-nohint architectures and optimization pro ce- dures, trying unsuccessfully to achiev e as go o d results with SMLP-nohin t as with SMLP-hints. Rectiﬁer Non-Linearity A rectiﬁer nonlinearit y is used for the activ ations of MLP hidden la yers. W e observ ed that using piecewise linear nonlinearity activ ation function such as the rectiﬁer can mak e the optimization more tractable. 7. The source co de of the structured MLP is av ailable at the github rep ository: https://github.com/caglar/ structured_mlp 14 Figure 8: First la yer ﬁlters learned by the Structured MLP architecture, trained without us- ing hin ts on 447600 examples with online SGD and a sigmoid intermediate lay er activ ation. In termediate Lay er The output of the P1NN is considered as an intermediate la yer of the SMLP . F or the SMLP-hin ts, only softmax output activ ations ha ve b een tried at the in termediate la yer, and that suﬃced to learn the task. Since things did not work nearly as well with the SMLP-nohints, sev eral diﬀeren t activ ation functions hav e b een tried: softmax( · ), tanh( · ), sigmoid( · ) and linear activ ation functions. Standardization La yer Normalization at the last la y er of the con volutional neural net works has b een used o ccasionaly to encourage the comp etition b etw een the hidden units. (Jarrett et al., 2009a) used a lo cal contrast normalization lay er in their architecture whic h p erforms subtractiv e and divisive normalization. A lo cal con trast normalization la yer enforces a lo cal comp etition b et w een adjacen t features in the feature map and b et ween features at the same spatial lo cation in diﬀeren t feature maps. Similarly (Krizhevsky et al., 2012) observed that using a lo cal resp onse lay er that enjo ys the b eneﬁt of using lo cal normalization scheme aids generalization. Standardization has been observed to b e crucial for b oth SMLP trained with or with- out hints. In b oth SMLP-hin ts and SMLP-nohints exp erimen ts, the neural netw ork was not able to generalize or ev en learn the training set without using standardization in the SMLP in termediate la yer, doing just chance p erformance. More sp eciﬁcally , in the SMLP-nohin ts ar- c hitecture, standardization is part of the computational graph, hence the gradients are b eing bac kpropagated through it. The mean and the standard deviation is computed for each hidden unit separately at the intermediate lay er as in Equation 4. But in order to preven t numerical underﬂo ws or ov erﬂows during the bac kpropagation w e hav e used  = 1 e − 8 (Equation 3). The b eneﬁt of ha ving sparse activ ations ma y b e sp eciﬁcally important for the ill-conditioned problems, for the following reasons. When a hidden unit is “oﬀ ”, its gradient (the deriv ativ e of the loss with respect to its output) is usually close to 0 as w ell, as seen here. That means that all oﬀ-diagonal second deriv atives in volving that hidden unit (e.g. its input weigh ts) are also near 0. This is basically like remo ving some columns and rows from the Hessian matrix asso ciated with a particular example. It has b een observ ed that the condition num b er of the Hessian matrix (sp eciﬁcally , its largest eigenv alue) increases as the size of the net work increases (Dauphin and Bengio, 2013), making training considerably slo wer and ineﬃcien t (Dauphin and Bengio, 2013). Hence one would expect that as sparsity of the gradien ts (obtained because of sparsit y of the activ ations) increases, training would b ecome more eﬃcient, as if we w ere training a smaller sub-net work for each example, with shared weigh ts across examples, as in drop outs (Hin ton et al., 2012). In Figure 9, the activ ation of each hidden unit in a bar c hart is shown: the eﬀect of standardization is signiﬁcan t, making the activ ations sparser. 15 (a) Before standardization. (b) After standardization. Figure 9: Activ ations of the in termediate-level hidden units of an SLMP-nohin ts for a particular examples (x-axis: hidden unit n umber, y-axis: activ ation v alue). Left (a): b efore standardization. Right (b): after standardization. In Figure 10, one can see the activ ation histogram of the SMLP-nohints intermediate lay er, sho wing the distribution of activ ation v alues, b efore and after standardization. Again the sparsifying eﬀect of standardization is v ery apparen t. (a) Before standardization. (b) After standardization. Figure 10: Distribution histogram of activ ation v alues of SMLP-nohints in termediate la yer. Left (a): b efore standardization. Righ t (b): after standardization. In Figures 10 and 9, the in termediate lev el activ ations of SMLP-nohin ts are shown b efore and after standardization. These are for the same SMLP-nohints architecture whose results are presented on T able 1. F or that same SMLP , the Adadelta (Zeiler, 2012) adaptive learning 16 rate sc heme has been used, with 512 hidden units for the hidden la yer of P1NN and rectiﬁer activ ation function. F or the output of the P1NN, 11 sigmoidal units hav e b een used while P2NN had 1200 hidden units with rectiﬁer activ ation function. The output nonlinearit y of the P2NN is a sigmoid and the training ob jective is the binary crossen tropy . Adaptiv e Learning Rates W e hav e exp erimen ted with sev eral diﬀerent adaptiv e learning rate algorithms. W e tried rmsprop 8 , Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2010) and a linearly (1/t) decaying learning rate (Bengio, 2013b). F or the SMLP-nohin ts with sigmoid activ ation function w e ha ve found Adadelta(Zeiler, 2012) conv erging faster to an eﬀectiv e lo cal minima and usually yielding b etter generalization error compared to the others. 3.2.4 Deep and Structured MLP with Unsuper vised Pre-Training Sev eral experiments ha ve b een conducted using an arc hitecture similar to the SMLP-nohin ts, but by using unsup ervised pre-training of P1NN, with Denoising Auto-Enco der (DAE) and/or Con tractive Auto-Enco ders (CAE). Sup ervised ﬁne-tuning pro ceeds as in the deep and struc- tured MLP without hin ts. Because an unsup ervised learner ma y not fo cus the represen tation just on the shap es, a larger num b er of intermediate-lev el units at the output of P1NN has b een explored: previous work on unsup ervised pre-training generally found that larger hidden lay ers w ere optimal when using unsup ervised pre-training, b ecause not all unsup ervised features will b e relev ant to the task at hand. Instead of limiting to 11 units p er patch, we exp erimen ted with netw orks with up to 20 hidden (i.e., co de) units p er patch in the second-la yer patc h-wise auto-enco der. In App endix 5.1 w e also provided the result of some exp erimen ts with binary-binary RBMs trained on 8 × 8 patches from the 40k training dataset. In unsupervised pretraining experiments in this pap er, both c ontr active auto-enc o der (CAE) with sigmoid nonlinearity and binary cross entrop y cost function and denoising auto-enc o der (DAE) ha ve b een used. In the second la yer, exp erimen ts w ere p erformed with a DAE with rectiﬁer hidden units utilizing L1 sparsity and w eight deca y on the weigh ts of the auto-enco der. Greedy la yerwise unsup ervised training procedure is used to train the deep auto-enco der arc hitecture (Bengio et al., 2007). In unsup ervised pretraining exp erimen ts, tied weigh ts hav e b een used. Diﬀeren t c om binations of CAE and D AE for unsup ervised pretraining hav e b een tested, but none of the conﬁgurations tested managed to learn the Pen tomino task, as shown in T able 1. 3.3 Exp erimen ts with 1 of K represen tation T o explore the eﬀect of changing the complexity of the input represen tation on the diﬃculty of the task, a set of exp erimen ts ha v e b een designed with sym b olic representations of the information in eac h patch. In all cases an empty patch is represen ted with a 0 vector. These represen tation can b e seen as an alternative input for a P2NN-lik e net work, i.e., they w ere fed as input to an MLP or another blac k-b o x clas siﬁer. The following four experiments ha ve been conducted, each one using one using a diﬀeren t input representation for each patch: 8. This is learning rate scaling metho d that is discussed by G. Hinton in his Video Lecture 6.5 - rmsprop: Divide the gradien t by a running a verage of its recent magnitude. COURSERA: Neural Netw orks for Machine Learning, 2012. 17 Algorithm 20k dataset 40k dataset 80k dataset T raining T est T raining T est T raining T est Error Error Error Error Error Error SVM RBF 26.2 50.2 28.2 50.2 30.2 49.6 K Nearest Neighbors 24.7 50.0 25.3 49.5 25.6 49.0 Decision T ree 5.8 48.6 6.3 49.4 6.9 49.9 Randomized T rees 3.2 49.8 3.4 50.5 3.5 49.1 MLP 26.5 49.3 33.2 49.9 27.2 50.1 Convnet/Lenet5 50.6 49.8 49.4 49.8 50.2 49.8 Maxout Convnet 14.5 49.5 0.0 50.1 0.0 44.6 2 lay er sDA 49.4 50.3 50.2 50.3 49.7 50.3 Struct. Sup ervised MLP w/o hints 0.0 48.6 0.0 36.0 0.0 12.4 Struct. MLP+CAE Sup ervised Finetuning 50.5 49.7 49.8 49.7 50.3 49.7 Struct. MLP+CAE+DAE, Supervised Finetuning 49.1 49.7 49.4 49.7 50.1 49.7 Struct. MLP+DAE+D AE, Sup ervised Finetuning 49.5 50.3 49.7 49.8 50.3 49.7 Struct. MLP with Hints 0.21 30.7 0 3.1 0 0.01 T able 1: The error p ercen tages with diﬀerent learning algorithms on P entomino dataset with diﬀeren t n umber of training examples. Exp erimen t 1-Onehot represen tation without transformations: In this exp eriment sev- eral trials ha ve b een done with a 10-input one-hot vector per patch. Each input corre- sp onds to an ob ject category given in clear, i.e., the ideal input for P2NN if a sup ervised P1NN p erfectly did its job. Exp erimen t 2-Disen tangled representations: In this exp eriment, we did trials with 16 binary inputs p er patc h, 10 one-hot bits for representing eac h ob ject category , 4 for rotations and 2 for scaling, i.e., the whole information ab out the input is given, but it is p erfectly disen tangled. This would b e the ideal input for P2NN if an unsup ervised P1NN p erfectly did its job. Exp erimen t 3-Onehot represen tation with transformations: F or eac h of the ten ob ject t yp es there are 8 = 4 × 2 p ossible transformations. Two ob jects in t wo diﬀeren t patc hes are the considered “the same” (for the ﬁnal task) if their category is the same regardless of the transformations. The one-hot represen tation of a patch corresp onds to the cross- pro duct b et ween the 10 ob ject shap e classes and the 4 × 2 transformations, i.e., one out of 80=10 × 4 × 2 p ossibilities represented in an 80-bit one-hot vector. This also contains all the information ab out the input image patch, but spread out in a kind of non-parametric and non-informative (not disentangled) w ay , lik e a p erfect memory-based unsup ervised learner (like clustering) could pro duce. Nev ertheless, the shap e class w ould b e easier to read out from this representation than from the image representation (it would be an OR o ver 8 of the bits). Exp erimen t 4-Onehot represen tation with 80 choices: This represen tation has the same 1 of 80 one-hot represen tation p er patc h but the target task is deﬁned diﬀeren tly . Two ob- jects in t wo diﬀeren t patc hes are considered the same iﬀ they ha v e exactly the same 80-bit onehot representation (i.e., are of the same ob ject category with the same transformation applied). The ﬁrst exp eriment is a sanity chec k. It was conducted with single hidden-lay ered MLP’s with rectiﬁer and tanh nonlinearit y , and the task w as learned perfectly (0 error on both training and test dataset) with very few training ep ochs. 18 0 100 200 300 400 500 600 700 800 900 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Training Error Rate Test Error Rate (a) T raining and T est Errors for Exp erimen t 4 0 100 200 300 400 500 600 700 800 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Training Error Rate Test Error Rate (b) T raining and T est Errors for Exp erimen t 3 Figure 11: T anh MLP training curves. Left (a): The training and test errors of Exp erimen t 3 o ver 800 training ep o c hs with 100k training examples using T anh MLP . Right (b):The training and test errors of Exp erimen t 4 ov er 700 training ep o c hs with 100k training examples using T anh MLP . The results of Exp erimen t 2 are given in T able 2. T o improv e results, we exp erimen ted with the Maxout non-linearity in a feedforw ard MLP (Go o dfello w et al., 2013) with tw o hidden la yers. Unlik e the typical Maxout net work mentioned in the original pap er, regularizers ha v e b een deliberately av oided in order to fo cus on the optimization issue, i.e: no w eight deca y , norm constrain t on the weigh ts, or drop out. Although learning from a disentangled representation is more diﬃcult than learning from p erfect ob ject detectors, it is feasible with some archi tectures suc h as the Maxout net work. Note that this represen tation is the kind of representation that one could hop e an unsup ervised learning algorithm could discov er, at b est, as argued in Bengio et al. (2012). The only results obtained on the v alidation set for Exp erimen t 3 and Exp eriment 4 are sho wn resp ectiv ely in T able 3 and T able 4. In these exp erimen ts a tanh MLP with t wo hidden la yers hav e b een tested with the same hyperparameters. In exp erimen t 3 the complexity of the problem comes from the transformations (8=4 × 2) and the num b er of ob ject t yp es. But in exp eriment 4, the only source of complexity of the task comes from the n umber of diﬀeren t ob ject types. These results are in b et w een the complete failure and complete success observ ed with other exp erimen ts, suggesting that the task could become solv able with b etter training or more training examples. Figure 11 illustrates the progress of training a tanh MLP , on b oth the training and test error, for Exp erimen ts 3 and 4. Clearly , something has b een learned, but the task is not nailed yet. On exp erimen t 3 for b oth maxout and tanh the maxout there was a long plateau where the training error and ob jective sta ys almost same. Maxout did just chance on the exp erimen t for ab out 120 iterations on the training and the test set. But after 120th iteration the training and test error started decline and even tually it was able to solve the task. Moreov er as seen from the curv es in Figure 11(a) and 11(b), the training and test error curv es are almost the same for b oth tasks. This implies that for onehot inputs, whether you increase the num b er of p ossible transformations for each ob ject or the num b er of 19 Learning Algorithm T raining Error T est Error SVM 0.0 35.6 Random F orests 1.29 40.475 T anh MLP 0.0 0.0 Maxout MLP 0.0 0.0 T able 2: P erformance of diﬀerent learning algorithms on disentangled representation in Exp er- imen t 2. Learning Algorithm T raining Error T est Error SVM 11.212 32.37 Random F orests 24.839 48.915 T anh MLP 0.0 22.475 Maxout MLP 0.0 0.0 T able 3: P erformance of diﬀerent learning algorithms using a dataset with onehot v ector and 80 inputs as discussed for Exp erimen t 3. ob ject categories, as so on as the num b er of p ossible conﬁgurations is same, the complexit y of the problem is almost the same for the MLP . 3.4 Do es the Eﬀect P ersist with Larger T raining Set Sizes? The results sho wn in this section indicate that the problem in the P entomino task clearly is not just a regularization problem, but rather basically hinges on an optimization problem. Otherwise, w e w ould exp ect test error to decrease as the n umber of training examples increases. This is shown ﬁrst by studying the online case and then by studying the ordinary training case with a ﬁxed size training set but considering increasing training set sizes. In the online minibatc h setting, parameter up dates are p erformed as follows: θ t +1 = θ t − ∆ θ t (5) ∆ θ t =  P N i ∇ θ t L ( x t , θ t ) N (6) where L ( x t , θ t ) is the loss incurred on example x t with parameters θ t , where t ∈ Z + and  is the learning rate. Ordinary batch algorithms con verge linearly to the optimum θ ∗ , how ev er the noisy gradient estimates in the online SGD will cause parameter θ to ﬂuctuate near the local optima. Ho wev er, online SGD directly optimizes the exp ected risk, b ecause the examples are drawn iid from the ground-truth distribution (Bottou, 2010). Thus: L ∞ = E [ L ( x, θ )] = Z x L ( x, θ ) p ( x ) d x (7) 20 Learning Algorithm T raining Error T est Error SVM 4.346 40.545 Random F orests 23.456 47.345 T anh MLP 0 25.8 T able 4: P erformance of diﬀerent algorithms using a dataset with onehot vector and 80 binary inputs as discussed in Exp eriment 4. where L ∞ is the generalization error. Therefore online SGD is trying to minimize the exp ected risk with noisy up dates. Those noisy up dates hav e the eﬀect of regularizer: ∆ θ t =  P N i ∇ θ t L ( x t , θ t ) N =  ∇ θ t L ( x, θ t ) + ξ t (8) where ∇ θ t L ( x, θ t ) is the true gradien t and ξ t is the zero-mean sto c hastic gradien t “noise” due to computing the gradient o ver a ﬁnite-size minibatc h sample. W e would lik e to know if the problem with the P entomino dataset is more a regularization or an optimization problem. An SMLP-nohints model was trained b y online SGD with the ran- domly generated online P entomino stream. The learning rate was adaptive, with the Adadelta pro cedure (Zeiler, 2012) on minibatc hes of 100 examples. In the online SGD exp erimen ts, tw o SMLP-nohin ts that is trained with and without standardization at the in termediate la yer with exactly the same hyperparameters are tested. The SMLP-nohin ts P1NN patch-wise submo del has 2048 hidden units and the SMLP in termediate lay er has 1152 = 64 × 18 hidden units. The nonlinearit y that is used for the in termediate lay er is the sigmoid. P2NN has 2048 hidden units. SMLP-nohin ts has b een trained either with or without standardization on top of the output units of the P1NN. The exp eriments illustrated in Figures 12 and 13 are with the same SMLP without hin ts arc hitecture for which results are given in T able 1. In those graphs only the results for the training on the randomly generated 545400 P entomino samples ha ve been presen ted. As sho wn in the plots SMLP-nohin ts w as not able to generalize without standardization. Although without standardization the training loss seems to decrease initially , it ev entually gets stuck in a plateau where training loss do esn’t c hange m uch. T raining of SMLP-nohin ts online minibatch SGD is p erformed using standardization in the in termediate lay er and Adadelta learning rate adaptation, on 1046000 training examples from the randomly generated P en tomino stream. At the end of the training, test error is down to 27.5%, which is muc h b etter than chance but from from the score obtained with SMLP-hin ts of near 0 error. In another SMLP-nohints exp erimen t without standar dization the mo del is trained with the 1580000 P en tomino examples using online minibatc h SGD. P1NN has 2048 hidden units and 16 sigmoidal outputs p er patch. for the P1NN hidden la yer. P2NN has 1024 hidden units for the hidden lay er. Adadelta is used to adapt the learning rate. A t the end of training this SMLP , the test error remained stuck, at 50.1%. 21 0 100 200 300 400 500 600 700 800 Batch no 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 Test Error SMLP with standardization SMLP without standardization Figure 12: T est errors of SMLP-nohin ts with and without standardization in the intermediate la yer. Sigmoid as an intermediate lay er activ ation has b een used. Each tick (batch no) in the x-axis represents 400 examples. 22 0 100 200 300 400 500 600 700 Batch no 0 1 2 3 4 5 6 7 Training Loss SMLP with standardization SMLP without standardization Figure 13: T raining errors of SMLP-nohin ts with and without standardization in the interme- diate lay er. Sigmoid nonlinearit y has b een used as an intermediate lay er activ ation function. The x-axis is in units of blo cks of 400 examples in the training set. 23 3.4.1 Experiments with Increased Training Set Size Here we consider the eﬀect of training diﬀerent learners with diﬀerent n umbers of training examples. F or the exp erimen tal results shown in T able 1, 3 training set sizes (20k, 40k and 80k examples) had b een used. Eac h dataset was generated with diﬀerent random seeds (so they do not ov erlap). Figure 14 also sho ws the error bars for an ordinary MLP with three hidden la yers, for a larger range of training set sizes, b et ween 40k and 320k examples. The n umber of training ep o c hs is 8 (more did not help), and there are three hidden lay ers with 2048 feature detectors. The learning rate w e used in our exp eriments is 0.01. The activ ation function of the MLP is a tanh nonlinearity , while the L1, L2 p enalt y co eﬃcients are b oth 1e-6. T able 1 sho ws that, without guiding hin ts, none of the state-of-art learning algorithms could p erform noticeably b etter than a random predictor on the test set. This sho ws the imp ortance of in termediate hints introduced in the SMLP . The decision trees and SVMs can ov erﬁt the training set but they could not generalize on the test set. Note that the n umbers rep orted in the table are for h yp er-parameters selected based on v alidation set error, hence lo wer training errors are p ossible if a voiding all regularization and taking large enough mo dels. On the training set, the MLP with t wo large hidden la yers (sev eral thousands) could reach nearly 0% training error, but still did not manage to achiev e go o d test error. In the experiment results sho wn in Figure 14, w e ev aluate the impact of adding more training data for the fully-connected MLP . As men tioned b efore for these experiments w e hav e used a MLP with three hidden lay ers where eac h la yer has 2048 hidden units. The tanh( · ) activ ation function is used with 0.05 learning rate and minibatches of size 200. As can b e seen from the ﬁgure, adding more training examples did not help either training or test error (b oth are near 50%, with training error sligh tly lo wer and test error slightly higher), reinforcing the hypothesis that the diﬃcult encountered is one of optimization, not of regularization. Figure 14: T raining and test error bar charts for a regular MLP with 3 hidden la yers. There is no signiﬁcan t improv ement on the generalization error of the MLP as the new training examples are introduced. 24 3.5 Exp erimen ts on Eﬀect of Initializing with Hin ts Initialization of the parameters in a neural netw ork can hav e a big impact on the learning and generalization (Glorot and Bengio, 2010). Previously Erhan et al. (2010) sho wed that initializing the parameters of a neural netw ork with unsup ervised pretraining guides the learn- ing to wards basins of attraction of local minima that provides b etter generalization from the training dataset. In this section we analyze the eﬀect of initializing the SMLP with hints and then contin uing without hin ts at the rest of the training. F or exp erimen tal analysis of hin ts based initialization, SMLP is trained for 1 training ep och using the hints and for 60 ep o c hs it is trained without hints on the 40k examples training set. W e also compared the same arc hi- tecture with the same h yp erparameters, against to SMLP-nohin ts trained for 61 iterations on the same dataset. After one iteration of hin t-based training SMLP obtained 9% training error and 39% test error. F ollowing the hint based training, SMLP is trained without hin ts for 60 ep ochs, but at ep o c h 18, it already got 0% training and 0% test error. The hyperparameters for this exp erimen t and the exp erimen t that the results sho wn for the SMLP-hints in T able 1 are the same. The test results for initialization with and without hints are shown on Figure 15. This ﬁgure suggests that initializing with hints can give the same generalization p erformance but training tak es longer. Figure 15: Plots showing the test error of SMLP with random initialization vs initializing with hin t based training. 3.5.1 Fur ther Experiments on Optimiza tion f or Pentomino D a t aset With extensive hyperparameter optimization and using standardization in the intermediate lev el of the SMLP with softmax nonlinearity , SMLP-nohints was able to get 5.3% training and 25 6.7% test error on the 80k P entomino training dataset. W e used the 2050 hidden units for the hidden la yer of P1NN and 11 softmax output p er patc h. F or the P2NN, w e used 1024 hidden units with sigmoid and learning rate 0.1 without using an y adaptive learning rate method. This SMLP uses a rectiﬁer nonlinearit y for hidden la yers of b oth P1NN and P2NN. Considering that arc hitecture uses softmax as the in termediate activ ation function of SMLP-nohin ts. It is very lik ely that P1NN is trying to learn the presence of sp eciﬁc Pen tomino shap e in a given patc h. This architecture has a very large capacit y in the P1NN, that probably pro vides it enough capacit y to learn the presence of P entomino shap es at each patch eﬀortlessly . An MLP with 2 hidden lay ers, each 1024 rectiﬁer units, w as trained using LBFGS (the implemen tation from the scipy .optimize library) on 40k training examples, with gradients com- puted on batches of 10000 examples at eac h iteration. Ho wev er, after conv ergence of training, the MLP w as still doing chance on the test dataset. W e also observ ed that using linear units for the intermediate lay er yields b etter general- ization error without standardization compared to using activ ation functions suc h as sigmoid, tanh and RELU for the in termediate lay er. SMLP-nohints was able to get 25% generaliza- tion error with linear units without standardization whereas all the other activ ation functions that has b een tested failed to generalize with the same num b er of training iterations without standardization and hin ts. This suggests that using non-linear intermediate-lev el activ ation functions without standardization introduces an optimization diﬃcult y for the SMLP-nohin ts, ma yb e b ecause the in termediate level acts like a b ottlenec k in this architecture. 4. Conclusion and Discussion In this paper we hav e sho wn an example of task whic h seems almost imp ossible to solve by standard blac k-b o x machine learning algorithms, but can b e almost p erfectly solved when one encourages a semantics for the in termediate-level represen tation that is guided by prior kno wledge. The task has the particularity that it is deﬁned b y the comp osition of tw o non- linear sub-tasks (ob ject detection on one hand, and a non-linear logical op eration similar to X OR on the other hand). What is in teresting is that in the case of the neural netw ork, we can compare tw o netw orks with exactly the same architecture but a diﬀeren t pre-training, one of which uses the kno wn in termediate concepts to teac h an intermediate representation to the net work. With enough capacit y and training time they can overﬁt but did not not capture the essence of the task, as seen by test set p erformance. W e know that a structured deep netw ork can learn the task, if it is initialized in the right place, and do it from v ery few training examples. F urthermore we hav e shown that if one pre-trains SMLP with hints for only one ep och, it can nail the task. But the exactly same arc hitecture whic h started training from random initialization, failed to generalize. Consider the fact that ev en SMLP-nohints with standardization after b eing trained using online SGD on 1046000 generated examples and still gets 27.5% test error. This is an indication that the problem is not a regularization problem but p ossibly an inabilit y to ﬁnd a go o d eﬀe ctive lo c al minima of gener alization err or . What we h yp othesize is that for most initializations and arc hitectures (in particular the fully-connected ones), although it is p ossible to ﬁnd a go o d eﬀe ctive lo c al minimum of tr aining err or when enough capacit y is pro vided, it is diﬃcult (without the prop er initialization) to ﬁnd a go od lo cal minimum of generalization error. On the other hand, when the netw ork architecture 26 is constrained enough but still allows it to represen t a go od solution (suc h as the structured MLP of our exp erimen ts), it seems that the optimization problem can still b e diﬃcult and even training error remains stuc k high if the standardization isn’t used. Standardization obviously mak es the training ob jective of the SMLP easier to optimize and helps it to ﬁnd at least a b etter eﬀe ctive lo c al minimum of tr aining err or . This ﬁnding suggests that b y using sp eciﬁc arc hitectural constraints and sometimes domain sp eciﬁc kno wledge ab out the problem, one can alleviate the optimization diﬃculty that generic neural net work architectures face. It could b e that the combination of the net work architecture and training pro cedure pro- duces a training dynamics that tends to yield in to these minima that are p oor from the p oin t of view of generalization error, even when they manage to nail training error by pro viding enough capacit y . Of course, as the n umber of examples increases, we would exp ect this discrepancy to decrease, but then the optimization problem could still mak e the task unfeasible in practice. Note ho wev er that our preliminary exp erimen ts with increasing the training set size (8-fold) for MLPs did not reveal signs of p oten tial improv ements in test error yet, as shown in Figure 14. Ev en using online training on 545400 Pen tomino examples, the SMLP-nohints arc hitecture was still doing far from p erfect in terms of generalization error (Figure 12). These ﬁndings bring supporting evidence to the “Guided Learning Hypothesis” and “Deep er Harder Hyp othesis” from Bengio (2013a): higher level abstractions, which are expressed b y comp osing simpler concepts, are more diﬃcult to learn (with the learner often getting in an eﬀectiv e lo cal minimum ), but that diﬃcult y can be o vercome if another agen t pro vides hints of the imp ortance of learning other, in termediate-level abstractions which are relev ant to the task. Man y in teresting questions remain open. W ould a netw ork without any guiding hin t ev en- tually ﬁnd the solution with a enough training time and/or with alternate parametrizations? T o what exten t is ill-conditioning a core issue? The results with LBF GS w ere disappointing but c hanges in the architectures (suc h as standardization of the in termediate lev el) seem to mak e training muc h easier. Clearly , one can reach go o d solutions from an appropriate initialization, p oin ting in the direction of an issue with lo cal minima, but it ma y b e that go o d solutions are also reac hable from other initializations, alb eit going through a tortuous ill-conditioned path in parameter space. Wh y did our attempts at learning the intermediate concepts in an unsu- p ervised w ay fail? Are these results sp eciﬁc to the task we are testing or a limitation of the unsup ervised feature learning algorithm tested? T rying with many more unsup ervised v ari- an ts and exploring explanatory h yp otheses for the observ ed failures could help us answer that. Finally , and most ambitious, can we solve these kinds of problems if we allow a comm unity of learners to collab orate and collectively discov er and combine partial solutions in order to obtain solutions to more abstract tasks like the one presented here? Indeed, we would lik e to disco ver learning algorithms that can solv e such tasks without the use of prior kno wledge as sp eciﬁc and strong as the one used in the SMLP here. These exp erimen ts could b e inspired by and inform us ab out p oten tial mec hanisms for collective learning through cultural ev olutions in human so cieties. Ac kno wledgments W e would lik e to thank to the ICLR 2013 reviewers for their insightful commen ts, and NSERC, CIF AR, Compute Canada and Canada Research Chairs for funding. 27 References A. Ben-Hur and J. W eston. A user’s guide to supp ort v ector mac hines. Metho ds in Mole cular Biolo gy , 609:223–239, 2010. Y. Bengio, P . Lamblin, D. Popovici, and H. Laro c helle. Greedy la yer-wise training of deep net works. In NIPS’2006 , 2007. Y oshua Bengio. Learning deep arc hitectures for AI. F oundations and T r ends in Machine L e arning , 2(1):1–127, 2009. Also published as a b ook. No w Publishers, 2009. Y oshua Bengio. Evolving culture vs lo cal minima. In Gr owing A daptive Machines: Inte gr ating Development and L e arning in Artiﬁcial Neur al Networks , num b er also as ArXiv 1203.2990v1, pages T. Ko waliw, N. Bredeche & R. Doursat, eds. Springer-V erlag, March 2013a. URL http://arxiv.org/abs/1203.2990 . Y oshua Bengio. Practical recommendations for gradient-based training of deep architectures. In K.-R. M ¨ uller, G. Monta v on, and G. B. Orr, editors, Neur al Networks: T ricks of the T r ade . Springer, 2013b. Y oshua Bengio, Jerome Louradour, Ronan Collob ert, and Jason W eston. Curriculum learning. In L´ eon Bottou and Michael Littman, editors, Pr o c e e dings of the Twenty-sixth International Confer enc e on Machine L e arning (ICML’09) . ACM, 2009a. Y oshua Bengio, Jerome Louradour, Ronan Collob ert, and Jason W eston. Curriculum learning. In ICML’09 , 2009b. Y oshua Bengio, Aaron Courville, and P ascal Vincent. Unsup ervised feature learning and deep learning: A review and new p ersp ectiv es. T echnical Rep ort arXiv:1206.5538, U. Montreal, 2012. URL . Y oshua Bengio, Aaron Courville, and P ascal Vincent. Unsup ervised feature learning and deep learning: A review and new p erspectives. IEEE T r ans. Pattern Analysis and Machine Intel- ligenc e (P AMI) , 2013. James Bergstra, Olivier Breuleux, F r´ ed ´ eric Bastien, Pascal Lam blin, Razv an Pascan u, Guil- laume Desjardins, Joseph T urian, David W arde-F arley , and Y osh ua Bengio. Theano: a CPU and GPU math expression compiler. In Pr o c e e dings of the Python for Scientiﬁc Computing Confer enc e (SciPy) , 2010. L ´ eon Bottou. Large-scale machine learning with sto c hastic gradient descent. In Pr o c e e dings of COMPST A T’2010 , pages 177–186. Springer, 2010. Leo Breiman. Random forests. Machine L e arning , 45(1):5–32, 2001. D. C. Ciresan, U. Meier, L. M. Gam bardella, and J. Schmidh ub er. Deep big simple neural nets for handwritten digit recognition. Neur al Computation , 22:1–14, 2010. Y ann Dauphin and Y oshua Bengio. Big neural netw orks waste capacit y . T ec hnical Rep ort arXiv:1301.3583, Universite de Montreal, 2013. Ric hard Da wkins. The Selﬁsh Gene . Oxford Universit y Press, 1976. 28 J. Duchi, E. Hazan, and Y. Singer. Adaptiv e subgradient metho ds for online learning and sto c hastic optimization. Journal of Machine L e arning R ese ar ch , 12:2121–2159, 2010. Dumitru Erhan, Y osh ua Bengio, Aaron Courville, Pierre-An toine Manzagol, P ascal Vincent, and Samy Bengio. Why do es unsup ervised pre-training help deep learning? Journal of Machine L e arning R ese ar ch , 11:625–660, F ebruary 2010. F ran¸ cois Fleuret, Ting Li, Charles Dub out, Emma K W ampler, Steven Y antis, and Donald Geman. Comparing machines and humans on a visual categorization test. Pr o c e e dings of the National A c ademy of Scienc es , 108(43):17621–17625, 2011. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural netw orks. In AIST A TS , 2011a. Xa vier Glorot and Y osh ua Bengio. Understanding the diﬃculty of training deep feedforw ard neural netw orks. In JMLR W&CP: Pr o c e e dings of the Thirte enth International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS 2010) , volume 9, pages 249–256, Ma y 2010. Xa vier Glorot, Antoine Bordes, and Y oshua Bengio. Deep sparse rectiﬁer neural net works. In JMLR W&CP: Pr o c e e dings of the F ourte enth International Confer enc e on Artiﬁcial Intel li- genc e and Statistics (AIST A TS 2011) , April 2011b. Ian J. Go o dfello w, Da vid W arde-F arley , Mehdi Mirza, Aaron Courville, and Y oshua Bengio. Maxout netw orks. In ICML , 2013. Maciej Henneberg. Decrease of human skull size in the holocene. Human biolo gy , pages 395–405, 1988. Maciej Henneb erg and Maryna Steyn. T rends in cranial capacit y and cranial index in subsa- haran africa during the holo cene. Americ an journal of human biolo gy , 5(4):473–479, 1993. J. Henric h and R. McElreath. The evolution of cultural evolution. Evolutionary A nthr op olo gy: Issues, News, and R eviews , 12(3):123–135, 2003. Geoﬀrey E. Hinton, Simon Osindero, and Y ee Wh ye T eh. A fast learning algorithm for deep b elief nets. Neur al Computation , 18:1527–1554, 2006. Geoﬀrey E. Hin ton, Nitish Sriv asta v a, Alex Krizhevsky , Ilya Sutsk ev er, and Ruslan Salakhutdi- no v. Impro ving neural netw orks by preven ting co-adaptation of feature detectors. T echnical rep ort, arXiv:1207.0580, 2012. C.W. Hsu, C.C. Chang, C.J. Lin, et al. A practical guide to supp ort vector classiﬁcation, 2003. Kevin Jarrett, Koray Kavuk cuoglu, Marc’Aurelio Ranzato, and Y ann LeCun. What is the b est multi-stage architecture for ob ject recognition? In Pr o c. International Confer enc e on Computer Vision (ICCV’09) , pages 2146–2153. IEEE, 2009a. Kevin Jarrett, Koray Kavuk cuoglu, Marc’Aurelio Ranzato, and Y ann LeCun. What is the b est m ulti-stage arc hitecture for ob ject recognition? In ICCV’09 , 2009b. 29 F aisal Khan, Xiao jin Zh u, and Bilge Mutlu. How do h umans teac h: On curriculum learning and teac hing dimension. In A dvanc es in Neur al Information Pr o c essing Systems 24 (NIPS’11) , pages 1449–1457, 2011. Alex Krizhevsky , Ily a Sutsk ever, and Geoﬀrey Hin ton. ImageNet classiﬁcation with deep con volutional neural net works. In A dvanc es in Neur al Information Pr o c essing Systems 25 (NIPS’2012) . 2012. Kai A. Krueger and Peter Da yan. Flexible shaping: ho w learning in small steps helps. Co gnition , 110:380–394, 2009. G. Kunapuli, K.P . Bennett, R. Maclin, and J.W. Shavlik. The adviceptron: Giving advice to the p erceptron. Pr o c e e dings of the Confer enc e on Artiﬁcial Neur al Networks In Engine ering (ANNIE 2010) , 2010. Hugo Laro chelle, Y oshua Bengio, Jerome Louradour, and Pascal Lam blin. Exploring strategies for training deep neural netw orks. Journal of Machine L e arning R ese ar ch , 10:1–40, 2009. Y. LeCun, L. Bottou, Y. Bengio, and P . Haﬀner. Gradient-based learning applied to do cumen t recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324, 1998. T.M. Mitc hell. The ne e d for biases in le arning gener alizations . Departmen t of Computer Science, Lab oratory for Computer Science Research, Rutgers Univ., 1980. T.M. Mitchell and S.B. Thrun. Explanation-based neural net work learning for rob ot control. A dvanc es in Neur al information pr o c essing systems , pages 287–287, 1993. R. Montague. Universal grammar. The oria , 36(3):373–398, 1970. V. Nair and G. E Hin ton. Rectiﬁed linear units impro ve restricted Boltzmann machines. In ICML’10 , 2010. L.B.J.H.F.R.A. Olshen and C.J. Stone. Classiﬁcation and regression trees. Belmont, Calif.: Wadsworth , 1984. Joseph O’Sulliv an. Integrating initialization bias and searc h bias in neural net work learning, 1996. F. P edregosa, G. V aro quaux, A. Gramfort, V. Mic hel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V. Dub ourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine L e arning R ese ar ch , 12:2825–2830, 2011. Gail B. Peterson. A day of great illumination: B. F. Skinner’s discov ery of shaping. Journal of the Exp erimental A nalysis of Behavior , 82(3):317–328, 2004. T apani Raiko, Harri V alp ola, and Y ann LeCun. Deep learning made easier by linear transfor- mations in p erceptrons. In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 924–932, 2012. Salah Rifai, Pascal Vincent, Xavier Muller, Xa vier Glorot, and Y oshua Bengio. Con tractive auto-enco ders: Explicit in v ariance during feature extraction. In ICML’2011 , 2011. 30 Salah Rifai, Y oshua Bengio, Y ann Dauphin, and Pascal Vincent. A generative pro cess for sam- pling contractiv e auto-enco ders. In Pr o c e e dings of the Twenty-nine International Confer enc e on Machine L e arning (ICML’12) . A CM, 2012. URL http://icml.cc/discuss/2012/590. html . R. Salakh utdinov and G.E. Hin ton. Deep Boltzmann mac hines. In Pr o c e e dings of the Twelfth International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics (AIST A TS 2009) , volume 8, 2009. Burrh us F. Skinner. Reinforcement to da y . Americ an Psycholo gist , 13:94–99, 1958. R.J. Solomonoﬀ. A system for incremen tal learning based on algorithmic probabilit y . In Pr o c e e d- ings of the Sixth Isr aeli Confer enc e on Artiﬁcial Intel ligenc e, Computer Vision and Pattern R e c o gnition , pages 515–527. Citeseer, 1989. G.G. T ow ell and J.W. Sha vlik. Kno wledge-based artiﬁcial neural netw orks. Artiﬁcial intel li- genc e , 70(1):119–165, 1994. T ommi V atanen, T apani Raiko, Harri V alp ola, and Y ann LeCun. Pushing sto c hastic gradien t to wards second-order metho ds–backpropagation learning with transformations in nonlinear- ities. arXiv pr eprint arXiv:1301.3476 , 2013. P ascal Vincen t, Hugo Larochelle, Isab elle La joie, Y osh ua Bengio, and Pierre-Antoine Manzagol. Stac ked denoising auto encoders: Learning useful represen tations in a deep netw ork with a lo cal denoising criterion. Journal of Machine L e arning R ese ar ch , 11:3371–3408, December 2010. Luis V on Ahn, Man uel Blum, Nicholas J Hopp er, and John Langford. Captcha: Using hard ai problems for securit y . In A dvanc es in Cryptolo gyEUROCR YPT 2003 , pages 294–311. Springer, 2003. Jason W eston, F r ´ ed ´ eric Ratle, and Ronan Collob ert. Deep learning via semi-sup ervised em- b edding. In William W. Cohen, Andrew McCallum, and Sam T. Ro weis, editors, Pr o- c e e dings of the Twenty-ﬁfth International Confer enc e on Machine L e arning (ICML’08) , pages 1168–1175, New Y ork, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390303. Matthew D Zeiler. Adadelta: An adaptiv e learning rate metho d. arXiv pr eprint arXiv:1212.5701 , 2012. 5. App endix 5.1 Binary-Binary RBMs on P entomino Dataset W e trained binary-binary RBMs (both visible and hidden are binary) on 8 × 8 patc hes extracted from the Pen tomino Dataset using PCD (sto chastic maxim um likelihoo d), a weigh t deca y of .0001 and a sparsit y p enalt y 9 . W e used 256 hidden units and trained b y SGD with a batc h size of 32 and a annealing learning rate (Bengio, 2013b) starting from 1e-3 with annealing rate 9. implemen ted as T orontoSparsit y in pylearn2, see the y aml ﬁle in the rep ository for more details 31 1.000015. The RBM is trained with momen tum starting from 0.5. The biases are initialized to -2 in order to get a sparse represen tation. The RBM is trained for 120 ep ochs (appro ximately 50 million updates). After pretraining the RBM, its parameters are used to initialize the ﬁrst la yer of an SMLP- nohin ts net work. As in the usual arc hitecture of the SMLP-nohin ts on top of P1NN, there is an intermediate lay er. Both P1NN and the intermediate lay er ha ve a sigmoid nonlinearity , and the intermediate la yer has 11 units p er lo cation. This SMLP-nohints is trained with Adadelta and standardization at the intermediate la yer 10 . 0 5 10 15 20 25 30 35 40 45 Epoch 0.0 0.1 0.2 0.3 0.4 0.5 Error percentage RBM Test and Training Errors Training Error Test Error Figure 16: T raining and test errors of an SMLP-nohin ts net work whose ﬁrst la yer is pre-trained as an RBM. T raining error reduces to 0% at ep o c h 42, but test error is still c hance. 5.2 Exp erimen tal Setup and Hyp er-parameters 5.2.1 Decision Trees W e used the decision tree implemen tation in the scikit-learn (Pedregosa et al., 2011) python pac k age whic h is an implemen tation of the CAR T (Regression T rees) algorithm. The CAR T algorithm constructs the decision tree recursively and partitions the input space such that the samples b elonging to the same category are group ed together (Olshen and Stone, 1984). W e used The Gini index as the impurity criteria. W e ev aluated the hyper-parameter conﬁgura- tions with a grid-searc h. W e cross-v alidated the maxim um depth ( max depth ) of the tree (for prev enting the algorithm to severely ov erﬁt the training set) and minimum num b er of samples 10. In our auto-enco der exp erimen ts we directly fed features to P2NN without standardization and Adadelta. 32 Figure 17: Filters learned b y the binary-binary RBM after training on the 40k examples. The RBM did learn the edge structure of P entomino shap es. Figure 18: 100 samples generated from trained RBM. All the generated samples are v alid Pen- tomino shap es. 33 required to create a split ( min spl it ). 20 diﬀerent conﬁgurations of hyper-parameter v alues w ere ev aluated. W e obtained the b est v alidation error with max depth = 300 and min spl it = 8. 5.2.2 Suppor t Vector Machines W e used the “Support V ector Classiﬁer (SVC)” implementation from the scikit-learn pack age whic h in turn uses the libsvm’s Supp ort V ector Machine (SVM) implementation. Kernel- based SVMs are non-parametric mo dels that map the data in to a high dimensional space and separate diﬀerent classes with h yp erplane(s) such that the supp ort v ectors for each category will b e separated by a large margin. W e cross-v alidated three hyper-parameters of the mo del using grid-searc h: C , γ and the t yp e of kernel( ker nel ty pe ). C is the p enalt y term (weigh t deca y) for the SVM and γ is a hyper-parameter that controls the width of the Gaussian for the RBF kernel. F or the p olynomial kernel, γ controls the ﬂexibilit y of the classiﬁer (degree of the p olynomial) as the n umber of parameters increases (Hsu et al., 2003; Ben-Hur and W eston, 2010). W e ev aluated forty-t wo hyper-parameter conﬁgurations. That includes, tw o k ernel types: { RB F , P oly nomial } ; three gammas: { 1 e − 2 , 1 e − 3 , 1 e − 4 } for the RBF kernel, { 1 , 2 , 5 } for the p olynomial kernel, and sev en C v alues among: { 0 . 1 , 1 , 2 , 4 , 8 , 10 , 16 } . As a result of the grid search and cross-v alidation, w e hav e obtained the b est test error by using the RBF kernel, with C = 2 and γ = 1. 5.2.3 Mul ti La yer Perceptr on W e hav e our o wn implementation of Multi Lay er P erceptron based on the Theano (Bergstra et al., 2010) mac hine learning libraries. W e ha ve selected 2 hidden la y ers, the rectiﬁer activ ation function, and 2048 hidden units p er la yer. W e cross-v alidated three h yp er-parameters of the mo del using random-search, sampling the learning rates  in log-domain, and selecting L 1 and L 2 regularization p enalty co eﬃcien ts in sets of ﬁxed v alues, ev aluating 64 h yp erparameter v alues. The range of the h yp erparameter v alues are  ∈ [0 . 0001 , 1], L 1 ∈ { 0 ., 1 e − 6 , 1 e − 5 , 1 e − 4 } and L 2 ∈ { 0 , 1 e − 6 , 1 e − 5 } . As a result, the follo wing were selected: L 1 = 1 e − 6, L 2 = 1 e − 5 and  = 0 . 05. 5.2.4 Random Forests W e used scikit-learn’s implemen tation of “Random F orests” decision tree learning. The Ran- dom F orests algorithm creates an ensemble of decision trees by randomly selecting for eac h tree a subset of features and applying bagging to combine the individual decision trees (Breiman, 2001). W e hav e used grid-search and cross-v alidated the max depth , min spl it , and n umber of trees ( n estimator s ). W e ha ve done the grid-search on the follo wing h yp erparameter v al- ues, n estimator s ∈ { 5 , 10 , 15 , 25 , 50 } , max depth ∈ { 100 , 300 , 600 , 900 } , and min spl its ∈ { 1 , 4 , 16 } . W e obtained the b est v alidation error with max depth = 300, min split = 4 and n estimator s = 10. 5.2.5 k-Nearest Neighbors W e used scikit-learn’s implemen tation of k-Nearest Neighbors (k-NN). k-NN is an instance- based, lazy learning algorithm that selects the training examples closest in Euclidean distance to the input query . It assigns a class lab el to the test example based on the categories of the k closest neigh b ors. The h yp er-parameters we ha ve ev aluated in the cross-v alidation are the n umber of neighbors ( k ) and w eig hts . The w eights hyper-parameter can b e either “uniform” or 34 “distance”. With “uniform”, the v alue assigned to the query point is computed b y the ma jorit y v ote of the nearest neighbors. With “distance”, eac h v alue assigned to the query p oin t is computed b y w eighted ma jorit y v otes where the w eigh ts are computed with the in verse distance b et w een the query p oin t and the neighbors. W e hav e used n neig hbours ∈ { 1 , 2 , 4 , 6 , 8 , 12 } and w eights ∈ { ” unif or m ” , ” distance ” } for h yp er-parameter searc h. As a result of cross-v alidation and grid searc h, we obtained the b est v alidation error with k = 2 and w eights =“uniform”. 5.2.6 Convolutional Neural Nets W e used a Theano (Bergstra et al., 2010) impleme n tation of Conv olutional Neural Netw orks (CNN) from the deep learning tutorial at deeplearning.net , whic h is based on a v anilla v ersion of a CNN LeCun et al. (1998). Our CNN has t wo con volutional la yers. F ollo wing eac h con volutional la yer, we hav e a max-p o oling lay er. On top of the con volution-po oling- con volution-po oling lay ers there is an MLP with one hidden lay er. In the cross-v alidation we ha ve sampled 36 learning rates in log-domain in the range [0 . 0001 , 1] and the num b er of ﬁlters from the range [10 , 20 , 30 , 40 , 50 , 60] uniformly . F or the ﬁrst con volutional lay er w e used 9 × 9 receptiv e ﬁelds in order to guaran tee that each ob ject ﬁts inside the receptive ﬁeld. As a result of random hyperparameter search and doing manual hyperparameter search on the v alidation dataset, the follo wing v alues were selected: • The num b er of features used for the ﬁrst lay er is 30 and the second lay er is 60. • F or the second con v olutional la yer, 7 × 7 receptiv e ﬁelds. The stride for b oth con volutional la yers is 1. • Conv olved images are do wnsampled by a factor of 2 × 2 at each p o oling op eration. • The learning rate for CNN is 0.01 and it w as trained for 8 ep ochs. 5.2.7 Maxout Conv olutional Neural Nets W e used the pylearn2 ( https://github.com/lisa- lab/pylearn2 ) implementation of maxout con volutional net works (Go odfellow et al., 2013). There are tw o conv olutional lay ers in the selected arc hitecture, without an y p ooling. In the last con volutional lay er, there is a maxout non-linearit y . The follo wing w ere selected by cross-v alidation: learning rate, n umber of channels for the b oth conv olution lay ers, num b er of k ernels for the second lay er and num b er of units and pieces p er maxout unit in the last lay er, a linearly decaying learning rate, momen tum starting from 0.5 and saturating to 0.8 at the 200’th ep och. Random searc h for the h yp erparameters w as used to ev aluate 48 diﬀerent hyperparameter conﬁgurations on the v alidation dataset. F or the ﬁrst con volutional la yer, 8 × 8 k ernels w ere selected to mak e sure that eac h Pen tomino shape ﬁts in to the kernel. Early stopping was used and test error on the mo del that has the b est v alidation error is rep orted. Using norm constraint on the fan-in of the ﬁnal softmax units yields slightly b etter result on the v alidation dataset. As a result of cross-v alidation and man ually tuning the h yp erparameters we used the fol- lo wing h yp erparameters: • 16 channels p er con volutional la yer. 600 hidden units for the maxout la yer. • 6x6 kernels for the second conv olutional la yer. 35 • 5 pieces for the con volution lay ers and 4 pieces for the maxout la yer p er maxout units. • W e deca yed the learning rate b y the factor of 0.001 and the initial learning rate is 0.026367. But we scaled the learning rate of the second con volutional la yer by a constant factor of 0.6. • The norm constrain t (on the incoming weigh ts of each unit) is 1.9365. Figure 19 sho ws the ﬁrst lay er ﬁlters of the maxout conv olutional net, after b eing trained on the 80k training set for 85 ep ochs. Figure 19: Maxout conv olutional net ﬁrst lay er ﬁlters. Most of the ﬁlters were able to learn the basic edge structure of the Pen tomino shap es. 5.2.8 St acked Denoising Auto-Encoders Denoising Auto-Enco ders (D AE) are a form of regularized auto-enco der (Bengio et al., 2013). The D AE forces the hidden lay er to discov er more robust features and preven ts it from simply learning the identit y by reconstructing the input from a corrupted v ersion of it (Vincent et al., 2010). Two DAEs were stack ed, resulting in an unsup ervised transformation with t wo hidden la yers of 1024 units each. P arameters of all lay ers are then ﬁne-tuned with sup ervised ﬁne- tuning using logistic regression as the classiﬁer and SGD as the gradien t-based optimization algorithm. The sto c hastic corruption pro cess is binomial (0 or 1 replacing each input v alue, with probability 0.2). The selected learning rate is  0 = 0 . 01 for the DAe and  1 = 0 . 1 for sup ervised ﬁne-tuning. Both L1 and L2 p enalt y for the DAEs and for the logistic regression la yer are set to 1e-6. CAE+MLP with Sup ervised Finetuning: A regularized auto-enco der which sometimes outp erforms the DAE is the Con tractive Auto-Enco der (CAE), (Rifai et al., 2012), which p enalizes the F rob enius norm of the Jacobian matrix of deriv atives of the hidden units with resp ect to the CAE’s inputs. The CAE serv es as pre-training for an MLP , and in the supervised ﬁne-tuning state, the Adagrad metho d w as used to automatically tune the learning rate (Duchi et al., 2010). After training a CAE with 100 sigmoidal units patch-wise, the features extracted on each patc h are concatenated and fed as input to an MLP . The selected Jacobian p enalt y co eﬃcien t is 2, the learning rate for pre-training is 0.082 with batch size of 200 and 200 ep o c hs of un- sup ervised learning are p erformed on the training set. F or sup ervised ﬁnetuning, the learning rate is 0.12 ov er 100 ep o c hs, L1 and L2 regularization p enalt y terms resp ectively are 1e-4 and 1e-6, and the top-level MLP has 6400 hidden units. 36 Greedy La y erwise CAE+DAE Sup ervised Finetuning: F or this exp erimen t w e stack a CAE with sigmoid non-linearities and then a DAE with rectiﬁer non-linearities during the pre- training phase. As recom mended by Glorot et al. (2011b) w e hav e used a softplus nonlinearit y for reconstruction, sof tpl us ( x ) = log (1 + e x ). W e used an L1 p enalt y on the rectiﬁer outputs to obtain a sparser represen tation with rectiﬁer non-linearity and L2 regularization to keep the non-zero weigh ts small. The main diﬀerence b et w een the DAE and CAE is that the D AE yields more robust recon- struction whereas the CAE obtains more robust features (Rifai et al., 2011). As seen on Figure 7 the weigh ts U and V are shared on each patch and we concatenate the outputs of the last auto-enco der on each patch to feed it as an input to an MLP with a large hidden lay er. W e used 400 hidden units for the CAE and 100 hidden units for DAE. The learning rate used for the CAE is 0.82 and for DAE it is 9*1e-3. The corruption level for the DAE (binomial noise) is 0.25 and the contraction lev el for the CAE is 2.0. The L1 regularization penalty for the D AE is 2.25*1e-4 and the L2 p enalty is 9.5*1e-5. F or the sup ervised ﬁnetuning phase the learning rate used is 4*1e-4 with L1 and L2 p enalties resp ectiv ely 1e-5 and 1e-6. The top-level MLP has 6400 hidden units. The auto-encoders are eac h trained for 150 epo chs while the whole MLP is ﬁne-tuned for 50 ep ochs. Greedy Lay erwise D AE+DAE Sup ervised Finetuning: F or this architecture, we hav e trained tw o lay ers of denoising auto-enco ders greedily and performed sup ervised ﬁnetuning after unsup ervised pre-training. The motiv ation for using t wo denoising auto-enco ders is the fact that rectiﬁer nonlinearities work well with the deep net works but it is diﬃcult to train CAEs with the rectiﬁer non-linearity . W e ha ve used the same t yp e of denoising auto-enco der that is used for the greedy lay erwise CAE+DAE sup ervised ﬁnetuning exp erimen t. In this exp erimen t w e ha v e used 400 hidden units for the ﬁrst la yer DAE and 100 hidden units for the second lay er DAE. The other hyperparameters for DAE and sup ervised ﬁnetuning are the same as with the CAE+DAE MLP Sup ervise d Finetuning exp erimen t. 37

Knowledge Matters: Importance of Prior Information for Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment