SKYNET: an efficient and robust neural network training tool for machine learning in astronomy

Mon. Not. R. Astron. Soc. 000 , 1–19 (2013) Printed 28 January 2014 (MN L A T E X style ﬁle v2.2) S K Y N E T : an efﬁcient and r ob ust neural network training tool f or machine lear ning in astr onomy Philip Graf f 1 ? , Farhan Feroz 2 † , Michael P . Hobson 2 and Anthony Lasenby 2 , 3 1 Gravitational Astr ophysics Laboratory , NASA Goddar d Space Flight Center , 8800 Gr eenbelt Rd., Greenbelt, MD 20771, USA 2 Astr ophysics Gr oup, Cavendish Laboratory , JJ Thomson A venue, Cambridge CB3 0HE, UK 3 Kavli Institute for Cosmology , Madingley Road, Cambridge CB3 0HA, UK Accepted —. Receiv ed —; in original form 28 January 2014 ABSTRA CT W e present the ﬁrst public release of our generic neural network training algorithm, called S K Y N E T . This ef ﬁcient and robust machine learning tool is able to train lar ge and deep feed- forward neural networks, including autoencoders, for use in a wide range of supervised and unsupervised learning applications, such as regression, classiﬁcation, density estimation, clus- tering and dimensionality reduction. S K Y N E T uses a ‘pre-training’ method to obtain a set of network parameters that has empirically been shown to be close to a good solution, fol- lowed by further optimisation using a regularised variant of Newton’ s method, where the lev el of regularisation is determined and adjusted automatically; the latter uses second-order deriv ativ e information to improve conv ergence, but without the need to e valuate or store the full Hessian matrix, by using a fast approximate method to calculate Hessian-vector prod- ucts. This combination of methods allows for the training of complicated networks that are difﬁcult to optimise using standard backpropagation techniques. S K Y N E T employs conv er - gence criteria that naturally prev ent ov erﬁtting, and also includes a fast algorithm for esti- mating the accuracy of network outputs. The utility and ﬂexibility of S K Y N E T are demon- strated by application to a number of toy problems, and to astronomical problems focusing on the recovery of structure from blurred and noisy images, the identiﬁcation of gamma- ray bursters, and the compression and denoising of galaxy images. The S K Y N E T software, which is implemented in standard ANSI C and fully parallelised using MPI, is available at http://www.mrao.cam.ac.uk/software/skynet/ . Key w ords: methods: data analysis – methods: statistical 1 INTR ODUCTION In modern astronomy , one is increasingly faced with the problem of analysing large, complicated and multidimensional data sets. Such analyses typically include tasks such as: data description and inter - pretation, inference, pattern recognition, prediction, classiﬁcation, compression, and many more. One way of performing such tasks is through the use of machine learning methods. For accessible ac- counts of machine learning and its use in astronomy , see, for exam- ple, MacKay (2003), Ball & Brunner (2010) and W ay et al. (2012). Moreov er , machine learning software easily used for astronomy , such as the Python-based A S T R O M L package 1 , or C-based Fast Artiﬁcial Neural Network Library (F ANN 2 ) have recently started to become av ailable. T wo major categories of machine learning are: supervised ? Email: philip.b.graff@gmail.com † Email: f.feroz@mrao.cam.ac.uk 1 http://astroml.github.com/ 2 http://leenissen.dk/fann/wp/ learning and unsupervised learning . In supervised learning, the goal is to infer a function from labeled training data, which consist of a set of training examples. Each example has both ‘properties’ and ‘labels’. The properties are kno wn ‘input’ quantities whose v al- ues are to be used to predict the values of the labels, which may be considered as ‘output’ quantities. Thus, the function to be inferred is the mapping from properties to labels. Once learned, this map- ping can be applied to datasets for which the v alues of the labels are not kno wn. Supervised learning is usually further subdivided into classiﬁcation and r e gr ession . In classiﬁcation, the labels take discrete values, whereas in re gression the labels are continuous. In astronomy , for example, using multifrequency observations of a supernov a lightcurve (its properties) to determine its type (e.g. Ia, Ib, II, etc.) is a classiﬁcation problem since the label (super- nov a type) is discrete (see, e.g., Karpenka, Feroz & Hobson 2013), whereas using the observations to determine (say) the energy out- put of the supernov a explosion is a regression problem, since the label (energy output) is continuous. Classiﬁcation can also be used to obtain a distribution for an output value that would normally be treated as a regression problem. This is demonstrated by Bonnett c  2013 RAS 2 P . Graf f et al. (2013) for measuring redshifts in CFHTLenS. A particularly im- portant recent application of regression supervised learning in as- trophysics and cosmology (and beyond) is the acceleration of the statistical analysis of large data sets in the context of complicated models. In such analyses, one typically performs many ( ∼ 10 4 − 6 ) ev aluations of the likelihood function describing the probability of obtaining the data for dif ferent sets of v alues of the model param- eters. For some problems, in particular in cosmology , each such function evaluation can take up to tens of seconds, making the anal- ysis very computationally e xpensi ve. By performing re gression su- pervised learning to infer and then replace the mapping between model parameters and likelihood value, once can reduce the com- putation required for each likelihood ev aluation by several orders of magnitude, thereby vastly accelerating the analysis (see, e.g., Fendt & W andelt 2007; Auld et al. 2008; Auld, Bridges & Hobson 2008). In unsupervised learning, the data have no labels. More pre- cisely , the quantities (often termed ‘observ ations’) associated with each data item are not divided into properties (inputs) and labels (outputs). This lack of a ‘causal structure’, where the inputs are assumed to be at the beginning and outputs at the end of a causal chain, is the key difference from supervised learning. Instead, all the observations are considered to be at the end of a causal chain, which is assumed to begin with some set of ‘latent’ (or hidden) variables. The aim of unsupervised learning is to infer the num- ber and/or nature of these latent variables (which may be dis- crete or continuous) by ﬁnding similarities between the data items. This then enables one to summarize and explain key features of the dataset. The most common tasks in unsupervised learning in- clude density estimation , clustering and dimensionality r eduction . Indeed, in some cases, dimensionality reduction can be used as a pre-processing step to supervised learning, since classiﬁcation and regression can sometimes be performed in the reduced space more accurately than in the original space. As an astronomical example of unsupervised learning one might wish to use multifrequency observations of the lightcurves of a set of superno vae to determine ho w many different types of su- pernov ae are contained in the set (a clustering task). Alternatively , if the data set also includes the type of each superno va (determined using spectroscopic observations), one might wish to determine which properties, or combination of properties, in the lightcurves are most important for determining their type photometrically (a dimensionality reduction task). This reduced set of property combi- nations could then be used instead of the original lightcurve data to perform the supernovae classiﬁcation or regression analyses men- tioned abov e. An intuitiv e and well-established approach to machine learn- ing, both supervised and unsupervised, is based on the use of artiﬁ- cial neural netw orks (NNs), which are loosely inspired by the struc- ture and functional aspects of a brain. They consist of a group of interconnected nodes, each of which processes information that it receiv es and then passes this product on to other nodes via weighted connections. In this way , NNs constitute a non-linear statistical data modeling tool, which may be used to represent complex relation- ships between a set of inputs and outputs, to ﬁnd patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables. In general, the structure of a NN can be arbitrary , but many machine learning applications can be performed using only feed-forward NNs. For such netw orks the structure is directed: an input layer of nodes passes informa- tion to an output layer via zero, one, or many ‘hidden’ layers in between. Such a network is able to ‘learn’ a mapping between in- puts and outputs, giv en a set of training data, and can then make predictions of the outputs for new input data. Moreover , a univer - sal approximation theorem assures us that we can accurately and precisely approximate the mapping with a NN of a given form. A useful introduction to NNs can be found in MacKay (2003). In astronomy , feed-forward NNs hav e been applied to various machine learning problems for ov er 20 years (see, e.g., Andreon et al. 1999, 2000; Longo, T agliaferri & Andreon 2001; T agliaferri et al. 2003a,b; W ay et al. 2012). Nonetheless, their more widespread use in astronomy has been limited by the difﬁculty associated with standard techniques, such as backpropagation, in training networks having man y nodes and/or numerous hidden layers (i.e. ‘large’ and/or ‘deep’ networks), which are often necessary to model the complicated mappings between the numerous inputs and outputs in modern astronomical applications. In this paper , we therefore present the ﬁrst public release of S K Y N E T : an efﬁcient and robust neural network training tool that is able to train large and/or deep feed-forward networks. 3 S K Y N E T is able to achiev e this by using a combination of the ‘pre-training’ method of Hinton, Osindero, & T eh (2006) to obtain a set of net- work weights close to a good optimum of the training objective function, followed by further optimisation of the weights using a regularised variant of Newton’ s method based on that developed for the M E M S Y S software package (Gull & Skilling 1999). In par - ticular , second-order deri v ativ e information is used to improv e con- ver gence, but without the need to ev aluate or store the full Hessian matrix, by using a fast approximate method to calculate Hessian- vector products (Schraudolph 2002; Martens 2010). S K Y N ET is implemented in the standard ANSI C programming language and parallelised using MPI. 4 W e also note that S K Y N ET has already been combined with M U LT I N E S T (Feroz & Hobson 2008; Feroz, Hobson & Bridges 2009; Feroz et al. 2013), to produce the Blind Accelerated Multi- modal Bayesian Inference ( BA M B I) package (Graff et al. 2012), which is a generic and completely automated tool for greatly accel- erating Bayesian inference problems (by up to a factor of ∼ 10 6 ; see, e.g., Bridges et al. 2011). M U LT I N E S T is a fully-parallelised implementation of nested sampling (Skilling 2004), e xtended to handle multimodal and highly-degenerate distributions. In most astrophysical (and particle physics) Bayesian inference problems, M U LT I N E S T typically reduces the number of likelihood ev alua- tions required by an order of magnitude or more, compared to standard MCMC methods, but BA M B I achieves further substan- tial gains by speeding up the ev aluation of the likelihood itself by replacing it with a trained re gression neural network. BA M B I pro- ceeds by ﬁrst using M U LT I N E S T to obtain a speciﬁed number of new samples from the model parameter space, and then uses these as input to S K Y N E T to train a network on the likelihood func- tion. After conv ergence to the optimal weights, the network’ s abil- ity to predict likelihood values to within a speciﬁed tolerance lev el is tested. If it fails, sampling continues using the original likeli- hood until enough new samples have been made for training to be performed again. Once a network is trained that is sufﬁciently ac- curate, its predictions are used in place of the original likelihood function for future samples for M U L T I N E S T . On typical problems in cosmology , for example, using the network reduces the likeli- 3 S K Y N E T may also be used to train recurrent neural networks (see, e.g., Mandic & Chambers 2001), but its application to such networks will be discussed in a future work. 4 A version of S K Y N E T parallelised for GPUs using CUD A is currently in dev elopment. c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 3 hood ev aluation time from seconds to less than a millisecond, al- lowing M ULT I N E S T to complete the analysis much more rapidly . As a bonus, at the end of the analysis the user also obtains a net- work that is trained to pro vide more likelihood e v aluations near the peak if needed, or in subsequent analyses. With the public release of S K Y N E T , we now also make B A M B I publically available. 5 The structure of this paper is as follows. In Section 2 we de- scribe the general structure of feed-forward NNs, including a par- ticular special case of such networks, called autoencoders, which may be used for performing non-linear dimensionality reduction. In Section 3 we present the procedures used by S K Y N E T to train networks of these types. S K Y N E T is then applied to some toy ma- chine learning examples in Section 4, including a regression task, a classiﬁcation task, and a dimensionality reduction task using au- toencoders. W e also apply S K Y N ET to the problem of classifying images of handwritten digits from the MNIST database, which is a widely-used benchmarking test of machine learning algorithms. The application of S K Y N E T to astronomical machine learning ex- amples is presented in Section 5, including: a regression task to de- termine the projected ellipticity of a galaxy from blurred and noisy images of the galaxy and of a ﬁeld star; a classiﬁcation task, based on a simulated gamma-ray burst detection pipeline for the Swift satellite (Gehrels et al. 2004), to determine if a GRB with given source parameters will be detected; and a dimensionality reduction task using autoencoders to compress and denoise galaxy images. Finally , we present our conclusions in Section 6. 2 NETWORK STR UCTURE 2.1 Feed-f orward neural networks A multilayer perceptron feed-forward neural network is the sim- plest type of network and consists of ordered layers of perceptron nodes that pass scalar values from one layer to the ne xt. The percep- tron is the simplest kind of node, and maps an input vector x ∈ < n to a scalar output f ( x ; w , θ ) via f ( x ; w , θ ) = θ + n X i =1 w i x i , (1) where w = { w i } and θ are the parameters of the perceptron, called the ‘weights’ and ‘bias’, respectively . F or a 3-layer NN, which con- sists of an input layer , a hidden layer , and an output layer , as shown in Fig. 1, the outputs of the nodes in the hidden and output layers are giv en by the follo wing equations: hidden layer: h j = g (1) ( f (1) j ); f (1) j = θ (1) j + X l w (1) j l x l , (2) output layer: y i = g (2) ( f (2) i ); f (2) i = θ (2) i + X j w (2) ij h j , (3) where l runs over input nodes, j runs over hidden nodes, and i runs over output nodes. The functions g (1) and g (2) are called ac- tiv ation functions and must be smooth and monotonic for our pur- poses. W e use g (1) ( x ) = 1  (1 + e − x ) = sig ( x ) (sigmoid) and g (2) ( x ) = x ; the non-linearity of g (1) is essential to allowing the network to model non-linear functions. T o expand the NN to in- clude more hidden layers, we iterate (2) for each connection from one hidden layer to the next, each time using the same acti vation function, g (1) . The ﬁnal hidden layer will connect to the output layer using the relation (3). 5 http://www.mrao.cam.ac.uk/software/bambi/ Figure 1. A 3-layer neural network with 3 inputs, 4 hidden nodes, and 2 outputs. Image courtesy of W ikimedia Commons. The weights and biases are the values we wish to determine in our training (described in Section 3). As they vary , a huge range of non-linear mappings from inputs to outputs is possible. In f act, a univ ersal approximation theorem (Cybenko 1989; Hornik, Stinch- combe & White 1990) states that a NN with three or more layers can approximate any continuous function to some given accuracy , as long as the activ ation function is locally bounded, piece wise con- tinuous, and not a polynomial (hence our use of sigmoid g (1) , al- though other functions would work just as well, such as tanh ). By increasing the number of hidden nodes, one can achieve more ac- curacy at the risk of o verﬁtting to our training data. Other acti vation functions hav e also been proposed, such as the rectiﬁed linear function wherein g ( x ) = max { 0 , x } or the ‘soft- sign’ function where g ( x ) = x /(1 + | x | ) . It has been argued that the former removes the need for pre-training (as described in Sec- tion 3.3) (Glorot et al. 2011) and serves as a better model of bio- logical neurons. The ‘softsign’ is similar to tanh , but with slower approach to the asymptotes of ± 1 (quadratic rather than exponen- tial) (Begstra et al. 2009; Glorot & Bengio 2010). 2.2 A utoencoders Autoencoders are a speciﬁc type of feed-forward neural network containing one or more hidden layers, where the inputs are mapped to themselves, i.e. the network is trained to approximate the iden- tity operation; when more than one hidden layer is used this is typically referred to as a ‘deep’ autoencoder . Such networks typ- ically contain se veral hidden layers and are symmetric about a cen- tral layer containing fe wer nodes than there are inputs (or outputs). A basic diagram of an autoencoder is shown in Fig. 2, in which the three inputs ( x 1 , x 2 , x 3 ) are mapped to themselves via three symmetrically-arranged hidden layers, with two nodes in the cen- tral layer . An autoencoder can thus be considered as two half-networks, with one part mapping the inputs to the central layer and the second part mapping the central layer v alues to the outputs (which approx- imate as closely as possible the original inputs). These two parts are called the ‘encoder’ and ‘decoder’, respectively , and map either to or from a reduced set of ‘feature variables’ embodied in the nodes of the central layer (denoted by z 1 and z 2 in Fig. 2). These variables are, in general, non-linear functions of the original input variables. One can determine this dependence for each feature variable in turn simply by decoding ( z 1 , 0 , 0 , . . . , 0) , (0 , z 2 , 0 , . . . , 0) , and so on, as the corresponding z i value is v aried; in this way , for each feature variable, one obtains a curv e in the original data space. Con versely , the collection of feature values ( z 1 , z 2 , . . . , z M ) in the central layer c  2013 RAS, MNRAS 000 , 1–19 4 P . Graf f et al. Figure 2. Schematic diagram of an autoencoder . The three input values are encoded to two feature variables. Pre-training (described in Section 3.3) deﬁnes the weight matrices W 1 and W 2 . might reasonably be termed the feature vector of the input data. Au- toencoders therefore pro vide a very intuiti ve approach to non-linear dimensionality reduction and constitute a natural generalisation of linear methods such as principal component analysis (PCA) and independent component analysis (ICA), which are widely used in astronomy . Indeed, an antoencoder with a single hidden layer and linear activ ation functions may be sho wn to be identical to PCA (Sanger 1989). This topic is explored further in Section 4.3. It is worth noting that encoding from input data to feature v ariables can also be useful in performing clustering tasks; this is illustrated in Section 4.4. Autoencoders are, ho wever , notoriously difﬁcult to train, since the objecti ve function contains a broad local maximum where each output is the average value of the inputs (Erhan et al. 2010). Nonetheless, this dif ﬁculty can be overcome by the use of pre- training methods, as discussed in Section 3.3. 2.3 Choosing the number of hidden layers and nodes An important choice when training a NN is the number of nodes in its hidden layers. The optimal number and organisation into one or more layers has a complicated dependence on the number of train- ing data points, the number of inputs and outputs, and the com- plexity of the function to be trained. Choosing too few nodes will mean that the NN is unable to learn the relationship to the high- est possible accuracy; choosing too many will increase the risk of ov erﬁtting to the training data and will also slow do wn the training process. Using empirical evidence (Murtagh 1991) and theoretical considerations (Geva & Sitte 1992), it has been suggested that the optimal architecture for approximating a continuous function is one hidden layer containing 2 N + 1 nodes, where N is the number of input nodes. Serra-Ricart et al. (1993) also ﬁnd empirical support for this suggestion. Such a choice allows the network to model the form of the mapping function without unnecessary work. In practice, it can be better to over -estimate (slightly) the num- ber of hidden nodes required. As described in Section 3, S K Y N E T performs basic checks to prev ent ov er-ﬁtting, and the additional training time associated with having more hidden nodes is not a large penalty if an optimal network can be obtained in an early at- tempt. In any case, giv en a particular problem, the optimal network structure, both in terms of the number of hidden nodes and ho w they are distributed into layers, can be determined by comparing the correlation and error squared of different trained NNs; this is illustrated in Section 4. 3 NETWORK TRAINING In training a NN, we wish to ﬁnd the optimal set of network weights and biases that maximise the accuracy of the predicted outputs. Howe ver , we must be careful to avoid ov erﬁtting to our train- ing data, which may lead to inaccurate predictions from inputs on which the network has not been trained. The set of training data inputs and outputs (or ‘targets’), D = { x ( k ) , t ( k ) } , is provided by the user (where k counts train- ing items). Approximately 75 per cent should be used for actual NN training and the remainder retained as a validation set that will be used to determine con vergence and to a void o verﬁtting. This ra- tio of 3:1 gives plenty of information for training but still leaves a representativ e subset of the data for checks to be made. 3.1 Data whitening It is prudent to ‘whiten’ the data before training a network. Whiten- ing normalises the input and/or output values, so that it easier to train a network starting from initial weights that are small and cen- tred on zero. The network weights in the ﬁrst and last layers can then be ‘unwhitened’ after training so that the netw ork will be able to perform the mapping from original inputs to outputs. Standard whitening transforms each input to a standard distri- bution by subtracting the mean and di viding by the standard devia- tion ov er all elements in the training data, such that ˜ x ( k ) l = x ( k ) l − x l σ l , (4a) x l = 1 K K X k =1 x ( k ) l , (4b) σ 2 l = 1 K − 1 K X k =1 ( x ( k ) l − x l ) 2 . (4c) An alternativ e whitening transform is also commonly used, wherein all values are scaled and shifted into the interv al [0 , 1] , such that ˜ x ( k ) l = x ( k ) l − min k ( x ( k ) l ) max k ( x ( k ) l ) − min k ( x ( k ) l ) . (5) One of these transforms may be chosen by the user if they wish to whiten the inputs of the training data. The whitening is normally performed separately on each input, b ut can be calculated across all inputs if they are related. The mean, standard deviation, minimum, or maximum would then be computed over all inputs for all train- ing data items. The chosen whitening transform is also used for whitening the outputs. Since both transforms consist of subtract- ing an offset and multiplying by a scale factor , they can easily be performed and reversed. T o unwhiten network weights the inv erse transform is applied, with the offset and scale determined by the source input node or target output node. Outputs for a classiﬁcation network are not whitened since they are already just probabilities (see below). c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 5 3.2 Network objectiv e function Let us denote the network weights and biases collectiv ely by the network parameter vector a . S K Y N E T considers the parameters a to be random variables with a posterior distrib ution given by P ( a ; α, σ ) ∝ L ( a ; σ ) × S ( a ; α ) , (6) where L ( a ; σ ) is the likelihood, which also depends on a set of hyperparameters σ that describe the standard deviation of the out- puts (see below). The likelihood encodes how well the NN, char- acterised by a giv en set of parameters a , is able to reproduce the kno wn training data outputs. This is modulated by the prior S ( a ; α ) , which is assumed to have the (logarithmic) form log S ( a ; α ) = − α 2 X i a 2 i , (7) where α is a hyperparameter that plays the role of a regularisation parameter during optimisation since it determines relativ e impor- tance of the prior and the likelihood. This prior can also be seen as an ` 2 -norm penalty . The form of the likelihood depends on the type of network being trained. 3.2.1 Re gr ession likelihood For regression problems, S K Y N ET assumes a log-likelihood func- tion for the network parameters a giv en by the standard χ 2 misﬁt function log L ( a ; σ ) = − K log (2 π ) 2 − N X i =1 log( σ i ) − 1 2 K X k =1 N X i =1 " t ( k ) i − y i ( x ( k ) ; a ) σ i # 2 , (8) where N is the number of outputs, K is the number of training data examples and y ( x ( k ) ; a ) is the NN’ s predicted output vector for the input vector x ( k ) and network parameters a . The hyperparameters σ = { σ i } describe the standard de viation (error size) of each of the outputs. 3.2.2 Classiﬁcation likelihood For classiﬁcation problems, S K Y N ET again uses continuous out- puts (rather than discrete ones), which are interpreted as the proba- bilities that a set of inputs belongs to a particular output class. This is achie ved by applying the softmax transformation to the output values, so that the y are all non-negati ve and sum to unity , namely y i ( x ( k ) ; a ) → exp[ y i ( x ( k ) ; a )] P N j =1 exp[ y j ( x ( k ) ; a )] . (9) The classiﬁcation likelihood is then giv en by the cr oss-entropy of the targets and softmax ed output values (MacKay 2003), log L ( a ; σ ) = − K X k =1 N X i =1 t ( k ) i log y i ( x ( k ) ; a ) . (10) In this scenario, the true and predicted output values are both prob- abilities (which lie in [0 , 1] ). For the true outputs, all are zero ex- cept for the correct output which has a value of unity . For clas- siﬁcation networks, the σ hyper-parameters do not appear in the log-likelihood. Figure 3. Diagram of an RBM with 3 visible nodes and 4 hidden nodes. Bias nodes are not shown. Image courtesy W ikimedia Commons. 3.3 Initialisation and pre-training The training of the NN can be started from some random initial state, or from a state determined from a ‘pre-training’ procedure discussed belo w . In the former case, the network training begins by setting random values for the network parameters, sampled from a normal distribution with zero mean and variance of 0 . 01 (this value can be modiﬁed by the user). In the latter case, S K Y N E T makes use of the pre-training ap- proach dev eloped by Hinton, Osindero, & T eh (2006), which ob- tains a set of network weights and biases close to a good solution of the network objectiv e function. This method was originally devised with autoencoders in mind and is based on the model of restricted Boltzmann machines (RBMs). An RBM is a generative model that can learn a probability distribution ov er a set of inputs. It consists of a layer of input nodes and a layer of hidden nodes, as sho wn in Figure 3. In this case, the map from the inputs to the hidden layer and then back is treated symmetrically and the weights are adjusted through a number of ‘epochs’, gradually reducing the re- production error . T o model an autoencoder , RBMs are ‘stacked’, with each RBM’ s hidden layer being the input for the next. The ini- tial case is the NN’ s inputs to the ﬁrst hidden layer; this is repeated for the ﬁrst to second hidden layer and so on until the central layer is reached. The network weights can then be ‘unfolded’ by using the transpose for the symmetric connections in the decoding half to provide a decent starting point for the full training to begin. This is shown in Fig. 2, where the W 1 and W 2 weights matrices are deﬁned by pre-training. The training is then performed using contrastive diver - gence (Carreira-Perpignan & Hinton 2005). This procedure can be summarised in the following steps, where sampling indicates set- ting the v alue of the node to 1 with the probability calculated and 0 otherwise. (i) T ake a training sample x and compute the probabilities of the hidden nodes (their values using a sigmoid activ ation function) and sample a hidden vector h from this distribution. (ii) Let g + = x ⊗ h , where ⊗ is used to indicate the outer product. (iii) Using h , compute the probabilties of the visible nodes and sample x 0 from this distribution. Resample the hidden vector from this to obtain h 0 . (iv) Let g − = x 0 ⊗ h 0 . (v) w i,j 7→ w i,j + r ( g + − g − ) for some learning rate 0 < r 6 1 . More details can be found in Hinton, Osindero, & T eh (2006) c  2013 RAS, MNRAS 000 , 1–19 6 P . Graf f et al. and Hinton & Salakhutdinov (2006) has useful diagrams and ex- planations. This pre-training approach can also be used for more gen- eral feed-forward networks. All layers of weights, except for the ﬁnal one that connects the last hidden layer to the outputs, are pre- trained as if they were the ﬁrst half of a symmetric autoencoder . Howe ver , the network weights are not unfolded; instead the ﬁnal layer of weights is initialised randomly as would have been done without pre-training. In this way , the network ‘learns the inputs’ before mapping to a set of outputs. This has been shown to greatly reduce the training time on multiple problems by Glorot & Bengio (2010); Erhan et al. (2010). W e note that when an autoencoder is pre-trained, the activa- tion function to the central hidden layer is made linear and the ac- tiv ation function from the ﬁnal hidden layer to the outputs is made sigmoidal. General feed-forward networks that are pre-trained con- tinue to use the original acti vation functions. Both of these are sim- ply the default settings and the user has the freedom to alter them to suit their speciﬁc problem. 3.4 Optimisation of the objective function Once the initial set of network parameters have been obtained, ei- ther by assigning them randomly or through pre-training, the net- work is then trained (further) by iterati ve optimisation of the objec- tiv e function. First, initial values of the hyperparameters σ (for regression networks) and α are set. The v alues σ are set by the user and can be set on either the true output values themselves or on their whitened values (as deﬁned in Section 3.1). The only difference between these two settings is the magnitude of the error used. The algorithm then calculates a large initial estimate for α , α = |∇ log( L ) | √ M r , (11) where M is the total number of weights and biases (NN parame- ters) and r is a rate set by the user ( 0 < r 6 1 , default r = 0 . 1 ) that deﬁnes the size of the ‘conﬁdence region’ for the gradient. This expression for α sets larger re gularisation (or ‘damping’) when the magnitude of the gradient of the likelihood is larger . This relates the amount of ‘smoothing’ required to the steepness of the function being smoothed. The rate factor in the denominator allows us to increase the damping for smaller conﬁdence regions on the value of the gradient. This results in smaller , more conserv ativ e steps that are more likely to result in an increase in the function value but results in more steps being required to reach the optimal weights. NN training then proceeds using an adapted form of a trun- cated Newton (or ‘Hessian-free’) optimisation algorithm as de- scribed below , to calculate the step, δ a , that should be taken at each iteration. Follo wing each such step, adjustments to α and σ may be made before another step is calculated. First, σ can be updated by multiplying it by a value c such that c 2 = − 2(log L + log S )/ M . This serves to assure that at conv ergence, the χ 2 value equals the number of unconstrained data points of the problem. Similarly , α is then updated such that the probability Pr( D| α ) is maximised for the current set of NN parameters a . These procedures are described in detail by Gull & Skilling (1999, Sec. 2.3 & 2.6) and Hobson et al. (1998, Sec. 3.6 & Appendix B). T o obtain the step δ a at each iteration, we ﬁrst note that one may approximate a general function f up to second-order in its T aylor expansion by f ( a + δ a ) ≈ f ( a ) + g T δ a + 1 2 ( δ a ) T B δ a , (12) where g = ∇ f ( a ) is the gradient and B = ∇∇ f ( a ) is the Hessian matrix of second derivati ves, both ev aluated at a . For our purposes, the function f is the log-posterior distribution of the NN parameters and hence (12) represents a Gaussian approximation to the poste- rior . The Hessian of the log-posterior is the regularised (‘damped’) Hessian of the log-lik elihood function, where the prior , whose mag- nitude is set by α , provides the regularisation. If we deﬁne the Hes- sian matrix of the log-likelihood as H , then B = H + α I , where I is the identity matrix. The regularisation parameter α can be in- terpreted as controlling the level of ‘conservatism’ in the Gaussian approximation to the posterior . In particular , regularisation helps prev ent the optimisation becoming trapped in small local maxima by smoothing out the function being e xplored. It also aids in reduc- ing the re gion of conﬁdence for the gradient information which will make it less likely that a step results in a w orse set of parameters. Ideally , we seek a step δ a , such that ∇ f ( a + δ a ) = 0 . Using the approximation (12), one thus requires B δ a = − g . (13) In the standard Ne wton’ s method of optimisation one simply solves this equation directly for δ a to obtain δ a = − B − 1 g . (14) In principle, iterating this stepping procedure will eventually bring us to a local maximum of f . Moreov er , Newton’ s method has the important property of being scale-inv ariant, namely its beha viour is unchanged under any linear rescaling of the parameters. Methods without this property often hav e problems optimising poorly scaled parameters. There are, howe ver , some major practical difﬁculties with the standard Ne wton’ s method. First, the Hessian H of the log- likelihood is not guaranteed to be positive semi-deﬁnite. Thus, even after the addition of the damping term α I derived from the log-prior , the full Hessian B of the log-posterior may also not be in vertible. Second, e ven if B is in vertible, the in version is prohibitiv ely expen- siv e if the number of parameters is large, as is the case ev en for modestly-sized neural networks. T o address the ﬁrst issue, we replace the Hessian H with a form of Gauss–Newton approximation G , which is guaranteed to be positi ve semi-deﬁnite and can be deﬁned both for the re gression likelihood (8) and the classiﬁcation likelihood (10), respectiv ely (Schraudolph 2002). In particular , the approximation used differs from the classical Gauss–Newton matrix in that it retains some second deriv ative information. Second, to av oid the prohibitive e x- pense of calculating the in verse in (14), we instead solve (13) (with H replaced by G in B ) for δ a iteratively using a conjugate-gradient algorithm, which requires only matrix-vector products B v for some vector v . One can avoid even the computational burden of calculating and storing the Hessian B . In principle, products of the form B v can be easily computed using ﬁnite differences at the cost of a sin- gle extra gradient e v aluation using the identity B v = lim r → 0 ∇ f ( a + r v ) − ∇ f ( a ) r . (15) This approach is, howev er , subject to numerical problems. There- fore, we instead calculate B v products using a stable and efﬁ- cient procedure applicable to NNs (Pearlmutter 1994; Schraudolph 2002). This in volv es an additional forward and backward pass through the network beyond the initial ones required for a gradi- ent calculation. The combination of all the above methods makes practical the c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 7 use of second-order deriv ative information e ven for large networks and signiﬁcantly improves the rate of con ver gence of NN training ov er standard backpropagation methods. It has been noted that this method for quasi-Newton second- order descent is equiv alent to the ﬁrst-order ‘natural gradient’ by Pascanu & Bengio (2013). 3.5 Con vergence Follo wing each iteration of the optmisation algorithm, the posterior , likelihood, correlation, and error squared values are calculated both for the training data and for the validation data (which were not used in calculating the steps in the optimisation). The correlation of the network outputs is deﬁned for each output i as Corr i ( a ) = P K k =1 ( t ( k ) i − t i )( y i − y i ) q P K k =1 ( t ( k ) i − t i ) 2 P K k =1 ( y ( k ) i − y i ) 2 , (16) where t i and y i are the means of these output variables over all the training data; the functional dependencies of y ( k ) i hav e been dropped for brevity . The correlation provides a relative measure of how well the predicted outputs match the true ones. In practice, the correlations from each output can be averaged together to give an average correlation for the network’ s predictions. The average error-squared of the netw ork outputs is deﬁned by ErSq ( a ) = 1 N K K X k =1 N X i =1 h t ( k ) i − y i ( x ( k ) ; a ) i 2 , (17) and is complementary to their correlation, since it is an absolute measure of accuracy . As one might expect, as the optimisation proceeds, there is a steady increase in the values of the posterior , likelihood, correla- tion, and ne gati ve of the error squared, e valuated both for the train- ing and validation data. Eventually , ho wev er , the algorithm will be- gin to overﬁt, resulting in the continued increase of these quantities when ev aluated on the training data, but a decrease in them when ev aluated on the validation data. This di ver gence in behaviour is taken as indicating that the algorithm has conv erged and the op- timisation in terminated. The user may choose which of the four quantities listed above is used to determine conv ergence, although the default is to use the error squared, since it does not include the hyperparameters σ and α in its calculation and is less prone to problems with zeros than the correlation. 3.6 Optimising network structur e W e note that the correlation and the error-squared functions dis- cussed abov e also provide quantitati ve measures with which to compare the performance of different network architectures, both in terms of the number of hidden nodes and ho w the y are dis- tributed into layers. As netw ork size and complexity is increased, a point will be reached at which minimal or no further gains may be achie ved in increasing correlation or reducing error-squared. Therefore, any netw ork architecture that can achiev e this peak per - formance is equally well-suited. In practice, we will wish to ﬁnd the smallest or simplest network that does so as this minimizes the risk of ov erﬁtting and the time required for training. 3.7 Estimating the error on netw ork outputs After training a network, in particular a regression network, one may want to calculate the accuracy of the network’ s predicted out- puts. A computationally cheap method of estimating this was sug- gested by MacKay (1995), whereby one adds Gaussian noise to the true outputs of the training data and trains a ne w network on this noisy data. After performing many realisations, the networks’ predictions will average to the predictions in the absence of the added noise. Moreover , the standard deviation of their predictions will provide a good estimate of the accuracy of the original net- work’ s predictions. Since one can train the ne w networks using the original trained network as a starting point, the re-training on noisy data is very fast. Additionally , ev aluating the ensemble of predic- tions to measure the accuracy is not v ery computationally intensive as network ev aluations are simple to perform and can be done in less than a millisecond. Explicitly , the steps of this method are: (i) Start with the conv erged network with parameters a ∗ , trained on the original data set D ∗ = { x ( k ) , t ( k ) } . Estimate the noise on the residuals using σ 2 = P k [ t ( k ) − y ( x ( k ) , a ∗ )] 2 /K . (ii) Deﬁne a new data set D 1 by adding Gaussian noise of zero mean and variance σ 2 to the outputs (targets) in D ∗ . (iii) T rain a NN on D 1 using the parameters a ∗ as a starting point. Training should conv erge rapidly as the new data set is only slightly different from the original. Denote the new network param- eters by a 1 . (iv) Repeat steps (ii) and (iii) multiple times to obtain an ensem- ble of networks with parameters a j . (v) Use each of the networks a j to make a prediction for a given set of inputs. The accuracy of the original network’ s outputs can be estimated as the standard deviation of the outputs of these net- works. In addition to these steps, S K Y N E T includes the option for the user to add random Gaussian offsets to the parameters a ∗ before training is performed on the new data set (step iii). This offset will aid the training in moving the optimisation from a potential local maximum in the posterior distribution of the network parameters, but the size of the of fset must be chosen for each problem; for this, we recommend using a value s . 1/ α . W e thus add noise to both the training data and the saved network parameters before training a new network whose posterior maximum will be near to, but not exactly the same as, the original network’ s. This method may be compared with that described in Graff et al. (2012) for determining the accuracy of the NN predictions for the likelihood used in BA M B I . Although the method described here requires the overhead time of training additional networks, this is small compared the speed gains possible. Indeed, the new method’ s accuracy computations require less than a millisecond, as opposed to tenths of a second for the method used previously . Consequently , the faster method described here is now incorpo- rated into our new public release version of BA MB I, leading to around two orders of magnitude increase in speed over that reported in Graff et al. (2012). 4 APPLICA TIONS: TO Y EXAMPLES 4.1 Regression As our ﬁrst to y example, we consider a simple re gression problem. W e generate 200 points randomly in the range x ∈ [ − 5 π , 5 π ] , for which we ev aluate the ramped sinc function, y ( x ) = sin( x ) x + 0 . 04 x, (18) c  2013 RAS, MNRAS 000 , 1–19 8 P . Graf f et al. Figure 4. The correlation and error-squared values as a function of the number of hidden nodes N obtained from con ver ged NNs with architec- ture 1 + N + 1 for the ramped sinc function regression problem. and then add Gaussian noise with zero mean and a standard devi- ation of 0 . 05 . The addition of noise makes the regression problem more difﬁcult and pre vents any exact solution being possible. T o perform the regression, the 200 data items ( x, y ) are di- vided randomly into 150 items for training and 50 for validation. For this simple problem, we use a network with a single hidden layer containing N nodes (we denote the full network by 1 + N +1 ), and we whiten the input and output data using (4). The network was not pre-trained. The optimal v alue for N is determined by compar- ing the correlation and error-squared for networks with different numbers of hidden nodes. These results are presented in Fig. 4, which shows that the correlation increases and the error-squared decreases until we reach N ≈ 6 − 7 hidden nodes, after which both measures le vel of f. Thus, adding additional nodes beyond this number does not improv e the accuracy of the network. F or the net- work with N = 7 hidden nodes, we obtain a correlation of greater than 99 . 3 per cent; a comparison of the true and predicted outputs in this case is shown in Figure 5. 4.2 Classiﬁcation W e now consider a toy classiﬁcation problem based on the three- way classiﬁcation data set created by Radford Neal 6 for testing his own algorithms for NN training. In this data set, each of four vari- ables x 1 , x 2 , x 3 , and x 4 is drawn 1000 times from the standard uniform distribution U [0 , 1] . If the two-dimensional Euclidean dis- tance between ( x 1 , x 2 ) and (0 . 4 , 0 . 5) is less than 0 . 35 , the point is placed in class 0 ; otherwise, if 0 . 8 x 1 + 1 . 8 x 2 < 0 . 6 , the class was set to 1 ; and if neither of these conditions is true, the class was set to 2 . Note that the values of x 3 and x 4 play no part in the classiﬁ- cation. Gaussian noise with zero mean and standard deviation 0 . 1 is then added to the input values. Approximately 75 percent of the data was used for training and the remaining 25 per cent for validation. W e again use a net- work with a single hidden layer containing N nodes, and we whiten the input and output data using (4). The network was not pre- trained. The full network thus has the architecture 4 + N + 3 , where the three output nodes give the probabilities (which sum to unity) 6 http://www.cs.toronto.edu/ ∼ radford/ fbm.2004-11-10.doc/Ex-netgp-c.html Figure 5. Comparisons of the true and predicted values obtained from the con ver ged NN with architecture 1 + 7 + 1 on the training data (left) and validation data (right) for the ramped sinc function re gression problem. Figure 6. The correlation and error squared of the conver ged NNs for the classiﬁcation problem as a function of nodes in the single hidden layer . that the input data belong to class 0, 1, or 2, respectively . The ﬁnal class assigned is that having the lar gest probability . The optimal v alue for N is again determined by comparing the correlation and error-squared for networks with different numbers of hidden nodes. These results are shown in Fig. 6, from which we see that the correlation increases and the error-squared decreases until we reach N ≈ 4 hidden nodes, after which both measures lev el off. For the network with N = 8 hidden nodes, a total of 87 . 8 per cent of training data points and 85 . 4 per cent of validation points were correctly classiﬁed. A summary of the classiﬁcation re- sults for this network is gi ven T able 1. These results compare well with Neal’ s own original results and are similar to classiﬁcations based on applying the original criteria directly to data points that hav e noise added. Figure 7 shows the data set and the true classiﬁ- cations. c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 9 T able 1. Classiﬁcation results for the conver ged NN with architecture 4 + 8 + 3 for the Neal data set. T rue class Number Predicted class ( % ) 0 1 2 T raining data 0 282 84.0 4.96 11.0 1 93 14.0 82.8 3.2 2 386 7.0 1.3 91.7 V alidation data 0 99 75.7 6.1 18.2 1 19 21.1 78.9 0.0 2 121 5.0 0.8 94.2 Figure 7. The full Neal data set (training and validation), showing the NN classiﬁcations (colour-coded) and the true criteria (solid and dashed lines) determining the classes of the data in the absence of noise. 4.3 Dimensionality reduction using autoencoders Dimensionality reduction is a v ery common task in astronomy , which is usually performed using principal component analysis (PCA, Kendall 1957) and its v ariants. In PCA, the eigenv alues and eigen vectors of the correlation matrix of the centred data (from which the mean has been subtracted) are found. The eigen vector di- rections deﬁne a new set of variables that are mutually-orthogonal linear combinations of the original variables describing each data item. Dimensionality reduction is then achieved by keeping only those combinations corresponding to (a certain number of) the largest eigen values. PCA is limited, howev er , by its use of orthog- onal projections and this has led to more recent interest in inde- pendent component analysis (ICA), which still constructs linear combinations of the original variables, but relaxes the condition of orthogonality (see, e.g., Hyv ¨ arinen & Oja 2000). More specif- ically , ICA ﬁnds a set of directions such that the projections of the data onto these directions have maximum statistical independence, either by minimisation of mutual information or maximization of non-Gaussianity . As discussed in Section 2.2, antoencoders provide a natural generalisation of PCA and ICA, and constitute an intuiti ve ap- proach to non-linear dimensionality reduction that reduces to PCA in the special case of a single hidden layer and linear activ ation functions. 4.3.1 Multivariate Gaussian data T o provide a quick comparison of autoencoders and traditional PCA, we ﬁrst consider two examples in which the data points are drawn from a single multiv ariate Gaussian distribution, as assumed in PCA, using a theoretical covariance matrix with giv en eigenv al- ues and eigen vectors. As a basic check, in both cases we perform the main PCA step of calculating the eigenv alues and eigenv ectors of the sample co variance matrix of the resulting data points, and we ﬁnd that they match those assumed v ery closely . In our ﬁrst example, we draw the data points ( x 1 , x 2 ) from a two-dimensional correlated Gaussian. For this simple case, we ﬁrst train an autoencoder with a single hidden layer . Moreov er , to ef fect a dimensionality reduction, we place just one node in the hidden layer , so the full network architecture is 2 + 1 + 2 . Whitening of the input and output data using (5) was performed. Pre-training was also used as, even in such a very small autoencoder network (with a total of 7 network parameters), it is easy for the optimiser to fall into the large local maximum where each output is the average of the inputs. Fig 8(a) sho ws the original data and the curve traced out in the data space when one performs a decoding as the value of the fea- ture variable z 1 in the single central layer node is varied between the limits obtained when encoding the data. As one might expect, this curve approximates the eigen vector with larger eigen v alue of the covariance matrix of the data. The curv e is not e xactly a straight line because of the non-linearity of the acti vation function from the hidden layer to the output layer, and is inﬂuenced by the particular realisation of the data analysed. It should be noted that dimension- ality reduction is performed con versely , by (non-linearly) encoding each data point ( x 1 , x 2 ) to obtain the corresponding feature value z 1 in the central layer node, rather than performing any PCA-like (linear) projection in the data space. The resulting error-squared and correlation for the antoencoder are 0 . 476 and 90 . 5 percent, re- spectiv ely . T o pursue the comparison with PCA further , we also train in a similar manner an autoencoder with two nodes in a single hidden layer , so the full network architecture is 2 + 2 + 2 , although this is clearly no longer relev ant for dimensionality reduction. Fig 8(b) again shows the original data, together with the two curves traced out in the data space when one performs a decoding as one varies (between the limits obtained when encoding the data) the feature vectors ( z 1 , 0) and (0 , z 2 ) , respectively , in the central layer nodes. W e see that, in the ﬁrst case, one recovers a curve very similar to that shown in Fig 8(a), whereas, in the second case, the curve ap- proximates the eigen vector of the data cov ariance matrix having the smaller eigen value. As before, neither curve is exactly a straight line because of non-linearity of the activ ation function. Moreover , the curves do not intersect at right-angles, as would be the case for principal component directions. The resulting correlation and error-squared for the antoencoder are 0 . 022 and 99 . 8 percent, re- spectiv ely . W e note that the latter is very close to 100 percent, as one would expect for this tw o-dimensional data set. In our second example, we demonstrate the ability to deter- mine the optimal number of nodes in the single hidden layer (and hence the optimal number of feature values) for an autoencoder , when redundant information is provided. T o accomplish this, we draw data points from a 3-dimensional correlated Gaussian, but c  2013 RAS, MNRAS 000 , 1–19 10 P . Graf f et al. (a) (b) Figure 8. Original data points (red) drawn from a two-dimensional corre- lated Gaussian distribution, together with (a) the curve traced out by per- forming a decoding as one varies the single feature value z 1 in the central layer of a trained autoencoder with architecture 2 + 1 + 2 ; and (b) two curves traced by performing a decoding as one varies the feature vectors ( z 1 , 0) and (0 , z 2 ) , respectively , in the central layer of a trained autoen- coder with architecture 2 + 2 + 2 . In both cases, the feature values are varied within the limits obtained when encoding the data. then ‘rotate’ these points into 5-dimensional space. Thus, each data point has the form ( x 1 , x 2 , . . . , x 5 ) , but all the points lie on a 3- dimensional hyperplane in this space. In a similar manner to abov e, we train autoencoders with architecture 5 + N + 5 . As N is var- ied, the error-squared and correlation of the resulting networks are giv en in T able 2. As expected, once three hidden nodes are used, the correlation is very close to 100 percent, and adding more does not improv e the results. W e note that, if desired, one can ‘rank’ the feature variables according to the amount by which they decrease the error-squared or increase the correlation, which is analogous to ranking eigen vec- tors according to their eigen v alues in PCA. 4.3.2 Data distributed on a ring W e now consider another two-dimensional example, but one for which the data are drawn from a distribution very different to the single multiv ariate Gaussian assumed by PCA. In particular , the data ( x 1 , x 2 ) are distrib uted about a partial ring centred on T able 2. The error-squared and correlation for autoencoder networks with architecture 5 + N + 5 applied to data points drawn from a 3-dimensional correlated Gaussian distrib ution that are then ‘rotated’ into a 5-dimensional space. N hid Error-squared Correlation % 1 0 . 00613 79 . 6 2 0 . 00127 96 . 0 3 4 . 87 × 10 − 5 99 . 86 4 4 . 87 × 10 − 5 99 . 86 5 4 . 87 × 10 − 5 99 . 86 (0 . 5 , 0 . 5) with radius 0 . 5 to produce a long curving degeneracy in the data space. The data are generated according to x 1 = 0 . 5 + (0 . 5 − n ) cos θ , (19a) x 2 = 0 . 5 + (0 . 5 − n ) sin θ , (19b) where θ is drawn uniformly in the range [0 . 1 π , 1 . 9 π ] and n is drawn from a Gaussian distribution with zero mean and standard deviation of 0 . 1 ; this example was originally presented in Serra- Ricart et al. (1993). The data are plotted as the light blue points in Figure 9 (top). Although the noiseless data are fully determined by the single pa- rameter θ , it is clear that a linear dimensionality reduction method, such as PCA, would be unable to represent this data set accu- rately in a single component. Indeed, the dominant principal com- ponent would lie along a straight, horizontal (symmetry) line pass- ing through the point (0 . 5 , 0 . 5) , as is easily veriﬁed in practice. Projections onto this direction clearly do not distinguish between data points having the same x 1 -coordinate, but lying on opposite sides of this symmetry line. As we will no w show , howe ver , it is possible to represent this data set well using just a single variable, by taking advantage of the non-linearity of an antoencoder . In this slightly more challenging example, we train an autoen- coder with three hidden layers, again with a single node in the cen- tral layer to perform a dimensionality reduction. Thus, the full net- work architecture is 2 + N + 1 + N + 2 . Whitening of the input and output data was applied using (5), and the network was pre- trained. The optimal value for N is determined by comparing the correlation and error-squared for networks with different numbers of hidden nodes. These results are shown in Fig. 10, which shows that optimal performance is reached for N & 10 , beyond which no signiﬁcant improv ement results from adding further hidden nodes. The results obtained for the antoencoder with architecture 2 + 13 + 1 + 13 + 2 are shown in Figure 9. The top panel shows the curve (in black) traced out by performing a decoding as one varies (between the limits obtained when encoding the data) the single feature value z 1 in the central layer of a trained autoencoder with architecture 2 + 13 + 1 + 13 + 2 ; this clearly follo ws the ring structure very closely . The curve traced out when z 1 is allowed to vary below (above) the range encountered in training is shown in red (green). One sees that each of these curves extend a short dis- tance into the gap in the ring, with one curv e extending at either end of the gap. Conv ersely , the bottom panel shows the encoded feature value obtained as compared to the true angle θ of the input data. W e see that, as expected, there is a strong and monotonic relationship between these two v ariables. c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 11 Figure 9. (T op) Original data points (light blue) and the curve (black) traced out by performing a decoding as one varies (between the limits obtained when encoding the data) the single feature value z 1 in the central layer of a trained autoencoder with architecture 2 + 13 + 1 + 13 + 2 . The curve traced out when z 1 is allowed to v ary below (above) the range encountered in training is shown in red (green). (Bottom) The true angle θ of the training data points versus their encoded feature v alues. Figure 10. The correlation and error-squared values as a function of the number of hidden nodes N obtained from conver ged autoencoder with ar- chitecture 2 + N + 1 + N + 2 for the data on a ring problem. 4.3.3 Data distributed in multiple Gaussian modes In this example, we again consider a two-dimensional case in which the data are drawn from a distribution very different to a single multiv ariate Gaussian, but, rather than simulating a long, curv- ing degeneracy , we focus here on a distribution possessing mul- tiple modes. In particular, the data are generated from the sum of four equal Gaussian modes, each having a standard deviation of 0 . 1 in both the x 1 and x 2 directions, with means of (0 . 25 , 0 . 25) , (0 . 25 , 0 . 75) , (0 . 75 , 0 . 25) , and (0 . 75 , 0 . 75) , respectiv ely; this ex- ample was also originally presented in Serra-Ricart et al. (1993). The data are plotted as the light blue points in Figure 11 (top). In this case it is not intuitiv ely obvious to what extent the data can be represented using a single variable. It is once again clear , how- ev er , that a linear dimensionality reduction method, such as PCA, would be unable to represent this data set accurately in a single component. Indeed, in this case, the two principal directions lie along diagonal (symmetry) lines at ± 45 degrees, passing through the point (0 . 5 , 0 . 5) , and (in theory) hav e equal eigen values, so ei- ther direction can be used to perform the dimensionality reduction. Projection onto the line at +45 degrees (say) will clearly not distin- guish between an y two data points that lie on any given line perpen- dicular to that direction, thereby conﬂating data points in modes 1 and 4 in the ﬁgure. Thus, the resulting histogram of projected val- ues for the data points will contain only three (broad) peaks (see Serra-Ricart et al. 1993). As we now sho w , howe ver , it is possi- ble to distinguish all four modes using just a single variable, if one again makes use of the non-linear nature of autoencoders. W e again train an autoencoder with architecture 2 + N + 1 + N + 2 , using whitening of the input with (5), and pre-training. The optimal value for N is determined by comparing the correlation and error-squared for networks with dif ferent numbers of hidden nodes. In vestigating v alues for N between 3 and 30 , we ﬁnd that N = 30 performed best by a small margin. The results from this network are shown in Figure 11. The top panel shows the curve (in black) traced out by performing a decoding as one v aries (between the limits obtained when encoding the data) the single feature v alue z 1 in the central layer of a trained autoencoder; this curve traces out a path through the centre and outskirts of all four modes, each of which corresponds to a distinct range of feature values. This is illustrated in the bottom panel, which sho ws the histogram of the encoded feature values obtained from the data set. The histogram contains four clear peaks, each corresponding to one of the modes in the original data distribution. By setting appropriate threshold values on this feature variable, we can classify the points as belonging to each of the modes with a 97 . 83 per cent accuracy (classifying based on the raw x 1 and x 2 values yields only a slightly better accuracy of 98 . 85 per cent). 4.4 MNIST handwriting r ecognition The MNIST database of handwritten digits is a subset of a larger collection av ailable from NIST (National Institute for Standards and T echnology). It consists of 60 , 000 training and 10 , 000 val- idation images of handwritten digits. Each digit has been size- normalised and centred in a 28 × 28 pixel greyscale image. The images are publicly a vailable online 7 , along with more information on the generation of the data set and results from pre vious analyses by other researchers. This data set has become a standard for testing 7 http://yann.lecun.com/exdb/mnist/ c  2013 RAS, MNRAS 000 , 1–19 12 P . Graf f et al. Figure 11. (T op) Original data points (light blue) and the curve (black) traced out by performing a decoding as one varies (between the limits ob- tained when encoding the data) the single feature value z 1 in the central layer of a trained autoencoder with architecture 2 + 30 + 1 + 30 + 2 . (Bot- tom) Histogram of the encoded feature v alues obtained from the input data; four separate peaks are visible, corresponding to the four Gaussian modes as labelled. of machine learning algorithms. Some sample digits are shown in Figure 12. The learning task is to identify correctly the digit writ- ten in each image. Although this may be an easy task for a human brain, it is quite a challenging machine learning application. 4.4.1 Dir ect classiﬁcation Sev eral classiﬁcation networks of v arying complexity were trained on this problem. In all cases, pre-training was used for any hidden layers and all inputs were whitened using the transformation (5). Each network has 28 × 28 = 784 inputs (corresponding to the greyscale image size) and 10 outputs (one for each digit). The class (digit) assigned to each image was that having the highest output probability . The results obtained are summarised in T able 3, where the er- ror rates, deﬁned as the fraction incorrectly classiﬁed, are those cal- culated on the set of validation images. W e note that some of these networks are lar ge and deep, but are nonetheless well trained using S K Y N E T . These results may be compared with those referenced on the MNIST website, which yield error rates as low as 0 . 35 per- Figure 12. Sample handwritten digits from the MNIST database. T able 3. Error rates for classiﬁcation networks with different architectures, trained to identify handwritten digits in greyscale images from the MNIST database. All networks hav e 784 inputs and 10 outputs. Hidden layers Error rate ( % ) 0 8.08 100 4.15 250 3.00 1000 2.38 300 + 30 2.83 500 + 50 2.62 1000 + 300 + 30 2.31 500 + 500 + 2000 1.76 cent (Ciresan et al. 2010) but more typically between 1 and 5 per- cent (LeCun et al. 1998). 4.4.2 Dimensionality r eduction and classiﬁcation A dimensionality reduction of the MNIST data was also per- formed. In particular , two lar ge and deep autoencoder networks were trained, one with hidden layers 1000 + 300 + 30 + 300 + 1000 (called AE-30) and the other with hidden layers 1000 + 500 + 50 + 500 + 1000 (called AE-50); both networks hav e 784 inputs and outputs, corresponding to the image size. Thus, the dimensionality reduction to 30 or 50 feature variables, respectiv ely , represents a signiﬁcant data compression. The AE-30 network was able to ob- tain an a verage total error squared of only 4 . 64 and AE-50 obtained an average total error squared of 3 . 29 . These values are comparable to those obtained in Hinton & Salakhutdinov (2006). Thus, despite reducing the dimensionality of the input data used from 784 pixels to 30 or 50 non-linear feature variables, these reduced basis sets retain enough information about the original images to reproduce them to within small errors. This also demonstrates that S KY N E T is capable of training large and deep autoencoder netw orks. As mentioned previously , dimensionality reduction is some- times used as a prelude to a supervised-learning task such as classi- ﬁcation, since the latter can sometimes be performed just as accu- rately (or sometimes more so) in the reduced space as in the orig- inal data space. T o illustrate this approach, using each of AE-30 c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 13 T able 4. Error rates for classiﬁcation networks with different architectures, trained on autoencoder feature values to identify handwritten digits from the MNIST database. For AE-30 (AE-50), all the classiﬁcation networks hav e 30 (50) inputs and 10 outputs. Autoencoder Classiﬁcation hidden layers Error rate ( % ) AE-30 0 9.57 10 6.39 30 3.03 100 + 50 + 10 2.55 AE-50 0 8.68 10 6.61 50 2.65 100 + 50 + 10 2.71 and AE-50, all the training images were passed through the au- toencoder to obtain the ( 30 or 50 ) encoded feature values for each image. These feature values (rather than the original images) were then used to train a classiﬁcation network (with just 30 or 50 input nodes, respectively , and 10 output nodes) to identify the handwrit- ten digits. The resulting error rates are listed in T able 4, for net- works with different numbers of hidden layers and nodes. In par- ticular , comparing with T able 3, we see that the resulting classiﬁca- tions are nearly as accurate as the best-performing network trained on the full images, despite reducing the dimensionality of the in- put data from 784 pixels to 30 or 50 non-linear feature variables, and reducing the number of classiﬁcation network parameters by sev eral orders of magnitude. 4.4.3 Dimensionality r eduction and clustering Finally , a massi ve dimensionality reduction to just two feature vari- ables was performed on the images by training a very large and deep autoencoder , with hidden layers 1000 + 500 + 250 + 2 + 250 + 500 + 1000 (and 784 inputs and outputs). As expected, this network is signiﬁcantly less able to reproduce the original images, having an av erage total error squared of 31 . 0 , but has the advan- tage that one can plot the two feature values obtained for each of the images to provide a simple illustration of clustering. Such a scatterplot is shown in Fig. 13 for the 10 , 000 vali- dation images, where the points are colour-coded according to the true digit contained in each image; this ﬁgure may be compared with ﬁgure 3B in Hinton & Salakhutdinov (2006). W e see that there is some signiﬁcant overlap between digits with similar shapes, but that some digits do occupy distinct regions of the parameter space (particularly 1 in the top right, some 0 s in the bottom right, and many e xamples of 2 in the middle right). 4.5 Comparison with F ANN library In this section, we perform a simple comparison between S K Y N E T and an alternativ e algorithm for training a NN. The case we use is the simple sinc problem from Section 4.1 and we compare against the F ANN library . This is a NN library that has been dev eloped ov er several years and thus has more features and interfaces than implemeneted thus far for S KY N E T . Howev er , the training is per- formed via standard backpropagation techniques which are ﬁrst- order in nature. T raining on the exact same sinc data with one hidden layer Figure 13. Scatterplot of the two feature variables for the 10 , 000 validation images from the MNIST database, obtained using the encoding half 784 + 1000 + 500 + 250 + 2 of a symmetric autoencoder. containing 7 nodes, we found that F ANN’ s optimal predictions and run time were equiv alent to those of S K Y N E T . Howe ver , F ANN performed approximately an order of magnitude more steps in pa- rameter space to reach this solution (meaning an individual step was similarly faster) and had no feature to prevent overﬁtting had a lar ger NN been trained (uses only target error and maximum number of steps). Furthermore, while running multiple times with S K Y N E T produces similar run time and results consistently , F ANN run times varied greatly and often did not conv erge to the same re- sult. Lastly , F ANN requires the user to create their own “main” function in C (or another language) to setup the network to be trained, read in data, perform training, and sav e the network. By comparison, S K Y N ET seeks to make these functions easier for the user by asking only for an input settings ﬁle and formatted data. The additional functionality can be implemented in future releases while a useful and simple tool is provided no w . 5 APPLICA TIONS: ASTROPHYSICAL EXAMPLES 5.1 Regression: Mapping Dark Matter challenge The Mapping Dark Matter (MDM) Challenge was presented on www.kaggle.com as a simpliﬁed version of the GREA T10 Chal- lenge (Kitching et al. 2011). In this problem, each data item con- sists of two 48 × 48 greyscale images of a galaxy and a star , re- spectiv ely . Each pix el v alue in each image is dra wn from a Poisson distribution with mean equal to the (unkno wn) underlying intensity value in that pixel. Moreover , both images hav e been con volv ed with the same, but unknown, point spread function. The learning task is, for each pair of images, to predict the ellipticities e 1 and e 2 (deﬁned belo w) of the underlying true galaxy image, and thus con- stitutes a regression problem. The training data set contains 40 , 000 image pairs and the challenge data set contains 60 , 000 image pairs. A sample galaxy and star image pair is sho wn in Fig 14. During the challenge, the solutions for the validation data set were kept secret and participating teams submitted their predictions. Further details about the challenge and descriptions of the top results can be found in Kitching et al. (2012). c  2013 RAS, MNRAS 000 , 1–19 14 P . Graf f et al. (a) (b) Figure 14. Example image pair of (a) galaxy and (b) star from the Mapping Dark Matter Challenge; each image contains 48 × 48 greyscale pixels. Figure 15. Deﬁnition of the underlying galaxy image ellipse pa- rameters used in the Mapping Dark Matter Challenge. Image from http://www.kaggle.com/c/mdm/ . 5.1.1 Galaxy image model The true underlying galaxy image is assumed to be elliptical with semi-major axis a , semi-minor axis b , and position angle θ , as shown in Fig. 15. The ellipticities e 1 and e 2 are related to these parameters by e 1 = a − b a + b cos(2 θ ) , (20a) e 2 = a − b a + b sin(2 θ ) , (20b) and may v ary in the range [ − 1 , 1] . Further details about the data set can be found at the challenge’ s webpage 8 . This also gives the un- weighted quadrupole moments (UWQM) formula for calculating the ellipticities of an image. As the competition organisers note, howe ver , this formula will not provide very good esimates of the true galaxy ellipticities, since it does not account for the point- spread function or noise. 5.1.2 Results W e use S K Y N E T to train several regression networks, each of which takes the galaxy and star images as inputs and produces es- timates of the true galaxy ellipticities as outputs. Follo wing the ap- proach used in the original challenge, the quality of a network’ s 8 http://www.kaggle.com/c/mdm/ T able 5. Root mean square errors on ellipticity predictions for networks with different architectures, ev aluated on the 60 , 000 image pairs of the Mapping Dark Matter Challenge. All networks have two outputs: the galaxy ellipticities e 1 and e 2 . Data set Hidden layers RMSE Full galaxy and star images 0 0.0224146 ( 48 × 48 × 2 = 4608 inputs) 2 0.0186944 5 0.0184237 10 0.0182661 Full galaxy and cropped star images 0 0.0175578 ( 48 × 48 + 24 × 24 = 2880 inputs) 2 0.0176236 5 0.0175945 10 0.0174997 50 0.0172977 50+10 0.0171719 Full galaxy image only 0 0.0234740 ( 48 × 48 = 2304 inputs) 2 0.0234669 5 0.0236508 10 0.0226440 predictions are measured by the root mean squared error (RMSE) of its predicted ellipticities over the challenge data set of 60 , 000 pairs of images. Clearly , better predictions result in lower v alues of the RMSE. The size of the dataset meant that training lar ge networks was very computationally expensi ve. Therefore, for this demonstration, we train only relativ ely small networks, but used three dif ferent data sets: (i) the full galaxy and star images; (ii) the full galaxy image and a centrally cropped star image; and (iii) the full galaxy image alone. Of the training data provided, consisting of 40 , 000 image pairs, 75 per cent were used for training the networks (with- out pre-training and whitening using transformation (5)) and the remaining 25 percent were used for validation. The RMSE values for trained netw orks of dif ferent architecture, evaluated on the chal- lenge dataset, are giv en in T able 5. The RMSE values obtained ev en for the naiv e ﬁrst approach of using the full dataset (i) as inputs are quite good; for comparison, the standard software package S E X T R AC T O R (Bertin and Arnouts 1996) produced an RMSE score of 0 . 0862191 on this test data and the UWQM method scores 0 . 1855600 . One see from T able 5 that increasing the number of hidden nodes beyond two does improve the network accuracy , but only very slo wly . The NN results can, howe ver , be improved more easily sim- ply by reducing the number of inputs without af fecting the infor- mation content, for e xample by cropping the star images to the central 24 × 24 pixels to yield dataset (ii). This simple change in- creases the accurac y of the ellipticity predictions, thereby lo wering the RMSE. Increasing the number of nodes in a single hidden layer, or adding an extra hidden layer, does yield improving predictions, although the rate of impro vement is quite gradual. Nonetheless, this indicates that more complex networks could further improve the accuracy of the ellipticity predictions. The best result obtained for the networks inv estigated, with an RMSE of 0 . 0171719 , com- pares well with the competition winners, who achiev ed an RMSE of 0 . 0150907 (Kitching et al. 2012) using a mixture of methods that included NNs. W e note that our score, produced using an im- mediate ‘out-of-the-box’ application of S K Y N E T that inv olves no c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 15 specialised data processing, would hav e placed us in 32 nd place out of 66 teams that entered the challenge. W e see from T able 5, ho wev er , that reducing the number of inputs further by removing the star images altogether leads to a sig- niﬁcant increase in the RMSE. This is to be expected since the ab- sence of the star images does not allow for the NN to infer the point- spread function sufﬁciently well to predict the underlying galaxy ellipticities accurately . Finally , we note that all of our results could potentially be im- prov ed further by ﬁtting proﬁles to the images and using the param- eters of these ﬁts for training, which would reduce the number of inputs by about two orders of magnitude. Alternatively , one could train an autoencoder to dimensionally-reduce each image into a set of feature variables; this would again vastly reduce the number of network inputs also potentially alleviate the effect of noise in the images. Such in vestigations are, howev er , postponed to a future publication. 5.2 Classiﬁcation: identifying gamma-ray bursters Long gamma-ray bursts (GRBs) are almost all indicators of core- collapse supernov ae from the deaths of massi ve stars. The ability to determine the intrinsic rate of these ev ents as a function of redshift is essential for studying numerous aspects of stellar ev olution and cosmology . The Swift space telescope is a multi-wavelength de- tector that is currently observing hundreds of GRBs (Gehrels et al. 2004). Ho wev er , Swift uses a complicated combination of over 500 triggering criteria for identifying GRBs, which mak es it dif ﬁcult to infer the intrinsic GRB rate. Indeed, most studies approximate the Swift trigger algorithm by a single detection threshold, which can lead to biasses in inferring the intrinsic GRB rate as a function of redshift. T o in vestigate this issue further , a recent study by Lien et al. (2012) performed a Monte Carlo analysis that generated a mock sample of GRBs, using the GRB rate and luminosity function of W anderman & Piran (2010), and processed them through an entire simulated Swift detection pipeline, applying the full set of Swift trigger criteria, to determine which GRBs would be detected. It was found that the resulting measured GRB rate as a function of redshift followed very closely that of the true Swift GRB set described in Fynbo et al. (2009); this ﬁnding is consistent with both the mock GRB sa mple and the simulated trigger pipeline being good approx- imations to reality . This analysis was, ho we ver , quite computation- ally expensi ve, since determining if each GRB is detected by Swift requires ov er a minute on a single CPU. 5.2.1 F orm of the classiﬁcation pr oblem Our goal here is to replace the simulated Swift trigger pipeline with a classiﬁcation NN, which (as we will see) can determine in just a few microseconds whether a given GRB is detected. T o this end, we use as training data a pre-computed mock sample of 10 , 000 GRBs from Lien et al. (2012). In particular, we divide this sam- ple randomly into ∼ 4000 for training, ∼ 1000 for validation, and the ﬁnal ∼ 5000 as a blind test set on which to perform our ﬁnal ev aluations. For each GRB we use use 13 inputs: the GRB total lu- minosity , redshift, and energy peak, together with the arriv al bin at the Swift detector , bin size of the light curve in the emitting frame, α and β parameters for the GRB’ s Band function spectrum, back- ground intensity in four different energy ranges ( 15 - 25 keV , 15 - 50 keV , 25 - 100 keV , and 50 - 350 keV), angle of arriv al at the de- tector , and total GRB ﬂux. The two softmaxed outputs correspond to probabilities (which sum to unity) for the class 0 , that the GRB is not detected, and for the class 1 , that the GRB is detected. In our analysis, we will focus on the probability for class 1. Different NN architectures were trained on these data and it was found that NNs with hidden layer conﬁgurations of 50 , 100 + 30 , and 300 + 100 + 30 all performed equally well on the classiﬁcation task. Thus results presented in this section are those from the network with hidden layers 100 + 30 applied to the blind set of ∼ 5000 GRBs. 5.2.2 Results Since the NN outputs are probabilities, we can in vestigate the qual- ity of the classiﬁcation as a function of the threshold probability , p th , required in class 1 to claim a detection. As discussed in Feroz, Marshall & Hobson (2008), we can compute the expected number of total GRB detections, correct detections, and f alse detections, as well as other deriv ed statistics as a function of p th , without know- ing the true classiﬁcations (as would be the case in the analysis of real data). If we label the probability of detection for each GRB in the blind set as p i , then the expected total number ˆ N total of GRBs, expected number correctly predicted, ˆ N true , and expected number falsely predicted, ˆ N false , are giv en by the follo wing: ˆ N total = N X i =1 p i , (21a) ˆ N true = N X i =1 ,p i > p th p i , (21b) ˆ N false = N X i =1 ,p i > p th 1 − p i , (21c) where N is the total number in the blind sample. From this, we can compute the completeness  (fraction of detected GRBs that hav e been correctly classiﬁed; also referred to as the efﬁcienc y) and purity τ (fraction of all GRBs that ha ve been correctly classiﬁed as detected), which are giv en by  = ˆ N true ˆ N total (22a) τ = ˆ N true ˆ N true + ˆ N false (22b) In Figure 16 we plot the actual and expected completeness and purity . It is clear that the actual and expected curves lie on top of one another with only minimal dif ferences. Thus, without kno wing the true classiﬁcations of the GRBs as detected or not, we can set p th to obtain the desired completeness and purity le vels for the ﬁnal sample. W ith this information, we can also plot the actual and expected receiv er operating characteristic (ROC) curves (see, e.g., Fawcett 2006). The R OC curve originated in signal detection theory and is a reliable way of choosing an optimal threshold v alue as well as comparing binary classiﬁers. The R OC curve plots the true posi- tiv e rate (identical to completeness and also equal to the Neyman– Pearson ‘power’ of a test) ag ainst the false positi ve rate (also known as contamination and the Neyman–Pearson type-I error rate) as a function of the threshold value. 9 A perfect classiﬁer will hav e a 9 It is worth noting that, in terms of conditional probabilities, complete- ness is simply Pr( classiﬁed as detected | detected ) , purity is its Bayes theo- c  2013 RAS, MNRAS 000 , 1–19 16 P . Graf f et al. Figure 16. Actual and e xpected values for the completeness and purity as a function of probability threshold, p th . The e xpected curves are v ery accurate predictors of the actual ones. Figure 17. Actual (black solid) and expected (dashed red) R OC curves for a NN classiﬁer that predicts whether a GRB will be detected by Swift. The curve traces true versus false positive rates as the probability thresh- old varies, as illustrated on the inset log-log plot. R OC curve that connects (0 , 0) to (0 , 1) and then (1 , 1) , whereas a random classiﬁer will yield a ROC curv e that is the diagonal line connecting (0 , 0) and (1 , 1) directly . In general, the larger the area underneath a R OC curve, the more po werful the classiﬁer . From Figure 17, we can see that the expected and actual R OC curves for a NN classiﬁer are very close, with small deviations oc- curing only at very low false positiv e rates; it should be noted that here the actual ROC curve is better than the e xpected one. The curves also indicate that this test is quite powerful at predicting which GRBs will be detected by Swift. Using the completeness, purity , and R OC curves, one can make a decision as to appropriate value of p th to use. One may re- quire a certain lev el of completeness, regardless of false positives, or we may require a minimal le vel of contamination in the ﬁnal sample. Alternativ ely , one can use the R OC curve to deri ve an op- timal value for p th . For example, one can use the p th value where the ROC curve intersects with the diagonal line connecting (0 , 1) rem complement Pr( detected | classiﬁed as detected ) , and contamination is Pr( classiﬁed as detected | not detected ) . Figure 18. (T op) Comparison of the detected GRB ev ent counts obtained using the full simulated Swift trigger pipeline and our classiﬁcation NN. (Bottom) Normalized error between the two counts. and (1 , 0) , or where the line from the point (1 , 0) intersects the R OC curve at right angles. From either of these criteria, we con- clude that p th = 0 . 5 , the original naiv e choice, is a near-optimal threshold value. Using p th = 0 . 5 , we no w wish to determine how well the GRB detection rate with respect to redshift is reproduced, since this relationship is key to deriving scientiﬁc results. In Figure 18, we show the e vent counts as a function of redshift for both our NN classiﬁer and the original simulated Swift trigger pipeline. It is clear that the two sets of counts agree very well. In the bottom part of the ﬁgure, we calculate a measure of the error within each redshift bin by computing ( N t − N p ) / √ N t , where N t is the ‘true’ number of GRBs detected by the full simulated Swift pipeline, and N p is the number obtained using our classiﬁcation NN. As the original detection counting is essentially a Poissonian process, its intrinsic normalized error will remain within [ − 1 , 1] for most bins, and we can see that error introduced by the NN similarly does not exceed this magnitude. W e note that networks used to obtain these results were trained in a few CPU hours and thereafter can make an accurate determi- nation of whether a GRB will be detected by Swift in just a few microseconds, instead of the minutes of computation time required by the full simulated Swift trigger pipeline. 5.3 Dimensionality reduction: compr essing galaxy images In astronomical data analysis, the ra w data set often contains a great deal of redundant information. Simply put, there are usually many more pixels in an astronomical image than there are distinct fea- tures to be identiﬁed in the object being imaged, so that not all pixels are independent measures of structure. If we are able to com- press the data by removing these redundancies and instead quantify only the distinct features present, then one can be more efﬁcient in subsequent analyses. One way of ﬁnding these features in the data and performing compression – and denoising – is through the use of autoencoders. As we hav e already sho wn, autoencoders are able to represent non-linear features in a data set and reduce the num- ber of variables used to describe it to a value closer to the intrinsic dimensionality of the data. c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 17 T able 6. The mean RMSE v alues for autoencoder reconstructions of galaxy and cropped star images from the MDM Challenge data set. Pixel values range from 0 to 255 . The normalized RMSE has had the errors divided by the square root of the original pixel v alue that was to be reconstructed. Hidden nodes RMSE Norm. RMSE 1 9 . 09 0 . 955 2 8 . 83 0 . 936 3 8 . 66 0 . 922 5 8 . 50 0 . 909 10 8 . 44 0 . 904 30 8 . 43 0 . 903 50 8 . 42 0 . 901 100 8 . 38 0 . 897 5.3.1 Image compr ession and denoising Our previous analysis, presented in Section 5.1, for measuring galaxy ellipticities from the MDM Challenge data set images pro- vides a good example. The original galaxy images each contain 2304 pix els, b ut these are clearly not all independent measure- ments. Additionally , even the cropped star images each contain 576 non-independent pix els. W ith an autoencoder, we can compress both of these images by two orders of magnitude, from thousands of input variables to just tens, from which one can then measure ellipticities. T o perform the compression, autoencoders with a single hid- den layer were trained, since these images contain relati vely few features and simpler networks require less time to train. Pre- training was used in all examples. The input data were the 2880 pixels of each galaxy and its accompanying (cropped) star . Since the noise in each pixel is Poissonian, we report in T able 6 not only the RMSE on the autoencoder outputs, but also the RMSE nor- malised by the original pixel value. The values listed in T able 6 correspond to the mean v alues obtained for the collection of images comprising the MDM training data set. The values obtained from one image to another are mutually consistent, but when the autoen- coder has 10 or more nodes in the hidden layer the image-to-image variations are signiﬁcantly larger than the differences between the means. One sees from T able 6 that only a slight improvement is observed when using 10 or more hidden nodes. Indeed, taking into account image-to-image v ariation, the predictions by the larger net- works are statistically indistinguishable from those of the smaller ones. W e can therefore represent the images well with just 10 fea- ture values. If we consider the original construction of the images, each galaxy can be represented by 4 parameters (two ellipticities, a position angle and an amplitude) and each star by 2 parameters (a radius for the point-spread function and an amplitude). This means that, without noise, only 6 parameters are needed to describe the images completely . This is reﬂected by the ability of our autoen- coder to perform the ‘majority’ of the ﬁt with just 5 features. Ad- ditional features produce more marginal decreases in the RMSE as they are no w ﬁne-tuning and/or ﬁtting for the noise in the data. By looking at pixel-to-pixel comparisons, one ﬁnds that a large part of the error is coming from the fainter pixels, which are, for the most part, distinctly external to the galaxy . Therefore, the galaxy itself is being described even more accurately than the numbers presented indicate. A plot of this comparison for a typical galaxy/star pair is shown in Figure 19, with the corresponding orig- Figure 19. A pixel-by-pixel comparison for a typical galaxy/star image pair between original and autoencoder reconstructed images. W e can see the larger errors occurring at smaller true pix el values associated with the back- ground. This used an autoencoder with a single hidden layer of 10 nodes. Figure 20. Comparison of original (input) and reconstructed (output) galaxy and star images for an autoencoder with a single hidden layer of 10 nodes. This example is the same as that sho wn in Figure 19 inal (input) and reconstructed (output) images shown in Figure 20. W e can illustrate the nature of the feature vectors constructed by the network by decoding the central layer values (1 , 0 , 0 , . . . ) , (0 , 1 , 0 , . . . ) , etc. of our autoencoder , which has 10 hidden nodes, to obtain the corresponding 10 output images. W e plot these images in Figure 21 for the galaxy/star e xample sho wn in Figure 20. Some shape features can be clearly seen, although the gre yscale has been rev ersed for these images to see them as brighter structures. This rev ersal can be accounted for by the network assigning the original features negati ve weights and a positiv e bias. 5.3.2 Estimating ellipticities with compr essed data Having trained the autoencoders, we now in vestigate using the compressed feature values, rather than original images, to deter - mine the ellipticities of the galaxies. Since the number of inputs has been decreased from 2880 to 10 – 100 , we can use more and larger hidden layers in our regression netw ork and the analysis will c  2013 RAS, MNRAS 000 , 1–19 18 P . Graf f et al. (a) (b) Figure 21. Features vectors obtained by decoding (1 , 0 , 0 , . . . ) , (0 , 1 , 0 , . . . ) , etc. in the central layer of an autoencoder with 10 central layer nodes. Shown are the extracted (a) galaxy and (b) star features. The greyscale has been re versed on the actual v alues. still take less time to run. Results on training regression networks with many dif ferent conﬁgurations are gi ven in T able 7. These results show that the extra information given to the re- gression networks trained on 100 feature values from the autoen- coder acted as a disadvantage in predicting the galaxy ellipticities. For networks trained on 50 or 30 features, howe ver , the accuracies of the predicted ellipticities were better even than those obtained using the full original images in some cases. This demonstrates the power of being able to eliminate redundant information and noise, and thereby improve the accuracy of the analysis. W e also observe that adding unnecessary complexity to the NN structure makes it more difﬁcult for the algorithm to ﬁnd the global maximum. This same method for dimensionality reduction – which also eliminates noise – before performing measurements can clearly be applied to a wide range of other astronomical applications. Exam- ples include classiﬁcation of supernov ae by type, or measurements of galaxies and stars by their spectra. 6 CONCLUSIONS W e hav e described an efﬁcient and robust neural network training algorithm, called S KY N E T , which we have no w made freely av ail- able for academic purposes. This generic tool is capable of train- ing lar ge and deep feed-forward networks, including autoencoders, and may be applied to supervised and unsupervised machine learn- ing tasks in astronomy , such as regression, classiﬁcation, density estimation, clustering and dimensionality reduction. S K Y N E T em- ploys a combination of (optional) pre-training follo wed by iterative reﬁnement of the network parameters using a regularised variant of Ne wton’ s optimisation algorithm that incorporates second-order deriv ative information without the need ev en to compute or store the Hessian matrix. Linear and sigmoidal activ ation functions are provided for the user to choose between. S KY N E T adopts conv er- T able 7. RMSE v alues for galaxy ellipticity predictions using the com- pressed feature values as inputs. V arious compressions and regression net- work sizes were used. # Features Hidden layers RMSE 5 10 0 . 022316 30 0 . 022534 100 0 . 022015 30 + 10 0 . 025472 100 + 30 0 . 025165 10 10 0 . 016802 30 0 . 016237 100 0 . 016296 30 + 10 0 . 017028 100 + 30 0 . 017869 30 10 0 . 016559 30 0 . 016927 100 0 . 016608 30 + 10 0 . 017312 100 + 30 0 . 017351 50 10 0 . 017056 30 0 . 017056 100 0 . 016769 30 + 10 0 . 016492 100 + 30 0 . 016629 100 10 0 . 019459 30 0 . 019561 100 0 . 019596 30 + 10 0 . 019750 100 + 30 0 . 019949 gence criteria that naturally pre vent ov erﬁtting, and it also includes a fast algorithm for estimating the accuracy of netw ork outputs. W e ﬁrst demonstrate the capabilities of S K Y N ET on toy ex- amples of regression, classiﬁcation, and dimensionality reduction using autoencoder networks, and then apply it to the classic ma- chine learning problem of handwriting classiﬁcation for determin- ing digits from the MNIST database. In an astronomical context, S K Y N E T is applied to: the regression problem of measuring the ellipticity of noisy and conv olved galaxy images in the Mapping Dark Matter Challenge; the classiﬁcation problem of identifying gamma-ray bursters that are detectable by the Swift satellite; and the dimensionality reduction problem of compressing and denois- ing images of galaxies. In each case, the straightforw ard use of S K Y N E T produces networks that perform the desired task quickly and accurately , and typically achiev e results that are competiti ve with machine learning approaches that have been tailored to the required task. Future dev elopment of S K Y N E T will expand upon many of the current features and introduce new ones. W e are working to in- clude more activ ation functions (e.g. tanh , softsign, and rectiﬁed linear), pooling of nodes, conv olutional NNs, diversity in outputs (i.e. mixing regression and classiﬁcation), and more rob ust sup- port of recursiv e NNs. This is all in addition to further improving the speed and efﬁcienc y of the training algorithm itself. Howe ver , S K Y N E T in its current state is already a useful tool for performing machine learning in astronomy . c  2013 RAS, MNRAS 000 , 1–19 S K Y N E T 19 A CKNO WLEDGMENTS The authors thank John Skilling for providing very useful advice in the early stages of algorithm development. W e also thank Amy Lien for providing the data used in Seciton 5.2. This work uti- lized three different high-performance computing facilities at dif- ferent times: initial work was performed on COSMOS VIII, an SGI Altix UV1000 supercomputer, funded by SGI/Intel, HEFCE and PP ARC, and the authors thank Andrey Kaliazin for assis- tance; early work also utilized the Darwin Supercomputer of the Univ ersity of Cambridge High Performance Computing Service ( http://www.hpc.cam.ac.uk/ ), provided by Dell Inc. us- ing Strategic Research Infrastructure Funding from the Higher Education Funding Council for England; later work utilised the Discov er system of the NASA Center for Climate Simulation at N ASA Goddard Space Flight Center . PG is currently supported by a NASA Postdoctoral Fellowship from the Oak Ridge Associated Univ ersities and completed a portion of this work while funded by a Gates Cambridge Scholarship at the Uni versity of Cambridge. FF is supported by a Research Fellowship from the Leverhulme and Newton T rusts. REFERENCES Andreon S., Gargiulo G., Longo G., T agliaferri R., & Capuano N., 1999, arXiv:astro-ph/9906099 Andreon S., Gargiulo G., Longo G., T agliaferri R., & Capuano N., 2000, MNRAS, 319, 700–716 Auld T ., Bridges M., Hobson M.P ., Gull S.F ., 2008, MNRAS, 376, L11 Auld T ., Bridges M., Hobson M.P ., 2008, MNRAS, 387, 1575 Ball N.M., Brunner R.J., 2010, Int. J. Mod. Phys., 19, 1049 Bergstra J., Desjardins G., Lamblin P ., & Bengio Y ., 2009, T ech- nical Report 1337, D ´ epartement dInformatique et de Recherche Op ´ erationnelle, Univ ersit ´ e de Montr ´ eal. Bertin E., Arnouts S., 1996, A&AS Supplement, 117, 393 Bridges M., Cranmer K., Feroz F ., Hobson M.P ., Ruiz de Austri R., T rotta R., 2011, JHEP , 03, 012 Bonnett C., 2013, arXiv:1312.1287 [astro-ph.CO] Carreira-Perpignan M. A. & Hinton. G. E., 2005, Proceedings of the T enth International W orkshop on Artiﬁcial Intelligence and Statistics, eds. Cowell R. G. & Ghahramani Z., 33–40 Ciresan D. C., Meier U., Gambardella L. M., & Schmidhuber J., 2010, Neural Comput., 22, 3207–3220 Cybenko G., 1989, Mathematics of Control, Signals, and Systems, 2, 303–314 Erhan D. et al., 2010, Journal of Machine Learning Research, 11, 625–660 Fa wcett T ., 2006, Pattern Recogn. Lett., 27, 861 Fendt W .A., W andelt B.D., 2007, ApJ, 654, 2 Feroz F ., Hobson M.P ., 2008, MNRAS, 384, 449 Feroz F ., Hobson M.P ., Bridges M., 2009, MNRAS, 398, 1601 Feroz F ., Hobson M. P ., Cameron E., & Pettitt A. N., 2013, arXiv:1306.2144 [astro-ph.IM] Feroz F ., Marshall P . J., Hobson M. P ., 2008, [astro-ph] Fynbo J. et al., 2009, ApJS, 185, 526 Gehrels N. et al., 2004, ApJ, 611, 1005 Gev a S., Sitte J., IEEE, 3, 621 Glorot X. & Bengio Y , 2010, Proceedings of the Thirteenth Interna- tional Conference on Artiﬁcial Intelligence and Statistics, Jour- nal of Machine Learning Research, eds. T eh Y . W . & T itterington M., 249–256 Glorot X., Bordes A., & Bengio Y ., 2011, Proceedings of the Four - teenth International Conference on Artiﬁcial Intelligence and Statistics, Journal of Machine Learning Research, eds. Gordon G. & Dunson D., 315–323 Graff P ., Feroz F ., Hobson M.P ., Lasenby A.N., 2012, MNRAS, 421, 169 Gull S.F . & Skilling J., 1999, Quantiﬁed Maximum En- tropy: MemSys 5 Users’ Manual. Maximum Entropy Data Consultants Ltd. Bury St. Edmunds, Suffolk, UK. http://www.maxent.co.uk/ Hinton G.E., Osindero S., & T eh Y .-W ., 2006, Neural Comput., 18, 1527–1554 Hinton G.E. & Salakhutdinov R.R., 2006, Science, 313, 504-507 Hobson M. P ., Jones A. W ., Lasenby A. N., & Bouchet F . R., 1998, MNRAS, 300, 1–29 Hornik K., Stinchcombe M. & White H., 1990, Neural Networks, 3, 359 Hyv ¨ arinen A., Oja E., 2000, Neural Networks, 13, 411 Karpenka N.V ., Feroz F ., Hobson M.P ., 2013, MNRAS, 429, 1278– 1285 Kendall M.G., 1957, A course in multiv ariate analysis. Grifﬁn, London Kitching T . et al., 2011, Annals of Applied Statistics, 5, 2231–2263 Kitching T . et al., 2012, Ne w Astronomy Revie ws, submitted LeCun Y ., Bottou L., Bengio Y ., & Haffner P ., 1998, Proc. of the IEEE, 86, 2278–2324 Lien A., Sakamoto T ., Gehrels N., Palmer D., Graziani C., 2012, Proceedings of the International Astronomical Union, 279, 347 Longo G., T agliaferri R., & Andreon S., 2001, Mining the Sky: Proceedings of the MP A/ESO/MPE W orkshop, eds. Banday A. J., Zaroubi S., Bartelmann M., 379–385 MacKay D.J.C., 1992, Neural Computation, 4, 415–447 MacKay D.J.C., 1995, Network: Computation in Neural Systems, 6, 469 MacKay D.J.C, 2003, Information Theory , Inference, and Learning Algorithms. Cambridge Univ . Press. www.inference.phy.cam.ac.uk/mackay/itila/ Mandic D., Chambers J., 2001, Recurrent Neural Netw orks for Pre- diction: Learning Algorithms, Architectures and Stability . W iley , New Y ork. Martens J., 2010, in F ¨ urnkranz J., Joachims T ., eds, Proc. 27th Int. Conf. Machine Learning. Omnipress, Haifa, p. 735 Murtagh F ., 1991, Neural Comput., 2, 183 Pascanu R. & Bengio Y ., 2013, arXi v:1301.3584 [cs.LG] Pearlmutter B.A., 1994, Neural Comput., 6, 147 Sanger T .D., 1989, Neural Networks, 2, 459 Schraudolph N.N., 2002, Neural Comput., 14, 1723 Serra-Ricart M., Calbet X., Garrido L., & Gaitan V ., 1993, AJ, 106, 1685 Skilling J., 2004, AIP Conference Series, 735, 395 T agliaferri R. et al., 2003a, Neural Networks, 16, 297 T agliaferri R., Longo G., Andreon S., Capozziello S., Donalek C., & Giordano G., 2003b, Neural Nets: 14th Italian W orkshop on Neural Nets, eds. Apolloni B., Marinaro M., & T agliaferri R., 226–234 W anderman D., Piran T ., 2010, MNRAS, 406, 1944 W ay M.J., Scargle J.D., Ali K.M., Sri vastav a A.N., 2012, Adv ances in Machine Learning and Data Mining for Astronomy . CRC Press. c  2013 RAS, MNRAS 000 , 1–19

SKYNET: an efficient and robust neural network training tool for machine learning in astronomy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment