Convolutional Networks on Graphs for Learning Molecular Fingerprints

Con volutional Netw orks on Graphs f or Lear ning Molecular Fingerprints David Duv enaud † , Dougal Maclaurin † , Jorge Aguilera-Iparraguirr e Rafael G ´ omez-Bombarelli, T imothy Hirzel, Al ´ an Aspuru-Guzik, Ryan P . Adams Harvard Uni versity Abstract W e introduce a conv olutional neural network that operates directly on graphs. These networks allo w end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular ﬁngerprints. W e show that these data-driven features are more interpretable, and have better pre- dictiv e performance on a v ariety of tasks. 1 Introduction Recent work in materials design used neural networks to predict the properties of nov el molecules by generalizing from examples. One dif ﬁculty with this task is that the input to the predictor , a molecule, can be of arbitrary size and shape. Currently , most machine learning pipelines can only handle inputs of a ﬁxed size. The current state of the art is to use off-the-shelf ﬁngerprint software to compute ﬁxed-dimensional feature vectors, and use those features as inputs to a fully-connected deep neural network or other standard machine learning method. This formula was followed by [ 28 , 3 , 19 ]. During training, the molecular ﬁngerprint vectors were treated as ﬁxed. In this paper , we replace the bottom layer of this stack – the function that computes molecular ﬁngerprint vectors – with a dif ferentiable neural network whose input is a graph representing the original molecule. In this graph, v ertices represent individual atoms and edges represent bonds. The lower layers of this network is con volutional in the sense that the same local ﬁlter is applied to each atom and its neighborhood. After several such layers, a global pooling step combines features from all the atoms in the molecule. These neural graph ﬁngerprints offer se veral advantages o ver ﬁx ed ﬁngerprints: • Predicti ve performance. By using data adapting to the task at hand, machine-optimized ﬁngerprints can pro vide substantially better predicti ve performance than ﬁxed ﬁngerprints. W e show that neural graph ﬁngerprints match or beat the predicti ve performance of stan- dard ﬁngerprints on solubility , drug efﬁcac y , and organic photo voltaic ef ﬁciency datasets. • Parsimony . Fixed ﬁngerprints must be extremely large to encode all possible substructures without ov erlap. For example, [ 28 ] used a ﬁngerprint vector of size 43,000, after having remov ed rarely-occurring features. Dif ferentiable ﬁngerprints can be optimized to encode only rele vant features, reducing downstream computation and regularization requirements. • Interpr etability . Standard ﬁngerprints encode each possible fragment completely dis- tinctly , with no notion of similarity between fragments. In contrast, each feature of a neural graph ﬁngerprint can be acti v ated by similar but distinct molecular fragments, making the feature representation more meaningful. † Equal contribution. 1 Figure 1: Left : A visual representation of the computational graph of both standard circular ﬁn- gerprints and neural graph ﬁngerprints. First, a graph is constructed matching the topology of the molecule being ﬁngerprinted, in which nodes represent atoms, and edges represent bonds. At each layer , information ﬂows between neighbors in the graph. Finally , each node in the graph turns on one bit in the ﬁxed-length ﬁngerprint vector . Right : A more detailed sketch including the bond information used in each operation. 2 Circular ﬁnger prints The state of the art in molecular ﬁngerprints are extended-connectivity circular ﬁngerprints (ECFP) [ 21 ]. Circular ﬁngerprints [ 6 ] are a reﬁnement of the Morgan algorithm [ 17 ], designed to encode which substructures are present in a molecule in a way that is in v ariant to atom-relabeling. Circular ﬁngerprints generate each layer’ s features by applying a ﬁx ed hash function to the concate- nated features of the neighborhood in the previous layer . The results of these hashes are then treated as integer indices, where a 1 is written to the ﬁngerprint vector at the index given by the feature vector at each node in the graph. Figure 1 (left) sho ws a sketch of this computational architecture. Ignoring collisions, each index of the ﬁngerprint denotes the presence of a particular substructure. The size of the substructures represented by each index depends on the depth of the network. Thus the number of layers is referred to as the ‘radius’ of the ﬁngerprints. Circular ﬁngerprints are analogous to con volutional networks in that they apply the same operation locally ev erywhere, and combine information in a global pooling step. 3 Creating a differ entiable ﬁngerprint The space of possible network architectures is lar ge. In the spirit of starting from a known-good con- ﬁguration, we designed a dif ferentiable generalization of circular ﬁngerprints. This section describes our replacement of each discrete operation in circular ﬁngerprints with a differentiable analog. Hashing The purpose of the hash functions applied at each layer of circular ﬁngerprints is to combine information about each atom and its neighboring substructures. This ensures that any change in a fragment, no matter ho w small, will lead to a different ﬁngerprint index being activ ated. W e replace the hash operation with a single layer of a neural network. Using a smooth function allows the acti vations to be similar when the local molecular structure v aries in unimportant ways. Indexing Circular ﬁngerprints use an indexing operation to combine all the nodes’ feature vectors into a single ﬁngerprint of the whole molecule. Each node sets a single bit of the ﬁngerprint to one, at an index determined by the hash of its feature vector . This pooling-like operation con verts an arbitrary-sized graph into a ﬁxed-sized vector . For small molecules and a lar ge ﬁngerprint length, the ﬁngerprints are alw ays sparse. W e use the softmax operation as a dif ferentiable analog of indexing. In essence, each atom is asked to classify itself as belonging to a single category . The sum of all these classiﬁcation label vectors produces the ﬁnal ﬁngerprint. This operation is analogous to the pooling operation in standard con v olutional neural networks. 2 Algorithm 1 Circular ﬁngerprints 1: Input: molecule, radius R , ﬁngerprint length S 2: Initialize: ﬁngerprint vector f ← 0 S 3: for each atom a in molecule 4: r a ← g ( a )  lookup atom features 5: for L = 1 to R  for each layer 6: for each atom a in molecule 7: r 1 . . . r N = neighbors ( a ) 8: v ← [ r a , r 1 , . . . , r N ]  concatenate 9: r a ← hash ( v )  hash function 10: i ← mod ( r a , S )  con vert to inde x 11: f i ← 1  Write 1 at index 12: Return: binary vector f Algorithm 2 Neural graph ﬁngerprints 1: Input: molecule, radius R , hidden weights H 1 1 . . . H 5 R , output weights W 1 . . . W R 2: Initialize: ﬁngerprint vector f ← 0 S 3: for each atom a in molecule 4: r a ← g ( a )  lookup atom features 5: for L = 1 to R  for each layer 6: for each atom a in molecule 7: r 1 . . . r N = neighbors ( a ) 8: v ← r a + P N i =1 r i  sum 9: r a ← σ ( v H N L )  smooth function 10: i ← softmax ( r a W L )  sparsify 11: f ← f + i  add to ﬁngerprint 12: Return: real-valued v ector f Figure 2: Pseudocode of circular ﬁngerprints ( left ) and neural graph ﬁngerprints ( right ). Differences are highlighted in blue. Every non-differentiable operation is replaced with a dif ferentiable analog. Canonicalization Circular ﬁngerprints are identical regardless of the ordering of atoms in each neighborhood. This in variance is achie ved by sorting the neighboring atoms according to their features, and bond features. W e experimented with this sorting scheme, and also with applying the local feature transform on all possible permutations of the local neighborhood. An alternative to canonicalization is to apply a permutation-in variant function, such as summation. In the interests of simplicity and scalability , we chose summation. Circular ﬁngerprints can be interpreted as a special case of neural graph ﬁngerprints having lar ge random weights. This is because, in the limit of large input weights, tanh nonlinearities approach step functions, which when concatenated form a simple hash function. Also, in the limit of large input weights, the softmax operator approaches a one-hot-coded argmax operator , which is anal- ogous to an indexing operation. Algorithms 1 and 2 summarize these two algorithms and highlight their differences. Giv en a ﬁnger - print length L , and F features at each layer, the parameters of neural graph ﬁngerprints consist of a separate output weight matrix of size F × L for each layer, as well as a set of hidden-to-hidden weight matrices of size F × F at each layer , one for each possible number of bonds an atom can hav e (up to 5 in organic molecules). 4 Experiments W e ran two experiments to demonstrate that neural ﬁngerprints with large random weights behave similarly to circular ﬁngerprints. First, we examined whether distances between circular ﬁngerprints were similar to distances between neural ﬁngerprint-based distances. Figure 3 (left) shows a scat- terplot of pairwise distances between circular vs. neural ﬁngerprints. Fingerprints had length 2048, and were calculated on pairs of molecules from the solubility dataset [ 4 ]. Distance was measured using a continuous generalization of the T animoto (a.k.a. Jaccard) similarity measure, gi ven by distance ( x , y ) = 1 − X min( x i , y i ) . X max( x i , y i ) (1) There is a correlation of r = 0 . 823 between the distances. The line of points on the right of the plot shows that for some pairs of molecules, binary ECFP ﬁngerprints ha ve exactly zero ov erlap. Second, we examined the predictive performance of neural ﬁngerprints with large random weights vs. that of circular ﬁngerprints. Figure 3 (right) shows av erage predictiv e performance on the sol- ubility dataset, using linear regression on top of ﬁngerprints. The performances of both methods follow similar curv es. In contrast, the performance of neural ﬁngerprints with small random weights follows a different curve, and is substantially better . This suggests that e ven with random weights, the relativ ely smooth acti vation of neural ﬁngerprints helps generalization performance. 3 0.5 0.6 0.7 0.8 0.9 1.0 Circular fingerprint distances 0.5 0.6 0.7 0.8 0.9 1.0 Neural fingerprint distances N e u r a l v s C i r c u l a r d i s t a n c e s , r = 0 : 8 2 3 0 1 2 3 4 5 6 Fingerprint radius 0.8 1.0 1.2 1.4 1.6 1.8 2.0 RMSE (log Mol/L) Circular fingerprints Random conv with large parameters Random conv with small parameters Figure 3: Left: Comparison of pairwise distances between molecules, measured using circular ﬁn- gerprints and neural graph ﬁngerprints with large random weights. Right : Predictive performance of circular ﬁngerprints (red), neural graph ﬁngerprints with ﬁxed large random weights (green) and neural graph ﬁngerprints with ﬁxed small random weights (blue). The performance of neural graph ﬁngerprints with large random weights closely matches the performance of circular ﬁngerprints. 4.1 Examining learned featur es T o demonstrate that neural graph ﬁngerprints are interpretable, we sho w substructures which most activ ate individual features in a ﬁngerprint vector . Each feature of a circular ﬁngerprint vector can each only be activ ated by a single fragment of a single radius, except for accidental collisions. In contrast, neural graph ﬁngerprint features can be activ ated by variations of the same structure, making them more interpretable, and allowing shorter feature v ectors. Solubility featur es Figure 4 sho ws the fragments that maximally acti v ate the most predicti ve fea- tures of a ﬁngerprint. The ﬁngerprint network was trained as inputs to a linear model predicting solubility , as measured in [ 4 ]. The feature sho wn in the top ro w has a positi ve predictive relationship with solubility , and is most acti vated by fragments containing a hydrophilic R-OH group, a standard indicator of solubility . The feature shown in the bottom row , strongly predictiv e of insolubility , is activ ated by non-polar repeated ring structures. Fragments most activ ated by pro-solubility feature O OH O NH O OH OH Fragments most activ ated by anti-solubility feature Figure 4: Examining ﬁngerprints optimized for predicting solubility . Shown here are representati ve examples of molecular fragments (highlighted in blue) which most activ ate different features of the ﬁngerprint. T op r ow: The feature most predictive of solubility . Bottom r ow: The feature most predictiv e of insolubility . 4 T oxicity features W e trained the same model architecture to predict toxicity , as measured in two different datasets in [ 26 ]. Figure 5 shows fragments which maximally activ ate the feature most predictiv e of toxicity , in two separate datasets. Fragments most activ ated by toxicity feature on SR-MMP dataset Fragments most activ ated by toxicity feature on NR-AHR dataset Figure 5: V isualizing ﬁngerprints optimized for predicting toxicity . Shown here are representativ e samples of molecular fragments (highlighted in red) which most acti vate the feature most predicti ve of toxicity . T op r ow: the most predicti ve feature identiﬁes groups containing a sulphur atom attached to an aromatic ring. Bottom r ow: the most predictiv e feature identiﬁes fused aromatic rings, also known as polyc yclic aromatic hydrocarbons, a well-kno wn carcinogen. [ 27 ] constructed similar visualizations, but in a semi-manual way: to determine which toxic frag- ments activ ated a gi ven neuron, they searched ov er a hand-made list of toxic substructures and chose the one most correlated with a giv en neuron. In contrast, our visualizations are generated automati- cally , without the need to restrict the range of possible answers beforehand. 4.2 Predicti ve Perf ormance W e ran sev eral experiments to compare the predicti ve performance of neural graph ﬁngerprints to that of the standard state-of-the-art setup: circular ﬁngerprints fed into a fully-connected neural network. Experimental setup Our pipeline takes as input the SMILES [ 30 ] string encoding of each molecule, which is then con verted into a graph using RDKit [ 20 ]. W e also used RDKit to produce the extended circular ﬁngerprints used in the baseline. Hydrogen atoms were treated implicitly . In our con volutional networks, the initial atom and bond features were chosen to be similar to those used by ECFP: Initial atom features concatenated a one-hot encoding of the atom’ s element, its degree, the number of attached hydrogen atoms, and the implicit valence, and an aromaticity indi- cator . The bond features were a concatenation of whether the bond type was single, double, triple, or aromatic, whether the bond was conjugated, and whether the bond was part of a ring. T raining and Architectur e T raining used batch normalization [ 11 ]. W e also experimented with tanh vs relu activ ation functions for both the neural ﬁngerprint network layers and the fully- connected network layers. relu had a slight but consistent performance advantage on the valida- tion set. W e also experimented with dropconnect [ 29 ], a variant of dropout in which weights are randomly set to zero instead of hidden units, but found that it led to worse validation error in gen- eral. Each experiment optimized for 10000 minibatches of size 100 using the Adam algorithm [ 13 ], a variant of RMSprop that includes momentum. Hyperparameter Optimization T o optimize hyperparameters, we used random search. The hy- perparameters of all methods were optimized using 50 trials for each cross-v alidation fold. The following h yperparameters were optimized: log learning rate, log of the initial weight scale, the log L 2 penalty , ﬁngerprint length, ﬁngerprint depth (up to 6), and the size of the hidden layer in the fully-connected network. Additionally , the size of the hidden feature vector in the con volutional neural ﬁngerprint networks was optimized. 5 Dataset Solubility [ 4 ] Drug efﬁcac y [ 5 ] Photov oltaic ef ﬁciency [ 8 ] Units log Mol/L EC 50 in nM percent Predict mean 4.29 ± 0.40 1.47 ± 0.07 6.40 ± 0.09 Circular FPs + linear layer 1.71 ± 0.13 1.13 ± 0.03 2.63 ± 0.09 Circular FPs + neural net 1.40 ± 0.13 1.36 ± 0.10 2.00 ± 0.09 Neural FPs + linear layer 0.77 ± 0.11 1.15 ± 0.02 2.58 ± 0.18 Neural FPs + neural net 0.52 ± 0.07 1.16 ± 0.03 1.43 ± 0.09 T able 1: Mean predictiv e accuracy of neural ﬁngerprints compared to standard circular ﬁngerprints. Datasets W e compared the performance of standard circular ﬁngerprints ag ainst neural graph ﬁn- gerprints on a variety of domains: • Solubility: The aqueous solubility of 1144 molecules as measured by [ 4 ]. • Drug efﬁcacy: The half-maximal effecti ve concentration (EC 50 ) in vitr o of 10,000 molecules against a sulﬁde-resistant strain of P . falciparum , the parasite that causes malaria, as measured by [ 5 ]. • Organic photov oltaic efﬁciency: The Harvard Clean Energy Project [ 8 ] uses expensiv e DFT simulations to estimate the photovoltaic efﬁcienc y of organic molecules. W e used a subset of 20,000 molecules from this dataset. Predicti ve accuracy W e compared the performance of circular ﬁngerprints and neural graph ﬁn- gerprints under two conditions: In the ﬁrst condition, predictions were made by a linear layer using the ﬁngerprints as input. In the second condition, predictions were made by a one-hidden-layer neural network using the ﬁngerprints as input. In all settings, all differentiable parameters in the composed models were optimized simultaneously . Results are summarized in T able 4.2 . In all experiments, the neural graph ﬁngerprints matched or beat the accuracy of circular ﬁngerprints, and the methods with a neural network on top of the ﬁngerprints typically outperformed the linear layers. Software Automatic differentiation (AD) software packages such as Theano [ 1 ] signiﬁcantly speed up dev elopment time by providing gradients automatically , b ut can only handle limited control structures and indexing. Since we required relativ ely complex control ﬂo w and indexing in order to implement v ariants of Algorithm 2 , we used a more ﬂexible automatic dif ferentiation package for Python called Autograd ( github.com/HIPS/autograd ). This package handles standard Numpy [ 18 ] code, and can dif ferentiate code containing while loops, branches, and indexing. Code for computing neural ﬁngerprints and producing visualizations is av ailable at github.com/HIPS/neural-fingerprint . 5 Limitations Computational cost Neural ﬁngerprints hav e the same asymptotic complexity in the number of atoms and the depth of the network as circular ﬁngerprints, but hav e additional terms due to the matrix multiplies necessary to transform the feature vector at each step. T o be precise, computing the neural ﬁngerprint of depth R , ﬁngerprint length L of a molecule with N atoms using a molecular con v olutional net having F features at each layer costs O ( RN F L + RN F 2 ) . In practice, training neural networks on top of circular ﬁngerprints usually took several minutes, while training both the ﬁngerprints and the network on top took on the order of an hour on the lar ger datasets. Limited computation at each layer Ho w complicated should we make the function that goes from one layer of the network to the ne xt? In this paper we chose the simplest feasible architecture: a single layer of a neural network. Howe ver , it may be fruitful to apply multiple layers of nonlinear- ities between each message-passing step (as in [ 22 ]), or to make information preservation easier by adapting the Long Short-T erm Memory [ 10 ] architecture to pass information upwards. 6 Limited information pr opagation across the graph The local message-passing architecture de- veloped in this paper scales well in the size of the graph (due to the low degree of organic molecules), but its ability to propagate information across the graph is limited by the depth of the network. This may be appropriate for small graphs such as those representing the small organic molecules used in this paper . Ho wev er , in the w orst case, it can take a depth N 2 network to distinguish between graphs of size N . T o av oid this problem, [ 2 ] proposed a hierarchical clustering of graph substructures. A tree-structured network could examine the structure of the entire graph using only log ( N ) layers, but would require learning to parse molecules. T echniques from natural language processing [ 25 ] might be fruitfully adapted to this domain. Inability to distinguish stereoisomers Special bookkeeping is required to distinguish between stereoisomers, including enantomers (mirror images of molecules) and cis/trans isomers (rotation around double bonds). Most circular ﬁngerprint implementations have the option to make these distinctions. Neural ﬁngerprints could be e xtended to be sensiti ve to stereoisomers, but this remains a task for future work. 6 Related work This work is similar in spirit to the neural T uring machine [ 7 ], in the sense that we take an existing discrete computational architecture, and make each part differentiable in order to do gradient-based optimization. Neural nets f or quantitative structure-activity r elationship (QSAR) The modern standard for predicting properties of nov el molecules is to compose circular ﬁngerprints with fully-connected neural networks or other re gression methods. [ 3 ] used circular ﬁngerprints as inputs to an ensemble of neural networks, Gaussian processes, and random forests. [ 19 ] used circular ﬁngerprints (of depth 2) as inputs to a multitask neural network, sho wing that multiple tasks helped performance. Neural graph ﬁngerprints The most closely related work is [ 15 ], who build a neural network having graph-v alued inputs. Their approach is to remove all cycles and build the graph into a tree structure, choosing one atom to be the root. A recursi ve neural network [ 23 , 24 ] is then run from the leav es to the root to produce a ﬁxed-size representation. Because a graph having N nodes has N possible roots, all N possible graphs are constructed. The ﬁnal descriptor is a sum of the representations computed by all distinct graphs. There are as many distinct graphs as there are atoms in the network. The computational cost of this method thus grows as O ( F 2 N 2 ) , where F is the size of the feature vector and N is the number of atoms, making it less suitable for large molecules. Con volutional neural netw orks Con volutional neural networks ha ve been used to model images, speech, and time series [ 14 ]. Ho wev er , standard con volutional architectures use a ﬁxed computa- tional graph, making them difﬁcult to apply to objects of varying size or structure, such as molecules. More recently , [ 12 ] and others hav e de veloped a con volutional neural network architecture for mod- eling sentences of varying length. Neural networks on ﬁxed graphs [ 2 ] introduce conv olutional networks on graphs in the regime where the graph structure is ﬁx ed, and each training e xample differs only in ha ving dif ferent features at the vertices of the same graph. In contrast, our networks address the situation where each training input is a different graph. Neural networks on input-dependent graphs [ 22 ] propose a neural network model for graphs having an interesting training procedure. The forward pass consists of running a message-passing scheme to equilibrium, a fact which allo ws the re verse-mode gradient to be computed without storing the entire forward computation. They apply their network to predicting mutagenesis of molecular compounds as well as web page rankings. [ 16 ] also propose a neural network model for graphs with a learning scheme whose inner loop optimizes not the training loss, but rather the correlation between each newly-proposed vector and the training error residual. They apply their model to a dataset of boiling points of 150 molecular compounds. Our paper b uilds on these ideas, with the 7 following dif ferences: Our method replaces their complex training algorithms with simple gradient- based optimization, generalizes existing circular ﬁngerprint computations, and applies these net- works in the context of modern QSAR pipelines which use neural networks on top of the ﬁngerprints to increase model capacity . Unrolled inference algorithms [ 9 ] and others hav e noted that iterativ e inference procedures sometimes resemble the feedforward computation of a recurrent neural network. One natural exten- sion of these ideas is to parameterize each inference step, and train a neural network to approximately match the output of exact inference using only a small number of iterations. The neural ﬁngerprint, when viewed in this light, resembles an unrolled message-passing algorithm on the original graph. 7 Conclusion W e generalized existing hand-crafted molecular features to allow their optimization for diverse tasks. By making each operation in the feature pipeline dif ferentiable, we can use standard neural-netw ork training methods to scalably optimize the parameters of these neural molecular ﬁngerprints end-to- end. W e demonstrated the interpretability and predictiv e performance of these new ﬁngerprints. Data-driv en features ha ve already replaced hand-crafted features in speech recognition, machine vision, and natural-language processing. Carrying out the same task for virtual screening, drug design, and materials design is a natural next step. Acknowledgments W e thank Edward Pyzer-Knapp, Jennifer W ei, and Samsung Adv anced Institute of T echnology for their support. This work was partially funded by NSF IIS-1421780. References [1] Fr ´ ed ´ eric Bastien, P ascal Lamblin, Razv an P ascanu, James Bergstra, Ian J. Goodfello w , Arnaud Bergeron, Nicolas Bouchard, and Y oshua Bengio. Theano: new features and speed improve- ments. Deep Learning and Unsupervised Feature Learning NIPS 2012 W orkshop, 2012. [2] Joan Bruna, W ojciech Zaremba, Arthur Szlam, and Y ann LeCun. Spectral networks and locally connected networks on graphs. arXiv pr eprint arXiv:1312.6203 , 2013. [3] George E. Dahl, Navdeep Jaitly , and Ruslan Salakhutdinov . Multi-task neural networks for QSAR predictions. arXiv pr eprint arXiv:1406.1231 , 2014. [4] John S. Delane y . ESOL: Estimating aqueous solubility directly from molecular structure. J our- nal of Chemical Information and Computer Sciences , 44(3):1000–1005, 2004. [5] Francisco-Javier Gamo, Laura M Sanz, Jaume V idal, Cristina de Cozar , Emilio Alvarez, Jose-Luis Lav andera, Dana E V anderwall, Darren VS Green, V inod Kumar , Samiul Hasan, et al. Thousands of chemical starting points for antimalarial lead identiﬁcation. Natur e , 465(7296):305–310, 2010. [6] Robert C. Glem, Andreas Bender, Catrin H. Arnby , Lars Carlsson, Scott Boyer , and James Smith. Circular ﬁngerprints: ﬂexible molecular descriptors with applications from physical chemistry to ADME. IDrugs: the in vestigational drugs journal , 9(3):199–204, 2006. [7] Alex Graves, Greg W ayne, and Ivo Danihelka. Neural T uring machines. arXiv pr eprint arXiv:1410.5401 , 2014. [8] Johannes Hachmann, Roberto Oli vares-Amaya, Sule Atahan-Evrenk, Carlos Amador -Bedolla, Roel S S ´ anchez-Carrera, Aryeh Gold-Parker , Leslie V ogt, Anna M Brockway , and Al ´ an Aspuru-Guzik. The Harv ard clean ener gy project: large-scale computational screening and design of organic photo voltaics on the w orld community grid. The Journal of Physical Chem- istry Letters , 2(17):2241–2251, 2011. [9] John R Hershey , Jonathan Le Roux, and Felix W eninger . Deep unfolding: Model-based inspi- ration of nov el deep architectures. arXiv pr eprint arXiv:1409.2574 , 2014. 8 [10] Sepp Hochreiter and J ¨ urgen Schmidhuber . Long short-term memory . Neural computation , 9(8):1735–1780, 1997. [11] Serge y Ioffe and Christian Sze gedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. arXiv pr eprint arXiv:1502.03167 , 2015. [12] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A con volutional neural network for modelling sentences. Pr oceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics , June 2014. [13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [14] Y ann LeCun and Y oshua Bengio. Conv olutional networks for images, speech, and time series. The handbook of brain theory and neural networks , 3361, 1995. [15] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of chemical information and modeling , 53(7):1563–1575, 2013. [16] Alessio Micheli. Neural network for graphs: A contextual constructi ve approach. Neur al Networks, IEEE T ransactions on , 20(3):498–511, 2009. [17] H.L. Morg an. The generation of a unique machine description for chemical structure. Journal of Chemical Documentation , 5(2):107–113, 1965. [18] T ravis E Oliphant. Python for scientiﬁc computing. Computing in Science & Engineering , 9(3):10–20, 2007. [19] Bharath Ramsundar , Ste ven K earnes, Patrick Riley , Dale W ebster , Da vid K onerding, and V ijay Pande. Massively multitask netw orks for drug discovery . , 2015. [20] RDKit: Open-source cheminformatics. www.rdkit.org . [accessed 11-April-2013]. [21] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints. Journal of Chemical Information and Modeling , 50(5):742–754, 2010. [22] F . Scarselli, M. Gori, Ah Chung Tsoi, M. Hagenb uchner , and G. Monf ardini. The graph neural network model. Neural Networks, IEEE T ransactions on , 20(1):61–80, Jan 2009. [23] Richard Socher , Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andre w Y Ng. Dynamic pooling and unfolding recursi ve autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems , pages 801–809, 2011. [24] Richard Socher , Jef frey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Man- ning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Pr o- ceedings of the Conference on Empirical Methods in Natural Language Pr ocessing , pages 151–161. Association for Computational Linguistics, 2011. [25] Kai Sheng T ai, Richard Socher , and Christopher D Manning. Impro ved semantic representations from tree-structured long short-term memory networks. arXiv pr eprint arXiv:1503.00075 , 2015. [26] T ox21 Challenge. National center for advancing translational sciences. http://tripod. nih.gov/tox21/challenge , 2014. [Online; accessed 2-June-2015]. [27] Thomas Unterthiner , Andreas Mayr , G ¨ unter Klambauer , and Sepp Hochreiter . T oxicity predic- tion using deep learning. arXiv pr eprint arXiv:1503.01445 , 2015. [28] Thomas Unterthiner , Andreas Mayr , G ¨ unter Klambauer , Marvin Steijaert, J ¨ org W enger , Hugo Ceulemans, and Sepp Hochreiter . Deep learning as an opportunity in virtual screening. In Advances in Neural Information Processing Systems , 2014. [29] Li W an, Matthew Zeiler , Sixin Zhang, Y ann L. Cun, and Rob Fer gus. Regularization of neural networks using dropconnect. In International Confer ence on Machine Learning , 2013. [30] David W eininger . SMILES, a chemical language and information system. Journal of chemical information and computer sciences , 28(1):31–36, 1988. 9

Convolutional Networks on Graphs for Learning Molecular Fingerprints

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment