Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking…
Authors: Michiel Stock, Thomas Fober, Eyke H"ullermeier
Iden tification of functionally related enzymes b y learning-to-rank metho ds Mic hiel Sto c k, Thomas F ob er, Eyk e H¨ ullermeier, Serghei Glinca, Gerhard Kleb e, T apio Pahikk ala, Antti Airola, Bernard De Baets, Willem W aegeman ∗ Abstract Enzyme sequences and structures are routinely used in the biological sciences as queries to search for func- tionally related enzymes in online databases. T o this end, one usually departs from some notion of similarity , comparing tw o enzymes by lo oking for corresp ondences in their sequences, structures or surfaces. F or a given query , the searc h op eration results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information ab out the biological function of annotated database enzymes is ignored. In this w ork w e sho w that rankings of that kind can b e substan tially impro ved by applying k ernel-based learning algorithms. This approach enables the detection of statistical dep endencies b et ween similarities of the activ e cleft and the biological function of annotated enzymes. This is in contrast to search-based approac hes, whic h do not take annotated training data in to accoun t. Similarity measures based on the activ e cleft are kno wn to outp erform sequence-based or structure-based measures under certain conditions. W e consider the Enzyme Commission (EC) classification hierarc hy for obtaining annotated enzymes during the training phase. The results of a set of sizeable exp eriments indicate a consistent and significant improv ement for a set of similarity measures that exploit information ab out small ca vities in the surface of enzymes. 1 In tro duction Mo dern high-throughput tec hnologies in molecular biology are generating more and more protein sequences and ter- tiary structures, of which only a small fraction can ever b e experimentally annotated w.r.t. functionalit y . Predicting the biological function of enzymes 1 remains extremely challenging, and esp ecially nov el functions are hard to detect, despite the large n umber of automated annotation metho ds that ha ve b een introduced in the last decade [1, 2]. Existing online services, suc h as BLAST or ReliBase, often provide to ols to searc h in databases that contain collections of annotated enzymes. These systems rely on some notion of similarity when searching for related enzymes, but the definition of similarity differs from system to system. Indeed, a v ast n umber of measures for expressing similarity betw een tw o enzymes exists in the literature, p erforming calculations on differen t lev els of abstraction. One can make a ma jor sub division of these measures into approaches that solely use the sequence of amino acids, approaches that also take in to account the tertiary structure, and approaches that consider lo cal fold information by analyzing small cavities (hypothetical binding sites) at the surface of an enzyme. Sequence-based measures suc h as BLAST [3] and PSI-BLAST [4] can b e computed in an efficien t manner and are able to find enzymes with related functions under certain conditions. In addition to these, several kernel- based metho ds hav e b een developed to make predictions for proteins at the sequence level - see e.g. [5, 6]. A high sequence similarit y usually results in a high structural similarit y , and proteins with a sequence iden tit y (the n umber of matches in an alignmen t) ab o v e 40 % are generally considered to share the same structure [7]. Ho wev er, this assumption b ecomes less reliable in the twiligh t zone, when the sequence identit y is situated b et ween 25 and 40 %. F urthermore, enzymes with comparable functions can exhibit sequences with very low sequence identit y [8, 9]. F or these reasons, and b ecause three-dimensional crystal structures are b ecoming more and more av ailable in online databases, the comparison of proteins at the structural level has gained increasing attention. The secondary ∗ M. Sto c k, B. De Baets and W. W aegeman are with the Department of Mathematical Mo delling, Statistics and Bioinformatics, Ghent Univ ersity , Coupure links 653, 9000 Ghent, Belgium, email: firstname.surname@ugent.be. T. F ober and E. H ¨ ullermeier are with the Philipps-Univ ersit¨ at of Marburg, Department of Mathematics and Computer Science, Hans-Meerwein-Straße 6, D-35032 Marburg, Germany . S. Glinca and G. Kleb e are also with the Philipps-Universit¨ at of Marburg, Department of Pharmacy , Marbacher W eg 6-10, D-35032 Marburg, Germany . T. Pahikk ala and A. Airola are with the Department of Information T ec hnology and the T urku Centre for Computer Science, Univ ersity of T urku, Jouk ahaisenk atu 3-5 B 20520 T urku, Finland. 1 Enzymes are biomolecules that catalyze chemical reactions. All the enzymes we consider in this work are proteins and vice versa. Both notions will be used interc hangeably . 1 structure of an enzyme is known to highly influence its biological function [10] and contains v aluable information that is missing at the sequence lev el [11, 12, 13]. Man y approaches that p erform calculations on the ov erall fold of the protein hav e b een developed - see e.g. [14, 15, 16]. Unfortunately , suc h approaches are also not optimal for determining the function of enzymes. They require knowledge of active site residues and usually lead to a quite coarse represen tation, esp ecially for enzymes, where often only a few sp ecific residues are resp onsible for the catalytic mechanism [17]. F or example, the vicinal-o xygen-chelate sup erfamily sho ws a large functional diversit y while ha ving only a limited sequence div ersity [18, 19]. It has also b een sho wn that some parts of the protein structure space hav e a high functional diversit y [20], further limiting the use of global fold similarity . F or these reasons, many metho ds consider lo cal structural features, such as evolutionary conserved residues [21, 22]. The most appropriate similarity measures for the prediction of enzyme functions fo cus on surface regions in whic h ligands, co-factors and substrates can bind [23]. Ca vities in the surface are known to con tain v aluable information, and exploiting similarities b et ween those cavities helps finding functionally related enzymes. By considering structural and physico-c hemical information of binding sites, one can detect relationships that cannot b e found using traditional sequence- and fold-based metho ds, making such similarities of particular interest for applications in drug discov ery [24, 25]. In addition to providing a complementary notion of protein families [26], these metho ds also allo w for extracting relationships b et ween cavities of unrelated proteins [27]. Similarity measures that highlight cavities and binding sites can b e further sub divided into graph-based approaches such as [28, 29, 30, 31, 32], geometric approaches suc h as [33, 34] and feature-based approaches such as [35, 36]. These measures will b e discussed more thoroughly in Section 2. This pap er aims to sho w that the search for functionally related enzymes can b e substantially impro ved by applying le arning-to-r ank metho ds. These algorithms use training data to build a mathematical mo del for ranking ob jects, such as enzymes, that are not necessarily seen among the training data. While our metho ds can b e applied to all types of data, as long as a meaningful similarity measure can b e constructed, we only demonstrate its p o w er using ca vit y-based measures, for the reasons explained abov e. Ranking-based mac hine learning algorithms are often used for applications in information retriev al [37]. Due to their prov en added v alue for searc h engines, ranking-based mac hine learning metho ds hav e gained some p opularit y in bioinformatics, for example in drug disco very [38, 39] or to find similarities b et ween proteins [40, 41]. Despite this, many online services such as BLAST, PDB, Dali and CavBase solely rely on similarit y measures to construct rankings, without utilizing annotated enzymes and learning algorithms to steer the search pro cess during a training phase. How ever, due to the presence of annotated enzymes in online databases, improv ements can b e made by applying ranking-based machine learning algorithms. This amounts to a transition from an unsup ervised to a sup ervised learning scenario. Using four differen t ca vity-based similarit y measures and one based on sequence alignment as input for RankRLS [42], a kernel-based ranking algorithm, we demonstrate a significan t impro vemen t for each of these measures. RankRLS w orks in a similar wa y as comp etitors suc h as RankSVM [43], b ecause it uses annotated training data to learn rankings during a training phase. The training data is annotated via the Enzyme Commission (EC) functional classification hierarch y , a commonly used wa y to sub divide enzymes into functional classes. EC num b ers adopt a four-n umber hierarchical structure, representing different lev els of catalytic detail. Importantly , this representa- tion fo cuses on the chemical reactions that are p erformed, and not on structure or homology . As explained more elab orately in Section 2, the EC num b ers are used to construct a ground-truth catalytic similarity measure, and subsequen tly to generate ground-truth rankings. In addition to obtaining annotated training data, this pro cedure also allo ws for a fair comparison with the more traditional approach, using con ven tional p erformance measures for rankings. This wa y of ev aluating also characterizes the difference b et ween our search engine approac h and previous w ork in whic h sup ervised learning algorithms for EC n umber assignment ha ve been considered – for a far from complete list see e.g. [44, 29, 45, 46, 47]. In this work we are unable to compare to such metho ds, b ecause they do not return rankings as output. Nonetheless, similar to some of these approac hes, we do tak e the hierarchical structure of the EC num b ers in to account. Instead of predicting one EC n umber, a ranking of functionally related enzymes is returned for a given query. In this sc heme the top of the obtained ranking is exp ected to contain enzymes with functions similar to the query enzyme with an unknown EC num b er. A ranking provides end users with a generally well-kno wn and easily understandable output, while still useful results can b e retrieved when an enzyme with a new EC num b er is encountered. 2 Material and metho ds 2.1 Database Our w ork builds up on CavBase, a database that is made comm ercially av ailable as part of ReliBase [48]. CavBase can be used for the automated detection, extraction, and storage of protein cavities from experimentally determined 2 protein structures, which are av ailable through the Protein Data Bank (PDB). The geometrical arrangement of the p ock et and its physico-c hemical prop erties are first represen ted by pre- defined pseudo centers – spatial p oin ts that c haracterize the geometric center of a functional group sp ecified by a particular prop erty . The t yp e and the spatial p osition of the pseudo cen ters dep end on the amino acids that b order the binding p o c k et and exp ose their functional groups. They are derived from the protein structure using a set of predefined rules [49]. Hydrogen-b ond donor, acceptor, mixed donor/acceptor, hydrophobic aliphatic, metal ion, pi (accounts for the abilit y to form π - π interactions) and aromatic prop erties are considered as p ossible types of pseudo cen ters. These pseudo cen ters can b e regarded as a compressed representation of surface areas where certain protein-ligand interactions are encountered. Consequently , a set of pseudo cen ters is an approximate representation of a spatial distribution of physico-c hemical prop erties. T o build and test our mo dels we require an appropriate data set that contains sufficiently many proteins and EC classes. Based on the exp erience of lo cal pharmaceutical exp erts, we c hose the data set of EC classes depicted in T able 1. T o generate the first data set (data set I), we retriev ed all proteins from the PDB which got assigned one of these EC classes. Thus, we ended up with a set of 5,257 proteins. T o ensure that only unique proteins w ere contained in our data set, we used the protein culling server 2 with its default parameterization. As such, all proteins that hav e high pairwise homology were filtered out. This pro cedure resulted in a data set of cardinality 1,714. T o extract the active site of the protein w e used the assumption that the largest binding site of a protein do es contain its catalytic center [23]. Hence, for each protein we to ok the binding site from the database CavBase whic h maximized the volume. F rom our data set, 158 proteins were not contained in the Ca vBase (e.g., b ecause the structure was determined by NMR instead of X-ray). Therefore these proteins w ere remo ved from the data set, resulting in a final data set of size 1,556. The first data set comes with tw o drawbac ks. First of all, the binding site containing the catalytic centre w as determined b y a pure heuristic, namely by taking the largest binding site among all binding sites a protein exhibits. Moreo ver, sufficient resolution w as not a criterium for selecting the cavities. This may lead to a data set of low qualit y . Therefore, relying on the exp ertise of pharmaceutical exp erts we compiled another data set referred to as data set I I, containing the same EC classes. F or this data set, all proteins from the PDB that hav e a resolution of at least 2.5 ˚ A were considered. Moreov er the binding site volume was required to range b et ween 350 ˚ A 3 and 3500 ˚ A 3 . Structures not meeting these conditions w ere eliminated since resolutions b elo w 2.5 ˚ A usually lead to a to o coarse representation, while binding sites with volumes outside the ab o v e-mentioned range are usually artefacts pro duced by the algorithm used for their detection. F rom the resulting set of 24,102 proteins the active site was selected. This resulted in a data set of 1730 enzymes on whic h we applied the protein culling serv er to finally end up with a second data set of 561 enzymes. A pairwise sequence similarity matrix and phylogenetic tree of our data sets can b e found in the supplemen tary materials. 2.2 Similarit y measures for ca vities In the introduction w e hav e motiv ated wh y our analysis is restricted to similarity measures for ca vities, which are three-dimensional ob jects that can b e represented in m ultiple wa ys. Some measures are graph-based, transforming ca vities into node-lab eled and edge-w eigh ted graphs. This allo ws to apply traditional tec hniques to compare graphs (e.g. [28]). Unfortunately , techniques that construct a b o olean similarity measure, suc h as those based on graph isomorphisms, are not appropriate for comparing noisy and flexible protein structures. Computing the maximum common subgraph [50] can b e considered as a more appropriate alternative, and this metho d will b e used in this pap er as a baseline (see b elo w). The graph edit distance [51] is another measure to compare graphs, sp ecifying the n umber of edit op erations needed to transform a given graph in to another graph. This distance can b e calculated in different wa ys, e.g., by using a greedy heuristic [52] or quadratic programming [53]. Unfortunately , the graph edit distance is very hard to parameterize and often quite inefficient. More efficient approac hes b elong to the class of graph kernels. They hav e gained a lot of attention in bioinformatics, as they allow for a sufficiently high degree of error tolerance. Differen t realizations are av ailable, such as the shortest path kernel [54], the random w alk kernel [30] and the graphlet kernel [31, 32]. Graph kernels work particularly well for small molecules such as ligands, but they are less useful for larger molecules suc h as proteins. They gav e rather po or results in [36], whic h explains wh y we concen trated here on the maximum common subgraph as a representativ e for graph-based approac hes. As a second category of measures for cavities, geometric metho ds directly pro cess the lab eled spatial co ordinates of the functional parts, denoted as p oin t clouds, instead of transforming a protein cavit y into a graph. Remark ably , only few approac hes hav e b een proposed that build on this representation. In [33] geometric hashing is employ ed to 2 http://www.bioinf.manc hester.ac.uk/leaf/ 3 T able 1: List of the 21 EC n umbers with their accepted name and the num b er of examples of eac h class for the t wo data sets. EC num ber accepted name # set I # set I I EC 1.1.1.1 alcohol dehydrogenase 23 15 EC 1.1.1.21 aldehyde reductase 35 30 EC 1.5.1.3 dih ydrofolate reductase 110 6 EC 1.11.1.5 cyto c hrome-c p eroxidase 92 31 EC 1.14.15.1 camphor 5-mono o xygenase 30 36 EC 2.1.1.45 thymidylate synthase 63 22 EC 2.1.1.98 diphthine synthase 5 43 EC 2.4.1.1 phosphorylase 43 40 EC 2.4.2.29 tRNA-guanine transglycosylase 32 16 EC 2.7.11.1 non-sp ecific serine/threonine enzyme kinase 304 24 EC 3.1.1.7 acet ylcholinesterase 23 13 EC 3.1.3.48 enzyme-tyrosine-phosphatase 151 28 EC 3.4.21.4 trypsin 118 72 EC 3.4.21.5 thrombin 87 51 EC 3.5.2.6 β -lactamase 153 8 EC 4.1.2.13 fructose-bisphosphate aldolase 48 4 EC 4.2.1.1 carb onate dehydratase 186 76 EC 4.2.1.20 tryptophan synthase 13 7 EC 5.3.1.5 xylose isomerase 18 21 EC 5.3.3.1 steroid ∆-isomerase 14 10 EC 6.3.2.1 pan toate- β -alanine ligase 8 8 calculate a superp osition of protein ca vities that can b e used to derive an alignmen t and a similarit y score. A similar approac h was used in [34], in which an optimization problem w as solved instead of applying geometric hashing. Beside these t wo approaches, several other metho ds exist for comparing tw o p oin t clouds [55]. Unfortunately , the ma jority of these metho ds cannot cop e with biological data, due to a very high complexity or error intolerance. As a third family of approaches, one can also represent the protein cavit y as a feature vector, taking b oth the geometry of the cavit y and physico-c hemical prop erties into account – see e.g. [36, 35]. Subsequently , traditional or sp ecialized measures can b e applied on these vectors to obtain similarit y scores b et w een protein ca vities [56, 57]. In the exp erimen ts we selected representativ e metho d for each of the three groups: one graph-based measure, one geometric measure and one feature-based measure. W e also considered the original Ca vBase measure and a measure obtained from the Smith-W aterman protein sequence alignment. This lead to a comparison of five different measures, four based on c a vities and one based on sequence alignment. Below, these measures are explained more in detail: Lab eled P oint Cloud Sup erp osition (LPCS) [34]. This v alue is obtained by pro cessing labeled p oin t clouds. Hence, the Ca vBase data can b e used directly without a need for transforming it into another representation. In tuitively , tw o lab eled p oint clouds are considered similar if they can b e spatially sup erimposed. More sp ecifically , an approximate sup erposition of the tw o structures is obtained by fixing the first p oin t cloud and mo ving the second p oint cloud as a whole. Tw o p oin t clouds are well sup erimp osed when each p oin t in the first cloud can be matched with a p oint in the second p oin t cloud, while the distances of these p oin ts are small and their lab els consistent. This concept is used to define a fitness function that is maximized using a direct search approach [58]. The obtained maximal fitness is taken as the similarity b etw een the tw o lab eled p oin t clouds. A similar measure was also prop osed in [59], but a con volution kernel is suggested to obtain similarities b et ween the p oin t clouds. Maxim um Common Subgraph (MCS) [50]. Using the MCS, the original representation in the form of a lab eled p oint cloud m ust be transformed into a no de-labeled and edge-weigh ted graph. Each pseudo cen ter is b ecoming a no de lab eled with the corresp onding physico-c hemical prop ert y . T o capture the geometry , a complete graph is considered, where eac h edge is w eighted with the Euclidean distance b et ween the tw o pseudo cen ters it is adjacent to. The problem of measuring similarit y b etw een protein cavities no w b oils down to the problem of measuring similarity b et w een graphs. A w ell-known approach here is to search for the maxim um common subgraph of the t wo input graphs and to define similarit y as the size of the maxim um common subgraph relative to the size of the larger graph. In case of noisy data, a threshold is required, 4 defining tw o edges as equal if their weigh t differs at most by . In this pap er, this parameter is set to 0.2 ˚ A, as recommended by sev eral authors [52, 60]. Ca vBase (CB) similarity [49]. Ca vBase also mak es use of an algorithm for the detection of common subgraphs. Instead of considering the largest common subgraph, as done in the case of MCS, the 100 largest common subgraphs are considered. Each common subgraph is used to determine a transformation rule by means of the Kabsc h algorithm [61], whic h sup erimposes b oth proteins. In a post-pro cessing step the surface points are also sup erimp osed according to the transformation rule, and a similarity score is derived using these surface p oin ts. Even tually , a set of 100 similarity v alues is obtained, from which the highest v alue is returned as similarit y b et ween the tw o protein cavities. Fingerprin ts (FP) Fingerprin ts are a well-kno wn concept and hav e b een used successfully in many domains. F or the comparison of protein binding sites, the authors in [62] transformed the protein binding site in to a no de-labeled and edge-w eighted graph as described ab o ve. Moreo ver they defined generically a set of features, namely complete no de-labeled and edge-weigh ted graphs of size 3. F or each suc h feature, a test is p erformed to decide whether or not the feature is contained in the graph represen ting the protein. This is done b y subgraph isomorphism, to chec ks whether the lab els are identical. The no des of the features are lab eled by the set of physiochemical prop erties. Edges of patterns are lab eled by interv als or bins and instead of testing for equiv alence, a test is p erformed whether edge w eight of the graph representing the protein falls into the bin of the pattern. The thus generated fingerprin ts are compared by means of the Jaccard similarit y measure, as prop osed by [56]. Smith-W aterman (SW) Beside using structure-based approaches to compare protein binding sites, we used also sequence alignment in our exp erimen tal study . T o calculate sequence alignments w e use d the Smith- W aterman algorithm [63] which w as parameterized with the Blosum-62 matrix. F rom the sequence alignment w e derived the sequence identit y which was subsequen tly used to p erform exp erimen ts. 2.3 Unsup ervised ranking In the introduction we hav e explained why existing online services suc h as BLAST, PDB, Dali and CavBase construct rankings in an unsup ervised wa y . These systems create a ranking b y means of a similarit y measure only , without training a mo del that uses annotated enzymes. The annotated enzymes in a database are simply rank ed according to their similarity with an enzyme query with unkno wn function. In the case of CavBase, the enzymes ha ving a high cavit y-based similarity app ear on top of the ranking, those exhibiting a low ca vity-based similarity end up at the b ottom. More formally , let us represent the similarit y b etw een a pair of enzymes by K : V 2 → R + , where V represents the set of all p oten tial enzymes. Given the similarities K ( v , v 0 ) and K ( v , v 00 ), w e comp ose the ranking of v 0 and v 00 conditioned on the query v as: v 0 K v v 00 ⇔ K ( v , v 0 ) ≥ K ( v , v 00 ) , (1) where K v indicates the relation is r anke d higher than, for query v , b ase d on the similarity K . Note that this is a relation betw een tw o enzymes c onditione d on a third enzyme. In our con text, there is no meaningful ranking p ossible betw een enzymes v 0 and v 00 without referring to another enzyme v . This approach adopts the same metho dology as a nearest neigh b or c lassifier, but a ranking rather than a class lab el should b e seen as the output of the algorithm. The quality of such rankings can b e ev aluated when the database contains annotated enzymes and annotated queries. In an ev aluation phase, w e compare the obtained ranking with the gr ound truth ranking, whic h can be constructed from the EC num b ers for annotated enzymes. This ground truth ranking can b e deduced from the catalytic similarity (i.e., ground truth similarit y) b etw een the query and all database enzymes, b y counting the n umber of successive matches in the EC lab el of the query and the database enzymes. Thus the catalytic similarity is a prop ert y of only a pair of enzymes. In contrast, in order to create the ground truth ranking of tw o enzymes, the catalytic similarit y has to b e calculated w.r.t. a third enzyme. F or example, an enzyme with EC num b er EC 2.4.2.23 has a catalytic similarity of tw o compared to an enzyme lab eled as EC 2.4.99.12, since b oth enzymes b elong to the family of glycosyltransferases. Conv ersely , the same enzyme manifests a similarity v alue of only one with an enzyme lab eled as EC 2.8.2.23. Both are transferases in this case, but they show no further relev an t similarit y in the chemistry of the reactions to b e catalyzed. More formally , let us represent the catalytic similarit y betw een tw o enzymes b y a relation Q : V 2 → { 0 , 1 , 2 , 3 , 4 } . Q is defined b y: Q ( v , v 0 ) = 4 X i =1 i Y j =1 q i q j , 5 EC 2.7.7.12 EC 4.2.3.90 EC ?.?.?.? EC 2.7.7.34 EC 4.6.1.1 1 EC 2.7.1.12 1 0 0 3 0 2 0 2 0 zondag, 13 mei 2012 Figure 1: Six enzyme structures are shown, five of whic h corresp ond to a kno wn EC num ber. The catalytic similarity Q is depicted on the edges of the graph. The algorithm that we present allo ws us to infer for the unannotated query (denoted as EC ?.?.?.?) a ranking of the annotated enzymes. T o this end, the unsup ervised approach solely uses cavit y-based similarity measures, whereas the supervised approac h also tak es the EC num b ers of annotated enzymes into account. where q i equals 1 if the i th digit of the EC num b ers of v and v 0 are the same and 0 otherwise. Figure 1 gives an example for six enzymes, five of which corresp ond to a known EC num b er. The catalytic similarit y Q is depicted on the edges of the graph. The prop osed algorithm allows us to infer for an unannotated query a ranking of the annotated enzymes, some of which the algorithm may not ha ve encountered among the training data. Giv en the similarities Q ( v , v 0 ) and Q ( v , v 00 ), w e comp ose similar to Eq. (1) the ground truth ranking of v 0 and v 00 conditioned on the query v as: v 0 Q v v 00 ⇔ Q ( v , v 0 ) ≥ Q ( v , v 00 ) . As a result, an en tire ground truth ranking of database enzymes with kno wn EC n umbers can b e constructed, giv en an annotated query enzyme. 2.4 Sup ervised ranking In contrast to unsup ervised ranking approaches, sup ervised algorithms do take ground truth information into accoun t during a training phase. W e p erform exp erimen ts with so-called conditional ranking algorithms [64, 65] using the RankRLS implementation [42]. Let us in tro duce the short-hand notation e = ( v , v 0 ) to denote a couple consisting of an enzyme query v and a database enzyme v 0 . RankRLS pro duces a linear basis function mo del of the type: h ( e ) = h ( v , v 0 ) = h w , Φ( e ) i , (2) in which w denotes a v ector of parameters and Φ( e ) an implicit feature represen tation for the couple e = ( v , v 0 ). RankRLS differs from more conv entional k ernel-based methods b ecause it optimizes a conv ex and differen tiable appro ximation of the rank loss in bipartite ranking (i.e., area under the ROC curve) instead of the zero-one loss. T ogether with the standard L2 regularization term on the parameter vector w and a regularization parameter λ , the following loss is minimized: L ( h, T ) = X v ∈ V X e, ¯ e ∈ E v ( Q e − Q ¯ e − h ( e ) + h ( ¯ e )) 2 , (3) for a given training set T = { ( e, Q e ) | e ∈ E } . Here Q e = Q ( v , v 0 ) denotes the ground truth similarity as defined ab o v e, E the set of training query-ob ject couples for which ground truth information is av ailable and E v the subset of E con taining results for the query v . The outer sum in Eq. (3) takes all queries in to account, and the inner sum analyzes all pairwise differences b et ween the ranked results for a given query . This loss can b e minimized in a computationally efficient manner, using analytic shortcuts and gradient-based metho ds, as shown in [64, 65]. According to the represen ter theorem [66], one can rewrite Eq. (2) in the following dual form: h ( e ) = h w , Φ( e ) i = X ¯ e ∈ E a ¯ e K Φ ( e, ¯ e ) . with K Φ ( e, ¯ e ) a kernel function with four enzymes as input and a ¯ e the weigh ts in the dual space. In this pap er we adopt the Kroneck er pro duct feature mapping, con taining information on couples of enzymes: Φ( e ) = Φ( v , v 0 ) = φ ( v ) ⊗ φ ( v 0 ) , 6 with φ ( v ) a feature mapping of an individual enzyme and ⊗ the Kroneck er pro duct. One can easily show that this pairwise feature mapping yields the Kroneck er pro duct pairwise kernel in the dual represen tation: K Φ ( e, ¯ e ) = K Φ ( v , v 0 , ¯ v , ¯ v 0 ) = K φ ( v , ¯ v ) K φ ( v 0 , ¯ v 0 ) , with K φ a traditional kernel for enzymes. Sp ecifying a universal k ernel for K φ leads to a universal kernel for K Φ [67], indicating that one can use the kernel to represen t any arbitrary relation, provided that the learning algorithm has access to training data of sufficien t qualit y. This kernel has b een introduced in [68] for modelling protein-protein in teractions. W e consider this k ernel b ecause of its univ ersal appro ximation property , but also other pairwise kernels exist, suc h as the cartesian pairwise kernel [69], the metric learning pairwise k ernel [70] and the transitiv e pairwise kernel [71, 64]. Nonetheless, it is probably not very surprising that such kernels only yield an impro vemen t if the concepts to b e learned satisfy the restrictions that are imp osed by the k ernels [67]. With the exception of the FP measure, none of the similarit y measures discussed in Section 2.2 are strictly sp eaking v alid k ernels. Using the ab o ve construction, all similarit y measures can b e conv erted in to k ernels of type K φ , when they are made symmetric and p ositiv e definite. These attributes guarantee a n umerically stable and unique solution of the learning algorithm. W e simply enforced symmetry by a veraging the similarity matrix with its transp ose. Subsequently , we made the different similarit y matrices p ositive definite b y p erforming an eigenv alue decomp osition and setting all eigen v alues smaller than 10 − 10 equal to zero. This metho d leads to a negligible loss of information compared to the numerical accuracy of our algorithms and data storage. Finally , eac h kernel matrix w as normalized so that all diagonal elements hav e a v alue equal to one. Since these pro cedures were p erformed on the whole data set, one arrives at a so-called transductive learning setting [72]. Minor adjustments would obtain a more traditional inductive learning setting. Note that ov erfitting is preven ted when applying this pro cedure, since the EC num b ers of the enzymes in the data set are not tak en into accoun t. Since the catalytic similarit y is a symmetric measure w e also p erform a p ost-processing to the output of our algorithm. The matrix with the predicted v alues used for ranking the enzymes is made symmetric by av eraging it with its transp ose. 2.5 P erformance measures for ranking The ranking obtained with unsup ervised or sup ervised learning algorithms can b e compared to the ground truth ranking by applying p erformance measures that are commonly used in information retriev al. First of all, the ranking accuracy (RA) is considered, and it is defined as follows: RA = 1 | V | X v ∈ V 1 |{ E v | y e > y ¯ e }| X e, ¯ e ∈ E v : y e
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment