Faster graphical model identification of tandem mass spectra using peptide word lattices

F aster graphical model identiﬁcation of tandem mass spectra using peptide word lattices Shengjie W ang, John T . Halloran, Jeff A. Bilmes, and W illiam S. Noble Univ ersity of W ashington, Seattle, W ashington, USA Abstract. Liquid chromatography coupled with tandem mass spectrometry , also known as shotgun pr oteomics , is a widely-used high-throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by a typical shotgun proteomics experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by searching each spectrum against a database of peptides. W e have recently described a machine learning method—Dynamic Bayesian Network for Rapid Identiﬁcation of Peptides (DRIP)—that not only achiev es state-of-the-art spectrum identiﬁcation performance on a variety of datasets but also provides a trainable model capable of returning v aluable auxiliary information regarding speciﬁc peptide-spectrum matches. In this work, we present two signiﬁcant improvements to DRIP . First, we describe how to use wor d lattices , which are widely used in natural language processing, to signiﬁcantly speed up DRIP’ s computations. T o our knowledge, all existing shotgun proteomics search engines compute independent scores between a gi ven observed spectrum and each possible candidate peptide from the database. The key idea of the word lattice is to represent the set of candidate peptides in a single data structure, thereby allo wing sharing of redundant computations among the different candidates. W e demonstrate that using lattices in conjunction with DRIP leads to speedups on the order of tens across yeast and worm data sets. Second, we introduce a variant of DRIP that uses a discriminative training framework, performing maximum mutual entropy estimation rather than maximum likelihood estimation. This modiﬁcation improv es DRIP’ s statistical po wer , enabling us to increase the number of identiﬁed spectrum at a 1% false discov ery rate on yeast and worm data sets. 1 Introduction The most widely used high-throughput technology to identify and quantify proteins in comple x mixtures is shotgun pr oteomics , in which proteins are enzymatically digested, separated by micro-capillary liquid chromatography and subjected to two rounds of mass spectrometry . The primary output of a shotgun proteomics experiment is a collection of, typically , tens of thousands of fragmentation spectra , each of which ideally corresponds to a single generating peptide. The ﬁrst, and arguably the most important, task in interpreting such data is to identify the peptide responsible for generating each observed spectrum. The most accurate methods to solve this spectrum identiﬁcation problem employ a database of peptides deri ved from the genome of the organism of interest (re vie wed in [13]). Giv en an observed spectrum, peptides in the database are scored, and the top-scoring peptide is assigned to the spectrum. W e recently proposed a machine learning method, called DRIP , that solves the spectrum identiﬁcation problem using a dynamic Bayesian network (DBN) [ 4 ]. In this model, the “time” axis of the DBN corresponds to the mass-to-charge (m/z) axis of the observed spectrum. The model uses V iterbi decoding to align the observed spectrum to a theoretical spectrum deriv ed from a giv en candidate peptide, while adjusting the corresponding score to account for insertions —a peak in the observed spectrum that is absent from the theoretical spectrum—and deletions —a peak in the theoretical spectrum that is absent from the observed spectrum. In DRIP , observed peaks are scored using Gaussians positioned along the m/z axis, the parameters of which are learned using a training set of high-conﬁdence peptide-spectrum matches (PSMs). This approach allows the model to score observed spectra in their nati ve resolution without quantization of the m/z axis. The current work introduces DRIP to the computational biology community and describes se veral important improv ements to the method. – First, and most signiﬁcantly , we describe how to use wor d lattices [ 7 ] to make DRIP more efﬁcient. W ord lattices are widely used in natural language processing to jointly represent a large collection of natural language strings. In the context of shotgun proteomics, a word lattice can be used to represent compactly the collection of candidate peptides (i.e., peptides whose masses are close to the precursor mass) associated with an observed fragmentation spectrum. Using the lattice during V iterbi decoding allows for the sharing of computation relati ve to the dif ferent candidate peptides. Without loss of performance, the lattice approach provides a signiﬁcant reduction in computational expense, ranging from 85–93% in the yeast and worm data sets that we examine here. Notably , this general approach to sharing computations among candidate peptides is generally applicable to any scoring function that can be framed as a dynamic programming operation. – Second, facilitated by the incorporation of word lattices, we extend DRIP’ s learning frame work to use discriminati ve training. Thus, rather than performing maximum likelihood estimation on a training set of high-conﬁdence PSMs, DRIP performs maximum mutual information estimation to discriminate between the high-conﬁdence PSMs and a large “background” collection of candidate PSMs. Empirical e vidence suggests that this discriminati ve approach provides an impro vement in statistical po wer relativ e to the generati vely trained version of DRIP . – Third, we introduce se veral impro vements to the model, including remo ving the need to specify se veral hyperpa- rameters (maximum number of allowed insertions and deletions per match) a priori and a calibration procedure to allow joint ranking of PSMs with dif ferent char ge states. The ﬁnal DRIP model provides an ef ﬁcient and accurate method for assigning peptides to observed fragmentation spectra from a shotgun proteomics experiment, all implemented within a rigorously probabilistic and easily e xtensible modeling paradigm. 2 Overview of T andem Mass Spectrometry T o p Sc o r i n g P ep t i d e s P ep t i d e s Pr o t e i n s Ma s s spe c t r o m e t e r P ep t i d e Da ta b a s e Sc o r in g A lg o r i th m C oll e c t i o n of O(1 0 k ) spe c t r a 1) 2 ) 3) N) … 1) H P L MD D YD W R 2) S A T G AK 3) AN L P DK N ) AA QLL A G AK … Fig. 1: A typical shotgun proteomics experiment. A typical shotgun proteomics experi- ment (Figure 1) begins by clea ving pro- teins into peptides using a digesting en- zyme, such as trypsin. The resulting pep- tides are then separated via liquid chro- matography and injected into the mass spectrometer . In the ﬁrst round of mass spectrometry , the mass and charge of the intact precursor peptide are measured. Peptides are then fragmented, and the fragments under go a second round of mass spectrometry . The intensity of each observed peak in the resulting fragmentation spectrum is roughly proportional to the ab undance of a single fragment ion with a particular m/z value. Formally , we can represent the spectrum identiﬁcation problem as follows. Let P be the set of all possible peptide sequences. Given an oberved spectrum s with precursor mass m s and precursor charge c s , and given a database of peptides D ⊆ P , we wish to ﬁnd the peptide x ∈ D responsible for generating s . Using the precursor mass and charge, we may constrain the set of peptides to be scored by setting a mass tolerance threshold, w , such that we score the set of candidate peptides D ( m s , c s , D , w ) =  x : x ∈ D ,     m ( x ) c s − m s     ≤ w  , (1) where m ( x ) denotes the mass of peptide x . Denoting an arbitrary scoring function as ψ ( x, s ) , the spectrum identiﬁcation problem requires ﬁnding x ∗ = argmax x ∈ D ( m s ,c s , D ,w ) ψ ( x, s ) . (2) 3 Benchmark methods T o score a peptide x with respect to a spectrum s , most scoring algorithms ﬁrst construct a theoretical spectrum comprised of the peptide’ s expected fragment ions. Denoting the length of x as n and, for con v enience, letting ˜ n = n − 1 , x = x 0 x 1 . . . x ˜ n is a string. The fragment ions of x are shifted preﬁx and suf ﬁx sums called b- and y-ions , respectiv ely . Denoting these two as functions b ( · , · , · ) , y ( · , · , · ) , we hav e b ( x, c b , k ) = round P k − 1 i =0 m ( x i ) + c b c b ! , y ( x, c y , k ) = round P ˜ n i = ˜ n − k m ( x i ) + 18 + c y c y ! , (3) where c b and c y are charges related to the precursor charge state. The b-ion offset corresponds to the mass of a c b charged hydrogen atom, while the y-ion of fset corresponds to the mass of a water molecule plus a c y charged hydrogen atom. When c s ∈ { 1 , 2 } , c b and c y are unity because, for any b- and y-ion pair, an ion with no charge is undetectable in the mass spectrometer and it is unlikely for both char ges in a +2 charged precursor ion to end up on a single fragment ion. When c s ≥ 3 , we search both singly and doubly charged ions so that c b , c y ∈ { 1 , 2 } . Denoting the number of unique b- and y-ions as n x and, for con venience, letting e n x = n x − 1 , our theoretical spectrum is thus a sorted vector v x = ( v 0 , . . . , v e n x ) consisting of the unique b- and y-ions of x . W e benchmark DRIP’ s performance relative to four widely used search algorithms. The score function used by the ﬁrst such algorithm, MS-GF+ [ 10 ], is a scalar product between a hea vily preprocessed observ ed spectrum, ˜ s , and a binary theoretical vector of equal length, u . MS-GF+ then ranks all the candidate peptides for a gi ven spectrum by calculating the p-v alue of u T ˜ s ov er a distribution of scores for all peptides with equal precursor mass. Although the number of distinct peptide sequences with masses close to the observed precursor mass is large (on the order of 10 20 in many cases), the linearity of u T ˜ s allows the full distrib ution of scores of all such peptides to be computed ef ﬁciently using dynamic programming. MS-GF+ version 9980 was used, and the E-v alue score was used for scoring of spectra. The remaining three benchmark algorithms are variants of SEQUEST [ 2 ], each implemented in Crux v2.1 [ 12 ]. Like MS-GF+, the SEQUEST algorithm begins by quantizing and preprocessing the observed spectrum into a vector , ˜ s . A vector of equal length, u , is constructed based on the theoretical spectrum, and XCorr takes the form XCorr ( s, x ) = ˜ u T ˜ s − 1 151 75 X τ = − 75 ˜ u T ˜ s τ = ˜ u T ( ˜ s − 1 151 75 X τ = − 75 ˜ s τ ) = ˜ u T ˜ s 0 , (4) where ˜ s τ denotes the vector shifted by τ m/z units. Thus, XCorr is a foreground (scalar product) minus a background score. W e report results from (1) the XCorr score as implemented in Tide [ 1 ], (2) the XCorr E-value computed by Comet [3] by ﬁtting, for each spectrum, an exponential distrib ution to the candidate peptide XCorr scores, and (3) the XCorr p-value computed by T ide using a dynamic programming approach similar to that used by MS-GF+ [6]. 4 Dynamic Bayesian Network f or Rapid Identiﬁcation of P eptides (DRIP) A graphical model is a formal representation of the factorization of the joint distribution governing a set of random variables via a graph, the nodes of which denote random v ariables and the edges of which denote potential conditional dependence between nodes . This formalization enables a host of tractable inference algorithms while offering incredible modeling ﬂexibility . A Bayesian network is a graphical model deﬁned over directed acyclic graphs, and a dynamic Bayesian network (DBN) is a Bayeian network deﬁned ov er variable length temporal sequences. The basic time unit of a DBN, determined by the time units of the temporal process being modeled, is called a fr ame and consists of a group of nodes and edges amongst these nodes. A DBN is often deﬁned in terms of a template , where the ﬁrst and last frame are called the pr ologue and emphepiloque, respectiv ely , and where the chunk is expanded to occupy all frames in between. The template of DRIP is depicted in Figure 2. MAX_ THEO _I N DE X THEO _I N DE X THEO _PEAK OBS_ MZ OBS_ INT E N SITY IS_ INSE R TION THEO _I N CR E MEN T Fig. 2: Graph of DRIP . The DRIP model itself represents the observ ed spec- trum, and each frame in DRIP corresponds to a single observed peak. The theoretical spectrum is hidden and tra- versed from left to right as follows. Denoting the number of frames as n s and, for con venience, letting e n s = n s − 1 and t be an aribtrary frame 0 ≤ t ≤ e n s , K t denotes the index of the current theoretical peak, which is a deter- ministic function of its parents such that K 0 = δ 0 and K t = K t − 1 + δ t for t > 0 , where δ t is a multinomial random v ariable. Thus, δ t denotes the number of theoreti- cal peaks we traverse in subsequent frames. Furthermore, the parents of δ t , e n x and K t − 1 , constrain it from being lar ger than the number of remaining theoretical peaks left to trav erse. v x ( K t ) is thus the K t th theoretical peak. The variables O mz t and O in t are the observed m/z and intensity values, respecti v ely , of the t th observed peak, where O mz t is scored using a Gaussian centered near v x ( K t ) and O in t is scored using a Gaussian whose learned variance is larger than that of the m/z Gaussians, thus prioritizing matching well along the m/z axis as opposed to high-intensity peaks. Parent to these observed v ariables is i t , a Bernoulli random variable which denotes whether an observed peak is considered an insertion, i t = 1 , or not. When i t = 1 , a constant penalty is returned rather than scoring the observed observ ations with the currently considered Gaussians dictated by K t , since scores may become arbitrarily bad by allo wing Gaussians to score m/z observations far from their means and this would make the comparability of scores impossible (a single insertion would mak e an otherwise great alignment terrible). Furthermore, so as to enforce that some observations be scored rather than inserted, i t enforces its child δ t +1 to be zero in a frame following an insertion. Because δ t and i t are hidden, DRIP considers all possible alignments , i.e., all possible combinations of insertions and theoretical peaks scoring observed peaks. The V iterbi path, i.e., the alignment which maximizes the log-likelihood of all the random variables, is used to score a peptide as well determine which observed peaks are most likely insertions and which theoretical peaks are most lik ely deletions. Note that the version of DRIP used in this work (Figure 2) is some what simpliﬁed relati ve to the previously described DRIP model [ 4 ]. In particular , the previously described v ersion of DRIP required two user-speciﬁed quantities, corresponding to the maximum allowable numbers of insertions and deletions. These constraints were enforced via two chains of v ariables, which kept track of the number of utilized insertions and deletions in an alignment, counting do wn in subsequent frames. The current, simpliﬁed model automatically determines these two quantities on the basis of the deletion and insertion probabilities as well as the insertion penalties. This simpliﬁcation comes at no detriment to performance or running time (data not shown). 4.1 Calibrating DRIP with respect to char ge For observe d fragmentation spectra with lo w-resolution precursor data often have indeterminate charge states. For such spectra, all candidates are scored and rank ed assuming all charge states, with the highest scoring peptide amongst all charges returned. This approach requires that scores among dif ferently charged spectra be comparable to one another . In DRIP , howe v er , this is not the case. Because the number of theoretical peaks essentially doubles when moving from charge 2+ to char ge 3+, higher charged PSMs which ha ve denser theoretical spectra and thus much better scores than lower char ged PSMs, rendering dif ferently charged PSMs incomparable. In order to alle viate this problem, we recalibrate scores on a per charge basis, projecting dif ferently char ged PSM scores to roughly the same range. The procedure consists of ﬁrst generating a secondary set of decoy peptides disjoint from the target peptide set and primary deco y peptide sets. Let D t be the set of target peptides, D d be the set of decoy peptides, and D dd be our ne w set of decoy peptides, such that D dd ∩ ( D t ∪ D d ) = ∅ . Next, all spectra and charge states are searched, returning top ranking sets of PSMs X t ⊆ D t , X d ⊆ D d , and X dd ⊆ D dd . W e then partition X dd based on charge. For each charge c and corresponding partition, Z dd , we rank all PSMs based on score. Let Z t ⊆ X t be all PSMs of charge c , for which we use their scores to linearly interpolate their normalized ranks in Z dd , using these interpolated normalized ranks as their new calibrated scores. In order to recalibrate scores greater than those found in Z dd , we use the 99.99th percentile score and maximal score for linear interpolation. Once this procedure is done, a majority of recalibrated scores for PSMs in Z t and Z d will lie between [0 , 1] and a few barely above unity(this is largely true for the tar gets, since a set of well expressed and identiﬁable target peptides typically outscores all decoys), thus greatly decreasing the dynamic range of scores. W ith these recalibrated scores, we may easily take the top ranking PSM of an observed spectrum amongst dif ferently charged PSMs, without an y loss in ov erall accuracy . 5 W ord Lattices W ord lattices (abbre viated as lattices for this section) represent data instances compactly in graphical structures and hav e gained great success in the ﬁelds of natural language processing and speech recognition. For example, natural language dictionaries can be stored in lattices for more ef ﬁcient querying; in speech recognition, lattices constructed out of the top phone/word le vel hypotheses can be used to rescore and select the best hypothesis ef fecti vely . A lattice ov er an alphabet Σ is a directed graph G = ( N , E , s, t ) , where N is the node set, E is the edge set, and s, t ∈ N denote the source and target node respecti vely . Every edge e ∈ E is a tuple ( n 1 , n 2 , α ( e )) , where n 1 , n 2 are the from-node and to-node respectiv ely , and α ( e ) ∈ Σ is the alphabet element encoded in e . A lattice encodes data on paths from s to t . Let P ( G, s, t ) denote the set of paths from s to t , then every p = e 1 , e 2 , .., e | p | , p ∈ P ( G, s, t ) represents a sequence of characters, or a string, over alphabet Σ : α ( e 1 ) , α ( e 2 ) , .., α ( e | p | ) . For notation simplicity , let P ( G, e 1 , e 2 ) , where e 1 , e 2 inE , denote the set of paths starting in e 1 and ending in e 2 . An alternativ e way to deﬁne a lattice is to consider it as a nondeterministic ﬁnite automaton (NF A) G = ( Q, Σ , ∆, q 0 , F ) , where Q denotes the set of states, which is equiv alent to the node set in the directed graph def- inition, ∆ is the transition function: Q × Σ → 2 Q , q 0 is the initial state and F = { f } is the ﬁnal state. q 0 and f correspond to s, t in the directed graph deﬁnition respectiv ely . The NF A deﬁnition is useful when we consider the problem of constructing a lattice from a set of strings. Suppose we are gi ven a set of strings X = { x 1 , x 2 , ..., x m } . The lattice representation of the strings G ( X ) can be extremely compressed as there may be signiﬁcant amount of redundant information shared among elements of X . More speciﬁcally , if x i and x j share some substring in common, we can merge the shared parts into a sequence of common W eekly Update s n 0 n 1 t sea ttle food kung to fu Shengjie W eekly Update Fig. 3: An Example Lattice: encodes “seattle”, “’ seafood’, “kungfu”, and “tofu”; the alphabet can be any string ov er [a-z] of length ≤ 4 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 2 4 6 8 10 12 x 10 4 Length of Jumps/deletions Counts Fig. 4: Jump Histogram edges, instead of representing the same substring twice. Fig. 3 gi ves a lattice ov er four word strings, with the shared substrings collapsed into common edges to reduce the redundancy . The common edges of lattices not only save space for data representation, b ut also speed-up comple x operations using the encoded data. F or this paper in particular , we focus on the task of dynamic graphical model inference with V iterbi algorithm. With lattices, we manage to both reduce the state space of V iterbi algorithm, and apply smart pruning strategies in a more ef fecti ve way , as we will discuss in details later, which result in magnitudes of speed-ups. 5.1 Lattice Construction T o construct the optimal lattice from input set of strings X ov er alphabet Σ is a hard problem. The objecti ve of the “optimal” lattice is task dependent. E.g., for dictionary queries, the optimal lattice should have the minimal number of nodes/states, which correspond to the major computation time; for data compression, the optimal lattice should be minimal in ov erall size, in which case both the nodes and edges matter . Moreover , some tasks do not require the lattice to be an “exact” representation of the input strings. F or a set of strings X , an exact lattice is an NF A that accepts only language L ( G ( x )) = X . For our task, which is to speed-up graphical model training/inference, we choose the objecti ve to be: construct lattice G ( X ) = ( N , E , s, t ) such that G ( X ) is exact and | E | is minimized. As stated abo ve, we can think a lattice as an NF A, and it has been prov en that NF A minimization in terms of number of states/transitions are NP-hard to approximate within constant factors. T o approximate the optimal lattice, we gi ve the following algorithm which is similar to the determinize-minimize algorithm of minimizing a DF A[ 14 ]. The result lattice G ( X ) is the minimum state DF A of the same language X . Algorithm 1 C onstructLattice ( X , Σ ) 1: G = {{ s, t } , ∅ , s, t } . 2: Assign an ordering in Σ , and sort X . 3: for x ∈ X do 4: Start from s , traverse G with x . Stop when no matching edge can be found: suppose the non-matching character is x [ i ] , and we stop at node n 0 . 5: Build a chain structure C out of x [ i :] , where each edge on the chain corresponds to one character . 6: Merge C into G by merging the start node of C with n 0 and the end node of C with t . 7: end for 8: Run DF A minimization on G . 9: return G . The for loop, which merges preﬁx es of input strings, constructs a DF A out of X . Minimization on the constructed DF A can be thought as a process that merges nodes, which share the same sufﬁx es. Both merging preﬁxes and sufﬁx es reduce the number of edges in the lattice, making the algorithm a powerful heuristic in practice, and managing to reduce up to 50% edges as we will show in the result section. The complexity of the algorithm is bounded by the DF A minimization step. With Hopcroft’ s algorithm [ 5 ], the running time is O ( | Σ || X | l max log ( | X | l max )) , where l max = max x ∈ X | x | . 5.2 Representation of Lattices in Dynamic Graphical Models W e can utilize dynamic graphical model structures to naturally traverse a lattice G to access the encoded data. Particularly , for time frame t , we use three vertices to access the lattice: the lattice node vertex V t , the lattice link vertex L t , and the transition verte x T t . V t ’ s and T t ’ s decide the set of values that L t ’ s can take on, and each v alue of L t corresponds to one character in the encoded strings. Fig. 5 sho ws the lattice representaion as dynamic graphical model structures. V t = n i , T t +1 = d ( d ≥ 0 ) determines the set of nodes V t +1 can take: V t +1 = { n j ∈ N |∃ p ∈ P ( G, n i , n j ) , | p | = d } . V t = n i , V t +1 = n j , and T t +1 = d , together determines the set of edges L t +1 = { e ∈ E | p ∈ P ( G, n i , n j ) , | p | = d, p [ d − 1] = e } . In another word, L t +1 is a random variable corresponding to all edges that go into n j and can be reached from n i with a path of length d . For the simple case, we may enforce T t ∈ { 0 , 1 } . If T t = 0 , we stay at the current node, and if T t +1 = 1 , L t +1 becomes the LAT_NODE) LAT_LINK) TRANSITION) FIRST_NODE) Fig. 5: Representation of Lattices in Dynamic Graphical Models MAX_THEO_INDEX, THEO_PEAK, OBS_MZ, OBS_INTENSITY, IS_INSERTION, THEO_INCREMENT, LAT_NODE, LAT_LINK, TRANSITION, LAT_INDEX, Lattice;Part, DRIP;Part, FIRST_NODE, Fig. 6: Lattice DRIP Model set of outgoing edges from n i , and V t +1 becomes the set of destination nodes of the outgoing edges. “FIRST NODE” verte x has the observatoin of the source node v alue s , and is used for initializing the time dependent structure. The complexity for constructing the deterministic conditional probability table (CPT), which stores P r ( L t | V t , V t +1 , T t +1 ) , for trav ersing the lattice as stated abo ve is |{ ( e 1 , e 2 ) | e 1 , e 2 ∈ E , ∃ p ∈ P ( G, e 1 , e 2 ) , | p | ≤ d max }| , where d max is the largest value T t can take, or in another word, maximum number of edges to “jump over”. Please note that the CPT can be very sparse depending on the structure of the lattice as there may not exist a path between two nodes n i and n j . The v alue of d max can v ary based on the underlying graphical model. If only edge-by-edge tra versal of the lattice is required, than d max = 1 . For DRIP , which we will talk in details later, d max = max x ∈ X | x | , which is the length of the longest candidate peptide. Howe ver , we may constrain d max to take much smaller value than max x ∈ X | x | , as such long jumps are very unlikely in probability , so that we trade-off signiﬁcant speed-ups with negligible performance loss. Please note that the representation of lattices in GM can suit for any underlying dynamic graphical models, which access data in a streaming-like manner (do not go back and access previous data). Rather than accessing data in the traditional way , the lattice representation brings two major beneﬁts: 1) various pruning strategies can be applied to speed-up the underlying GM signiﬁcantly; 2) such compressed representation enables possibility of certain expensi ve learning methods, which requires to access the entire set of data, such as discriminativ e training. 5.3 Lattice with DRIP Naturally , we conv ert the input of DRIP , which is a set of theoretical peptide peaks v x within certain mass window around the precursor mass, into a peptide peak lattice G p . The alphabet Σ p for G p contains all possible peak m/z v alues (in Daltons) rounded to the nearest integer values. W e build a lattice for each set of peptide candidates within certain mass windo w . The lattices then become the “database” to search for the best peptide candidate match. W e don’t count lattice construction time as an ov erhead as we can reuse the lattices for future queries just like a database. Fig .6 shows the complete graphical model for Lattice DRIP . The “THEO INCREMENT” vertex δ t , which controls the number of deletions in DRIP , behaves similarly as the transition random v ariable T t in the lattice representation in GM, and we feed the value of δ t into T t . “LA T INDEX” keeps track of the number of edges from the current node to s . At ﬁrst frame, the max value T 0 can take is then the max v alue of δ t , which is capped at max x ∈ X | x | . Over time frames, the max value of T t decreases as the rest lengths of peptides decrease. Dif ferent from the original DRIP model, where the rest length of each peptide candidate can be speciﬁed exactly , as only one peptide is process as a time, it is hard to map from the set of edges tra versed so far to one speciﬁc peptide candidate, for the reason that multiple peptides can share those set of edges. As an approximation, we choose to use the rest length of the longest peptide encoded in the lattice as the rest lengths for all peptides. Theoretically , the Lattice DRIP score of a peptide would be a lower bound on the peptide’ s original DRIP score, as P r ( δ t = a | max( δ t ) = b ) < P r ( δ t = a | max( δ t ) = c ) if b > c > a . In practice, we ﬁnd such effect ne gligible, as we will sho w in the result session. The value of L t , which contains the set of m/z v alues of b/y-ions, is fed into the “THEO PEAK” vertex v x ( K t ) for scoring based on the mean Gaussians and the intensity Gaussian. The lattice acts like a “database” for querying theoretical peaks, and it does not affect the mechanism of the underlying GM. 5.4 Speeding Up Graphical Models with Lattices Lattices are compressed representations of the encoded data instances. By applying lattices in graphical models, the state space for accessing all the data instances is smaller compared to the original case where each data instance is accessed separately . An intuitive illustration is to consider about the “simple lattice”, which contains | X | disjoint paths from s to t , and each path corresponds to one data instance. The state space for the simple lattice is no dif ferent from accessing each data instance separately . By constructing the lattice, common structures in the simple lattice get merged into shared edges so that the state space is smaller . Depending on the input data instances, the state space reduction can be signiﬁcant. As data gets larger , we would expect more shared structures, so that the sizes of lattices gro w only sublinearly . For the task of mass spectrometry , there are often hundreds of candidate peptides within a certain mass window for one spectrum, out of which around up to 50% of peaks can be identiﬁed as redundant by the lattices. PTMs and other modiﬁcations to the peptides may enlar ge the number of candidates to search tremendously , where lattices can perform more efﬁciently . Beam pruning, which is a heuristic algorithm that expands only a pruned set of hypotheses in graphical model inference, ﬁt perfectly with lattices. Different beam pruning strate gies such as k-beam pruning, which preserves only k hypothesis states at each time frame t , shrinks the state space of graphical model inference signiﬁcantly . The hypothesis states of the original DRIP only consist of various deletion/insertion sequences of a single peptide candidate. Therefore, when various beam pruning strategies are applied, bad deletion/insertion sequences of the peptide get pruned away , yet we still end up ev aluating ev ery peptide candidate even though the best deletion/insertion sequences of certain peptides don’t match the spectrum well. W ith lattices, beam pruning methods force peptide candidates ﬁlter themselves collaborati vely . When applying beam pruning methods with lattices, the state space consists of various deletion/insertion sequences of all the peptide candidates. The badly scored peptides get pruned early on, and we end up scoring only a subset of the candidates. Reduce the maximum value of T t also decreases the search space of lattice. As stated above, complexity of the lattice trav ersing deterministic CPT is |{ ( e 1 , e 2 ) | e 1 , e 2 ∈ E , ∃ p ∈ P ( G, e 1 , e 2 ) , | p | ≤ d max }| . For DRIP , longer “jumps” or deletions are attributed with smaller probabilities. From Fig. 4, which depicts the histogram of the length of deletions, long jumps are unlikely for the best-matching candidate. Therefore, the maximum value of T t can be set small, and rarely do we lose any good candidate. In practice, for Lattice DRIP we set the maximum allo wed v alue of T t to 20. There are enormous other w ays to prune the lattices to accelerate the graphical model inference thanks to the lattices’ capability of identifying common structures within the data. Pruning lattices statically like pruning trees, as well as embed probabilities into lattice edges to get pruning subject to certain distributions, may achie ve good speed-ups, while the beam pruning and the limit of the longest jump, as we will mention in the result section, have gained us 7 to 15 times of acceleration. 6 T raining DRIP DRIP is a highly trainable model amenable to learning its Gaussian parameters, which pro vides both a tool to explore the nonlinear m/z offsets caused by machine error as well as a signiﬁcant boost in performance. For the overall training procedure, assume that we hav e a collection, C , of N i.i.d. pairs ( s i , x i ) , where s i is an observed spectrum and x i the corresponding PSM we have strong e vidence to believe generated s i . Let θ be the set of parameters for which we would lik e to learn, in our case DRIP’ s Gaussian parameters. For generati ve training, we then wish to ﬁnd θ ∗ = argmax θ P N i =1 p ( s i | x i , θ ) , i.e. we wish to maximize DRIP’ s likelihood with respect to the parameters to be learned, achiev ed via expectation-maximization (EM). A much more dif ﬁcult training paradigm is that of discriminati ve training, wherein we do not simply wish to maximize the likelihood under a set of parameters, but we would also like to simultaneously minimize a parameterized distribution deﬁned over a set of alternative hypotheses. In our case, this alternati ve set of hypotheses consists of all candidate peptides within the precursor mass tolerance not equal to x i , i.e. all incorrect explanations of s i . More formally , our discriminativ e training criterion is that of Maximum Mutual Information Estimation (MMIE); deﬁning the set of candidate peptides for s i within precursor mass tolerance w as C i = D ( m s , c s , D , w ) and denoting the set of all training spectra and high-conﬁdence PSMs as S and X , respectiv ely , the function we would like to maximize with respect to θ is then I θ ( S ; X ) = E log p ( s i , x i | θ ) p ( s i | θ ) p ( x i | θ ) = X s i , C i p ( s i , C i | θ ) log p ( s i , x i | θ ) p ( s i | θ ) p ( x i | θ ) = X s i , C i p ( s i , C i | θ ) log p ( s i | x i , θ ) p ( x i | θ ) p ( s i | θ ) p ( x i | θ ) = X s i , C i p ( s i , C i | θ ) log p ( s i | x i , θ ) p ( s i | θ ) = X s i , C i p ( s i , C i | θ ) log p ( s i | x i , θ ) P x ∈C i p ( s i , x | θ ) . (5) W e approximate this objectiv e using the quantity ˜ I θ ( S ; X ) = 1 N P N i =1 log p ( s i | x i ,θ ) P x ∈C i p ( s i ,x | θ ) , which con ver ges to the quantity in Equation 5 for large N by the i.i.d. assumption and the weak law of lar ge numbers. Our objecti ve is then max θ ˜ I θ ( S ; X ) = max θ 1 N N X i =1 log p ( s i | x i , θ ) P x ∈C i p ( s i , x | θ ) = max θ 1 N N X i =1 log p ( s i | x i , θ ) − log X x ∈C i p ( s i , x | θ ) ! , (6) where, for obvious reasons, we call M n ( s i , x i ) = log p ( s i | x i , θ ) the numerator model and M d ( x i ) = log P x ∈C i p ( s i , x | θ ) the denominator model . Note that the numerator model is our objectiv e function for generativ e training, and in general the sum over possible peptide candidates in the denominator model makes the discriminative training objecti ve dif ﬁcult to compute. Howe ver , by using lattices to efﬁciently perform the computation in the denominator model, we solve Equation 6 using stochastic gradient ascent. In stochastic gradient ascent, we calculate the gradient of the objectiv e function with regard to a single training instance, ∇ θ ˜ I θ ( s i ; x i ) = ∇ θ M n ( s i | x i θ ) − ∇ θ M d ( s i | θ ) , (7) where the gradients of M n and M d are vectors referred to as F isher scor es . Correspondingly , we update the parameters θ using the previous parameters plus a damped version of Equation 7, iterating this process until con ver gence. The ov erall algorithm is detailed in Algorithm 2. In practice, we begin the algorithm by initializing θ 0 to a good initial value, i.e. the output of generati ve training, and the learning rate η j is updated with η j +1 = ( √ j ) − 1 . Intuitiv ely , the gradients mo ve in the direction maximizing the dif ference between the numerator and denominator models, encouraging improv ement for the numerator while discriminating against the incorrect labels in the denominator . Our experimental results show that discriminati v e training positiv ely inﬂuences performance (Section 8.3). Algorithm 2 Discriminativ e T raining via Stochastic Gradient Ascent 1: Initialize θ 0 , η 0 . Let j = 0 . 2: while True do 3: θ j +1 := θ j ; j := j + 1 ; 4: Update learning rate η j ; 5: for ( s i , x i ) ∈ C do 6: θ j := θ j + η j ( ∇ θ M n ( s i ) | x i θ ) − ∇ θ M d ( s i | θ )) ; 7: end for 8: if || θ j − θ j − 1 θ j − 1 || <  then 9: break; 10: end if 11: end while 6.1 Discriminative T raining with Lattices Discriminati ve training is expensi ve to e xecute. The denominator model requires calculating the gradients of all possible candidate peptides C i , which can be infeasible for many tasks, yet to represent all possible labels in some graphical models, e.g. DRIP , is neither a tri vial task. The hardness of representing all possible labels in DRIP comes from the fact that it is difﬁcult to constrain the model to consider valid peptides only , as the distance between subsequent theoretical peaks can take on any v alue. Lattices work perfectly with discriminativ e training in solving the scalability and representability problem. The lattice of all possible labels can be very compact, together with different strate gies to speed up graphical model with lattices discussed in the previous session, discriminati v e training with lattice can be quite feasible and efﬁcient. The Lattice is a general framew ork to represent hypotheses for any dynamic graphical model. The denominator model of the discriminative training for DRIP is exactly the same as the Lattice DRIP model. Even for graphical models, which are capable of encoding all possible labels for discriminative training, lattices ha ve the advantage of being more ef ﬁcient and amenable to various modiﬁcations, as we can encode an y set of labels into the lattice, so that discriminative training against any distrib ution is achie vable. 7 Experimental methods W e benchmark all methods using three data sets: one low-resolution data set from the yeast S. cere visiae consisting of 35,236 spectra (Y east), one low-resolution data set from C. ele gans consisting of 23,697 spectra (W orm-I), and one high-resolution C. elegans dataset consisting of 7,557 spectra (W orm-II). Further details regarding the Y east and W orm-I datasets and corresponding tar get databases may be found in [ 8 ], and details regarding W orm-II and its database may be found in [6]. The datasets and target databases are a v ailable on the corresponding supplementary pages. In order to ensure that all methods score exactly the same peptides, each search engine was provided with a pre-digested set of peptide sequences, rather than intact proteins sequences. Each protein database was digested using trypsin without suppression of clea vage by proline. Precursor charges range from 1+ to 3+ for W orm-I and Y east and from 1+ to 5+ for W orm-II. For spectra with multiple char ge states, the top scoring PSM was chosen per method. All search algorithms were run with as equiv alent settings as possible; machines were speciﬁed to CID, no allo wance for missed cleav ages or isotope errors, and a single ﬁxed carbamidomethyl modiﬁcation. For the two data sets with low-resolution precursors (Y east and W orm-I), the mass tolerance w was set to ± 3 Th, whereas for W orm-II w was set to ± 10 ppm. Comet searches (XCorr E-values) used ﬂanking peaks around each b- and y-ion, whereas T ide searches (XCorr and XCorr p-values) did not. These settings provided optimal performance for the two methods. All benchmark methods included neutral loss peaks in theoretical spectra, although these peaks are not modeled in DRIP . A signiﬁcant challenge in ev aluating the quality of a spectrum identiﬁcation algorithm is the absence of a “ground truth” data set where the generating spectrum is kno wn for each observed spectrum. W e therefore follow the standard approach of using decoy peptides (which in our case correspond to randomly shufﬂed versions of each real tar get peptide) to estimate the number of incorrect identiﬁcations in a given set of identiﬁed spectra. In this work, targets and decoys are scored separately and used to estimate the number of identiﬁed matches at a particular false discovery rate (FDR), i.e. the fraction of spectra improperly identiﬁed at a gi ven signiﬁcance threshold [ 9 ]. Because the FDR is not monotonic with respect to the underlying score, we instead use the q-value , deﬁned to be the minimum FDR threshold at which a gi ven score is deemed to be signiﬁcant. Since data sets containing more than 10% incorrect identiﬁcations are generally not practically useful, we only plot q -values in the range [0 , 0 . 1] . 8 Results 8.1 Charge calibration impr oves statistical power 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 2 4 6 8 10 12 14 Spectra identiﬁed (1000’ s) DRIP MS-GF+ XCorr p-value XCorr XCorr E-value (a) W orm-I 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 2 4 6 8 10 12 14 Spectra identiﬁed (1000’ s) DRIP MS-GF+ XCorr p-value XCorr XCorr E-value (b) Y east 0.02 0.04 0.06 0.08 0.10 q-value 0 500 1000 1500 2000 2500 Number of target matches DRIP MS-GF+ XCorr p-value XCorr XCorr E-value (c) W orm-II Fig. 7: Charge v ariation results for all datasets. DRIP charge recalibrated outperforms XCorr and XCorr E-value on all datasets, and outperforms all datasets on W orm-I by a large margin. Though DRIP is outperformed by the two p-value methods on Y east-I and W orm-II, this gap may be closed by gathering training data better representativ e of the corresponding spectra and their charge states (the high quality training data used only contains charge 2+ PSMs), which is currently underway . Furthermore, with the incorporation of lattices into DRIP , we now hav e a mechanism by which to sequence through entire sets of peptides. W e believe this to be a critical step tow ards ev aluating DRIP score thresholds with respect to arbitrary peptide sets, thus allowing the computation of DRIP p-v alues for which a performance increase such as XCorr to XCorr p-v alue is expected. It is w orth noting that while MS-GF+ and XCorr p-v alue e valuate the entire set of possible peptides equal to a giv en precursor mass, the utilization of lattices allows DRIP to e v aluate an arbitrary set of peptides (ev en regardless of precursor mass), which is strictly more general and ﬂexible. 8.2 Faster DRIP scoring using lattices W e ﬁrst present results on the ability of lattices to compress peptides. W e build lattices on four peptide datasets with theoretical peak values as the alphabet. Figure 8 shows the effecti veness of compression as the number of peptide candidates in the precursor mass windo w (size of ± 3 Th ) varies. The compression ratio is deﬁned to be the ratio of the number of edges in the lattices, which are built on mass windo ws of certain number of peptide cadidates, to the number of theoretical peaks in the candidates in those windows. W e note that the compression ratio decreases almost linearly as the number of peptides increases, and such reduction can be ov er 50% . T o illustrate the ef fectiv eness of improving inference time using lattices, we test Lattice DRIP with the following beam pruning strategies, and compare the results against DRIP (using the beam pruning settings described in [4]): – lattice base : pruning with k -beams which are dynamic across time frames, with wider beams for the early part and narrower beams later on. – lattice speed : pruning aggressiv ely with dynamic k -beam (same as lattice base ) with narrower beams. Moreo ver , the pbeam pruning algorithm is applied, which prunes the state space while building up the inference structures. 0 200 400 600 800 1000 1200 1400 0.4 0.5 0.6 0.7 0.8 0.9 1 Num Peptide Candidate Average Edge Ratio 0 1000 2000 3000 4000 5000 600 0 0.4 0.5 0.6 0.7 0.8 0.9 1 Num Peptide Candidate Average Edge Ratio Fig. 8: Lattice Compression Ratio: for mass windows of certain number of peptide candidates, calculate the number of edges in the corresponding lattices divided by the number of peaks in the candidates from Y east(left)/W orm(right) databases. 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 1 2 3 4 5 6 7 8 9 Spectra identiﬁed (1000’ s) DRIP Lattice Base Lattice Speed 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 2 4 6 8 10 12 14 Spectra identiﬁed (1000’ s) Lattice Base Lattice Speed DRIP Fig. 9: Performance curves: compare two lattice results against the original DRIP , on Y east (left) and W orm-I (right) with charge 2. T able 1: Percentage running time of lattice methods com- pared to original DRIP . dataset lattice speed lattice base DRIP Y east 7.78% 14.80% 100% W orm-I 8.59% 16.36% 100% W e note that while the space of beam pruning strategies is large and it is possible that the two above methods are not optimal in terms of absolute computational efﬁciency achie vable using lattices, they giv e signiﬁcant speed improvements over the original DRIP . T o ev aluate the speed-ups possible with lattices, we randomly select 50 spectra from Y east and W orm- I, respectively , and record inference time. W e use the same graphical model inference engine for all three methods. Experiments were carried out on a 3.40GHz CPU with 16G memory . The lo west CPU time out of 3 runs is recorded, and we report the relati ve CPU time of lattice methods to original DRIP in T able 1. Timing tests sho w that utilizing lattices, DRIP runs 7-15 times faster . Note that this comes at practically no expense to performance, as sho w in Figure 9. 8.3 Discriminative training further boosts statistical po wer As detailed in Section 6, we use a set of high-conﬁdence, charge 2+ PSMs and their corresponding peptide database to discriminativ ely train DRIP , utilizing lattices in the denominator model. T o illustrate the po wer afforded by DRIP’ s learning capabilities, we also illustrate performance under hand-tuned parameters, where DRIP’ s Gaussian means are placed between unit intervals along the m/z access (representative of ﬁxed binning strate gies). Figures 10(a), 10(b) sho w that the discriminativ ely trained DRIP improves performance, especially for lo w q -values, ar guably the most important region of performance. Note that the discriminati vely trained model depicted employs the lattice base pruning strategy , and thus we hav e an increase in accuracy as well an approximately se ven-fold speed-up. 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 1 2 3 4 5 6 7 8 9 Spectra identiﬁed (1000’ s) Disc. T rain Gen. T rain Hand T rain (a) Y east, charge 2+ 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 q-value 0 2 4 6 8 10 12 14 Spectra identiﬁed (1000’ s) Disc. T rain Gen. T rain Hand T rain (b) W orm-I, char ge 2+ Disc. Train Gen. Train Hand Tune Spectrum 1 Spectrum 2 Spectrum 3 Obs. Peak m/z m/z m/z 333 333.2 333.4 333.6 0 0.5 1 1.5 385 385.2 385.4 385.6 0 0.5 1 1.5 480.2 480.4 480.6 0 0.5 1 1.5 630.2 630.4 630.6 0 0.5 1 1.5 215 215.2 215.4 215.6 0 0.5 1 1.5 243 0 0.5 1 1.5 470 0 0.5 1 1.5 775.2 775.4 775.6 0 0.5 1 1.5 610.2 610.4 610.6 0 0.5 1 1.5 719.2 719.4 719.6 0 0.5 1 1.5 720.3 720.4 720.5 720.6 0 0.5 1 1.5 1034.4 1034.5 1034.6 1034.7 0 0.5 1 1.5 (c) Means used to score observed peaks. Fig. 10: Effects of training DRIP parameters. In Figure 10(c), we further in vestig ate the inﬂuence of training methods on the m/z Gaussian means of DRIP and their performance. Choosing three spectra from Y east at random, four high intensities are displayed per spectrum, along with the means used to score these peaks. With the observed peaks plotted in black, the resulting discriminati vely trained Gaussian means are much closer than all other means, yielding better scoring and more accurate results. 9 Conclusions In this paper , we sho w se veral signiﬁcant improv ements to the DRIP model. W e show ho w to apply w ord lattices to compress peptide candidate sets in order to dramatically speed up inference in DRIP (7-15 fold speed up). W ith the ability to compactly represent entire sets of peptides, we extend DRIP’ s learning framework to discriminati ve training, leading to performance gains at lo w q -values. W e hav e also greatly simpliﬁed the DRIP model itself and allo wed the ability to accurately search varying char ge states per dataset. There are sev eral a venues for future w ork. W ith the ability to ef ﬁenctly sequence through entire set peptides af forded by lattices, we will in v estigate ways to take thresholds with respect to DRIP scores in order to compute p-v alues. As evidenced by other scores for which exact p-v alues are computed (MS-GF+, XCorr p-value), this is e xpected to greatly increase DRIP’ s performance. W e ha ve also only scratched the surface of training parameters with the model. Collecting an assortment of training data, we will discriminati vely train the model for a host of dif ferent char ge states, machine types, and organisms in an ef fort to further increase accuracy ov er the wide array of tandem mass spectra encountered in practice. W e also plan to explore ways to alle viate the labor intensi ve process of collecting high-conﬁdence training PSMs described in [11], in order to simplify the ov erall training process, beginning from data collection. References 1. Diament, B., Noble, W .S.: Faster sequest searching for peptide identiﬁcation from tandem mass spectra. Journal of Proteome Research 10(9), 3871–3879 (2011), pMC3166376 2. Eng, J.K., McCormack, A.L., Y ates, III, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 5, 976–989 (1994) 3. Eng, J.K., Jahan, T .A., Hoopmann, M.R.: Comet: An open-source ms/ms sequence database search tool. Proteomics 13(1), 22–24 (2013) 4. Halloran, J.T ., Bilmes, J.A., Noble, W .S.: Learning peptide-spectrum alignment models for tandem mass spectrometry . In: Uncertainty in Artiﬁcial Intelligence (U AI). A U AI, Quebic City , Quebec Canada (July 2014) 5. Hopcroft, J.: An n log n algorithm for minimizing states in a ﬁnite automaton (1971) 6. Howbert, J.J., Noble, W .S.: Computing exact p-v alues for a cross-correlation shotgun proteomics score function. Molecular & Cellular Proteomics pp. mcp–O113 (2014) 7. Ji, G., Bilmes, J., Michels, J., Kirchhoff, K., Manning, C.: Graphical model representations of word lattices ieee/acl 2006 workshop on spoken language technology (slt2006) , palm beach, aruba, dec 2006. IEEE/ACL W orkshop on Spoken Language T echnology (SL T) (2006) 8. K ¨ all, L., Canterb ury , J., W eston, J., Noble, W .S., MacCoss, M.J.: A semi-supervised machine learning technique for peptide identiﬁcation from shotgun proteomics datasets. Nature Methods 4, 923–25 (2007) 9. K ¨ all, L., Storey , J.D., MacCoss, M.J., Noble, W .S.: Assigning signiﬁcance to peptides identiﬁed by tandem mass spectrometry using decoy databases. Journal of Proteome Research 7(1), 29–34 (2008) 10. Kim, S., Mischeriko w , N., Bandeira, N., Na varro, J.D., W ich, L., Mohammed, S., Heck, A.J., Pe vzner , P .A.: The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Molecular and Cellular Proteomics 9(12), 2840–2852 (2010) 11. Klammer , A.A., Reynolds, S.M., Bilmes, J.A., MacCoss, M.J., Noble, W .S.: Modeling peptide fragmentation with dynamic Bayesian networks for peptide identiﬁcation. Bioinformatics 24(13), i348–356 (Jul 2008), http://bioinformatics. oxfordjournals.org/cgi/content/abstract/24/13/i348 12. McIlwain, S., T amura, K., Kertesz-Farkas, A., Grant, C.E., Diament, B., Fre wen, B., Howbert, J.J., Hoopmann, M.R., K ¨ all, L., Eng, J.K., et al.: Crux: rapid open source protein tandem mass spectrometry analysis. Journal of proteome research (2014) 13. Nesvizhskii, A.I.: A survey of computational methods and error rate estimation procedures for peptide and protein identiﬁcation in shotgun proteomics. Journal of Proteomics 73(11), 2092 – 2123 (2010) 14. W atson, B.W .: A taxonomy of ﬁnite automata minimization algorithms. Computer Science Note 44 (93)

Faster graphical model identification of tandem mass spectra using peptide word lattices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment