A Preliminary Work on Evolutionary Identification of Protein Variants and New Proteins on Grids

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

Protein identification is one of the major task of Proteomics researchers. Protein identification could be resumed by searching the best match between an experimental mass spectrum and proteins from a database. Nevertheless this approach can not be used to identify new proteins or protein variants. In this paper an evolutionary approach is proposed to discover new proteins or protein variants thanks a “de novo sequencing” method. This approach has been experimented on a specific grid called Grid5000 with simulated spectra and also real spectra.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

arXiv:0804.1202v1 [q-bio.BM] 8 Apr 2008 A Preliminary Work on Evolutionary Identiﬁcation of Protein Variants and New Proteins on Grids Jean-Charles Boisson, Laetitia Jourdan and El-Ghazali Talbi LIFL/INRIA Futurs-Universit´e de Lille1 Bˆat M3-Cit´e Scientiﬁque {boisson,jourdan,talbi}@liﬂ.fr Christian Rolando Plateforme de Prot´eomique / Centre Commun de Spectrom´etrie de masse 59655 Villeneuve d’Ascq Cedex, FRANCE Christian.Rolando@univ-lille1.fr Abstract Protein identiﬁcation is one of the major task of Pro- teomics researchers. Protein identiﬁcation could be re- sumed by searching the best match between an experimental mass spectrum and proteins from a database. Nevertheless this approach can not be used to identify new proteins or protein variants. In this paper an evolutionary approach is proposed to discover new proteins or protein variants thanks a “de novo sequencing” method. This approach has been experimented on a speciﬁc grid called Grid5000 with simulated spectra and also real spectra.

Introduction Proteomics can be deﬁned as the global analysis of pro- teins. Protein identiﬁcation is one of the major task of Pro- teomic researchers as it can help to understand the biologi- cal mechanisms in the living cells. All the current methods use data from mass-spectrometers and generally give good results. But in the case of protein variants or new proteins, these methods can only recognize a protein if it is stored in a database and can not clearly explain why this protein is different from any other in the database. The aim of our ap- proach is to ﬁnd the entire sequence of a protein, even in the case of variants or unknown proteins. To do that, we need to identify the different peptides that composed the protein. First, their mass (their chemical formula) have to be found with a MS spectrum and secondly, from their mass, their sequence can be found with MS/MS spectra. In fact, when peptides are known, we can obtain the complete protein. This article is organized as follows. Section 2 deals with the speciﬁcities of protein variants and new protein iden- tiﬁcation problems; section 3 describes our approach and the different algorithms that compose it; section 4 intro- duces the parallel framework; section 5 presents our results and discusses them and ﬁnally conclusions and perspectives about this work are provided.
The Positioning of the Protein Variants and New Proteins Identiﬁcation Problem The identiﬁcation of new proteins and protein variants is a complex problem. All the existing protein identiﬁ- cation methods are based on two types of data: MS and MS/MS spectra (MS for Mass Spectrometry) which are mass/intensity spectra. A MS spectrum is obtained by ex- traction of an experimental protein from a proteins mix, its digestion by a speciﬁc enzyme and its analysis in a mass spectrometer. From a MS spectrum, databases allow to identify all the peptides by their masses. Techniques us- ing MS spectra for protein identiﬁcation are identiﬁcation methods by peptide mass ﬁngerprint (PMF). The scoring of these methods is based of the comparison of an exper- imental peptide mass list with a theoretical peptide mass list [5, 11]. They give good results but they only ﬁnd the closest protein to the experimental one without more infor- mation. A way to overcome the lacks of MS data is to use also MS/MS data (tandem mass spectrometry). Each pep- tide from the MS spectrum is selected and fragmented to obtain the corresponding MS/MS spectrum. The ions de- tected are characteristic of the structure of the parent pep- tide. Thus it is theoretically possible to obtain the sequence of each peptide from the digested protein. The use of MS data (mass of the peptides) combined to MS/MS data (par- tial sequence of the peptides) data increase the accuracy of the PMF techniques [1, 9]. These scores use several proper- ties on the ions obtained by MS/MS spectra in order to ﬁnd amino acid sequences. With partial amino acid sequences and masses, proteins can be distinguished easier than with masses only. However, it is not sufﬁcient to identify un- known proteins. An alternative method named de novo sequencing has been proposed, using tandem mass spectrometry. It works on random sequence of proteins in order to ﬁnd the exper- imental one (without databases). In this case the identiﬁ- cation is based on random peptides or peptides result of a earlier identiﬁcation (made by speciﬁc tools) [3, 4, 10, 13]. But the MS/MS data are so fragmented (the deduced se- quences are limited) and the number of theoretical protein that can be generated is so large that this kind of technique is only use on small amount of data. We speak about de novo peptide sequencing. Furthermore, alignment tools as Blast are necessary to ﬁnd the closest peptide corresponding to the result sequence and validate it. Evolutionary approaches as optimization method have been already used against the huge research space of the de novo peptide sequencing proble

View Original ArXiv

This content is AI-processed based on ArXiv data.

A Preliminary Work on Evolutionary Identification of Protein Variants and New Proteins on Grids

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found