An Optimized Data Structure for High Throughput 3D Proteomics Data: mzRTree

Reading time: 5 minute
...

📝 Original Info

  • Title: An Optimized Data Structure for High Throughput 3D Proteomics Data: mzRTree
  • ArXiv ID: 1002.3724
  • Date: 2010-04-27
  • Authors: 원문에 저자 정보가 포함되지 않아 제공할 수 없습니다. —

📝 Abstract

As an emerging field, MS-based proteomics still requires software tools for efficiently storing and accessing experimental data. In this work, we focus on the management of LC-MS data, which are typically made available in standard XML-based portable formats. The structures that are currently employed to manage these data can be highly inefficient, especially when dealing with high-throughput profile data. LC-MS datasets are usually accessed through 2D range queries. Optimizing this type of operation could dramatically reduce the complexity of data analysis. We propose a novel data structure for LC-MS datasets, called mzRTree, which embodies a scalable index based on the R-tree data structure. mzRTree can be efficiently created from the XML-based data formats and it is suitable for handling very large datasets. We experimentally show that, on all range queries, mzRTree outperforms other known structures used for LC-MS data, even on those queries these structures are optimized for. Besides, mzRTree is also more space efficient. As a result, mzRTree reduces data analysis computational costs for very large profile datasets.

💡 Deep Analysis

Figure 1

📄 Full Content

Mass spectrometry-based proteomics [1] plays an ever-increasing role in different biological and medical fields, but, as an emerging field, it still requires reliable tools for the storage, exchange and analysis of experimental data. Over the last years, it has become available a wide range of technologies which can generate a huge quantity of data potentially able to address relevant questions, e.g., to identify proteins in a biological sample, to quantify their concentration, to monitor post-translational modifications, to measure individual protein turnover, to infer on interactions with other proteins, transcripts, drugs or molecules. The technology is quickly advancing but, without efficient bioinformatics tools, high-throughput proteomics data handling and analysis are difficult and error-prone. Therefore, a major challenge facing proteomic research is how to manage the overwhelming amount of data in order to extract the desired information. This holds especially for high-throughput quantitative proteomics, since it needs highly informative, high-resolution profile data, in order to achieve reliable quantifications. Moreover, the data hostage held by different instrument proprietary formats slows down the evolution of proteomics, mainly because comparisons among different experiments, or analytical methods, often become unfeasible.

In order to facilitate data exchange and management, the Human Proteome Organization (HUPO) [2] established the Proteomics Standards Initiative (PSI ). HUPO-PSI released the Minimum Information About a Proteomics Experiment (MIAPE ) reporting guidelines [3] and proposed mzData [4], which, as mzXML [5], is an eXtensible Markup Language (XML) based data format, developed to uniform data. Recently, merging the best features of each of these formats, the HUPO introduced mzML as a unique data format [6]. XML-based data formats are characterized by intuitive language and a standardized structure. At the state of art, the adoption of these formats is widespread among the proteomics research groups, also thanks to the extensive support of instrument and database searching vendors, and the availability of converters from proprietary data formats. In spite of their success, the currently adopted formats suffer from some limitations [7]: the impossibility to store raw data [8]; the lack of information on the experimental design, necessary for regulatory submission; the lack of scalability with respect to data size, a bottleneck for the analysis of profile data. Above all, the 1-dimensional (1D) data indexing provided by these formats considerably penalizes the analysis of datasets embodying an inherent 2-dimensional (2D) indexing structure, such as Liquid Chromatography-MS (LC-MS ) ones.

LC-MS provides intensity data on a 2D (t, m/z) domain, since LC separates proteins along retention time dimension (temporal index) based on their chemical-physical properties, while MS separates proteins based on their mass over charge (m/z index) ratios. Minimizing the computational time to access these huge datasets plays a key role in the progress of LC-MS data mining, and can be of help also in a variety of other MS techniques, since MS experiments usually have a temporal index related to the experimental time at which the MS acquisition takes place (e.g., a scan in mzXML). Therefore, MS data can be accessed by means of either an m/z range, or a temporal range, or a combination of them, defining different range queries. On LC-MS data, these accesses provide respectively chromatograms, spectra, and peptide data, whereas on generic MS data, they provide a set of sub-spectra belonging to the specified range. An elevated number of range queries are required during data analysis, thus optimizing them would significantly improve computational performance.

Most research groups develop, often in a sub-optimal way, intermediate data structures optimized for accesses on a privileged dimension, depending on the downstream analysis. For instance, accredited software packages like Maspectras [9] or MapQuant [10] make use of the method-specific intermediate data structures Chrom and OpenRaw, respectively: the former is optimized for a chromatogram based access, the latter for a spectra based access. In a recent work [11] Khan et al. provide evidence that the use of a spatial indexing structure, namely the kd-tree, is suitable for handling large LC-MS datasets and supporting the extraction of quantitative measurements. The authors emphasize the effectiveness of the kd-tree for performing analyses based on range queries but they do not compare explicitly the range query performance of the kd-tree with that attainable by other known data structures. Moreover, their experimental assessment is carried out only on centroid datasets and does not consider profile data, which, as the literature often remarks [8], are the most informative, especially for quantitative analysis, but also the most challenging to handle.

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut