THEA: ontology-driven analysis of microarray data

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: THEA: ontology-driven analysis of microarray data
ArXiv ID: 0709.1397
Date: 2007-09-10
Authors: Claude Pasquier, Fabrice Girardot, Karim Jevardat De Fombelle, Richard Christen

📝 Abstract

MOTIVATION: Microarray technology makes it possible to measure thousands of variables and to compare their values under hundreds of conditions. Once microarray data are quantified, normalized and classified, the analysis phase is essentially a manual and subjective task based on visual inspection of classes in the light of the vast amount of information available. Currently, data interpretation clearly constitutes the bottleneck of such analyses and there is an obvious need for tools able to fill the gap between data processed with mathematical methods and existing biological knowledge. RESULTS: THEA (Tools for High-throughput Experiments Analysis) is an integrated information processing system allowing convenient handling of data. It allows to automatically annotate data issued from classification systems with selected biological information coming from a knowledge base and to either manually search and browse through these annotations or automatically generate meaningful generalizations according to statistical criteria (data mining). AVAILABILITY: The software is available on the website http://thea.unice.fr/

💡 Deep Analysis

Deep Dive into THEA: ontology-driven analysis of microarray data.

📄 Full Content

During the last decade, the various genomes sequencing projects fed the biological databanks with an extraordinary amount of data that remains of little use if not transformed into knowledge. Currently, the laborious process of annotation is carried out jointly by human experts and data-processing programs. Similarly, new technologies (proteomics, transcriptomics) start to produce mountains of data. The goal, from now, is more to track the activity of whole genomes, temporally and spatially than to thoroughly study biological objects taken separately. Knowledge is deduced from overall gene expression measurements in particular experimental contexts. The assumption is that a set of gene products is probably involved in a functional module when their levels of expression vary in a coordinated manner (Segal et al., 2003). Work thus consists in two distinct phases: identifying these modules and then understanding their roles.

The first phase is now abundantly studied (Quackenbush, 2002). Numerous approaches dedicated to the acquisition, normalization, filtering and clustering of such high throughput results are available (Chuaqui et al., 2002). In the end, treated data are more reliable and organized, but still very numerous. There is more than ever a need for automatic or semi-automatic approaches relying on structured and controlled vocabularies (ontologies) to analyze large quantities of data in order to discover meaningful patterns and rules (Attwood and Miller, 2001).

The THEA project is dedicated to the elaboration of tools and methods suited for the analysis of post-genomic data. In this paper, we present the first module developed in the frame of the project. It belongs to the field of knowledge discovery and is focused on the exploration and annotation of data generated by microarray experiments (Schena et al., 1995).

Two basic requirements of knowledge discovery are the access to the most complete and up to date information and its rapid availability (Fayyad et al., 1996). In THEA, these requirements have led to the elaboration of ALLONTO, a dedicated data warehouse which stores selected data extracted from electronic resources, supplemented by a mediator which dynamically queries required and specific complementary informations over the internet.

In order to fully exploit data, knowledge discovery systems rely on a formal representation of information based on a well defined semantic (Simoff and Maher, 1998). This formal system is represented in ALLONTO by ontologies, which constitute a popular way to modelize biological concepts and their relationships.

THEA is designed to make use of Ontologies described as Directed Acyclic Graphs (DAGs). A DAG is a structure composed of nodes (representing terms) and oriented arcs (representing relationships between terms) containing no cycle. This means that if there is a path from one node to another, then there is no way back. Such a modelization is very popular because it is intuitive, easily editable and less limited than hierarchical structures since terms can be source and target of many relationships. DAG based ontologies cover many biological domains (see for example the list collected by OBO at http://obo.sourceforge.net/list.shtml) .

Presently ALLONTO includes two ontologies. Gene Ontology (GO) (Ashburner et al., 2000) is a controlled vocabulary developed by a consortium of scientists. It can be used to describe (‘annotate’) a gene product in regard to its molecular functions (GO:MF), cellular localizations (GO:CL) and biological processes (GO:BP). Specific vocabularies dedicated to Drosophila melanogaster are developed by FlyBase (The_Flybase_Consortium, 2003), they describe the developmental stages and the anatomy of the fly: Drosophila Developmental Stages (FB:DDS) and Drosophila Gross Anatomy (FB:DGA), respectively. Progresses are being made to incorporate other ontologies as they develop.

Our database schema is designed as an extension of the GO one.

Ontology constitutes a mechanism for expressing and sharing biological concepts which, in order to be useful, must be used as qualifiers for underlying data. Associations between gene products and GO terms are imported from text files elaborated by a growing number of biological databases (http://www.geneontology.org/doc/GO.current.annotations.shtml) .

Concerning Drosophila ontologies, associations are queried from the “Gene Expression” page of FlyBase (http://flybase.bio.indiana.edu/cgi-bin/expat) .

As no single source contains all the necessary information, one of the most fastidious tasks in functional genomics is finding the correspondences among the multiple identifiers of genes or gene products. To assist the knowledge discovery process, we have collected cross-links about all known genes, transcripts and proteins for nine organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio, Fugu rubripes, Anopheles gambiae, D. melanogaster, Caenorabditis elegans and Caenorabditis briggsa

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

THEA: ontology-driven analysis of microarray data

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Partition Decoupling for Multi-gene Analysis of Gene Expression Profiling Data

OMICtools: a community-driven search engine for biological data analysis

Using PCA and Factor Analysis for Dimensionality Reduction of Bio-informatics Data

Start searching

No results found