Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis
ArXiv ID: 0708.4350
Date: 2009-09-29
Authors: Researchers from original ArXiv paper

📝 Abstract

A prespecified set of genes may be enriched, to varying degrees, for genes that have altered expression levels relative to two or more states of a cell. Knowing the enrichment of gene sets defined by functional categories, such as gene ontology (GO) annotations, is valuable for analyzing the biological signals in microarray expression data. A common approach to measuring enrichment is by cross-classifying genes according to membership in a functional category and membership on a selected list of significantly altered genes. A small Fisher's exact test $p$-value, for example, in this $2\times2$ table is indicative of enrichment. Other category analysis methods retain the quantitative gene-level scores and measure significance by referring a category-level statistic to a permutation distribution associated with the original differential expression problem. We describe a class of random-set scoring methods that measure distinct components of the enrichment signal. The class includes Fisher's test based on selected genes and also tests that average gene-level evidence across the category. Averaging and selection methods are compared empirically using Affymetrix data on expression in nasopharyngeal cancer tissue, and theoretically using a location model of differential expression. We find that each method has a domain of superiority in the state space of enrichment problems, and that both methods have benefits in practice. Our analysis also addresses two problems related to multiple-category inference, namely, that equally enriched categories are not detected with equal probability if they are of different sizes, and also that there is dependence among category statistics owing to shared genes. Random-set enrichment calculations do not require Monte Carlo for implementation. They are made available in the R package allez.

💡 Deep Analysis

Deep Dive into Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis.

📄 Full Content

A prespecified set of genes may be enriched, to varying degrees, for genes that have altered expression levels relative to two or more states of a cell. Knowing the enrichment of gene sets defined by functional categories, such as gene ontology (GO) annotations, is valuable for analyzing the biological signals in microarray expression data. A common approach to measuring enrichment is by crossclassifying genes according to membership in a functional category and membership on a selected list of significantly altered genes. A small Fisher's exact test p-value, for example, in this 2 × 2 table is indicative of enrichment. Other category analysis methods retain the quantitative gene-level scores and measure significance by referring a category-level statistic to a permutation distribution associated with the original differential expression problem. We describe a class of random-set scoring methods that measure distinct components of the enrichment signal. The class includes Fisher's test based on selected genes and also tests that average gene-level evidence across the category. Averaging and selection methods are compared empirically using Affymetrix data on expression in nasopharyngeal cancer tissue, and theoretically using a location model of differential expression. We find that each method has a domain of superiority in the state space of enrichment problems, and that both methods have benefits in practice. Our analysis also addresses two problems related to multiple-category inference, namely, that equally enriched categories are not detected with equal probability if they are of different sizes, 1. Introduction. In processing results of a microarray study, one is faced with the daunting task of relating differential-expression evidence to other information about the genes. Any interesting connections that can be revealed are critical in developing a fuller understanding of the biology and in providing guidance toward the next experiment [e.g., Rhodes and Chinnaiyan (2005)]. Much of the exogenous information is organized in networks of functional categories; genes are annotated to the same category by virtue of a shared biological property. The Gene Ontology (GO) project is perhaps the best example of how biological information is carried by networked collections of functional categories [Gene Ontology Consortium (2000, 2004)]. Initiated as a collaboration among different genome projects, GO has become a fundamental resource that records attributes of genes and gene products and that organizes these attributes using networks of connected functional categories.

The problem of enrichment emerges in relating gene-level expression results with functional categories. To what extent, if at all, are genes with altered expression over-represented in a named category? At the risk of oversimplifying things, the extensive research and development toward solving this problem may be classified by two statistical approaches. The first begins by selecting a short list of genes that are altered significantly relative to the cell grouping under study: for instance, genes with extreme fold change or with extreme value of a test statistic. The intersection of the selected list and the functional category is then evaluated, perhaps by Fisher’s exact test or a variant, which scores the category highly for enrichment if many more selected genes than expected belong to the category [Drǎghici et al. (2003), Berriz et al. (2003), Doniger et al. (2003), Al-Shahrour, Uriarte and Dopazo (2004), Beißbarth and Speed (2004), Cheng et al. (2004), Zhong et al. (2004), Dodd et al. (2006)]. Available informatics tools and related problems are reviewed in Khatri and Drǎghici (2005). A second approach is developed in Virtaneva et al. (2001) and Barry, Nobel and Wright (2005), called SAFE (Significance analysis of function and expression) and also in Mootha et al. (2003) and Subramanian et al. (2005), called GSEA (Gene set enrichment analysis). Briefly, expression information on all the genes under study is retained; then a permutation analysis is used to measure the significance of category-level statistics computed from these gene-level statistics.

Existing tools have been effective in adding value to expression results, but they remain limited for evaluating enrichment signals. Analysis is simplified when considering selected gene lists, since quantitative scores from the gene-level analysis are not required. But then the enrichment results depend on the stringency of the selection, and give equal weight to genes at both ends of the selected list. This problem is redressed in the SAFE/GSEA approach. The permutation method adopted by SAFE/GSEA refers back to the labeled microarray data themselves rather than to the results of the differential-expression analysis. There is an added computational burden in this strategy and also it can become ineffective when few microarrays enter the permutation. A technical issue, further, concerns the

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Error Analysis of Approximated PCRLBs for Nonlinear Dynamics

Rejoinder of: Statistical analysis of an archeological find

Temperature and Respiratory Emergency Department Visits: A Mediation Analysis with Ambient Ozone Exposure

Start searching

No results found