MOST: detecting cancer differential gene expression

Reading time: 6 minute
...

📝 Original Info

  • Title: MOST: detecting cancer differential gene expression
  • ArXiv ID: 0709.1307
  • Date: 2008-12-17
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors.

💡 Deep Analysis

Deep Dive into MOST: detecting cancer differential gene expression.

We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors.

📄 Full Content

arXiv:0709.1307v1 [stat.AP] 10 Sep 2007 MOST: detecting cancer differential gene expression HENG LIAN October 24, 2018 Abstract We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statis- tics designed for this unconventional circumstance has proved to be valu- able for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA ([Tomlins and others(2005)]), OS ([Tibshirani and Hastie(2006)]) and ORT ([Wu(2007)]). We propose a new statistics called maximum or- dered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors. Cancer; COPA; Differential gene expression; Microarray. 1 Introduction The most popular method for differential gene expression detection in two- sample microarray studies is to compute the t-statistics. The differentially ex- pressed genes are those whose t-statistics exceed a certain threshold. Recently, due to the realization that in many cancer studies, many genes show increased expressions in disease samples, but only for a small number of those samples. The study of [Tomlins and others(2005)] shows that t-statistics has low power in this case, and they introduced the so-called “cancer outlier profile analysis” (COPA). Their study shows clearly that COPA can perform better than the traditional t-statistics for cancer microarray data sets. More recently, several progresses have been made in this direction with the aim to design better statistics to account for the heterogeneous activation pat- tern of the cancer genes. In [Tibshirani and Hastie(2006)], the authors intro- duced a new statistics, which they called outlier sum. Later, [Wu(2007)] pro- posed outlier robust t-statistics (ORT) and showed it usually outperformed the previously proposed ones in both simulation study and application to real data set. In this paper, we propose another statistics for the detection of cancer dif- ferential gene expression which have similar power to ORT when the number of activated samples are very small, but perform betters when more samples are 1 2 MAXIMUM ORDERED SUBSET T-STATISTICS (MOST) 2 differentially expressed. We call our new method the maximum ordered subset t-statistics (MOST). Through simulation studies we found the new statistics outperformed the previously proposed ones under some circumstances and never significantly worse in all situations. Thus we think it is a valuable addition to the dictionary of cancer outlier expression detection. 2 Maximum ordered subset t-statistics (MOST) We consider the simple 2-class microarray data for detecting cancer genes. We assume there are n normal samples and m cancer samples. The gene expressions for normal samples are denoted by xij for genes i = 1, 2, . . . , p and samples j = 1, 2, . . .n, while yij denote the expressions for cancer samples with i = 1, 2, . . ., p and j = 1, 2, . . . m. In this paper, we are only interested in one-sided test where the activated genes from cancer samples have a higher expression level. The extension to two-sided test is straightforward. The usual t-statistics (up to a multiplication factor independent of genes) for two-sample test of differences in means is defined for each gene i by Ti = ¯xi −¯yi si , (1) where ¯xi = P j xij/n is the average expression of gene i in normal samples, ¯yi = P j yij/m is the average expression of gene i in cancer samples, and si is the usual pooled standard deviation estimate s2 i = P 1≤j≤n(xij −¯xi)2 + P 1≤j≤m(yij −¯yi)2 n + m −2 . The t-statistics is powerful when the alternative distribution is such that yij, j = 1, 2, . . ., m all come from a distribution with a higher mean. [Tomlins and others(2005)] argues that for most cancer types, heterogeneous activation patterns make t- statistics inefficient for detecting those expression profiles. They defined the COPA statistics Ci = qr({yij}1≤j≤m) −medi madi , (2) where qr(·) is the rth percentile of the data, medi = median({xij}1≤j≤n, {yij}1≤j≤m) is the median of the pooled samples for gene i, and madi = 1.4826×median({xij− medi}1≤j≤n, {yij −medi}1≤j≤m) is the median absolute deviation of the pooled samples. The choice of r in (2) depends on the subjective judgement of the user. The use of medi and madi to replace the mean and the standard deviation in (1) is due to robustness considerations since it is already known that some of the genes are differentially expressed. In (2), only one value of {yij} is used in the computation. A more efficient strategy would be to use additional expression values. Let Oi = {yij : yij > q75({xij}1≤j≤n, {yij}1≤j≤m)+ IQR({xij}1≤j≤n, {yij}1≤j≤m)} (3) 2 MAXIMUM ORDERED SUBSET T-STATISTICS (MOST) 3 be the outliers from the cancer samples for gene i, where IQR(·) is the interquar- tile range of the d

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut