We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors.
Deep Dive into MOST: detecting cancer differential gene expression.
We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors.
arXiv:0709.1307v1 [stat.AP] 10 Sep 2007
MOST: detecting cancer differential gene
expression
HENG LIAN
October 24, 2018
Abstract
We propose a new statistics for the detection of differentially expressed
genes, when the genes are activated only in a subset of the samples. Statis-
tics designed for this unconventional circumstance has proved to be valu-
able for most cancer studies, where oncogenes are activated for a small
number of disease samples. Previous efforts made in this direction include
COPA ([Tomlins and others(2005)]), OS ([Tibshirani and Hastie(2006)])
and ORT ([Wu(2007)]). We propose a new statistics called maximum or-
dered subset t-statistics (MOST) which seems to be natural when the
number of activated samples is unknown. We compare MOST to other
statistics and find the proposed method often has more power then its
competitors. Cancer; COPA; Differential gene expression; Microarray.
1
Introduction
The most popular method for differential gene expression detection in two-
sample microarray studies is to compute the t-statistics. The differentially ex-
pressed genes are those whose t-statistics exceed a certain threshold. Recently,
due to the realization that in many cancer studies, many genes show increased
expressions in disease samples, but only for a small number of those samples.
The study of [Tomlins and others(2005)] shows that t-statistics has low power
in this case, and they introduced the so-called “cancer outlier profile analysis”
(COPA). Their study shows clearly that COPA can perform better than the
traditional t-statistics for cancer microarray data sets.
More recently, several progresses have been made in this direction with the
aim to design better statistics to account for the heterogeneous activation pat-
tern of the cancer genes. In [Tibshirani and Hastie(2006)], the authors intro-
duced a new statistics, which they called outlier sum. Later, [Wu(2007)] pro-
posed outlier robust t-statistics (ORT) and showed it usually outperformed the
previously proposed ones in both simulation study and application to real data
set.
In this paper, we propose another statistics for the detection of cancer dif-
ferential gene expression which have similar power to ORT when the number of
activated samples are very small, but perform betters when more samples are
1
2
MAXIMUM ORDERED SUBSET T-STATISTICS (MOST)
2
differentially expressed. We call our new method the maximum ordered subset
t-statistics (MOST). Through simulation studies we found the new statistics
outperformed the previously proposed ones under some circumstances and never
significantly worse in all situations. Thus we think it is a valuable addition to
the dictionary of cancer outlier expression detection.
2
Maximum ordered subset t-statistics (MOST)
We consider the simple 2-class microarray data for detecting cancer genes. We
assume there are n normal samples and m cancer samples. The gene expressions
for normal samples are denoted by xij for genes i = 1, 2, . . . , p and samples j =
1, 2, . . .n, while yij denote the expressions for cancer samples with i = 1, 2, . . ., p
and j = 1, 2, . . . m. In this paper, we are only interested in one-sided test where
the activated genes from cancer samples have a higher expression level. The
extension to two-sided test is straightforward.
The usual t-statistics (up to a multiplication factor independent of genes)
for two-sample test of differences in means is defined for each gene i by
Ti = ¯xi −¯yi
si
,
(1)
where ¯xi = P
j xij/n is the average expression of gene i in normal samples,
¯yi = P
j yij/m is the average expression of gene i in cancer samples, and si is
the usual pooled standard deviation estimate
s2
i =
P
1≤j≤n(xij −¯xi)2 + P
1≤j≤m(yij −¯yi)2
n + m −2
.
The t-statistics is powerful when the alternative distribution is such that yij, j =
1, 2, . . ., m all come from a distribution with a higher mean. [Tomlins and others(2005)]
argues that for most cancer types, heterogeneous activation patterns make t-
statistics inefficient for detecting those expression profiles. They defined the
COPA statistics
Ci = qr({yij}1≤j≤m) −medi
madi
,
(2)
where qr(·) is the rth percentile of the data, medi = median({xij}1≤j≤n, {yij}1≤j≤m)
is the median of the pooled samples for gene i, and madi = 1.4826×median({xij−
medi}1≤j≤n, {yij −medi}1≤j≤m) is the median absolute deviation of the pooled
samples.
The choice of r in (2) depends on the subjective judgement of the user. The
use of medi and madi to replace the mean and the standard deviation in (1)
is due to robustness considerations since it is already known that some of the
genes are differentially expressed.
In (2), only one value of {yij} is used in the computation. A more efficient
strategy would be to use additional expression values. Let
Oi = {yij : yij > q75({xij}1≤j≤n, {yij}1≤j≤m)+ IQR({xij}1≤j≤n, {yij}1≤j≤m)}
(3)
2
MAXIMUM ORDERED SUBSET T-STATISTICS (MOST)
3
be the outliers from the cancer samples for gene i, where IQR(·) is the interquar-
tile range of the d
…(Full text truncated)…
This content is AI-processed based on ArXiv data.