Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Often this data is transposable, meaning that both the row variables and column variables are of potential interest. An example of this s…

Authors: Genevera I. Allen, Robert Tibshirani

As statisticians, we often make assumptions when constructing a model to ease computations or employ existing methodologies. When analyzing matrix data, we often assume that the variables along one dimension (say the columns) are independent, allowing us to pool these observations to make inferences on the variables along the other dimension (rows). In microarrays, for example, it is common to assume that the arrays are independent observations when computing test statistics, allowing us to assess differential expression in genes. Since we are testing many row variables (for example, genes) simultaneously, we commonly correct for multiple testing using procedures that theoretically are known only to control error measures when the row variables are independent or follow limited dependence structures. Thus, for inference with matrix data, we often make assumptions of independence or limited dependencies among the row variables and among the column variables to be able to employ existing statistical methodologies. What if these assumptions are incorrect? What if this matrix data is in fact transposable, meaning that potentially both the rows and/or columns are correlated? In this paper, we consider the problem of testing the significance of row variables in a data matrix where there are correlations among the rows, or among the columns, or among both. We study the behavior of standard statistical methodology on transposable data and then propose a method to directly account for the dependencies when conducting inference. Throughout this paper, we often refer to the example of detecting genes that are differentially expressed between two classes in microarray data. These genomic datasets contain complicated correlation structures. Genes in similar pathways, for example, are usually highly positively correlated. Other genes may encode proteins that act as inhibitors leading to negative correlations. In the analysis of microarrays, it is common to assume that the arrays are independent. Many have suggested, however, that this may not be correct (Efron, 2009;Leek and Storey, 2008;Owen, 2005;Qiu et al., 2005), due to the measurement process or latent variables. Arranged in the form of a matrix, this means that both the row (gene) and column (array) variables could be dependent, indicating that the data could be transposable. While we focus on the example of detecting significant genes in the two-class microarray, our methods can be applied to many examples of large-scale inference with transposable data. These include: testing the significance of proteins, genes, or isoforms in data such as protein arrays and next-generation sequencing data, testing the significance of voxels in functional magnetic resonance imaging data, and testing the significance of biomarkers in three-way data where measurements are taken on multiple subjects at several time points or in many different laboratories. In all of these examples, the assumptions of independence along one dimension of the data is questionable. We begin by introducing two examples that we will refer to throughout this paper. The first is a two-class microarray study of cardiovascular disease (Efron, 2009). We will refer to this as the "Cardio" data. This data has m = 20, 426 genes and n = 63 arrays consisting of 44 controls and 19 diseased patients. The second is a two-class microarray study of two types of Leukemia cancer (Golub et al., 1999), which we will refer to as the "Leukemia" data. This data has m = 3, 701 filtered genes and n = 72 arrays with 25 and 47 samples in each subtype. For each of these datasets, we calculate the two-sample t-statistic for each gene and compare their distribution to that of the theoretical null distribution in Figure 1. We see that the t-statistics are over-dispersed compared to their theoretical null distributions. This could be due to the highly correlated nature of the thousands of genes, or another cause could be correlations among the arrays. In fact, the permutation tests of Efron (2009) reject the null hypothesis of independent arrays for both of these microarrays. Figure 1: Histograms of two-sample t-statistics for the "Cardio" data (left) and the "Leukemia" data (right). Log intensity values were used with the genes and arrays centered. The theoretical null distribution, the t-distribution with 70 degrees of freedom (left) and 61 degrees of freedom (right), is drawn in red. When studying inference with transposable data, the effects of row and column correlations must be considered separately. Since the columns are generally considered to be independent, population column correlations lead to the used of incorrect test statistics and null distributions which in turn result in problems when correcting for multiple testing. Row correlations lead to the much discussed problem of multiple testing dependence (Benjamini and Yekutieli, 2001;Hommel, 1986;Leek and Storey, 2008;Sarkar, 2008;Storey et al., 2004). We propose to study and solve these problems by modeling row and column correlations using the mean-restricted matrix-variate normal distribution (Allen and Tibshirani, 2010) described in Section 2. The first half of our paper is devoted to studying the effects of these correlations on test statistics and their theoretical null distributions, Section 2.2, and on power and multiple testing procedures through a simulation study in Section 3.2. Interestingly, this study finds the following results. 1. Unanticipated column correlations dramatically alter the null distributions of test statistics leading to the use of incorrect test statistics, null distributions and estimates of the FDR. 2. Row correlations do not seem to affect the estimates of the FDR. The later half of our paper is focused on solving the problems associated with row and column correlations by directly making use of the correlation structure. In Section 4, we simultaneously estimate row and column covariances using transposable regularized covariance models (Allen and Tibshirani, 2010). We then present an algorithm to sphere or de-correlate the rows and columns so that they are approximately independent. This algorithm is to be used as a pre-processing step and in conjunction with standard multiple testing procedures. Simulation results using our sphering algorithm are presented in Section 5 under various models on both structured covariance and real microarray covariance examples. These reveal two important results: (c) Sphering can alter the rank of the test statistics leading to an ordering with higher statistical power. (d) Sphering often leads to substantial improvements in the estimation of the FDR. We conclude with a discussion of our study and methods in Section 6. In this section, we first present a matrix decomposition model based on the mean-restricted matrix-variate normal in Section 2.1. Then, going back to the two-class microarray example, we consider the test statistic for a single gene. Since the arrays are usually assumed to be independent, the two-sample z and t-tests are used commonly to assess differential expression. We give the theoretical null distributions for these test statistics under our model with column correlations in Section 2.2. We propose to study row and column correlations through a simple matrix decomposition model based on the matrix-variate normal. We motivate the use of this distribution through the example of microarrays. In microarray data, the genes are often assumed to follow a multivariate normal distribution with the arrays independent and identically distributed. Since we aim to study the effects of array correlations, we need a parametric model that has the flexibility to model either array independence or various array correlation structures. To this end, we turn to the mean-restricted matrix-variate normal introduced in Allen and Tibshirani (2010). (We also note that Efron (2009) proposes the matrix-variate normal as a model for microarrays). This distribution, denoted as X ∼ N m,n (ν, µ, Σ, ∆), has separate mean and covariance parameters for the rows, ν ∈ m and Σ ∈ m×m , and columns, µ ∈ n and ∆ ∈ n×n . Thus, we can model array correlations directly though the covariance matrix ∆. If the matrix is transformed into a vector of length np, we have that vec(X) ∼ N (vec(M), Ω), where and Ω = ∆ ⊗ Σ. Also, the commonly used multivariate normal is a special case of the distribution. If ∆ = I and µ = 0, then X ∼ N (ν, Σ). In fact, all marginal models of the matrix-variate normal are multivariate normal, meaning that both the genes and arrays separately are multivariate normal. Further properties of this distribution are given in Allen and Tibshirani (2010). In our matrix decomposition model, we will assume that the data, X, has m rows and n columns. We define the overall row means as ν ∈ m and column means as µ ∈ n . The covariance of the rows is Σ ∈ m×m and the covariance of the columns is ∆ ∈ n×n . Then, we decompose the data into a mean, signal, and correlated noise matrix as follows. where Thus, X -S ∼ N m,n (ν, µ, Σ, ∆), meaning that after removing the signal, the data follows a mean-restricted matrix-variate normal distribution. For the example of the two-class microarray, we let there be n 1 arrays in class one, with indices denoted by C 1 , and n 2 in class two, C 2 . (For simplicity of notation, we assume that the first n 1 arrays are in class one and the last n 2 arrays are in class two.) The class signals, or the gene means for each class are defined as ψ 1 ∈ m and ψ 2 ∈ m . Then, the signal matrix, S, can be written as follows. There are several remarks to make regarding this model. First, prior to analyzing data, it is common to standardize the rows. Sometimes this two-way data is doubly-standardized, or both the rows and columns are iteratively scaled (Efron, 2009;Olshen and Rajaratnam, 2010). Here, we center both the rows and columns through the mean matrix M, but do not directly scale them. Instead, we allow the diagonals of the covariance matrices of the rows, Σ, and columns ∆, to capture the differences in variablities. Thus, our model keeps the mean and variances separate in the estimation process. In this section, we study the effect of column correlations on the theoretical null distribution of two-sample test statistics computed for a single row of the data matrix. More specifically, we calculate the distributions of test statistics under our matrix decomposition model instead of the typical two-sample framework where samples are drawn independently from two populations. This corresponds to considering a single test for differential expression of gene i between the two classes. In the familiar two-sample hypothesis testing problem, we have a vector with x 1 of length n 1 and x 2 of length n 2 where the elements of each vector are ). We wish to test whether there is a shift in means between the two classes, namely (2) Throughout this paper, we will assume that the variances σ 2 are equal between the two classes, a common assumption in microarrays. If the variance, σ 2 is known, we have the familiar two-sample Z-statistic, . Now, going back to our matrix decomposition model, we wish to know the distribution of the Z-statistic for each row when there are column correlations. where and L is the matrix square root of ∆. In terms of the decomposition (1), the assumptions of Theorem 1 correspond to a row vector previously centered by ν and µ, with signal [ψ 1 1 (n 1 ) ψ 2 1 (n 2 ) ], column covariance ∆, and row variance σ 2 , the diagonal element of Σ. For microarrays, the result states that when the columns (arrays) are correlated, the variance of the Z-statistic is inflated or deflated by η, a function of the column covariance. Notice that if ∆ = I, η = c n and the variance of Z is one. If there is only column correlation within the two classes we have the following result. where In both of the previous results, we assumed that the row variance, σ 2 was known. However, in most microarray experiments this is not known and must be estimated. With σ 2 unknown, for testing the hypothesis (2), the two-sample t-statistic is used. with c n and xk as previously defined. Under the null hypothesis, T ∼ t (n-2) , while under the alternative, T ∼ t(δ) (n-2) , a non-central t distribution with non-centrality parameter When there are column correlations as in the assumptions of Proposition 1, however, the distribution of T does not have a closed form. (The square of the pooled sample standard deviation is no longer distributed as a Chi-squared random variable and the numerator and denominator of T are not independent.) Hence, we explore the effects of column correlations on the T -statistic through a small simulation study. Data is simulated according to the assumptions of Theorem 1 with n = 50 columns with n 1 = n 2 = 25 in each class. Four structured covariance matrices were used to assess the Z and T -statistics under the column correlations scenarios, as given below. • ∆1 : ∆1,ij = 0.9 |i-j| . • ∆2: Blocked diagonal with blocks of size 10. Within each block, ∆2,ij = 0.9 |i-j| . • ∆3 : ∆3,ij = 0.5 |i-j| . • ∆4: Blocked diagonal with blocks of size 10. Within each block, ∆4,ij = 0.5 |i-j| . Figure 2 reveals the effect of column correlations on the distributions of Z and T . We see that column correlations can cause dramatic over-dispersion of the test statistics compared to their theoretical null distribution. This is a possible explanation to the over-dispersion seen in the real microarray examples of Figure 1. Compared to the variance of the Zstatistic, the T -statistic appears to be even more affected by column correlations. This is confirmed in Table 1 where we present the variances of the Z-statistic calculated by Theorem 1 and the variances of the T -statistic estimated by Monte Carlo simulation. Indeed, small amounts of correlation in the columns can cause a dramatic increase in the variance of the T -statistic. In this section, we have shown how the distribution of T and Z-statistics behave when columns or arrays are correlated. When analyzing microarrays, however, many have advocated using a non-parametric method, estimating the null distribution by permutations Ge et al. ( 2003); Storey and Tibshirani (2003); Tusher et al. (2001). For the two-class microarray, one would permute the class labels and calculate the T -statistic for each permutation. These permutations form a null distribution, as under the null hypothesis (2), the class means are the same. Thus, each permutation of the labels is equally likely. When the arrays are correlated, however, this assumptions fails. Each permutation of the columns is not equally likely under the null due to the array covariance structure. While we do not explore the behavior of the permutation nulls further in this section, we include permutation-based methods from Storey and Tibshirani (2003) in our simulation study in the following section. 3 Study: Dependence and Multiple Testing In the previous section, we presented the theoretical null distributions of commonly used test statistics for a single two-sample test statistic when the columns are correlated. With transposable data, however, one needs to test possibly tens of thousands of row variables, thus creating a problem of multiplicity. In this section, we first review some multiple testing procedures that are known to control errors under certain types of dependencies. We then present a series of simulations to study the behavior of commonly used multiple testing procedures when the rows and columns are correlated. A common error measure for controling the number of false positives in microarrays is the False Discovery Rate (FDR). This is the expectation of the False Discovery Proportion (FDP): let V be the number of false positives and R be the total number of rejections, then q = F DR = E(V /R|R > 0). Typically, investigators seek to control the FDR at q = 0.1, meaning that on average 10% of rejections are false. The step-up method of Benjamini and Hochberg (1995) is one of the most widely used methods for controling the FDR. Benjamini and Yekutieli (2001) have shown that this method controls the FDR under types of positive dependence, specifically positive regression dependence, and Sarkar (2008) has relaxed this assumption to slightly broader forms of positive dependence. This may not be appropriate for all types of transposable data, especially microarrays where we expect some negative correlations between genes. Alternatively, Benjamini and Yekutieli (2001) have shown that dividing the thresholds in the step-up procedure by a constant controls the FDR under arbitrary dependencies. Another commonly used method to control the FDR is based on re-sampling or permutation distributions (Ge et al., 2003;Storey, 2002;Tusher et al., 2001;Yekutieli and Benjamini, 1999). Theoretically, these methods are only known to control the FDR asymptotically under types of weak dependence, which encompasses forms of local dependence such as finite blocks (Storey et al., 2004). Thus, there could be many transposable data sets in which the row variables do not satisfy these dependence structures. (We also note that applying the step-up method to the permutation-adjusted p-values is equivalent to the direct FDR estimation via re-sampling (Storey et al., 2004)). Also, to directly account for correlations, Efron (2004Efron ( , 2007) ) proposed a method to fit an empirical null to the data. One can then estimate the local FDR and then the FDR by averaging the local FDR over the tail regions. We study the effects of both row and column correlations on standard statistical methodology used for large-scale inference through a simulation study based on our matrix decomposition model. We compare FDR estimates of four types of FDR-controlling procedures to the true false discovery proportion (FDP). The four methods we compare are the step-up method of Benjamini and Hochberg (1995), the step-up method for control under arbitrary dependence of Benjamini and Yekutieli (2001), the permutation-based method of Storey and Tibshirani (2003), and the method based on the empirical null and local FDR's of Efron (2007). The two-sample t-statistic was used for all methods with p-values computed by comparing it to the t (n-2) distribution for the step-up procedures. We used 1000 permutations for the permutation-based method. The defaults in the localfdr package available on CRAN, the R language repository. These defaults fit the null distribution as a natural spline with seven degrees of freedom, for the empirical null-based method. Our simulation study is structured as follows. The data is simulated under the matrix decomposition model (1) and is of size 250 by 50. The first 50 rows are non-null with a twoclass signal matrix given by ψ 1,1:25 = 0.5, ψ 1,26:50 = -0.5, ψ 2,1:25 = -0.5, ψ 2,26:50 = 0.5 and the last 200 elements of ψ 1 and ψ 2 equal to zero. We consider two types of row covariances, Σ 1 with all positive correlations satisfying the positive regression dependence assumption of (Benjamini and Yekutieli, 2001), and Σ 2 with both positive and negative correlations. Both of these row covariances are block diagonal. We simulate data under three column covariances, with the first being the identity, or no column correlations. The others, ∆ 1 and ∆ 2 reflect a local and a class effect, respectively. These simulation covariances are summarized below. • Σ1: Blocked diagonal with blocks of size 10. Within each block, Σ1,ij = 0.9 |i-j| . • Σ2: Blocked diagonal with blocks of size 10. Within each block, Σ2,ij = (-0.9) |i-j| . • ∆1: Blocked diagonal with blocks of size 10. Within each block, ∆1,ij = 0.5 |i-j| . • ∆2: Blocked diagonal with blocks of size 25. Within each block, ∆2,ij = 0.5 |i-j| . (Efron, 2007) Permutation-based method (Storey and Tibshirani, 2003) Step-up method (Benjamini and Hochberg, 1995) Step-up method for dependence (Benjamini and Yekutieli, 2001) We present plots of the true FDP verses the number of hypotheses rejected for the four methods for one realization of each of the six simulation scenarios in Figure 3. We note that the lines above the true FDP curve denote conservative FDR estimates. In Table 2, we report the true and estimated false discovery proportions (FDP) when fixed numbers of hypotheses are rejected (40, 45, 50, 55 and 60 tests). Results are averaged over ten simulations with the standard error also reported. This simulation study reveals several interesting results. First, dependencies among the rows do not seem to effect FDR estimation with the four multiple testing procedures. When, ∆ = I as in simulations (a) and (b), the methods generally conservatively estimate the true FDP. This is noteworthy since besides the method of Benjamini and Yekutieli (2001), there are limited theoretical results supporting FDR control under various dependencies. When there are even moderate correlations between columns, simulations (c) through (f), the four methods give poor estimates of the FDR. The step-up method and the permutation-based method perform similarly. They both give extremely conservative estimates of the FDP when there is either a local or a class effect among the columns. Thus, when using these methods for controlling the FDR at q = 0.1, for example, one would reject The effect of row and column correlations on estimation of the false discovery rates. The true false discovery proportion (FDP) and estimates with standard errors using the step-up, stepup for dependence, permutation, and empirical null based methods, as described in Section 3.2, are given when a pre-specified number of tests are rejected. All simulations were done using the matrix decomposition model, (1), with parameters given in Section 3.2, and repeated ten times. less than 45 genes, when in reality one should be permitted to reject around 55 genes. We also see that while the method of Benjamini and Yekutieli (2001) controls the FDR are arbitrary dependencies, in practice this method is much too conservative for general use. On the other hand, the empirical null-based method of Efron (2007) performs inconsistently. Overall, the results of this simulation study reveal that dependencies among rows do not seem to effect the performance of the multiple testing procedures. On the other hand, the theoretical results of Section 2.2 are confirmed: dependencies among the columns are extremely problematic when conducting large-scale inference. In the previous sections, we have presented theoretical and simulation results demonstrating some of the problems with using standard statistical methodology for making inferences on transposable data. In the remainder of this paper, we present a solution to these problems by directly estimating the covariances and using these to sphere or de-correlate the data. The key to our method, based on the matrix decomposition model ( 1), is the simultaneous estimation of the row and column covariances. This is important because of the close relationship between the observed row and column covariances. Take, for example, the empirical covariances of a centered data matrix X, Σ = X X T /m and ∆ = X T X /n. If we take the singular value decomposition, X = U D V T , then Σ = U D U T and ∆ = V D V T , i.e. the two covariances estimates share the same eigenvalues. In fact, Efron (2009) shows that the variance of the two correlation matrices is the same. Because of this, population correlations among the rows, for example, often make the columns seem correlated. Thus, estimating the column covariance without accounting for the covariances of the rows is problematic. With Transposable Regularized Covariance Models, we can estimate both Σ and ∆ simultaneously according the the matrix-variate normal framework. We review these models and discuss their relevance for the example of microarrays in the next section. The Transposable Regularized Covariance Model (TRCM) allows us to estimate a nonsingular row and column covariance matrix by maximizing a penalized log-likelihood of the matrix-variate normal distribution (Allen and Tibshirani, 2010). The model places a strictly convex penalty on the inverse covariances, or concentration matrices, of the rows and columns. For estimating the covariances in this context, we propose to use a sparsityinducing penalty, an L 1 penalty, on the concentration matrices. Following from the matrix decomposition model, (1), if we let N be the noise matrix remaining after removing the means and the signal in the data, then the penalized log-likelihood is as follows. where and λ is a penalty parameter that must be estimated. We motivate the use of (6) first by discussing practical considerations. As the columns of the data matrix are usually There are several advantages assumed to be independent, ∆ = I, this should be our default position. By placing an L 1 penalty on ∆ -1 , our model encourages sparsity in the off-diagonal elements of ∆. Also, notice that we have one penalty parameter, λ, that is modulated by the dimension of the rows and columns. (We note that the penalty parameter, λ, can be selected by cross-validation). Thus, the evidence of a partial correlation among columns must be strong relative to the correlations among the rows for an off-diagonal element of ∆ -1 to be estimated as non-zero. Secondly,specifically for microarrays, it seems reasonable to assume that the covariances among the genes is sparse, since biologically genes are likely only to be correlated with genes in the same or related pathways. We also pause briefly to discuss the theoretical rationale for using L 1 penalties, instead of, for example, L 2 penalties. Recall that covariance solutions to the TRCM model with L 2 penalties have eigenvectors that are equal to the left and right singular vectors of the data (Allen and Tibshirani, 2010). Thus, the singular vectors of the data would remain the same when sphering with these estimates. In high-dimensional settings, however, it is well established that eigenvectors of empirical covariances are inconsistent (Johnstone and Lu, 2004), and thus, sphering by the L 2 covariance estimates seems ill-advised. While the consistency of L 1 TRCM estimates has not been established, there are consistency results for multivariate covariance estimation with an L 1 penalty. Rothman et al. (2008) show convergence of the multivariate covariance estimate in the Frobenius norm and more importantly for the correlation estimate in the operator norm which implies convergence of the eigenvectors (El Karoui, 2008). These results reveal some of the possible theoretical advantages of using L 1 penalties to estimate the covariances. Based on the matrix decomposition model, (1), we present a method of de-correlating or sphering the data so that the rows and columns are approximately independent. This sphered data can then be used with standard multiple testing procedures to identify significant row variables. Given a data matrix X with m rows and n columns, we present our sphering algorithm in Algorithm 1. Algorithm 1 Sphering Algorithm 1. Estimate row and column means, ν and μ forming M, and the signal matrix, Ŝ. 2. Define the noise, N X -M -Ŝ. Estimate row and column covariances of noise, Σ and ∆ via TRCM. 3. Sphere the noise: Ñ Σ-1 2 N ∆-1 2 . Form the sphered data matrix: X Ŝ + Ñ. The Sphering Algorithm simply estimates the means and signal according to the matrix decomposition model (1) and then estimates the correlation structure among the rows and columns in the remaining noise. The TRCM covariance estimates, Σ and ∆ are used to de-correlated the noise. Here, Σ-1/2 is the matrix square root of Σ-1 and ∆-1/2 of ∆. (We use the symmetric square root defined by the following. Let Σ-1 = PΛP T be the eigenvalue decomposition of Σ-1 , then the symmetric matrix square root is given by Σ-1/2 = PΛ 1/2 P T .) Adding the signal back into this sphered noise, we obtain X which we call the sphered data. One can use this de-correlated data to find significant row variables. Now, we investigate some of the theoretical properties of the sphered data for the twoclass problem introduced in Section 2.2. The sphered data, X has the following properties. ] and let X be the sphered data given by Algorithm 1. Then, Thus, the class signal remains the same between X and X, and the covariance structure is all that changes. By sphering the noise, the noise of each row in X becomes a linear combination of the noise in the other rows. We now study how sphering the data affects the Z and T statistics from Section 2.2. First, the Z-statistic does not change with sphering. The numerator of both the Z and T statistic, x1 -x2 is given by ψ1 -ψ2 , the components of the estimated signal matrix Ŝ. The denominator of the T -statistic, namely s x 1 ,x 2 , the estimate of the noise, however, changes with sphering. Recall that in Section 2.2, we discussed how the T -statistic does not have a closed form distribution when there are column correlations. After sphering the data, however, the T -statistic on the sphered data follows a scaled t distribution under certain conditions. This is given by the following result. ]. Let X be the sphered data given by Algorithm 1, and let the statistic Ti be the statistic for the i th row defined by (5) for the data X. Then under the null hypothesis H 0 : where and L is the matrix square root of ∆, σ i = Σ ii and σi = Σii . Using our sphering algorithm, we obtain test statistics that follow known distributions when the sphered column covariance, ∆ is the identity. If ∆ is instead a diagonal matrix, then a simple scaling of the columns will give the above result. Notice that if the original data, X, has no column correlations, ∆ = I, and σi = σ i , then T and T both follow a t distribution with n -2 degrees of freedom. Thus, if the data originally follows the correct theoretical null distribution, then sphering the data does not change its null distribution. Also, if the sphered rows are independent, Σ = I, or approximately independent, then the statistics, T i are independent or approximately independent. We also note that we can often assume that σi = σ i , thus eliminating that coefficient ratio from the distribution. This is especially a reasonable assumption if the rows are scaled prior to applying the sphering algorithm. Our results in Proposition 2 hold if the sphered column covariance ∆ is the identity or diagonal. The TRCM model, however, estimates sparse penalized row and column covariances. These penalized estimates will not capture the full covariances, but will instead estimate the major correlations. Thus, in practice, ∆ and Σ are not likely to be exactly the identity. We have observed in simulations, however, that ∆ is often diagonal or nearly diagonal, and thus the theoretical results are appropriate. When calculating p-values for T based on the distribution given in Proposition 2, we must know the value of η which depends on the original column covariance ∆. One could estimate this from ∆, but since the TRCM framework estimates penalized covariances, an estimate, η, based on ∆ will underestimate the population η. Hence to obtain the null distribution of the T -statistics, we have opted to scale T by the variance of the central portion of the observed distribution of the test statistics. This procedure is outlined in Algorithm 4.2 where ρ α (x) denotes the α th quantile of x. and I() is the indicator function. Algorithm 2 Scaling by the central portion of T . 1. Let the expected proportion of null test statistics be π0 = m0 /m. T (π 0 ) Var Ti I Ti ≥ ρ ((1-π 0 )/2) ( T ), Ti ≤ ρ (1-π 0 /2) ( T ) 3. Define the central-matched T -statistics: T * T σ t (n-2) (π 0 )/σ T (π 0 ), where σ 2 t (n-2) (π 0 ) is the variance of the central portion of the t (n-2) distribution. We scale by the central portion of the T -statistics so that the statistics can be tested against the t (n-2) distribution. Notice that if all of the test statistics are null and π 0 = 1 then, central-matching the variances reduces to scaling the T -statistics. Since under the assumptions of Proposition 2, only the null Ti follow a scaled t-distribution, we do not want statistics corresponding to non-null tests to contaminate the variance estimates. Thus, we recommend using a conservative estimate of π 0 , such as 0.8 or 0.9 for microarrays. By applying our sphering algorithm, we directly account for correlations among the rows and columns. This results in test statistics that more closely follow both their theoretical nulls and the theoretical assumptions under which common multiple testing procedures are known to control the false discovery rate. We now evaluate the performance of our sphering algorithm through many simulated examples. First, we compare data pre-processed by sphering to the standard row and column centering method on simulations based on the matrix decomposition model, (1). We use simulations from the matrix-variate normal with the structured covariances from the simulation study in Section 3.2 and also with covariances based on the observed dependencies in real microarray data. Finally, we test the robustness of our method and compare it to other methods for modeling dependencies in Section 5.2. For all simulations, the sphering algorithm was applied with the TRCM penalty parameter λ selected by five-fold cross-validation and with statistics scaled by the central portion using π 0 = 0.8. (Note that the "standard" pre-processing method refers to row and column centering throughout this section.) In all of these simulations, the data, X is simulated from the matrix decomposition model ( 1) with m = 250 rows and n = 50 columns. The first 50 rows are non-null given by ψ 1,1:25 = 0.5, ψ 1,26:50 = -0.5, ψ 2,1:25 = -0.5, ψ 2,26:50 = 0.5 and the last 200 elements of ψ 1 and ψ 2 equal to zero. In Table 3, we present results on a subset of the simulations from our simulation study in Section 3.2. The remaining simulation study results are given in Appendix A. The results in Table 3 show that de-correlating the data matrix yields improvements in 1) statistical power and 2) estimation of the FDR. We briefly illustrate this by examining a specific example from Table 3.2. Take the simulation with parameters Σ 1 , ∆ 1 and look at the results with 55 tests rejected. We notice that the true FDP for the data pre-processed by the standard method is 0.111 whereas it is lower, 0.105, on the data that was sphered. This results from a favorable re-ordering of the test statistics that gives a higher statistical power, one minus the true FDP. Next, notice that the FDR estimates for the step-up and permutation-based methods are 0.426 and 0.43 respectively for the un-sphered data. These estimates are overly conservative, as the true FDP is 0.111. After sphering, however, the FDR estimates are 0.124 and 0.125 which are much closer to the true FDP of 0.105. Hence, sphering also improves FDR estimation. Sphering the data as a pre-processing step to multiple testing procedures has many advantages. First in microarrays, the higher statistical power that results from a re-ordering of test statistics is important to scientists who desire the top ranked genes from one microarray study to translate to the top genes in another study. Also, while sphering leads to improvements in FDR estimation, it is still a slightly conservative estimate, as desired, for the true FDP. As with the simulation study, we find that the empirical null based method of Efron (2007) gives an inconsistent estimate of the FDR as it is both a conservative and estimates with standard errors are given when a pre-specified number of tests are rejected. Results using the sphering algorithm (in bold) are compared to data that has been row and column centered. All data was simulated under the matrix decomposition model, (1), with parameters given in Section 3.2, and repeated ten times. Two sets of values should be compared: the true FDP with sheering to without sphering, and the FDR estimates compared to the true FDP for both with and without sphering. liberal estimate for differing numbers of rejected tests. We also wish, however, to test the performance of our sphering algorithm on data with dependencies more similar to real microarray data. Thus, we build a second simulation study based upon the empirical covariances of the "Cardio" and the "Leukemia" microarrays. For each of the ten repetitions, we sample 250 genes and 50 arrays at random from each microarray. Let i ∈ I and j ∈ J , be the sampled sets of the genes and arrays respectively, and assume X is the centered data matrix. We then calculate the empirical covariances, Hence, the simulated data follows the observed covariance of the "Cardio" and "Leukemia" studies. Example images and FDR curves from this simulation are given in Figure 4 as well as the simulation results in Table 4. The results of the structured covariance study, namely improvements in statistical power and FDR estimation, are confirmed on these microarray-based simulations. There are also (Efron, 2007) Permutation-based method (Storey and Tibshirani, 2003) Step-up method (Benjamini and Hochberg, 1995) Figure 4: Example data images (top panel) and FDR curves (bottom panel) for the simulations based on dependencies within the "Cardio" and "Leukemia" microarrays. Data is either gene and array centered or sphered. In the FDR curves, the true and estimated false discovery proportions are plotted against the number of genes rejected. All data was simulated under the matrix decomposition model, (1), with parameters given in Section 5.1. some specific notes to make regarding these simulations. First in Table 4, notice that using un-sphered data the FDR is overestimated on the "Cardio" simulations and underestimated on the "Leukemia" simulations. This is confirmed by the example giving the full FDR curves in Figure 4. After sphering, however, we see that the FDR estimates for both simulations are still conservative, but much closer to the true FDP. We note that all ten repetitions of the "Cardio" simulation estimated both Σ and ∆ to be non-diagonal. This means that even after accounting for the correlations among the genes, there still appear to be significant correlations among the arrays. In the "Leukemia" simulation, however, ∆ was estimated to be diagonal in all ten simulations. Thus, the correlations among the genes may be driving the over-dispersion seen in the t-statistic distributions of Figure 1. Thus, from these simulations based on our matrix decomposition model, we see that sphering the data as a pre-processing step greatly improves statistical power and false discovery rate estimation. True FDP (Benjamini & (Storey & (Efron, 2007) Hochberg, 1995) Table 4: Results for simulations based on observed dependencies within the "Cardio" and "Leukemia" microarrays: True false discovery proportions (FDP) and FDR estimates with standard errors are given when a pre-specified number of tests are rejected. Results using the sphering algorithm (in bold) are compared to data that has been row and array centered. All data was simulated under the matrix decomposition model, (1), with parameters given in Section 5.1, and repeated ten times. Two sets of values should be compared: the true FDP with sheering to without sphering, and the FDR estimates compared to the true FDP for both with and without sphering. We now evaluate the performance of our method using simulations based on models other than the matrix-variate normal, namely a latent variable model and a random effects model. In these simulations we will not only compare our sphering method to the standard method, but also to the surrogate variable analysis method of Leek and Storey (2008). We first compare this method and model's properties to our sphering algorithm and matrix decomposition model, and then compare these methods numerically. To account for possible latent variables in a multiple testing framework, Leek and Storey (2008) propose a matrix model and the surrogate variable analysis (SVA) method. They propose the model for the data X ∈ m×n , X = B S +ΓG + U where S is a signal matrix, G ∈ d×n for d < n is the latent variable matrix, U ∈ m×n is independent noise and B and Γ ∈ m×d are coefficients to be estimated. This model is similar in nature to our matrix decomposition model (1). If we assume X has been previously centered, using the notation Table 5: Simulations based on a latent variable and random-effects models as described in Section 5.2. The true FDP (FDP) is compared to the FDR estimates ( FDR) for a fixed number of rejected tests using the step-up method of Benjamini and Hochberg (1995) for three pre-processing techniques: Standard (row and column centered), our sphering algorithm, and the surrogate variable analysis (SVA) method. Averages are taken on ten repetitions and standard errors are given. of the latent variable model, we can write (1) as . Thus, our model accounts for structure within the data through the row and column covariances Σ and ∆, whereas their method estimates the structure through G and assumes the noise is additive. Assuming that the latent variable model or our model is correct, applying the respective algorithms results in approximately independent p-values. Also similar to our method, SVA can change the rankings of the test statistics. Unlike SVA, however, our model and sphering algorithm directly capture and account for possible correlations among the columns as well as the rows. We simulate data from a latent variable model taken directly from Leek and Storey (2008) as well as from a random effects model denoting a batch effect. For both models, the data is of dimension 250 × 50, with 25 columns in each class with the following signal: The first 50 rows are non-null given by ψ 1,1:25 = 0.5, ψ 1,26:50 = -0.5, ψ 2,1:25 = -0.5, ψ 2,26:50 = 0.5 and the last 200 elements of ψ 1 and ψ 2 equal to zero. For the latent variable model, there are two latent variables given by G ij iid ∼ Bern(0.5), coefficients Γ ij iid ∼ N (0, 1) and noise U ij idd ∼ N (0, 1). For the random-effects model with K batches indicated by indices I(k), a column of the data is given by the following: , where ν, µ j and ψ i are fixed effects, and iid ∼ N (0, Σ 1 ) are random effects. In our simulation, we have µ k = [-0.5 -0.25 0 0.25 0.5], σ 2 = 0.5, Σ = Σ 1 as defined in Section 3.2 and I(k) indicating batches of five columns. In Table 5, we compare the true FDP to the estimate of the FDR via the step-up method (Benjamini and Hochberg, 1995) for the data with standard pre-processing, with sphering and with Leek and Storey (2008)'s surrogate variable analysis (SVA). The SVA method was implemented using the defaults available in the package sva from CRAN, the R language repository. For the latent variable simulation, both our sphering method and the SVA method improve the rank ordering of the test statistics resulting in higher statistical power as well as improved estimates of the FDR. In the random-effects model simulation, however, the sphering algorithm substantially outperforms the standard pre-processing and the SVA method. We illustrate this by looking at the specific case where 50 tests are rejected. For the standard pre-processing and SVA methods, the true FDP is 0.48 and 0.475 respectively, meaning that on average 25 out of the 50 rejected tests are false positives. With sphering, however, the order of the test statistics is dramatically changed leading to a true FDP of 0.102 so that on average only 5 out of 50 rejected tests are false. The FDR estimates using the step-up method are also problematic for the standard and SVA methods as 0.0196 and 0.185 are substantially below the true FDPs of 0.48 and 0.475 respectively. If these methods were used, the number of false positives would not be controlled. With sphering, however, the FDR estimate of 0.14 is much closer to the true FDP, 0.102 and is a conservative estimate, as desired. These simulations based on models other than the matrix-variate normal reveal the robustness of our pre-processing technique. Our sphering method also compares very favorably to the surrogate variable analysis, another pre-processing method. In this paper, we have demonstrated that using standard statistical methodology to conduct inference on transposable data is problematic. As a method of solving these problems, we have prosed a sphering pre-processing technique that de-correlates the data yielding approximately independent rows and columns. We have revealed the advantages and robustness of this method through simulations on many correlated data sets. The major disadvantage of our method is its computational cost. Fitting the transposable regularized covariance model with L 1 penalties is approximately O(k(m 3 +p 3 )), where k is the total number of iterations needed until convergence. Thus, directly fitting this model to microarrays, for example, where m may be twenty or thirty thousand, is not currently feasible. A simple fix can be proposed, that is to first filter the genes by the absolute value of their un-sphered T -statistics down to say 1,000 or 500 genes. Since the signal in each gene remains the same before and after sphering, filtering should not effect the power to detect non-null genes, especially since researchers are rarely interested in re-testing over 500 genes. As future work, we will examine approximations to the TRCM covariance estimates that can be used in high-dimensional settings and would circumvent the need to filter the genes before sphering. There are many components of our work that deserve further investigation and testing. First, Allen and Tibshirani (2010) outline some of the properties of the TRCM covariances estimates, but several questions, such as the consistency of the estimates, remain. Also, direct estimation of η, the scaled variance of the Z-statistic that depends on the array covariance, should be examined to find a consistent estimate of η. In conclusion, our model and study have revealed several important issues related to large-scale inference with transposable data such as microarrays. First, correlation among the columns proves to be a major problem, both theoretically and in simulations, when comparing test statistics to a theoretical or permutation null distribution. This results is striking as inference is often conducted under the false assumption of column independence. Second, despite the lack of theoretical results supporting the use of many common FDRcontrolling procedures for test statistics with arbitrary dependence structure, the procedures seem to conservatively estimate the FDP under a variety of correlation scenarios. Finally, our method of de-correlating the data is a way to directly model the covariance structure in a multiple testing framework. This method leads to 1) improvements in the statistical power, and to 2) better estimation of the FDR. While this paper has focused on the example of two-class microarrays, our model and methods may prove useful in a variety large-scale inference problems with highly transposable data sets. We would like to thank Jonathan Taylor for several helpful comments and conversations regarding this work. Thanks to Joseph Romano for discussions and references for papers on multiple testing with dependencies. Thanks also to Bradley Efron whose observations and ideas on microarrays partly inspired this work. Proof 1 (Theorem 1) Let z be a vector of N (0, 1) random variables. Then, if we arrange x as a column vector, we have True FDP (Benjamini & (Storey & (Efron, 2007) Hochberg, 1995) Tibshirani with standard errors are given when a pre-specified number of tests are rejected. Results using the sphering algorithm (in bold) are compared to data that has been row and column centered. All data was simulated under the matrix decomposition model, (1), with parameters given in Section 3.2, and repeated ten times. Two sets of values should be compared: the true FDP with sphering to without sphering, and the FDR estimates compared to the true FDP for both with and without sphering. can be written as T = Z/ √ D. From Theorem 1, Z ∼ N ((ψ1 -ψ2)/σ √ cn, η/cn). Then, under the null, Z/ η/cn ∼ N (0, 1). Also, under the null, Dσ 2 /σ 2 ∼ χ 2 (n-2) with D and Z independent. Then,

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment