Stable Estimation of a Covariance Matrix Guided by Nuclear Norm Penalties

Estimation of covariance matrices or their inverses plays a central role in many statistical methods. For these methods to work reliably, estimated matrices must not only be invertible but also well-conditioned. In this paper we present an intuitive …

Authors: Eric C. Chi, Kenneth Lange

Stable Estimation of a Covariance Matrix Guided by Nuclear Norm   Penalties
Stable Estimation of a Co v ariance Matrix Guided b y Nuclear Norm P enalties Eric C. Chi a , Kenneth Lange b a Dep artment of Ele ctric al and Computer Engine ering, Ric e University, T exas, USA b Dep artments of Human Genetics, Biomathematics, and Statistics, University of California, L os Angeles, California, USA Abstract Estimation of cov ariance matrices or their in verses pla ys a central role in man y statistical metho ds. F or these metho ds to work reliably , estimated matrices must not only b e inv ertible but also well-conditioned. In this pap er w e present an in tuitiv e prior that shrinks the classic sample cov ariance esti- mator to wards a stable target. W e pro v e that our estimator is consistent and asymptotically efficien t. Th us, it gracefully transitions to wards the sample co v ariance matrix as the n um b er of samples gro ws relativ e to the n um b er of co v ariates. W e also demonstrate the utility of our estimator in t w o standard situations – discriminant analysis and EM clustering – when the num b er of samples is dominated b y or comparable to the num b er of cov ariates. Keywor ds: Co v ariance estimation, Regularization, Condition n um b er, Discriminan t analysis, EM Clustering 1. In tro duction Estimates of co v ariance matrices and their in verses play a central role in man y core statistical metho ds, ranging from least squares regression to EM clustering. In these applications it is crucial to obtain estimates that are not just non-singular but also stable. It is w ell known that the sample co v ariance matrix S = 1 n n X j =1 ( y j − ¯ y )( y j − ¯ y ) t is the maximum lik eliho o d estimates of the p opulation cov ariance Σ of a random sample y 1 , . . . , y n from a multiv ariate normal distribution. When the n um b er of comp onen ts p of y exceeds the sample size n , the sample co v ariance S is no longer inv ertible. Ev en when p is close to n , S b ecomes unstable in the sense that small p erturbations in measurements can lead to disprop ortionately large fluctuations in its en tries. A reliable wa y to combat instabilit y is to p erform p enalized maximum lik eliho od estimation. T o motiv ate our c hoice of p enalization, consider the eigenv alues of the sample cov ariance matrix in a simple sim ulation exp erimen t. W e drew n indep enden t samples from a 10-dimensional multiv ariate normal distribution y i ∼ N ( 0 , I 10 ). Figure 1 presen ts b o xplots of the sorted eigenv alues of the sample co v ariance matrix S ov er 100 trials for sample sizes n drawn from the set { 5 , 10 , 20 , 50 , 100 , 500 } . The b o xplots descend from the largest eigen- v alue on the left to the smallest eigen v alue on the righ t. The figure vividly illustrates the previous observ ation that the highest eigenv alues tend to b e inflated up w ards ab o ve 1, while the lo w est eigenv alues are deflated do wn- w ards b elo w 1 (Ledoit and W olf, 2004, 2012). In general, if the sample size n and the num b er of comp onen ts p approac h ∞ in suc h a wa y that the ratio p/n approaches ζ ∈ (0 , 1), then the eigenv alues of S tend to the Marˆ cenk o- P astur la w (Marˆ cenk o and Pastur, 1967), which is supported on the in terv al ([1 − √ ζ ] 2 , [1 + √ ζ ] 2 ). Th us, the distortion worsens as ζ approac hes 1. The ob vious remedy is to pull the highest eigen v alues do wn and push the lo west eigen v alues up. In this pap er, we in tro duce a nov el prior which effects the desired adjust- men t on the sample eigen v alues. Maxim um a p osteriori (MAP) estimation under the prior b oils down to a simple nonlinear transformation of the sample eigen v alues. In addition to pro ving that our estimator has desirable theoret- ical prop erties, we also demonstrate its utilit y in extending tw o fundamental statistical metho ds – discriminan t analysis and EM clustering - to con texts where the n umber of samples n is either on the order of or dominated by the n umber of parameters p . The rest of our paper is organized as follo ws. Section 2 discusses the his- tory of robust estimation of structured and unstructured co v ariance matrices. Section 3 specifies our Bay esian prior and derives the MAP estimator under the prior. Section 4 pro ves that the estimator is consisten t and asymptoti- cally efficient. Section 5 rep orts finite sample studies comparing our MAP estimator to relev an t existing estimators. Section 6 illustrates the estimator for some common tasks in statistics. Finally , Section 7 discusses limitations, generalizations, and further applications of the estimator. 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● n = 5 n = 10 n = 20 n = 50 n = 100 n = 500 0 2 4 6 8 0 2 4 6 8 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Sor ted Components Sor ted Eigen values Figure 1: Boxplots of the sorted eigen v alues of the sample cov ariance matrix S ov er 100 random trials. Here the num ber of comp onen ts p = 10, and the sample size n is dra wn from the set { 5 , 10 , 20 , 50 , 100 , 500 } . 2. Related W ork Structured estimation of cov ariance matrices can b e attack ed from tw o complemen tary p erspectives: generalized linear mo dels and regularization (P ourahmadi, 2011, 2013). In this w ork we consider the problem from the latter p erspective. Regularized estimation of co v ariance matrices and their in verses has b een a topic of intense scrutin y (W u and Pourahmadi, 2003; Bic kel and Levina, 2008), and the curren t literature reflects a wide sp ectrum of structural assumptions. F or instance, banded cov ariance matrices make sense for time series and spatial data, where the order of the comp onen ts is imp ortan t. It is also helpful to imp ose sparsity on a cov ariance matrix, its in verse, or its factors in a Cholesky decomp osition or other factoriza- tion (Huang et al., 2006; Rohde and Tsybak ov, 2011; Cai and Zhou, 2012; Ra vikumar et al., 2011; Ra jaratnam et al., 2008; Khare and Ra jaratnam, 2011; F an et al., 2011; Banerjee et al., 2008; F riedman et al., 2008; Hero and Ra jaratnam, 2011, 2012; Peng et al., 2009). In this curren t pap er, we do not assume any sp ecial structure. Our sole concern is to directly address the distortion in the eigenv alues of the sample 3 co v ariance matrix. Th us, w e w ork in the context of rotationally-in v arian t estimators first prop osed by Stein (1975). If S = U D U t is the sp ectral decomp osition of S , then Stein suggests alternativ e estimators of the form ˆ Σ = U diag( e 1 , . . . , e p ) U t that c hange the eigenv alues but not the eigen vectors of S . In particular, Stein (1975); Haff (1991); Ledoit and W olf (2004) and W arton (2008) study the family ˆ Σ = (1 − γ ) S + γ T (1) of linear shrink age estimators, where γ ∈ [0 , 1] and T = ρ I for some ρ > 0. The estimator (1) ob viously entails e i = (1 − γ ) d i + γ ρ. A natural c hoice of ρ , and one taken b y the p opular estimator of Ledoit and W olf (2004), is the mean of the sample eigen v alues, namely ˆ σ = (1 /p ) tr( S ). Under the assumption that y i ∼ N ( 0 , σ I ), ˆ σ is the maxim um likelihoo d estimate of σ . With this choice, the linear estimator b ecomes a mixture of the co v ariance model with the greatest num b er of de grees of freedom and the simplest non-trivial mo del. F or the rest of this pap er, w e will assume that ρ = ˆ σ . W e also highligh t the fact that the Ledoit and W olf linear estimator, whic h w e refer by acron ym L W, is a notable member of the class of linear estimators as it sp ecifies an asymptotically optimal v alue for γ based on the data. Ledoit and W olf (2004, 2012) sho w that linear shrink age works w ell when p/n is large or the p opulation eigenv alues are close to one another. On the other hand, if p/n is small or the p opulation eigen v alues are disp ersed, linear shrink age yields marginal improv emen ts ov er the sample cov ariance. Nonlinear shrink age estimators ma y presen t a v enues for further impro v ement (Dey and Sriniv asan, 1985; Daniels and Kass, 2001; Sheena and Gupta, 2003; P ourahmadi et al., 2007; Ledoit and W olf, 2012; W on et al., 2012). Our shrink age estimator is closest in spirit to the estimator of W on et al. (2012), who put a prior on the condition n umber of the cov ariance matrix. Recall that the condition n umber κ of a matrix is the ratio of its largest singular v alue to its smallest singular v alue. F or a symmetric matrix, the singular v alues are the absolute v alues of the eigen v alues, and for a co v ariance 4 matrix they are the eigenv alues themselv es. The b est conditioned matrices are multiples of the identit y matrix and hav e κ = 1. A w ell-conditioned co v ariance estimate is one where κ is not to o large, say in excess of 1000. When n do es not greatly exceed p , Figure 1 sho ws that the sample co- v ariance matrix is often ill conditioned. T o address this defect, W on et al. p erform maximum likelihoo d estimation restricted to the space of p ositiv e definite matrices whose condition num b er do es not exceed a threshold κ max . Let ` ( Σ ) denote the negativ e loglikelihoo d, namely ` ( Σ ) = n 2 ln det Σ + n 2 tr( S Σ − 1 ) . W on et al. seek an Σ that solves minimize ` ( Σ ) sub ject to λ max ( Σ ) /λ min ( Σ ) ≤ κ max , where λ max and λ min are the largest and smallest eigenv alues of Σ respectively and κ max ≥ 1 is a tuning parameter. Note that when κ max = 1, the unique solution is ˆ σ I . In practice, κ max is determined by cross-v alidation. W e will refer to the solution to the abov e problem as W on et al.’s CNR for condition n umber regularized estimator. W on et al. show that CNR has improv ed finite sample p erformance com- pared to linear estimators in sim ulations, but the greatest gains arise when only a few eigenv alues of the p opulation cov ariance are m uch larger than the rest. The gains diminish when this do es not hold. The main contribution of the estimator w e describ e next is that it pro vides sup erior p erformance in scenarios where CNR loses its comp etitiv e edge. 3. Maxim um a P osteriori Estimation with a Nov el Prior Adding a p enalt y is equiv alen t to imposing a prior π ( Σ ) on the population co v ariance Σ . The prior we advocate is designed to steer the eigen v alues of Σ a wa y from the extremes of 0 and ∞ . Recall that the n uclear norm of a matrix Σ , whic h we denote by k Σ k ∗ , is the sum of the singular v alues of Σ . The reasonable c hoice π ( Σ ) ∝ e − λ 2 [ α k Σ k ∗ +(1 − α ) k Σ − 1 k ∗ ] , relies on the nuclear norms of Σ and Σ − 1 , a p ositiv e strength constan t λ , and an mixture constan t α ∈ (0 , 1). 5 This is a prop er prior on the set of in vertible matrices. One can demon- strate this fact by comparing the nuclear norm k Σ k ∗ to the F rob enius norm k Σ k F , whic h coincides with the Euclidean norm of the v ector of singular v alues of Σ . In view of the equiv alence of v ector norms on R p ( p +1) / 2 , α 2 k Σ k ∗ ≥ η k Σ k F for some p ositiv e constan t η . In tegrating the resulting inequality e − λ 2 [ α k Σ k ∗ +(1 − α ) k Σ − 1 k ∗ ] ≤ e − η λ k Σ k F o ver Σ demonstrates that the prior is prop er. The normalizing constant of π ( Σ ) is irrelev ant in the ensuing discussion. Consider therefore minimization of the ob jective function f ( Σ ) = n 2 ln det Σ + n 2 tr( S Σ − 1 ) + λ 2  α k Σ k ∗ + (1 − α ) k Σ − 1 k ∗  . The maxim um of − f ( Σ ) o ccurs at the posterior mo de. In the limit as λ tends to 0, − f ( Σ ) reduces to the loglik eliho o d − ` ( Σ ). In the sequel we will refer to our MAP co v ariance estimate by the acron ym CERNN (Cov ariance Estimate Regularized b y Nuclear Norms). F ortunately , three of the four terms of f ( Σ ) can b e expressed as functions of the eigenv alues e i of Σ . The trace con tribution presen ts a greater challenge. As b efore, let S = U D U t denote the sp ectral decomp osition of S with nonnegativ e diagonal en tries d i ordered from largest to smallest. Lik ewise, let Σ = V E V t denote the sp ectral decomp osition of Σ with p ositiv e diagonal en tries e i ordered from largest to smallest. In view of the von Neumann-F an inequalit y (Mirsky, 1975), we can assert that − tr( S Σ − 1 ) ≤ − p X i =1 d i e i , with equality if and only if V = U . Consequen tly , w e mak e the latter assumption and replace f ( Σ ) b y g ( E ) = n 2 p X i =1 ln e i + n 2 p X i =1 d i e i + λ 2 " α p X i =1 e i + (1 − α ) p X i =1 1 e i # 6 using the cyclic p erm utation prop ert y of the trace function. A t a stationary p oin t of g ( E ), w e hav e 0 = n e i − nd i + λ (1 − α ) e 2 i + λα. The solution to this essen tially quadratic equation is e i = − n + p n 2 + 4 λα [ nd i + λ (1 − α )] 2 λα . (2) W e reject the negative ro ot as inconsisten t with Σ b eing p ositiv e definite. F or the sp ecial case n = 0 of no data, all e i = p (1 − α ) /α , and the prior mo de o ccurs at a m ultiple of the identit y matrix. In con trast, CNR shrinks the sample eigenv alues d i as follo ws e i =      κ max τ ∗ d i ≥ κ max τ ∗ d i τ ∗ < d i < κ max τ ∗ τ ∗ d i ≤ τ ∗ , (3) for a τ ∗ > 0 that is determined from the data. Thus, CNR truncates extreme sample eigen v alues and lea v es the mo derate ones unc hanged. Figure 2 compares the eigenv alue solution paths obtained by CERNN and CNR to the solution paths of the linear estimator on a set of fiv e sample eigen v alues (13.29, 5.73, 1.51, 0.55, 0.44). A t each condition num b er κ , the regularization parameters ( λ, κ max , γ ) of the three metho ds are adjusted to giv e a condition n umber matc hing κ . Each path starts as a sample eigen v alue and ends at the mean ˆ σ of the sample eigen v alues. As desired, all three meth- o ds pull the large eigenv alues do wn to wards ˆ σ and the small eigenv alues up to wards ˆ σ . There are imp ortan t differences, ho w ever. Compared to the linear estimator, b oth of the non-linear estimators pull the larger eigenv alues do wn more aggressively and pull the smaller eigenv alues up less aggressively . The discrepancy in the treatmen t of high and low eigen v alues is more pronounced in CNR than in CERNN. W e will see later in finite sample exp erimen ts that this more mo derate approach usually leads to better p erformance. Some insight to the limiting b eha vior of our estimator can b e gained through asymptotic expansions. Holding all but one v ariable fixed in formula 7 CERNN CNR 0 5 10 0 10 20 30 0 10 20 30 κ eigenv alues of Ω ^ Figure 2: A comparison of the solution paths for CERNN (left panel, solid line) and CNR (right panel, solid line) against the path for the linear estima- tor (b oth panels, dashed line) for sample eigen v alues (13.29, 5.73, 1.51, 0.55, 0.44). The dotted line indicates the mean of the sample eigen v alues. (2), one can demonstrate after a fair amoun t of algebra that e i = d i + λ (1 − α − αd 2 i ) n + O  1 n 2  , n → ∞ (4) e i = r 1 − α α + " r 1 − α α nd i 2(1 − α ) − n 2 α # 1 λ + O  1 λ 2  , λ → ∞ . These asymptotic expansions accord with common sense. Namely , the data ev entually ov erwhelms a fixed prior, and increasing the p enalt y strength for a fixed amount of data pulls the estimate of Σ to ward the prior mo de. The second asymptotic expansion will pro ve useful in selecting λ in a practical data-dep enden t w ay . 3.1. Cho osing λ and α Let us first consider ho w to c ho ose α . A natural c hoice for it w ould matc h the prior to the scale of the data. Thus, w e would determine α as the solution 8 to the equation p r 1 − α α = tr r 1 − α α I ! = tr( S ) , namely ˆ α = " 1 +  1 p tr( S )  2 # − 1 . (5) Of course, we recognize that this choice of ˆ α results in the prior mo de being ˆ σ I . F or the remainder of the pap er w e will set α to ˆ α . W e recommend differen t strategies for c ho osing λ dep ending on whether the cov ariance estimation is being performed for a supervised or unsupervised task. Both strategies employ cross-v alidation. W e defer describing the former strategy until w e discuss an application to discriminan t analysis. T o choose λ in the unsup ervised context, we implement cross-v alidation as follows. Let Y ∈ R n × p denote the observed data. In K -fold cross v alidation we partition the observ ations, or ro ws of Y , into K disjoin t sets. Let Y k denote the k th subset, n k denote the num b er of its rows, and ˆ Σ ( − k ) λ denote the estimate w e obtain when we fit the data using all but the k th partition Y k . W e ha ve indicated with our notation that our estimate dep ends on λ but not α since w e ha ve fixed ˆ α at the v alue displa yed in equation (5). F or a grid of increasing regularization parameters, λ = 0 , . . . , λ max , we ev aluate the predictiv e negative loglikelihoo d of ˆ Σ ( − k ) λ on the held out k th subgroup Y k ` k ( ˆ Σ ( − k ) λ , Y k ) = n k 2 ln det ˆ Σ ( − k ) λ + n k 2 tr  1 n k Y t k Y k h ˆ Σ ( − k ) λ i − 1  . W e select the λ that minimizes the av erage ` k o ver the K folds, namely ˆ λ = arg min λ ∈{ 0 ,...,λ max } 1 n K X k =1 ` k ( ˆ Σ ( − k ) λ , Y k ) . W e w an t to c ho ose λ max sufficien tly large to adequately explore the dynamic range of p ossible solutions. The second expansion in (4) giv es us a rough b ound on ho w m uch eac h of the shrunk en eigen v alues deviates from the prior 9 mo de. W e choose λ max so that those deviations are no more than some small fraction ε of the prior mo de, namely we c ho ose λ max so that max i =1 ,...,p      r 1 − ˆ α ˆ α nd i 2(1 − ˆ α ) − n 2 ˆ α      1 λ max ≤ ε r 1 − ˆ α ˆ α , where ε  1, say 10 − 2 . 4. Consistency and Asymptotic Efficiency In proving consistency , we will need v arious facts. First, supp ose A and B are tw o p × p symmetric matrices with ordered eigen v alues { a i } p i =1 and { b i } p i =1 . Then one has p X i =1 ( a i − b i ) 2 ≤ k A − B k 2 F . (6) This is a consequence of F an’s inequality because P p i =1 a 2 i = k A k 2 F and P p i =1 b 2 i = k B k 2 F . If the t w o matrices A = U diag( a ) U t and B = U diag( b ) U t are sim ultaneously diagonalizable, then equality holds in inequalit y (6). W e will also need the inequalities √ 1 + x ≤ 1 + x 2 and √ 1 + x ≥ 1 + x 2 − x 2 8 (7) for nonnegative x . V erification will b e left to the reader based on the fact that the deriv ativ es of √ 1 + x alternate in sign. F unctions ha ving this prop ert y are said to b e completely monotonic. Let S n b e the sample co v ariance matrix with eigenv alues d n 1 through d np for the first n sample p oin ts. The sequence S n con verges almost surely to the true co v ariance matrix Σ with eigenv alues ω 1 through ω p . Inequalit y (6) therefore implies lim n →∞ P p i =1 ( d ni − ω i ) 2 = 0. On this basis we will argue that lim n →∞ P p i =1 ( e ni − ω i ) 2 = 0 as well, where the e ni are the transformed eigen v alues of S n . T o mak e this reasoning rigorous, we must show that the asymptotic expansion (4) is uniform as the eigen v alues d ni con verge to the eigen v alues ω i . This is where the inequalities (7) come in to pla y . Indeed, w e ha ve λ (1 − α ) n − n 2 λα x 2 8 ≤ e ni − d ni ≤ λ (1 − α ) n (8) x = 4 λαd ni n + 4 λ 2 α (1 − α ) n 2 . 10 The iden tity k S n − Σ n k 2 F = p X i =1 ( d ni − e ni ) 2 finishes the pro of that Σ n tends to Σ . No w consider the question of asymptotic efficiency . The scaled difference √ n ( S n − Σ ) tends in distribution to a multiv ariate normal distribution with mean 0 b ecause the sequence of estimators S n is asymptotically efficien t (F erguson, 1996). The representation √ n ( Σ n − Σ ) = √ n ( S n − Σ ) + √ n ( Σ n − S n ) and Slutsky’s theorem (F erguson, 1996) imply that √ n ( Σ n − Σ ) tends in distribution to the same limit. In this regard note that k √ n ( Σ n − S n ) k 2 F = n p X i =1 ( d ni − e ni ) 2 tends almost surely to 0 owing to the b ounds (8) and the con v ergence of d ni to ω i . CNR and linear estimators are also asymptotically v ery well b eha v ed. The asymptotic prop erties of CNR are stated in terms of the en tropy risk, whic h is the exp ectation of the en tropy loss given b elo w L E ( Σ n , Σ ) = tr( Σ − 1 Σ n ) − log det( Σ − 1 Σ n ) − p. (9) CNR has asymptotically low er en trop y risk than the sample co v ariance (W on et al., 2012). But since the sample cov ariance is a consisten t estimator of the co v ariance, it follo ws that CNR is also a consisten t estimator. Among all linear estimators, the L W estimator of Ledoit and W olf (2004) is asymptotically optimal with resp ect to a quadratic risk. T o mak e this claim more precise, consider the optimization problem, min γ 1 ,γ 2 k ˜ Σ − Σ k 2 F (10) sub ject to ˜ Σ = γ 1 I + γ 2 S n , where the w eights γ 1 and γ 2 are allo wed to b e random and can therefore dep end on the data. One can think of the solution Σ ? to (10) as b eing 11 the b est p ossible linear com bination of I and the sample co v ariance S n . Ev en though Σ ? ma y not b e an estimator, since it is allo wed to dep end on the unseen p opulation cov ariance Σ , Ledoit and W olf (2004) pro ve that the quadratic risk of L W has the same quadratic risk as Σ ? asymptotically . Their results are actually even stronger than stated here b ecause p is allo wed to increase along with n , under suitable, but somewhat complicated regularity conditions that we omit. Since the sample cov ariance is consisten t, and its quadratic risk b y definition exceeds the quadratic risk of Σ ? , it follo ws that L W is also a consistent estimator. 5. Finite Sample P erformance Giv en their similar asymptotic b eha vior, to b etter understand the rela- tiv e strengths of CERNN, CNR, and the optimal linear estimator, L W, we conducted a finite sample study almost identical to the one carried out by W on et al. (2012). W e assessed the estimators based on tw o commonly used criteria, namely the en tropy loss (9) and the quadratic loss L Q ( ˆ Σ , Σ ) = k ˆ ΣΣ − 1 − I k 2 F . In our exp erimen ts, w e simulated data in which we v aried the ratio p/n and the spread of the eigen v alues of the p opulation cov ariance. As noted ear- lier, linear shrink age improv es on the sample cov ariance when p/n is large or the p opulation eigenv alues are close to one another. W on et al. rep ort that when the eigenv alues of the p opulation cov ariance are bimo dal, CNR dramatically outp erforms linear estimators if v ery few eigen v alues tak e on the high v alues. As the prop ortion of large eigenv alues grows, how ev er, the discrepancy diminishes. CERNN shrinks extreme eigen v alues in a manner similar to CRN but less drastically and shrinks intermediate eigen v alues sim- ilarly to linear estimators. In con trast CNR lea ves in termediate eigenv alues unc hanged. Consequen tly , we anticipate that CERNN has the p oten tial to impro ve on CNR in the latter case, where there is a need to shrink all eigen- v alues, lik e linear estimators, but with extra emphasis on the extreme ones, unlik e linear estimators. W e sim ulated p -dimensional zero mean multiv ariate normal samples with diagonal co v ariances with p = 120, 250, and 500. The eigen v alues to ok on either high v alues of 1 − υ + υ p or lo w v alues of 1 − υ where υ = 0 . 1. F or each p , the n umber of high v alue eigenv alues ranged from a single high v alue to 12 10%, 20%, 30%, and 40% of p . F or eac h p w e drew one of three sample sizes n suc h that the ratio r = p/n to ok on v alues that w ere roughly 1.25, 2, or 4. F or eac h c hoice of p , n , and r , we sim ulated 100 data sets and computed CNR, CERNN, and L W. F or each data set, we c hose an optimal κ max and λ via 10-fold cross-v alidation. W e used the R pack age CondReg to obtain the CNR estimate (Oh et al., 2013). L W sp ecifies a p enalization parameter γ based on the data. T o exp edite comparisons, we rep ort the av erage ratio of the loss of other estimators to the loss of CERNN. When this ratio is less than one, the other estimator is p erforming b etter than CERNN, and when the ratio is greater than one, CERNN is p erforming b etter. T ables 1 and 2 summarize comparisons under the quadratic loss and en tropy loss resp ectiv ely . W e first note that these exp erimen ts confirm what w as previously ob- serv ed by W on et al. (2012). Regardless of loss criterion, CNR typically outp erforms linear estimators ov er a wide range of scenarios, but esp ecially when few eigenv alues dominate and the ratio p/n is larger. Compared to CERNN, ho w ever, CNR soundly outp erforms CERNN only in the extreme case of a single large eigenv alue. In all other scenarios, under either loss criterion, CERNN p erforms b etter. It is esp ecially notable that CERNN p erforms v ery well in comparison to CNR in scenarios where CNR provides marginal improv emen t ov er linear estimators, namely when the fraction of high eigen v alues is highest at 40%. According to equation (3), recall that CNR only shrinks the most extreme sample eigenv alues and leav es the inter- mediate eigen v alues unchanged. It is not surprising then that it works b est when there are very few large p opulation eigenv alues and loses its comp eti- tiv e edge in the opp osite circumstance. CERNN’s more mo derate approac h of shrinking all eigen v alues, with extra emphasis on larger ones, app ears to win out when there is a more balanced mix of high and lo w eigenv alues. 6. Applications Sev eral common statistical pro cedures are p oten tial b eneficiaries of shrink- age estimation of sample cov ariance matrices. Here w e illustrate ho w CERNN applies to discriminan t analysis and clustering. 6.1. Discriminant Analysis The classical discriminan t function δ k ( x ) = x t Σ − 1 µ k − µ t k Σ − 1 µ k + ln π k , 13 singleton CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 0.42 (0.13) 0.19 (0.03) 0.12 (0.01) 12.4 (6.05) 17.9 (3.97) 33.2 (4.23) 2 0.36 (0.07) 0.23 (0.03) 0.20 (0.02) 9.28 (2.76) 18.1 (2.66) 31.7 (2.09) 1 . 25 0.37 (0.04) 0.29 (0.03) 0.27 (0.02) 8.87 (1.59) 16.9 (1.48) 26.7 (1.18) 10% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 1.15 (0.20) 1.80 (0.12) 2.77 (0.13) 4.91 (0.53) 8.01 (0.42) 17.0 (0.66) 2 1.18 (0.11) 1.66 (0.08) 1.84 (0.06) 4.42 (0.30) 6.85 (0.28) 12.3 (0.41) 1 . 25 1.46 (0.08) 1.39 (0.05) 1.81 (0.05) 4.14 (0.21) 6.08 (0.20) 10.3 (0.29) 20% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 4.28 (0.70) 8.52 (0.86) 18.5 (1.06) 8.72 (0.88) 23.5 (1.62) 78.5 (3.32) 2 1.97 (0.16) 2.51 (0.13) 5.25 (0.16) 6.10 (0.37) 12.9 (0.53) 31.1 (1.00) 1.25 1.45 (0.07) 1.95 (0.07) 4.90 (0.15) 4.98 (0.20) 9.42 (0.31) 20.9 (0.61) 30% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 11.9 (1.90) 37.8 (3.17) 128 (6.03) 16.9 (1.74) 62.6 (3.49) 244 (7.32) 2 3.97 (0.36) 6.65 (0.34) 13.6 (0.31) 10.7 (0.67) 31.4 (1.24) 95.0 (2.22) 1.25 2.15 (0.10) 3.76 (0.12) 10.6 (0.30) 7.71 (0.36) 18.6 (0.63) 51.8 (1.56) 40% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 25.6 (3.18) 83.2 (5.06) 304 (10.7) 28.3 (2.54) 105 (4.60) 407 (10.9) 2 11.0 (1.10) 28.2 (1.61) 68.9 (3.48) 20.1 (1.23) 66.9 (2.28) 234 (4.74) 1.25 3.66 (0.22) 6.39 (0.14) 18.9 (0.27) 13.5 (0.68) 38.6 (1.05) 119 (1.62) T able 1: Av erage ratio of quadratic loss of CNR and L W to that of CERNN. V alues less than 1 indicate b etter p erformance than CERNN (b old). Stan- dard deviations are in paren theses. 14 singleton CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 0.83 (0.11) 0.56 (0.05) 0.40 (0.03) 5.38 (2.24) 9.20 (2.10) 18.7 (2.61) 2 0.78 (0.07) 0.59 (0.03) 0.49 (0.02) 5.24 (1.63) 11.9 (2.08) 24.6 (2.17) 1.25 0.79 (0.04) 0.66 (0.03) 0.56 (0.02) 5.75 (1.15) 12.8 (1.37) 24.4 (1.30) 10% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 1.11 (0.12) 1.34 (0.07) 1.62 (0.06) 1.43 (0.11) 1.83 (0.09) 3.19 (0.11) 2 1.04 (0.07) 1.08 (0.04) 1.16 (0.03) 1.29 (0.07) 1.34 (0.05) 1.81 (0.05) 1 . 25 1.13 (0.05) 0.96 (0.03) 1.26 (0.03) 1.29 (0.05) 1.20 (0.03) 1.38 (0.03) 20% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 1.90 (0.19) 2.71 (0.15) 3.74 (0.10) 2.16 (0.19) 4.16 (0.20) 8.27 (0.17) 2 1.19 (0.06) 1.20 (0.05) 1.97 (0.03) 1.73 (0.10) 2.93 (0.10) 5.57 (0.11) 1 . 25 0.92 (0.03) 1.03 (0.02) 2.14 (0.04) 1.42 (0.05) 2.18 (0.06) 4.05 (0.07) 30% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 2.50 (0.20) 4.04 (0.17) 6.49 (0.15) 2.69 (0.17) 5.06 (0.17) 9.06 (0.15) 2 1.50 (0.08) 1.72 (0.06) 2.39 (0.02) 2.44 (0.12) 4.58 (0.12) 8.53 (0.10) 1 . 25 0.96 (0.03) 1.24 (0.02) 2.64 (0.02) 2.11 (0.10) 3.78 (0.10) 7.13 (0.07) 40% high CNR/CERNN L W/CERNN p = 125 250 500 125 250 500 p/n = 4 2.80 (0.17) 4.44 (0.13) 7.29 (0.13) 2.79 (0.15) 4.91 (0.12) 8.45 (0.12) 2 2.09 (0.11) 3.04 (0.11) 4.23 (0.12) 2.81 (0.13) 4.91 (0.12) 8.51 (0.09) 1 . 25 1.13 (0.05) 1.33 (0.02) 2.66 (0.02) 2.67 (0.11) 4.79 (0.09) 8.64 (0.08) T able 2: Av erage ratio of entrop y loss of CNR and L W to that of CERNN. V alues less than 1 indicate b etter p erformance than CERNN (b old). Stan- dard deviations are in paren theses. 15 incorp orates the mean µ k and prior probability π k of each class k . A new observ ation x is assigned to the class k maximizing δ k ( x ). If there are c classes C 1 , . . . , C c , then the standard estimator of Σ is ˆ Σ = 1 n − c c X k =1 X i ∈C k ( x i − ˆ µ k )( x i − ˆ µ k ) t , where ˆ µ k = 1 |C k | X i ∈C k x i . One can obviously shrink ˆ Σ to mo derate its eigen v alues. In quadratic dis- criminan t analysis, a separate cov ariance matrix Σ k is assigned to eac h class k . These are estimated in the usual wa y , and eigen v alue shrink age is likely ev en more b eneficial than in linear discriminant analysis. F riedman (1989) adv o cates regularized discriminan t analysis (RD A), a compromise b et ween linear and quadratic discriminan t analysis that shrinks Σ k to ward a common Σ via a conv ex combination γ Σ k + (1 − γ ) Σ . Although F riedman also sug- gests shrinking to ward class sp ecific m ultiples of the identit y matrix, we do not consider his more complicated version here. Guo et al. (2007) shrink co- v ariance estimates to w ards the iden tit y matrix and also apply lasso shrink age on the centroids to obtain impro ved classification p erformance in microarra y studies. The main difference b et w een CERNN and these metho ds is that CERNN p erforms nonlinear shrink age of the sample eigen v alues. Since we are primarily interested in the case where all or most of the pre- dictors are instrumental in grouping, w e consider only F riedman’s metho d in a comparison on three data sets from the UCI mac hine learning repos- itory (Bache and Lichman, 2013). In the case of the E. Coli data set, we restricted analysis to the fiv e most abundant classes. W e split each data set in to training and testing sets. In eac h exp erimen t we used 1/5 of the data for training and 4/5 for testing. T able 3 records the num b er of samples p er group in eac h set. In these data p o or examples, even linear discriminan t analysis is not viable since a common sample cov ariance estimate will b e ill-conditioned if not singular. Nonetheless, our results show that the combi- nation of separate co v ariances with regularization w orks well. W e mo deled a separate cov ariance for each class and used 5-fold cross v alidation to select c regularization parameters for CERNN. W e used essen tially the same pro ce- dure describ ed in 3.1 except that we used the misclassification rate instead 16 of the predictiv e negativ e loglikelihoo d. W e also used 5-fold cross v alidation to select the single γ parameter for (F riedman, 1989). The testing errors in T able 3 demonstrate that CERNN p erforms well in comparison with RDA. Ev en when it do es not p erform as accurately , its drop off is small. T able 3: Comparison of CERNN and RD A on three data sets from the UCI mac hine learning rep ository . The fourth column indicates the num b er of parameters (mean and cov ariance) p er group in the QDA mo del. The fifth and sixth columns breakdown the num b er of samples p er group. The last t wo columns rep ort the classification success rate in the test set. data p c p ( p +3) 2 samples (train) samples (test) CERNN RD A wine 13 3 104 13/13/10 46/58/38 0 . 859 0 . 627 seeds 7 3 35 14/15/13 56/55/57 0 . 929 0 . 935 ecoli 7 5 35 30/17/7/3/9 113/60/28/17/43 0 . 670 0 . 705 6.2. Covarianc e R e gularize d EM Clustering W e no w sho w how CERNN stabilizes estimation in the standard EM clustering algorithm (McLac hlan and Peel, 2000). Let φ ( y | µ , Σ ) denote a m ultiv ariate Gaussian density with mean µ and cov ariance Σ . EM clustering rev olves around the mixture density h ( y | Ξ) = c X k =1 π k φ ( y | µ k , Σ k ) with parameters Ξ = { µ k , Σ k , π k } c k =1 . The π k are nonnegative mixture w eights summing to 1. W e are giv en n indep enden t observ ations y 1 , . . . , y n from the mixture densit y and wish to estimate Ξ by maximizing the loglik e- liho od. It is w ell kno wn that the loglik eliho o d is un b ounded from ab o ve and plagued by man y sub optimal lo cal extrema (McLachlan and Peel, 2000). Hatha wa y (1985) prop osed constrain ts leading to a maxim um lik eliho od problem free of singularities and b eset b y fewer lo cal extrema. Later it was sho wn that these constraints could b e met by imp osing bounds on the largest and smallest eigenv alues of the Σ k (Ingrassia, 2004; Ingrassia and Rocci, 2007). These findings suggest that employing our Bay esian prior to estimate 17 Σ k can also improv e the searc h for go o d lo cal optima, since it shrinks the extreme sample eigen v alues to the sample mean eigen v alue. If z ik is the indicator function of the ev ent that observ ation i comes from cluster k , then the complete data loglik eliho od plus logprior amounts to ` (Ξ) = n X i =1 c X k =1 z ik [ln π k + ln φ ( y i | µ k , Σ k )] − λ 2  α k k Σ k k ∗ + (1 − α k ) k Σ − 1 k k ∗  . Straigh tforward application of Bay es rule yields the conditional exp ectation w ik = E [ z ik | Y , Ξ] = π k φ ( y i | µ k , Σ k ) P c l =1 π l φ ( y i | µ l , Σ l ) . These w eights should b e subscripted by the current iteration num ber m , but to a void clutter we omit the subscripts. If w e set w k = n X i =1 w ik and S k = 1 w k n X i =1 w ik ( y i − µ k )( y i − µ k ) t , then the EM up dates are π k = w k n , µ k = 1 w k P n i =1 w ik y i , and Σ k = arg min Σ w k 2 log det Σ + w k 2 tr( Σ − 1 S k ) + λ 2  α k k Σ k ∗ + (1 − α k ) k Σ − 1 k ∗  . W e no w address t wo practical issues. First, there is the question of how to c ho ose α k . In the previous examples we sough t a stable estimate of a single co v ariance matrix. Here w e seek c cov ariance matrices whose imputed data c hange from iteration to iteration. In accord with our previous c hoice for α , w e set α k to b e (1 /p ) tr( S k ). This α k c hanges dynamically as S k c hanges. Second, it is p ossible for w ik ≈ 0 for all i for a giv en k at some p oin t as the algorithm iterates. Indeed, finite machine precision ma y assign w k the v alue 0 at some point, making the updates for Σ k and µ k undefined. Consequently , w e only up date Σ k and µ k when w k > 0. This is not a ma jor issue, ho w ever, since this scenario only arises when no data points should b e assigned to the k th cluster. Besides the w ork of Ingrassia and Ro cci (2007), similar approac hes hav e b een employ ed previously . F raley and Raftery (2007) suggest a restricted parameterization of co v ariance matrices. While they offer a menu of pa- rameterizations that co ver a range of degrees of freedom, each mo del has a 18 fixed n um b er of degrees of freedom. One adv an tage of our mo del is that the degrees of freedom ma y b e tuned con tinuously with a single parameter λ . Figure 3 shows the results of clustering with our algorithm on a simulated data set. A total of 60 data p oin ts w ere generated from a mixture of 10 biv ariate normals corresp onding to 59 parameters in the most general case. The n um b er of observ ations p er cluster ranged from 3 to 11. W e set the n umber of clusters c to b e 10 and considered several different λ v alues o ver a wide range (0.1, 10, 100, 10000). W e could choose λ in a completely data dep enden t wa y through cross-v alidation, but our main concern is to stabilize the pro cedure so fine-tuning λ is not of paramount imp ortance, esp ecially when doing so complicates the pro cedure. W e ran our algorithm 500 times using random initializations with the k -means++ algorithm (Arth ur and V assilvitskii, 2007) and k ept the clustering that ga v e the greatest v alue of the exp ected log likelihoo d n X i =1 c X k =1 w ik [ln π k + ln φ ( y i | µ k , Σ k )] . The resulting clusterings are quite go o d o ver our broad range of λ v alues. It is notable that for three out of the four v alues of λ , ev en clusters 2 and 10, whic h o verlap, are somewhat distinguished. The only missteps o ccur when λ = 10, where cluster 1 is split in to t wo clusters, and clusters 2 and 10 ha ve b een merged. The latter decision is reasonable given how muc h clusters 2 and 10 o verlap. 7. Discussion The initial insight of Stein (1975) has led to sev eral methods for shrink age estimation of a sample co v ariance matrix S . These metho ds preserv e the eigen vectors of S while pushing S tow ards a multiple of the iden tity matrix. Our Ba yesian prior do es precisely this in a nonlinear fashion. Finite sample exp erimen ts comparing CERNN and CNR sho w that CERNN and CNR complemen t each other. CNR p erforms b etter when only a few eigenv alues of the p opulation cov ariance are v ery large. CERNN performs b etter when there is a more ev en mix of high and lo w eigen v alues. Both p erform at least as well and often b etter than the simple and asymptotically optimal L W estimator. CERNN do es require a singular v alue decomposition (SVD), as do es CNR. Although highly optimized routines for accurately computing the SVD 19 10 10 10 3 5 1 5 3 3 9 2 9 3 1 1 3 6 5 3 2 7 7 5 8 9 1 7 3 2 8 10 7 10 4 5 8 10 4 10 3 7 8 8 5 1 10 10 8 6 1 10 2 4 6 1 2 9 10 2 5 ● ● ● ● ● ● ● ● ● ● −25 0 25 −40 −20 0 20 40 Principal Component 1 Principal Component 2 (a) λ = 0 . 1 10 10 10 3 5 1 5 3 3 9 2 9 3 1 1 3 6 5 3 2 7 7 5 8 9 1 7 3 2 8 10 7 10 4 5 8 10 4 10 3 7 8 8 5 1 10 10 8 6 1 10 2 4 6 1 2 9 10 2 5 ● ● ● ● ● ● ● ● ● ● −25 0 25 −40 −20 0 20 40 Principal Component 1 Principal Component 2 (b) λ = 10 10 10 10 3 5 1 5 3 3 9 2 9 3 1 1 3 6 5 3 2 7 7 5 8 9 1 7 3 2 8 10 7 10 4 5 8 10 4 10 3 7 8 8 5 1 10 10 8 6 1 10 2 4 6 1 2 9 10 2 5 ● ● ● ● ● ● ● ● ● ● −25 0 25 −40 −20 0 20 40 Principal Component 1 Principal Component 2 (c) λ = 100 10 10 10 3 5 1 5 3 3 9 2 9 3 1 1 3 6 5 3 2 7 7 5 8 9 1 7 3 2 8 10 7 10 4 5 8 10 4 10 3 7 8 8 5 1 10 10 8 6 1 10 2 4 6 1 2 9 10 2 5 ● ● ● ● ● ● ● ● ● ● −25 0 25 −40 −20 0 20 40 Principal Component 1 Principal Component 2 (d) λ = 10000 Figure 3: CERNN clustering pro jected on to the first tw o principal com- p onen ts of the data. Ellipses depict the first t wo eigen vectors (and their corresp onding eigenv alues) of the estimated cov ariances of each cluster. 20 are readily a v ailable, such calculations are not c heap. Randomized linear al- gebra ma y provide computational relief (Halko et al., 2011; Mahoney, 2011). If one can tolerate a small loss in accuracy , the SVD of a randomly sampled subset of the data or a random pro jection of the data can give an acceptable surrogate SVD. Applications extend w ell b ey ond the classical statistical metho ds illus- trated here. F or example, in gene mapping with p edigree data, a co v ariance matrix is t ypically parameterized as a mixture of three comp onen ts, one of whic h is the global kinship co efficien t matrix capturing the relatedness b e- t ween individuals in the study (Lange, 2002). The kinship matrix can b e estimated from a high density SNP (single nucleotide p olymorphism) panel rather than calculated from p ossibly faulty genealogical records. Because a t ypical study con tains thousands of individuals typed at h undreds of thou- sands of genetic mark ers, this application o ccurs in the regime n  p . The construction of netw orks from gene co-expression data is another obvious ge- netic example (Horv ath, 2011). Readers w orking in other application areas can doubtless think of other p ertinen t examples. Ac kno wledgments This researc h was supp orted b y NIH (United States Public Health Ser- vice) gran ts GM53275 and HG006139. References Arth ur, D., V assilvitskii, S., 2007. k -means++: The adv an tages of care- ful seeding. In: Pro ceedings of the eigh teenth ann ual A CM-SIAM symp o- sium on Discrete algorithms. SOD A ’07. So ciet y for Industrial and Applied Mathematics, Philadelphia, P A, USA, pp. 1027–1035. Bac he, K., Lichman, M., 2013. UCI machine learning repository . URL http://archive.ics.uci.edu/ml Banerjee, O., El Ghaoui, L., d’Aspremont, A., Jun. 2008. Mo del selection through sparse maxim um likelihoo d estimation for multiv ariate Gaussian or binary data. Journal of Mac hine Learning Research 9, 485–516. Bic kel, P . J., Levina, E., 2008. Regularized estimation of large co v ariance matrices. Annals of Statistics 36 (1), 199–227. 21 Cai, T., Zhou, H., 2012. Minimax estimation of large co v ariance matrices under ` 1 norm. Statistica Sinica 22, 1319–1378. Daniels, M. J., Kass, R. E., 2001. Shrink age estimators for co v ariance matri- ces. Biometrics 57 (4), 1173–1184. Dey , D. K., Sriniv asan, C., 1985. Estimation of a cov ariance matrix under Stein’s loss. Annals of Statistics 13 (4), 1581–1591. F an, J., Liao, Y., Minchev a, M., 2011. High-dimensional cov ariance matrix estimation in appro ximate factor models. Annals of Statistics 39 (6), 3320– 3356. F erguson, T. S., 1996. A course in large sample theory . CRC T exts in Statis- tical Science Series. Chapman and Hall. F raley , C., Raftery , A. E., Sep. 2007. Ba y esian regularization for normal mixture estimation and mo del-based clustering. J. Classif. 24 (2), 155– 181. F riedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inv erse co v ariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. F riedman, J. H., 1989. Regularized discriminan t analysis. Journal of the American Statistical Asso ciation 84 (405), 165–175. Guo, Y., Hastie, T., Tibshirani, R., 2007. Regularized linear discriminant analysis and its application in microarra ys. Biostatistics 8 (1), 86–100. Haff, L. R., 1991. The v ariational form of certain Ba yes estimators. The Annals of Statistics 19 (3), 1163–1190. Halk o, N., Martinsson, P . G., T ropp, J. A., 2011. Finding structure with randomness: Probabilistic algorithms for constructing appro ximate matrix decomp ositions. SIAM Rev. 53 (2), 217–288. Hatha wa y , R. J., 1985. A constrained formulation of maxim um-likelihoo d estimation for normal mixture distributions. Annals of Statistics 13 (2), 795–800. Hero, A., Ra jaratnam, B., 2011. Large-scale correlation screening. Journal of the American Statistical Asso ciation 106 (496), 1540–1552. 22 Hero, A., Ra jaratnam, B., 2012. Hub discov ery in partial correlation graphs. Information Theory , IEEE T ransactions on 58 (9), 6064–6078. Horv ath, S., 2011. W eighted Net work Analysis. Applications in Genomics and Systems Biology . Springer, New Y ork. Huang, J. Z., Liu, N., P ourahmadi, M., Liu, L., 2006. Cov ariance matrix selection and estimation via p enalised normal likelihoo d. Biometrik a 93 (1), 85–98. Ingrassia, S., 2004. A lik eliho o d-based constrained algorithm for m ultiv ariate normal mixture mo dels. Statistical Metho ds and Applications 13 (2), 151– 166. Ingrassia, S., Ro cci, R., 2007. Constrained monotone EM algorithms for fi- nite mixture of multiv ariate Gaussians. Computational Statistics & Data Analysis 51 (11), 5339–5351. Khare, K., Ra jaratnam, B., 2011. Wishart distributions for decomp osable co v ariance graph models. Annals of Statistics 39, 514–555. Lange, K., 2002. Mathematical and Statistical Metho ds for Genetic Analysis, 2nd Edition. Statistics for Biology and Health. Springer-V erlag, New Y ork. Ledoit, O., W olf, M., 2004. A w ell-conditioned estimator for large- dimensional co v ariance matrices. Journal of Multiv ariate Analysis 88 (2), 365–411. Ledoit, O., W olf, M., 2012. Nonlinear shrink age estimation of large- dimensional co v ariance matrices. Annals of Statistics 40 (2), 1024–1060. Mahoney , M. W., F eb. 2011. Randomized algorithms for matrices and data. F ound. T rends Mach. Learn. 3 (2), 123–224. Mar ˆ cenk o, V. A., P astur, L. A., 1967. Distribution of eigenv alues for some sets of random matrices. Mathematics of the USSR-Sb ornik 1 (4), 507–536. McLac hlan, G., Peel, D., 2000. Finite Mixture Mo dels. Wiley , New Y ork. Mirsky , L., 1975. A trace inequalit y of John von Neumann. Monatshefte f ¨ ur Mathematik 79, 303–306. 23 Oh, S.-Y., Ra jaratnam, B., W on, J.-H., 2013. CondReg: Condition Number Regularized Co v ariance Estimation. R pac k age version 0.16. P eng, J., W ang, P ., Zhou, N., Zhu, J., 2009. P artial correlation estimation b y joint sparse regression mo dels. Journal of the American Statistical As- so ciation 104 (486), 735–746. P ourahmadi, M., 2011. Co v ariance estimation: The GLM and regularization p erspectives. Statistical Science 26 (3), 369–387. P ourahmadi, M., 2013. High-Dimensional Co v ariance Estimation: With High-Dimensional Data. Wiley . P ourahmadi, M., Daniels, M. J., P ark, T., 2007. Simultaneous mo delling of the Cholesky decomp osition of several co v ariance matrices. Journal of Multiv ariate Analysis 98 (3), 568–587. Ra jaratnam, B., Massam, H., Carv alho, C. M., 2008. Flexible cov ariance estimation in graphical Gaussian mo dels. Annals of Statistics 36 (6), 2818– 2849. Ra vikumar, P ., W ainwrigh t, M. J., Raskutti, G., Y u, B., 2011. High- dimensional cov ariance estimation by minimizing ` 1 -p enalized log- determinan t divergence. Electronic Journal of Statistics 5, 935–980. Rohde, A., Tsybak ov, A. B., 2011. Estimation of high-dimensional low-rank matrices. Annals of Statistics 39 (2), 887–930. Sheena, Y., Gupta, A. K., 2003. Estimation of the multiv ariate normal co v ari- ance matrix under some restrictions. Statistics & Decisions 21 (4), 327–342. Stein, C., 1975. Estimation of a co v ariance matrix, 39th A. Meet. Institute of Mathematical Statistics, A tlanta. W arton, D. I., 2008. Penalized normal lik eliho o d and ridge regularization of correlation and co v ariance matrices. Journal of the American Statistical Asso ciation 103 (481), 340–349. W on, J.-H., Lim, J., Kim, S.-J., Ra jaratnam, B., 2012. Condition-n umber- regularized cov ariance estimation. Journal of the Ro yal Statistical So ciet y: Series B (Statistical Metho dology). 24 W u, W. B., P ourahmadi, M., 2003. Nonparametric estimation of large co- v ariance matrices of longitudinal data. Biometrik a 90 (4), 831–844. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment