Robust regularized covariance matrix estimation: well-posedness and convergent algorithm

Robust regularized co v ariance matrix estimation: w ell-p osedness and con v ergen t algorithm Mengxi Yi 1 and Da vid E. T yler 2 1 Sc ho ol of Statistics, Beijing Normal Univ ersit y , Beijing, China 2 Departmen t of Statistics, Rutgers Univ ersity , Piscata w a y , USA Abstract In this pap er, we study properties of p enalized and structured M-estimators of m ul- tiv ariate scatter, based on geo desically con vex but not necessarily smooth p enalt y functions. Existence and uniqueness conditions for these p enalized and structured es- timators are giv en. Ho wev er, we sho w that the standard ﬁxed-p oin t algorithm which is usually applied to an M-estimation problem do es not necessarily con verge for p enalized M-estimation problems. Hence, w e develop a new but simple re-w eighting algorithm and prov e that it has monotone con vergence for a broad class of p enalized and struc- tured M-estimators of m ultiv ariate scatter. Keyw ords: Constrained estimation, Geodesic con vexit y , M-estimation, MM-algorithm, Regularization, Rew eighting algorithm. 1. In tro duction and Motiv ation Co v ariance matrix estimation is an imp ortan t topic in multiv ariate statistics. F or a random sample x 1 , · · · , x n ∈ R p , the classical estimate of the cov ariance matrix Σ = E [( x − E x )( x − E x ) T ] is the empirical co v ariance matrix, S n = 1 n P n i =1 ( x i − ¯ x )( x i − ¯ x ) T , where 1 2 ¯ x = 1 n P n i =1 x i . When the sample size n is small relative to the dimension p , and in particular when n < p , S n is w ell-known to b e a rather p oor estimate of Σ even when sampling from a m ultiv ariate normal distribution. In such cases, one ma y wish to consider more parsimonious mo dels for the p ( p + 1) / 2 parameters in the p opulation cov ariance matrix, such as a T oeplitz mo del Snyder et al. (1989), factor mo del F an et al. (2008) or a graphical mo del Dempster (1972). Alternatively , one ma y wish to use a regularized or a p enalized v ersion of the sample co v ariance matrix, suc h as the Ledoit-W olf estimator Ledoit and W olf (2004) or the graphical lasso Y uan and Lin (2007), among others W arton (2008); Deng and T sui (2013). Since the negativ e log lik eliho o d, taken as a loss function, under m ultiv ariate normal sampling is con vex in the inv erse of the co v ariance matrix, Σ − 1 , it is natural to consider additive p enalties which are also con vex in the inv erse of the cov ariance matrix. More recently , T yler and Yi (2020) and Yi and T yler (2020) studied a class of smooth and non-smo oth p enalties which do not hav e this conv exit y prop ert y , but rather hav e the prop ert y of b eing geo desically conv ex. By utilizing the little known property that the negative log likelihoo d under m ultiv ariate normal sampling is also geodesically conv ex (g-con vex), they show that suc h p enalized sample co v ariance problems yield unique solutions. The sample co v ariance matrix is well known to b e highly non-resistant to outliers in the the observ ations and to b e a highly ineﬃcien t estimator when the samples are drawn from hea vy-tailed distributions. Robust estimators can protect against this. Here w e concen trate on robust p enalized metho d for estimating co v ariance matrices for m ultidimensional obser- v ations. In the context of robust estimation, the cov ariance matrix is often referred to as the pseudo-co v ariance or scatter matrix to allow for nonexistence for the second order moments. Our aim is to study robust versions of the p enalized sample cov ariances, whic h are deﬁned b y applying a rigorous and general treatment of p enalization to loss functions whic h are more robust than the normal negative log-lik eliho o d function. In particular, w e consider the class of loss functions which give rise to the monotonic M-estimators of m ultiv ariate scat- ter. Using the concept of geodesic con vexit y , w e deriv e necessary and suﬃcien t conditions ensuring the w ell-p osedness of a broad class of p enalized M-estimators of scatter. These 3 w ell-p osedness conditions, whic h guarantee b oth existence and uniqueness of the estimator, are generally w eaker than the traditional conditions required for M-estimator of scatter. Since its introduction, a pro ven approac h for ﬁnding a solution to M-estimating equations has b een via ﬁxed-point algorithm, which can be related to iterativ e re-weigh ted least squares algorithms Tyler (1997); Ken t and Tyler (1991). Although it is kno wn to work in the unp enalized setting, in general this algorithm is not necessarily applicable to the penalized setting. F urthermore, even though a ﬁxed p oin t algorithm has b een successfully applied when using the Kullback-Leibler p enalt y Ollila and Tyler (2014), there are no con vergence results for general p enalties. One result w e sho w in this pap er is that this algorithm do es not necessarily con verge for other p enalt y functions, and in particular for the Riemannian p enalt y . Consequently , w e prop ose in section 3 a new but simple re-w eighting algorithm. W e pro ve this algorithm alw ays conv erges for an y g-conv ex p enalized or structured M-estimator of multiv ariate scatter, and that the con vergence is monotone. The rest of the pap er is organized as follo ws. Section 2 reviews the M-estimates of m ultiv ariate scatter, in tro duces the general framew ork for the p enalized version and estab- lishes some new results regarding their existence and uniqueness. In addition, subsection 2.3 discusses a Bay esian interpretation of p enalized M-estimation. In section 3, the afore- men tioned re-w eighting algorithm for the p enalized case is presen ted and some examples are giv en therein. Section 4 demonstrates the dualit y b et w een the problem of minimizing a g-con vex loss function and a minimization problem under a g-conv ex constrain t. Algo- rithms for a general g-con v ex constraint problem are also shown here. Section 5 includes a concluding remark. The concept of g-con vexit y is review ed in App endix App endix A and pro ofs are giv en in App endix App endix B. 4 2. Regularized M-estimators of m ultiv ariate scatter 2.1. Review of elliptical distributions and the M-estimators of scatter A random v ector x ∈ R p is said to ha ve a real el liptic al ly symmetric distribution with cen ter µ and scatter matrix Σ ∈ P p , denoted by E ( µ, Σ , g ) , if it has a density of the form f ( x ) = C p,g det(Σ) − 1 / 2 g { ( x − µ ) T Σ − 1 ( x − µ ) } , where P p represen ts class of p ositiv e deﬁnite matrices of order p , C p,g > 0 denotes a normal- izing constant ensuring that f ( x ) integrates to one, and g is a given function g : R + → R + , where R + = { x ∈ R | x ≥ 0 } . The scatter matrix Σ is proportional to the co v ariance matrix whenev er x possesses second momen ts. When the second momen ts do not exist, the scat- ter matrix Σ can b e viewed as a generalization of the cov ariance matrix. Note that ov er the semi-parametric class of elliptical distributions, i.e. when the radial function g is not sp eciﬁed, Σ is only well-deﬁned up to prop ortionalit y . The function g determines the radial distribution of the elliptical p opulation and hence the degree of its “hea vy-tailedness". F or example, when g ( s ) = exp( − s/ 2) , one obtains the multiv ariate normal distribution. If x is a random v ariable h a ving an elliptical distribution E ( µ, Σ , g ) , then the standardized v ariable z = Σ − 1 / 2 ( x − µ ) has a spheric al distribution centered at 0 , with densit y C p,g g ( z T z ) . The spherical z admits a decomposition z = r s , where r = || z || and s = z / || z || are indep enden t, with s b eing uniformly distributed on the unit sphere. F or a detailed accoun t on prop erties of elliptical distributions, w e refer readers to Kelker (1970); Ollila et. al. (2012). Since this pap er fo cuses on the co v ariance matrix and its generalizations, we presume throughout that the p -dimensional sample { x 1 , . . . , x n } has b een robustly centered by an estimate of µ , e.g. by the v ector of marginal medians or the spatial median, or b y a kno wn cen ter. Alternativ ely , the data may b e replaced with their pairwise diﬀerences. Hereafter, w e assume µ = 0 . F or a giv en function g , the maximum lik eliho o d estimator (MLE) of the scatter matrix, is then the minim um of the corresp onding negativ e log-likelihoo d function ℓ ρ (Σ) = 1 n n X i =1 ρ ( x T i Σ − 1 x i ) + log { det(Σ) } , (1) 5 o ver all Σ > 0 , i.e. ov er all p ositiv e deﬁnite symmetric matrices of order p , here ρ ( s ) = − 2 log g ( s ) . If ρ is diﬀeren tiable, then setting the deriv ative of ℓ ρ (Σ) , with respect to Σ , to zero yields the estimating equation: Σ = 1 n n X i =1 u ( x T i Σ − 1 x i ) x i x T i , (2) where u ( s ) = ρ ′ ( s ) = − 2 g ′ ( s ) /g ( s ) represen ts a w eighting function. The critical p oin ts of the ob jective function (1) then satisfy the estimating equation (2). M-estimators of scatter matrices were deﬁned in Maronna (1976) as generalizations of the MLEs of an elliptical distribution, namely b y allowing for a general ρ function in (1), whic h need not be related to an elliptical distribution. Hence w e refer to (1) as an M-loss function in general, while equation (2) is called an M-estimating equation. Some examples of ML- and M-estimators are giv en b elo w. Sample c ovarianc e matrix. F or the Gaussian sample, ρ ( s ) = s , and u ( s ) = ρ ′ ( s ) = 1 , and so (1) becomes ℓ (Σ; S n ) = tr(Σ − 1 S n ) + log { det(Σ) } (3) where S n = 1 n P n i =1 x i x T i is th e sample cov ariance matrix. When n > p , S n is the MLE and an unbiased estimator of Σ . MLE for an el liptic al t -family. F or the t ν -distribution, g ( s ) = (1 + s/ν ) − ( ν + p ) / 2 and so ρ ( s ) = ( ν + p ) log ( ν + s ) and u ( s ) = ( ν + p ) / ( ν + s ) , with ν b eing the degrees of freedom. A t ν - distribution has ﬁnite second moments Co v ( x ) = ν / ( ν − 2)Σ only when ν > 2 . When ν = 1 , the corresp onding distribution is said to ha ve a p -v ariate elliptical Cauch y distribution. The limiting case ν → ∞ yields the normal distribution. The loss function (1) asso ciated with the elliptical t ν distribution is ℓ t (Σ) = ν + p n n X i =1 log( ν + x T i Σ − 1 x i ) + log { det(Σ) } . The existence and uniqueness of the t ν -MLEs hav e b een considered in Ken t and Tyler (1991). When ν go es to zero, the t ν -MLE yields T yler’s M-estimator T yler (1987a), which is the unique solution up to prop ortionalit y to the M-estimating equations: Σ = p n n X i =1 x i x T i x T i Σ − 1 x i . 6 T yler’s M-estimator also corresp onds to the MLE of Σ when the data comes from an angular c entr al Gaussian distribution , for whic h ρ ( s ) = p log s and u ( s ) = p/s Tyler (1987b). The corresp onding loss function can then b e written as ℓ T yl er (Σ) = p n n X i =1 log( x T i Σ − 1 x i ) + log { det(Σ) } . F or more details and prop erties of this estimator, see T yler (1987a,b). Hub er’s M-estimator. Hub er’s M-estimator Hub er (1964, 1977) is deﬁned via w eight u ( s ) =        1 /b, for s ≤ c 2 c 2 / ( sb ) , for s > c 2 where c is a tuning constan t deﬁned so that r = F χ 2 p ( c 2 ) for a c hosen r (0 < r ≤ 1) , where F χ 2 p ( · ) denotes the c.d.f. of the χ 2 p . The scaling factor b is usually c hosen so that the resulting M-estimate is consistent for the co v ariance matrix at the normal distribution. When r = 1 , u ( s ) = 1 , which corresponds to the sample co v ariance matrix, whereas r → 0 yields u ( s ) = p/s , corresp onding to T yler’s M-estimator. 2.2. P enalized M-estimation of Scatter An ob vious w a y to “robustify” the penalized sample co v ariance matrix is to replace S n with an M-estimator or some other robust estimators of cov ariance. This may b e a go od approac h for large sample sizes, but for smaller sample sizes prop erties of the M-estimators and other aﬃne equiv ariant estimators of cov ariance matrices tend not to diﬀer substantially from the sample co v ariance matrix; see Tyler (2010). Giv en that regularization or penal- ized approaches are primarily of interest when the problems are ill-posed, suc h alternativ e metho ds are worth exploring. In particular, p enalized M-estimators of scatter hav e b een prop osed W arton (2008); Ollila and T yler (2014), which are deﬁned as a minimizer o v er Σ > 0 of L ρ (Σ; η ) = ℓ ρ (Σ) + η Π(Σ) , (4) where Π(Σ) denotes a non-negativ e penalty function, with η ≥ 0 b eing a tuning parameter. In particular, the p enalized sample cov ariance is deﬁned as a minimizer of the p enalized 7 normal likelihoo d function, which is written as L (Σ; S n ; η ) = tr(Σ − 1 S n ) + log { det(Σ) } + η Π(Σ) . (5) A detailed study of the penalized normal problem based on general p enalties are giv en in Tyler and Yi (2020); Yi and Tyler (2020). These studies include prov en con vergen t algorithms for ﬁnding the minima of (5). How ev er, ﬁ nding a minimizer to (4) is more in volv ed than in the p enalized sample co v ariance case (5). It has b een kno wn since their in tro duction that, under certain conditions on the function ρ ( s ) and on the sample, there exists a unique critical p oin t to ℓ ρ (Σ) ; see Maronna (1976); Kent and Tyler (1991) for details. This suggests ℓ ρ (Σ) may p ossess some conv exit y prop ert y , but unlik e the normal case it is not necessarily conv ex in Σ − 1 . Some insigh tful work Zhang et al. (2013), though, has shown ℓ ρ (Σ) to b e geo desically conv ex in Σ whenev er the function ρ ( s ) is con vex in log( s ) , i.e. whenev er h : R → R deﬁned by h ( a ) ≡ ρ ( e a ) is con v ex. The notion of geo desic con vexit y , is review ed in the appendix. As with con vexit y , an y lo cal minim um of a g- con vex function is a global minim um, and when diﬀeren tiable an y critical p oin t is a global minim um, with the set of all minima b eing g-con vex. In addition, if a minim um exists, then the minim um is unique when the function is strictly g-con vex. Finally , the sum of tw o g-con vex functions is g-con vex, and the sum is strictly g-conv ex if either of the t wo g-conv ex summands is strictly g-conv ex. These basic prop erties of geo desic conv exity can b e applied to establish the follo wing lemma. Lemma 1. If ρ ( s ) and Π(Σ) ar e g-c onvex, then L ρ (Σ; η ) is g-c onvex and the solution set A ρ = { Σ ∗ > 0 | L ρ (Σ ∗ ; η ) ≤ L ρ (Σ; η ) for al l Σ > 0 } (6) is g-c onvex. F urthermor e, if either ρ ( s ) or Π(Σ) is strictly g-c onvex, then L ρ (Σ; η ) is strictly g-c onvex and henc e the solution set A ρ is either empty or c ontains a single element. The ab o ve lemma applies when using the Kullbac k-Leibler p enalt y Π K L = tr(Σ − 1 ) + log { det(Σ) } or the Riemannian p enalt y Π R (Σ) = || log(Σ) || 2 F , since in addition to b eing strictly conv ex in Σ − 1 and log (Σ) resp ectiv ely , they are known to b e strictly g-con vex; see Yi and Ty ler (2020) for details. 8 As in the unpenalized case, the existence of a minimizer may dep end on the data itself. Other than b eing non-empty , it is desirable for the solution set A ρ to b e compact. The next lemma establishes that if these prop erties hold in the well studied non-p enalized case, η = 0 , then they hold in the p enalized case for any η > 0 . Lemma 2. Supp ose ρ ( s ) and Π(Σ) ar e g-c onvex. If A ρ is a non-empty c omp act set when η = η o , then it is a non-empty c omp act set when η > η o . F urthermor e, if either ρ ( s ) or Π(Σ) is strictly g-c onvex, the unique minimizer ˆ Σ η > 0 is a c ontinuous function of η > η o . A suﬃcient condition to guarantee A ρ is a non-empty compact set when η = 0 is P n ( V ) < 1 − { p − dim ( V ) } K ρ for any prop er subspace V ⊂ R p , (7) where K ρ = sup { k > 0 | s k e − ρ ( s ) → 0 as s → ∞} and P n represen ts the empirical probabilit y measure for the data set, see Kent and T yler (1991). The constant K ρ is referred to as the sil l of ρ ( s ) . F or K ρ > p and n > p , this condition holds with probabilit y one when random sampling from a con tinuous m ultiv ariate distribution. When diﬀerentiable, ρ ( s ) is g-con vex if and only if the function ψ ( s ) = su ( s ) is non-decreasing. In this case, K ρ = ψ ( ∞ ) . The function ψ ( s ) is related to the inﬂuence function of the M-estimator, with K ρ b eing related to its gross error sensitivity Hamp el (1968, 1974). Smaller v alues of K ρ imply smaller gross error sensitivities and hence Condition 7 is more restrictiv e for more robust M-estimators. When η > 0 , condition (7) can usually b e relaxed. F or example, Sun et al. (2014) studied a family of shrink age Tyler’s estimators and provided suﬃcient conditions, dep enden t on the sample, for the existence and uniqueness of the estimators when using a Kullback-Leibler- t yp e p enalt y . F or a general g-con vex ρ -function that is b ounded from b elo w, Duembgen and T yler (2016) stated that when using some g-co erciv e p enalt y functions, like the Riemannian p enalt y Π R or the symmetrized Kullbac k-Leibler p enalt y Π K Ls (Σ) = tr(Σ) + tr(Σ − 1 ) , no conditions on the sample are needed when η > 0 to ensure a unique minimizer exists. W e sho w in the next theorem that in general if the p enalt y is g-co erciv e and under some other conditions, indep enden t of the data, a unique minimum alw ays exists for large enough η . This result holds ev en for n = 0 . 9 Theorem 1. Supp ose ρ ( s ) is b ounde d b elow, as wel l as g-c onvex, and Π(Σ) is g-c onvex and g-c o er cive. If log { det(Σ) } / Π(Σ) → 0 as log { det(Σ) } → −∞ , the solution set A ρ is non- empty for η > 0 . If log { det(Σ) } / Π(Σ) is b ounde d b elow by − η 0 as log { det(Σ) } → −∞ , the solution set A ρ is non-empty for η > η 0 . G-co ercivit y for Π(Σ) ma y not hold for some p enalties of interest. In particular, it cannot hold for any shap e p enalt y , i.e. a scale inv ariant p enalt y for which Π(Σ) = Π( c Σ) for an y c > 0 . As noted within the pro of of Theorem 1, the condition that log { det(Σ) } + η Π(Σ) b e g-co erciv e is suﬃcient to guaran tee the existence of a minimizer. This implies the following more general result. Theorem 2. Supp ose ρ ( s ) is b ounde d b elow, as wel l as g-c onvex, and Π(Σ) is g-c onvex. If log { det(Σ) } + η Π(Σ) is g-c o er cive for some η = η o , then the solution set A ρ is non-empty for any η ≥ η 0 . The conditions of Theorem 1 hold for the Kullback-Leibler p enalt y Π K L , the symmetrized Kullbac k-Leibler p enalt y Π K Ls , and the Riemannian p enalt y Π R , with η o = 0 . They do not hold for the inv erse precision matrix p enalty Π o (Σ) = tr(Σ − 1 ) , considered in Ollila and T yler (2014). The conditions of Theorem 2, though, do hold for Π o (Σ) for an y η > 0 . The conditions of Theorem 2 are still to o strong for shap e p enalties. This can b e seen b y considering the sequence Σ k = c k I p , with c k → 0 , where I p is the iden tity matrix of order p . Here log { det(Σ k ) } + η Π(Σ k ) = p log c k + η Π( I p ) → −∞ for an y η ≥ 0 . F or shap e p enalties, the following modiﬁcation of the ab o v e theorem can b e used to establish existence of the corresp onding p enalized M-estimators of scatter. Theorem 3. Supp ose ρ ( s ) is b ounde d b elow, as wel l as g-c onvex, and Π(Σ) is g-c onvex with Π(Σ) = Π( c Σ) for any c > 0 . Also, for any se quenc e Γ > 0 , with γ 1 = 1 and γ p → 0 , wher e γ 1 ≥ · · · ≥ γ p > 0 ar e the eigenvalues of Γ . Supp ose log { det(Γ) } + η Π(Γ) → ∞ for some η = η o . Then, pr ovide d n ≥ 1 , K ρ > p and P n (0) < 1 − p/K ρ , the solution set A ρ is non-empty for any η ≥ η o . 10 Examples of shape p enalties which satisfy Theorem 3 are the Riemannian shap e p enalt y Π Rs (Σ) = || log(Σ / det(Σ) 1 /p ) || 2 F , with the theorem holding for an y η > 0 , and the non- smo oth “elasso" p enalt y Π E (Σ) = P p i =1 a i log λ i , studied in T yler and Yi (2020), for η > max {− 1 /a j , j = 1 , · · · , p } , where λ 1 ≥ · · · ≥ λ p are eigen v alues of Σ and a 1 ≥ · · · ≥ a p are ﬁxed constants. If w e replace 1 n P n i =1 ρ ( x T i Σ − 1 x i ) in (4) with 1 n 1 P x i  =0 ρ ( x T i Σ − 1 x i ) , where n 1 = # { x i  = 0 } , then Theorem 3 still holds if w e replace the condition on P n (0) with the condition that x i  = 0 for some x i . This is reasonable when we are only interested in the shap e comp onen t, i.e. in functions of Σ such that H (Σ) = H ( c Σ) for all c > 0 , since x i = 0 has no information regarding shap e. The abov e theorems do not hold for the regularized T yler’s M-estimators since ρ ( s ) = pl og ( s ) is not b ounded from b elo w. F or this case, w e can apply the follo wing theorem. Theorem 4. Supp ose ρ ( s ) = pl og ( s ) , and Π(Σ) is g-c onvex. A lso, for any se quenc e Γ > 0 , with γ 1 = 1 and γ p → 0 , wher e γ 1 ≥ · · · ≥ γ p > 0 ar e the eigenvalues of Γ . Supp ose log { det(Γ) } + η Π(Σ) → ∞ for some η = η o . Then, the solution set A ρ is non-empty for any η ≥ η o . 2.3. Ba yesian interpretation In regression, adding an L 2 p enalt y is equiv alent to using a Gaussian prior on β , while adding an L 1 p enalt y is equiv alent to using a Laplacian prior, i.e. β ∼ C exp − η β 2 and β ∼ C exp − η | β | r espectiv ely . Th us the ridge and lasso estimators of β Tibshirani (1996) can b e in terpreted as the Bay esian p osterior mo de under these respective priors; see Li and Go el (2006) for a comprehensiv e review. Similarly , the p enalized ob jective function (4) can b e related to the Bay esian pos- terior det(Σ) − n/ 2 n Y i =1 g ( x T i Σ − 1 x i ) · exp( − η (Π(Σ))) . When Π(Σ) is orthogonally inv arian t, the prior, exp {− η Π(Σ) } , corresp onds to indep enden t priors for P , and λ , where Σ = P Λ P T represen ts its spectral v alue decomp osition, with P 11 ha ving a Haar measure on O ( p ) , and λ ∈ R p , whic h are the diagonal elemen ts of Λ , having a density prop ortional to exp {− η π (log( λ )) } . Since L ρ is g-conv ex, the corresp onding p osterior marginal density of Σ , say f (Σ) , is also g-con vex. Thus its superlevel sets { Σ ∈ P p | f (Σ) ≥ t } are g-conv ex as w ell for any t > 0 . As deﬁned in Anderson (1955), a m ultiv ariate function h ( x ) is unimo dal if its epigraph { x ∈ R n | h ( x ) ≥ c } is con vex for ev ery real n umber c . This notion can b e generalized to the space of p ositiv e deﬁnite symmetric matrix of order p . That is, the condition of unimo dalit y of a matrix distribution function can b e expressed b y the condition that its sup erlev el sets are g-conv ex. Therefore, the p osterior marginal densit y of Σ is unimodal if the density generator function g is monotonically non-increasing. Examples of unimo dal elliptical densit y include normal, t -distribution, angular Gaussian distribution, and logistic distribution. Analogous to Anderson (1955) and Khatri (1967), this generalization has applications to the concen tration of matrix-v ariate distributions and the co verage probability of simulta- neous conﬁdence interv als for the components of the cov ariance matrix of a matrix-v ariate p opulation. F or example, if X ∼ N (0 , Σ 1 ) and Y ∼ N (0 , Σ 2 ) with Σ 2 − Σ 1 ≥ 0 , where X , Y ∈ R p . Then the distribution of X is more concen trated ab out 0 than Y , in the sense that P [ Y ∈ J ] ≤ P [ X ∈ J ] for every symmetric g-con vex set J ∈ R p . 3. Rew eighting algorithms When ρ ( s ) and Π(Σ) are b oth diﬀeren tiable then the minima of L ρ (Σ; η ) satisfy the M-estimating equation: Σ = 1 n n X i =1 { u ( x T i Σ − 1 x i ) x i x T i } + η ∇ Π(Σ − 1 ) , where u ( s ) = ρ ′ ( s ) and ∇ Π(Σ − 1 ) denotes the deriv ative with resp ect to Σ − 1 . A common wa y to ﬁnd a solution to an M-estimating equation is to then use a ﬁxed-point (FP) algorithm: Σ k +1 = 1 n n X i =1 { u ( x T i Σ − 1 k x i ) x i x T i } + η ∇ Π(Σ − 1 k ) . (8) 12 F or the unp enalized case, i.e. η = 0 , and when equation (2) corresp onds to the MLE from a scale mixture of normals, such as a multiv ariate t ν -distribution, the FP algorithm arises from an application of a standard EM algorithm, properties of which hav e b een well studied. In particular, the FP algorithm is th us monotonically decreasing in ℓ ρ (Σ) , meaning ℓ ρ (Σ k +1 ) ≤ ℓ ρ (Σ k ) . F or monotonic M-estimates, i.e. when ψ ( s ) = su ( s ) is non-decreasing, when ρ ( s ) is unbounded and not necessarily related to a scale mixture of normals, Kent and T yler (1991) sho w that the FP algorithm alwa ys conv erges to the unique solution, no matter the c hoice of the initial v alue. These conditions are satisﬁed, for example, by Hub er’s M-estimators of scatter, even though they do not corresp ond to MLEs based on a scale mixture of normals. F urthermore, it was later sho wn in Arslan (2004) that for this more general case the FP algorithm is monotonically increasing in the corresp onding M-ob jective function. F or the p enalized case, Ollila and T yler (2014) gives a partial pro of that the ﬁxed-p oin t algorithm con verges when using the Kullback-Leibler p enalt y or the trace precision matrix p enalt y Π o (Σ) = tr(Σ − 1 ) . How ever, there is no general pro of for other p enalties. The follo wing example shows that this algorithm do esn’t alwa ys con verge for the Riemannian p enalt y . Example 1. W e c onsider a sample x = { x 1 , · · · , x 100 } ∈ R 2 gener ate d fr om a me an-zer o el liptic al t 3 -distribution, with Σ b eing the identity matrix. W e use the weight function gen- er ate d fr om the t 3 distribution, i.e., u ( s i ) = ν + p ν + s i in e quation (8), wher e s i = x T i Σ − 1 k x i , ν = 3 and p = 2 . First, c onsider the p enalty Π o (Σ) = tr(Σ − 1 ) . The ﬁxe d-p oint algorithm is given by Σ k +1 ← ν + p n n X i =1 x i x T i ν + x T i Σ − 1 k x i + 2 η I p F or a ﬁxe d η = 0 . 5 , we c ompute ˆ Σ η with two diﬀer ent initial values given, r esp e ctively, by Σ 00 =    1 0 0 1    , and Σ 01 =    3 1 1 3    . 13 As exp e cte d, the ﬁxe d-p oint algorithm c onver ges for b oth of the initial values. In b oth c ases, c onver genc e o c curr e d after ﬁve steps and gives the estimate: ˆ Σ η =    0 . 5136 − 0 . 0072 − 0 . 0072 0 . 5038    . Next, c onsider the Riemannian p enalty Π R (Σ) = ∥ log Σ ∥ 2 F , for which the ﬁxe d-p oint algorithm is given by: Σ k +1 ← ν + p n n X i =1 x i x T i ν + x T i Σ − 1 k x i − 2 η log (Σ k )Σ k . F or η = 0 . 5 , we use this algorithm with the same data and the same two initial values as b efor e. (i) With Σ 00 , it c onver ges at the 38 th step and gives an estimate ˆ Σ η =    0 . 9758 − 0 . 0063 − 0 . 0063 1 . 1291    . (ii) With Σ 01 , it fails to c onver ge. In general, the minima of M-optimization problem (4) do not ha ve closed form solutions. W e sho w here, though, that if one knows ho w to minimize (5), the normal optimization problem, then one can use a simple reweigh ting algorithm to ﬁnd a minimum to (4), whenev er ρ ( s ) is conca ve as well as g-conv ex. F or such cases, the w eight function u ( s ) = ρ ′ ( s ) is non- increasing, while the “inﬂuence function” ψ ( s ) = su ( s ) is non-decreasing. These are the original conditions imp osed on the w eigh t and inﬂuence functions for the multiv ariate M- estimators Maronna (19 76). Examples which satisﬁes these conditions include the u ( s ) and ψ ( s ) functions asso ciated with the sample co v ariance matrix, the maxim um lik eliho o d estimators of scatter under an elliptical t -distribution on ν degrees of freedom, and Hub er’s M-estimator of scatter. A ﬁrst order appro ximation to ℓ ρ (Σ) yields the follo wing monotonic algorithm. 14 Theorem 5. Supp ose ρ ( s ) is c onc ave and diﬀer entiable with u ( s ) = ρ ′ ( s ) , and let L (Σ; M ; η ) b eing deﬁne d as in (5) . Given an initial Σ 0 > 0 , deﬁne the se quenc e Σ k +1 = arg min Σ > 0 L (Σ; M (Σ k ); η ) , wher e M (Σ) = 1 n n X i =1 u ( x T i Σ − 1 x i ) x i x T i . Then L ρ (Σ k +1 ; η ) ≤ L ρ (Σ k ; η ) , i.e. L ρ (Σ k ; η ) is a non-incr e asing se quenc e. F or the unp enalized case, Σ k +1 = M (Σ k ) , and so this rew eighting algorithm coincides with the original ﬁxed-p oin t algorithm prop osed in Maronna (1976) and Hub er (1977). When using the Kullbac k-Leibler p enalt y or the trace precision matrix p enalt y this algorithm can be sho wn to b e equiv alen t to the FP algorithm. The algorithm can also b e used with non-smo oth p enalties. In particular, it readily applies to the non-smo oth elasso p enalt y Π E (Σ) . This application is facilitated by the ﬁnite-step algorithm presen ted in Tyler and Yi (2020), whic h can eﬃciently ﬁnd the minim um of the normal ob jective L (Σ; M (Σ k ); η ) . Note that a sp ecial case of the elasso p enalt y is the log-condition num b er p enalt y Π cn (Σ) = log( λ 1 ) − log( λ p ) . T o establish the con vergence of the sequence Σ k , the solution set A ρ needs to be non- empt y and compact. W e then ha ve the follo wing con vergence theorem. Theorem 6. L et ρ ( s ) b e c onc ave and g-c onvex, and Π(Σ) b e g-c onvex. If the solution set A ρ as deﬁne d by (6) is non-empty and c omp act, then the se quenc e Σ k has an ac cumulation p oint, with any ac cumulation p oint, say Σ ∞ , in A ρ . F urthermor e, if either ρ ( s ) or Π(Σ) is strictly g-c onvex, then A ρ c ontains exactly one element, say Σ ∞ , with Σ k → Σ ∞ . Example 2. T o il lustr ate our new algorithm, c onsider the same data set use d in Example 1. R e c al l that when using the Π o (Σ) p enalty, our algorithm c orr esp onds to the ﬁxe d-p oint algorithm. Unlike the Kul lb ack-Liebler and Π o (Σ) p enalty, the normal minimization pr oblem Σ k +1 = arg min Σ > 0 L (Σ; M (Σ k ); η ) gener al ly do es not have a close d-form solution for other p enalty functions. F or the R iemannian p enalty Π R (Σ) , however, a simple iter ative algorithm is pr ovide d in Yi and T yler (2020). Applying our pr op ose d algorithm with η = 0 . 5 and the same two initials values fr om Example 1, we observe c onver genc e for b oth Σ 00 and Σ 01 . The 15 algorithm c onver ges at the 12 th and 14 th iter ations, r esp e ctively, r e aching the value of ˆ Σ η given in Example 1. • R emark 1. Our algorithm can also b e in terpreted within the ma jorization-minimization (MM) framew ork, where L (Σ; M (Σ k ); η ) , the normal ob jective function, serv es as a surrogate function that lo cally approximates the M-ob jectiv e function L ρ (Σ; η ) with their diﬀerence minimized at the current p oin t. In other words, the M-ob jectiv e func- tion L ρ (Σ; η ) is upp er-b ounded b y the surrogate function up to a constant, that is, L ρ (Σ; η ) ≤ L (Σ; M (Σ k ); η ) + 1 n P n i =1 ρ ( x T i M − 1 (Σ k ) x i ) . How ever, unlik e traditional MM approaches suc h as those in Sun et al. (2014) and Wiesel (2012), whic h require deriving problem-sp eciﬁc surrogate functions for eac h com bination of likelihoo d and p enalt y–limiting their applicability to T yler’s loss and Kullback-Leibler t yp e p enalties– our re-weigh ting algorithm provides a uniﬁed framework. W e establish global conv er- gence for a broad class of loss functions ρ and p enalt y functions Π(Σ) , without requir- ing case-b y-case surrogate function construction. This generality enables practitioners to apply our metho d to div erse robust estimation problems without redeveloping th e theoretical machinery for eac h new ob jective function. • R emark 2. The conditions in the abov e t wo theorems apply to the Kullbac k-Leibler and the trace precision matrix p enalties. As already noted, our prop osed re-w eighting algorithm is equiv alent to the FP algorithm when using either of these t wo p enalties. Theorem 5 then implies the FP algorithm given in Ollila and T yler (2014) is mono- tonic. It w as also noted that only a partial pro of is given in Ollila and T yler (2014) for the conv ergence of their FP algorithm. Theorem 6 then provides a complete pro of. 4. Constrained optimization It has b een noted that in some applications, cov ariance matrices ma y p ossess certain structures, such as Banded matrices, which are asso ciated with time-v arying moving av erage mo dels, as studied in Bic kel and Levina (2008), or T o eplitz matrices, whic h are used to mo del the correlation of cyclostationary pro cesses in p erio dic time series, see Miller and Sn yder 16 (1987). In suc h cases, one may wish to consider a structured maximum lik eliho o d estimator or a structured M-estimator of scatter, deﬁned as: min Σ > 0 ℓ ρ (Σ) , s.t. Σ ∈ M , (9) where M ⊂ P p is a constraint set that c haracterizes the co v ariance structure. When the constrain t set corresp onds to Banded or T o eplitz matrix classes, these form con vex subsets of the p ositiv e semideﬁnite cone. Consequently , if ℓ ρ (Σ) is the Gaus sian negativ e log-lik eliho o d, standard conv ex optimization metho ds can b e used to solve the constrained problem (9). F or Tyler’s lik eliho o d under structural constrain ts, Sun et al. (2016) applied MM algorithm with con vex surrogate functions but pro vided no con v ergence guarantees–establishing neither lo cal nor global con vergence. F urthermore, they did not address well-posedness, lea ving unresolv ed when solutions exist and are unique. Beyond con vex structures, imp ortan t classes of co v ariance matrices suc h as Kroneck er matrices Sriv astav a et al. (2008) lac k standard con vexit y . A dditionally , as already noted ℓ ρ (Σ) is not con v ex in both Σ and Σ − 1 for most M-estimators of scatter of interest. How ever, the class of Kroneck er matrices is geo desic con vex. W e therefore establish well-posedness for the constrained problem and extend our p enalized algorithm to handle constrain ts whenev er M forms a geo desically con vex set. T ypically , constrained optimization problems (9) do not hav e closed form solutions and need to be solved using iterative n umerical tec hniques. A ccording to optimization theory , if b oth the ob jectiv e function and the constraint set are conv ex, then due to the Lagrange dualit y , a constrained optimization problem is mathematically equiv alent to a p enalized optimization problem. Similarly , w e note that under g-con v exity , results for the penalized lik eliho o d or M-estimation framew ork (4) readily apply to the constrained lik eliho o d or M-estimation framework (9) due to the follo wing dualit y b et ween the t wo settings. Theorem 7. Supp ose L ρ (Σ; η ) has a unique minimizer ˆ Σ η > 0 which is c ontinuous as a function of η . Deﬁne κ L = inf { Π(Σ) | Σ > 0 } and κ U = sup { Π(Σ) | η > η o } , wher e η o satisﬁe d the c onditions for The or em 1 or The or em 2. F or κ L < κ ≤ κ U , ther e exists a unique solution ˜ Σ κ > 0 to the pr oblem arg min { ℓ ρ (Σ) | Σ > 0 , Π(Σ) ≤ κ } , with the solution 17 ˜ Σ κ b eing a c ontinuous function of κ . F urthermor e, for e ach η > η o ther e exists a κ > 0 , and vic e versa, such that ˆ Σ η = ˜ Σ κ . The r elationship b etwe en η and κ is given by κ ( η ) = Π( ˆ Σ η ) for η > η o . Therefore, if the g-conv ex constrain t set can b e expressed using a g-con vex function, the existence of a solution to (9) follo ws immediately from the p enalized problem (4). F or example, for the condition n umber constrain t problem ﬁrst studied b y W on et al. (2013), the constraint set { Σ ∈ P p | λ 1 /λ p ≤ κ } can b e related to the log-condition num b er p enalt y Π cn . Theoretically , one can compute the solution to (9) for a given κ b y ﬁrst computing the solution to (4) for a range of η and then ﬁnding the v alue of η whic h gives Π( ˆ Σ η ) = κ . This is unnecessarily complicated since we can use the iterative algorithm (11) prop osed b elo w to solv e (9) directly . The proof of the con vergence of (11) makes use of the duality b et ween (9) and (4). W e ﬁrst giv e a lemma that characterizes the solution set to the constrain t problem (9). Lemma 3. Supp ose M ⊂ P p is a close d g-c onvex set, and ℓ ρ (Σ) is g-c o er cive, then the solution set A ∗ ρ = { Σ ∗ ∈ M| ℓ ρ (Σ ∗ ) ≤ ℓ ρ (Σ) , Σ ∈ M} (10) is not empty and A ∗ ρ is g-c onvex. F or a general closed g-conv ex constrain t set, similar to the algorithm for the p enalized problem, it is presumed that the solution to the constrained lik eliho o d problem under the normal negativ e log-lik elho o d function ℓ (Σ; S n ) can b e obtained. F or some initial Σ 0 ∈ M , the ( k + 1) -th step the up date of Σ is then giv en b y Σ k +1 = arg min Σ ∈M ℓ (Σ; M (Σ k )) , (11) where M (Σ) is the w eighted cov ariance matrix deﬁned in Theorem 5. W e sho w b elo w that at each step of this rew eighting algorithm the ob jectiv e function ℓ ρ decreases. 18 Lemma 4. Supp ose ρ ( s ) is strictly c onc ave and diﬀer entiable with u ( s ) = ρ ′ ( s ) . Given an initial Σ 0 > 0 ∈ M , deﬁne the se quenc e Σ k +1 as in (11) . Then ℓ ρ (Σ k +1 ) < ℓ ρ (Σ k ) , i.e. ℓ ρ (Σ k ) is a monotone de cr e asing se quenc e, unless Σ k +1 = Σ k . Finally , using these t wo previous lemmas, w e can establish the conv ergence of the iterative rew eighting algorithm (11). Theorem 8. L et ρ ( s ) b e strictly c onc ave and g-c onvex. If the solution set A ∗ ρ as deﬁne d by (10) is non-empty and c omp act, then the se quenc e Σ k , as deﬁne d by (11) , has an ac cumu- lation p oint, with any ac cumulation p oint, say Σ ∞ , in A ∗ ρ . F urthermor e, if ρ ( s ) is strictly g-c onvex, then A ∗ ρ c ontains exactly one element, say Σ ∞ , with Σ k → Σ ∞ . Example 3. F or matrix-value d data, say X i ∈ R p 1 × p 2 , i = 1 , · · · , n , it is c ommon to mo del its c ovarianc e structur e by using a Kr one cker pr o duct mo del; that is Σ = Σ 1 ⊗ Σ 2 , wher e Σ 1 ∈ P p 1 and Σ 2 ∈ P p 2 ar e pr op ortional to the c ovarianc e matric es of the r ow and c olumn variables, r esp e ctively. F urther r estrictions like gr oup symmetry on the matric es Σ 1 and/or Σ 2 may also b e imp ose d. L et K j b e a set of ortho gonal matric es on R p j , j = 1 , 2 . Consider the set M 2 = { (Σ 1 , Σ 2 ) ∈ P p 1 ⊗ P p 2 | det(Σ j ) = 1 , U j Σ j U T j = Σ j , ∀ U j ∈ K j , j = 1 , 2 } . F ol lowing Wiesel (2012b), it c an b e shown that M 2 is a g-c onvex set. So, applying our c onstr aine d r e-weighting algorithm, an M-estimator of sc atter having b oth the kr one cker and gr oup symmetry structur e c an b e c ompute d via                                          ˜ Σ 1 ,k +1 = 1 np 2 |K 1 ||K 2 | P U j ∈K j , j =1 , 2 P n i =1 u { tr(Σ − 1 1 ,k Y i Σ − 1 2 ,k Y T i ) } Y i Σ − 1 2 ,k Y T i ˜ Σ 2 ,k +1 = 1 np 1 |K 1 ||K 2 | P U j ∈K j , j =1 , 2 P n i =1 u { tr(Σ − 1 1 ,k Y i Σ − 1 2 ,k Y T i ) } Y i Σ − 1 1 ,k Y T i Σ 1 ,k +1 = ˜ Σ 1 ,k +1 (det( ˜ Σ 1 ,k +1 )) 1 /p 1 Σ 2 ,k +1 = ˜ Σ 2 ,k +1 (det( ˜ Σ 2 ,k +1 )) 1 /p 2 , wher e Y i = U 1 X i U 2 . 19 • R emark 3. Note that if w e only assume the cov ariance has the group symmetry structure, then our algorithm reduces to the one dev elop ed in Solo veyc hik et al. (2015). If w e only assume the Kronec ker structure, use T yler’s w eight function, and c hange the identiﬁcation condition det(Σ j ) = 1 to ∥ Σ j ∥ 2 = 1 , our algorithm b ecomes a generalization of the algorithm considered in Solo veyc hik and T rushin (2015). • R emark 4. Example 3 can b e readily generalized to K-mo de m ulti-wa y arra y data X i ∈ R p 1 ×···× p K , i = 1 , · · · , n , whose co v ariance has the tensor structure Σ = Σ 1 ⊗ · · · ⊗ Σ K , where Σ k ∈ P p k , k = 1 , · · · , K . It can again be sho wn that the set M K is g-con vex, and so our constrained re-weigh ting algorithm can b e applied to construct M-estimators of scatter for tensor co v ariance mo dels. 5. Concluding remarks There has been considerable interest, especially in the signal pro cessing and the EE comm unity , in penalized and structured M-estimation of multiv ariate scatter. Muc h of the previous literature addresses more sp eciﬁc problems, with the corresp onding arguments not alw ays b eing complete. In this pap er, w e ha ve established a uniﬁed theoretical and compu- tational framew ork for p enalized and structured M-estimators of m ultiv ariate scatter. By lev eraging geo desic con vexit y , we provide comprehensive w ell-p osedness results that apply to broad classes of p enalt y functions and structural constraints, moving beyond the case- b y-case analyses prev alent in existing literature. The theoretical analysis resolv es sev eral op en questions in robust co v ariance estima- tion. Existence and uniqueness conditions are established for p enalized M-estimators un- der geo desically con v ex penalties, including non-smo oth cases previously unaddressed. F or structured estimation, the dualit y relationship b et ween p enalized and constrained formula- tions immediately yields existence results for geo desically con v ex constraint sets. Our re-w eighting algorithm addresses fundamen tal conv ergence failures of the standard ﬁxed-p oin t iteration for p enalt y functions b ey ond the Kullback-Leibler t yp e p enalties. The algorithm ac hieves monotone con vergence for general geo desically con vex penalties while 20 main taining the same con vergence rate as ﬁxed-p oin t metho ds when they do conv erge. Critically , the algorithm reduces eac h iteration to solving a p enalized Gaussian problem—a w ell-studied subproblem with eﬃcien t solutions—thereby pro viding practitioners with an implemen table metho d that requires minimal problem-sp eciﬁc deriv ation. The extension to constrained problems demonstrates that the same algorithmic framework applies to geo desi- cally con vex constraint sets, unifying the treatment of p enalized and structured estimation. This generality encompasses non-conv ex but geo desically conv ex structures, including co- v ariance structures like the Kroneck er pro ducts, group-symmetric or log-condition n umber, pro viding con vergence guarantees absen t in previous work. A c kn owledgmen ts This work w as supp orted b y the National Science F oundation Grants DMS-1407751 and DMS-1812198, the National Natural Science F oundation of China (No. 12101119) and the F undamen tal Researc h F unds for the Cen tral Universities. App endix A. Geodesic conv exity The set of symmetric p ositiv e deﬁnite matrices of order p can b e view ed as a Rie- mannian manifold with the geo desic path from Σ 0 > 0 to Σ 1 > 0 b eing given by Σ t = Σ 1 / 2 0 { Σ − 1 / 2 0 Σ 1 Σ − 1 / 2 0 } t Σ 1 / 2 0 for 0 ≤ t ≤ 1 , see Wiesel (2012); Duem bgen and T yler (2016) for more details. An y alternative representation for this path is given by Σ t = B e t ∆ 1 B T , where Σ 0 = B B T and Σ 1 = B e ∆ 1 B T with ∆ b eing a diagonal matrix of order p . A function f (Σ) is said to b e geo desically con vex, or g-conv ex, if and only if f (Σ t ) ≤ (1 − t ) f (Σ 0 ) + tf (Σ 1 ) for 0 < t < 1 , and it is strictly g-conv ex if strict inequalit y holds. As previously noted, g-con vex functions p ossess prop erties similar to those of con vex functions. In particular, a g-conv ex function deﬁne on Σ > 0 is con tinuous, and an y critical p oin t of a g-con vex function corresp onds to a global minimum. When the g-conv ex function is diﬀerentiable, then Σ 0 = B B T is a global minim um if and only if lim t → 0 + f ( B e t ∆ B T ) − f ( B B T ) t ≥ 0 , for any diagonal matrix ∆ of order p, (12) see e.g. Lemma 3.11 in Duem bgen and T yler (2016). F urthermore, the set of all minima 21 of a g-con vex function form a non-empty compact set if and only if the function is also g-co erciv e. A function f (Σ) deﬁned for Σ > 0 is said to b e g-co erciv e if f (Σ) → ∞ as ∥ det { log(Σ) }∥ → ∞ . More generally , for a g-con v ex and g-co erciv e function, the lev el sets C c = { Σ > 0 | f (Σ) ≤ c } are non-empt y compact g-conv ex sets for any c ≥ inf Σ > 0 f (Σ) , where a subset C of symmetric p ositiv e deﬁnite matrices is said to b e g-con vex if Σ 0 and Σ 1 ∈ C implies Σ t ∈ C . App endix B. Proofs Pr o of of L emma 1 As previously noted, this lemma follo ws from prop erties of g-conv ex function. In partic- ular, it has b een shown by Zhang et al. (2013) that when ρ ( s ) is g-con v ex, then ℓ ρ (Σ) is a g-con v ex function of Σ > 0 , and is strictly g-conv ex when ρ ( s ) is strictly g-con vex and at least one x i  = 0 . Hence, the g-conv exit y prop erties of L ρ (Σ; η ) follo ws. □ Pr o of of L emma 2 The ﬁrst part of the lemma follows from noting that for an y ﬁxed Σ , the term L ρ (Σ; η ) is non-decreasing in η . Hence, if L ρ (Σ; η ) is g-co erciv e at η = η o , then it is g-co erciv e at η > η o . The result then follows since A ρ is a non-empt y concav e set if and only if L ρ (Σ; η ) is g-co erciv e. T o pro v e contin uity , we ﬁrst state the follo wing general lemma. Lemma 5. L et D b e a close d subset of R p . Supp ose the r e al-value d functions f ( x ) and g ( x ) ar e c ontinuous on D , with g ( x ) > 0 . F urthermor e, supp ose h ( x ; η ) = f ( x ) + η g ( x ) has a unique minimum in D for any 0 ≤ η o ≤ η ≤ η 1 . If the set { x ∈ D | h ( x ; η o ) ≤ c } is c omp act for any c ≥ inf { h ( x ; η o ) | x ∈ D } , then the function x ( η ) = ar ginf { h ( x ; η ) | x ∈ D} is c ontinuous for η o ≤ η < η 1 . T o prov e this lemma, ﬁrst note that h ( x ; η ) is increasing in η , and so the set { x ( η ) | η o ≤ η < η 1 } is con tained in the compact set { x | h ( x ; η o ) ≤ h ( x ( η 1 ); η 1 ) } . So, if η k → η , then x ( η k ) has a con vergen t subsequence, sa y x ( η k ′ ) → ˜ x . By deﬁnition, h ( x ( η k ′ ); η k ′ ) ≤ h ( x ( η ); η k ′ ) . 22 By contin uit y , the left hand side con verges to h ( ˜ x ; η ) and the right hand side con verges to h ( x ( η ); η ) . By uniqueness, this implies ˜ x = x ( η ) . Hence, x ( η k ) → x ( η ) , whic h establishes Lemma 5. The con tinuit y assertion follo ws from Lemma 5, with f (Σ) = ℓ ρ (Σ) an g (Σ) = Π(Σ) , since, by g-con vexit y , ℓ ρ (Σ) and Π(Σ) are con tinuous, and b y g-coercivity , the lev el sets of L ρ (Σ; η ) are compact. □ Pr o of of The or em 1 W e only need to sho w that L ρ (Σ; η ) is g-coercive for some η > 0 . Since ρ ( s ) is b ounded b elo w, it is suﬃcien t to show D (Σ; η ) = log { det(Σ) } + η Π(Σ) = Π(Σ) { η + log { det(Σ) } / Π(Σ) } is g-co erciv e. Consider a sequence Σ k > 0 suc h that || log Σ k || → ∞ , whic h implies Π(Σ k ) → ∞ since Π(Σ) is g-coercive. If log { det(Σ k ) } is bounded b elo w, then it readily follows that D (Σ k ; η ) → ∞ for an y η > 0 . Also, if log { det(Σ k ) } → −∞ then log { det(Σ k ) } / Π(Σ k ) > − η o for large enough k , and so D (Σ k ; η ) → ∞ for η > η o . □ Pr o of of The or em 2 As already noted, this theorem is established within the pro of of Theorem 1. □ Pr o of of The or em 3 Since x T i Σ − 1 x i ≥ s i /λ 1 , where s i = x T i x i and ρ ( s ) is non-decreasing, it follows that L ρ (Σ; η ) ≥ H ρ ( λ 1 ) + h (Γ; η ) , where Γ = Σ /λ 1 , H ρ ( λ 1 ) = 1 n n X i =1 { ρ ( s i /λ 1 ) + p log λ 1 } and h (Γ; η ) = log { det(Γ) } + η Π(Γ) , No w, || log Σ || F → ∞ if and only if λ 1 → ∞ or γ p → 0 . Since γ 1 = 1 , it follows b y the conditions of the theorem that if γ p → 0 then h (Γ; η ) → ∞ for η ≥ η o . If λ 1 → ∞ , then H ρ ( λ 1 ) → ∞ , whereas λ 1 → 0 , then b y the deﬁnition of the sill K ρ and the condition on 23 P n (0) , H ρ ( λ 1 ) → ∞ . Th us, if λ 1 → ∞ and/or γ q → 0 then L ρ (Σ; η ) → ∞ for η ≥ η o . Hence L ρ (Σ; η ) is g-coercive for suc h η , and the theorem then follo ws. □ Pr o of of The or em 4 Since x T i Σ − 1 x i ≥ x T i x i /λ 1 , it follo ws that L ρ (Σ; η ) ≥ p n n X i =1 log( x T i x i ) + log { det(Γ) } + η Π(Σ) where Γ = Σ /λ 1 . Then the result follo ws b y the similar pro of of Theorem 1. Pr o of of The or em 5 Since ρ ( s ) is concav e, ρ ( s ) ≤ ρ ( s o )+( s − s o ) u ( s o ) , and so a ve { ρ ( x T i Σ − 1 k +1 x i ) } ≤ a ve { ρ ( x T i Σ − 1 k x i ) } + tr { (Σ − 1 k +1 − Σ − 1 k ) M (Σ k ) } . F urthermore, by deﬁnition L (Σ k +1 ; M (Σ k ); η ) ≤ L (Σ k ; M (Σ k ); η ) . T ogether, these tw o inequalities imply L ρ (Σ k +1 ; η ) = L (Σ k +1 ; M (Σ k ); η ) + av e { ρ ( x T i Σ − 1 k +1 x i ) } − tr { Σ − 1 k +1 M (Σ k ) } ≤ L (Σ k ; M (Σ k ); η ) + av e { ρ ( x T i Σ − 1 k x i ) } − tr { Σ − 1 k M (Σ k ) } (13) = L ρ (Σ k ; η ) . □ Pr o of of The or em 6 W e ﬁrst establish the following lemma. Lemma 6. The se quenc e Σ k has an ac cumulation p oint, with any ac cumulation p oint, say Σ ∞ , satisfying L (Σ ∞ ; M ∞ ; η ) = inf Σ > 0 L (Σ; M ∞ ; η ) , wher e M ∞ = M (Σ ∞ ) T o show this, recall that the solution set A ρ is non-empty and compact if and only if the function L ρ (Σ; η ) is g-co erciv e. Consequently the set S o = { Σ > 0 | L ρ (Σ; η ) ≤ L ρ (Σ 0 ; η ) } is g-conv ex and compact, with the sequence Σ k ∈ S o . Hence, there exists a con v ergent sub- sequence Σ k ′ suc h that Σ k ′ → Σ ∞ and Σ k ′ +1 → Σ 1 , ∞ , where Σ 1 , ∞ = arg min Σ > 0 L (Σ; M ∞ ; η ) . Since L ρ (Σ k ; η ) is non-increasing, it follo ws that L ρ (Σ ∞ ; η ) = L ρ (Σ 1 , ∞ ; η ) . This implies 24 L (Σ 1 , ∞ ; M ∞ ; η ) = L (Σ ∞ ; M ∞ ; η ) , since otherwise (13), with Σ k set to Σ ∞ , w ould b e a strict inequalit y , whic h contradicts L ρ (Σ ∞ ; η ) = L ρ (Σ 1 , ∞ ; η ) . Hence Lemma 6 holds. Con tinuing, since Σ ∞ is a minimizer of the g-conv ex function L (Σ; M ∞ ; η ) , it follo ws from (12) that lim t → 0 + { L ( B e tA B T ; M ∞ ; η ) − L ( B B T ; M ∞ ; η ) } /t ≥ 0 for an y A and for B B T = Σ ∞ . Next, express L ρ (Σ; η ) = L (Σ; M ∞ ; η ) + Q (Σ) , where Q (Σ) = n − 1 P n i =1 { ρ ( x T i Σ − 1 x i ) } − tr(Σ − 1 M ∞ ) . The function Q is diﬀeren tiable, and it can be readily sho wn that lim t → 0 + { Q ( B e tA B T ) − Q ( B B T ) } /t = 0 . Th us, lim t → 0 + { L ρ ( B e tA B T ; η ) − L ρ ( B B T ; η ) } /t ≥ 0 for any A , and hence Σ ∞ is a minimizer of the g-con vex function L ρ (Σ; η ) . □ Pr o of of The or em 7 First note that κ ( η ) is con tin uous and non-increasing with κ L ≤ κ ( η ) ≤ κ U . Contin uity follo ws since b oth Π and ˆ Σ η are contin uous. T o pro ve that κ η is non-increasing, supp ose η 1 < η 2 and deﬁne, for j = 1 , 2 , ˆ Σ j = ˆ Σ η j , ℓ ρ,j = ℓ ρ ( ˆ Σ j ) and κ j = κ η j . By deﬁnition of ˆ Σ j , it follows that ℓ ρ, 1 + η 1 κ 1 ≤ ℓ ρ, 2 + η 1 κ 2 and ℓ ρ, 2 + η 2 κ 2 ≤ ℓ ρ, 1 + η 2 κ 1 , which together implies that η 1 ( κ 1 − κ 2 ) ≤ ℓ ρ, 2 − ℓ ρ, 1 ≤ η 2 ( κ 1 − κ 2 ) . Hence κ 1 ≥ κ 2 . F urthermore, if κ 1 = κ 2 , then ℓ ρ, 1 = ℓ ρ, 2 and ℓ ρ, 1 + η 1 κ 1 = ℓ ρ, 2 + η 1 κ 2 , which, by uniqueness, implies ˆ Σ 1 = ˆ Σ 2 . Next, for κ L < κ ≤ κ U , deﬁne η ( κ ) = inf { η ≥ 0 | κ ( η ) = κ } , and note that Π( ˆ Σ η ( κ ) ) = κ . It readily follo ws from the previous paragraph that η ( κ ) is strictly decreasing and con tinuous from ab o v e. By deﬁnition, ℓ ρ ( ˆ Σ η ( κ ) ) + η ( κ ) κ ≤ ℓ ρ (Σ) + η ( κ )Π(Σ) for any Σ > 0 , and so ℓ ρ ( ˆ Σ η ( κ ) ) − ℓ ρ (Σ) ≤ η ( κ )(Π(Σ) − κ ) . Hence, if Π(Σ) ≤ κ , then ℓ ρ ( ˆ Σ η ( κ ) ) ≤ ℓ ρ (Σ) , whic h implies ˜ Σ κ = ˆ Σ η ( κ ) . Since ˆ Σ η is con tinuous, it follows that ˜ Σ κ is con tinuous at p oin ts of con tinuit y of η ( κ ) . It is also contin uous at p oin ts of discon tinuit y of η ( κ ) since ˆ Σ η is constan t on the sets { η | κ ( η ) = κ } . □ Pr o of of L emma 3 If ℓ ρ (Σ) = ∞ for every Σ ∈ M , then every p oin t is a global minimizer. WLOG, assume there exists a Σ ∗ suc h that ℓ ρ (Σ 0 ) < ∞ . In this case, it follows that ℓ ∗ ρ := inf Σ ∈M ℓ ρ (Σ) ≤ 25 ℓ ρ (Σ 0 ) < ∞ . No w let Σ k b e the minimizing sequence of the optimization problem. By the co ercivit y , Σ k is b ounded, thus admits a con vergen t subsequence Σ k j ∈ M → Σ ∗ ∈ M since M is closed. Therefore inf Σ ∈M ℓ ρ (Σ) = ℓ ∗ ρ = lim k →∞ ℓ ρ (Σ k ) = lim k j →∞ ℓ ρ (Σ k j ) = lim inf k j →∞ ℓ ρ (Σ k j ) ≥ ℓ ρ (Σ ∗ ) whic h sho ws that Σ ∗ is a global minimizer in M , thus A ∗ ρ is non-empty . Supp ose Σ 1 , Σ 2 ∈ M and ℓ ρ (Σ 1 ) ≤ ℓ ρ (Σ) , ℓ ρ (Σ 2 ) ≤ ℓ ρ (Σ) for all Σ ∈ M , then b y g-con vexit y of M , Σ t ∈ M , and b y g-conv exity of ℓ ρ , ℓ ρ (Σ t ) ≤ (1 − t ) ℓ ρ (Σ 1 ) + tℓ ρ (Σ 2 ) ≤ ℓ ρ (Σ) , Σ ∈ M Th us Σ t ∈ A ∗ ρ , and A ∗ ρ is g-conv ex. □ Pr o of of L emma 4 Since ρ ( s ) is strictly concav e, ρ ( s ) < ρ ( s o ) + ( s − s o ) u ( s o ) , and so av e { ρ ( x T i Σ − 1 k +1 x i ) } < a ve { ρ ( x T i Σ − 1 k x i ) } + tr { (Σ − 1 k +1 − Σ − 1 k ) M (Σ k ) } . F urthermore, b y deﬁnition ℓ (Σ k +1 ; M (Σ k )) ≤ ℓ (Σ k ; M (Σ k )) , where Σ k , Σ k +1 ∈ M . T ogether, these tw o inequalities imply ℓ ρ (Σ k +1 ) = ℓ (Σ k +1 ; M (Σ k )) + a ve { ρ ( x T i Σ − 1 k +1 x i ) } − tr { Σ − 1 k +1 M (Σ k ) } < ℓ (Σ k ; M (Σ k )) + a ve { ρ ( x T i Σ − 1 k x i ) } − tr { Σ − 1 k M (Σ k ) } = ℓ ρ (Σ k ) . □ Pr o of of The or em 8 The sequence Σ k has a conv ergent subsequence Σ k j → Σ ∞ . By the con tinuit y of ℓ ρ , we ha ve ℓ ρ (Σ k j ) → ℓ ρ (Σ ∞ ) as j → ∞ . First w e show that ℓ ρ (Σ k ) → ℓ ρ (Σ ∞ ) as k → ∞ . Observe from Lemma 4 that l ρ is monotonically decreasing on the sequence { Σ } ∞ k =0 . Hence w e m ust hav e ℓ ρ (Σ k ) − ℓ ρ (Σ ∞ ) ≥ 0 26 for all k . No w since ℓ ρ (Σ k j ) → ℓ ρ (Σ ∞ ) as j → ∞ , giv en ϵ > 0 , there exists j ≥ j 0 suc h that ℓ ρ (Σ k j ) − ℓ ρ (Σ ∞ ) < ϵ for all j ≥ j 0 . Hence for k ≥ k j 0 , ℓ ρ (Σ k ) − ℓ ρ (Σ ∞ ) = ℓ ρ (Σ k ) − ℓ ρ (Σ k j 0 ) + ℓ ρ (Σ k j 0 ) − ℓ ρ (Σ ∞ ) < ϵ Th us ℓ ρ (Σ k ) → ℓ ρ (Σ ∞ ) . No w we w ant to sho w that Σ ∞ is a solution. W e prov e this b y contradiction. Sup- p ose that Σ ∞ is not a solution. Consider the sequence { Σ k j +1 } ∞ j =1 suc h that Σ k j +1 = arg min ℓ (Σ; M (Σ k j )) , which admits a con vergen t subsequence Σ ( k j +1) l → Σ 0 as l → ∞ . Since the mapping Σ k j 7→ arg min ℓ (Σ; M (Σ k j )) is closed, w e ha ve Σ 0 = arg min ℓ (Σ; M (Σ ∞ )) . On the other hand, ℓ ρ (Σ k ) → ℓ ρ (Σ ∞ ) implies that we m ust hav e ℓ ρ (Σ 0 ) = ℓ ρ (Σ ∞ ) which con tradicts the fact that ℓ ρ (Σ 0 ) < ℓ ρ (Σ ∞ ) for Σ ∞ / ∈ A ∗ ρ . □ References T. Anderson (1955). The integral of a symmetric unimo dal function ov er a symmetric conv ex set and some probabilit y inequalities. Pr o c e e dings of the Americ an Mathematic al So ciety, 6, 170-176. O. Arslan (2004). Conv ergence b eha vior of an iterative reweigh ting algorithm to compute multiv ariate M- estimates for lo cation and scatter. Journal of Statistic al Planning and Infer enc e , 118, 115-128. O. Arslan, P . Constable, and L. Ken t (1995). Conv ergence behavior of the EM algorithm for the m ultiv ariate t-distribution. Communic ations in statistics-the ory and metho ds , 24(12), 2981-3000. J. Baik and J.W. Silv erstein (2006). Eigen v alues of large sample cov ariance matrices of spiked p opulation mo dels. Journal of multivariate analysis , 97(6), 1382-1408. O. Banerjee, L.E. Ghaoui, and A. d?Aspremont (2008). Model selection through sparse maxim um likelihoo d estimation for m ultiv ariate gaussian or binary data. Journal of Machine L e arning R ese ar ch , 9, 485-516. P .J. Bick el, and E. Levina (2008). Regularized estimation of large co v ariance matrices. The A nnals of Statistics, 36(1), 199-227. L. Davies (1987). Asymptotic behavior of S-estimators of multiv ariate location parameters and dispersion matrices. The Annals of Statistics , 15, 1269-1292 A. Dempster (1972). Co v ariance selection. Biometrics , 157-175. 27 A. Dempster, N. Laird, and D. Rubin (1977). Maxim um likelihoo d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al), 39, 1-22. X. Deng and K. T sui (2013). Penalized cov ariance matrix estimation using a matrix-logarithm transforma- tion. Journal of Computational and Gr aphic al Statistics , 22(2), 494-512. L. Duem bgen and D. E. Tyler (2016). Geodesic conv exity and regularized scatter estimators. arXiv pr eprint P . Dutilleul (1999). The MLE algorithm for the matrix normal distribution. Journal of Statistic al Compu- tation and Simulation , 64(2), 105-123. J. F an, Y. F an, and J. Lv (2008). High dimensional co v ariance matrix estimation using a factor model. Journal of Ec onometrics , 147(1), 186-197. J. F riedman, T. Hastie and R. Tibshirani (2008). Sparse in verse cov ariance estimation with the graphical lasso. Biostatistics , 9, 432-441. D. R. F uhrmann and M. I. Miller (1988). On the existence of positive-deﬁnite maxim um-likelihoo d estimates of structured co v ariance matrices. IEEE T r ansactions on Information The ory , 34(4), 722-729. R. F urrer and T. Bengtsson (2007). Estimation of high-dimensional prior and p osterior cov ariance matrices in Kalman ﬁlter v ariants. Journal of Multivariate Analysis, 98(2), 227-255. F. Hamp el (1968). Contribution to the theory of robust estimation. Ph. D. Thesis . Univ ersity of California, Berk eley . F. Hamp el (1974). The inﬂuence curve and its role in robust estimation. Journal of the Americ an Statistic al Asso ciation , 69(346), 383-393. P . Huber (1964). Robust estimation of a lo cation parameter. The A nnals of Mathematic al Statistics , 35, 73-10. P . Hub er (1977). R obust c ovarianc es. In Statistical decision theory and related topics (pp. 165-191). A cademic Press. D. Kelker (1970). Distribution theory of spherical distributions and a location-scale parameter generalization. Sankhy¯ o: The Indian Journal of Statistics, Series A, 419-430. J. Ken t and D. Tyler (1991). Redescending M-estimates of multiv ariate lo cation and scatter. The Annals of Statistics , 19, 2102-2119. 28 C. Khatri (1967). On certain inequalities for normal distributions and their applications to sim ultaneous conﬁdence b ounds. The A nnals of Mathematic al Statistics, 38, 1853-1867. O. Ledoit and M. W olf (2004). A well-conditioned estimator for large-dimensional co v ariance matrices. Journal of multivariate analysis , 88(2), 365-411. B. Li, and P . Goel (2006). Regularized optimization in statistical learning: A Bay esian persp ectiv e. Statistic a Sinic a, 16, 411-424. A. M. Manceur and P . Dutilleul (2013). Maximum lik eliho o d estimation for the tensor normal distribution: Algorithm, minimum sample size, and empirical bias and disp ersion. Journal of Computational and Applie d Mathematics, 239, 37-49. R. Maronna (1976). Robust M-estimators of multiv ariate lo cation and scatter. The Annals of Statistics , 4, 51-67. M. I. Miller and D. L. Sn yder (1987). The role of lik eliho od and en trop y in incomplete-data problems: Applications to estimating p oin t-pro cess intensities and T o eplitz constrained cov ariances. Pr o c e e dings of the IEEE, 75(7), 892-907. E. Ollila, I. Solo veyc hik, D. E. Tyler and A. Wiesel (2016). Simultaneous p enalized M-estimation of cov ari- ance matrices using geo desically conv ex optimization. arXiv pr eprint E. Ollila and D. E. Tyler (2014). Regularized M-estimators of scatter matrix. IEEE T r ansactions on Signal Pr o c essing , 62(22), 6059-6070. E. Ollila, D. E. Tyler, V. Koivunen, and H. V. P o or (2012). Complex elliptically symmetric distributions: Surv ey , new results and applications. IEEE T r ansactions on Signal Pr o c essing , 60, 5597-5625. D. Paul (2007). Asymptotics of sample eigenstructure for a large dimensional spik ed cov ariance model. Statistic a Sinic a , 1617-1642. M. Raza viya yn, M. Hong and Z. Luo (2013). A uniﬁed con vergence analysis of block successiv e minimization metho ds for nonsmo oth optimization. SIAM Journal on Optimization , 23, 1126-1153. P . Rousseeuw (1984). Least median of squares regression. Journal of Americ an Statistic al Asso ciation , 79, 871-880. D. Rubin (1983). Iteratively reweigh ted least squares. In: Kotz, S., Johnson, N.L. (Eds.), Encyclop e dia of Statistic al Scienc es. Wiley , New Y ork. 29 I. Solov eychik and D. T rushin (2016). Gaussian and robust Kronec ker pro duct cov ariance estimation: Exis- tence and uniqueness. Journal of Multivariate Analysis , 149, 92-113. I. Solov eychik, D. T rushin, and A. Wiesel (2015). Group symmetric robust co v ariance estimation. IEEE T r ansactions on Signal Pr o c essing , 64(1), 244-257. D. Sn yder, J. O’Sulliv an, and M. Miller (1989). The use of maxim um lik eliho od estimation for forming images of diﬀuse radar targets from dela y-Doppler data. IEEE T r ansactions on Information The ory , 35(3), 536-548. S. Sra and R. Hosseini (2015). Conic geometric optimization on the manifold of p ositiv e deﬁnite matrices. SIAM Journal on Optimization , 25(1), 713-739 M. S. Sriv ast a v a , T. von Rosen and D. V on Rosen (2008). Mo dels with a Kronec ker pro duct cov ariance structure: estimation and testing. Mathematic al Metho ds of Statistics , 17(4), 357-370. Y. Sun, P . Babu, and D. P . P alomar (2014). Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms. IEEE T r ansactions on Signal Pr o c essing , 62, 5143-5156. Y. Sun, P . Babu, and D. P . Palomar(2016). Robust estimation of structured cov ariance matrix for heavy- tailed elliptical distributions. IEEE T r ansactions on Signal Pr o c essing. 64, 3576-3590. K. T atsuok a and D. E. T yler (2000). On the uniqueness of S-functionals and M-functionals under nonelliptical distributions. The Annals of Statistics , 28, 1219-1243. R. Tibshirani (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al), 58, 267-288. D. E. Tyler (1987). A distribution-free m-estimator of multiv ariate scatter. The Annals of Statistics , 15, 234-251. D. E. Tyler (1987). Statistical analysis for the angular central Gaussian distribution on the sphere. Biometrika , 74(3), 579-589. D. E. T yler (1997). Discussion on the EM algorithm?an old F olk-song Sung to a fast new tune. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 59, 550-551. D. E. Tyler (2010). A note on multiv ariate location and scatter statistics for sparse data sets. Statistics & Pr ob ability L etters , 80(17-18), 1409-1413. D. E. T yler and M. Yi (2020). Lassoing eigenv alues. Biometrika , 107(2), 397-414. 30 S. Visuri, V. K oivunen and H. Oja (2000). Sign and rank cov ariance matrices. Journal of Statistic al Planning and Infer enc e , 91, 557-575. D. I. W arton (2008). Penalized normal lik eliho od and ridge regularization of correlation and cov ariance matrices. Journal of the Americ an Statistic al Asso ciation , 103(481), 340-349. A. Wiesel (2012). Uniﬁed framework to regularized cov ariance estimation in scaled Gaussian mo dels. IEEE T r ansactions on Signal Pr o c essing , 60(1), 29-38. A. Wiesel (2012). Geo desic conv exity and co v ariance estimation. IEEE T r ansactions on Signal Pr o c essing , 60(12), 6182-6189. J. H. W on, J. Lim, S. J. Kim and B. Ra jaratnam (2013). Condition n umber regularized cov ariance estimation. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 75(3), 427-450. M. Yi and D. E. T yler (2020). Shrinking the Sample Co v ariance Matrix using Conv ex Penalties on the Matrix-Log T ransformation. Journal of Computational and Gr aphic al Statistics , 30(2), 442-451. M. Y uan and Y. Lin(2007). Mo del selection and estimation in the Gaussian graphical mo del. Biometrika , 94(1), 19-35. T. Zhang, A. Wiesel and M. Greco (2013). Multiv ariate generalized Gaussian distribution: Con vexit y and graphical mo dels. IEEE T r ansactions on Signal Pr o c essing , 61(16), 4141-4148.

Robust regularized covariance matrix estimation: well-posedness and convergent algorithm

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment