In this paper, we consider estimating sparse inverse covariance of a Gaussian graphical model whose conditional independence is assumed to be partially known. Similarly as in [5], we formulate it as an $l_1$-norm penalized maximum likelihood estimation problem. Further, we propose an algorithm framework, and develop two first-order methods, that is, the adaptive spectral projected gradient (ASPG) method and the adaptive Nesterov's smooth (ANS) method, for solving this estimation problem. Finally, we compare the performance of these two methods on a set of randomly generated instances. Our computational results demonstrate that both methods are able to solve problems of size at least a thousand and number of constraints of nearly a half million within a reasonable amount of time, and the ASPG method generally outperforms the ANS method.
Deep Dive into Adaptive First-Order Methods for General Sparse Inverse Covariance Selection.
In this paper, we consider estimating sparse inverse covariance of a Gaussian graphical model whose conditional independence is assumed to be partially known. Similarly as in [5], we formulate it as an $l_1$-norm penalized maximum likelihood estimation problem. Further, we propose an algorithm framework, and develop two first-order methods, that is, the adaptive spectral projected gradient (ASPG) method and the adaptive Nesterov’s smooth (ANS) method, for solving this estimation problem. Finally, we compare the performance of these two methods on a set of randomly generated instances. Our computational results demonstrate that both methods are able to solve problems of size at least a thousand and number of constraints of nearly a half million within a reasonable amount of time, and the ASPG method generally outperforms the ANS method.
It is well-known that sparse undirected graphical models are capable of describing and explaining the relationships among a set of variables. Given a set of random variables with Gaussian distribution, the estimation of such models involves finding the pattern of zeros in the inverse covariance matrix since these zeros correspond to conditional independencies among the variables. In recent years, a variety of approaches have been proposed for estimating sparse inverse covariance matrix. (All notations used below are defined in Subsection 1.1.) Given a sample covariance matrix Σ ∈ S n + , d'Aspremont et al. [5] formulated sparse inverse covariance selection as the following l 1 -norm penalized maximum likelihood estimation problem: max X {log det X -Σ, Xρe T |X|e : X 0}, (1) where ρ > 0 is a parameter controlling the trade-off between likelihood and sparsity of the solution. They also studied Nesterov's smooth approximation scheme [10] and block-coordinate descent (BCD) method for solving (1). Independently, Yuan and Lin [13] proposed a similar estimation problem to (1) as follows:
They showed that problem (2) can be suitably solved by the interior point algorithm developed in Vandenberghe et al. [12]. As demonstrated in [5,13], the estimation problems (1) and ( 2) are capable of discovering effectively the sparse structure, or equivalently, the conditional independence in the underlying graphical model. Recently, Lu [8] proposed a variant of Nesterov’s smooth method [10] for problems (1) and ( 2) that substantially outperforms the existing methods in the literature. In addition, Dahl et al. [4] studied the maximum likelihood estimation of a Gaussian graphical model whose conditional independence is known, which can be formulated as max
where Ē is a collection of all pairs of conditional independent nodes. They showed that when the underlying graph is nearly-chordal, Newton’s method and preconditioned conjugate gradient method can be efficiently applied to solve (3).
In practice, the sparsity structure of a Gaussian graphical model is often partially known from some knowledge of its random variables. In this paper we consider estimating sparse inverse covariance of a Gaussian graphical model whose conditional independence is assumed to be partially known in advance (but it can be completely unknown). Given a sample covariance matrix Σ ∈ S n + , we can naturally formulate it as the following constrained l 1 -norm penalized maximum likelihood estimation problem: max X log det X -Σ, X -
where Ω consists of a set of pairs of conditionally independent nodes, and {ρ ij } (i,j) / ∈Ω is a set of nonnegative parameters controlling the trade-off between likelihood and sparsity of the solution. It is worth mentioning that unlike in [4], we do not assume any specific structure on the sparsity of underlying graph for problem (4). We can clearly observe that (i) (i, i) / ∈ Ω for 1 ≤ i ≤ n, and (i, j) ∈ Ω if and only if (j, i) ∈ Ω; (ii) ρ ij = ρ ji for any (i, j) /
∈ Ω; and (iii) problems ( 1)-( 3) can be viewed as special cases of problem (4) by choosing appropriate Ω and {ρ ij } (i,j) / ∈Ω . For example, if setting Ω = ∅ and ρ ij = ρ for all (i, j), problem (4) becomes (1).
It is easy to observe that problem (4) can be reformulated as a constrained smooth convex problem that has an explicit O(n 2 )-logarithmically homogeneous self-concordant barrier function. Thus, it can be suitably solved by interior point (IP) methods (see Nesterov and Nemirovski [11] and Vandenberghe et al. [12]). The worst-case iteration complexity of IP methods for finding an ǫ-optimal solution to (4) is O(n log(ǫ 0 /ǫ)), where ǫ 0 is an initial gap. Each iterate of IP methods requires O(n 6 ) arithmetic cost for assembling and solving a typically dense Newton system with O(n 2 ) variables. Thus, the total worst-case arithmetic cost of IP methods for finding an ǫ-optimal solution to (4) is O(n 7 log(ǫ 0 /ǫ)), which is prohibitive when n is relatively large.
Recently, Friedman et al. [6] proposed a gradient type method for solving problem (4). They first converted (4) into the following penalization problem max
by setting ρ ij to an extraordinary large number (say, 10 9 ) for all (i, j) ∈ Ω. Then they applied a slight variant of the BCD method [5] to the dual problem of (5) in which each iteration is solved by a coordinate descent approach to a lasso (l 1 -regularized) least-squares problem.
Given that their method is a gradient type method and the dual problem of ( 5) is highly ill-conditioned for the above choice of ρ, it is not surprising that their method converges extremely slowly. Moreover, since the associated lasso least-squares problems can only be solved inexactly, their method often fails to converge even for a small problem.
In this paper, we propose adaptive first-order methods for problem (4). Instead of solving (5) once with a set of huge penalty parameters {ρ ij } (i,j)∈Ω , our methods consist of solving a sequence of probl
…(Full text truncated)…
This content is AI-processed based on ArXiv data.