Efficient Implementation of the AI-REML Iteration for Variance Component QTL Analysis

Reading time: 6 minute
...

📝 Original Info

  • Title: Efficient Implementation of the AI-REML Iteration for Variance Component QTL Analysis
  • ArXiv ID: 0709.0625
  • Date: 2008-02-11
  • Authors: ** - 논문에 명시된 저자 정보는 제공되지 않았습니다. (원문에 저자명 및 소속이 누락되어 있음) **

📝 Abstract

Regions in the genome that affect complex traits, quantitative trait loci (QTL), can be identified using statistical analysis of genetic and phenotypic data. When restricted maximum-likelihood (REML) models are used, the mapping procedure is normally computationally demanding. We develop a new efficient computational scheme for QTL mapping using variance component analysis and the AI-REML algorithm. The algorithm uses an exact or approximative low-rank representation of the identity-by-descent matrix, which combined with the Woodbury formula for matrix inversion results in that the computations in the AI-REML iteration body can be performed more efficiently. For cases where an exact low-rank representation of the IBD matrix is available a-priori, the improved AI-REML algorithm normally runs almost twice as fast compared to the standard version. When an exact low-rank representation is not available, a truncated spectral decomposition is used to determine a low-rank approximation. We show that also in this case, the computational efficiency of the AI-REML scheme can often be significantly improved.

💡 Deep Analysis

📄 Full Content

Traits that vary continuously are called quantitative. In general, such traits are affected by an interplay between multiple genetic factors and the environment. Most medically and economically important traits in humans, animals and plants are quantitative, and understanding the genetics behind them is of great importance. robust global optimization algorithm for the search for the outer optimization problem where the most likely position of the QTL is determined.

A general linear mixed model is given by ( 1)

where y is a vector of n observations, X is the n × n f design matrix for n f fixed effects, R is the n × n r design matrix for n r random effects, b is the vector of n f unknown fixed effects, r is the vector of n r unknown random effects, and e is a vector of n residuals. In the QTL analysis setting, we assume that the entries of e are identically and independently distributed and there is a single observation for each individual in the pedigree. In this case, the covariance matrices are given by var(e) = Iσ 2 e and var(r) = Aσ 2 a , where A is referred to as the identity-by-descent (IBD) matrix. Using these assumptions, we have that (2) var(y) ≡ V = σ 2 a A + σ 2 e I ≡ σ 1 A + σ 2 I.

Here, σ 1 ≥ 0 and σ 2 > 0 and we assume that the phenotype follows a normal distribution with y MV N(Xb, V ). At least two different procedures can be used for computing estimates of b and σ 1,2 . In the standard codes mentioned above, the parameters are computed from the mixed-model equations (MME) (Lynch and Walsh, 1998) using a the inverse of A. To be able to use this approach, A has to be positive definite, or it must be modified so that this property holds. Also, the computations can be performed efficiently if A is sparse. For the QTL analysis problems, the IBD matrix A is often only semi-definite and not necessarily sparse. Here the values of σ 1,2 are given by the solution of the minimization problem Min L , (3)

where L is the log-likelihood for the model (1), ( 5)

and the projection matrix P is defined by ( 6)

In the original AI-REML algorithm, the minimization problem is solved using the standard Newton scheme but where the Hessian is substituted by the average information matrix H AI , whose entries are given by ( 7)

The entries of the gradient of L are given by ( 8)

The main result of this paper is an algorithm that allows for an efficient implementation of the iteration body in the AI-REML algorithm for variance component analysis in QTL mapping problems. In (Mishchenko et al.), we examine optimization schemes for the REML optimization and different approaches for introducing the constraints for σ 1,2 in the AI-REML scheme. For cases when the solution to the optimization problem is close to a constraint, different ad-hoc fixes have earlier been attempted to ensure convergence for the Newton scheme. In (Mishchenko et al.), we show that by introducing an active-set formulation in the AI-REML optimization scheme, robustness and good convergence properties were achieved also for cases when the solution is close to a constraint. When computing the results below, we use the active-set AI-REML optimization scheme from (Mishchenko et al.).

From the formulas in the previous section, it is clear that the most computationally demanding part of the AI-REML algorithm is the computation of the matrix P , and especially the explicit inversion of the matrix V = σ 1 A + σ 2 I, where A is a constant semi-definite matrix and σ 1,2 are updated in each iteration. If V is regarded as a general matrix, i.e. no information about the structure of the problem is used, a standard algorithm based on Cholesky factorization with complete pivoting is the only alternative, and O(n 3 ) arithmetic operations are required. If A is very sparse, the work can be significantly reduced by using a sparse Cholesky factorization, possibly combined with a reordering of the equations. This can be compared to the approach taken e.g. in (Bates, (2004)), where a sparse factorization is used for A when solving the mixedmodel equations. However, as remarked earlier A is only semi-definite and not very sparse for the QTL analysis problems and this approach is not an option here.

The IBD matrix A is a function of the location in the genome where the REML model is applied. The key observation leading to a more efficient algorithm for inverting V is that the rank of A at a location with complete genetic information only depends on the size of the base generation. At such locations, A will be a rank-k matrix where k ≪ n in experimental pedigrees with small base generations. If the genetic information is not fully complete, A can still be approximated by a low-rank matrix. The error in such an approximation can be made small when the genetic distance to a location with complete data is small.

Let a symmetric rank-k representation of A be given by ( 9)

where Z is an n × k matrix. We now exploit this type of representation to compute

Case 1: An exact

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut