Differentiation of the Cholesky decomposition

Dif ferentiation of the Cholesky decomposition Iain Murra y Febr uar y 2016 Abstract W e revie w strategies for differentiating matrix-based computations, and derive symbolic and algorithmic update rules for differentiating expressions containing the Cholesky decomposition. W e recommend new ‘blocked’ algorithms, based on differentiating the Cholesky algorithm DPOTRF in the LAPACK library , which uses ‘Lev el 3’ matrix-matrix operations from BLAS , and so is cache-friendly and easy to parallelize. For large matrices, the resulting algorithms are the fastest w a y to compute Cholesky derivativ es, and are an order of magnitude faster than the algorithms in common usage. In some computing environments, symbolically-deriv ed updates are faster for small matrices than those based on differ entiating Cholesky algorithms. The symbolic and algorithmic approaches can be combined to get the best of both w orlds. 1 Introduction The Cholesky decomposition L of a symmetric positiv e deﬁnite matrix Σ is the unique low er- triangular matrix with positiv e diagonal elements satisfying Σ = L L > . Alter nativ ely , some librar y routines compute the upper-triangular decomposition U = L > . This note compares w a ys to differentiate the function L ( Σ ) , and larger expressions containing the Cholesky decomposition (Section 2). W e consider compact symbolic results (S ection 3) and longer algorithms (S ection 4). Existing computer code that differentiates expressions containing Cholesky decompositions often uses an algorithmic approach proposed by Smith (1995). This approach results from manually applying the ideas behind ‘automatic differentiation’ (e.g. Ba y din et al., 2015) to a numerical algorithm for the Cholesky decomposition. Experiments by W alter (2011) suggested that — despite conv entional wisdom — computing symbolically-deriv ed results is actually faster . How ev er , these experiments were based on differentiating slo w algorithms for the Cholesky decomposition. In this note w e introduce ‘blocked’ algorithms for propagating Cholesky deriva- tiv es (Section 4), which use cache-friendly and easy-to-parallelize matrix-matrix operations. In our implementations (Appendix A), these are faster than all previously-proposed methods. 2 Computational setup and tasks This section can be safely skipped by readers familiar with “automatic differentiation”, the ˙ Σ notation for “forward-mode sensitivities”, and the ¯ Σ notation for “reverse-mode sensitivities” (e.g. Giles, 2008). W e consider a sequence of computations, x → Σ → L → f , (1) that starts with an input x , computes an inter mediate symmetric positive-deﬁnite matrix Σ , its low er-triangular Cholesky decomposition L , and then a ﬁnal result f . Derivativ es of the ov erall computation ∂ f ∂ x , can be decomposed into reusable parts with the chain rule. How e v er , there are multiple wa ys to proceed, some much better than others. 1 Matrix chain rule: It’s tempting to simply write down the chain rule for the ov erall procedure: ∂ f ∂ x = ∑ i , j ≤ i ∑ k , l ≤ k ∂ f ∂ L i j ∂ L i j ∂ Σ kl ∂ Σ kl ∂ x , (2) where we only sum ov er the independent elements of symmetric matrix Σ and the occupied low er-triangle of L . W e can also rewrite the same chain rule in matrix for m, ∂ f ∂ x = ∂ f ∂ v ech ( L ) ∂ v ech ( L ) ∂ v ech ( Σ ) ∂ v ech ( Σ ) ∂ x , (3) where the v ech operator creates a v ector by stacking the low er -triangular columns of a matrix. A derivativ e ∂ y ∂ z is a matrix or vector , with a row for each element of y and a column for each element of z , giving a row v ector if y is a scalar , and a column v ector if z is a scalar . The set of all partial derivativ es n ∂ L i j ∂ Σ kl o , or equivalently the matrix ∂ vech ( L ) ∂ vech ( Σ ) , contains O ( N 4 ) v alues for the Cholesky decomposition of an N × N matrix. Explicitly computing each of the terms in equations (2) or (3) is inefﬁcient, and simply not practical for large matrices. W e giv e expressions for these O ( N 4 ) deriv ativ es at the end of S ection 3 for completeness, and because they might be useful for analytical study . Ho w ev er , the computational primitiv es w e really need are methods to accumulate the terms in the chain rule moving left (for w ards) or right (backw ards), without creating enor mous matrices. W e outline these processes no w , adopting the ‘automatic differentiation’ notation used by Giles (2008) and others. For wards-mode accumulation: W e start by computing a matrix of sensitivities for the ﬁrst stage of the computation, with elements ˙ Σ kl = ∂ Σ kl ∂ x . If we applied an inﬁnitesimal perturbation to the input x ← x + d x , the inter mediate matrix would be perturbed by d Σ = ˙ Σ d x . This change w ould in turn perturb the output of the Cholesky decomposition b y d L = ˙ L d x , where ˙ L i j = ∂ L i j ∂ x . W e w ould like to compute the sensitivities of the Cholesky decomposition, ˙ L , from the sensitivities of the input matrix ˙ Σ and other ‘local’ quantities ( L and/or Σ ), without needing to consider where these came from. Finally , w e w ould compute the required result ˙ f = ∂ f ∂ x from L and ˙ L , again without reference to downstream computations (the Cholesky decomposition). The forwards-mode algorithms in this note describe how to compute the reusable function ˙ L ( L , ˙ Σ ) , which propagates the ef fect of a perturbation forwar ds through the Cholesky decomposition. The computational cost will hav e the same scaling with matrix size as the Cholesky decomposition. How ev er , if w e want the derivativ es with respect to D different inputs to the computation, w e must perform the whole for war ds propagation D times, each time accumulating sensitivities with respect to a different input x . Rev erse-mode accumulation: W e can instead accumulate derivativ es b y starting at the other end of the computation sequence (1) . The effect of perturbing the ﬁnal stage of the computation is summarized by a matrix with elements ¯ L i j = ∂ f ∂ L i j . W e need to ‘back-propagate’ this summary to compute the sensitivity of the output with respect to the downstream matrix, ¯ Σ kl = ∂ f ∂ Σ kl . In turn, this signal is back-propagated to compute ¯ x = ∂ f ∂ x , the target of our computation, equal to ˙ f in the for w ards propagation abov e. The reverse-mode algorithms in this note describe how to construct the reusable function ¯ Σ ( L , ¯ L ) , which propagates the effect of a perturbation in the Cholesky decomposition backwards, to compute the effect of perturbing the original positiv e deﬁnite matrix. Like for w ards-mode propagation, the computational cost has the same scaling with matrix size as the Cholesky decomposition. Rev erse-mode differentiation or ‘back-propagation’ has the advantage that ¯ Σ can be reused to compute deriv ativ es with respect to multiple inputs. Indeed if the input x to the sequence of computations (1) is a D -dimensional v ector , the cost to obtain all D partial deriv ativ es ∇ x f scales the same as a single for w ards computation of f . For D -dimensional inputs, rev erse-mode differentiation scales a factor of D times better than for w ards-mode. Rev erse-mode computations can ha v e greater memory requirements than forwards mode, and are less appealing than for w ards-mode if there are more outputs of the computation than inputs. 2 3 Symbolic differentiation It is not immediately ob vious whether a small, neat symbolic form should exist for the deriv ativ es of some function of a matrix, or whether the forward- and rev erse-mode updates are simple to express. For the Cholesky decomposition, the literature primarily advises using algorithmic update rules, derived from the algorithms for numerically ev aluating the original function (Smith, 1995; Giles, 2008). How ev er , there are also fairly small algebraic expressions for the deriv ativ es of the Cholesky decomposition, and for for w ards- and rev erse-mode updates. For wards-mode: S ¨ arkk ¨ a (2013) provides a short derivation of a forwards propagation rule (his Theorem A.1), which we adapt to the notation used here. An inﬁnitesimal perturbation to the expression Σ = L L > giv es: d Σ = d L L > + L d L > . (4) W e wish to re-arrange to get an expression for d L . The trick is to left-multiply b y L − 1 and right-multiply by L −> : L − 1 d Σ L −> = L − 1 d L + d L > L −> . (5) The ﬁrst ter m on the right-hand side is now low er-triangular . The second ter m is the transpose of the ﬁrst, meaning it is upper-triangular and has the same diagonal. W e can therefore remo v e the second term b y applying a function Φ to both sides, where Φ takes the low er -triangular part of a matrix and halv es its diagonal: Φ ( L − 1 d Σ L −> ) = L − 1 d L , where Φ i j ( A ) =        A i j i > j 1 2 A ii i = j 0 i < j . (6) Multiplying both sides by L gives us the perturbation of the Cholesky decomposition: d L = L Φ ( L − 1 d Σ L −> ) . (7) Substituting the for w ard-mode sensitivity relationships d Σ = ˙ Σ d x and d L = ˙ L d x (Section 2), immediately giv es a for w ards-mode update rule, which is easy to implement: ˙ L = L Φ ( L − 1 ˙ Σ L −> ) . (8) The input perturbation ˙ Σ must be a symmetric matrix, ˙ Σ kl = ˙ Σ l k = ∂ Σ kl ∂ x , because Σ is assumed to be symmetric for all inputs x . Rev erse-mode: W e can also obtain a neat symbolic expression for the rev erse mode updates. W e substitute (7) into d f = T r ( ¯ L > d L ) , and with a few lines of manipulation, rearrange it into the form d f = T r ( S > d Σ ) . Brew er (1977)’s Theorem 1 then implies that for a symmetric matrix Σ , the symmetric matrix containing rev erse mode sensitivities will be: ¯ Σ = S + S > − diag ( S ) , where S = L −> Φ ( L > ¯ L ) L − 1 , (9) where diag ( S ) is a diagonal matrix containing the diagonal elements of S , and function Φ is still as deﬁned in (6). Alternatively , a low er-triangular matrix containing the independent elements of ¯ Σ can be constructed as: tril ( ¯ Σ ) = Φ ( S + S > ) = Φ  L −> ( P + P > ) L − 1  , where P = Φ ( L > ¯ L ) , (10) with S as in (9), and using function Φ again from (6). Since ﬁrst writing this section we hav e discov er ed tw o similar rev erse-mode expressions (W al- ter, 2011; Koerber, 2015). It seems likely that other authors hav e also independently derived equiv alent results, although these update rules do not appear to ha v e seen wide-spread use. 3 Matrix of der ivativ es: By choosing the input of interest to be x = Σ kl = Σ l k , and ﬁxing the other elements of Σ , the sensitivity ˙ Σ becomes a matrix of zeros except for ones at ˙ Σ kl = ˙ Σ l k = 1. Substituting into (8) giv es an expression for all of the partial derivativ es of the Cholesky decomposition with respect to any chosen element of the cov ariance matrix. S ome further manipulation, expanding matrix products as sums ov er indices, giv es an explicit expression for any element, ∂ L i j ∂ Σ kl =  ∑ m > j L im L − 1 mk + 1 2 L i j L − 1 jk  L − 1 jl + ( 1 − δ kl )  ∑ m > j L im L − 1 ml + 1 2 L i j L − 1 jl  L − 1 jk . (11) If w e compute ev ery ( i , j , k , l ) element, each one can be evaluated in constant time by keeping running totals of the sums in (11) as w e decrement j from N to 1. Explicitly computing ev ery partial derivativ e therefore costs Θ ( N 4 ) . These deriv ativ es can be arranged into a matrix, by ‘vectorizing’ the expression (Magnus and Neudecker, 2007; Minka, 2000; Har meling, 2013). W e use a w ell-known identity involving the v ec operator , which stacks the columns of a matrix into a vector , and the Kronecker product ⊗ : v ec ( A B C ) = ( C > ⊗ A ) vec ( B ) . (12) Applying this identity to (7) yields: v ec ( d L ) = ( I ⊗ L ) v ec  Φ  L − 1 d Σ L −>   . (13) W e can remov e the function Φ , by introducing a diagonal matrix Z deﬁned such that Z vec ( A ) = v ec Φ ( A ) for any N × N matrix A . Applying (12) again gives: v ec ( d L ) = ( I ⊗ L ) Z ( L − 1 ⊗ L − 1 ) v ec ( d Σ ) . (14) Using the standard elimination matrix L , and duplication matrix D (Magnus and Neudecker, 1980), w e can conv ert betw een the v ec and v ech of a matrix, where v ech ( A ) is a vector made by stacking the columns of the low er triangle of A . v ech ( d L ) = L ( I ⊗ L ) Z ( L − 1 ⊗ L − 1 ) D v ech ( d Σ ) ⇒ ∂ v ech L ∂ v ech Σ = L ( I ⊗ L ) Z ( L − 1 ⊗ L − 1 ) D . (15) This compact-looking result was stated on MathOverﬂow 1 b y pseudonymous user ‘pete’. It ma y be useful for further analytical study , but doesn’t immediately help with scalable computation. 4 Differentiating Cholesky algor ithms W e ha v e seen that it is inefﬁcient to compute each term in the chain rule, (2) or (3) , applied to a high-lev el matrix computation. For Cholesky deriv ativ es the cost is Θ ( N 4 ) , compared to O ( N 3 ) for the forward- or rev erse-mode updates in (8) , (9) , or (10) . How ev er , evaluating the ter ms of the chain rule applied to any low-level computation — expressed as a series of elementary scalar operations — giv es deriv ativ es with the same computational complexity as the original function (e.g. Ba y din et al., 2015). Therefore O ( N 3 ) algorithms for the dense Cholesky decomposition can be mechanically converted into O ( N 3 ) forwar d- and rev erse-mode update algorithms, which is called ‘automatic differentiation’. Smith (1995) proposed taking this automatic differentiation approach, although presented hand-deriv ed propagation algorithms that could be easily implemented in any programming environment. Smith also reported applications to sparse matrices, where automatic differentia- tion inherits the impro v ed complexity of computing the Cholesky decomposition. Ho w ev er , the 1. http://mathoverflow.net/questions/150427/the- derivative- of- the- cholesky- factor#comment450752_ 167719 — comment fr om 2014-09-01 4 algorithms that wer e considered for dense matrices aren’t cache-friendly or easy to parallelize, and will be slow in practice. Currently-popular numerical packages such as NumPy , Octave , and R (Oliphant, 2006; Eaton et al., 2009; R Core T eam, 2012) compute the Cholesky decomposition using the LAPACK library (Ander- son et al., 1999). LAPACK implements block algorithms that express computations as cache-friendly , parallelizable ‘Lev el 3 BLAS ’ matrix-matrix operations that are fast on moder n architectures. Dongarra et al. (1990) described the Lev el 3 BLAS operations, including an example block imple- mentation of a Cholesky decomposition. For large matrices, w e hav e sometimes found LAPACK ’s routine to be 50 × faster than a C or Fortran implementation of the Cholesky algorithm consid- ered b y Smith (1995). Precise timings are machine-dependent, how e v er it’s clear that any large dense matrix computations, including deriv ativ e computations, should be implemented using blocked algorithms where possible 2 . Block routines, like those in LAPACK , ultimately come do wn to elementary scalar operations inside calls to BLAS routines. In principle, automatic differ entiation tools could be applied. How e v er , the source code and compilation tools for the optimized BLAS routines for a particular machine are not alw a ys a v ailable to users. Ev en if they w ere, automatic differentiation tools w ould not necessarily create cache-friendly algorithms. For these reasons W alter (2011) used symbolic approaches (Section 3) to provide update rules based on standard matrix-matrix operations. An alter nativ e approach is to extend the set of elementary routines understood by an automatic differentiation procedure to the operations supported by BLAS . W e could then pass derivativ es through the Cholesky routine implemented b y LAPACK , treating the best a v ailable matrix-matrix routines as black-box functions. Giles (2008) pro vides an excellent tutorial on deriving for war d- and rev erse-mode update rules for elementar y matrix operations, which we found invaluable for deriving the algorithms that follow 3 . While his results can largely be found in materials already mentioned (Magnus and Neudecker, 2007; Minka, 2000; Har meling, 2013), Giles emphasised forwar ds- and rev erse-mode update rules, rather than huge objects like (15). In the end, we didn’t follow an automatic differentiation procedure exactly . While w e deriv ed deriv ativ e propagation rules from the structure of the Cholesky algorithms (unlike S ection 3), w e still symbolically manipulated some of the results to make the updates neater and in-place. In principle, a sophisticated optimizing compiler for automatic differentiation could do the same. 4.1 Lev el 2 routines LAPACK also provides ‘unblocked’ routines, which use ‘Lev el 2’ BLAS operations (Dongarra et al., 1988a,b) like matrix-vector products. Although a step up from scalar-based algorithms, these are intended for small matrices only , and as helpers for ‘Lev el 3’ blocked routines (S ection 4.2). The LAPACK routine DPOTF2 loops ov er columns of an input matrix A , replacing the low er- triangular part in-place with its Cholesky decomposition. At each iteration, the algorithm uses a ro w v ector r , a diagonal element d , a matrix B , and a column v ector c as follows: function level2partition( A , j ) r = A j , 1 : j − 1 d = A j , j B = A j + 1: N , 1 : j − 1 c = A j + 1: N , j return r , d , B , c where A =            . . . − − r − − d | . . . B c . . . | . . .            2. Historical note: It’s entirely reasonable that Smith (1995) did not use blocked algorithms. Primarily , Smith’s applications used sparse computations. In any case, blocked algorithms w eren’t universally adopted until later . For example, Matlab didn’t incorporate LAPACK until 2000, http://www.mathworks.com/company/newsletters/articles/ matlab- incorporates- lapack.html . 3. Ironically , Giles (2008) also considered differentiating the Cholesky decomposition but, like Smith (1995), gav e slow scalar-based algorithms. 5 Here ‘=’ creates a view into the matrix A , meaning that in the algorithm below , ‘ ← ’ assigns results into the corresponding part of matrix A . function chol unblocked( A ) # If at input tril ( A ) = tril ( Σ ) = tril ( L L > ) , at output tril ( A ) = L . for j = 1 to N : r , d , B , c = level2partition( A , j ) d ← √ d − r r > c ← ( c − B r > ) / d return A The algorithm only inspects and updates the lo w er-triangular part of the matrix. If the upper- triangular part did not start out ﬁlled with zeros, then the user will need to zero out the upper triangle of the ﬁnal arra y with the tril function: tril ( A ) i j =    A i j i ≥ j 0 otherwise. (16) In each iteration, r and B are parts of the Cholesky decomposition that ha v e already been computed, and d and c are updated in place, from their original settings in A to giv e another column of the Cholesky decomposition. The matrix-v ector multiplication B r > is a Lev el 2 BLAS operation. These multiplications are the main computational cost of this algorithm. For wards-mode dif ferentiation: The in-place updates obscure the relationships between parts of the input matrix and its Cholesky decomposition. W e could rewrite the updates more explicitly as L d = q Σ d − L r L > r , (17) L c = ( Σ c − L B L > r ) / L d . (18) Applying inﬁnitesimal perturbations to these equations giv es d L d = 1 2 ( Σ d − L r L > r ) − 1/ 2 ( d Σ d − 2d L r L > r ) = 1 L d ( d Σ d /2 − d L r L > r ) , (19) d L c = ( d Σ c − d L B L > r − L B d L > r ) / L d − ( ( Σ c − L B L > r ) / L 2 d ) d L d = ( d Σ c − d L B L > r − L B d L > r − L c d L d ) / L d . (20) W e then get update rules for the for war d-mode sensitivities by substituting their relationships, d Σ = ˙ Σ d x and d L = ˙ L d x (Section 2), into the equations abov e. Mirroring the original algorithm, w e can thus conv ert ˙ Σ to ˙ L in-place, with the algorithm below: function chol unblocked fwd( L , ˙ A ) # If at input tril ( ˙ A ) = tril ( ˙ Σ ) , at output tril ( ˙ A ) = ˙ L, where Σ = L L > . for j = 1 to N : r , d , B , c = level2partition( L , j ) ˙ r , ˙ d , ˙ B , ˙ c = level2partition( ˙ A , j ) ˙ d ← ( ˙ d / 2 − r ˙ r > ) / d ˙ c ← ( ˙ c − ˙ B r > − B ˙ r > − c ˙ d ) / d return ˙ A Alternatively , the Cholesky decomposition and its for w ard sensitivity can be accumulated in one loop, by placing the updates from this algorithm after the corresponding lines in chol unblocked . 6 Rev erse-mode differentiation: Rev erse mode automatic differentiation trav erses an algorithm backw ards, r ev ersing the direction of loops and the updates within them. At each step, the effect ¯ Z of perturbing an output Z ( A , B , C , . . . ) is ‘back-propagated’ to compute the effects ( ¯ A ( Z ) , ¯ B ( Z ) , ¯ C ( Z ) , . . . ) of perturbing the inputs to that step. If the effects of the perturbations are consistent then T r ( ¯ Z > d Z ) = T r ( ¯ A ( Z ) > d A ) + T r ( ¯ B ( Z ) > d B ) + T r ( ¯ C ( Z ) > d C ) + . . . , (21) and w e can ﬁnd ( ¯ A ( Z ) , ¯ B ( Z ) , ¯ C ( Z ) , . . . ) b y comparing coefﬁcients in this equation. If a quantity A is an input to multiple computations ( X , Y , Z , . . . ), then w e accumulate its total sensitivity , ¯ A = ¯ A ( X ) + ¯ A ( Y ) + ¯ A ( Z ) + . . . , (22) summarizing the quantity’s effect on the ﬁnal computation, ¯ A i j = ∂ f ∂ A i j (as review ed in S ection 2). Using the standard identities T r ( A B ) = T r ( B A ) , T r ( A > ) = T r ( A ) , and ( A B ) > = B > A > , the perturbations from the ﬁnal line of the Cholesky algorithm (20) imply: T r ( ¯ L > c d L c ) = T r (( ¯ L c / L d ) > d Σ c ) − T r ( ( ¯ L c L r / L d ) > d L B ) − T r ( ( ¯ L > c L B / L d ) > d L r ) − T r ( ( L > c ¯ L c / L d ) > d L d ) . (23) W e thus read off that ¯ Σ c = ¯ L c / L d , where the sensitivities ¯ L c include the direct effect on f , pro vided b y the user of the routine, and the knock-on effects that changing this column w ould hav e on the columns computed to the right. These knock-on effects should hav e been accumulated thr ough pre vious iterations of the rev erse propagation algorithm. From this equation, w e can also identify the knock-on effects that changing L d , L r , and L B w ould hav e through changing column c , which should be added on to their existing sensitivities for later . The perturbation (19) to the other update in the Cholesky algorithm implies: T r ( ¯ L > d d L d ) = T r (( ¯ L d / ( 2 L d )) > d Σ d ) − T r ( ( ¯ L d L r / L d ) > d L r ) . (24) Comparing coefﬁcients again, we obtain another output of the rev erse-mode algorithm, ¯ Σ d = ¯ L d / ( 2 L d ) . W e also add ¯ L d L r / L d to the running total for the sensitivity of L r for later updates. The algorithm belo w tracks all of these sensitivities, with the updates rearranged to simplify some expressions and to make an algorithm that can update the sensitivities in-place. function chol unblocked rev( L , ¯ A ) # If at input tril ( ¯ A ) = ¯ L, at output tril ( ¯ A ) = tril ( ¯ Σ ) , where Σ = L L > . for j = N to 1 , in steps of − 1 : r , d , B , c = level2partition( L , j ) ¯ r , ¯ d , ¯ B , ¯ c = level2partition( ¯ A , j ) ¯ d ← ¯ d − c > ¯ c / d  ¯ d ¯ c  ←  ¯ d ¯ c   d ¯ r ← ¯ r −  ¯ d ¯ c >  h r B i ¯ B ← ¯ B − ¯ cr ¯ d ← ¯ d / 2 return ¯ A 4.2 Lev el 3 routines The LAPACK routine DPOTRF also updates the low er -triangular part of an arra y A in place with its Cholesky decomposition. How ev er , this routine updates blocks at a time, rather than single column v ectors, using the following partitions: 7 function level3partition( A , j , k ) R = A j : k , 1 : j − 1 D = A j : k , j : k B = A k + 1: N , 1: j − 1 C = A k + 1: N , j : k return R , D , B , C where A =      . . . R D B C . . .      Only the low er -triangular part of D , the matrix on the diagonal, is referenced. The algorithm below loops ov er each diagonal block D , updating it and the matrix C below it. Each diagonal block (except possibly the last) is of size N b × N b . The optimal block-size N b depends on the size of the matrix N , and the machine running the code. Implementations of LAPACK select the block-size with a routine called ILAENV . function chol blocked( A , N b ) # If at input tril ( A ) = tril ( Σ ) = tril ( L L > ) , at output tril ( A ) = L, for integer N b ≥ 1 . for j = 1 to at most N in steps of N b : k ← min ( N , j + N b − 1) R , D , B , C = level3partition( A , j , k ) D ← D − tril ( R R > ) D ← chol unblocked( D ) C ← C − B R > C ← C tril ( D ) −> return A The computational cost of the blocked algorithm is dominated by Lev el 3 BLAS operations for the matrix-matrix multiplies and for solving a triangular system. The unblocked Lev el 2 routine from S ection 4.1 ( DPOTF2 in LAPACK ) is also called as a subroutine on a small triangular block. For large matrices it ma y be w orth replacing this unblocked routine with one that perfor ms more Lev el 3 operations (Gusta vson et al., 2013). For wards-mode dif ferentiation: Following the same strategy as for the unblocked case, we obtained the algorithm below . As before, the input sensitivities ˙ Σ i j = ∂ Σ i j ∂ x can be updated in-place to giv e ˙ L i j = ∂ L i j ∂ x , the sensitivities of the resulting Cholesky decomposition. Again, these updates could be accumulated at the same time as computing the original Cholesky decomposition. function chol blocked fwd( L , ˙ A ) # If at input tril ( ˙ A ) = tril ( ˙ Σ ) , at output tril ( ˙ A ) = tril ( ˙ L ) , where Σ = L L > . for j = 1 to at most N in steps of N b : k ← min ( N , j + N b − 1) R , D , B , C = level3partition( L , j , k ) ˙ R , ˙ D , ˙ B , ˙ C = level3partition( ˙ A , j , k ) ˙ D ← ˙ D − tril ( ˙ R R > + R ˙ R > ) ˙ D ← chol unblocked fwd( D , ˙ D ) ˙ C ← ˙ C − ˙ B R > − B ˙ R > ˙ C ← ( ˙ C − C ˙ D > ) D −> return ˙ A The unblocked derivativ e routine is called as a subroutine. Alternatively , chol blocked fwd could call itself recursiv ely with a smaller block size, we could use the symbolic result (8) , or we could differentiate other algorithms (e.g. Gusta vson et al., 2013). Minor detail: The standard BLAS operations don’t provide a routine to neatly perform the ﬁrst update for the lo w er-triangular ˙ D . One option is to wastefully subtract the full matrix 8 ( ˙ R R > + R ˙ R > ) , then zero out the upper-triangle of ˙ D , meaning that the upper triangle of ˙ A can’t be used for auxiliar y storage. Rev erse-mode differentiation: Again, deriving the rev erse-mode algorithm and arranging it into a conv enient form w as more inv olv ed. The strategy is the same as the unblocked case how e v er , and still relativ ely mechanical. function chol blocked rev( L , ¯ A ) # If at input tril ( ¯ A ) = ¯ L, at output tril ( ¯ A ) = tril ( ¯ Σ ) , where Σ = L L > . for k = N to no less than 1 in steps of − N b : j ← max (1, k − N b + 1) R , D , B , C = level3partition( L , j , k ) ¯ R , ¯ D , ¯ B , ¯ C = level3partition( ¯ A , j , k ) ¯ C ← ¯ C D − 1 ¯ B ← ¯ B − ¯ C R ¯ D ← ¯ D − tril ( ¯ C > C ) ¯ D ← chol unblocked rev ( D , ¯ D ) ¯ R ← ¯ R − ¯ C > B − ( ¯ D + ¯ D > ) R return ¯ A The partitioning into columns is arbitrar y , so the rev erse-mode algorithm doesn’t need to select the same set of blocks as the for w ards computation. Here, when the matrix size N isn’t a multiple of the block-size N b , w e’v e put the smaller blocks at the other edge of the matrix. As in the blocked forwards-mode update, there is a call to the unblocked routine, which can be replaced with alter nativ e algorithms. In the implementation pro vided (Appendix A) w e use the symbolically-deriv ed update (10). 5 Discussion and Future Directions The matrix operations required by the Cholesky algorithms implemented in LAPACK can be implemented with straightforwar d calls to BLAS . How ev er , the for war ds- and rev erse-mode updates we ha v e derived from these algorithms giv e some expressions where only the triangular part of a matrix product is required. There aren’t standard BLAS routines that implement exactly what is required, and our implementations must perform unnecessary computations to exploit the fast libraries a v ailable. In future, it w ould be desirable to hav e standard fast matrix libraries that offer a set of routines that are closed under the rules for deriving deriv ativ e updates. The automatic differentiation tools that ha v e pro v ed popular in machine learning differ entiate high-lev el array-based code. As a result, these tools don’t hav e access to the source code of the Cholesky decomposition, and need to be told ho w to differentiate it. Theano (Bastien et al., 2012; Bergstra et al., 2010), the ﬁrst tool to be widely-adopted in machine lear ning, and AutoGrad (Maclaurin et al., 2015) use the algorithm by Smith (1995). T ensorFlow (Abadi et al., 2015) in its ﬁrst release can’t differentiate expressions containing a Cholesky decomposition, but a fork (Hensman and de G. Matthews, 2016) also uses the algorithm b y Smith (1995), as previously implemented by The GPy authors (2015). The approaches in this note will be an order of magnitude faster for large matrices than the codes that are in current wide-spread use. Some illustrativ e timings are giv en at the end of the code listing (Appendix A). As the algorithms are only a few lines long, they could be ported to a v ariety of settings without introducing any large dependencies. The simple symbolic expressions (Section 3) could be differentiated using most existing matrix-based tools. Currently AutoGrad can’t repeatedly differentiate the Cholesky decomposition because of the in-place updates in the (Smith, 1995) algorithm. 9 The ‘Lev el 3’ blocked algorithms (Section 4.2) are the fastest forwards- and rev erse-mode update rules for large matrices. How e v er , these require helper routines to perform the updates on small triangular blocks. In high-lev el languages ( Matlab , Octave , Python ), the ‘Lev el 2’ routines — similar to the algorithms that automatic differentiation w ould provide — are slow , and we recommend using the symbolic updates (Section 3) for the small matrices instead. It should be relativ ely easy to provide similar derivativ e routines for many standard matrix functions, starting with the rest of the routines in LAPACK . How e v er , it w ould sav e a lot of w ork to ha v e automatic tools to help make these routines. Although there are a wide-variety of tools for automatic differentiation, w e are unaw ar e of practical tools that can currently create algorithms as neat and accessible as those made by hand for this note. References M. Abadi, A. Agarwal, P . Barham, E. Brev do, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemaw at, I. Goodfellow , A. Harp, G. Irving, M. Isard, R. Jozefowicz, Y . Jia, L. Kaiser , M. Kudlur , J. Lev enber g, D. Man ´ e, M. S chuster , R. Monga, S. Moore, D. Murra y , C. Olah, J. Shlens, B. Steiner , I. Sutskever , K. T alwar , P . T ucker , V . V anhoucke, V . V asudevan, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. W icke, Y . Y u, and X. Zheng. TensorFlo w: Large-Scale Machine Learning on Heterogeneous Systems. White paper , Google Research, 2015. S oftw are a v ailable from http://tensorflow.org . T ensorFlow is a trademark of Google Inc. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney , and D. S orensen. LAP ACK Users’ guide , volume 9. SIAM, 1999. http://www.netlib. org/lapack/ . F . Bastien, P . Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow , A. Bergeron, N. Bouchard, and Y . Bengio. Theano: new features and speed improv ements. Deep Learning and Unsuper vised Feature Learning NIPS 2012 W orkshop, 2012. A. G. Bay din, B. A. Pearlmutter , A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a surve y , 2015. J. Bergstra, O. Breuleux, F . Bastien, P . Lamblin, R. Pascanu, G. Desjardins, J. T urian, D. W arde-Farle y , and Y . Bengio. Theano: a CPU and GPU math expression compiler . In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy) , 2010. J. W . Brew er . The gradient with respect to a symmetric matrix. IEEE T ransactions on Automatic Control , 22(2):265–267, 1977. J. J. Dongarra, J. Ducroz, S. Hammarling, and R. Hanson. An extended set of fortran basic linear algebra subprograms. ACM T ransactions on Mathematical Software , 14(1):1–17, 1988a. J. J. Dongarra, J. Ducroz, S. Hammarling, and R. Hanson. Algorithm 656: An extended set of fortran basic linear algebra subprograms: Model implementation and test programs. ACM T ransactions on Mathematical Software , 16(1):1–17, 1988b. J. J. Dongarra, J. Du Cr oz, S. Hammarling, and I. S. Duff. A set of le v el 3 basic linear algebra subprograms. ACM T ransactions on Mathematical Software , 16(1):1–17, 1990. J. W . Eaton, D. Bateman, and S. Hauberg. GNU Octave version 3.0.1 manual: a high-level interactive language for numerical computations . CreateSpace Independent Publishing Platform, 2009. URL http://www.gnu.org/software/octave/doc/interpreter . ISBN 1441413006. M. B. Giles. An extended collection of matrix deriv ativ e results for forw ard and re v erse mode automatic differentiation, 2008. F . G. Gusta vson, J. W a ´ sniewski, J. J. Dongarra, J. R. Herrero, and J. Langou. Lev el-3 Cholesky factorization routines impro v e perfor mance of many Cholesky algorithms. ACM T ransactions on Mathematical Software , 39(2):9:1–9:10, 2013. S. Harmeling. Matrix differential calculus cheat sheet. T echnical Report Blue Note 142, Max Planck Institute for Intelligent Systems, 2013. http://people.tuebingen.mpg.de/harmeling/bn142.pdf . 10 J. Hensman and A. G. de G. Matthews. GPFlow, 2016. As of Februar y 2016, https://github.com/ GPflow/GPflow . P . Koerber . Adjoint algorithmic differentiation and the derivativ e of the Cholesky decomposition, 2015. Preprint, a v ailable at SSRN: http://dx.doi.org/10.2139/ssrn.2703893 . D. Maclaurin, D. Duv enaud, M. Johnson, and R. P . Adams. Autograd: Reverse-mode differentiation of nativ e Python, 2015. V ersion 1.1.3, http://github.com/HIPS/autograd and https://pypi.python. org/pypi/autograd/ . J. R. Magnus and H. Neudecker . The elimination matrix: some lemmas and applications. SIAM Journal on Algebraic and Discrete Methods , 1(4):422–449, 1980. J. R. Magnus and H. Neudecker . Matrix differential calculus with application in statistics and econometrics . 3rd edition, 2007. A vailable from http://www.janmagnus.nl/misc/mdc2007- 3rdedition and older editions from W ile y . T . Minka. Old and new matrix algebra useful for statistics, 2000. MIT Media Lab note (1997; revised 12/00), http://research.microsoft.com/en- us/um/people/minka/papers/matrix/ . T . E. Oliphant. Guide to NumPy . Prov o, UT , 2006. R Core T eam. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, V ienna, Austria, 2012. URL http://www.R- project.org/ . ISBN 3-900051-07-0. S. S ¨ arkk ¨ a. Bayesian ﬁltering and smoothing . Cambridge University Press., 2013. S. P . Smith. Differentiation of the Cholesky algorithm. Journal of Computational and Graphical Statistics , 4 (2):134–147, 1995. The GPy authors. GPy: A Gaussian process framew ork in Python, 2015. V ersion 0.8.8, http://github. com/SheffieldML/GPy . S. F . W alter . Structured higher-order algorithmic differ entiation in the forward and reverse mode with application in optimum experimental design . PhD thesis, Humboldt-Universit ¨ at zu Berlin, 2011. 11 A Illustrativ e Python code Equations (11) and (15) w ere checked numerically using Octave / Matlab code, not provided here. The rest of the equations and algorithms in this note are illustrated below using Python code that closely follows the equations and pseudo-code. There are differences due to the note using Matlab /Fortran-style ranges, which are one-based and inclusive, e.g. 1 : 3 = [ 1, 2, 3 ] . In contrast, Python uses zero-based, half-open ranges, e.g. 0 : 3 = : 3 = [ 0, 1, 2 ] . The code is also a v ailable as pseudocode port.py in the source tar-ball for this paper , av ailable from arXiv . Dev elopment of alternativ e implementations in multiple programming languages is on-going. At the time of writing, Fortran code with Matlab / Octave and Python bindings, and pure Matlab code is a v ailable at https://github.com/imurray/chol- rev . The Fortran code is mainly useful for smaller matrices, as for large matrices, the time spent inside BLAS routines dominates, regardless of the language used. The code repository also contains a demonstration of pushing deriv ativ es through a whole computation (the log-likelihood of the hyper parameters of a Gaussian process). 1 # D e m o n s t r a t i o n c o d e f o r C h o l e s k y d i f f e r e n t i a t i o n 2 # I a i n M u r r a y , F e b r u a r y 2 0 1 6 3 4 # T h e s e r o u t i n e s n e e d P y t h o n > = 3 . 5 a n d Nu mP y > = 1 . 1 0 f o r m a t r i x 5 # m u l t i p l i c a t i o n w i t h t h e i n f i x o p e r a t o r ” @ ” . F o r e a r l i e r P y t h o n / N um Py , 6 # r e p l a c e a l l u s e s o f ”@ ” w i t h t h e n p . d o t ( ) f u n c t i o n . 7 8 # T e s t e d w i t h P y t h o n 3 . 5 . 0 , Num P y 1 . 1 0 . 4 , a n d S c i P y 0 . 1 7 . 0 , w i t h M K L 9 # f r o m A n a c o n d a ’ s d i s t r i b u t i o n w i t h a q u a d c o r e i 5 − 3 4 7 0 CPU @ 3 . 2 0 GH z . 10 11 i m p o r t numpy a s n p 12 f r o m numpy i m p o r t t r i l 13 f r o m s c i p y . l i n a l g i m p o r t s o l v e t r i a n g u l a r a s s o l v e t r i a n g u l a r 14 15 # T h e r e a r e o p e r a t i o n s t h a t a r e n o t p e r f o r m e d i n p l a c e b u t c o u l d b e b y 16 # s p l i t t i n g u p t h e o p e r a t i o n s , a n d / o r u s i n g l o w e r − l e v e l c o d e . I n t h i s 17 # v e r s i o n o f t h e c o d e , I ’ v e i n s t e a d t r i e d t o k e e p t h e P y t h o n s y n t a x 18 # c l o s e t o t h e i l l u s t r a t i v e p s e u d o − c o d e . 19 20 # W h e r e t h e p s e u d o c o d e c o n t a i n s i n v e r s e s o f t r i a n g u l a r m a t r i c e s , i t ’ s 21 # c o m m o n l y u n d e r s t o o d t h a t t h e m a t r i x p r o d u c t o f t h e i n v e r s e w i t h t h e 22 # a d j a c e n t t e r m s h o u l d b e f o u n d b y s o l v i n g t h e r e s u l t i n g l i n e a r s y s t e m o f 23 # e q u a t i o n s . T h e c o d e b e l o w c o n t a i n s s o m e c o m m e n t e d − o u t l i n e s w i t h a 24 # s t r a i g h t f o r w a r d r e w r i t i n g o f t h e p s e u d o c o d e u s i n g i n v ( ) , f o l l o w e d b y 25 # e q u i v a l e n t b u t m o r e e f f i c i e n t l i n e s c a l l i n g a l i n e a r s o l v e r ( s t ( ) 26 # d e f i n e d b e l o w ) . T o g e t a n i n v f u n c t i o n f o r t e s t i n g w e c o u l d d o : 27 # f r o m n u m p y . l i n a l g i m p o r t i n v 28 # I ’ d h a v e t o c a l l t h e u n d e r l y i n g LA PACK r o u t i n e s m y s e l f t o s o l v e t h e s e 29 # s y s t e m s i n − p l a c e , a s t h e m a t r i x t r a n p o s e s h a v e n ’ t w o r k e d o u t t o m a t c h 30 # w h a t t h e S c i P y r o u t i n e c a n d o i n − p l a c e . 31 32 d e f s t ( A , b , t r a n s = 0 ) : 33 ” ” ” 34 s o l v e t r i a n g u l a r s y s t e m ” t r i l ( A ) @ x = b ” , r e t u r n i n g x 35 36 i f t r a n s = = 1 , s o l v e ” t r i l ( A ) . T @ x = b ” i n s t e a d . 37 ” ” ” 38 i f b . s i z e = = 0 : 39 r e t u r n b 40 e l s e : 41 r e t u r n s o l v e t r i a n g u l a r ( A , b , t r a n s =t r a n s , l o w e r = T r u e ) 12 43 d e f P h i ( A ) : 44 ” ” ” R e t u r n l o w e r − t r i a n g l e o f m a t r i x a n d h a l v e t h e d i a g o n a l ” ” ” 45 A = t r i l ( A ) 46 A [ n p . d i a g i n d i c e s f r o m ( A ) ] ∗ = 0 . 5 47 r e t u r n A 48 49 d e f c h o l s y m b o l i c f w d ( L , S i g m a d o t ) : 50 ” ” ” 51 F o r w a r d s − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 52 53 T h i s v e r s i o n u s e s a ” o n e − l i n e ” s y m b o l i c e x p r e s s i o n t o r e t u r n L d o t 54 w h e r e ” d o t ” m e a n s s e n s i t i v i t i e s i n f o r w a r d s − m o d e d i f f e r e n t i a t i o n , 55 a n d S i g m a = L @ L . T . 56 ” ” ” 57 # i n v L = i n v ( L ) 58 # r e t u r n L @ P h i ( i n v L @ S i g m a d o t @ i n v L . T ) 59 r e t u r n L @ P h i ( s t ( L , s t ( L , S i g m a d o t . T ) . T ) ) 60 61 d e f c h o l s y m b o l i c r e v ( L , L b a r ) : 62 ” ” ” 63 R e v e r s e − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 64 65 T h i s v e r s i o n u s e s a s h o r t s y m b o l i c e x p r e s s i o n t o r e t u r n 66 t r i l ( S i g m a b a r ) w h e r e ” b a r ” m e a n s s e n s i t i v i t i e s i n r e v e r s e − m o d e 67 d i f f e r e n t i a t i o n , a n d S i g m a = L @ L . T . 68 ” ” ” 69 P = P h i ( L . T @ L b a r ) 70 # i n v L = i n v ( L ) 71 # r e t u r n P h i ( i n v L . T @ ( P + P . T ) @ i n v L ) 72 r e t u r n P h i ( s t ( L , s t ( L , ( P + P . T ) , 1 ) . T , 1 ) ) 73 74 d e f l e v e l 2 p a r t i t i o n ( A , j ) : 75 ” ” ” R e t u r n v i e w s i n t o A u s e d b y t h e u n b l o c k e d a l g o r i t h m s ” ” ” 76 # d i a g o n a l e l e m e n t d i s A [ j , j ] 77 # w e a c c e s s [ j , j : j + 1 ] t o g e t a v i e w i n s t e a d o f a c o p y . 78 r r = A [ j , : j ] # r o w 79 dd = A [ j , j : j + 1 ] # s c a l a r o n d i a g o n a l / \ 80 B = A [ j + 1 : , : j ] # B l o c k i n c o r n e r | r d | 81 c c = A [ j + 1 : , j ] # c o l u m n \ B c / 82 r e t u r n r r , dd , B , c c 83 84 d e f c h o l u n b l o c k e d ( A , i n p l a c e = F a l s e ) : 85 ” ” ” 86 C h o l e s k y d e c o m p o s i t i o n , m i r r o r i n g LA PACK ’ s D P OT F 2 87 88 I n t e n d e d t o i l l u s t r a t e t h e a l g o r i t h m o n l y . U s e a C h o l e s k y r o u t i n e 89 f r o m n u m p y o r s c i p y i n s t e a d . 90 ” ” ” 91 i f n o t i n p l a c e : 92 A = A . c o p y ( ) 93 f o r j i n r a n g e ( A . s h a p e [ 0 ] ) : 94 r r , d d , B , c c = l e v e l 2 p a r t i t i o n ( A , j ) 95 d d [ : ] = n p . s q r t ( d d − r r @ r r ) 96 c c [ : ] = ( c c − B @ rr ) / d d 97 r e t u r n A 13 99 d e f c h o l u n b l o c k e d f w d ( L , A d o t , i n p l a c e = F a l s e ) : 100 ” ” ” 101 F o r w a r d s − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 102 103 O b t a i n L d o t f r o m S i g m a d o t , w h e r e ” d o t ” m e a n s s e n s i t i v i t i e s i n 104 f o r w a r d s − m o d e d i f f e r e n t i a t i o n , a n d S i g m a = L @ L . T . 105 106 T h i s v e r s i o n u s e s a n u n b l o c k e d a l g o r i t h m t o u p d a t e s e n s i t i v i t i e s 107 A d o t i n p l a c e . t r i l ( A d o t ) s h o u l d s t a r t c o n t a i n i n g S i g m a d o t , a n d 108 w i l l e n d c o n t a i n i n g t h e L d o t . T h e u p p e r t r i a n g u l a r p a r t o f A d o t 109 i s u n t o u c h e d , s o t a k e t r i l ( A d o t ) a t t h e e n d i f t r i u ( A d o t , 1 ) d i d 110 n o t s t a r t o u t f i l l e d w i t h z e r o s . 111 112 I f i n p l a c e = F a l s e , a c o p y o f A d o t i s m o d i f i e d i n s t e a d o f t h e 113 o r i g i n a l . T h e A b a r t h a t w a s m o d i f i e d i s r e t u r n e d . 114 ” ” ” 115 i f n o t i n p l a c e : 116 A d o t = A d o t . c o p y ( ) 117 f o r j i n r a n g e ( L . s h a p e [ 0 ] ) : 118 r r , d d , B , c c = l e v e l 2 p a r t i t i o n ( L , j ) 119 r d o t , d d o t , B d o t , c d o t = l e v e l 2 p a r t i t i o n ( Ad o t , j ) 120 d d o t [ : ] = ( d d o t / 2 − r r @ r d o t ) / d d 121 c d o t [ : ] = ( c d o t − B d o t @ r r − B @ r d o t − c c ∗ d d o t ) / d d 122 r e t u r n Ad o t 123 124 d e f c h o l u n b l o c k e d r e v ( L , A b a r , i n p l a c e = F a l s e ) : 125 ” ” ” 126 R e v e r s e − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 127 128 O b t a i n t r i l ( S i g m a b a r ) f r o m L b a r , w h e r e ” b a r ” m e a n s s e n s i t i v i t i e s 129 i n r e v e r s e − m o d e d i f f e r e n t i a t i o n , a n d S i g m a = L @ L . T . 130 131 T h i s v e r s i o n u s e s a n u n b l o c k e d a l g o r i t h m t o u p d a t e s e n s i t i v i t i e s 132 A b a r i n p l a c e . t r i l ( A b a r ) s h o u l d s t a r t c o n t a i n i n g L b a r , a n d w i l l 133 e n d c o n t a i n i n g t h e t r i l ( S i g m a b a r ) . T h e u p p e r t r i a n g u l a r p a r t o f 134 A d o t i s u n t o u c h e d , s o t a k e t r i l ( A b a r ) a t t h e e n d i f t r i u ( A b a r , 1 ) 135 d i d n o t s t a r t o u t f i l l e d w i t h z e r o s . A l t e r n a t i v e l y , ( t r i l ( A b a r ) + 136 t r i l ( A b a r ) . T ) w i l l g i v e t h e s y m m e t r i c , r e d u n d a n t m a t r i x o f 137 s e n s i t i v i t i e s . 138 139 I f i n p l a c e = F a l s e , a c o p y o f A b a r i s m o d i f i e d i n s t e a d o f t h e 140 o r i g i n a l . T h e A b a r t h a t w a s m o d i f i e d i s r e t u r n e d . 141 ” ” ” 142 i f n o t i n p l a c e : 143 A b a r = A b a r . c o p y ( ) 144 f o r j i n r a n g e ( L . s h a p e [ 0 ] − 1 , − 1 , − 1 ) : # N − 1 , N − 2 , . . . , 1 , 0 145 r r , dd , B , c c = l e v e l 2 p a r t i t i o n ( L , j ) 146 r b a r , d b a r , B b a r , c b a r = l e v e l 2 p a r t i t i o n ( Ab a r , j ) 147 d b a r − = c c @ c b a r / d d 148 d b a r /= dd # / T h e s e t w o l i n e s c o u l d b e 149 c b a r /= d d # \ d o n e i n o n e o p e r a t i o n 150 r b a r − = d b a r ∗ r r # / T h e s e t w o l i n e s c o u l d b e d o n e 151 r b a r − = c b a r @ B # \ w i t h o n e m a t r i x m u l t i p l y 152 B b a r − = c b a r [ : , No n e ] @ r r [ N o n e , : ] 153 d b a r /= 2 154 r e t u r n A b a r 14 156 d e f l e v e l 3 p a r t i t i o n ( A , j , k ) : 157 ” ” ” R e t u r n v i e w s i n t o A u s e d b y t h e b l o c k e d a l g o r i t h m s ” ” ” 158 # T o p l e f t c o r n e r o f d i a g o n a l b l o c k i s [ j , j ] 159 # B l o c k s i z e i s NB = ( k − j ) 160 R = A [ j : k , : j ] # R o w b l o c k / \ 161 D = A [ j : k , j : k ] # t r i a n g u l a r b l o c k o n D i a g o n a l | | 162 B = A [ k : , : j ] # B i g c o r n e r b l o c k | R D | 163 C = A [ k : , j : k ] # C o l u m n b l o c k \ B C / 164 r e t u r n R , D , B , C 165 166 d e f c h o l b l o c k e d ( A , N B = 2 5 6 , i n p l a c e = F a l s e ) : 167 ” ” ” C h o l e s k y d e c o m p o s i t i o n , m i r r o r i n g LA PACK ’ s DPOTRF 168 169 I n t e n d e d t o i l l u s t r a t e t h e a l g o r i t h m o n l y . U s e a C h o l e s k y r o u t i n e 170 f r o m n u m p y o r s c i p y i n s t e a d . ” ” ” 171 i f n o t i n p l a c e : 172 A = A . c o p y ( ) 173 f o r j i n r a n g e ( 0 , A . s h a p e [ 0 ] , N B ) : 174 k = m i n ( N , j + N B ) 175 R , D , B , C = l e v e l 3 p a r t i t i o n ( A , j , k ) 176 D − = t r i l ( R @ R . T ) 177 c h o l u n b l o c k e d ( D , i n p l a c e = T r u e ) 178 C − = B @ R . T 179 # C [ : ] = C @ i n v ( t r i l ( D ) ) . T 180 C [ : ] = s t ( D , C . T ) . T 181 r e t u r n A 182 183 d e f c h o l b l o c k e d f w d ( L , A d o t , N B= 2 5 6 , i n p l a c e = F a l s e ) : 184 ” ” ” 185 F o r w a r d s − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 186 187 O b t a i n L d o t f r o m S i g m a d o t , w h e r e ” d o t ” m e a n s s e n s i t i v i t i e s i n 188 f o r w a r d s − m o d e d i f f e r e n t i a t i o n , a n d S i g m a = L @ L . T . 189 190 T h i s v e r s i o n u s e s a b l o c k e d a l g o r i t h m t o u p d a t e s e n s i t i v i t i e s A d o t 191 i n p l a c e . t r i l ( A d o t ) s h o u l d s t a r t c o n t a i n i n g S i g m a d o t , a n d w i l l 192 e n d c o n t a i n i n g t h e L d o t . T a k e t r i l ( ) o f t h e a n s w e r i f 193 t r i u ( A d o t , 1 ) d i d n o t s t a r t o u t f i l l e d w i t h z e r o s . U n l i k e t h e 194 u n b l o c k e d r o u t i n e , i f t h e u p p e r t r i a n g u l a r p a r t o f A d o t s t a r t e d 195 w i t h n o n − z e r o v a l u e s , s o m e o f t h e s e w i l l b e o v e r w r i t t e n . 196 197 I f i n p l a c e = F a l s e , a c o p y o f A d o t i s m o d i f i e d i n s t e a d o f t h e 198 o r i g i n a l . T h e A b a r t h a t w a s m o d i f i e d i s r e t u r n e d . 199 ” ” ” 200 i f n o t i n p l a c e : 201 A d o t = A d o t . c o p y ( ) 202 f o r j i n r a n g e ( 0 , L . s h a p e [ 0 ] , N B ) : 203 k = m i n ( N , j + N B ) 204 R , D , B , C = l e v e l 3 p a r t i t i o n ( L , j , k ) 205 R d o t , D d o t , B d o t , C d o t = l e v e l 3 p a r t i t i o n ( Ad o t , j , k ) 206 D d o t [ : ] = t r i l ( D d o t ) − t r i l ( R d o t @ R . T + R@R dot . T ) 207 # c h o l u n b l o c k e d f w d ( D , D d o t , i n p l a c e = T r u e ) # s l o w i n P y t h o n 208 D d o t [ : ] = c h o l s y m b o l i c f w d ( D , D d o t + t r i l ( D d o t , − 1 ) .T ) 209 C d o t − = ( Bd ot@ R . T + B@Rdo t . T ) 210 # C d o t [ : ] = ( C d o t − C @ D d o t . T ) @ i n v ( t r i l ( D ) ) . T 211 C d o t [ : ] = s t ( D , C d o t . T − Ddo t@ C . T ) . T 212 r e t u r n Ad o t 15 214 d e f c h o l b l o c k e d r e v ( L , A b a r , N B= 2 5 6 , i n p l a c e = F a l s e ) : 215 ” ” ” 216 R e v e r s e − m o d e d i f f e r e n t i a t i o n t h r o u g h t h e C h o l e s k y d e c o m p o s i t i o n 217 218 O b t a i n t r i l ( S i g m a b a r ) f r o m L b a r , w h e r e ” b a r ” m e a n s s e n s i t i v i t i e s 219 i n r e v e r s e − m o d e d i f f e r e n t i a t i o n , a n d S i g m a = L @ L . T . 220 221 T h i s v e r s i o n u s e s a b l o c k e d a l g o r i t h m t o u p d a t e s e n s i t i v i t i e s A b a r 222 i n p l a c e . t r i l ( A b a r ) s h o u l d s t a r t c o n t a i n i n g L b a r , a n d w i l l e n d 223 c o n t a i n i n g t h e t r i l ( S i g m a b a r ) . T a k e t r i l ( A b a r ) a t t h e e n d i f 224 t r i u ( A b a r , 1 ) d i d n o t s t a r t o u t f i l l e d w i t h z e r o s . A l t e r n a t i v e l y , 225 ( t r i l ( A b a r ) + t r i l ( A b a r ) . T ) w i l l g i v e t h e s y m m e t r i c , r e d u n d a n t 226 m a t r i x o f s e n s i t i v i t i e s . 227 228 U n l i k e t h e u n b l o c k e d r o u t i n e , i f t h e u p p e r t r i a n g u l a r p a r t o f A b a r 229 s t a r t e d w i t h n o n − z e r o v a l u e s , s o m e o f t h e s e w i l l b e o v e r w r i t t e n . 230 231 I f i n p l a c e = F a l s e , a c o p y o f A b a r i s m o d i f i e d i n s t e a d o f t h e 232 o r i g i n a l . T h e A b a r t h a t w a s m o d i f i e d i s r e t u r n e d . 233 ” ” ” 234 i f n o t i n p l a c e : 235 A b a r = A b a r . c o p y ( ) 236 f o r k i n r a n g e ( L . s h a p e [ 0 ] , − 1 , − N B ) : 237 j = max ( 0 , k − N B) 238 R , D , B , C = l e v e l 3 p a r t i t i o n ( L , j , k ) 239 R b a r , D b a r , B b a r , C b a r = l e v e l 3 p a r t i t i o n ( A b a r , j , k ) 240 # C b a r [ : ] = C b a r @ i n v ( t r i l ( D ) ) 241 C b a r [ : ] = s t ( D , C b a r . T , t r a n s = 1 ) . T 242 B b a r − = C b a r @ R 243 D b a r [ : ] = t r i l ( D b a r ) − t r i l ( C b a r . T @ C ) 244 # c h o l u n b l o c k e d r e v ( D , D b a r , i n p l a c e = T r u e ) # s l o w i n P y t h o n 245 D b a r [ : ] = c h o l s y m b o l i c r e v ( D , D b a r ) 246 R b a r − = ( C b a r . T @ B + ( D b a r + D b a r . T ) @ R ) 247 r e t u r n A b a r 248 249 # T e s t i n g c o d e f o l l o w s 250 251 d e f t r a c e d o t ( A , B ) : 252 ” ” ” t r a c e d o t ( A , B ) = t r a c e ( A @ B ) = A . r a v e l ( ) @ B . r a v e l ( ) ” ” ” 253 r e t u r n A . r a v e l ( ) @ B . r a v e l ( ) 254 255 d e f t e s t m e ( N ) : 256 ” ” ” E x e r c i s e e a c h f u n c t i o n u s i n g NxN m a t r i c e s ” ” ” 257 i m p o r t s c i p y a s s p 258 f r o m t i m e i m p o r t t i m e 259 i f N > 1 : 260 S i g m a = n p . c o v ( s p . r a n d n ( N , 2 ∗ N ) ) 261 S i g m a d o t = n p . c o v ( s p . r a n d n ( N , 2 ∗ N ) ) 262 e l i f N = = 1 : 263 S i g m a = n p . a r r a y ( [ [ s p . r a n d ( ) ] ] ) 264 S i g m a d o t = n p . a r r a y ( [ [ s p . r a n d ( ) ] ] ) 265 e l s e : 266 a s s e r t ( F a l s e ) 267 t i c = t i m e ( ) 268 L = n p . l i n a l g . c h o l e s k y ( S i g m a ) 269 t o c = t i m e ( ) − t i c 270 p r i n t ( ’ R u n n i n g n p . l i n a l g . c h o l e s k y : ’ ) 16 271 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 272 t i c = t i m e ( ) 273 L u b = t r i l ( c h o l u n b l o c k e d ( S i g m a ) ) 274 t o c = t i m e ( ) − t i c 275 p r i n t ( ’ U n b l o c k e d c h o l w o r k s : %r ’ 276 % n p . a l l ( np . i s c l o s e ( L , L u b ) ) ) 277 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 278 t i c = t i m e ( ) 279 L b l = t r i l ( c h o l b l o c k e d ( S i g m a ) ) 280 t o c = t i m e ( ) − t i c 281 p r i n t ( ’ B l o c k e d c h o l w o r k s : %r ’ 282 % n p . a l l ( np . i s c l o s e ( L , L b l ) ) ) 283 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 284 t i c = t i m e ( ) 285 L d o t = c h o l s y m b o l i c f w d ( L , S i g m a d o t ) 286 t o c = t i m e ( ) − t i c 287 hh = 1 e − 5 288 L2 = n p . l i n a l g . c h o l e s k y ( S i g m a + S i g m a d o t ∗ hh / 2 ) 289 L1 = n p . l i n a l g . c h o l e s k y ( S i g m a − S i g m a d o t ∗ hh / 2 ) 290 L d o t f d = ( L2 − L1 ) / h h 291 p r i n t ( ’ S y m b o l i c c h o l f w d w o r k s : %r ’ 292 % n p . a l l ( np . i s c l o s e ( L d o t , L d o t f d ) ) ) 293 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 294 t i c = t i m e ( ) 295 L d o t u b = t r i l ( c h o l u n b l o c k e d f w d ( L , S i g m a d o t ) ) 296 t o c = t i m e ( ) − t i c 297 p r i n t ( ’ U n b l o c k e d c h o l f w d w o r k s : %r ’ 298 % n p . a l l ( np . i s c l o s e ( L d o t , L d o t u b ) ) ) 299 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 300 t i c = t i m e ( ) 301 L d o t b l = t r i l ( c h o l b l o c k e d f w d ( L , S i g m a d o t ) ) 302 t o c = t i m e ( ) − t i c 303 p r i n t ( ’ B l o c k e d c h o l f w d w o r k s : %r ’ 304 % n p . a l l ( np . i s c l o s e ( L d o t , L d o t b l ) ) ) 305 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 306 L b a r = t r i l ( s p . r a n d n ( N , N ) ) 307 t i c = t i m e ( ) 308 S i g m a b a r = c h o l s y m b o l i c r e v ( L , L b a r ) 309 t o c = t i m e ( ) − t i c 310 D e l t a 1 = t r a c e d o t ( L b a r , L d o t ) 311 D e l t a 2 = t r a c e d o t ( S i g m a b a r , S i g m a d o t ) 312 p r i n t ( ’ S y m b o l i c c h o l r e v w o r k s : %r ’ 313 % n p . a l l ( np . i s c l o s e ( D e l t a 1 , D e l t a 2 ) ) ) 314 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 315 t i c = t i m e ( ) 316 S i g m a b a r u b = c h o l u n b l o c k e d r e v ( L , L b a r ) 317 t o c = t i m e ( ) − t i c 318 D e l t a 3 = t r a c e d o t ( S i g m a b a r u b , S i g m a d o t ) 319 p r i n t ( ’ U n b l o c k e d c h o l r e v w o r k s : %r ’ 320 % n p . a l l ( np . i s c l o s e ( D e l t a 1 , D e l t a 3 ) ) ) 321 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 322 t i c = t i m e ( ) 323 S i g m a b a r b l = c h o l b l o c k e d r e v ( L , L b a r ) 324 t o c = t i m e ( ) − t i c 325 D e l t a 4 = t r a c e d o t ( S i g m a b a r b l , S i g m a d o t ) 326 p r i n t ( ’ B l o c k e d c h o l r e v w o r k s : %r ’ 327 % n p . a l l ( np . i s c l o s e ( D e l t a 1 , D e l t a 4 ) ) ) 17 328 p r i n t ( ’ Ti m e t a k e n : % 0 . 4 f s ’ % t o c ) 329 330 i f n a m e = = ’ m a i n ’ : 331 i m p o r t s y s 332 i f l e n ( s y s . a r g v ) > 1 : 333 N = i n t ( s y s . a r g v [ 1 ] ) 334 e l s e : 335 N = 5 0 0 336 t e s t m e ( N ) 337 338 # E x a m p l e o u t p u t f o r N = 5 0 0 339 # − − − − − − − − − − − − − − − − − − − − − − − − − − 340 # R u n n i n g n p . l i n a l g . c h o l e s k y : 341 # T i m e t a k e n : 0 . 0 0 3 6 s 342 # U n b l o c k e d c h o l w o r k s : T r u e 343 # T i m e t a k e n : 0 . 0 3 1 9 s 344 # B l o c k e d c h o l w o r k s : T r u e 345 # T i m e t a k e n : 0 . 0 3 5 6 s 346 # S y m b o l i c c h o l f w d w o r k s : T r u e 347 # T i m e t a k e n : 0 . 0 1 4 3 s 348 # U n b l o c k e d c h o l f w d w o r k s : T r u e 349 # T i m e t a k e n : 0 . 0 5 9 2 s 350 # B l o c k e d c h o l f w d w o r k s : T r u e 351 # T i m e t a k e n : 0 . 0 1 1 2 s 352 # S y m b o l i c c h o l r e v w o r k s : T r u e 353 # T i m e t a k e n : 0 . 0 1 6 5 s 354 # U n b l o c k e d c h o l r e v w o r k s : T r u e 355 # T i m e t a k e n : 0 . 1 0 6 9 s 356 # B l o c k e d c h o l r e v w o r k s : T r u e 357 # T i m e t a k e n : 0 . 0 0 9 3 s 358 359 # E x a m p l e o u t p u t f o r N = 4 0 0 0 360 # − − − − − − − − − − − − − − − − − − − − − − − − − − 361 # R u n n i n g n p . l i n a l g . c h o l e s k y : 362 # T i m e t a k e n : 0 . 4 0 2 0 s 363 # U n b l o c k e d c h o l w o r k s : T r u e 364 # T i m e t a k e n : 2 5 . 8 2 9 6 s 365 # B l o c k e d c h o l w o r k s : T r u e 366 # T i m e t a k e n : 0 . 7 5 6 6 s 367 # S y m b o l i c c h o l f w d w o r k s : T r u e 368 # T i m e t a k e n : 3 . 9 8 7 1 s 369 # U n b l o c k e d c h o l f w d w o r k s : T r u e 370 # T i m e t a k e n : 5 1 . 6 7 5 4 s 371 # B l o c k e d c h o l f w d w o r k s : T r u e 372 # T i m e t a k e n : 1 . 2 4 9 5 s 373 # S y m b o l i c c h o l r e v w o r k s : T r u e 374 # T i m e t a k e n : 4 . 1 3 2 4 s 375 # U n b l o c k e d c h o l r e v w o r k s : T r u e 376 # T i m e t a k e n : 9 6 . 3 1 7 9 s 377 # B l o c k e d c h o l r e v w o r k s : T r u e 378 # T i m e t a k e n : 1 . 2 9 3 8 s 379 380 # T i m e s a r e m a c h i n e a n d c o n f i g u r a t i o n d e p e n d e n t . On t h e s a m e t e s t 381 # m a c h i n e , m y L e v e l 3 F o r t r a n i m p l e m e n t a t i o n i s o n l y ˜ 1 0 % f a s t e r 382 # f o r N = 4 0 0 0 , a l t h o u g h c a n b e a l o t f a s t e r f o r s m a l l m a t r i c e s . 383 # A L e v e l 2 F o r t r a n i m p l e m e n t a t i o n i s n ’ t a s b a d a s t h e P y t h o n 384 # v e r s i o n , b u t i s s t i l l > 1 5 x s l o w e r t h a n t h e b l o c k e d P y t h o n c o d e . 18

Differentiation of the Cholesky decomposition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment