Probabilistic Interpretation of Linear Solvers

PR OBABILISTIC INTERPRET A TION OF LINEAR SOL VERS PHILIPP HENNIG ∗ Abstract. This manuscript prop oses a probabilistic framework for algorithms that iteratively solve unconstrained linear problems B x = b with p ositiv e deﬁnite B for x . The goal is to replace the point estimates returned by existing methods with a Gaussian p osterior b elief o v er the elements of the inv erse of B , which can be used to estimate errors. Recent probabilistic in terpretations of the secant family of quasi-Newton optimization algorithms are extended. Combined with properties of the conjugate gradient algorithm, this leads to uncertaint y-calibrated metho ds with very limited cost ov erhead over conjugate gradients, a self-contained nov el interpretation of the quasi-Newton and conjugate gradient algorithms, and a foundation for new nonlinear optimization metho ds. Key words. linear programming, quasi-Newton metho ds, conjugate gradien t, Gaussian inference AMS sub ject classiﬁcations. 49M15, 65K05, 60G15 1. In tro duction. 1.1. Motiv ation. Solving the unconstrained linear problem of ﬁnding x in (1.1) B x = b with symmetric, p ositive deﬁnite B ∈ R N × N and b ∈ R N is a basic task for computational linear algebra. It is equiv alent to minimizing the quadratic f ( x ) = 1  2 x ⊺ B x − x ⊺ b , with gradient F ( x ) = ∇ x f ( x ) = B x − b and constant Hessian B . If N is to o large for exact solution, iterative solvers such as the metho d of conjugate gradien ts [ 27 ] (CG) are widely applied. These metho ds pro duce a sequence of estimates { x i } i = 0 ,...,M , up dated by ev aluating F ( x i ) . The question addressed here is: Assume we run an iterative solv er for M < N steps. How muc h information do es doing so pro vide ab out B and its (pseudo-) in v erse H ? If w e had to giv e estimates for B , H , and for the solution to related problems B ˜ x = ˜ b , what should they b e, and ho w big of an ‘error bar’ (a joint p osterior distribution) should we put on these estimates? The gradient F ( x i ) provides an error residual on x i , but not on B , H and ˜ x . It will turn out that a family of quasi-Newton metho ds ( § 1.2 ), more widely used to solv e non linear optimization problems, can help answ er this question, b ecause classic deriv ations of these metho ds can b e re-formulated and extended into a proba- bilistic interpretation of these metho ds as maxima of Gaussian p osterior probability distributions ( § 2 ). The co v ariance of these Gaussians oﬀers a new ob ject of interest and provides error estimates ( § 3 ). Because there are entire linear spaces of Gaussian distributions with the same posterior mean but diﬀering p osterior error estimates, selecting one error measure consisten t with the algorithm is a new statistical estimation task ( § 4 ). 1.2. The Dennis family of secan t metho ds. The family of se c ant update rules for an approximation to the Newton-Raphson search direction is among the most p opular building blo c ks for contin uous nonlinear programming. Their evolution c hieﬂy o ccurred from the late 1950s [ 8 ] to the 1970s, and is widely understoo d to b e crowned b y the developmen t of the BFGS rule due to Broyden [ 5 ], Fletc her [ 18 ], Goldfarb [ 22 ] and Shanno [ 39 ], which now forms a core part of many contemporary optimization metho ds. But the family also includes the earlier and somewhat less popular DFP ∗ Max Planck Institute for Intelligen t Systems, Sp emannstraße 38, 72076 T ¨ ubingen, Germany . 1 2 PHILIPP HENNIG rule of Davidon [ 8 ], Fletc her and P ow ell [ 19 ]; the Greenstadt [ 23 ] rule, and the so- called symmetric rank-1 metho d [ 8 , 4 ]. Several authors hav e prop osed grouping these metho ds into broader classes, among them Broyden in 1967 [ 4 ] (subsequently reﬁned b y Fletcher [ 18 ]) and Davidon in 1975 [ 9 ]. Of particular interest here will b e a class of up dates formulated in 1971 b y Dennis [ 10 ], which includes all the sp eciﬁc rules cited ab o ve. It is the class of up date rules mapping a current estimate B 0 for the Hessian, a v ector-v alued pair of observ ations y i = F ( x i ) − F ( x i − 1 ) ∈ R N and s i = x i − x i − 1 with y i = B s i , into a new estimate B i of the form (1.2) B i + 1 = B i + ( y i − B i s i ) c ⊺ i + c i ( y i − B i s i ) ⊺ c ⊺ i s i − c i s ⊺ i ( y i − B i s i ) c ⊺ i ( c ⊺ i s i ) 2 for some c i ∈ R N . This ensures the secan t relation y i = B i + 1 s i , sometimes called ‘the quasi-Newton Equation’ [ 13 ]. Con vergence of the sequence of iterates x i for v arious members of this class (and the classes of Broyd en and Davidon) are w ell-understo od [ 21 , 12 ]. The rules named ab ov e can b e found in the Dennis class as [ 37 ]: Symmetric Rank-1 (SR1) c = y − B 0 s (1.3) P ow ell Symmetric Broyden [ 36 ] c = s (1.4) Greenstadt [ 23 ] c = B 0 s (1.5) Da vidon Fletc her P ow ell (DFP) c = y (1.6) Bro yden Fletc her Goldfarb Shanno (BFGS) c = y +  y ⊺ s s ⊺ B 0 s B 0 s (1.7) Inverse up dates. Because the up date of Equation ( 1.2 ) is of rank 2, the corre- sp onding estimate for the in v erse H = B − 1 (assuming it exists) can b e constructed using the matrix inv ersion lemma. Alternatively , all Dennis rules can also b e used as inverse up dates [ 13 ], i.e. estimates for H itself, by exchanging s ] y and B ] H , B 0 ] H 0 ab o ve (corresp onding to the secant relation s = H y ). In terestingly , the DFP and BFGS up dates are duals of each other under this exchange [ 13 ]: The inv erse of B 1 as constructed by the DFP rule ( 1.6 ) equals the H 1 arising from the inv erse BFGS rule ( 1.7 ). This do es not mean BFGS and DFP are the same, but only that they ﬁll opp osing roles in the inv erse and direct formulation. T o av oid confusion, in this text the DFP rule will alwa ys b e used in the sense of a direct up date (estimating B , with c = y ), and the BF GS rule in the inv erse sense (i.e. estimating H , with c = s ). The ﬁrst parts of this text will fo cus on direct up dates and th us mostly talk about the DFP metho d instead of the BFGS rule. All results extend to the inv erse mo dels (and th us BFGS) under the exchange of v ariables mentioned ab ov e. Sections 3.2 and 4 will mak e some sp eciﬁc choices geared to inv erse up dates. They will then talk explicitly ab out BFGS, alwa ys in the sense of an inv erse up date. T owar ds pr ob abilistic quasi-Newton metho ds. This text gives a probabilistic inter- pretation of the Dennis family , for the linear problems of Eq. ( 1.1 ). W e will interpret the secant metho ds as estimators of (in verse) Hessians of an ob jectiv e function, and ask what kind of prior assumptions would give rise to these sp eciﬁc estimators. This results in a self-contained deriv ation of inference rules for symmetric matrices. Some of the rules quoted ab o ve can b e motiv ated as ‘natural’ from the inference p ersp ectiv e. Another ma jor strand of nonlinear optimization metho ds extends from the con- jugate gradient algorithm of Hestenes & Stiefel [ 27 ] for linear problems, nonlinearly extended by Fletcher and Reeves [ 20 ] and others. On linear problems, the CG and PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 3 quasi-Newton ideas are closely link ed: Nazareth [ 32 ] sho wed that CG is equiv alent to BF GS for linear problems (with exact line searches, when the initial estimate B 0 = I ). More generally , Dixon [ 16 , 17 ] sho wed for quasi-Newton metho ds in the Broyden class (whic h also con tains the metho ds listed abov e) that, under exact line searc hes and the same starting p oint, all metho ds in Broyden’s class generate a sequence of p oin ts iden tical to CG, if the starting matrix B 0 is taken as a preconditioner of CG. In this sense, this text also provides a nov el deriv ation for conjugate gradients, and will use sev eral well-kno wn prop erties of that metho d. Implications of the results presented herein to nonlinear v arian ts of conjugate gradients will b e left for future work. 1.3. Numerical methods p erform inference — The v alue of a statis- tical interpretation. The deﬁning asp ect of quasi-Newton metho ds is that they appro ximate— estimate —the Hessian matrix of the ob jectiv e function, or its inv erse, based on ev aluations— observations —of the ob jective’s gradient and certain prior structural restrictions on the estimate. They can therefore b e interpreted as inferring the latent quantit y B or H from the observed quantities s, y . This creates a connec- tion to statistics and probabilit y theory , in particular the probabilistic framework of enco ding prior assumptions in a probability measure ov er a h yp othesis space, and describing observ ations using a likelihoo d function, whic h combines with the prior according to Bay es’ theorem into a p osterior measure ov er the hypothesis space ( § 2 ). On the one hand, this elucidates prior assumptions of quasi-Newton metho ds ( § 3 ). On the other, it suggests new functionality for the existing methods, in particular error estimates on B and H ( § 4 ). In future w ork, it may also allow for algorithms robust to ‘noisy’ linear maps, such as they arise in physical inv erse problems. The interpretation of numerical problems as estimation w as p oin ted out by statisti- cians like Diaconis in 1988 [ 14 ], and O’Hagan in 1992 [ 35 ], well after the in tro duction of quasi-Newton metho ds. T o the author’s kno wledge, the idea has rarely attracted inter- est in numerical mathematics, and has not b een studied in the con text of quasi-Newton metho ds b efore recen t w ork b y Hennig & Kiefel [ 25 , 26 ]. An argumen t sometimes raised against analysing numerical metho ds probabilistically is that numerical prob- lems do not generally feature an asp ect of randomness. But probability theory makes no formal distinction b et w een epistemic uncertaint y , arising from lack of knowledge, and aleatoric uncertaint y , arising from ‘randomness’, whatever the latter ma y b e tak en to mean precisely . Randomness is not a prerequisite for the use of probabilities. Those who do feel uneasy ab out applying probabilit y theory to unknown deterministic quan tities, how ev er, may prefer another, p erhaps more sub jective argumen t: F rom the p oin t of view of a numerical algorithm’s designer, the ‘p opulation’ of problems that practitioners will apply the algorithm to do es in fact form a probabilit y distribution from which tasks are ‘sampled’. Numerical algorithms running on a ﬁnite computational budget make numerical errors. A notion of the imprecision of this answ ers is helpful, in particular when the metho d is used within a larger computational framework. Explicit error estimates can b e propagated through the computational pipeline, helping identify points of instabilit y , and to distribute or sav e computational resources. Needless to say , it makes no sense to ask for the exact error (if the exact diﬀerence b et ween the true and estimated answ er where known, the exact answer would b e known, to o). But it is meaningful to ask for the remaining volu me of hypotheses consistent with the computations so far. This pap er attempts to construct such an answer for linear problems. 1.4. Ov erview of main results. As p oin ted out ab o v e, although quasi-Newton metho ds are most p opular for nonlinear optimization, here the fo cus will b e on line ar 4 PHILIPP HENNIG problems. Extending the probabilistic interpretation constructed here to the nonlinear setting of inferring the (inv erse) Hessian of a function f will b e left for future work (see [ 26 ] for p oin ters). The present aim is an iterativ e linear solver iterating through p osterior b eliefs { p t ( x, H )} t = 1 ,...,M for H = B − 1 and the solution x = H b of the linear problem. These b eliefs will b e constructed as Gaussian densities p t ( H ) = N ( H ; H t , V t ) o ver the elements 1 of H , with mean H t and cov ariance matrix V t . The results in this pap er signiﬁcan tly clarify and extend previous results by Hennig & Kiefel [ 26 ] and Hennig [ 24 ]. Here is a brief outlo ok of the main results: Dennis family deriv ed in a symmetric h yp othesis class ( § 2 ) As a probabilis- tic interpretation of results by Dennis & Mor´ e [ 13 ] and Dennis & Schnabel [ 11 ], Hennig [ 24 ] provided a deriv ation of rank-2 secan t metho ds in terms of tw o indep endent observ ations of tw o separate parts of the Hessian. This viewp oin t aﬀords a nonparametric extension to nonlinear optimization, but is not particularly elegant. This pap er provides a cleaner deriv ation: the Dennis family can in fact be derived naturally from a prior o ver only symmetric matrices. This extends the results of Dennis & Schnabel [ 11 ], from statements ab out the maximum of a F rob enius norm in the space of symmetric matrices to the entire structure of that norm in that space. New in terpretation for SR1, Greenstadt, DFP & BF GS up dates ( § 3 ) The c hoice of prior parameters distinguishes betw een the mem b ers of the Dennis family . An analysis shows that DFP and BF GS are ‘more correct’ than other members of the family in the sense that they are consistent with exact probabilistic inference for the entire run of the algorithm, while general Dennis rules are only consistent after the ﬁrst step (Lemmas 3.2 and 3.3 ). F urther, SR1, Greenstadt, DFP and BFGS all use diﬀerent prior measures that, although all ‘scale-free’, give imperfect notions of calibration for the prior measure. Finally , b ecause BFGS is equiv alen t to CG ([ 32 ] and Lemma 3.4 ), its se t of ev aluated gradients is orthogonal. This allows a computationally con venien t parameterization of p osterior uncertain ty . Ov erall, the picture arising is that, from the probabilistic p ersp ectiv e, the DFP and particularly BF GS metho ds hav e con v enient n umerical prop erties, but their p osterior measure can b e calibrated b etter. P osterior uncertain t y b y parameter estimation ( § 4 ) It will transpire that the decision for a sp eciﬁc member of the Dennis family still leav es a space of p ossible c hoices of prior cov ariances consistent with this up date rule. Constructing a meaningful p osterior uncertaint y estimate (cov ariance) on H after ﬁnitely man y steps requires a choice in this uniden tiﬁed space, whic h, as in other estimation problems, needs to b e motiv ated based on some notion of regularit y in H . Several possible c hoices are discussed in Section 3 , all of which add v ery lo w o verhead to the standard conjugate gradient algorithm. 2. Gaussian inference from matrix-v ector m ultiplications. 2.1. In tro duction to Gaussian inference. Gaussian inference—probabilistic inference using b oth a Gaussian prior and a Gaussian likelihoo d—is one of the b est- studied areas of probabilistic inference. The following is a very brief in troduction; more can b e found in introductory texts [ 38 , 29 ]. Consider a hypothesis class consisting of elemen ts of the D -dimensional real vector space, v ∈ R D , and as sign a Gaussian prior 1 F or notational conv enience, the elements of H will b e treated as the elements of a vector of length N 2 , see more at the b eginning of § 2.2 . PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 5 probabilit y densit y o ver this space: (2.1) p ( v ) = N ( v ; µ, Σ ) ∶ = 1 ( 2 π ) D / 2  Σ  1 / 2 exp  − 1 2 ( v − µ ) ⊺ Σ − 1 ( v − µ ) , parametrised by me an ve ctor µ ∈ R D and p ositiv e deﬁnite c ovarianc e matrix Σ ∈ R D × D . If we now observ e a linear mapping A ⊺ v + a = y ∈ R M of v , up to Gaussian uncertain ty of cov ariance Λ ∈ R M × M , i.e. according to the likelihoo d function (2.2) p ( y  A, a, v ) = N ( y ; A ⊺ v + a, Λ ) , then, by Bay es’ theorem and a simple linear computation (see e.g. [ 38 , § 2.1.2]), the p osterior, the unique distribution ov er v consisten t with b oth prior and likelihoo d, is (2.3) p ( v  y , A, a ) = N [ v ; µ + Σ A ( A ⊺ Σ A + Λ ) − 1 ( y − A ⊺ µ − a ) , Σ − Σ A ( A ⊺ Σ A + Λ ) − 1 A ⊺ Σ ] . This deriv ation also works in the limit of p erfect information, i.e. for a well-deﬁned limit of Λ _ 0, in which case 2 the likelihoo d conv erges to the Dirac distribution p ( y  A, a, v ) _ δ ( y − A ⊺ v − a ) . The crucial p oin t is that constructing the posterior after linear observ ations inv olv es only linear algebraic op erations, with the p osterior co v ariance (the ‘error bar’) using many of the computations also required to compute the mean (the ‘b est guess’). Gaussian inference is closely linked to least-squares estimation: Because the logarithm is concav e, the maximum of the p osterior ( 2.3 ) (which equals the mean) is also the minimizer of the quadratic norm (using  x  2 K ∶ = x ⊺ K − 1 x. ) (2.4) − 2 log p ( v  y , A, a ) =  y − A ⊺ v − a  2 Λ +  v − µ  2 Σ + const . The added v alue of the probabilistic in terpretation is embo died in the p osterior co v ariance, which quan tiﬁes remaining degrees of freedom of the estimator, and can th us also b e interpreted as a measure of uncertaint y , or estimated error. 2.2. Inference on asymmetric matrices from matrix vector multiplica- tions. W e now consider Gaussian inference in the sp eciﬁc context of iterative solvers for linear problems as deﬁned in Eq. ( 1.1 ). Our solver shall maintain a current proba- bilit y density estimate, either p i ( B ) for p i ( H ) , i = 0 , . . . , M . The solv er do es not hav e direct access to B itself, but only to a function mapping s _ B s , for arbitrary s ∈ R N . It is p ossible to use the Gaussian inference framew ork in the con text of secant metho ds [ 26 ] through the use of Kroneck er algebra: W e write the elements of B as a v ector Ð → B ∈ R N 2 , indexed as Ð → B ij b y the matrix’ index set 3 ( i, j ) ∈ R × R . The Kroneck er pro duct provides the link betw een such ‘vectorized matrices’ and linear operations (e.g. [ 40 ]). The Kronec k er pro duct A ⊗ C of t wo matrices A ∈ R M a × N and C ∈ R M c × N is the M a M c × N 2 matrix with elements ( A ⊗ C ) ( ij )( k` ) = A ik C j ` . It has the prop ert y ( A ⊗ C ) Ð → B = Ð Ð Ð → AB C ⊺ . Thus, Ð → B S can b e written as ( I ⊗ S ) Ð → B , which allows incorp orating the kind of observ ations made by an iterative solver in a Gaussian inference framework, according to the following Lemma. 2 If A is not of maximal rank, a precise formulation requires a pro jection of y into the preimage of A . This is merely a technical complication. It is circumven ted here by assuming, later on, that line-search directions are linearly indep enden t. This amounts to a maximal-rank A . 3 In the notation used here, this vector is assumed to b e created by stacking the elements of B row after row into a column vector. An equiv alent column-by-column formulation is also widely used. In that formulation, some of the formulae b elo w are permuted. 6 PHILIPP HENNIG Lemma 2.1 ( proof in Hennig & Kiefel, 2013 ). Given a Gaussian prior over a gener al quadr atic matrix Ð → B , with prior me an Ð → B 0 and a prior c ovarianc e with Kr one cker structur e, p ( B ) = N ( Ð → B ; Ð → B 0 , W ⊗ W ) , the p osterior me an after observing B S = Y ∈ R N × M (i.e. M pr oje ctions along the line-se ar ch dir e ctions S ∈ R N × M ) is (2.5) B M = B 0 + ( Y − B 0 S )( S ⊺ W S ) − 1 W S ⊺ , and the p osterior c ovarianc e is (2.6) V M = W ⊗  W − W S ( S ⊺ W S ) − 1 W S ⊺  . This implies, for example, that Broyden’s rank-1 metho d [ 3 ] is equal to the p osterior mean up date after a single line search for the parameter c hoice W = I . This is a probabilistic re-phrasing of the muc h older observ ation, most likely by Dennis & Mor ´ e, [ 13 ], that Broyden’s metho d minimizes a change in the F rob enius norm  B i − B i − 1  F, I suc h that B i s i = y i . The weigh ted F robenius norm  A  2 F,W = tr ( AW − 1 A ⊺ W − 1 ) (with the p ositive deﬁnite w eighting W ) is the ` 2 loss on vectorized matrices in the sense that  A  2 F,W =  Ð → A  2 W ⊗ W . An imp ortan t observ ation is that Broyden’s m ethod ceases to b e a direct match to this up date after the ﬁrst line search, b ecause matrix S ⊺ W S is not a diagonal matrix. This matrix will come to pla y a cen tral role; we will call it the Gr am matrix , b ecause it is an inner pro duct of S w eighted by the p ositiv e deﬁnite W . 2.2.1. Symmetric h yp othesis classes. It is w ell kno wn that, because the p osterior mean of Eq. ( 2.5 ) is not in general a symmetric matrix, it is a sub optimal learning rule for the Hessian of an ob jective function. Whic h is why this class was quic kly abandoned in fav our of the rank-2 up dates in the Dennis family mentioned ab o ve. Dennis & Mor´ e [ 13 ] and Dennis & Sc hnab el [ 11 ] sho wed that the minimizer of weigh ted F rob enius regularizers (the maximizer of the Gaussian p osterior) within the linear subspace of symmetric matrices is given by the Dennis class of up date rules. Hennig & Kiefel [ 26 ] constructed a probabilistic interpretation based on this deriv ation, whic h in volv es doubling the input domain of the ob jective function and in troducing tw o separate, indep endent observ ations. This has the adv antage of allowing for relatively straigh tforward nonparametric extensions, and a broad class of noise mo dels for cases in which gradients can not b e ev aluated without error [ 24 ]. But artiﬁcially doubling the input dimensionality is dissatisfying. W e no w in troduce a cleaner deriv ation of the same updates, by explicitly restricting the hypothesis class to symmetric matrices. This gives the cov ariance matrix a more inv olved structure than the Kronec ker product, and mak es deriv ations more c hallenging. It results in a new interpretation for the Dennis class, fully consisten t with the probabilistic framew ork. Since the iden tity of the p osterior mean was known from [ 13 , 11 ] and [ 26 ], the interesting nov el asp ect here is the structure of the p osterior co v ariance. In essence, it provides insight into the structure of the loss function ar ound the previously known estimates. W e begin b y building a Gaussian prior ov er the symmetric matrices, using the symmetrization op erator Γ, the linear op erator acting on vectorized matrices deﬁned implicitly through its eﬀect Γ Ð → A = 1  2 Ð Ð Ð Ð Ð → ( A + A ⊺ ) (explicit deﬁnition in App endix A.1 ). Lemma 2.2 ( proof in Appendix A.1 ). Assuming a Gaussian prior p ( B ) = N ( Ð → B ; Ð → B 0 , W ⊗ W ) over the sp ac e of squar e matric es B ∈ R N × N with Kr one cker c ovarianc e co v ( B ij , B k` ) = W ik W j ` (this r e quir es W to b e a symmetric p ositive deﬁnite matrix), the prior over the symmetric matrix Γ Ð → B is p ( B ) = N ( Γ Ð → B ; Γ Ð → B 0 , W ⊗ ⊖ W ) . PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 7 Here, W ⊗ ⊖ W = Γ ( W ⊗ W ) Γ ⊺ is the symmetric Kr one cker pr o duct of W with itself (see e.g. [ 40 ] for an earlier mention). It is the matrix containing elements (2.7) ( W ⊗ ⊖ W ) ij,k` = 1  2 ( W ik W j ` + W j k W i` ) . It can easily be seen that, when acting on a square (not necessarily symmetric) v ectorized matrix K ∈ R N × N , it has the eﬀect ( W ⊗ ⊖ W ) Ð → K = 1  2 ( W K W ⊺ + W ⊺ K ⊺ W ) . Unfortunately , not all of the Kroneck er pro duct’s conv enien t algebraic prop erties carry o ver to the symmetric Kronec ker pro duct. F or example, ( W ⊗ ⊖ W ) − 1 = W − 1 ⊗ ⊖ W − 1 , but ( A ⊗ ⊖ B ) − 1 ≠ A − 1 ⊗ ⊖ B − 1 in general, and inv ersion of this general form is straightforw ard only for comm uting, symmetric A, B [ 1 ]. This is why the pro of for the follo wing Theorem is considerably more tedious than the one for Lemma 2.1 . Theorem 2.3 ( proof in App endix A.2 ). Assume a Gaussian prior of me an B 0 and c ovarianc e V = W ⊗ ⊖ W on the elements of a symmetric matrix B . After M line arly indep endent noise-fr e e observations of the form Y = B S , Y , S ∈ R N × M , rk ( S ) = M , the p osterior b elief over B is a Gaussian with me an B M = B 0 + ( Y − B 0 S )( S ⊺ W S ) − 1 S ⊺ W + W S ( S ⊺ W S ) − 1 ( Y − B 0 S ) ⊺ (2.8) − W S ( S ⊺ W S ) − 1 ( S ⊺ ( Y − B 0 S ))( S ⊺ W S ) − 1 S ⊺ W , and p osterior c ovarianc e (2.9) V M = ( W − W S ( S ⊺ W S ) − 1 S ⊺ W ) ⊗ ⊖ ( W − W S ( S ⊺ W S ) − 1 S ⊺ W ) . This immediately leads to the following Corollar y 2.4. The Dennis family of quasi-Newton metho ds is the p osterior me an after one step ( M = 1 ) of Gaussian r e gr ession on matrix elements. Pr o of . Assume Y , S ∈ R N × 1 , and set c = W S in Equation ( 1.2 ). Note that, for eac h member of the Dennis class, there is an entire vector space of W consisten t with c = W S . Additionally , each member of the Dennis family is itself a scalar space of c hoices c , b ecause Equation ( 1.2 ) is unc hanged under the transformation c _ αc for α ∈ R ∖ 0 . Dealing with these degrees of freedom turns out to b e the central task when deﬁning probabilistic interpretations of linear solvers. 2.2.2. Remark on the structure of the prior co v ariance. The fact that symmetric Kroneck er pro duct cov ariance matrices giv e rise to some of the most p opular secant metho ds may b e reason enough to b e interested in these structured Gaussian priors. This section provides tw o additional arguments in their fav or. The ﬁrst argumen t, applicable to the entire family of Gaussian inference rules, is that they giv e consisten t estimates, and th us conv ergent solvers: The priors of Lemma 2.1 and Theorem 2.3 assign nonzero mass to all square, and all symmetric matrices, resp ectively . It thus follows, from standard theorems ab out the consistency of parametric Ba y esian priors (e.g. [ 6 ]), that linear solvers based on the mean estimate arising from either of these tw o Gaussian priors, applied to linear problems of general, or symmetric structure, resp ectiv ely , are guaran teed (assuming perfect arithmetic precision) to conv erge to the correct B (and B − 1 , where it exists) after M = N linearly indep enden t line searc hes (i.e. rk ( S ) = M ). This is because the Sc hur complement W − W S ( S ⊺ W S ) − 1 S ⊺ W is of rank N − M [ 41 , Eq. 0.9.2], so the remaining b elief after M = N is a p oint-mass at the unique B = Y S − 1 . By a generalization of the same argument, it also follows that these linear solvers are alwa ys exact within the v ector space spanned by the line-search directions. This holds for all choices of prior 8 PHILIPP HENNIG parameters B 0 and W , as long as W is strictly p ositive deﬁnite. Of course, go od con vergence r ates do dep end crucially on these tw o choices. And the aim in this pap er is to also iden tify choices for these parameters suc h that the posterior uncertaint y around the mean estimate is meaningful, to o. Since w e kno w B to b e p ositiv e deﬁnite, it would b e desirable to restrict the prior explicitly to the positive deﬁnite cone. Unfortunately , this is not straigh tforw ard within the Gaussian family , because normal distributions ha v e full supp ort. A seemingly more natural prior ov er this cone is the Wishart distribution p opular in statistics, (2.10) W ( B ; W , ν ) ∝  B  ν / 2 − ( N − 1 )/ 2 exp  − ν 2 tr ( W − 1 B ) (the ∝ sym b ol suppresses an irrelev ant normalization constant). Using this prior in conjunction with linear observ ations, how ev er, causes v arious complications, b ecause the Wishart is not conjugate to one-sided linear observ ations of the form discussed ab o ve. So one may b e interested in ﬁnding a ‘linearization’ (a Gaussian appro ximation of some form) for the Wishart, for example through moment matching. And indeed, the second moment (cov ariance) of the Wishart is ν − 1 ( W ⊗ ⊖ W ) (see e.g. [ 30 ]). 3. Choice of parameters. Ha ving motiv ated the Gaussian hypothesis class, the next step is to identify individual desirable parameter choices in this class. The follo wing Corollary follows directly from Theorem 2.3 , by comparing Equation ( 2.5 ) with Equations ( 1.3 ) to ( 1.7 ). In eac h of the following cases, α ∈ R ∖ 0 . Corollar y 3.1. 1. The Powel l symmetric Br oyden up date rule is the one-step p osterior me an for a Gaussian r e gr ession mo del with W = α I . 2. The Symmetric R ank-1 rule is the one-step p osterior me an for a Gaussian r e gr ession mo del with the implicit choic e W = α ( B − B 0 ) . (F or a sp e ciﬁc r ank- 1 observation, ther e is a line ar subsp ac e of choic es W which give W S = Y , but W = B is the only glob al ly c onsistent such choic e). 3. The Gr e enstadt up date rule is the one-step p osterior me an for a Gaussian r e gr ession mo del with W = αB 0 . 4. The DFP up date is the one-step p osterior me an for the implicit choic e W = αB . (This choic e is unique in a manner analo gous to the ab ove for SR1). 5. The BFGS rule is the one-step p osterior me an for the implicit choic e W = α  B +  s ⊺ B s s ⊺ B 0 s B t  . (This, to o, is unique in a manner analo gous to the ab ove). It may seem circular for an inference algorithm trying to infer the matrix B to use that v ery matrix as part of its computations (SR1, DFP , BFGS). But computation of the mean in Equation ( 2.8 ) only requires the pro jections B S of B , whic h are accessible b ecause B S = Y . How ever, the p osterior uncertaint y (Eq. 2.9 ), which is not part of the optimizers in their contemporary form, can not b e computed this wa y . Hence, with the exception of PSB, the p opular secant rules all inv olv e what would b e called empiric al Bayesian estimation in statistics, i.e. parameter adaptation from observ ed data. W e also note again that the connection b et w een probabilistic maxim um- a-p osterior estimates and Dennis-class up dates only applies in the ﬁrst of M steps. As such, the Dennis up dates ignore the dep endence b et ween information collected in older and newer searc h directions that leads to the matrix inv erse of G = ( S ⊺ W S ) in Equations ( 2.8 ) and ( 2.9 ) (obviously , including this information explicitly requires solving M linear problems, at additional cost). As will b e shown in Lemma 3.3 b elo w, though, for some members of the Dennis family , and for their use within line ar problems, this simpliﬁcation is in fact exact . PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 9 0 20 40 60 80 100 0 0 . 2 0 . 4 0 . 6 0 . 8 1 # gradient ev aluations ∥ B − ˆ B ∥ F /∥ B − ˆ B 0 ∥ F 0 20 40 60 80 100 # gradient ev aluations DFP PSB BFGS Prob. W = I , mean Prob. W = B , mean Prob. W = I , std Prob. W=B, std Fig. 3.1 . Eﬀe ct of par ameter choice and exact vs. indep endent up dates. L eft: 10 randomly gener ated linear problems with N = 100 with eigenvalue sc ale λ = 10 . Right: Analo gous pr oblems with eigenvalue scale λ = 1000 . Individual exp eriments as thin lines, me ans over al l 10 experiments as thick lines. The spikes for the W = B estimate at the end of the left plot ar e numerical artifacts c aused by il l-c onditioned r andom proje ctions. They do not arise in the optimization setting. 3.1. A motiv ating exp eriment. Ho w relev ant is the diﬀerence betw een the full rank-2 M p osterior up date and a sequence of M rank-2 up dates? Figure 3.1 sho ws results from a simple conceptual experiment. F or this test only , the v arious estimation rules are treated as ‘stand-alone’ inference algorithms, not as optimizers. Random p ositiv e deﬁnite matrices B ∈ R N × N where generated as follows: Eigenv alues d i , i = 1 , . . . , N w ere drawn iid from an exp onen tial distribution p ( d ) = 1  λ exp ( − d  λ ) with scale λ = 10 (small eigenv alues, left plot) or λ = 1000 (large eigenv alues, righ t plot), resp ectively . A random rotation matrix Q ∈ S O ( N ) w as dra wn uniformly from the Haar measure o ver S O ( N ) , using the subgroup algorithm of Diaconis & Shahshahani [ 15 ], giving B = QD Q ⊺ (where D = diag ( d ) ). Pro jections—sim ulated ‘searc h directions’—where dra wn uniformly at random as S ∈ R N × M , s nm ∼ N ( 0 , 1 ) . F or M = 1 , . . . , N , the P ow ell Symmetric Broyden (PSB), DFP and BF GS, as w ell the corresp onding posterior means from Equation ( 2.8 ) with W = I (equal to PSB after one step) and W = B (equal to DFP after one step) w ere used to construct p oin t estimates B M for B . The plot shows the F robenius norm  B M − B  F b et ween true and estimated B , normalised by the initial error  B 0 − B  F . All algorithms used B 0 = I . Because directions s where chosen randomly , these results say little ab out these algorithms as optimizers. What they do oﬀer is an intuition for the diﬀerence b et w een the exact rank-2 M p osterior and rep eated application of rank-2 Dennis-class up date rules. A ﬁrst observ ation is that, in this setup, keeping trac k of the dep endence b et ween consecutive searc h directions through S ⊺ W S mak es a big diﬀerence: F or b oth pairs of ‘related’ algorithms PSB and W = I , as w ell as DFP and W = B , the full p osterior mean dominates the simpler ‘indep enden t’ up date rule. In fact, the classic secan t rules do not conv erge to the true Hessian B in this setup. The consistency argumen t in Section 2.2.2 only applies to estimators constructed by exact inference. The exp eriment sho ws how crucial trac king the full Gram matrix S ⊺ W S is after M > 1. A second, not surprising observ ation is that, although both probabilistic algorithms are consisten t—they con v erge to the correct B after N steps—the qualit y of the inferred 10 PHILIPP HENNIG p oin t estimate after M < N steps dep ends on the choice of parameters. The simpler W = I (PSB) choice p erforms qualitatively worse than the W = B (DFP) choice. The p osterior co v ariances were used to compute p osterior uncertaint y estimates for  B M − B  F (gra y lines in Figure 3.1 ): The F rob enius norm can b e written as  B M − B  2 F = Ð Ð Ð Ð Ð Ð → ( B M − B ) ⊺ Ð Ð Ð Ð Ð Ð → ( B M − B ) ; thus the exp ected v alue of this quadratic form is E [ Ð Ð Ð Ð Ð Ð → ( B M − B ) ⊺ Ð Ð Ð Ð Ð Ð → ( B M − B )] =  ij V M , ( ij )( ij ) =  ij 1 2 ( W M ,ii W M ,j j + W M ,ij W M ,ij ) , (3.1) with W M ∶ = W − W S ( S ⊺ W S ) − 1 S ⊺ W . (T o b e clear: for W = B , computing this uncertain ty required the unrealistic step of giving the algorithm access to B , which only makes sense for this conceptual exp erimen t). The uncertaint y estimate for W = I (dashed gray lines) is all but invisible in the right hand plot b ecause its v alues are v ery close to 0—this algorithm has a b ad ly c alibr ate d unc ertainty me asur e that has no practical use as an estimate of error. The uncertaint y under W = B (solid gray lines), on the other hand, scales qualitatively with the size of B . This is b ecause scaling B b y a scalar factor automatically also scales the co v ariance b y the same factor. This has b een noted before as a ‘non-dimensional’ prop ert y of BF GS/DFP [ 34 , Eq. 6.11]. How ever, it is also apparent that the uncertaint y estimate is to o large in both plots—here by ab out a factor of 5. T o understand why , we consider the individual terms in the sum of Equation ( 3.1 ) at the b eginning of the inference: The ratio b etw een the true estimation error on e lemen t B ij and the estimated error is (3.2) e 2 ij = ( B 0 − B ) 2 ij E [( B 0 − B ) 2 ij ] = 2 B 2 ij − 2 B ij B 0 ,ij + B 2 0 ,ij W ii W j j + W 2 ij . One may argue that a ‘well-calibrated’ algorithm should achiev e e ij ≈ 1. A problem with the choice W = B b ecomes apparent considering diagonal elements and B 0 = I : (3.3) e 2 ii = ( B ii − 1 ) 2 B 2 ii =  1 − 1 B ii  2 . This means the DFP prior is well-calibrated only for large diagonal elemen ts ( B ii ≫ 1 ) . F or diagonal elements B ii ≈ 1, it is under-conﬁdent ( e ii _ 0, estimating to o large an error), and for very small diagonal elemen ts B ii > 0 , B ii ≪ 1, it can b e sev erely o ver-conﬁden t ( e ii _ ∞ estimating to o small an error). F or oﬀ-diagonal elements and unit prior mean, the error estimate is (3.4) e 2 ij = 2 B 2 ij B 2 ij + B ii B j j = 2 1 + B ii B j j  B 2 ij for i ≠ j. F or p ositive deﬁnite B , e 2 ij < 1 oﬀ the diagonal holds b ecause, for such matrices, B 2 ij < B ii B j j (see e.g. [ 28 , Corollary 7.1.5]), but of course e 2 ij can still b e very small or ev en v anish, e.g. for diagonal matrices. It is p ossible to at least ﬁx the ov er-conﬁdence problem, using the degree of freedom in Corollary 3.1 to scale the prior cov ariance to W = θ 2 B with θ = λ min ( λ min − 1 ) , using λ min , the smallest eigen v alue of B . This at least ensures e ij ≤ 1 ∀ ( i, j ) . In terestingly , setting W = B − B 0 (whic h giv es the SR1 rule after the ﬁrst observ a- tion, but not after subsequent ones) gives e 2 ii = 1, and e ij < 1 for i ≠ j . It also has the prop ert y that norm of the true B under this prior is (3.5) Ð Ð Ð Ð Ð → ( B − B 0 ) ⊺ (( B − B 0 ) ⊗ ⊖ ( B − B 0 )) − 1 Ð Ð Ð Ð Ð → ( B − B 0 ) = Ð → I ⊺ Ð → I = N , PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 11 so the true B is exactly one standard deviation a wa y from the me an under this prior. These prop erties suggest this cov ariance, which will b e called standar dize d norm co v ariance, for further inv estigation in § 4 , whic h addresses the question: Is it p ossible to construct a linear solver that, without ‘cheating’ (using B or H explicitly in the co v ariance), has a w ell-calibrated uncertaint y measure, and can th us meaningfully estimate the error of its computation; ideally , without ma jor cost increase? 3.2. Structure of the Gram matrix. The ab ov e established that, treated as inference rules for matrices, general Dennis rules are probabilistically exact only after one rank-1 observ ation y = B s . How strong is the error th us in troduced? In fact, as the following lemma shows, there are choices of search directions S for which the existing algorithms do b ecome exact probabilistic inference. Lemma 3.2 ( pro of in App endix A.3 ). If the Gr am matrix S ⊺ W S is a diagonal matrix (i.e., if the se ar ch dir e ctions S ∈ R N × M ar e c onjugate under the c ovarianc e p ar ameter W ), then the M r ep e ate d r ank-2 up date steps of classic se c ant-rule im- plementations r esult in an estimate that is e qual to the p osterior me an under exact pr ob abilistic Gaussian infer enc e fr om ( Y , S ) . (The e quivalent statement for inverse up dates r e quir es c onjugacy of the Y under W ). So a che ap 4 , probabilistic optimizer can b e constructed by c ho osing search direc- tions conjugate under W . The follo wing reformulation of a previously known Lemma 5 sho ws that, in fact, b oth the BFGS and DFP up date rules hav e this prop ert y . Lemma 3.3 ( additional pro of in Appendix A.4 ). F or line ar pr oblems B x = b with symmetric p ositive deﬁnite B and exact line se ar ches, under the DFP c ovarianc e W = B , and linese ar ches along the inverse of the p osterior me an of the Gaussian b elief, the Gr am matrix is diagonal. Analo gously for inverse up dates: F or infer enc e on H = B − 1 under the BFGS c ovarianc e W = H on the same line ar optimization pr oblem and linese ar ches along the p osterior me an over H , the Gr am matrix is diagonal. The following result by Nazareth [ 32 ] establishes that, for linear problems, the inference interpretation for BF GS transfers directly to the conjugate gradient (CG) metho d of Hestenes & Stiefel [ 27 ]. Theorem 3.4 ( Nazareth 6 [ 32 ] ). F or line ar optimization pr oblems as deﬁne d in L emma 3.3 , BFGS infer enc e on H with sc alar prior me an, H 0 = α I , α ∈ R , is e quivalent to the c onjugate gr adient algorithm in the sense that the se quenc e of se ar ch dir e ctions is e qual: s BFGS i = s CG i . The connection betw een BF GS and CG is intuitiv e within the probabilistic framew ork: BFGS uses W = H , so its mean estimate H M is the ‘b est guess’ for H under (the minimizer of ) the norm Ð Ð Ð Ð Ð Ð → ( H − H M ) ⊺ ( H ⊗ ⊖ H ) − 1 Ð Ð Ð Ð Ð Ð → ( H − H M ) , and its iter- ated estimate x M is the b est rank- M estimate for x when the error is measured as ( x − x M ) ⊺ W − 1 ( x − x M ) = ( x − x M ) ⊺ B ( x − x M ) . Minimizing this quantit y after M steps is a well-kno wn characterisation of CG [ 34 , Eq. 5.27]. Theorem 3.4 implies that, describing BFGS in terms of Gaussian inference also giv es a Gaussian interpretation for CG ‘for free’. F rom the probabilistic p ersp ectiv e, 4 W e note in passing that, to reduce cost further, and regardless of whether the Gram matrix is diagonal or not, the up dates of Eq. ( 2.5 ) can b e appr oximate d by using only the ˜ M most recent pairs ( s i , y i ) , or b y retaining a restricted rank ˜ M form of the update. This is the analogue to “limited-memory” metho ds [ 33 ] well-kno wn in the literature for large-scale problems. 5 This result is quoted by Nazareth in 1979 [ 32 ] as “well-known”, with a citation to [ 31 ]. The proof in the app endix is less general, but ma y help put this lemma in the context of this text. 6 Dixon [ 16 , 17 ] provided a related result linking CG to the whole Broyden class of quasi Newton methods: they b ecome equiv alent to CG when the starting matrix is chosen as the pre-conditioner. 12 PHILIPP HENNIG and exclusively for linear problems, CG is ‘just’ a compact implementation of iterated Gaussian inference on H from p ( H ) = N ( H ; I , H ⊗ ⊖ H ) , with search directions along H M F ( x M ) = H M ( B x M − b ) . This observ ation has conceptual v alue in itself (the natural question, left op en here, is what it implies for the nonparametric extensions of CG). But Theorem 3.4 , among other things, also implies the follo wing helpful properties for the searc h directions s i c hosen by , and gradients F i ‘observ ed’ by the (scalar prior mean) BF GS algorithm. They are all well-kno wn properties of the conjugate gradient metho d (e.g. [ 34 , Thm. 5.3]). In the following, generally assume that the algorithm has not conv erged at step M < N , and remember that the F M = B x M − b are the residuals (gradien ts of f ( x ) = 1  2 x ⊺ B x − x ⊺ x ) after M steps, which form y M = F M − F M − 1 . ● the set of ev aluated gradients / residuals is orthogonal: (3.6) F ⊺ i F j = 0 for i ≠ j and i, j < N ● the gradients (and thus Y ) span the Krylov subspaces generated by ( B , b ) : (3.7) span { F 0 , F 1 , . . . , F M } = span { F 0 , B F 0 , . . . , B M F 0 } ● line searches and gradients span the same vector space: (3.8) span { s 0 , s 1 , . . . , s M } = span { F 0 , B F 0 , . . . , B M F 0 } 3.3. Discussion. W e ha v e established a probabilistic in terpretation of the Dennis class of quasi-Newton metho ds, and the CG algorithm, as Gaussian inference: The Dennis class can b e seen as Gaussian p osterior means after the ﬁrst line search (Corollary 2.4 ), but this connection extends to multiple searc h directions only if the search directions are conjugate under prior cov ariance (Lemma 3.2 ). F or linear problems, this is the case for the DFP , BFGS update rules (Lemma 3.3 ). Since BF GS is equiv alent to CG on linear problems (Lemma 3.4 ), this also establishes a probabilistic in terpretation for linear CG. These results oﬀer new w ays of thinking ab out linear solv ers, in terms of solving an inference problem by collecting information and building a mo del, rather than b y designing a dynamic pro cess con verging to the minimum of a function. It is intriguing that, from this v an tage p oin t, the extremely p opular CG / BF GS metho ds lo ok less well-calibrated than one may ha ve exp ected ( § 3.1 ). The ob vious next question is, can one design explicitly uncertain linear solv ers with a reasonably well-calibrated p osterior? In addition to the scaling issues, a challenge is that, for BFGS / CG, the prior cov ariance W = H is only an implicit ob ject. After M < N steps, there exists a 1  2 ( N − M )( N − M + 1 ) -dimensional cone of positive deﬁnite cov ariance matrices fulﬁlling W Y = S (and, additionally , a scalar degree of freedom inherent to the Dennis class). Ho w do w e pic k a p oin t in this space? 4. Constructing explicit p osteriors. The remainder will fo cus exclusively on inference on H = B − 1 , on inv erse up date rules, priors p ( H ) = N ( Ð → H ; Ð → H 0 , W ⊗ ⊖ W ) . As p oin ted out in § 1.2 , these arise from the direct rules under exchange of S and Y : Given ( S, Y ) ∈ R N × M , the p osterior b elief is p ( H  S , Y ) = N ( Ð → H ; Ð → H M , W M ⊗ ⊖ W M ) with H M = H 0 + ( S − H 0 Y )( Y ⊺ W Y ) − 1 Y ⊺ W + W Y ( Y ⊺ W Y ) − 1 ( S − H 0 Y ) ⊺ (4.1) − W Y ( Y ⊺ W Y ) − 1 [ Y ⊺ ( S − H 0 Y )]( Y ⊺ W Y ) − 1 Y ⊺ W W M = W − W Y ( Y ⊺ W Y ) − 1 Y ⊺ W . Recall from Sections 1.2 and Corollary 3.1 that, cast as an inv erse up date, BFGS (CG) arises from the prior p ( H ) = N ( H ; I , θ 2 ( H ⊗ ⊖ H )) for arbitrary θ ∈ R + . PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 13 4.1. Fitting co v ariance matrices. Equations ( 1.6 ), ( 1.7 ) show that b oth the BF GS and DFP priors in principle require access to H . As noted ab o ve, for this mean estimate it is implicitly feasible to use W = H , b ecause this computation only requires observed pro jections W Y = H Y = S . Computing the cov ariance under W = H , ho wev er, can only be an idealistic goal: After M steps, only a sub-space of rank 1 2 ( N ( N + 1 ) − ( N − M )( N − M + 1 )) of the elemen ts of H is identiﬁed. T o see this explicitly , consider the singular v alue decomp osition 7 Y = Q Σ U ⊺ , which deﬁnes a symmetric p ositive deﬁnite T ∈ R N × N through W = QT Q ⊺ . This notation gives (4.2) W M = W − W Y ( Y ⊺ W Y ) − 1 Y ⊺ W = Q ( T − T Σ ( Σ ⊺ T Σ ) − 1 Σ ⊺ T ) Q ⊺ . Considering the structure of Σ, one can write T in terms of blo c k matrices (4.3) T =  T ++ T +− T −+ T −−  then W M = Q  0 0 0 T −− − T −+ T − 1 ++ T +−  Q ⊺ , with T ++ ∈ R M × M , T −+ = T ⊺ +− ∈ R M × ( N − M ) , T −− ∈ R ( N − M ) × ( N − M ) (and p ositive deﬁnite T ++ , a principal blo ck of the p ositiv e deﬁnite T ). Observing ( S, Y ) , exactly iden tiﬁes [ T ++ , T +− ] ⊺ = QS U ⊺ D − 1 , and provides no information at all 8 ab out T −− . A primary goal in designing a probabilistic linear solver is thus, at step M < N , to (1) identify the span of W M , ideally without incurring additional cost, and to (2) ﬁx the entries in the remaining free dimensions in W M , by using some regularit y assumptions 9 ab out H . The equiv alence b et w een BF GS and CG oﬀers an elegant wa y of solving problem (1), with no additional computational cost: Recall from Theorem 2.3 and Lemma 3.2 that the co v ariance after M steps under W = H is W M ⊗ ⊖ W M with W M = W − M  i W y i ( W y i ) ⊺ s ⊺ i y i = W − M  i s i s ⊺ i s ⊺ i y i = W − S ( S ⊺ Y ) − 1 S ⊺ , (4.4) Because, b y Equation ( 3.8 ) the vector-space spanned b y S is identical to that spanned b y the ortho gonal gradients, we can write the space of all symmetric p ositive semideﬁnite matrices W with the prop ert y W Y = S as W ( Ω ) = S ( S ⊺ Y ) − 1 S ⊺ + ( I − ¯ F ¯ F ⊺ ) Ω ( I − ¯ F ¯ F ⊺ ) , (4.5) with the righ t-orthonormal matrix ¯ F con taining the M normalised gradien ts F i  F i  in its columns, and a p ositiv e deﬁnite matrix Ω ∈ R N × N (the eﬀective size of the space spanned in this wa y is only R ( N − M ) × ( N − M ) , so Ω is ov er-parameterising this space). 4.1.1. Standardized norm p osteriors using conjugate gradient observ a- tions. Eq. ( 4.5 ) parametrises p osterior cov ariances of the BFGS family . In light of the scaling issues of these priors discussed in § 3.1 , one would prefer, from the probabilistic standp oin t, to use the standardized norm priors p ( H ) = N ( H ; α I , ( H − H 0 ) ⊗ ⊖ ( H − H 0 )) , but these priors do not share BFGS/CG’s other go o d numerical prop erties. Instead, a h ybrid algorithm can b e constructed as follows: 7 With orthonormal Q ∈ R N × N and U ∈ R M × M , and rectangular diagonal Σ ∈ R N × M , which can be written as Σ = [ D , 0 ] ⊺ with an inv ertible diagonal matrix D ∈ R M × M and empty 0 ∈ R M × ( N − M ) . 8 Knowing H to b e p ositive deﬁnite do es provide a low er bound on the eigenv alues of T −− . 9 A probabilistically more appealing approach would b e to use a hyper-prior on the elements of W , marginalized ov er the uniden tiﬁed degrees of freedom. It is currently unclear to the author how to do this in a computationally eﬃcient wa y . 14 PHILIPP HENNIG 1. Solv e the linear problem using the conjugate gradien t metho d. While the algorithm runs, collect S, Y , ¯ F . This has storage cost of 2 N M + M ﬂoats: Because Y consists of diﬀerences b et ween subsequent columns of F , it do es not need to b e stored explicitly , the column norms  F  i required to compute ¯ F require M extra ﬂoats. The computation cost of the standard conjugate gradien t algorithm is O ( M ) matrix-v ector multiplications (that is, O ( M N 2 ) assuming a dense matrix), plus O ( M N ) op erations for the algorithm itself (including computation of  F  i ). 2. Using the ( S, Y , ¯ F ) c onstructe d by CG , compute the standar dize d-norm pos- terior on H , i.e. use the prior p ( H ) deﬁned ab ov e, which yields a Gaussian p osterior with mean and co v ariance H M = H 0 + ( S − H 0 Y )( Y ⊺ ( S − H 0 Y )) − 1 ( S − H 0 Y ) ⊺ (4.6) = α I − ( S − αY )( Y ⊺ S − αY ⊺ Y ) − 1 ( S − αY ) ⊺ and (4.7) W M = ( H − H 0 ) − ( S − H 0 Y )( Y ⊺ ( S − H 0 Y )) − 1 ( S − H 0 Y ) ⊺ (4.8) = S ( S ⊺ Y ) − 1 S ⊺ + ( I − ¯ F ¯ F ⊺ ) Ω ( I − ¯ F ¯ F ⊺ ) − α I (4.9) − ( S − αY )( Y ⊺ S − αY ⊺ Y )) − 1 ( S − αY ) ⊺ . A prerequisite for this is to choose α < λ min ( H ) , less than the smallest eigen v alue of H , to ensure that W = H − H 0 is positive deﬁnite. But λ min ( H ) = 1 .  λ max ( B ) , which can b e estimated eﬃciently (and without additional cost) from the  F  i . Another minor hurdle is that Equations ( 4.7 ) & ( 4.9 ) require the in verse of S ⊺ Y − αY ⊺ Y . The columns of Y are Y i = F i − F i − 1 , so, b ecause conjugate gradient constructs orthogonal gradients, Y ⊺ Y is a symmetric tridiagonal matrix, Y ⊺ i Y j = δ ij ( F i  2 +  F i − 1  2 ) + ( δ i ( j − 1 ) + δ ( i + 1 ) j ) F i  2 , and S ⊺ Y is diagonal b ecause the S are conjugate under B . So the entire Gram matrix is tridiagonal, and the M linear problems in ( Y ⊺ S − αY Y ⊺ ) − 1 ( S − αY ) ⊺ can b e solved in O ( M 2 ) , e.g. using the Thomas algorithm [ 7 , Alg. 4.3] 3. estimate Ω according to some rule. § 4.2 prop oses several rules of O ( M ) cost. While there is a v ague connection b et w een the standardized norm prior and the SR1 algorithm by Corollary 3.1 , the algorithm describ ed ab ov e is quite diﬀerent from the SR1 metho d. It uses search directions constructed by BFGS/CG, and its up date rule uses the exact Gram matrix, not the rep eated rank-1 updates that give SR1 its name. Computational c ost. The computation ov erhead of constructing this posterior mean and cov ariance, after running the conjugate gradient algorithm, is O ( M 2 ) , which is small compared even to the internal O ( M N ) cost of CG, let alone the O ( M N 2 ) for the matrix-vector multiplications in CG. Storing the p osterior mean and cov ariance requires O ( N M ) space, whic h is feasible even for relatively large problems. Crucially , retaining the cov ariance adds almost no ov erhead to storing the mean alone. 4.2. Estimation rules. The remaining step is to ﬁnd estimates for Ω. It is clear that there are myriad options for ﬁxing such rules. F or an initial ev aluation, we adopt the p erhaps simplistic, but straightforw ard approach of estimating Ω to a scalar matrix Ω = ω 2 I (one wa y to motiv ate this is to argue that, at step M , future line searches s M + i will p oin t in an unknown direction in the span of I − ¯ F ¯ F ⊺ , so it makes sense to not prefer any direction in the choice of Ω). A natural idea is to use regularit y structure on quantities already computed during the run of the conjugate gradien t algorithm: Assume the algorithm is curren tly at step T . If, at step M < T w e had tried to predict the Gram matrix diagonal element PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 15 uniformly distributed eigenv alues 20 40 60 80 100 10 − 4 10 − 1 10 2 ∥ F ∥ 20 40 60 80 100 0 0 . 5 1 ω exponentially distributed eigenv alues 20 40 60 80 100 10 − 5 10 − 1 10 3 ∥ F ∥ 20 40 60 80 100 0 0 . 2 0 . 4 ω structured eigenv alue sp ectrum 20 40 60 80 100 10 − 4 10 0 10 4 M ∥ F ∥ 20 40 60 80 100 0 0 . 5 1 M ω Fig. 4.1 . Fitting posterior uncertainty during iter ative solution of line ar pr oblems, for thre e diﬀer ent generative pr o cesses of B . Each plot shows r esults fr om 20 r andomly gener ate d exp eriments with, top r ow: uniformly, midd le r ow: exp onential ly distribute d eigenvalues; b ottom r ow: structur e d eigenvalue spe ctrum (details in text). L eft: R esidual (gr adient) B x − b as a function of numb er of line sear ches. Right: pr oje ctions ω = s ⊺ M F M − 1 , whose r e gular structur e is used for estimating W ( Ω ) . y ⊺ M + 1 W y M + 1 = − s ⊺ M + 1 F M using the structure for W describ ed ab o ve, we would ha v e predicted, b ecause F M is kno wn to b e in the span of S , and orthogonal to ( I − ¯ F ¯ F ⊺ ) , y ⊺ M + 1 W y M + 1 = F ⊺ M S ( S ⊺ Y ) − 1 S ⊺ F M + F ⊺ M + 1 Ω F M + 1 (4.10) − s ⊺ M + 1 F M = − M  i = 1 ( F ⊺ M s i ) 2 s ⊺ i F i − 1 + ω 2  F M + 1  2 , (4.11) and thus ω 2 =  F M + 1  − 2  M  i = 1 ( F ⊺ M s i ) 2 s ⊺ i F i − 1 − s ⊺ M + 1 F M  . (4.12)  F M + 1  can b e estimated from the norm of preceding gradients. The second term on the righ t hand side of Equation ( 4.12 ) is known at step M . The ﬁrst term of the right hand side can b e estimated by regression, in wa ys further explored b elo w. First, to conﬁrm that ω indeed tends to ha ve regular structure related to the eigen v alue sp ectrum of H , Figure 4.1 , right column, shows ω i for i = 1 , . . . , M during runs of CG on 20 linear problems, sampled from three diﬀerent generative pro cesses for B = QD Q ⊺ ∈ R 200 × 200 . In eac h case, orthonormal matrices where drawn uniformly from the Haar measure ov er S O ( N ) as in § 3.1 . F or the top ro w of Figure 4.1 , the eigen v alues (elements of D = diag ( d ) ) where drawn uniformly from p ( d i ) = U ( 0 , 10 ) (the uniform distribution ov er [ 0 , 10 ] ). F or the middle row, eigenv alues where drawn 16 PHILIPP HENNIG from the exp onen tial distribution p ( d i ) = 1  λ exp ( − d i  λ ) with scale λ = 10  log 2 (giving a median eigen v alue of 10). Finally , for the b ottom row, eigen v alues where drawn from a structured pro cess, with d i for i = 1 , . . . , 20 drawn from p ( d ) = U ( 0 , 10 3 ) , and d i for i = 21 , . . . , 200 dra wn from p ( d ) = U ( 0 , 10 ) (i.e. the corresp onding eigenv alues of H lie non-uniformly in [ 0 , 10 − 3 ] and [ 0 , 0 . 1 ] ). Clear structure is visible in all cases. Using these observ ations, several diﬀerent regression schemes for ω can b e adopted. ● A simple baseline is a stationary mo del for the ω i . This was used to construct error estimates in Figures 4.2 to 4.4 (in gray for the middle and b ottom ro w, blac k for the top row). Of course, if the eigenv alues of B are uniformly distributed in the top row, the eigenv alues of H (their in verses) are not. ● A sligh tly more elab orate mo del is a linear trend with noise: ω i = ai + b + n (with n ∼ N ( 0 , σ 2 ) ). Linear regression on the v alues of ω i can b e p erformed in O ( M ) . W e can then set Ω = ¯ ω I with ¯ ω = aN + b the exp ected largest v alue of ω i (i.e. a noisy upp er bound). This approach was used to construct the (blac k) error estimates in the middle rows of Figures 4.2 to 4.4 . ● Finally , if structural knowledge is av ailable, e.g. that the ﬁrst L eigen v alues of B are α times larger than the later ones, on may use the stationary rule from ab o ve, but explicitly multiply the estimate ω b y α for the ﬁrst L steps. This ma y seem contriv ed, but in fact it is not uncommon in applications to know an eﬀectiv e n umber of degrees of freedom in B . F or example, in nonparametric least-squares regression with a v ery large n um b er of N data p oin ts distributed appro ximately uniformly ov er a range of width ρ , using an RBF k ernel of length scale λ , the mo del’s num b er of degrees of freedom is L = ρ ( 2 π λ ) [ 38 , Eq. 4.3]. This rule was used to construct (black) error estimates in the b ottom ro ws of Figures 4.2 to 4.4 . 4.3. Estimating quantities of in terest. This ﬁnal part demonstrates a few ex- ample uses of the Gaussian p osterior p ( H ) = N ( Ð → H ; Ð → H M , W M ⊗ ⊖ W M ) on H constructed b y the BF GS / CG metho d. Figures 4.2 to 4.4 show three suc h uses, explained b elo w. Each row of this ﬁgure uses data from one of the exp erimen ts shown in the corresp onding row of Figure 4.1 . 4.3.1. Estimating H itself. The most obvious question is how far the estimate H M for H after M steps is from the true H . This distance is estimated directly b y the Gaussian posterior of Equation ( 4.1 ). The marginal distribution on any linear pro jection A Ð → H is N ( A Ð → H ; A Ð → H M , A ( W M ⊗ ⊖ W M ) A ⊺ ) . In particular, the marginal distribution on each element H ij is a scalar Gaussian (4.13) p ( H ij  S M , Y M ) = N [ H ij ; H M ,ij , 1  2 ( W M ,ii W M ,j j + W 2 M ,ij )] . Figure 4.2 shows this error estimate for 40 elements of one particular H (dra wn uniformly at random from the 4 ⋅ 10 4 elemen ts of the 200 × 200 matrix). The estimate arising from the uniform estimation rule for ω from Section 4.2 is shown in gray in eac h panel (black for the top panel). The same quantit y , estimated with the linear regression and structured estimation rules from Section 4.2 are sho wn in black in the middle and b ottom row, resp ectiv ely . The left column of the ﬁgure shows results from the BF GS/CG prior, the righ t column sho ws results using the standardized norm prior on data constructed with the CG algorithm as describ ed in § 4.1.1 . As exp ected from the argumen t in § 3.1 , the BF GS estimates are regularly considerably to o small, while the standardized-norm estimates ha v e a meaningful width. The error estimators ha v e v arying b eha viour. F or the exp onen tial eigenv alue sp ectrum, the estimator ﬂuctuates PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 17 uniformly distributed eigenv alues 20 40 60 80 100 − 1 0 1 H i − H M,i 20 40 60 80 100 − 1 0 1 exponentially distributed eigenv alues 20 40 60 80 100 − 1 − 0 . 5 0 0 . 5 1 H i − H M,i 20 40 60 80 100 − 1 − 0 . 5 0 0 . 5 1 structured eigenv alue sp ectrum 20 40 60 80 100 − 2 − 1 0 1 2 M H i − H M,i 20 40 60 80 100 − 2 − 1 0 1 2 M Fig. 4.2 . Err or estimation on H . Posterior mean (solid r e d) and one standar d deviation (dashe d black, gr ay). L eft: BFGS/CG prior. Right: Standar dize d norm prior, fr om CG observations. R ows as in Figure 4.1 . The cut-oﬀ err or b ars in the b ottom right plot rise up to values < 6 . strongly in the ﬁrst few steps b efore settling to a go od v alue (this could b e corrected using a regularizer, left out here to not bias the results). F or the structured-eigenv alues problems, the region around the step from small to large eigenv alues is problematic. But ov erall, they do pro vide a meaningful notion of error. In particular, they are rarely to o small. F or most uses of statistical error estimators, it is b etter to b e to o conserv ative (to o large) than to b e to o conﬁdent. Of course, it would b e great if future researc h w ould ﬁnd b etter calibrated error estimates. As explained in Equation ( 3.1 ), the same error estimates can also b e collapsed in to an error estimate on the norm  H − H M  F . Figure 4.3 shows results from such an exp erimen t, for the 20 diﬀerent H ’s from Figure 4.1 . The quan titativ e results are similar to the previous ﬁgure, but this ﬁgure more clearly shows the diﬀerence b et ween the baseline (gray) and exp onen tial, structured error estimates (black), and the b eha viour of the estimated errors relative to the v arying norms of the drawn H ’s. 4.3.2. Estimating solutions for new linear problems. An obvious use for the estimate for H found by CG / BF GS when solving one linear problem B x = b is as an instantaneous solution estimate for other linear problems B x test = b test . The left and middle columns of Figure 4.4 shows this use. In each case, an x test w as drawn from N ( x ; 0 , 10 I ) , and the corresp onding b test = B x test presen ted to the algorithm. Since x test = H b test is a linear pro jection of H , the p osterior marginal on x test is also Gaussian p ( x test  S M , Y M ) = N ( x test , H M b test , Σ ) , and has cov ariance matrix elemen ts 18 PHILIPP HENNIG uniformly distributed eigenv alues 20 40 60 80 100 10 0 10 1 10 2 10 3 ∥ H M − H ∥ F 20 40 60 80 100 10 0 10 1 10 2 10 3 exponentially distributed eigenv alues 20 40 60 80 100 10 − 2 10 1 10 4 ∥ H M − H ∥ F 20 40 60 80 100 10 − 2 10 1 10 4 structured eigenv alue sp ectrum 20 40 60 80 100 10 − 4 10 0 10 4 M ∥ H M − H ∥ F 20 40 60 80 100 10 − 4 10 0 10 4 M Fig. 4.3 . T rue and estimate d norm error ∥ H − H M ∥ F . Posterior me an (r e d) and one stan- dar d deviation (black, gr ay). L eft: BFGS/CG prior. Right: Standar dized norm prior, fr om CG observations. R ows as in Figure 4.1 . (4.14) co v ( x test ,i , x test ,j ) = Σ ij = 1  2 ( W ij x ⊺ test W x test + ( W x test ) i ( W test ) j . Figure 4.3 shows the true errors on the elements of x test in blue, and the estimated marginal errors (the diagonal elements of Σ) in black for the stationary , linear, struc- tured mo dels, resp ectively (and, as in previous ﬁgures, the stationary mo del in gra y in the tw o non-stationary cases). More drastically than the previous ones, these ﬁgures sho w that the BF GS posterior can severely underestimate the error on elemen ts of x test , while the standardized norm prior at least provides outer b ounds (alb eit sometimes quite lo ose ones). R emark on c onver genc e. The error on x test do es not alw a ys collapse ov er the course of ﬁnding x . This says more ab out CG as suc h than ab out its probabilistic in terpretation: CG do es not aim to construct H , but only to ﬁnd x ∗ . F or simplicity of exp osition, we ha ve assumed that H = B − 1 exists, and CG requires the full N steps to con verge, th us identifying B and H . In general, CG regularly conv erges muc h earlier. F or an intuition, consider the special case where x 0 = 0 and b = [ 1 , . . . , 1 , 0 , 0 , . . . , 0 ] consists of K consecutiv e ones and N − K zeros. The CG/BF GS algorithm will never explore the low er ( N − K ) × ( N − K ) blo c k of H , which may contain arbitrary num b ers. If the primary aim is not x ∗ = H b but H itself, a more elab orate course is needed; PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 19 uniformly distributed eigenv alues 20 40 60 80 100 − 10 0 10 x test i − ˆ x test i 20 40 60 80 100 − 50 0 50 exponentially distributed eigenv alues 20 40 60 80 100 − 50 0 50 x test i − ˆ x test i 20 40 60 80 100 − 50 0 50 structured eigenv alue sp ectrum 20 40 60 80 100 − 500 0 500 M x test i − ˆ x test i 20 40 60 80 100 − 40 − 20 0 20 40 M Fig. 4.4 . Estimating solutions to B x ′ = b ′ . Element-wise err or on a single test ve ctor x test . T rue err or in blue. Err or estimate with stationary mo del for ω in gr ay. Err or estimate for mo del-sp eciﬁc estimate for ω (as in Figur e 4.1 ) in black. L eft: BFGS/CG prior. Right: Standar dize d norm prior, fr om CG observations. R ows as in Figur e 4.1 . e.g. c ho osing several b to span a space of interest ov er H . It is an interesting op en question whether the probabilistic interpretation can b e used to actively collapse the uncertain ty on H in a typic al ly more eﬃcient wa y than established matrix in version metho ds like Gauss-Jordan (whic h is also a conjugate direction metho d [ 27 ]). 5. Conclusion & outlo ok. This text developed a probabilistic interpretation of iterative solvers for linear problems B x = b with symmetric B . The Dennis family of secant up dates can be deriv ed as the posterior mean of a parametric Gaussian mo del after one rank-1 observ ation. F or rank M observ ations, the matc h betw een these up dates and Gaussian inference only holds if the search directions are conjugate under the prior co v ariance. This is the case for the DFP direct and BFGS inv erse up dates rules. Their equiv alence to CG in the linear case makes them particularly in teresting. Ho w ever, it also b ecame apparent that, from a inference p ersp ectiv e, the BF GS rule do es not yield a well-scaled error measure. As a ﬁrst step to ward a better scaled Gaussian b elief, the standardized norm co v ariance, w as proposed. It is inspired by the SR1 rule, but leads to probabilistic corrections in the form of oﬀ-diagonal terms, and can b e used with data pro duced by the CG algorithm, th us retaining the go od numerical prop erties of that metho d. The space of p ossible cov ariance matrices consistent w ith the resulting mean is a sub-space of the p ositiv e deﬁnite cone, which collapses during the run of the algorithm (the same 20 PHILIPP HENNIG holds for the BFGS / CG metho d). Several p ossible estimation rules for choosing elemen ts in this space of cov ariances where proposed, arising from diﬀerent structural assumptions o v er H . The resulting Gaussian p osterior pro vides joint uncertain ty estimates on the elemen ts of H , and all linear pro jections of H , in particular of other linear problems x test = H b test . This adds functionality to the conjugate gradient metho d, at a computational o verhead m uch smaller than the cost of CG itself. The implications for non linear optimization metho ds of b oth the quasi-Newton and CG families remain interesting op en questions. F or example, clearly the conjugacy assumption implicit in the Dennis class members is inconsistent with the probabilistic in terpretation. This was already noted b y Hennig & Kiefel [ 25 , 26 ], who also prop osed using a nonparametric Gaussian form ulation to give a more explicit inference inter- pretation to nonlinear optimization. This left questions regarding the c hoice of prior co v ariance, which are only made more pressing by the results presented here. Another direction is inference from noisy ev aluations, in which case the p osterior co v ariance do es not collapse to zero after ﬁnitely many steps of optimization, not ev en in the linear case. Some related results where previously discussed in [ 24 ], but the study of probabilistic numerical optimization remains at an early stage. App endix. Pro ofs for results from main text. Throughout the appendix, the notation ∆ = Y − B 0 S will b e used to represent the residual. A.1. Pro of for Lemma 2.2 . Because the op erator Γ maps ∑ k` Γ ij,k` A k` = 1  2 ( A ij + A j i ) for all A , its elements can b e written as Γ ij,k` = 1  2 ( δ ik δ j ` + δ i` δ j k ) , using Kronec ker’s δ function. W e also note that Gaussians are closed under linear operations (see e.g. [ 2 , Eq. 2.115]: p ( B ) = N ( Ð → B ; Ð → B 0 , V ) implies p ( Γ Ð → B ) = N ( Γ Ð → B ; Ð → B 0 , Γ V Γ ⊺ ) . W e complete the pro of by observing that ( Γ ( W ⊗ W ) Γ ⊺ ) ij,k` =  ab,cd 1  4 ( δ ia δ j b + δ ib δ j a )( δ kc δ `d + δ kd δ `c ) W ac W bd (A.1) = 1  4 ( W ik W j ` + W i` W j k + W j k W i` + W j ` W ik ) (A.2) = 1  2 ( W ik W j ` + W i` W j k ) (A.3) A.2. Pro of for Theorem 2.3 . T o b e shown: Given p ( B ) = N ( Ð → B ; Ð → B 0 , W ⊗ ⊖ W ) , the p osterior from the likelihoo d δ ( Y − B S ) = lim Λ _ 0 N ( Y ; ( I ⊗ S ) B , Λ ) , with Y , S ∈ R N × M and rk ( S ) = M has mean (with ∆ = Y − B 0 S ) B M = B 0 + ∆ ( S ⊺ W S ) − 1 W S ⊺ + W S ( S ⊺ W S ) − 1 ∆ ⊺ (A.4) − W S ( S ⊺ W S ) − 1 ( S ⊺ ∆ )( S ⊺ W S ) − 1 S ⊺ W , and cov ariance V M = ( W − W S ( S ⊺ W S ) − 1 S ⊺ W ) ⊗ ⊖ ( W − W S ( S ⊺ W S ) − 1 S ⊺ W ) . (A.5) W e b egin with the p osterior mean ( A.4 ). F rom Equation ( 2.3 ), it has the form (with the prior cov ariance V = W ⊗ ⊖ W ) B 0 + V ( I ⊗ S )[( I ⊗ S ⊺ ) V ( I ⊗ S )] − 1 Ð → ∆ . (A.6) A few straigh tforw ard steps establish that the N M × N M matrix to be in v erted is indeed inv ertible for linearly indep enden t columns of S , and has elemen ts [( I ⊗ S ⊺ ) V ( I ⊗ S )] ia,j b = 1  2 [ W ij ( S ⊺ W S ) ab + ( W S ) ib ( W S ) j a ] . (A.7) PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 21 Also, the elements of V ( I ⊗ S ) are [ V ( I ⊗ S )] ij,ka = 1  2 ( W ik S j a + W j k S ia ) . (A.8) So we are searching the unique matrix X ∈ R N × M satisfying Ð → ∆ = [( I ⊗ S ⊺ ) V ( I ⊗ S )] Ð → X = 1  2 ( Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð → W X S ⊺ W S + W S X ⊺ W S ) , (A.9) whic h then gives the posterior as 1  2 ( W X S ⊺ W S W S X ⊺ W ) . (Because X is rectangular, Equation ( A.9 ) is a generalization of a Lyapuno v equation. Standard solutions for such Equations do not apply directly). Instead of just presenting a solution, the following lines show a constructive proof. W e ﬁrst re-write Eq. ( A.9 ) ( S ⊺ W S is inv ertible b ecause W is p ositiv e deﬁnite, and S is assumed to b e of rank M ) as 2∆ = W X S ⊺ W S + W S X ⊺ W S, (A.10) 2 W − 1 ∆ ( S ⊺ W S ) − 1 = X + S X ⊺ W S ( S ⊺ W S ) − 1 . (A.11) Let Q Σ U ⊺ = S b e the singular v alue decomp osition of S . That is, Q ∈ R N × N and U ∈ R M × M are orthonormal, Σ ∈ R N × M , consisting of an upper part containing the diagonal matrix D ∈ R M × M and a low er part in R ( N − M ) × M con taining on zeros. W e will write Q = [ Q + , Q − ] , where Q + ∈ R N × M is a basis of the preimage of S , and Q − ∈ R ( N − M ) × M is a basis of the kernel of S . Because S is full rank, D is inv ertible, and we can equiv alen tly write (A.12) X = QRD − 1 U with a (generally dense) matrix R =  R + R −  ( R + ∈ R M × M , R − ∈ R ( N − M ) × M ). This allows re-writing Equation ( A.11 ) as 2 Q ⊺ W − 1 ∆ ( S ⊺ W S ) − 1 = RD − 1 U ⊺ + Q ⊺ Q Σ U ⊺ ( U D − 1 R ⊺ Q ⊺ W S )( S ⊺ W S ) − 1 (A.13) 2  Q ⊺ + W − 1 ∆ ( S ⊺ W S ) − 1 U D Q ⊺ − W − 1 ∆ ( S ⊺ W S ) − 1 U D  =  R + + [ R ⊺ Q ⊺ W S ]( S ⊺ W S ) − 1 U D R −  , whic h iden tiﬁes R − . Noting that Q + Q ⊺ + = Q + D − 1 U U ⊺ D Q ⊺ + = S + S ⊺ , we can write R ⊺ Q ⊺ W S = ( R ⊺ + Q ⊺ + + R ⊺ − Q ⊺ − ) W S (A.14) = ( R ⊺ + Q ⊺ + + 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 Q − Q ⊺ − ) W S (A.15) = R ⊺ + Q ⊺ + W S + 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 ( I − Q + Q ⊺ + ) W S (A.16) = R ⊺ + Q ⊺ + W S + 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ S (A.17) − 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 S + ( S ⊺ W S ) . Plugging back into Equation ( A.13 ), using ( S ⊺ W S ) − 1 U D = ( Q ⊺ + W S ) − 1 , we get 2 Q ⊺ + W − 1 ∆ ( S ⊺ W S ) − 1 U D = R + + R ⊺ + Q ⊺ + W S ( Q ⊺ + W S ) − 1 + 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 U D − 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 S + ( S ⊺ W S )( S ⊺ W S ) − 1 U D = R + + R ⊺ + + 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 U D (A.18) − 2 D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 S + U D 1  2 ( R + + R ⊺ + ) = Q ⊺ + W − 1 ∆ ( S ⊺ W S ) − 1 U D + D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 Q + (A.19) − D U ⊺ ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 U D . (A.20) 22 PHILIPP HENNIG W e see directly that this is a symmetric matrix, b ecause S ⊺ ∆ = S ⊺ B S − S ⊺ B 0 S = ∆ ⊺ S . No w, noting that X S ⊺ + S X ⊺ = Q + ( R + + R ⊺ + ) Q ⊺ + + Q − R − Q ⊺ + + Q + R ⊺ − Q ⊺ − , we ﬁnd 1  2 ( X S ⊺ + S X ⊺ ) = ( Q + Q ⊺ + W − 1 ∆ ( S ⊺ W S ) − 1 S ⊺ ) (A.21) − S ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 S ⊺ + S ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 Q + Q ⊺ + ⋅ ( I − Q + Q ⊺ + ) W − 1 ∆ ( S ⊺ W S ) − 1 S ⊺ + S ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 ( I − Q + Q ⊺ + ) = − S ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 S ⊺ (A.22) + W − 1 ∆ ( S ⊺ W S ) − 1 S ⊺ + S ( S ⊺ W S ) − 1 ∆ ⊺ W − 1 . F rom Equation ( A.8 ), the p osterior mean can b e written as B M = B 0 + 1  2 ( W X S ⊺ W + W S X ⊺ W ) , (A.23) whic h is clearly equal to Equation ( A.4 ). T o establish the form of the p osterior co v ariance, we make use of the structural similarities b etw een the p osterior mean and co v ariance (Equation ( 2.3 )), and notice that we hav e just established  ka,nb ( V S ) ij,ka ( S ⊺ V S ) − 1 ka,nb ∆ nb (A.24) = [ ∆ ( S ⊺ W S ) − 1 S ⊺ W + W S ( S ⊺ W S ) − 1 ∆ ⊺ ] ij − [ W S ( S ⊺ W S ) − 1 ∆ ⊺ S ( S ⊺ W S ) − 1 S ⊺ W ] ij . So we can simply replace ∆ nb with ( S ⊺ V ) nb,k` = 1  2 [ W nk ( S ⊺ W ) b` + W n` ( S ⊺ W ) bk ] and ﬁnd, after a few lines of simple algebra, the form of Equation ( A.5 ) for the p osterior co v ariance. This completes the pro of. A.3. Pro of for Lemma 3.2 . T o be sho wn: If the Gram matrix S ⊺ W S is diagonal, then the exact p osterior mean B M after M steps, which is B M = B 0 + ∆ ( S ⊺ W S ) − 1 S ⊺ W + W S ( S ⊺ W S ) − 1 ∆ ⊺ (A.25) − W S ( S ⊺ W S ) − 1 ( S ⊺ ∆ )( S ⊺ W S ) − 1 S ⊺ W , is equal to the rank-2 up date of B M − 1 using the Dennis up date B M = B M − 1 + ( y M − B M − 1 s M ) c ⊺ M + c M ( y M − B M − 1 s M ) ⊺ c ⊺ M s M (A.26) − c M s ⊺ M ( y M − B M − 1 s M ) c ⊺ M ( c ⊺ M s M ) 2 for c M = W s M . W e ﬁrst harmonize the notation b etw een the tw o formulations by writing the elements of the diagonal Gram matrix as ( S ⊺ W S ) ij = δ ij c ⊺ i s i = ∶ δ ij a i . With this notation, the p osterior mean B M , Equation ( A.25 ), can b e written as B M = B 0 + M  i = 1 ∆ i c ⊺ i + c i ∆ ⊺ i a i + M  i = 1 M  j = 1 c i [ ∆ ⊺ S ] ij c ⊺ j a i a j , (A.27) PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 23 whic h can b e written recursively as B M = B M − 1 + ∆ M c ⊺ M + c M ∆ ⊺ M a M (A.28) − M − 1  i = 1 c M [ ∆ ⊺ S ] M i c ⊺ i + c i [ ∆ ⊺ S ] iM c ⊺ M a M a i − c M [ ∆ ⊺ S ] M M c ⊺ M a M a M = B M − 1 +  ∆ M − M − 1  i = 1 c i ( ∆ ⊺ S ) iM a i  c ⊺ M a M + y M a M  ∆ M − M − 1  i = 1 c i ( ∆ ⊺ S ) iM a i  (A.29) − c M ( ∆ ⊺ S ) M M c ⊺ M a 2 M . On the other hand, the expression y M − B M − 1 s M from Equation ( A.26 ) can b e written using Equation ( A.27 ) as y M − B M − 1 s M = y M − B 0 s M − M − 1  i = 1 ∆ i c ⊺ i s M + c i ∆ ⊺ i s M a i + M − 1  i = 1 M − 1  j = 1 c i ( ∆ ⊺ S ) ij c ⊺ j s M a i a j . (A.30) But since, by assumption, c ⊺ i s M = 0 for i ≠ M , this expression simpliﬁes to y M − B M − 1 s M = y M − B 0 s M − M − 1  i = 1 c i ∆ ⊺ i s M a i = ∆ M − M − 1  i = 1 c i ∆ ⊺ i s M a i . (A.31) Similarly , the expression s ⊺ M ( y M − B M − 1 s M ) from Equation ( A.26 ) simpliﬁes to s ⊺ M ( y M − B M − 1 s M ) = s ⊺ M y M − s ⊺ M B 0 s M − M − 1  i s ⊺ M ( ∆ i c ⊺ i + c i ∆ ⊺ i ) s M a i (A.32) − M − 1  i M − 1  j s ⊺ M c i [ ∆ ⊺ S ] ij c ⊺ j s M a i a j = s ⊺ M ∆ M . Reinserting these expressions into Equation ( A.26 ), w e see that it equals Equation ( A.29 ), which completes the pro of. A.4. Pro of for Lemma 3.3 . The DFP up date is the direct up date with the c hoice W = B ; and the BFGS up date is the inv erse up date with the choice W = H . So the Gram matrix, in b oth cases, is S ⊺ B S = Y ⊺ H Y = S ⊺ Y . The i, j -th element of this symmetric M × M matrix is y ⊺ i s j . The statement to b e shown is that this matrix is diagonal if the line search directions are chosen as (A.33) s i + 1 = − α i + 1 H i + 1 F i . with the residual (the gradien t of the equiv alen t quadratic optimization ob jective) F i = B x i − b . W e also assume p erfect line searc hes. First, consider the special case where j = i + 1 (i.e. subsequent line searches). Because they are in the Dennis class, the estimates for H (irresp ectiv e of whether they were constructed by in v erting a direct estimate or using an inv erse estimate directly) fulﬁll the ‘quasi-Newton equation’ s i = H i + 1 y i = H i + 1 ( F i − F i − 1 ) . Thus (A.34) s i + 1 = − α i + 1 ( s i + H i + 1 F i − 1 ) , 24 PHILIPP HENNIG The exact line search along s i ended when s ⊺ i F i = 0, so y ⊺ i s i + 1 = − α i + 1 ( F i − F i − 1 ) ⊺ ( s i + H i + 1 F i − 1 ) = − α i + 1 ( y ⊺ i H i + 1 F i − 1 − s ⊺ i F i − 1 ) (A.35) = − α i + 1 ( s ⊺ i F i − 1 − s ⊺ i F i − 1 ) = 0 (the last line follo ws again b ecause, b y the quasi-Newton equation, s i = H j y i for all j > i ). By symmetry of the Gram matrix, Eq. ( A.35 ) also implies y ⊺ i + 1 s i = 0. W e complete the pro of inductiv ely: Let j > i + 1 or i > j + 1, and assume y ⊺ i s j − a = y ⊺ j − a s i = 0 ∀ a > 0. Also, F j − 1 can b e written with a telescoping sum as (A.36) F j − 1 = ( F j − 1 − F j − 2 + F j − 2 − F j − 3 + ⋅ ⋅ ⋅ − F i + F i ) = j − 1  a = i y a + F i . Hence y ⊺ i s j = − α j y ⊺ i ( s j − 1 + H j F j − 1 ) [b y deﬁnition of Newton’s direction] (A.37) = − α j ( 0 + y ⊺ i H j F j − 1 ) [b y induction hypothesis] (A.38) = − α j s ⊺ i F j − 1 [b y quasi-Newton prop erty] (A.39) = − α j s ⊺ i      i  a = j − 1 y a + F i      [b y Eq. ( A.36 )] (A.40) = − α j s ⊺ i F i [b y induction hypothesis] (A.41) = 0 [b ecause i -th line searc h is exact] (A.42) This completes the pro of. R emark. This also implies F ⊺ i s j = 0 for i ≠ j : Assume w.l.o.g. that i > j . Then use the telescoping sum of Equation ( A.36 ) to get (A.43) 0 = y ⊺ i s j = ( F i − F i − 1 ) ⊺ s j = ( F i − i − 1  a = j y a − F j ) ⊺ s j = F ⊺ i s j Ac kno wledgmen ts. The author would lik e to thank Maren Mahsereci and Martin Kiefel for helpful discussions b oth prior to and during the preparation of this man uscript. He is particularly grateful to Maren Mahsereci for helpful discussions ab out details of the symmetric basis in Theorem 2.3 . In addition, the author is thankful for comments from tw o anonymous reviewers, particularly for p oin ting out relev ant previous work. REFERENCES [1] F. Alizadeh, J-P. A. Haeberley, and M. L. Over ton , Primal-Dual interior-p oint methods for semideﬁnite pr ogr amming: Conver genc e r ates, stability and numeric al results , SIAM J. Optimization, 8 (1988), pp. 746–768. [2] C.M. Bishop , Pattern Re c ognition and Machine L e arning , Springer, 2006. [3] C.G. Broyden , A class of metho ds for solving nonline ar simultaneous e quations , Math. Comp., 19 (1965), pp. 577–593. [4] , Quasi-Newton methods and their application to function minimization , Math. Comp., 21 (1967), p. 45. [5] , A new double-r ank minimization algorithm , Notices of the AMS, 16 (1969), p. 670. [6] L.M. Le Cam , Conver genc e of estimates under dimensionality r estrictions , Ann. Statistics, 1 (1973), pp. 38–53. [7] S.D. Conte and C. deBoor , Elementary numeric al analysis. A n algorithmic appr o ach , McGra w- Hill, 1980. PROBABILISTIC INTERPRET A TION OF LINEAR SOL VERS 25 [8] W. Da vidon , V ariable metric metho d for minimization , tech. rep ort, Argonne National Labora- tories, Ill., 1959. [9] , Optimal ly Conditione d Optimization Algorithms Without Line Se ar ches , Mathematical Programming, 9 (1975), pp. 1–30. [10] J. Dennis , On some metho ds b ase d on Br oyden ’s se cant appr oximations , in Numerical Metho ds for Non-Linear Optimization, Dundee, 1971. [11] J.E. Dennis, Jr and R.B. Schnabel , L e ast change se cant up dates for quasi-newton metho ds , Siam Review, 21 (1979), pp. 443–459. [12] J.E. Dennis, Jr and H.F. W alker , Conver genc e the or ems for le ast-change se c ant up date metho ds , SIAM Journal on Numerical Analysis, 18 (1981), pp. 949–987. [13] J.E. Jr Dennis and J.J. Mor ´ e , Quasi-Newton metho ds, motivation and the ory , SIAM Review, (1977), pp. 46–89. [14] P. Diaconis , Bayesian numeric al analysis , Statistical decision theory and related topics, IV (1988), pp. 163–175. [15] P. Diaconis and M. Shahshahani , The sub gr oup algorithm for gener ating uniform r andom variables , Probability in Engineering and Informational Sciences, 1 (1987), p. 40. [16] L.C.W. Dixon , Quasi-newton algorithms gener ate identic al p oints , Mathematical Programming, 2 (1972), pp. 383–387. [17] , Quasi newton te chniques generate identic al p oints ii: The pr o ofs of four new theor ems , Mathematical Programming, 3 (1972), pp. 345–358. [18] R. Fletcher , A new appro ach to variable metric algorithms , The Computer Journal, 13 (1970), p. 317. [19] R. Fletcher and M.J.D. Powell , A r apid ly c onver gent desc ent metho d for minimization , The Computer Journal, 6 (1963), pp. 163–168. [20] R. Fletcher and C.M. Reeves , F unction minimization by c onjugate gr adients , The Computer Journal, 7 (1964), pp. 149–154. [21] D.M. Ga y , Some c onver genc e pr op erties of Br oyden ’s metho d , SIAM Journal on Numerical Analysis, 16 (1979), pp. 623–630. [22] D. Goldf arb , A family of variable metric updates derive d by variational me ans , Math. Comp., 24 (1970), pp. 23–26. [23] J. Greenst adt , V ariations on variable-metric methods , Math. Comp, 24 (1970), pp. 1–22. [24] P. Hennig , F ast Pr ob abilistic Optimization fr om Noisy Gradients , in International Conference on Machine Learning (ICML), 2013. [25] P. Hennig and M. Kiefel , Quasi-Newton metho ds – a new dire ction , in International Conference on Machine Learning (ICML), 2012. [26] , Quasi-Newton Metho ds – a new dir ection , Journal of Machine Learning Research, 14 (2013), pp. 834–865. [27] M.R. Hestenes and E. Stiefel , Methods of c onjugate gr adients for solving line ar systems , Journal of Research of the National Bureau of Standards, 49 (1952), pp. 409–436. [28] R.A. Horn and C.R. Johnson , Matrix analysis , Cambridge Universit y Press, 1990. [29] D.J.C. MacKa y , Intr o duction to Gaussian pr o c esses , NA TO ASI Series F Computer and Systems Sciences, 168 (1998), pp. 133–166. [30] R.J. Muirhead , Asp e cts of multivariate statistic al the ory , Wiley , 2005. [31] W. Murra y , ed., Numeric al Metho ds for Unc onstr aine d Optimization , Academic Press, New Y ork-London, 1972. [32] L. Nazareth , A r elationship b etwe en the BFGS and c onjugate gr adient algorithms and its implic ations for new algorithms , SIAM J Numerical Analysis, 16 (1979), pp. 794–800. [33] J. Nocedal , Updating quasi-Newton matric es with limite d stor age , Math. Comp., 35 (1980), pp. 773–782. [34] J. Nocedal and S.J. Wright , Numeric al Optimization , Springer V erlag, 1999. [35] A. O’Hagan , Some Bayesian Numeric al Analysis , Bay esian Statistics, 4 (1992), pp. 345–363. [36] M.J.D. Powell , A new algorithm for unc onstr ained optimization , in Nonlinear Programming, O. L. Mangasarian and K. Ritter, eds., AP , 1970. [37] H.J. Mar tinez R. , L o cal and sup erline ar c onver gence of structure d sec ant methods fr om the c onvex class , T ech. Rep ort 88-01, Rice Universit y , 1988. [38] C.E. Rasmussen and C.K.I. Williams , Gaussian Pr o c esses for Machine L earning , MIT, 2006. [39] D.F. Shanno , Conditioning of quasi-Newton metho ds for function minimization , Math. Comp., 24 (1970), pp. 647–656. [40] C.F. v an Loan , The ubiquitous Kr one cker pr o duct , J of Computational and Applied Mathe- matics, 123 (2000), pp. 85–100. [41] F. Zhang , The Schur complement and its applic ations , vol. 4, Springer, 2005.

Probabilistic Interpretation of Linear Solvers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment