A Bregman Extension of quasi-Newton updates I: An Information Geometrical framework
We study quasi-Newton methods from the viewpoint of information geometry induced associated with Bregman divergences. Fletcher has studied a variational problem which derives the approximate Hessian update formula of the quasi-Newton methods. We point out that the variational problem is identical to optimization of the Kullback-Leibler divergence, which is a discrepancy measure between two probability distributions. The Kullback-Leibler divergence for the multinomial normal distribution corresponds to the objective function Fletcher has considered. We introduce the Bregman divergence as an extension of the Kullback-Leibler divergence, and derive extended quasi-Newton update formulae based on the variational problem with the Bregman divergence. As well as the Kullback-Leibler divergence, the Bregman divergence introduces the information geometrical structure on the set of positive definite matrices. From the geometrical viewpoint, we study the approximation Hessian update, the invariance property of the update formulae, and the sparse quasi-Newton methods. Especially, we point out that the sparse quasi-Newton method is closely related to statistical methods such as the EM-algorithm and the boosting algorithm. Information geometry is useful tool not only to better understand the quasi-Newton methods but also to design new update formulae.
💡 Research Summary
The paper revisits quasi‑Newton optimization methods through the lens of information geometry, using Bregman divergences as the central tool. It begins by recalling Fletcher’s variational formulation, which seeks the smallest change in the Hessian approximation subject to the secant condition. The authors observe that this variational problem is mathematically equivalent to minimizing the Kullback‑Leibler (KL) divergence between two probability distributions, specifically between multivariate normal laws whose covariance matrices are the current and updated Hessian approximations. Since KL divergence for Gaussian families is a special case of a Bregman divergence, the paper naturally extends the framework by replacing KL with a general Bregman divergence defined by a strictly convex generator φ.
A Bregman divergence D_φ(P‖Q)=φ(P)−φ(Q)−⟨∇φ(Q),P−Q⟩ induces a dually flat manifold on the cone of positive‑definite matrices S_{++}^n. The e‑connection (exponential) and m‑connection (mixture) give rise to two complementary affine coordinate systems, allowing the variational problem to be expressed as either an e‑projection or an m‑projection onto the affine subspace defined by the secant condition. By choosing different generators φ (e.g., φ(H)=−log det H, φ(H)=½ tr(H²), φ(H)=tr(H log H)), the resulting update formulas recover classical quasi‑Newton schemes such as DFP, BFGS, and SR1, and also generate novel families of updates that inherit the same secant property.
The authors prove that these Bregman‑based updates are invariant under affine transformations of the variable space: if x̃ = A x, then the transformed Hessian approximation H̃ = A H Aᵀ follows the same update rule with the same generator φ. This affine invariance is a direct consequence of the underlying information‑geometric structure and guarantees that the algorithm’s performance does not depend on the choice of coordinate system.
A substantial portion of the paper is devoted to sparse quasi‑Newton methods. When the Hessian approximation is required to respect a sparsity pattern (e.g., a tree or banded structure), the authors propose an alternating projection scheme: first project the full matrix onto the sparsity manifold using an m‑projection (which minimizes the Bregman divergence while preserving sparsity), then project back onto the secant manifold using an e‑projection. This two‑step process mirrors the Expectation‑Maximization (EM) algorithm, where the E‑step corresponds to the sparsity projection and the M‑step to the secant projection. Moreover, the iterative refinement resembles boosting, where each projection incrementally improves the approximation, analogous to adding weak learners.
Theoretical analysis shows that each projection strictly reduces the chosen Bregman divergence, ensuring monotonic progress toward the optimal sparse Hessian. Numerical experiments on benchmark problems demonstrate that selecting a generator φ adapted to the problem structure (for instance, a log‑det generator for covariance estimation) can accelerate convergence compared to standard BFGS, while preserving sparsity and affine invariance.
In conclusion, the paper establishes a unified information‑geometric framework for quasi‑Newton updates. By interpreting Fletcher’s variational principle as KL‑divergence minimization and then generalizing to arbitrary Bregman divergences, it provides a systematic way to derive, analyze, and extend quasi‑Newton schemes. The framework clarifies the geometric meaning of classical updates, guarantees desirable invariance properties, and opens new avenues for designing sparse, problem‑specific algorithms that are closely related to well‑known statistical procedures such as EM and boosting.
Comments & Academic Discussion
Loading comments...
Leave a Comment