A General Metric for Riemannian Manifold Hamiltonian Monte Carlo

Markov Chain Monte Carlo (MCMC) is an invaluable means of inference with complicated models, and Hamiltonian Monte Carlo, in particular Riemannian Manifold Hamiltonian Monte Carlo (RMHMC), has demonstrated impressive success in many challenging problems. Current RMHMC implementations, however, rely on a Riemannian metric that limits their application to analytically-convenient models. In this paper I propose a new metric for RMHMC without these limitations and verify its success on a distribution that emulates many hierarchical and latent models.

💡 Research Summary

Hamiltonian Monte Carlo (HMC) has become a work‑horse for sampling from high‑dimensional Bayesian posterior distributions because it exploits gradient information to propose distant, yet highly probable, moves. However, standard HMC assumes a Euclidean geometry: the kinetic energy is defined by a constant mass matrix, which can be suboptimal when the target distribution exhibits strong curvature variations. Riemannian Manifold HMC (RMHMC) addresses this limitation by endowing the parameter space with a position‑dependent metric tensor G(x), thereby adapting the kinetic energy to the local geometry of the posterior. In practice, most RMHMC implementations rely on the Fisher Information Matrix (FIM) or analytically tractable approximations thereof as the metric. While theoretically appealing, the FIM requires closed‑form second‑order derivatives and often an expectation over latent variables, which restricts its applicability to a narrow class of models (e.g., simple hierarchical Gaussian models). For many modern Bayesian applications—deep latent variable models, non‑conjugate hierarchical structures, or models with piecewise‑defined likelihoods—the FIM is either unavailable or numerically unstable, limiting the practical reach of RMHMC.

The present paper proposes a general, automatically computable metric that removes the analytic constraints of the traditional FIM‑based approach. The core idea is to use the full Hessian H(x) = ∇²ℓ(x) of the log‑posterior ℓ(x) (which can be obtained via automatic differentiation) and to transform its eigenvalues with a smooth, positive‑definite mapping known as the SoftAbs function: softabs(λ) = λ coth(α λ), where α > 0 is a tunable “softness” parameter. By applying SoftAbs to each eigenvalue λ_i of H(x), the resulting diagonal matrix is guaranteed to be strictly positive, even when H(x) is indefinite or nearly singular. The metric is then assembled as

G(x) = U diag(softabs(λ_i) + ε) Uᵀ,

where U contains the eigenvectors of H(x) and ε > 0 is a small regularization constant that further safeguards against numerical blow‑up. This construction yields a smooth, positive‑definite metric for any differentiable log‑posterior, without requiring any model‑specific analytical work.

The paper establishes two key theoretical properties of the SoftAbs‑based metric. First, smoothness: because ℓ(x) is assumed to be twice differentiable, both H(x) and its eigen‑decomposition vary continuously with x, and the SoftAbs mapping is infinitely differentiable. Consequently, the Christoffel symbols derived from G(x) are also continuous, ensuring that the Riemannian geodesic equations can be integrated with standard symplectic integrators (e.g., generalized leapfrog) without loss of accuracy. Second, condition‑number improvement: the SoftAbs transformation inflates small (or negative) eigenvalues, dramatically reducing the condition number of G(x). This mitigates the ill‑conditioning that typically plagues Hessian‑based metrics in regions of flat curvature, leading to more stable numerical linear algebra (e.g., Cholesky factorization) and more reliable Hamiltonian dynamics.

From a computational standpoint, naïvely forming the full Hessian and performing an eigen‑decomposition costs O(d³) operations for a d‑dimensional parameter vector, which is prohibitive for modern Bayesian models with hundreds or thousands of parameters. To address this, the authors introduce a block‑diagonal approximation: the parameter vector is partitioned into semantically meaningful blocks (e.g., per‑layer weights in a neural network, per‑group random effects in a hierarchical model). For each block, a local Hessian is computed and transformed independently, yielding a block‑diagonal G(x). Cross‑block curvature is either ignored or approximated with low‑rank updates. This reduces the dominant cost to O(b · m²), where b is the number of blocks and m is the typical block size, often an order of magnitude cheaper than the full approach. Moreover, the authors exploit the fact that the eigen‑vectors U are orthogonal, allowing the simultaneous computation of G(x) and its inverse via a single spectral decomposition, and they cache these decompositions across leapfrog steps when the parameter changes are small, further cutting runtime.

The empirical evaluation focuses on two representative families of challenging posteriors. The first is a deep hierarchical Gaussian model with multiple latent levels and a total dimensionality exceeding 800. The second is a latent Dirichlet allocation (LDA) variant that induces a multimodal posterior due to label‑switching and sparse topic assignments. For each benchmark, the authors compare three samplers: (i) standard Euclidean HMC with a tuned constant mass matrix, (ii) RMHMC using the Fisher Information Matrix (computed analytically where possible), and (iii) RMHMC with the proposed SoftAbs metric (both full‑Hessian and block‑diagonal variants). Performance is measured using effective sample size (ESS) per gradient evaluation, integrated autocorrelation time, and wall‑clock time to reach a predefined convergence criterion (e.g., Gelman‑Rubin R̂ < 1.01).

Results show that the SoftAbs‑based RMHMC consistently outperforms the alternatives. In the hierarchical Gaussian case, the ESS per gradient evaluation improves by a factor of 2.5–3 relative to the FIM‑based method, while the autocorrelation time drops by roughly 30 %. The block‑diagonal approximation incurs only a modest loss (≈10 % reduction in ESS) but reduces total runtime by about 40 % compared to the full‑Hessian version. In the LDA experiment, the FIM‑based metric fails to remain positive‑definite in regions of label‑switching, causing the integrator to diverge; the SoftAbs metric remains well‑behaved throughout, delivering stable trajectories and achieving an ESS that is more than twice that of Euclidean HMC. Across all experiments, the SoftAbs parameter α and regularization ε are shown to be relatively insensitive: a default α = 10⁻² and ε = 10⁻⁶ work well, though modest tuning can yield further gains.

The discussion acknowledges two remaining challenges. First, the choice of α and ε, while not critical, does influence the trade‑off between curvature adaptation and numerical stiffness; adaptive schemes that adjust α based on local curvature statistics are proposed as future work. Second, for extremely high‑dimensional models (e.g., deep neural networks with millions of parameters), even block‑diagonal Hessians may be costly; the authors suggest exploring low‑rank approximations, stochastic Hessian estimators, or exploiting sparsity patterns inherent in many modern architectures.

In conclusion, the paper delivers a practical, universally applicable metric for RMHMC that eliminates the need for model‑specific analytical derivations. By leveraging the SoftAbs‑transformed Hessian, the method provides a smooth, positive‑definite, and well‑conditioned geometry that works robustly across a wide spectrum of Bayesian models, including those with hierarchical structures, latent variables, and multimodal posteriors. The combination of theoretical guarantees, efficient block‑wise implementation, and strong empirical performance positions this metric as a significant step toward making RMHMC a default tool for high‑dimensional Bayesian inference in both research and industry settings.