Scalable Krylov Subspace Methods for Generalized Mixed Effects Models with Crossed Random Effects
Mixed-effects models are widely used to model data with hierarchical grouping structures and high-cardinality categorical predictor variables. However, for high-dimensional crossed random effects, sparse Cholesky decompositions, the current standard approach, can become prohibitively slow. In this work, we present Krylov subspace-based methods that address these computational bottlenecks and analyze them both theoretically and empirically. In particular, we derive new results on the convergence and accuracy of the preconditioned stochastic Lanczos quadrature and conjugate gradient methods for mixed-effects models, and we develop scalable methods for calculating predictive variances. In experiments with simulated and real-world data, the proposed methods yield speedups by factors of up to about 10,000 and are numerically more stable than Cholesky-based computations as implemented in state-of-the-art packages such as lme4 and glmmTMB. Our methodology is available in the open-source C++ software library GPBoost, with accompanying high-level Python and R packages.
💡 Research Summary
This paper addresses the severe computational bottlenecks that arise when fitting generalized mixed‑effects models (GLMMs) with high‑dimensional crossed random effects. Traditional approaches rely on sparse Cholesky factorisation of the matrix Σ⁻¹ + ZᵀWZ, where Σ encodes the variance components of the random effects, Z is the incidence matrix, and W contains second‑derivative information of the likelihood. In high‑cardinality settings (e.g., recommender systems with millions of user‑item interactions), the cubic time and memory complexity of Cholesky become prohibitive.
The authors propose a suite of Krylov subspace methods that replace Cholesky entirely. First, they employ a preconditioned conjugate‑gradient (CG) algorithm to solve linear systems of the form (Σ⁻¹ + ZᵀWZ) u = v. Two preconditioners are investigated: a simple diagonal preconditioner D = diag(Σ⁻¹ + ZᵀWZ) and a more sophisticated symmetric successive over‑relaxation (SSOR) preconditioner. Theoretical analysis shows that the SSOR preconditioner dramatically reduces the effective condition number, leading to convergence in far fewer CG iterations (often < 50) compared with the diagonal case.
Second, the paper tackles the evaluation of log‑determinants and their derivatives, which are required for the Laplace‑approximated marginal likelihood and for Fisher‑information calculations. The authors adopt stochastic Lanczos quadrature (SLQ) to approximate log det(Σ⁻¹ + ZᵀWZ). Crucially, they embed the same preconditioner P used in CG into the SLQ formulation:
log det(Σ⁻¹ + ZᵀWZ) = log det(P) + log det(P⁻¹/2(Σ⁻¹ + ZᵀWZ)P⁻ᵀ/2).
The second term is estimated by SLQ using a small number of Gaussian probe vectors. By reusing the Lanczos coefficients that naturally arise during the CG solves, the method avoids a separate Lanczos run, eliminates loss‑of‑orthogonality issues, and requires no storage of the Lanczos basis. This integration also yields a variance‑reduced stochastic trace estimator for derivatives of the log‑determinant with respect to variance components θ, enabling efficient gradient and Fisher‑information computation.
Third, the authors develop a scalable approach for predictive variances when the number of prediction points nₚ is large. Instead of solving nₚ linear systems individually, they employ a Monte‑Carlo style estimator that draws a modest number of random vectors, solves (Σ⁻¹ + ZᵀWZ)⁻¹ z for each, and uses these solutions to approximate diag(Ωₚ), the diagonal of the posterior predictive covariance. The cost scales with the number of probe vectors and CG iterations, not with nₚ, making it feasible for millions of predictions.
Theoretical contributions include: (i) bounds on the extremal eigenvalues of the preconditioned matrix for both diagonal and SSOR preconditioners; (ii) explicit expressions for the effective condition number governing CG convergence; (iii) convergence rate and variance bounds for the preconditioned SLQ estimator; and (iv) a control‑variates scheme that further reduces stochastic estimator variance.
Empirical evaluation comprises synthetic experiments (varying m from 10⁴ to 2 × 10⁵ and n up to 10⁶) and real‑world case studies: a large‑scale movie‑rating dataset and a genetics dataset with crossed family and site effects. Across all settings, the Krylov‑based pipeline achieves speed‑ups ranging from 10³ to 10⁴ relative to state‑of‑the‑art packages lme4 and glmmTMB, while delivering numerically identical parameter estimates and predictive performance. Moreover, the new methods exhibit far greater numerical stability; the Cholesky‑based solvers frequently encounter overflow/underflow or loss of positive‑definiteness, whereas the CG/SLQ pipeline remains robust.
Implementation is provided in the open‑source C++ library GPBoost, with high‑level Python and R bindings that mimic the familiar lme4/glmmTMB APIs. This makes the methodology readily accessible to practitioners who need to fit GLMMs with massive crossed random effects without sacrificing statistical accuracy.
In summary, the paper establishes Krylov subspace methods—preconditioned CG for linear solves, preconditioned SLQ for log‑determinants, and stochastic trace estimation for gradients—as a comprehensive, theoretically grounded, and practically scalable alternative to sparse Cholesky factorisation for generalized mixed‑effects models with high‑dimensional crossed random effects.
Comments & Academic Discussion
Loading comments...
Leave a Comment