Conjugating Variational Inference for Large Mixed Multinomial Logit Models and Consumer Choice
Heterogeneity in multinomial choice data is often accounted for using logit models with random coefficients. Such models are called “mixed”, but they can be difficult to estimate for large datasets. We review current Bayesian variational inference (VI) methods that can do so, and propose a new VI method that scales more effectively. The key innovation is a step that updates efficiently a Gaussian approximation to the conditional posterior of the random coefficients, addressing a bottleneck within the variational optimization. The approach is used to estimate three types of mixed logit models: standard, nested and bundle variants. We first demonstrate the improvement of our new approach over existing VI methods using simulations. Our method is then applied to a large scanner panel dataset of pasta choice. We find consumer response to price and promotion variables exhibits substantial heterogeneity at the grocery store and product levels. Store size, premium and geography are found to be drivers of store level estimates of price elasticities. Extension to bundle choice with pasta sauce improves model accuracy further. Predictions from the mixed models are more accurate than those from fixed coefficients equivalents, and our VI method provides insights in circumstances which other methods find challenging.
💡 Research Summary
The paper addresses the computational challenges of Bayesian inference for large‑scale mixed multinomial logit (MMNL) models, where the number of observations can reach millions and the random‑coefficient vector may have dozens or hundreds of dimensions. Traditional Markov chain Monte Carlo (MCMC) methods become infeasible in such settings, prompting the authors to explore variational inference (VI) as a scalable alternative. Existing VI approaches for MMNL—data‑augmented VI (DA‑VI) and amortized VI (A‑VI)—either fail to capture the dependence between random effects and global parameters or require heavy neural‑network training, leading to sub‑optimal accuracy or excessive memory usage.
The authors propose a new algorithm called Conjugating Variational Inference (CVI). The key idea is to perform a second‑order Taylor expansion of the log‑likelihood with respect to the random coefficients, yielding a quadratic form that is conjugate to a Gaussian prior. This results in a closed‑form Gaussian approximation to the conditional posterior of the random coefficients, which can be updated efficiently without solving a numerical optimization problem for each group at every iteration. The variational family therefore consists of (i) global parameters θ (fixed coefficients, hyper‑means, and hyper‑covariance), (ii) the mean μ and covariance Σ of the random coefficients, and (iii) auxiliary expansion‑center parameters η that are refreshed periodically using minibatch data. The ELBO is maximized via stochastic gradient descent with the re‑parameterization trick, and Adam is used for adaptive learning rates.
CVI’s computational advantage stems from three aspects: (1) the quadratic approximation yields a closed‑form update for Σ, avoiding O(N·K) operations and reducing them to O(K²); (2) η is updated only intermittently, so the algorithm does not need to scan the entire dataset each iteration; (3) the method naturally accommodates an unrestricted full covariance matrix for the random coefficients, a setting that is notoriously difficult for other VI schemes.
Simulation studies compare CVI against MCMC (ground truth), DA‑VI, and A‑VI across a range of sample sizes (10⁴–10⁶) and random‑coefficient dimensions (30–90). CVI matches MCMC’s posterior means and covariances with negligible bias, while achieving 3–5× faster convergence than the other VI methods. In particular, CVI’s estimates of the covariance matrix are substantially more accurate, which is crucial for interpreting heterogeneity.
The authors then apply CVI to a real‑world scanner panel dataset comprising over 500,000 pasta purchases from 381 U.S. grocery stores. Three model specifications are estimated: (a) a standard MMNL, (b) a mixed nested logit (MNestL) that groups alternatives into nests with nest‑specific scale parameters τ, and (c) a bundled MMNL (B‑MMNL) that allows joint purchase of pasta and sauce, introducing complementary effects γ for each bundle. Priors are weakly informative Gaussian for fixed effects, half‑t for τ, and two alternatives for Σ (LKJ and Huang‑Wand). The results reveal substantial heterogeneity in price and promotion sensitivities at both store and product levels. Larger, premium‑positioned stores exhibit more negative price elasticities, indicating higher price sensitivity. The nested model uncovers distinct τ values for premium versus regular stores, suggesting different substitution patterns. The bundled model improves out‑of‑sample predictive accuracy by about 2.3 percentage points, an improvement that disappears when using fixed‑coefficient models, highlighting the value of capturing joint purchase behavior.
Overall, the paper makes several contributions: (1) a novel VI algorithm (CVI) that efficiently handles high‑dimensional random effects with unrestricted covariance; (2) a thorough empirical comparison demonstrating CVI’s speed‑accuracy trade‑off; (3) application to large‑scale marketing data, providing new insights into consumer heterogeneity and the benefits of modeling bundles; and (4) a discussion of limitations (second‑order approximation bias, choice of η‑update frequency) and avenues for future work (higher‑order expansions, non‑Gaussian priors, online extensions).
In sum, CVI offers a practical, scalable solution for Bayesian estimation of large mixed logit models, opening the door for richer choice‑behavior analyses in marketing, transportation, health economics, and other fields where heterogeneity matters.
Comments & Academic Discussion
Loading comments...
Leave a Comment