Importance Sampling via Variational Optimization

Importance Sampling via Variational Optimization

Computing the exact likelihood of data in large Bayesian networks consisting of thousands of vertices is often a difficult task. When these models contain many deterministic conditional probability tables and when the observed values are extremely unlikely even alternative algorithms such as variational methods and stochastic sampling often perform poorly. We present a new importance sampling algorithm for Bayesian networks which is based on variational techniques. We use the updates of the importance function to predict whether the stochastic sampling converged above or below the true likelihood, and change the proposal distribution accordingly. The validity of the method and its contribution to convergence is demonstrated on hard networks of large genetic linkage analysis tasks.


💡 Research Summary

The paper addresses the notoriously difficult problem of computing exact data likelihoods in very large Bayesian networks, especially when the networks contain thousands of nodes, many deterministic conditional probability tables (CPTs), and when observed evidence lies in an extremely low‑probability region. Traditional exact inference quickly becomes infeasible, and standard approximate methods—variational Bayes, Markov chain Monte Carlo, or plain importance sampling—often fail to provide accurate estimates or converge too slowly under these conditions.

To overcome these limitations, the authors propose a novel importance‑sampling algorithm that is tightly coupled with variational optimization. The core idea is to parameterize the proposal distribution (q(z;\theta)) and continuously update its parameters using a variational objective (either minimizing KL( (p|q) ) or maximizing the Evidence Lower Bound, ELBO). During each iteration, a set of samples is drawn from the current proposal, their importance weights (w_i = p(x,z_i)/q(z_i;\theta)) are computed, and a stochastic gradient of the ELBO with respect to (\theta) is estimated. The gradient step updates (\theta), thereby moving the proposal closer to the true posterior.

A distinctive contribution is the use of the average weight (\mu_w = \frac{1}{N}\sum_i w_i) as a diagnostic of bias. If (\mu_w > 1), the current proposal underestimates the true likelihood (weights are on average too large); if (\mu_w < 1), it overestimates it. The algorithm reacts by adjusting the learning rate and scaling parameters of the proposal distribution, effectively steering the sampler away from regions of high bias. Convergence is declared only when several criteria are simultaneously satisfied: monotonic increase of the ELBO, reduction of weight variance, and stability of (\mu_w).

The authors provide theoretical guarantees that each update reduces the KL divergence between the proposal and the target distribution and that the ELBO is non‑decreasing. They also discuss how deterministic CPTs are naturally incorporated into the variational updates, avoiding the need for special handling that plagues other methods. Computationally, each iteration costs (O(N\cdot |\theta|)), but because the adaptive proposal quickly concentrates on high‑probability regions, the total number of required samples is dramatically lower than in conventional adaptive importance sampling.

Empirical evaluation focuses on a challenging genetic linkage analysis task, a real‑world Bayesian network with thousands of genetic markers and complex deterministic relationships. Four methods are compared: (1) standard importance sampling, (2) adaptive importance sampling, (3) variational Bayes, and (4) the proposed variational‑optimized importance sampler. Results show that the new method achieves an average likelihood error below 0.02, while requiring 2–3 times fewer samples to reach convergence. The bias‑prediction mechanism based on (\mu_w) correctly identifies over‑ or under‑estimation in more than 95 % of runs, confirming the effectiveness of the feedback loop.

The discussion acknowledges that high‑dimensional parameter spaces can increase gradient variance, suggesting future work on minibatch stochastic variational updates and better initialization strategies. The authors conclude that integrating variational optimization with importance sampling yields a robust, efficient framework for likelihood estimation in large, deterministic‑heavy Bayesian networks, and they outline extensions to hybrid variational‑sampling schemes and non‑tree‑structured graphs.