Improving parameter learning of Bayesian nets from incomplete data
This paper addresses the estimation of parameters of a Bayesian network from incomplete data. The task is usually tackled by running the Expectation-Maximization (EM) algorithm several times in order to obtain a high log-likelihood estimate. We argue that choosing the maximum log-likelihood estimate (as well as the maximum penalized log-likelihood and the maximum a posteriori estimate) has severe drawbacks, being affected both by overfitting and model uncertainty. Two ideas are discussed to overcome these issues: a maximum entropy approach and a Bayesian model averaging approach. Both ideas can be easily applied on top of EM, while the entropy idea can be also implemented in a more sophisticated way, through a dedicated non-linear solver. A vast set of experiments shows that these ideas produce significantly better estimates and inferences than the traditional and widely used maximum (penalized) log-likelihood and maximum a posteriori estimates. In particular, if EM is adopted as optimization engine, the model averaging approach is the best performing one; its performance is matched by the entropy approach when implemented using the non-linear solver. The results suggest that the applicability of these ideas is immediate (they are easy to implement and to integrate in currently available inference engines) and that they constitute a better way to learn Bayesian network parameters.
💡 Research Summary
The paper tackles the problem of learning the parameters of a Bayesian network (BN) when the training data contain missing values. Under the common MAR (Missing At Random) assumption, the log‑likelihood surface becomes non‑concave and multimodal, so the standard practice is to run the Expectation‑Maximization (EM) algorithm from many random initializations and keep the solution with the highest score (maximum likelihood, penalized likelihood, or MAP). The authors argue that this “maximum‑score” strategy suffers from two fundamental drawbacks: (1) the selected solution may be over‑fitted to the limited data, even when MAP or a penalty term is used; (2) many EM runs often converge to solutions with almost identical scores but very different parameter values, so discarding all but the top scorer ignores model uncertainty.
To address these issues, two alternative criteria are proposed, both of which can be layered on top of the existing EM engine.
-
Maximum‑Entropy Criterion – Instead of picking the single highest‑scoring estimate, the method first filters the set of EM solutions to those whose scores are at least a fraction c of the global maximum (e.g., c = 0.99). Among this high‑score subset, the estimate with the largest entropy of the conditional probability tables is chosen. Maximizing entropy yields the most conservative distribution consistent with the data, thereby reducing over‑fitting. Two implementations are described: a simple “pick‑the‑most‑entropic among the EM runs” approach, and a more sophisticated formulation that treats the entropy maximization with a score constraint as a nonlinear optimization problem solved by a generic solver.
-
Bayesian Model Averaging (BMA) at the CPT Level – Classical BMA averages predictions over a set of models weighted by their posterior probabilities, but applying it directly to full joint distributions would break the BN factorization. The authors instead apply BMA locally: for each variable Xj and each parent configuration πj, the conditional probabilities obtained from each EM run are averaged using weights proportional to the run’s score (or posterior). The resulting averaged CPTs define a single BN that can be used with any standard inference engine. This approach preserves the original graph structure while incorporating model uncertainty.
The experimental evaluation uses three network structures (Asia, Alarm, and randomly generated nets) with sample sizes n = 100 and 200, and missing‑data rates of 30 % and 60 %. For each setting, 300 independent trials are performed: a reference BN is sampled, complete data are generated, MCAR missingness is introduced, and 30 EM runs are executed from different random starts. Three estimators are compared: (i) MAP (i.e., the highest‑score EM solution), (ii) the maximum‑entropy estimator, and (iii) the CPT‑level BMA estimator.
Two performance metrics are reported: (a) KL‑divergence between the full joint distribution of the learned BN and the reference BN, and (b) KL‑divergence between the marginal joint distribution of all leaf nodes (the “leaf metric”) and the reference. Because KL values are not normally distributed, the authors employ a non‑parametric Friedman test (α = 0.01) followed by Tukey’s Honestly Significant Difference post‑hoc analysis to rank the methods.
Results consistently show that both the entropy‑based and BMA‑based estimators achieve lower KL values than the MAP baseline. BMA typically attains the best rank across all settings, while the entropy method matches BMA’s performance when the more elaborate nonlinear‑solver implementation is used. The simple entropy‑pick‑among‑EM runs also outperforms MAP but is slightly less accurate than the full optimization.
The study concludes that relying solely on the maximum score to select a parameter set is insufficient for robust BN learning with incomplete data. Incorporating a conservative entropy criterion mitigates over‑fitting, and applying BMA locally to CPTs effectively captures model uncertainty without sacrificing the BN’s factorized representation. Both techniques require only modest modifications to existing EM pipelines, making them readily deployable in current BN software. Future work is suggested on extending the approaches to non‑MAR missingness, continuous variables, and scaling the nonlinear optimization to very large networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment