Finding Optimal Bayesian Networks

In this paper, we derive optimality results for greedy Bayesian-network search algorithms that perform single-edge modifications at each step and use asymptotically consistent scoring criteria. Our results extend those of Meek (1997) and Chickering (2002), who demonstrate that in the limit of large datasets, if the generative distribution is perfect with respect to a DAG defined over the observable variables, such search algorithms will identify this optimal (i.e. generative) DAG model. We relax their assumption about the generative distribution, and assume only that this distribution satisfies the {em composition property} over the observable variables, which is a more realistic assumption for real domains. Under this assumption, we guarantee that the search algorithms identify an {em inclusion-optimal} model; that is, a model that (1) contains the generative distribution and (2) has no sub-model that contains this distribution. In addition, we show that the composition property is guaranteed to hold whenever the dependence relationships in the generative distribution can be characterized by paths between singleton elements in some generative graphical model (e.g. a DAG, a chain graph, or a Markov network) even when the generative model includes unobserved variables, and even when the observed data is subject to selection bias.

💡 Research Summary

The paper investigates the asymptotic optimality of greedy search algorithms for learning Bayesian‑network structures when the search is restricted to single‑edge modifications (addition, deletion, or reversal) and the scoring criterion is asymptotically consistent (e.g., BIC, BDeu). Earlier work by Meek (1997) and Chickering (2002) showed that, in the limit of infinite data, if the true data‑generating distribution is perfect with respect to some DAG over the observable variables, then such greedy procedures will recover that exact generative DAG. The present study relaxes the stringent “perfectness” assumption and replaces it with the composition property—a weaker conditional‑independence condition that holds in many realistic domains.

The composition property states that if a variable X is conditionally independent of Y given a set W, and also conditionally independent of Z given the same set W, then X is conditionally independent of the union Y∪Z given W. This property is satisfied by any distribution that can be represented by a graphical model (DAG, chain graph, or Markov network) in which dependencies are mediated by paths between singleton variables. Crucially, the property remains valid even when the underlying generative model contains hidden (unobserved) variables or when the observed data are subject to selection bias, provided that the observable dependencies can still be expressed as paths in some latent graphical structure.

Under the composition property, the authors prove that greedy single‑edge search converges to an inclusion‑optimal model. An inclusion‑optimal model (1) contains the true distribution (i.e., the true distribution is a member of the equivalence class defined by the learned DAG) and (2) has no proper sub‑model that also contains the true distribution. In other words, the algorithm returns the most parsimonious network that still explains all statistical regularities of the data. This guarantee is weaker than exact recovery of the true DAG but is far more applicable because it does not require the data‑generating process to be perfectly faithful to a DAG over the observed variables.

The proof proceeds by showing that, with an asymptotically consistent score, any edge addition that would increase the score must correspond to a genuine dependence not already captured by the current graph, while any edge deletion that would increase the score would contradict the composition property. Consequently, the search cannot become trapped in a local optimum that omits a necessary dependence, nor can it retain superfluous edges once the data are sufficiently large.

The paper also establishes that the composition property is guaranteed whenever the observable independencies can be characterized by paths between individual variables in some underlying graphical model, even if that model includes latent variables or selection mechanisms. This result broadens the applicability of the optimality guarantee to a wide class of real‑world problems such as medical diagnosis (where unmeasured confounders are common), social science surveys (subject to non‑random participation), and biological networks (with hidden molecular species).

In summary, the contributions are threefold: (1) a theoretical extension of greedy Bayesian‑network search optimality from the perfect‑DAG assumption to the more realistic composition property; (2) a rigorous definition and justification of inclusion‑optimality as a meaningful target when exact recovery is impossible; and (3) a demonstration that the composition property holds in many practical settings, including those with latent variables and selection bias. These insights provide a solid foundation for applying greedy structure‑learning algorithms to complex, imperfect data while retaining strong asymptotic guarantees about the quality of the learned model.