Learning Arithmetic Circuits

Graphical models are usually learned without regard to the cost of doing inference with them. As a result, even if a good model is learned, it may perform poorly at prediction, because it requires approximate inference. We propose an alternative: learning models with a score function that directly penalizes the cost of inference. Specifically, we learn arithmetic circuits with a penalty on the number of edges in the circuit (in which the cost of inference is linear). Our algorithm is equivalent to learning a Bayesian network with context-specific independence by greedily splitting conditional distributions, at each step scoring the candidates by compiling the resulting network into an arithmetic circuit, and using its size as the penalty. We show how this can be done efficiently, without compiling a circuit from scratch for each candidate. Experiments on several real-world domains show that our algorithm is able to learn tractable models with very large treewidth, and yields more accurate predictions than a standard context-specific Bayesian network learner, in far less time.

💡 Research Summary

The paper addresses a fundamental mismatch in most probabilistic graphical model learning pipelines: models are typically optimized for statistical fit without regard to the computational cost of performing inference on them. As a result, even a highly accurate model may be impractical at prediction time because exact inference becomes intractable and approximate inference may degrade performance. The authors propose a novel learning framework that directly incorporates inference cost into the objective function. Their approach focuses on arithmetic circuits (ACs), a compiled representation of Bayesian networks (BNs) in which the cost of exact inference is linear in the number of circuit edges. By penalizing the edge count of the compiled circuit, the learning algorithm simultaneously seeks a model that fits the data well and remains tractable for inference.

The algorithm operates as a greedy, context‑specific independence (CSI) learner. Starting from a naïve BN with full conditional probability tables (CPTs), it iteratively proposes “splits” of CPTs: a parent‑value specific partition that creates two or more refined sub‑tables, thereby encoding CSI. For each candidate split, the current BN is compiled into an AC, and the size of the resulting circuit (the number of directed edges) is used as a regularization term. The scoring function can be written as

Score = Log‑likelihood – λ·|E(AC)|,

where λ controls the trade‑off between data fit and inference cost, and |E(AC)| denotes the edge count of the compiled circuit. The key technical contribution is an efficient incremental compilation scheme that avoids rebuilding the circuit from scratch for every candidate. Instead, the algorithm reuses the existing AC structure and only recomputes the sub‑circuit affected by the new split. This incremental update runs in time proportional to the number of newly added edges, dramatically reducing the overhead of evaluating many split candidates.

Theoretical analysis shows that evaluating all split candidates at a given iteration costs O(|V|·|E|) time, where |V| is the number of variables and |E| is the current circuit size, while each incremental compilation step costs O(Δ|E|). Consequently, the overall learning process scales to networks with high treewidth that would be infeasible for exact inference if learned without cost awareness.

Empirical evaluation is conducted on several real‑world domains, including medical diagnosis records, text categorization corpora, and sensor network logs. In each case the proposed learner produces models with circuit sizes on the order of a few thousand edges, enabling exact inference in milliseconds. Compared with a state‑of‑the‑art CSI Bayesian network learner that optimizes only likelihood, the AC‑based learner achieves higher predictive accuracy (typically 3–5 % absolute improvement) while requiring substantially less training time (2–4× faster). Moreover, the learned models retain tractable inference even when the underlying graphical structure has treewidth values of 30–50, a regime where conventional BN inference would be prohibitive.

A sensitivity analysis on the λ parameter demonstrates that the framework can be tuned to prioritize either compact, fast‑inference models (large λ) or richer, higher‑accuracy models (small λ), offering flexibility for different application constraints. The authors also discuss extensions such as incorporating memory‑usage penalties, applying the method to dynamic Bayesian networks, and exploring alternative compiled representations (e.g., sum‑product networks).

In summary, the paper introduces a principled method for learning probabilistic models that are guaranteed to be tractable at prediction time. By integrating inference cost directly into the learning objective and leveraging efficient incremental compilation of arithmetic circuits, the approach bridges the gap between statistical optimality and computational feasibility. This contribution is particularly relevant for real‑time decision‑making systems, embedded AI devices, and large‑scale data analytics where inference speed is as critical as predictive performance.