Learning Topic Models by Belief Propagation
Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper represents LDA as a factor graph within the Markov random field (MRF) framework, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly-used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy as validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representation.
💡 Research Summary
The paper “Learning Topic Models by Belief Propagation” re‑examines Latent Dirichlet Allocation (LDA) from a Markov Random Field (MRF) perspective and shows how the classic loopy Belief Propagation (BP) algorithm can be used for both inference and parameter estimation.
Traditional LDA is a three‑layer hierarchical Bayesian model (HBM) in which document‑specific topic proportions (θ) and topic‑specific word distributions (φ) are drawn from Dirichlet priors and generate observed word tokens. Exact inference is intractable because the graphical model contains loops. Existing approximate methods—Variational Bayes (VB) and collapsed Gibbs Sampling (GS)—have become standard, but each has drawbacks: VB requires costly variational updates, while GS can be slow to converge.
The authors first collapse the Dirichlet‑Multinomial conjugacy, integrating out θ and φ. The remaining hidden variables are the topic assignments z for each word token. By treating the Dirichlet hyper‑parameters (α, β) as pseudo‑counts, the joint probability of the collapsed model can be factorized into a product of two types of factor functions: one associated with each document (θ‑factor) and one with each word type (φ‑factor). This factorization yields a two‑layer factor graph (Fig. 2) that is mathematically equivalent to the original three‑layer HBM.
Because the factor graph is now explicit, the loopy BP algorithm can be applied. The paper derives the message‑passing equations in detail. A message µᵂ,ᵈ(k) from variable zᵏʷ,ᵈ to its neighboring factors is proportional to the product of two incoming messages: one from the document factor (θᵈ) and one from the word factor (φʷ). The authors replace exact neighbor configurations with expected counts (µ⁻ʷ,ᵈ and µʷ,⁻ᵈ) and incorporate the Dirichlet hyper‑parameters as additive pseudo‑messages, leading to the compact update rule (Eq. 7). To avoid numerical underflow, they further approximate the product of many small messages by a sum‑sum operation, which is equivalent to a relaxation‑labeling scheme used in MRF learning.
Two scheduling strategies are discussed: synchronous updates, where all variables use messages from the previous iteration, and asynchronous updates, where a variable’s new message is immediately used by its neighbors. The asynchronous scheme typically accelerates convergence. Convergence can be declared after a fixed number of iterations or when changes in the estimated multinomial parameters (θ, φ) fall below a small threshold.
Parameter estimation follows an EM‑like procedure. In the E‑step, normalized messages µᵂ,ᵈ(k) are interpreted as posterior expectations of the hidden topic counts. Using Dirichlet‑Multinomial conjugacy, the posterior distributions of θᵈ and φʷ become Dirichlet with parameters (µ·,ᵈ(k)+α) and (µʷ,·(k)+β), respectively. The M‑step then takes the mean of these Dirichlet posteriors, yielding closed‑form updates for θ and φ (Eqs. 12‑13). Hyper‑parameters α and β are held fixed in the experiments, but the authors note that standard methods can be employed to learn them.
The framework is then extended to two popular LDA extensions: the Author‑Topic Model (ATM) and the Relational Topic Model (RTM). For ATM, an additional author‑factor links each author to topic assignments, while for RTM a link‑factor connects pairs of documents that share a citation or hyperlink. In both cases the same BP message‑passing machinery applies, demonstrating the generic nature of the approach.
Empirical evaluation is performed on four large‑scale corpora (including NIPS conference papers, a Korean news set, PubMed abstracts, and a Wikipedia dump). The authors compare BP against VB and GS in terms of training time, perplexity, and topic coherence. Results show that BP converges 2–3 times faster than VB and achieves perplexities comparable to or better than GS, while maintaining competitive topic coherence. Because BP only needs to pass messages for non‑zero entries in the sparse document‑word matrix, its computational cost scales linearly with the number of observed word tokens.
The paper’s contributions can be summarized as follows:
- A rigorous factor‑graph representation of LDA that preserves the exact joint distribution after collapsing θ and φ.
- Derivation of a loopy BP algorithm tailored to this factor graph, including practical approximations (sum‑sum) to ensure numerical stability.
- Demonstration that BP is both faster and as accurate as the dominant VB and GS methods on real‑world large datasets.
- Extension of the BP framework to ATM and RTM, illustrating its applicability to a broad class of topic models.
The authors conclude by suggesting future work on (i) theoretical convergence guarantees for asynchronous BP on loopy graphs, (ii) automatic hyper‑parameter learning within the BP loop, (iii) parallel and distributed implementations to further exploit the message‑passing structure, and (iv) applying the factor‑graph/BP paradigm to non‑text domains such as image segmentation or biological network analysis. Overall, the paper offers a compelling alternative to variational and sampling‑based inference, positioning belief propagation as a versatile and efficient tool for modern topic modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment