Factorized Multi-Modal Topic Model

Factorized Multi-Modal Topic Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-modal data collections, such as corpora of paired images and text snippets, require analysis methods beyond single-view component and topic models. For continuous observations the current dominant approach is based on extensions of canonical correlation analysis, factorizing the variation into components shared by the different modalities and those private to each of them. For count data, multiple variants of topic models attempting to tie the modalities together have been presented. All of these, however, lack the ability to learn components private to one modality, and consequently will try to force dependencies even between minimally correlating modalities. In this work we combine the two approaches by presenting a novel HDP-based topic model that automatically learns both shared and private topics. The model is shown to be especially useful for querying the contents of one domain given samples of the other.


💡 Research Summary

The paper addresses the challenge of modeling multi‑modal data collections—such as paired images and text snippets—where each modality contains both shared information and modality‑specific (private) information. Existing approaches fall into two camps. For continuous data, extensions of canonical correlation analysis (CCA) factorize variation into shared and private components using structural sparsity, but these methods are unsuitable for count‑based data (e.g., bag‑of‑words). For count data, several multi‑modal topic models based on Latent Dirichlet Allocation (LDA) have been proposed, yet they enforce strong correlations across modalities, which is unrealistic for weakly related pairs like Wikipedia images and their surrounding text.

The authors propose a novel factorized multi‑modal topic model that combines the strengths of correlated topic models (CTM) and hierarchical Dirichlet processes (HDP). The model introduces a high‑dimensional Gaussian latent variable ξ for each document, drawn from N(μ, Σ). The covariance matrix Σ encodes both intra‑modality (diagonal blocks) and inter‑modality (off‑diagonal blocks) correlations, allowing the model to capture how topics co‑vary within and across views. To achieve automatic selection of shared versus private topics, the authors embed a separate HDP for each modality, using the stick‑breaking construction to generate modality‑specific selection weights p^(m). The final topic proportions for modality m are defined as

θ_k^(m) ∝ Gamma(β^(m) p_k^(m), exp(−ξ_k^(m)))

where β^(m) is the second‑level concentration parameter of the HDP. Because ξ is shared across modalities while p^(m) selectively turns topics on or off for each view, some topics become active in all modalities (shared topics) and others only in a subset (private topics). This mechanism mirrors the component‑switching ideas used in CCA‑based continuous models but is adapted to count data.

Inference is performed via a truncated variational approximation, closely following the Discrete Infinite Logistic Normal (DILN) algorithm. The variational distribution over ξ is a diagonal‑covariance Gaussian, updated by gradient ascent while keeping Σ⁻¹ fixed. Global parameters μ and Σ are updated by maximum marginal likelihood, yielding closed‑form updates. The HDP parameters (β, V) and the topic‑word distributions η are updated using standard DILN variational steps, with the only novelty being the presence of M separate sets (one per modality). The truncation level T limits the number of instantiated topics, but the stick‑breaking weights p^(m) automatically prune unused topics, effectively learning the number of shared and private topics from data.

For prediction, the model excels at inferring a missing modality given an observed one. After variationally estimating the topic proportions θ̂^(j) and latent ξ̂^(j) for the observed modality j, the latent vector for the missing modality i is obtained via the conditional expectation of a multivariate Gaussian:

ξ̂^(i) = μ^(i) + Σ_{i,j} Σ_{j,j}^{−1} (ξ̂^(j) – μ^(j)).

This linear transformation (matrix W) projects the observed view’s latent representation into the space of the missing view. The projected ξ̂^(i) is then exponentiated and weighted by p^(i) to recover θ̂^(i), which in turn generates the missing words or visual features. Because private topics are suppressed in the projection, the prediction focuses on shared structure, reducing noise from modality‑specific content.

The authors evaluate the model on a Wikipedia dataset consisting of images and the full page text, a setting characterized by low inter‑modality correlation. Compared against baseline HDP‑LDA, Corr‑LDA, and other multi‑view LDA variants, the proposed model achieves substantially higher text reconstruction accuracy and better image‑text matching scores. Qualitative inspection of learned topics shows clear separation: shared topics capture generic concepts (e.g., “people”, “landscape”), while private topics capture text‑only details (historical descriptions) or image‑only visual patterns (color, composition). The model also automatically determines the appropriate number of topics for each category without manual tuning.

In summary, the paper introduces a principled, fully Bayesian framework for multi‑modal count data that simultaneously discovers shared and private topics, learns the number of topics automatically via HDP, and provides an effective mechanism for cross‑modal prediction. The approach bridges the gap between CCA‑style factorization for continuous data and topic modeling for discrete data, offering a versatile tool for a wide range of applications such as image captioning, cross‑modal retrieval, and multimodal content analysis. Future work may explore integration with deep visual features, scalable inference for large corpora, and extensions to more than two modalities.


Comments & Academic Discussion

Loading comments...

Leave a Comment