Kernel Topic Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents’ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

💡 Research Summary

The paper introduces a novel extension of Latent Dirichlet Allocation (LDA) called the Kernel Topic Model (KTM). Traditional LDA treats each document as a bag‑of‑words generated from a mixture of K topics, where the per‑document topic proportions π are drawn from a Dirichlet distribution. KTM replaces this Dirichlet prior with a “squashed” Gaussian: a K‑dimensional latent vector y is drawn from a Gaussian process (GP) defined over a Hilbert space of document features φ, and the soft‑max σ(y) yields the topic proportions π. This construction allows arbitrary metadata (time stamps, geographic coordinates, social‑network positions, etc.) to be incorporated via kernel functions η_k that define the GP mean and covariance for each topic k.

The generative process is as follows: for each topic k, a word distribution θ_k is sampled from a Dirichlet(β_k). Independently, K functions h_k(·) are drawn from GP priors with mean μ_k(·) and kernel η_k(·,·). For each document d with feature vector φ_d, a latent vector y_d is obtained by evaluating h_k(φ_d) and adding isotropic Gaussian noise τ. The soft‑max σ(y_d) produces the topic mixture π_d. Words are then generated by first sampling a topic assignment c_{di} from π_d and then sampling a word from the corresponding θ_{c_{di}}.

The main technical challenge is that inference on the latent Gaussian y is not analytically tractable because π lives on the simplex while y lives in Euclidean space. The authors solve this by constructing a Laplace approximation of the Dirichlet distribution in the soft‑max (log‑odds) space, extending a technique originally proposed by MacKay (1998). By changing variables from π to y = σ⁻¹(π), the Dirichlet density becomes a log‑concave function whose Hessian can be expressed as L = A + XBXᵀ, where A = diag(α) (α are Dirichlet parameters) and B is a 2×2 diagonal matrix. Using the matrix inversion lemma, they obtain an explicit inverse for L, leading to a Gaussian approximation N(y; μ, Σ) with mean μ_k = log α_k – (1/K)∑ℓ log α_ℓ and diagonal covariance Σ{kk} ≈ 1/α_k·(1 – 2/K) for large K. This approximation is accurate even when α < 1, which is important for sparse topic priors.

Inference proceeds via a semi‑collapsed variational scheme. The Dirichlet prior over π_d is retained, while the per‑topic word distributions θ_k are analytically integrated out. The variational posterior over π_d takes the form D(π_d; α_d + ν_d), where ν_d are pseudo‑counts derived from the observed words. The Laplace bridge then maps this Dirichlet posterior to a Gaussian over y_d, providing “messages” (means μ_d and precisions Σ_d⁻¹) that serve as observations for the GP regression. Standard GP inference (e.g., Rasmussen & Williams 2006) is then applied topic‑wise to update the posterior over each function h_k, yielding closed‑form expressions for posterior means and variances at any feature location.

Hyper‑parameter learning (kernel parameters and noise τ) is performed by maximizing an approximate evidence term log Z that depends only on the Gaussian messages, not on the full word likelihood. This avoids the prohibitive cost of repeatedly running the variational LDA sub‑routine during evidence maximization. The resulting log‑evidence is numerically stable: log Z = ½

Kernel Topic Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment