Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process

Infinite Author Topic Model based on Mixed Gamma-Negative Binomial   Process
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors’s interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author’s contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors’ interests on these topics and the number of topics simultaneously.


💡 Research Summary

The paper addresses a fundamental limitation of the traditional Author Topic Model (ATM), namely the need to pre‑specify the number of latent topics. In many real‑world applications the appropriate number of topics is unknown and fixing it a priori can lead to poor model fit. To overcome this, the authors propose the Infinite Author Topic (IAT) model, which leverages non‑parametric Bayesian processes to let the data determine the number of topics automatically while still capturing the hierarchical relationship among authors, documents, and words.

The core of the IAT model is an extension of the Gamma‑Negative Binomial Process (GNBP) to three hierarchical levels. A global base measure H generates a Gamma process Γ₀. Each author a draws its own Gamma process Γₐ from Γ₀, representing the author’s distribution over an (potentially infinite) set of topics. Because a document can have multiple authors, the model constructs a mixed Gamma process for each document d by linearly combining the Gamma processes of all its authors (Γ_{da}=⊕_{a∈A_d}Γₐ). A document‑specific dispersion parameter p_d then scales this mixed process to produce a document‑level Gamma process Γ_d, which serves as the intensity measure for a Poisson (or equivalently a Negative Binomial) distribution over word counts. Consequently, the observed word tokens are generated from a Poisson process whose rate is governed by the mixed author interests, allowing the model to attribute words to individual authors within a multi‑author document.

Inference in such an infinite mixture model is non‑trivial. The authors adopt a truncation approach, fixing a large upper bound K on the number of topics and treating the model as a finite approximation. The main difficulty lies in updating the mixed Gamma parameters r_{da,k}, which are sums of author‑specific Gamma variables. To resolve this, the paper exploits the additive property of the Negative Binomial distribution: a NB(r_{da,k}, p_d) variable can be decomposed into a sum of A_d independent NB(r_{a,k}, p_d) variables. By introducing auxiliary variables n_{ad,k} that count how many words of topic k in document d are contributed by author a, the authors obtain closed‑form conditional distributions for each r_{a,k} (Gamma) and for the auxiliary counts (NB). This yields a fully Gibbs‑sampling algorithm where every conditional is analytically tractable, leading to efficient posterior inference.

Empirical evaluation is performed on two real‑world corpora: a collection of academic papers and a social‑media text dataset. The IAT model is compared against the classic ATM (with a fixed K), HDP‑LDA, and other Gamma‑NBP based topic models. Results show that IAT automatically discovers a sensible number of topics, achieves lower perplexity and higher topic coherence than the baselines, and provides interpretable author‑topic profiles. In particular, the ability to disentangle each author’s contribution within multi‑author documents enables downstream tasks such as author recommendation, paper browsing, and author disambiguation.

In summary, the paper makes three key contributions: (1) a novel three‑level non‑parametric Bayesian model that integrates author information without pre‑defining the number of topics; (2) an elegant Gibbs sampling scheme that handles the mixed Gamma process via auxiliary NB variables, ensuring closed‑form updates; and (3) extensive experiments demonstrating that IAT outperforms existing models in both predictive performance and interpretability. Future work may extend the framework to incorporate temporal dynamics, sentiment tags, or develop scalable variational inference methods for even larger corpora.


Comments & Academic Discussion

Loading comments...

Leave a Comment