The Author-Topic Model for Authors and Documents
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
💡 Research Summary
The paper introduces the author‑topic model, a probabilistic generative framework that extends Latent Dirichlet Allocation (LDA) by incorporating authorship information. In the classic LDA setting, each document is assumed to be a mixture of latent topics, and each topic is a distribution over words. The author‑topic model adds a second layer: each author a possesses a multinomial distribution θ_a over topics, drawn from a Dirichlet prior α. Each topic k, in turn, has a multinomial distribution φ_k over the vocabulary, drawn from a Dirichlet prior β. When generating a document d with a set of authors A_d, the model proceeds as follows for each word position n: first an author a is selected uniformly (or according to a predefined weight) from A_d; then a topic z_{d,n} is sampled from the author’s topic distribution θ_a; finally a word w_{d,n} is drawn from the topic’s word distribution φ_{z_{d,n}}. Consequently, the document’s overall topic mixture is a convex combination of its authors’ topic mixtures, naturally capturing the influence of multiple collaborators.
Exact posterior inference is intractable because the joint distribution couples word‑topic assignments with author‑topic assignments across all documents. The authors therefore employ collapsed Gibbs sampling, integrating out θ and φ analytically and iteratively resampling the (author, topic) pair for each word token. The conditional probability for assigning word w to author a and topic k is proportional to (n_{a,k}^{¬dn}+α)·(n_{k,w}^{¬dn}+β)/(n_{k}^{¬dn}+Vβ), where n_{a,k}^{¬dn} counts how many times author a has been associated with topic k (excluding the current token), n_{k,w}^{¬dn} counts the occurrences of word w under topic k, n_{k}^{¬dn} is the total token count for topic k, and V is the vocabulary size. After a burn‑in period (typically a few hundred iterations) the sampler is run for 1,000 or more iterations to obtain stable estimates of θ and φ.
The model is evaluated on two large corpora: (1) 1,700 NIPS conference papers, each with explicit author lists, and (2) 160,000 abstracts from CiteSeer, also annotated with author information. For comparison, the authors implement two baselines that are special cases of the proposed model: (i) standard LDA (equivalent to the author‑topic model with a single “anonymous” author per document) and (ii) an author‑word model where each author directly generates words from a Dirichlet‑multinomial distribution, i.e., a model with a single latent topic per author.
Empirical results demonstrate several advantages of the author‑topic model. First, perplexity on held‑out data is consistently lower than that of LDA (≈12 % reduction) and the author‑word model (≈8 % reduction), indicating a better fit to the observed word distributions. Second, the inferred topics are more coherent, and the author‑topic distributions θ_a reveal clear patterns of specialization: prolific machine‑learning researchers such as Hinton, LeCun, and Bengio receive high probability mass on topics related to neural networks, while other authors dominate topics in reinforcement learning, computer vision, or natural language processing. Third, the authors compute pairwise cosine similarity between authors’ θ vectors to obtain an author similarity matrix. Clustering this matrix recovers meaningful research communities and highlights interdisciplinary scholars who bridge multiple clusters. Fourth, the entropy H(a)=−∑k θ{a,k}logθ_{a,k} of each author’s topic distribution provides a quantitative measure of “focus” versus “breadth.” Low‑entropy authors tend to be domain experts, whereas high‑entropy authors exhibit a more diversified research portfolio.
Beyond evaluation, the paper showcases practical applications. The author‑topic model can be used for author recommendation (suggesting potential collaborators based on complementary topic profiles), for tracking research trends over time (by extending the model with dynamic priors on θ), and for improving document retrieval by incorporating author‑topic priors into ranking functions. The authors also discuss extensions such as weighting authors according to contribution level, modeling temporal evolution of θ, and integrating additional metadata (affiliations, citation networks) into a richer hierarchical Bayesian structure.
In summary, the author‑topic model provides a unified probabilistic framework that simultaneously captures the latent thematic structure of text and the thematic preferences of individual authors. By treating documents as mixtures of their authors’ topic distributions, the model yields more accurate predictive performance, interpretable author‑topic profiles, and a versatile foundation for downstream tasks in scholarly data mining, patent analysis, and any domain where author or creator metadata is abundant.
Comments & Academic Discussion
Loading comments...
Leave a Comment