Hierarchical relational models for document networks
We develop the relational topic model (RTM), a hierarchical model of both network structure and node attributes. We focus on document networks, where the attributes of each document are its words, that is, discrete observations taken from a fixed vocabulary. For each pair of documents, the RTM models their link as a binary random variable that is conditioned on their contents. The model can be used to summarize a network of documents, predict links between them, and predict words within them. We derive efficient inference and estimation algorithms based on variational methods that take advantage of sparsity and scale with the number of links. We evaluate the predictive performance of the RTM for large networks of scientific abstracts, web documents, and geographically tagged news.
💡 Research Summary
The paper introduces the Relational Topic Model (RTM), a hierarchical probabilistic framework that jointly models the textual content of documents and the binary links between them. Building on Latent Dirichlet Allocation (LDA), each document d is assigned a topic proportion vector θ_d drawn from a Dirichlet prior α. Words are generated by first sampling a topic assignment z_{dn} from a multinomial distribution parameterized by θ_d and then emitting a word w_{dn} from the corresponding topic‑word distribution β. The novel contribution lies in the link generation process: for any pair of documents (d, d′), a binary link variable y_{dd′} is drawn from a Bernoulli distribution whose success probability is a logistic function σ(η·f(θ_d, θ_{d′})). The function f can be a simple inner product, a cosine similarity, or any differentiable combination of the two topic vectors, and η is a learned weight vector that captures how content similarity translates into link formation.
Inference is performed using a variational Expectation‑Maximization (EM) algorithm. The variational distribution q(θ_d, z_{dn}) is factorized into a Dirichlet over θ_d (parameter γ_d) and multinomials over each z_{dn} (parameters φ_{dn}). In the E‑step, γ_d and φ_{dn} are updated using coordinate ascent, incorporating an additional term that accounts for the expected log‑likelihood of observed links. Because real‑world document networks are extremely sparse, the authors exploit this sparsity: the contribution of observed links scales with the number of edges |E|, while non‑links are handled via negative sampling, avoiding a quadratic blow‑up. In the M‑step, the global parameters β (topic‑word distributions) and η (link weight vector) are updated by maximizing the expected complete‑data log‑likelihood under the current variational posterior. The resulting algorithm has linear complexity in the number of documents plus the number of observed links, making it suitable for large‑scale corpora.
The authors evaluate RTM on three massive datasets: (1) a collection of scientific abstracts with citation links, (2) a web‑page corpus with hyperlink structure, and (3) geographically tagged news articles where proximity induces links. They compare RTM against several baselines: (i) LDA followed by a logistic regression classifier on document pairs, (ii) the Mixed Membership Stochastic Blockmodel (MMSB), which models links but not text, and (iii) pure network models such as graph‑based embeddings. Performance is measured in three ways: link prediction (area under the ROC curve, AUC), word prediction (perplexity), and topic interpretability (topic coherence). RTM consistently outperforms all baselines on link prediction, achieving AUC scores above 0.85 across datasets, and also yields lower perplexity than standard LDA, indicating that incorporating link information improves the quality of the learned topics. Analysis of the learned η parameters shows that, for citation networks, the inner‑product similarity receives a strong positive weight, confirming the intuition that documents with similar thematic content are more likely to cite each other.
Despite its strengths, RTM has limitations. It only handles binary links, so weighted or multi‑type relationships require extensions. The model’s performance can be sensitive to the choice of the number of topics K, and the variational approximation may struggle with highly clustered network structures where higher‑order dependencies matter. The authors suggest several avenues for future work: extending the link model to accommodate real‑valued edge weights, incorporating temporal dynamics to capture evolving topics and links, and integrating multimodal data (e.g., images, metadata) into the hierarchical framework.
In summary, the Relational Topic Model provides a principled and scalable solution for jointly analyzing document content and network structure. By tying link probabilities directly to latent topic representations, RTM achieves superior predictive performance, yields more coherent topics, and offers a versatile foundation for downstream applications such as recommendation, summarization, and community detection in document‑rich networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment