When people explore and manage information, they think in terms of topics and themes. However, the software that supports information exploration sees text at only the surface level. In this paper we show how topic modeling -- a technique for identifying latent themes across large collections of documents -- can support semantic exploration. We present TopicViz, an interactive environment for information exploration. TopicViz combines traditional search and citation-graph functionality with a range of novel interactive visualizations, centered around a force-directed layout that links documents to the latent themes discovered by the topic model. We describe several use scenarios in which TopicViz supports rapid sensemaking on large document collections.
As information repositories continue to expand and diversify, there is an urgent need for systems that help people explore and make sense of large document collections. While researchers in information-seeking and related areas have developed increasingly effective interaction techniques for navigating document collections [1,16], these methods are hampered by a view of language that is generally restricted to the surface level; such techniques are oblivious to the semantic meaning behind the text. Meanwhile, researchers in machine learning and natural language processing have developed powerful statistical methods for recovering latent semantics [2], but the output of these methods is difficult to present to both domain expert and novice users alike.
In this paper, we introduce TopicViz, a new tool for searching and navigating large document collections (Figure 1). TopicViz infers a set of topics that summarize the latent high-level semantic organization of a collection, and provides a novel interactive view that exposes this semantic organization using a force-directed layout. This layout permits a range of interactive affordances, allowing users to gradually refine their understanding of the search results and citations links, while focusing in on key semantic distinctions of interest.
The analytic engine of our approach is the topic model -a powerful statistical technique for identifying latent themes in a document collection [2]. Without any annotation, topic models can extract topics -sets of semantically-related words -and describe each document as a mixture of these topics. For example, a given research paper might be characterized as 70% human-computer interaction, and 30% machine learning. 1 Topic models have been successfully applied to a broad range of text, and the extracted topics have been shown to cohere with readers’ semantic judgments [4]. But while topic models are often motivated as a technique to support information seeking, there has been little investigation of how users can understand and exploit them.
One of the principle strengths of topic models is their flexibility: topics need not correspond to any predefined taxonomy, but rather represent the latent structure inherent to the document collection. However, this means that the content of each topic must somehow be conveyed to the user. In topic modeling research, this issue is almost invariably addressed by showing ranked lists of words and documents that are closely associated with each topic. But such lists have undesirable properties: it is difficult to show more than a few entries per topic, the meaning of individual terms may be unknown to non-experts;2 in addition, the numerical scores for each word and topic are hard to interpret.
While hundreds of papers address the mathematical methodology of topic modeling, relatively few take up the question of how topic models can support information exploration. Our approach is distinguished from prior work in its emphasis on interaction: the user is empowered to manipulate the visualization by adding, rearranging or removing topics, and by controlling the set of documents to visualize. The motivation for this design stems from our focus on local information exploration: we aim to provide a deep understanding of a local area of the information landscape that is relevant to the user’s goals, rather than a surface-level static view of thousands of documents. We provide affordances for users to quickly focus in on the topical distinctions that are relate to their goals, allowing them to interactively manipulate topics within this space to better understand document-document, document-topic, and topic-topic relationships.
Topic models of document collections A topic model is a hierarchical probabilistic model of document content [2]. Each topic is a probability distribution over words, β; every word in every document is assumed to be randomly generated from one topic. In a given document the proportion of words generated from each topic is given by a latent vector θ d . Thus, the matrix θ provides a succinct summary of the semantics of each document.
Both the topics β and the document descriptions θ can be obtained through offline statistical inference [2,10], without any need for manual annotation. Thus, topics need not correspond to Figure 1: The TopicViz environment. The main panel shows the initial presentation for a selected set of documents, which are arranged in a force-directed layout controlled by the seven best-matching topics. The upper-left panel shows the search results in list form, and the lower-left panel describes the selected topic, “multilingual”. 2: A textual display of the top ten automatically-identified keywords from three topics obtained from a dataset of research papers on computational linguistics. The topic names were assigned manually.
any predefined categories; indeed, this is why they are useful for exploratory analysis. In the research literature, to
This content is AI-processed based on open access ArXiv data.