We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitely-deep, infinitely-branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning--the use of Bayesian nonparametric methods to infer distributions on flexible data structures.
Deep Dive into The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies.
We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitely-deep, infinitely-branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning–the use of Bayesian nonparametric methods to infer distributions on flexible data st
For much of its history, computer science has focused on deductive formal methods, allying itself with deductive traditions in areas of mathematics such as set theory, logic, algebra, and combinatorics. There has been accordingly less focus on efforts to develop inductive, empirically-based formalisms in computer science, a gap which became increasingly visible over the years as computers have been required to interact with noisy, difficultto-characterize sources of data, such as those deriving from physical signals or from human activity. In more recent history, the field of machine learning 1 arXiv:0710.0845v3 [stat.ML] 27 Aug 2009 has aimed to fill this gap, allying itself with inductive traditions in probability and statistics, while focusing on methods that are amenable to analysis as computational procedures.
Machine learning methods can be divided into supervised learning methods and unsupervised learning methods. Supervised learning has been a major focus of machine learning research. In supervised learning, each data point is associated with a label (e.g., a category, a rank or a real number) and the goal is to find a function that maps data into labels (so as to predict the labels of data that have not yet been labeled). A canonical example of supervised machine learning is the email spam filter, which is trained on known spam messages and then used to mark incoming unlabeled email as spam or non-spam.
While supervised learning remains an active and vibrant area of research, more recently the focus in machine learning has turned to unsupervised learning methods. In unsupervised learning the data are not labeled, and the broad goal is to find patterns and structure within the data set. Different formulations of unsupervised learning are based on different notions of “pattern” and “structure.” Canonical examples include clustering, the problem of grouping data into meaningful groups of similar points, and dimension reduction, the problem of finding a compact representation that retains useful information in the data set. One way to render these notions concrete is to tie them to a supervised learning problem; thus, a structure is validated if it aids the performance of an associated supervised learning system. Often, however, the goal is more exploratory. Inferred structures and patterns might be used, for example, to visualize or organize the data according to subjective criteria. With the increased access to all kinds of unlabeled data-scientific data, personal data, consumer data, economic data, government data, text data-exploratory unsupervised machine learning methods have become increasingly prominent.
Another important dichotomy in machine learning distinguishes between parametric and nonparametric models. A parametric model involves a fixed representation that does not grow structurally as more data are observed. Examples include linear regression and clustering methods in which the number of clusters is fixed a priori. A nonparametric model, on the other hand, is based on representations that are allowed to grow structurally as more data are observed.1 Nonparametric approaches are often adopted when the goal is to impose as few assumptions as possible and to “let the data speak.”
The nonparametric approach underlies many of the most significant developments in the supervised learning branch of machine learning over the past two decades. In particular, modern classifiers such as decision trees, boosting and nearest neighbor methods are nonparametric, as are the class of supervised learning systems built on “kernel methods,” including the support vector machine. (See Hastie et al. (2001) for a good review of these methods.) Theoretical developments in supervised learning have shown that as the number of data points grows, these methods can converge to the true labeling function underlying the data, even when the data lie in an uncountably infinite space and the labeling function is arbitrary (Devroye et al., 1996). This would clearly not be possible for parametric classifiers.
The assumption that labels are available in supervised learning is a strong assumption, but it has the virtue that few additional assumptions are generally needed to obtain a useful supervised learning methodology. In unsupervised learning, on the other hand, the absence of labels and the need to obtain operational definitions of “pattern” and “structure” generally makes it necessary to impose additional assumptions on the data source. In particular, unsupervised learning methods are often based on “generative models,” which are probabilistic models that express hypotheses about the way in which the data may have been generated. Probabilistic graphical models (also known as “Bayesian networks” and “Markov random fields”) have emerged as a broadly useful approach to specifying generative models (Lauritzen, 1996;Jordan, 2000). The elegant marriage of graph theory and probability theory in graphical models makes it p
…(Full text truncated)…
This content is AI-processed based on ArXiv data.