1,280 research outputs found
Correction: A correlated topic model of Science
Correction to Annals of Applied Statistics 1 (2007) 17--35
[doi:10.1214/07-AOAS114]Comment: Published in at http://dx.doi.org/10.1214/07-AOAS136 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Inferring Networks of Substitutable and Complementary Products
In a modern recommender system, it is important to understand how products
relate to each other. For example, while a user is looking for mobile phones,
it might make sense to recommend other phones, but once they buy a phone, we
might instead want to recommend batteries, cases, or chargers. These two types
of recommendations are referred to as substitutes and complements: substitutes
are products that can be purchased instead of each other, while complements are
products that can be purchased in addition to each other.
Here we develop a method to infer networks of substitutable and complementary
products. We formulate this as a supervised link prediction task, where we
learn the semantics of substitutes and complements from data associated with
products. The primary source of data we use is the text of product reviews,
though our method also makes use of features such as ratings, specifications,
prices, and brands. Methodologically, we build topic models that are trained to
automatically discover topics from text that are successful at predicting and
explaining such relationships. Experimentally, we evaluate our system on the
Amazon product catalog, a large dataset consisting of 9 million products, 237
million links, and 144 million reviews.Comment: 12 pages, 6 figure
A correlated topic model of Science
Topic models, such as latent Dirichlet allocation (LDA), can be useful tools
for the statistical analysis of document collections and other discrete data.
The LDA model assumes that the words of each document arise from a mixture of
topics, each of which is a distribution over the vocabulary. A limitation of
LDA is the inability to model topic correlation even though, for example, a
document about genetics is more likely to also be about disease than X-ray
astronomy. This limitation stems from the use of the Dirichlet distribution to
model the variability among the topic proportions. In this paper we develop the
correlated topic model (CTM), where the topic proportions exhibit correlation
via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982)
139--177]. We derive a fast variational inference algorithm for approximate
posterior inference in this model, which is complicated by the fact that the
logistic normal is not conjugate to the multinomial. We apply the CTM to the
articles from Science published from 1990--1999, a data set that comprises 57M
words. The CTM gives a better fit of the data than LDA, and we demonstrate its
use as an exploratory tool of large document collections.Comment: Published at http://dx.doi.org/10.1214/07-AOAS114 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Stochastic Variational Inference
We develop stochastic variational inference, a scalable algorithm for
approximating posterior distributions. We develop this technique for a large
class of probabilistic models and we demonstrate it with two probabilistic
topic models, latent Dirichlet allocation and the hierarchical Dirichlet
process topic model. Using stochastic variational inference, we analyze several
large collections of documents: 300K articles from Nature, 1.8M articles from
The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can
easily handle data sets of this size and outperforms traditional variational
inference, which can only handle a smaller subset. (We also show that the
Bayesian nonparametric topic model outperforms its parametric counterpart.)
Stochastic variational inference lets us apply complex Bayesian models to
massive data sets
- …
