Search CORE

1,119 research outputs found

Some considerations in the selection of aircraft for earth resource observations

Author: Arno R. D.
Deerwester J. M.
Publication venue
Publication date
Field of study

Comparison of logistics problems and cost aspects in selection of aircraft for earth resources survey

NASA Technical Reports Server

Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

Author: Banerjee
Church
Deerwester
François Yvon
Halkidi
Hofmann
Jain
Katz
Kuhn
Lange
Loïs Rigouste
Mosimann
Nigam
Olivier Cappé
Robert
Sebastiani
Shahnaz
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL Descartes

Detecting Large Concept Extensions for Conceptual Analysis

Author: C Dutilh Novaes
DJ Chalmers
DM Blei
F Jackson
KL Gwet
S Deerwester
S Haslanger
S Laurence
TL Griffiths
U Fayyad
Publication venue
Publication date: 18/06/2017
Field of study

When performing a conceptual analysis of a concept, philosophers are interested in all forms of expression of a concept in a text---be it direct or indirect, explicit or implicit. In this paper, we experiment with topic-based methods of automating the detection of concept expressions in order to facilitate philosophical conceptual analysis. We propose six methods based on LDA, and evaluate them on a new corpus of court decision that we had annotated by experts and non-experts. Our results indicate that these methods can yield important improvements over the keyword heuristic, which is often used as a concept detection heuristic in many contexts. While more work remains to be done, this indicates that detecting concepts through topics can serve as a general-purpose method for at least some forms of concept expression that are not captured using naive keyword approaches

arXiv.org e-Print Archive

Crossref

Spoken query processing for interactive information retrieval

Author: Barnett
Crestani
Crestani
Crestani
Crestani
Crestani
Deerwester
Fabio Crestani
Garofolo
Harman
Harman
Markowitz
Porter
Silipo
Singhal
Singhal
Tombros
Tombros
van Rijsbergen
Voorhees
Publication venue: 'Elsevier BV'
Publication date: 01/01/2002
Field of study

It has long been recognised that interactivity improves the effectiveness of information retrieval systems. Speech is the most natural and interactive medium of communication and recent progress in speech recognition is making it possible to build systems that interact with the user via speech. However, given the typical length of queries submitted to information retrieval systems, it is easy to imagine that the effects of word recognition errors in spoken queries must be severely destructive on the system's effectiveness. The experimental work reported in this paper shows that the use of classical information retrieval techniques for spoken query processing is robust to considerably high levels of word recognition errors, in particular for long queries. Moreover, in the case of short queries, both standard relevance feedback and pseudo relevance feedback can be effectively employed to improve the effectiveness of spoken query processing

Crossref

University of Strathclyde Institutional Repository

Community detection based on links and node features in social networks

Author: A. Pothen
D.M.. Blei
J. Xie
J.M. Kleinberg
M. Girvan
S.. Fortunato
S.C. Deerwester
Publication venue
Publication date: 01/01/2015
Field of study

© Springer International Publishing Switzerland 2015. Community detection is a significant but challenging task in the field of social network analysis. Many effective methods have been proposed to solve this problem. However, most of them are mainly based on the topological structure or node attributes. In this paper, based on SPAEM [1], we propose a joint probabilistic model to detect community which combines node attributes and topological structure. In our model, we create a novel feature-based weighted network, within which each edge weight is represented by the node feature similarity between two nodes at the end of the edge. Then we fuse the original network and the created network with a parameter and employ expectation-maximization algorithm (EM) to identify a community. Experiments on a diverse set of data, collected from Facebook and Twitter, demonstrate that our algorithm has achieved promising results compared with other algorithms

Crossref

OPUS - University of Technology Sydney

Word Embeddings for Entity-annotated Texts

Author: A Das
A Spitz
CD Manning
D Nadeau
E Bruni
F Hill
F Hill
H Abdi
H Rubenstein
J Mitchell
J Strötgen
JG Moreno
L Maaten
P Bojanowski
P Goyal
S Deerwester
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/02/2020
Field of study

Learned vector representations of words are useful tools for many information retrieval and natural language processing tasks due to their ability to capture lexical semantics. However, while many such tasks involve or even rely on named entities as central components, popular word embedding models have so far failed to include entities as first-class citizens. While it seems intuitive that annotating named entities in the training corpus should result in more intelligent word features for downstream tasks, performance issues arise when popular embedding approaches are naively applied to entity annotated corpora. Not only are the resulting entity embeddings less useful than expected, but one also finds that the performance of the non-entity word embeddings degrades in comparison to those trained on the raw, unannotated corpus. In this paper, we investigate approaches to jointly train word and entity embeddings on a large corpus with automatically annotated and linked entities. We discuss two distinct approaches to the generation of such embeddings, namely the training of state-of-the-art embeddings on raw-text and annotated versions of the corpus, as well as node embeddings of a co-occurrence graph representation of the annotated corpus. We compare the performance of annotated embeddings and classical word embeddings on a variety of word similarity, analogy, and clustering evaluation tasks, and investigate their performance in entity-specific tasks. Our findings show that it takes more than training popular word embedding models on an annotated corpus to create entity embeddings with acceptable performance on common test cases. Based on these results, we discuss how and when node embeddings of the co-occurrence graph representation of the text can restore the performance.Comment: This paper is accepted in 41st European Conference on Information Retrieva

arXiv.org e-Print Archive

Crossref

Looking at Vector Space and Language Models for IR using Density Matrices

Author: A Gleason
AI Lvovsky
B Piwowarski
C Carpineto
ChX Zhai
G Birkhoff
G Salton
G Zuccon
G Zuccon
J Rocchio
J Zobel
K Rijsbergen van
K Tsuda
M Melucci
M Melucci
M Melucci
M Melucci
MA Nielsen
MK Warmuth
S Deerwester
SKM Wong
T Hofmann
X Zhao
Publication venue
Publication date: 08/01/2014
Field of study

In this work, we conduct a joint analysis of both Vector Space and Language Models for IR using the mathematical framework of Quantum Theory. We shed light on how both models allocate the space of density matrices. A density matrix is shown to be a general representational tool capable of leveraging capabilities of both VSM and LM representations thus paving the way for a new generation of retrieval models. We analyze the possible implications suggested by our findings.Comment: In Proceedings of Quantum Interaction 201

arXiv.org e-Print Archive

Crossref

Meaning-focused and Quantum-inspired Information Retrieval

Author: AY Khrennikov
D Aerts
D Aerts
D Aerts
D Aerts
D Aerts
D Aerts
D Aerts
D Osherson
D Widdows
D. Aerts
DM Blei
EM Pothos
G Zuccon
JA Hampton
JA Hampton
JA Hampton
JR Busemeyer
JR Busemeyer
K Lund
K Rijsbergen Van
M Melucci
S Deerwester
S Dumais
Y Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/03/2013
Field of study

In recent years, quantum-based methods have promisingly integrated the traditional procedures in information retrieval (IR) and natural language processing (NLP). Inspired by our research on the identification and application of quantum structures in cognition, more specifically our work on the representation of concepts and their combinations, we put forward a 'quantum meaning based' framework for structured query retrieval in text corpora and standardized testing corpora. This scheme for IR rests on considering as basic notions, (i) 'entities of meaning', e.g., concepts and their combinations and (ii) traces of such entities of meaning, which is how documents are considered in this approach. The meaning content of these 'entities of meaning' is reconstructed by solving an 'inverse problem' in the quantum formalism, consisting of reconstructing the full states of the entities of meaning from their collapsed states identified as traces in relevant documents. The advantages with respect to traditional approaches, such as Latent Semantic Analysis (LSA), are discussed by means of concrete examples.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Selective Metal Cation Capture by Soft Anionic Metal-Organic Frameworks via Drastic Single-Crystal-to-Single-Crystal Transformations

Author: B Settles
CW Arnold
DM Blei
DM Blei
DR Rhodes
DR Swanson
GA Miller
JH Miles
L Kanner
SC Deerwester
Publication venue
Publication date: 13/06/2012
Field of study

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, and splitting and merging. The power of the proposed framework is demonstrated on the medical literature corpus concerned with the autism spectrum disorder (ASD) - an increasingly important research subject of significant social and healthcare importance. In addition to the collected ASD literature corpus which we will make freely available, our contributions also include two free online tools we built as aids to ASD researchers. These can be used for semantically meaningful navigation and searching, as well as knowledge discovery from this large and rapidly growing corpus of literature.Comment: In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 201

arXiv.org e-Print Archive

Heriot Watt Pure

DRO Deakin Research Online

Crossref

Edinburgh Research Explorer

University of St. Andrews - Pure

Effect of Tuned Parameters on a LSA MCQ Answering Model

Author: A. C. Graesser
Alain Lifchitz
C. H. Q. Ding
D. I. Martin
G. Denhière
G. Salton
G. Salton
Guy Denhière
J. Diaz
J. Diaz
J. Quesada
M. Efron
M. F. Porter
M. W. Berry
S. Deerwester
S. T. Dumais
S. T. Dumais
Sandra Jhean-Larose
W. Kintsch
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.Comment: 9 page

arXiv.org e-Print Archive

HAL: Hyper Article en Ligne

Hal-Diderot