457 research outputs found
Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
In this article, we investigate the use of a probabilistic model for
unsupervised clustering in text collections. Unsupervised clustering has become
a basic module for many intelligent text processing applications, such as
information retrieval, text classification or information extraction. The model
considered in this contribution consists of a mixture of multinomial
distributions over the word counts, each component corresponding to a different
theme. We present and contrast various estimation procedures, which apply both
in supervised and unsupervised contexts. In supervised learning, this work
suggests a criterion for evaluating the posterior odds of new documents which
is more statistically sound than the "naive Bayes" approach. In an unsupervised
context, we propose measures to set up a systematic evaluation framework and
start with examining the Expectation-Maximization (EM) algorithm as the basic
tool for inference. We discuss the importance of initialization and the
influence of other features such as the smoothing strategy or the size of the
vocabulary, thereby illustrating the difficulties incurred by the high
dimensionality of the parameter space. We also propose a heuristic algorithm
based on iterative EM with vocabulary reduction to solve this problem. Using
the fact that the latent variables can be analytically integrated out, we
finally show that Gibbs sampling algorithm is tractable and compares favorably
to the basic expectation maximization approach
Cluster validation by measurement of clustering characteristics relevant to the user
There are many cluster analysis methods that can produce quite different
clusterings on the same dataset. Cluster validation is about the evaluation of
the quality of a clustering; "relative cluster validation" is about using such
criteria to compare clusterings. This can be used to select one of a set of
clusterings from different methods, or from the same method ran with different
parameters such as different numbers of clusters.
There are many cluster validation indexes in the literature. Most of them
attempt to measure the overall quality of a clustering by a single number, but
this can be inappropriate. There are various different characteristics of a
clustering that can be relevant in practice, depending on the aim of
clustering, such as low within-cluster distances and high between-cluster
separation.
In this paper, a number of validation criteria will be introduced that refer
to different desirable characteristics of a clustering, and that characterise a
clustering in a multidimensional way. In specific applications the user may be
interested in some of these criteria rather than others. A focus of the paper
is on methodology to standardise the different characteristics so that users
can aggregate them in a suitable way specifying weights for the various
criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure
An Approach to Web-Scale Named-Entity Disambiguation
We present a multi-pass clustering approach to large scale. wide-scope named-entity disambiguation (NED) oil collections of web pages. Our approach Uses name co-occurrence information to cluster and hence disambiguate entities. and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasing), difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information front documents
An efficient density-based clustering algorithm using reverse nearest neighbour
Density-based clustering is the task of discovering high-density regions of
entities (clusters) that are separated from each other by contiguous regions of
low-density. DBSCAN is, arguably, the most popular density-based clustering
algorithm. However, its cluster recovery capabilities depend on the combination
of the two parameters. In this paper we present a new density-based clustering
algorithm which uses reverse nearest neighbour (RNN) and has a single
parameter. We also show that it is possible to estimate a good value for this
parameter using a clustering validity index. The RNN queries enable our
algorithm to estimate densities taking more than a single entity into account,
and to recover clusters that are not well-separated or have different
densities. Our experiments on synthetic and real-world data sets show our
proposed algorithm outperforms DBSCAN and its recent variant ISDBSCAN.Comment: Accepted in: Computing Conference 2019 in London, UK.
http://saiconference.com/Computin
Spatial correlations in attribute communities
Community detection is an important tool for exploring and classifying the
properties of large complex networks and should be of great help for spatial
networks. Indeed, in addition to their location, nodes in spatial networks can
have attributes such as the language for individuals, or any other
socio-economical feature that we would like to identify in communities. We
discuss in this paper a crucial aspect which was not considered in previous
studies which is the possible existence of correlations between space and
attributes. Introducing a simple toy model in which both space and node
attributes are considered, we discuss the effect of space-attribute
correlations on the results of various community detection methods proposed for
spatial networks in this paper and in previous studies. When space is
irrelevant, our model is equivalent to the stochastic block model which has
been shown to display a detectability-non detectability transition. In the
regime where space dominates the link formation process, most methods can fail
to recover the communities, an effect which is particularly marked when
space-attributes correlations are strong. In this latter case, community
detection methods which remove the spatial component of the network can miss a
large part of the community structure and can lead to incorrect results.Comment: 10 pages and 7 figure
Recovering the number of clusters in data sets with noise features using feature rescaling factors
In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.Peer reviewe
An effective non-parametric method for globally clustering genes from expression profiles
Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm. <br /
Clustering sensory inputs using NeuroEvolution of augmenting topologies
Sorting data into groups and clusters is one of the fundamental tasks of artificially intelligent systems. Classical clustering algorithms rely on heuristic (k-nearest neighbours) or statistical methods (k-means, fuzzy c-means) to derive clusters and these have performed well. Neural networks have also been used in clustering data, but researchers have only recently begun to adopt the strategy of having neural networks directly determine the cluster membership of an input datum. This paper presents a novel strategy, employing NeuroEvolution of Augmenting Topologies to produce an evoltionary neural network capable of directly clustering unlabelled inputs. It establishes the use of cluster validity metrics in a fitness function to train the neural network
- …
