Search CORE

457 research outputs found

Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

Author: Banerjee
Church
Deerwester
François Yvon
Halkidi
Hofmann
Jain
Katz
Kuhn
Lange
Loïs Rigouste
Mosimann
Nigam
Olivier Cappé
Robert
Sebastiani
Shahnaz
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL Descartes

Cluster validation by measurement of clustering characteristics relevant to the user

Author: Bowcock
Calinski
Coretto
Fang
Franck
Halkidi
Hausdorf
Hennig
Hennig
Hennig
Hennig
Hubert
Hubert
Katsnelson
Kaufman
Lago-Fernandez
Stigler
Tibshirani
Publication venue
Publication date: 01/01/2019
Field of study

There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria to compare clusterings. This can be used to select one of a set of clusterings from different methods, or from the same method ran with different parameters such as different numbers of clusters. There are many cluster validation indexes in the literature. Most of them attempt to measure the overall quality of a clustering by a single number, but this can be inappropriate. There are various different characteristics of a clustering that can be relevant in practice, depending on the aim of clustering, such as low within-cluster distances and high between-cluster separation. In this paper, a number of validation criteria will be introduced that refer to different desirable characteristics of a clustering, and that characterise a clustering in a multidimensional way. In specific applications the user may be interested in some of these criteria rather than others. A focus of the paper is on methodology to standardise the different characteristics so that users can aggregate them in a suitable way specifying weights for the various criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An Approach to Web-Scale Named-Entity Disambiguation

Author: C. Whitelaw
I. Bhattacharya
L. Sarmento
M. Halkidi
M. Meilă
P. Pantel
S. Dill
S. Guha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

We present a multi-pass clustering approach to large scale. wide-scope named-entity disambiguation (NED) oil collections of web pages. Our approach Uses name co-occurrence information to cluster and hence disambiguate entities. and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasing), difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information front documents

Crossref

Repositório Aberto da Universidade do Porto

Density-based clustering is the task of discovering high-density regions of entities (clusters) that are separated from each other by contiguous regions of low-density. DBSCAN is, arguably, the most popular density-based clustering algorithm. However, its cluster recovery capabilities depend on the combination of the two parameters. In this paper we present a new density-based clustering algorithm which uses reverse nearest neighbour (RNN) and has a single parameter. We also show that it is possible to estimate a good value for this parameter using a clustering validity index. The RNN queries enable our algorithm to estimate densities taking more than a single entity into account, and to recover clusters that are not well-separated or have different densities. Our experiments on synthetic and real-world data sets show our proposed algorithm outperforms DBSCAN and its recent variant ISDBSCAN.Comment: Accepted in: Computing Conference 2019 in London, UK. http://saiconference.com/Computin

University of Essex Research Repository

arXiv.org e-Print Archive

Crossref

Spatial correlations in attribute communities

Author: A Barrat
A De Montis
A Decelle
A Lancichinetti
AK Jain
Alessandro Chessa
B Karrer
D Grady
D Hu
D Hu
F Calabrese
Federica Cerina
G Daraganova
L Danon
L Denoeud
M Barthelemy
M Chavez
M Halkidi
MA Porter
Marc Barthelemy
MEJ Newman
P Expert
P Kaluza
R Guimerà
R Guimerá
RandW
RJGB Campello
S Erlander
S Fortunato
S Fortunato
S Gregory
Sergio Gómez
VD Blondel
Vincenzo De Leo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

Community detection is an important tool for exploring and classifying the properties of large complex networks and should be of great help for spatial networks. Indeed, in addition to their location, nodes in spatial networks can have attributes such as the language for individuals, or any other socio-economical feature that we would like to identify in communities. We discuss in this paper a crucial aspect which was not considered in previous studies which is the possible existence of correlations between space and attributes. Introducing a simple toy model in which both space and node attributes are considered, we discuss the effect of space-attribute correlations on the results of various community detection methods proposed for spatial networks in this paper and in previous studies. When space is irrelevant, our model is equivalent to the stochastic block model which has been shown to display a detectability-non detectability transition. In the regime where space dominates the link formation process, most methods can fail to recover the communities, an effect which is particularly marked when space-attributes correlations are strong. In this latter case, community detection methods which remove the spatial component of the network can miss a large part of the community structure and can lead to incorrect results.Comment: 10 pages and 7 figure

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Cagliari

IMT Institutional Repository

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Author: Arbelaitz
Ball
Bezdek
Caliński
Chan
Chiang
Chiang
Christian Hennig
David
de Amorim
de Amorim
de Amorim
Dudoit
Dunn
Gasch
Halkidi
Hartigan
Hennig
Huang
Huang
Hubert
Jain
Jain
Kaufman
MacQueen
Milligan
Mirkin
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Steinley
Steinley
Sturn
Vedaldi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.Peer reviewe

arXiv.org e-Print Archive

University of Essex Research Repository

Crossref

UCL Discovery

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

University of Hertfordshire Research Archive

An effective non-parametric method for globally clustering genes from expression profiles

Author: AK Jain
F Azuaje
G Sherlock
Gang Li
Jingyu Hou
M Halkidi
MS Aldenderfer
MT Özsu
PC Boutros
PT Spellman
R Simon
R Tibshirani
RB Altman
RJ Hathaway
RR Sokal
S Raychaudhuri
SM Tseng
T Zhang
VS Tseng
Wanlei Zhou
Wei Shi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2007
Field of study

Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm. <br /

DRO Deakin Research Online

Crossref

Clustering sensory inputs using NeuroEvolution of augmenting topologies

Author: Goudbeek Martijn
Halkidi M.
Hsu Yen-Chang
Raue Federico
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2018
Field of study

Sorting data into groups and clusters is one of the fundamental tasks of artificially intelligent systems. Classical clustering algorithms rely on heuristic (k-nearest neighbours) or statistical methods (k-means, fuzzy c-means) to derive clusters and these have performed well. Neural networks have also been used in clustering data, but researchers have only recently begun to adopt the strategy of having neural networks directly determine the cluster membership of an input datum. This paper presents a novel strategy, employing NeuroEvolution of Augmenting Topologies to produce an evoltionary neural network capable of directly clustering unlabelled inputs. It establishes the use of cluster validity metrics in a fitness function to train the neural network

Crossref

The IT University of Copenhagen's Repository