94 research outputs found
Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering
The two main topics of this paper are the introduction of the "optimally
tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering
based on the multivariate Gaussian model for clusters, and a comprehensive
simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian
mixtures with and without noise component, mixtures of t-distributions, and the
TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant
density for modelling outliers and noise. This can be chosen optimally so that
the non-noise part of the data looks as close to a Gaussian mixture as
possible. Some deviation from Gaussianity can be traded in for lowering the
estimated noise proportion. Covariance matrix constraints and computation of
the OTRIMLE are also treated. In the simulation study, all methods are
confronted with setups in which their model assumptions are not exactly
fulfilled, and in order to evaluate the experiments in a standardized way by
misclassification rates, a new model-based definition of "true clusters" is
introduced that deviates from the usual identification of mixture components
with clusters. In the study, every method turns out to be superior for one or
more setups, but the OTRIMLE achieves the most satisfactory overall
performance. The methods are also applied to two real datasets, one without and
one with known "true" clusters
Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering
The robust improper maximum likelihood estimator (RIMLE) is a new method for
robust multivariate clustering finding approximately Gaussian clusters. It
maximizes a pseudo-likelihood defined by adding a component with improper
constant density for accommodating outliers to a Gaussian mixture. A special
case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In
this paper we treat existence, consistency, and breakdown theory for the RIMLE
comprehensively. RIMLE's existence is proved under non-smooth covariance matrix
constraints. It is shown that these can be implemented via a computationally
feasible Expectation-Conditional Maximization algorithm.Comment: The title of this paper was originally: "A consistent and breakdown
robust model-based clustering method
Una teoria della decidibilità: entropia e scelte in condizioni di incertezza
Questo lavoro presenta un nuovo modello di scelta in condizioni di incertezza. Dopo aver introdotto una caratterizzazione del concetto di incertezza, si dimostra, su base assiomatica, come sia possibile interpretare la funzione di entropia come misura di incertezza debole. I concetti di entropia e di utilità attesa sono impiegati per costruire su base assiomatica una nuova funzione: la funzione di decidibilità, che è in grado di ordinare le preferenze sullo spazio delle lotterie. Infine si mostra che questo modello è in grado di razionalizzare sia il paradosso di Allais che quello di Ellsberg.
Cluster validation by measurement of clustering characteristics relevant to the user
There are many cluster analysis methods that can produce quite different
clusterings on the same dataset. Cluster validation is about the evaluation of
the quality of a clustering; "relative cluster validation" is about using such
criteria to compare clusterings. This can be used to select one of a set of
clusterings from different methods, or from the same method ran with different
parameters such as different numbers of clusters.
There are many cluster validation indexes in the literature. Most of them
attempt to measure the overall quality of a clustering by a single number, but
this can be inappropriate. There are various different characteristics of a
clustering that can be relevant in practice, depending on the aim of
clustering, such as low within-cluster distances and high between-cluster
separation.
In this paper, a number of validation criteria will be introduced that refer
to different desirable characteristics of a clustering, and that characterise a
clustering in a multidimensional way. In specific applications the user may be
interested in some of these criteria rather than others. A focus of the paper
is on methodology to standardise the different characteristics so that users
can aggregate them in a suitable way specifying weights for the various
criteria that are relevant in the clustering application at hand.Comment: 20 pages 2 figure
Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions
The consistency of the maximum likelihood estimator for mixtures of
elliptically-symmetric distributions for estimating its population version is
shown, where the underlying distribution is nonparametric and does not
necessarily belong to the class of mixtures on which the estimator is based. In
a situation where is a mixture of well enough separated but nonparametric
distributions it is shown that the components of the population version of the
estimator correspond to the well separated components of . This provides
some theoretical justification for the use of such estimators for cluster
analysis in case that has well separated subpopulations even if these
subpopulations differ from what the mixture model assumes
An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering
We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, Journal of the American Statistical Association111, 1648-1659) of a Gaussian mixture model allowing for observations to be classified as 'noise', but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic Q that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to Q. The simplicity of a model is assessed by a measure S that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of Q is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on real two data sets
Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score
Cluster analysis requires many decisions: the clustering method and the
implied reference model, the number of clusters and, often, several
hyper-parameters and algorithms' tunings. In practice, one produces several
partitions, and a final one is chosen based on validation or selection
criteria. There exist an abundance of validation methods that, implicitly or
explicitly, assume a certain clustering notion. Moreover, they are often
restricted to operate on partitions obtained from a specific method. In this
paper, we focus on groups that can be well separated by quadratic or linear
boundaries. The reference cluster concept is defined through the quadratic
discriminant score function and parameters describing clusters' size, center
and scatter. We develop two cluster-quality criteria called quadratic scores.
We show that these criteria are consistent with groups generated from a general
class of elliptically-symmetric distributions. The quest for this type of
groups is common in applications. The connection with likelihood theory for
mixture models and model-based clustering is investigated. Based on bootstrap
resampling of the quadratic scores, we propose a selection rule that allows
choosing among many clustering solutions. The proposed method has the
distinctive advantage that it can compare partitions that cannot be compared
with other state-of-the-art methods. Extensive numerical experiments and the
analysis of real data show that, even if some competing methods turn out to be
superior in some setups, the proposed methodology achieves a better overall
performance.Comment: Supplemental materials are included at the end of the pape
A simulations study to compare robust clustering methods based on mixtures
Abstract The following mixture model-based clustering methods are compared in a simulation study with one-dimensional data, fixed number of clusters and a focus on outliers and uniform "noise": an ML-estimator (MLE) for Gaussian mixtures, an MLE for a mixture of Gaussians and a uniform distribution (interpreted as "noise component" to catch outliers), an MLE for a mixture of Gaussian distributions where a uniform distribution over the range of the data is fixed (Fraley and Raftery in Comput J 41:578-588, 199
- …
