60 research outputs found
An efficient density-based clustering algorithm using reverse nearest neighbour
Density-based clustering is the task of discovering high-density regions of
entities (clusters) that are separated from each other by contiguous regions of
low-density. DBSCAN is, arguably, the most popular density-based clustering
algorithm. However, its cluster recovery capabilities depend on the combination
of the two parameters. In this paper we present a new density-based clustering
algorithm which uses reverse nearest neighbour (RNN) and has a single
parameter. We also show that it is possible to estimate a good value for this
parameter using a clustering validity index. The RNN queries enable our
algorithm to estimate densities taking more than a single entity into account,
and to recover clusters that are not well-separated or have different
densities. Our experiments on synthetic and real-world data sets show our
proposed algorithm outperforms DBSCAN and its recent variant ISDBSCAN.Comment: Accepted in: Computing Conference 2019 in London, UK.
http://saiconference.com/Computin
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
Webometrics benefitting from web mining? An investigation of methods and applications of two research fields
Webometrics and web mining are two fields where research is focused on quantitative analyses of the web. This literature review outlines definitions of the fields, and then focuses on their methods and applications. It also discusses the potential of closer contact and collaboration between them. A key difference between the fields is that webometrics has focused on exploratory studies, whereas web mining has been dominated by studies focusing on development of methods and algorithms. Differences in type of data can also be seen, with webometrics more focused on analyses of the structure of the web and web mining more focused on web content and usage, even though both fields have been embracing the possibilities of user generated content. It is concluded that research problems where big data is needed can benefit from collaboration between webometricians, with their tradition of exploratory studies, and web miners, with their tradition of developing methods and algorithms
Identification of anti-tumour biologics using primary tumour models, 3-D phenotypic screening and image-based multi-parametric profiling
Intelligent Routing System for a Personalised Electronic Tourist
When tourists are at a destination, they typically search for information in the Local Tourist Organizations. There, the staff categorizes tourists’ profile and restrictions. Combining this information with their up-to-date knowledge about the local attractions, weather and public transportation, they suggest a personalised route for the tourist agenda. This paper presents an intelligent routing system for a Personalised Electronic Tourist Guide to fulfil the same task. This system improves the automatic route creation functionality of existing PETs to solve better the needs of tourists in several aspects: i) it includes public transportation, ii) it takes varying travelling times into account, adapting to real circumstances as rush-hours, iii) it calculates routes in real time to react to unexpected events, iv) it applies last generation heuristics from Operations Research to create routes efficiently, even in destinations with a large number of point of interests and a dense public transportation network.status: publishe
- …
