94 research outputs found
Quadratic distances on probabilities: A unified foundation
This work builds a unified framework for the study of quadratic form distance
measures as they are used in assessing the goodness of fit of models. Many
important procedures have this structure, but the theory for these methods is
dispersed and incomplete. Central to the statistical analysis of these
distances is the spectral decomposition of the kernel that generates the
distance. We show how this determines the limiting distribution of natural
goodness-of-fit tests. Additionally, we develop a new notion, the spectral
degrees of freedom of the test, based on this decomposition. The degrees of
freedom are easy to compute and estimate, and can be used as a guide in the
construction of useful procedures in this class.Comment: Published in at http://dx.doi.org/10.1214/009053607000000956 the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests
In the statistical literature, as well as in artificial intelligence and machine learning, measures of discrepancy between two probability distributions are largely used to develop measures of goodness-of-fit. We concentrate on quadratic distances, which depend on a non-negative definite kernel. We propose a unified framework for the study of two-sample and k-sample goodness of fit tests based on the concept of matrix distance. We provide a succinct review of the goodness of fit literature related to the use of distance measures, and specifically to quadratic distances. We show that the quadratic distance kernel-based two-sample test has the same functional form with the maximum mean discrepancy test. We develop tests for the -sample scenario, where the two-sample problem is a special case. We derive their asymptotic distribution under the null hypothesis and discuss computational aspects of the test procedures. We assess their performance, in terms of level and power, via extensive simulations and a real data example. The proposed framework is implemented in the QuadratiK package, available in both R and Python environments.35 pages, 18 figure
kamila: Clustering Mixed-Type Data in R and Hadoop
In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets
Recommended from our members
Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance. RESULTS: Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated. CONCLUSION: Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology
A Platform for Processing Expression of Short Time Series (PESTS)
<p>Abstract</p> <p>Background</p> <p>Time course microarray profiles examine the expression of genes over a time domain. They are necessary in order to determine the complete set of genes that are dynamically expressed under given conditions, and to determine the interaction between these genes. Because of cost and resource issues, most time series datasets contain less than 9 points and there are few tools available geared towards the analysis of this type of data.</p> <p>Results</p> <p>To this end, we introduce a platform for Processing Expression of Short Time Series (PESTS). It was designed with a focus on usability and interpretability of analyses for the researcher. As such, it implements several standard techniques for comparability as well as visualization functions. However, it is designed specifically for the unique methods we have developed for significance analysis, multiple test correction and clustering of short time series data. The central tenet of these methods is the use of biologically relevant features for analysis. Features summarize short gene expression profiles, inherently incorporate dependence across time, and allow for both full description of the examined curve and missing data points.</p> <p>Conclusions</p> <p>PESTS is fully generalizable to other types of time series analyses. PESTS implements novel methods as well as several standard techniques for comparability and visualization functions. These features and functionality make PESTS a valuable resource for a researcher's toolkit. PESTS is available to download for free to academic and non-profit users at <url>http://www.mailman.columbia.edu/academic-departments/biostatistics/research-service/software-development</url>.</p
- …
