136 research outputs found
Optimal cross-validation in density estimation with the -loss
We analyze the performance of cross-validation (CV) in the density estimation
framework with two purposes: (i) risk estimation and (ii) model selection. The
main focus is given to the so-called leave--out CV procedure (Lpo), where
denotes the cardinality of the test set. Closed-form expressions are
settled for the Lpo estimator of the risk of projection estimators. These
expressions provide a great improvement upon -fold cross-validation in terms
of variability and computational complexity. From a theoretical point of view,
closed-form expressions also enable to study the Lpo performance in terms of
risk estimation. The optimality of leave-one-out (Loo), that is Lpo with ,
is proved among CV procedures used for risk estimation. Two model selection
frameworks are also considered: estimation, as opposed to identification. For
estimation with finite sample size , optimality is achieved for large
enough [with ] to balance the overfitting resulting from the
structure of the model collection. For identification, model selection
consistency is settled for Lpo as long as is conveniently related to the
rate of convergence of the best estimator in the collection: (i) as
with a parametric rate, and (ii) with some
nonparametric estimators. These theoretical results are validated by simulation
experiments.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1240 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A One-Sample Test for Normality with Kernel Methods
We propose a new one-sample test for normality in a Reproducing Kernel
Hilbert Space (RKHS). Namely, we test the null-hypothesis of belonging to a
given family of Gaussian distributions. Hence our procedure may be applied
either to test data for normality or to test parameters (mean and covariance)
if data are assumed Gaussian. Our test is based on the same principle as the
MMD (Maximum Mean Discrepancy) which is usually used for two-sample tests such
as homogeneity or independence testing. Our method makes use of a special kind
of parametric bootstrap (typical of goodness-of-fit tests) which is
computationally more efficient than standard parametric bootstrap. Moreover, an
upper bound for the Type-II error highlights the dependence on influential
quantities. Experiments illustrate the practical improvement allowed by our
test in high-dimensional settings where common normality tests are known to
fail. We also consider an application to covariance rank selection through a
sequential procedure
Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier
The present work aims at deriving theoretical guaranties on the behavior of
some cross-validation procedures applied to the -nearest neighbors (NN)
rule in the context of binary classification. Here we focus on the
leave--out cross-validation (LO) used to assess the performance of the
NN classifier. Remarkably this LO estimator can be efficiently computed
in this context using closed-form formulas derived by
\cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and
exponential concentration inequalities for the LO estimator applied to the
NN classifier. Such results are obtained first by exploiting the connection
between the LO estimator and U-statistics, and second by making an intensive
use of the generalized Efron-Stein inequality applied to the LO estimator.
One other important contribution is made by deriving new quantifications of the
discrepancy between the LO estimator and the classification error/risk of
the NN classifier. The optimality of these bounds is discussed by means of
several lower bounds as well as simulation experiments
New normality test in high dimension with kernel methods
A new goodness-of-fit test for normality in high-dimension (and Reproducing
Kernel Hilbert Space) is proposed. It shares common ideas with the Maximum Mean
Discrepancy (MMD) it outperforms both in terms of computation time and
applicability to a wider range of data. Theoretical results are derived for the
Type-I and Type-II errors. They guarantee the control of Type-I error at
prescribed level and an exponentially fast decrease of the Type-II error.
Synthetic and real data also illustrate the practical improvement allowed by
our test compared with other leading approaches in high-dimensional settings
New efficient algorithms for multiple change-point detection with kernels
Several statistical approaches based on reproducing kernels have been
proposed to detect abrupt changes arising in the full distribution of the
observations and not only in the mean or variance. Some of these approaches
enjoy good statistical properties (oracle inequality, \ldots). Nonetheless,
they have a high computational cost both in terms of time and memory. This
makes their application difficult even for small and medium sample sizes (). This computational issue is addressed by first describing a new
efficient and exact algorithm for kernel multiple change-point detection with
an improved worst-case complexity that is quadratic in time and linear in
space. It allows dealing with medium size signals (up to ).
Second, a faster but approximation algorithm is described. It is based on a
low-rank approximation to the Gram matrix. It is linear in time and space. This
approximation algorithm can be applied to large-scale signals ().
These exact and approximation algorithms have been implemented in \texttt{R}
and \texttt{C} for various kernels. The computational and statistical
performances of these new algorithms have been assessed through empirical
experiments. The runtime of the new algorithms is observed to be faster than
that of other considered procedures. Finally, simulations confirmed the higher
statistical accuracy of kernel-based approaches to detect changes that are not
only in the mean. These simulations also illustrate the flexibility of
kernel-based approaches to analyze complex biological profiles made of DNA copy
number and allele B frequencies. An R package implementing the approach will be
made available on github
MPAgenomics : An R package for multi-patients analysis of genomic markers
MPAgenomics, standing for multi-patients analysis (MPA) of genomic markers,
is an R-package devoted to: (i) efficient segmentation, and (ii) genomic marker
selection from multi-patient copy number and SNP data profiles. It provides
wrappers from commonly used packages to facilitate their repeated (sometimes
difficult) use, offering an easy-to-use pipeline for beginners in R. The
segmentation of successive multiple profiles (finding losses and gains) is
based on a new automatic choice of influential parameters since default ones
were misleading in the original packages. Considering multiple profiles in the
same time, MPAgenomics wraps efficient penalized regression methods to select
relevant markers associated with a given response
Minimum discrepancy principle strategy for choosing in -NN regression
We present a novel data-driven strategy to choose the hyperparameter in
the -NN regression estimator. We treat the problem of choosing the
hyperparameter as an iterative procedure (over ) and propose using an easily
implemented in practice strategy based on the idea of early stopping and the
minimum discrepancy principle. This model selection strategy is proven to be
minimax-optimal, under the fixed-design assumption on covariates, over some
smoothness function classes, for instance, the Lipschitz functions class on a
bounded domain. The novel method shows consistent simulation results on
artificial and real-world data sets in comparison to other model selection
strategies, such as the Hold-out method and generalized cross-validation. The
novelty of the strategy comes from reducing the computational time of the model
selection procedure while preserving the statistical (minimax) optimality of
the resulting estimator. More precisely, given a sample of size , assuming
that the nearest neighbors are already precomputed, if one should choose
among , the strategy reduces the computational
time of the generalized cross-validation or Akaike's AIC criteria from
to ,
where is the proposed (minimum discrepancy principle) value of the nearest
neighbors
Analyzing the discrepancy principle for kernelized spectral filter learning algorithms
We investigate the construction of early stopping rules in the nonparametric
regression problem where iterative learning algorithms are used and the optimal
iteration number is unknown. More precisely, we study the discrepancy
principle, as well as modifications based on smoothed residuals, for kernelized
spectral filter learning algorithms including gradient descent. Our main
theoretical bounds are oracle inequalities established for the empirical
estimation error (fixed design), and for the prediction error (random design).
From these finite-sample bounds it follows that the classical discrepancy
principle is statistically adaptive for slow rates occurring in the hard
learning scenario, while the smoothed discrepancy principles are adaptive over
ranges of faster rates (resp. higher smoothness parameters). Our approach
relies on deviation inequalities for the stopping rules in the fixed design
setting, combined with change-of-norm arguments to deal with the random design
setting.Comment: 68 pages, 4 figure
- …
