Search CORE

136 research outputs found

Optimal cross-validation in density estimation with the $L^2$ -loss

Author: Celisse Alain
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/10/2014
Field of study

We analyze the performance of cross-validation (CV) in the density estimation framework with two purposes: (i) risk estimation and (ii) model selection. The main focus is given to the so-called leave-

p

-out CV procedure (Lpo), where

p

denotes the cardinality of the test set. Closed-form expressions are settled for the Lpo estimator of the risk of projection estimators. These expressions provide a great improvement upon

V

-fold cross-validation in terms of variability and computational complexity. From a theoretical point of view, closed-form expressions also enable to study the Lpo performance in terms of risk estimation. The optimality of leave-one-out (Loo), that is Lpo with

p=1

, is proved among CV procedures used for risk estimation. Two model selection frameworks are also considered: estimation, as opposed to identification. For estimation with finite sample size

n

, optimality is achieved for

p

large enough [with

p/n=o(1)

] to balance the overfitting resulting from the structure of the model collection. For identification, model selection consistency is settled for Lpo as long as

p/n

is conveniently related to the rate of convergence of the best estimator in the collection: (i)

p/n\to1

n\to+\infty

with a parametric rate, and (ii)

p/n=o(1)

with some nonparametric estimators. These theoretical results are validated by simulation experiments.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1240 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

A One-Sample Test for Normality with Kernel Methods

Author: Celisse Alain
Kellner Jérémie
Publication venue
Publication date: 10/07/2015
Field of study

We propose a new one-sample test for normality in a Reproducing Kernel Hilbert Space (RKHS). Namely, we test the null-hypothesis of belonging to a given family of Gaussian distributions. Hence our procedure may be applied either to test data for normality or to test parameters (mean and covariance) if data are assumed Gaussian. Our test is based on the same principle as the MMD (Maximum Mean Discrepancy) which is usually used for two-sample tests such as homogeneity or independence testing. Our method makes use of a special kind of parametric bootstrap (typical of goodness-of-fit tests) which is computationally more efficient than standard parametric bootstrap. Moreover, an upper bound for the Type-II error highlights the dependence on influential quantities. Experiments illustrate the practical improvement allowed by our test in high-dimensional settings where common normality tests are known to fail. We also consider an application to covariance rank selection through a sequential procedure

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier

Author: Celisse Alain
Mary-Huard Tristan
Publication venue
Publication date: 12/10/2017
Field of study

The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the

k

-nearest neighbors (

k

NN) rule in the context of binary classification. Here we focus on the leave-

p

-out cross-validation (L

p

O) used to assess the performance of the

k

NN classifier. Remarkably this L

p

O estimator can be efficiently computed in this context using closed-form formulas derived by \cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and exponential concentration inequalities for the L

p

O estimator applied to the

k

NN classifier. Such results are obtained first by exploiting the connection between the L

p

O estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L

1

O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the L

p

O estimator and the classification error/risk of the

k

NN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments

arXiv.org e-Print Archive

HAL Descartes

ProdInra

New normality test in high dimension with kernel methods

Author: Celisse Alain
Kellner Jérémie
Publication venue
Publication date: 11/04/2014
Field of study

A new goodness-of-fit test for normality in high-dimension (and Reproducing Kernel Hilbert Space) is proposed. It shares common ideas with the Maximum Mean Discrepancy (MMD) it outperforms both in terms of computation time and applicability to a wider range of data. Theoretical results are derived for the Type-I and Type-II errors. They guarantee the control of Type-I error at prescribed level and an exponentially fast decrease of the Type-II error. Synthetic and real data also illustrate the practical improvement allowed by our test compared with other leading approaches in high-dimensional settings

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

New efficient algorithms for multiple change-point detection with kernels

Author: Celisse Alain
Marot Guillemette
Pierre-Jean Morgane
Rigaill Guillem
Publication venue
Publication date: 01/09/2016
Field of study

Several statistical approaches based on reproducing kernels have been proposed to detect abrupt changes arising in the full distribution of the observations and not only in the mean or variance. Some of these approaches enjoy good statistical properties (oracle inequality, \ldots). Nonetheless, they have a high computational cost both in terms of time and memory. This makes their application difficult even for small and medium sample sizes (

n< 10^4

). This computational issue is addressed by first describing a new efficient and exact algorithm for kernel multiple change-point detection with an improved worst-case complexity that is quadratic in time and linear in space. It allows dealing with medium size signals (up to

n \approx 10^5

). Second, a faster but approximation algorithm is described. It is based on a low-rank approximation to the Gram matrix. It is linear in time and space. This approximation algorithm can be applied to large-scale signals (

n \geq 10^6

). These exact and approximation algorithms have been implemented in \texttt{R} and \texttt{C} for various kernels. The computational and statistical performances of these new algorithms have been assessed through empirical experiments. The runtime of the new algorithms is observed to be faster than that of other considered procedures. Finally, simulations confirmed the higher statistical accuracy of kernel-based approaches to detect changes that are not only in the mean. These simulations also illustrate the flexibility of kernel-based approaches to analyze complex biological profiles made of DNA copy number and allele B frequencies. An R package implementing the approach will be made available on github

arXiv.org e-Print Archive

HAL Evry

INRIA a CCSD electronic archive server

HAL: Hyper Article en Ligne

Hal-Diderot

MPAgenomics : An R package for multi-patients analysis of genomic markers

Author: Blanck Samuel
Celisse Alain
Cheok Meyling
Figeac Martin
Grimonprez Quentin
Marot Guillemette
Publication venue
Publication date: 20/01/2014
Field of study

MPAgenomics, standing for multi-patients analysis (MPA) of genomic markers, is an R-package devoted to: (i) efficient segmentation, and (ii) genomic marker selection from multi-patient copy number and SNP data profiles. It provides wrappers from commonly used packages to facilitate their repeated (sometimes difficult) use, offering an easy-to-use pipeline for beginners in R. The segmentation of successive multiple profiles (finding losses and gains) is based on a new automatic choice of influential parameters since default ones were misleading in the original packages. Considering multiple profiles in the same time, MPAgenomics wraps efficient penalized regression methods to select relevant markers associated with a given response

arXiv.org e-Print Archive

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Inserm

HAL Descartes

PubMed Central

HAL: Hyper Article en Ligne

Minimum discrepancy principle strategy for choosing $k$ in $k$ -NN regression

Author: Averyanov Yaroslav
Celisse Alain
Publication venue
Publication date: 26/02/2021
Field of study

We present a novel data-driven strategy to choose the hyperparameter

k

in the

k

-NN regression estimator. We treat the problem of choosing the hyperparameter as an iterative procedure (over

k

) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal, under the fixed-design assumption on covariates, over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method shows consistent simulation results on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method and generalized cross-validation. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size

n

, assuming that the nearest neighbors are already precomputed, if one should choose

k

among

\left\{ 1, \ldots, n \right\}

, the strategy reduces the computational time of the generalized cross-validation or Akaike's AIC criteria from

\mathcal{O}\left( n^3 \right)

\mathcal{O}\left( n^2 (n - k) \right)

, where

k

is the proposed (minimum discrepancy principle) value of the nearest neighbors

arXiv.org e-Print Archive

Analyzing the discrepancy principle for kernelized spectral filter learning algorithms

Author: Celisse Alain
Wahl Martin
Publication venue
Publication date: 10/04/2020
Field of study

We investigate the construction of early stopping rules in the nonparametric regression problem where iterative learning algorithms are used and the optimal iteration number is unknown. More precisely, we study the discrepancy principle, as well as modifications based on smoothed residuals, for kernelized spectral filter learning algorithms including gradient descent. Our main theoretical bounds are oracle inequalities established for the empirical estimation error (fixed design), and for the prediction error (random design). From these finite-sample bounds it follows that the classical discrepancy principle is statistically adaptive for slow rates occurring in the hard learning scenario, while the smoothed discrepancy principles are adaptive over ranges of faster rates (resp. higher smoothness parameters). Our approach relies on deviation inequalities for the stopping rules in the fixed design setting, combined with change-of-norm arguments to deal with the random design setting.Comment: 68 pages, 4 figure

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Publications at Bielefeld University