56 research outputs found
Nonparametric Linear Feature Learning in Regression Through Regularisation
Representation learning plays a crucial role in automated feature selection,
particularly in the context of high-dimensional data, where non-parametric
methods often struggle. In this study, we focus on supervised learning
scenarios where the pertinent information resides within a lower-dimensional
linear subspace of the data, namely the multi-index model. If this subspace
were known, it would greatly enhance prediction, computation, and
interpretation. To address this challenge, we propose a novel method for linear
feature learning with non-parametric prediction, which simultaneously estimates
the prediction function and the linear subspace. Our approach employs empirical
risk minimisation, augmented with a penalty on function derivatives, ensuring
versatility. Leveraging the orthogonality and rotation invariance properties of
Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising
alternative minimisation, we iteratively rotate the data to improve alignment
with leading directions and accurately estimate the relevant dimension in
practical settings. We establish that our method yields a consistent estimator
of the prediction function with explicit rates. Additionally, we provide
empirical results demonstrating the performance of RegFeaL in various
experiments.Comment: 42 pages, 5 figure
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
A recent line of empirical studies has demonstrated that SGD might exhibit a
heavy-tailed behavior in practical settings, and the heaviness of the tails
might correlate with the overall performance. In this paper, we investigate the
emergence of such heavy tails. Previous works on this problem only considered,
up to our knowledge, online (also called single-pass) SGD, in which the
emergence of heavy tails in theoretical findings is contingent upon access to
an infinite amount of data. Hence, the underlying mechanism generating the
reported heavy-tailed behavior in practical settings, where the amount of
training data is finite, is still not well-understood. Our contribution aims to
fill this gap. In particular, we show that the stationary distribution of
offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and
the approximation error is controlled by how fast the empirical distribution of
the training data converges to the true underlying data distribution in the
Wasserstein metric. Our main takeaway is that, as the number of data points
increases, offline SGD will behave increasingly 'power-law-like'. To achieve
this result, we first prove nonasymptotic Wasserstein convergence bounds for
offline SGD to online SGD as the number of data points increases, which can be
interesting on their own. Finally, we illustrate our theory on various
experiments conducted on synthetic data and neural networks.Comment: In Neural Information Processing Systems (NeurIPS), Spotlight
Presentation, 202
Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for
deriving generalization bounds for practical algorithms. The last decade has
witnessed an increasing number of stability bounds for different algorithms
applied on different classes of loss functions. While these bounds have
illuminated various properties of optimization algorithms, the analysis of each
case typically required a different proof technique with significantly
different mathematical tools. In this study, we make a novel connection between
learning theory and applied probability and introduce a unified guideline for
proving Wasserstein stability bounds for stochastic optimization algorithms. We
illustrate our approach on stochastic gradient descent (SGD) and we obtain
time-uniform stability bounds (i.e., the bound does not increase with the
number of iterations) for strongly convex losses and non-convex losses with
additive noise, where we recover similar results to the prior art or extend
them to more general cases by using a single proof technique. Our approach is
flexible and can be generalizable to other popular optimizers, as it mainly
requires developing Lyapunov functions, which are often readily available in
the literature. It also illustrates that ergodicity is an important component
for obtaining time-uniform bounds -- which might not be achieved for convex or
non-convex losses unless additional noise is injected to the iterates. Finally,
we slightly stretch our analysis technique and prove time-uniform bounds for
SGD under convex and non-convex losses (without additional additive noise),
which, to our knowledge, is novel.Comment: 49 pages, NeurIPS 202
Learning via Wasserstein-Based High Probability Generalisation Bounds
Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications.As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments
Efficient Bayesian Model Selection in PARAFAC via Stochastic Thermodynamic Integration
International audienceParallel factor analysis (PARAFAC) is one of the most popular tensor factorization models. Even though it has proven successful in diverse application fields, the performance of PARAFAC usually hinges up on the rank of the factorization, which is typically specified manually by the practitioner. In this study, we develop a novel parallel and distributed Bayesian model selection technique for rank estimation in large-scale PARAFAC models. The proposed approach integrates ideas from the emerging field of stochastic gradient Markov Chain Monte Carlo, statistical physics, and distributed stochastic optimization. As opposed to the existing methods, which are based on some heuristics, our method has a clear mathematical interpretation, and has significantly lower computational requirements, thanks to data subsampling and parallelization. We provide formal theoretical analysis on the bias induced by the proposed approach. Our experiments on synthetic and large-scale real datasets show that our method is able to find the optimal model order while being significantly faster than the state-of-the-art
Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets
We propose data-dependent uniform generalization bounds by approaching the
problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian
framework on `random sets' in a rigorous way, where the training algorithm is
assumed to output a data-dependent hypothesis set after observing the training
data. This approach allows us to prove data-dependent bounds, which can be
applicable in numerous contexts. To highlight the power of our approach, we
consider two main applications. First, we propose a PAC-Bayesian formulation of
the recently developed fractal-dimension-based generalization bounds. The
derived results are shown to be tighter and they unify the existing results
around one simple proof technique. Second, we prove uniform bounds over the
trajectories of continuous Langevin dynamics and stochastic gradient Langevin
dynamics. These results provide novel information about the generalization
properties of noisy algorithms
SGD with Clipping is Secretly Estimating the Median Gradient
There are several applications of stochastic optimization where one can
benefit from a robust estimate of the gradient. For example, domains such as
distributed learning with corrupted nodes, the presence of large outliers in
the training data, learning under privacy constraints, or even heavy-tailed
noise due to the dynamics of the algorithm itself. Here we study SGD with
robust gradient estimators based on estimating the median. We first consider
computing the median gradient across samples, and show that the resulting
method can converge even under heavy-tailed, state-dependent noise. We then
derive iterative methods based on the stochastic proximal point method for
computing the geometric median and generalizations thereof. Finally we propose
an algorithm estimating the median gradient across iterations, and find that
several well known methods - in particular different forms of clipping - are
particular cases of this framework
- …
