150 research outputs found
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
We consider stochastic gradient descent and its averaging variant for binary
classification problems in a reproducing kernel Hilbert space. In the
traditional analysis using a consistency property of loss functions, it is
known that the expected classification error converges more slowly than the
expected risk even when assuming a low-noise condition on the conditional label
probabilities. Consequently, the resulting rate is sublinear. Therefore, it is
important to consider whether much faster convergence of the expected
classification error can be achieved. In recent research, an exponential
convergence rate for stochastic gradient descent was shown under a strong
low-noise condition but provided theoretical analysis was limited to the
squared loss function, which is somewhat inadequate for binary classification
tasks. In this paper, we show an exponential convergence of the expected
classification error in the final phase of the stochastic gradient descent for
a wide class of differentiable convex loss functions under similar assumptions.
As for the averaged stochastic gradient descent, we show that the same
convergence rate holds from the early phase of training. In experiments, we
verify our analyses on the -regularized logistic regression.Comment: 15 pages, 2 figure
Why is parameter averaging beneficial in SGD? An objective smoothing perspective
It is often observed that stochastic gradient descent (SGD) and its variants
implicitly select a solution with good generalization performance; such
implicit bias is often characterized in terms of the sharpness of the minima.
Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD
which eliminates sharp local minima by the convolution using the stochastic
gradient noise. We follow this line of research and study the commonly-used
averaged SGD algorithm, which has been empirically observed in Izmailov et al.
(2018) to prefer a flat minimum and therefore achieves better generalization.
We prove that in certain problem settings, averaged SGD can efficiently
optimize the smoothed objective which avoids sharp local minima. In
experiments, we verify our theory and show that parameter averaging with an
appropriate step size indeed leads to significant improvement in the
performance of SGD.Comment: 27pages, AISTATS202
Hyperbolic Ordinal Embedding
Given ordinal relations such as the object i is more similar to j than k is to l, ordinal embedding is to embed these objects into a low-dimensional space with all ordinal constraints
preserved. Although existing approaches have preserved ordinal relations in Euclidean
space, whether Euclidean space is compatible with true data structure is largely ignored,
although it is essential to effective embedding. Since real data often exhibit hierarchical
structure, it is hard for Euclidean space approaches to achieve effective embeddings in low
dimensionality, which incurs high computational complexity or overfitting. In this paper we
propose a novel hyperbolic ordinal embedding (HOE) method to embed objects in hyperbolic space. Due to the hierarchy-friendly property of hyperbolic space, HOE can effectively
capture the hierarchy to achieve embeddings in an extremely low-dimensional space. We
have not only theoretically proved the superiority of hyperbolic space and the limitations
of Euclidean space for embedding hierarchical data, but also experimentally demonstrated
that HOE significantly outperforms Euclidean-based methods
- …
