150 research outputs found

    Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

    Full text link
    We consider stochastic gradient descent and its averaging variant for binary classification problems in a reproducing kernel Hilbert space. In the traditional analysis using a consistency property of loss functions, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition but provided theoretical analysis was limited to the squared loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions. As for the averaged stochastic gradient descent, we show that the same convergence rate holds from the early phase of training. In experiments, we verify our analyses on the L2L_2-regularized logistic regression.Comment: 15 pages, 2 figure

    Why is parameter averaging beneficial in SGD? An objective smoothing perspective

    Full text link
    It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.Comment: 27pages, AISTATS202

    Hyperbolic Ordinal Embedding

    Get PDF
    Given ordinal relations such as the object i is more similar to j than k is to l, ordinal embedding is to embed these objects into a low-dimensional space with all ordinal constraints preserved. Although existing approaches have preserved ordinal relations in Euclidean space, whether Euclidean space is compatible with true data structure is largely ignored, although it is essential to effective embedding. Since real data often exhibit hierarchical structure, it is hard for Euclidean space approaches to achieve effective embeddings in low dimensionality, which incurs high computational complexity or overfitting. In this paper we propose a novel hyperbolic ordinal embedding (HOE) method to embed objects in hyperbolic space. Due to the hierarchy-friendly property of hyperbolic space, HOE can effectively capture the hierarchy to achieve embeddings in an extremely low-dimensional space. We have not only theoretically proved the superiority of hyperbolic space and the limitations of Euclidean space for embedding hierarchical data, but also experimentally demonstrated that HOE significantly outperforms Euclidean-based methods
    corecore