Search CORE

140 research outputs found

Eliminating all bad Local Minima from Loss Landscapes without even adding an Extra Unit

Author: Kawaguchi Kenji
Sohl-Dickstein Jascha
Publication venue
Publication date: 12/01/2019
Field of study

Recent work has noted that all bad local minima can be removed from neural network loss landscapes, by adding a single unit with a particular parameterization. We show that the core technique from these papers can be used to remove all bad local minima from any loss landscape, so long as the global minimum has a loss of zero. This procedure does not require the addition of auxiliary units, or even that the loss be associated with a neural network. The method of action involves all bad local minima being converted into bad (non-local) minima at infinity in terms of auxiliary parameters

arXiv.org e-Print Archive

Note on Equivalence Between Recurrent Neural Network Time Series Models and Variational Bayesian Models

Author: Kingma Diederik P.
Sohl-Dickstein Jascha
Publication venue
Publication date: 18/06/2016
Field of study

We observe that the standard log likelihood training objective for a Recurrent Neural Network (RNN) model of time series data is equivalent to a variational Bayesian training objective, given the proper choice of generative and inference models. This perspective may motivate extensions to both RNNs and variational Bayesian models. We propose one such extension, where multiple particles are used for the hidden state of an RNN, allowing a natural representation of uncertainty or multimodality

arXiv.org e-Print Archive

A universal tradeoff between power, precision and speed in physical communication

Author: Ganguli Surya
Lahiri Subhaneil
Sohl-Dickstein Jascha
Publication venue
Publication date: 24/03/2016
Field of study

Maximizing the speed and precision of communication while minimizing power dissipation is a fundamental engineering design goal. Also, biological systems achieve remarkable speed, precision and power efficiency using poorly understood physical design principles. Powerful theories like information theory and thermodynamics do not provide general limits on power, precision and speed. Here we go beyond these classical theories to prove that the product of precision and speed is universally bounded by power dissipation in any physical communication channel whose dynamics is faster than that of the signal. Moreover, our derivation involves a novel connection between friction and information geometry. These results may yield insight into both the engineering design of communication devices and the structure and function of biological signaling systems.Comment: 15 pages, 3 figure

arXiv.org e-Print Archive

Generalizing Hamiltonian Monte Carlo with Neural Networks

Author: Hoffman Matthew D.
Levy Daniel
Sohl-Dickstein Jascha
Publication venue
Publication date: 02/03/2018
Field of study

We present a general-purpose method to train Markov chain Monte Carlo kernels, parameterized by deep neural networks, that converge and mix quickly to their target distribution. Our method generalizes Hamiltonian Monte Carlo and is trained to maximize expected squared jumped distance, a proxy for mixing speed. We demonstrate large empirical gains on a collection of simple but challenging distributions, for instance achieving a 106x improvement in effective sample size in one case, and mixing when standard HMC makes no measurable progress in a second. Finally, we show quantitative and qualitative gains on a real-world task: latent-variable generative modeling. We release an open source TensorFlow implementation of the algorithm.Comment: ICLR 201

arXiv.org e-Print Archive

Efficient and optimal binary Hopfield associative memory storage using minimum probability flow

Author: Hillar Christopher
Koepsell Kilian
Sohl-Dickstein Jascha
Publication venue
Publication date: 19/05/2015
Field of study

We present an algorithm to store binary memories in a Hopfield neural network using minimum probability flow, a recent technique to fit parameters in energy-based probabilistic models. In the case of memories without noise, our algorithm provably achieves optimal pattern storage (which we show is at least one pattern per neuron) and outperforms classical methods both in speed and memory recovery. Moreover, when trained on noisy or corrupted versions of a fixed set of binary patterns, our algorithm finds networks which correctly store the originals. We also demonstrate this finding visually with the unsupervised storage and clean-up of large binary fingerprint images from significantly corrupted samples.Comment: 6 pages, 4 figures, 2012 Neural Information Processing Systems (NIPS) workshop on Discrete Optimization in Machine Learning (DISCML

arXiv.org e-Print Archive

PCA of high dimensional random walks with comparison to neural network training

Author: Antognini Joseph M.
Sohl-Dickstein Jascha
Publication venue
Publication date: 22/06/2018
Field of study

One technique to visualize the training of neural networks is to perform PCA on the parameters over the course of training and to project to the subspace spanned by the first few PCA components. In this paper we compare this technique to the PCA of a high dimensional random walk. We compute the eigenvalues and eigenvectors of the covariance of the trajectory and prove that in the long trajectory and high dimensional limit most of the variance is in the first few PCA components, and that the projection of the trajectory onto any subspace spanned by PCA components is a Lissajous curve. We generalize these results to a random walk with momentum and to an Ornstein-Uhlenbeck processes (i.e., a random walk in a quadratic potential) and show that in high dimensions the walk is not mean reverting, but will instead be trapped at a fixed distance from the minimum. We finally compare the distribution of PCA variances and the PCA projected training trajectories of a linear model trained on CIFAR-10 and ResNet-50-v2 trained on Imagenet and find that the distribution of PCA variances resembles a random walk with drift

arXiv.org e-Print Archive

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Author: Gilmer Justin
Raghu Maithra
Sohl-Dickstein Jascha
Yosinski Jason
Publication venue
Publication date: 08/11/2017
Field of study

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code: https://github.com/google/svcca/Comment: Accepted to NIPS 2017, code: https://github.com/google/svcca/ , new plots on Imagene

arXiv.org e-Print Archive

Density estimation using Real NVP

Author: Bengio Samy
Dinh Laurent
Sohl-Dickstein Jascha
Publication venue
Publication date: 27/02/2017
Field of study

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.Comment: 10 pages of main content, 3 pages of bibliography, 18 pages of appendix. Accepted at ICLR 201

arXiv.org e-Print Archive

An Unsupervised Algorithm For Learning Lie Group Transformations

Author: Olshausen Bruno A.
Sohl-Dickstein Jascha
Wang Ching Ming
Publication venue
Publication date: 07/06/2017
Field of study

We present several theoretical contributions which allow Lie groups to be fit to high dimensional datasets. Transformation operators are represented in their eigen-basis, reducing the computational complexity of parameter estimation to that of training a linear transformation model. A transformation specific "blurring" operator is introduced that allows inference to escape local minima via a smoothing of the transformation space. A penalty on traversed manifold distance is added which encourages the discovery of sparse, minimal distance, transformations between states. Both learning and inference are demonstrated using these methods for the full set of affine transformations on natural image patches. Transformation operators are then trained on natural video sequences. It is shown that the learned video transformations provide a better description of inter-frame differences than the standard motion model based on rigid translation

arXiv.org e-Print Archive

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

Author: Ganguli Surya
Poole Ben
Sohl-Dickstein Jascha
Publication venue
Publication date: 29/11/2014
Field of study

We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by quasi-Newton methods. We unify these disparate approaches by maintaining an independent Hessian approximation for each contributing function in the sum. We maintain computational tractability and limit memory requirements even for high dimensional optimization problems by storing and manipulating these quadratic approximations in a shared, time evolving, low dimensional subspace. Each update step requires only a single contributing function or minibatch evaluation (as in SGD), and each step is scaled using an approximate inverse Hessian and little to no adjustment of hyperparameters is required (as is typical for quasi-Newton methods). This algorithm contrasts with earlier stochastic second order techniques that treat the Hessian of each contributing function as a noisy approximation to the full Hessian, rather than as a target for direct estimation. We experimentally demonstrate improved convergence on seven diverse optimization problems. The algorithm is released as open source Python and MATLAB packages

arXiv.org e-Print Archive