140 research outputs found
Eliminating all bad Local Minima from Loss Landscapes without even adding an Extra Unit
Recent work has noted that all bad local minima can be removed from neural
network loss landscapes, by adding a single unit with a particular
parameterization. We show that the core technique from these papers can be used
to remove all bad local minima from any loss landscape, so long as the global
minimum has a loss of zero. This procedure does not require the addition of
auxiliary units, or even that the loss be associated with a neural network. The
method of action involves all bad local minima being converted into bad
(non-local) minima at infinity in terms of auxiliary parameters
Note on Equivalence Between Recurrent Neural Network Time Series Models and Variational Bayesian Models
We observe that the standard log likelihood training objective for a
Recurrent Neural Network (RNN) model of time series data is equivalent to a
variational Bayesian training objective, given the proper choice of generative
and inference models. This perspective may motivate extensions to both RNNs and
variational Bayesian models. We propose one such extension, where multiple
particles are used for the hidden state of an RNN, allowing a natural
representation of uncertainty or multimodality
A universal tradeoff between power, precision and speed in physical communication
Maximizing the speed and precision of communication while minimizing power
dissipation is a fundamental engineering design goal. Also, biological systems
achieve remarkable speed, precision and power efficiency using poorly
understood physical design principles. Powerful theories like information
theory and thermodynamics do not provide general limits on power, precision and
speed. Here we go beyond these classical theories to prove that the product of
precision and speed is universally bounded by power dissipation in any physical
communication channel whose dynamics is faster than that of the signal.
Moreover, our derivation involves a novel connection between friction and
information geometry. These results may yield insight into both the engineering
design of communication devices and the structure and function of biological
signaling systems.Comment: 15 pages, 3 figure
Generalizing Hamiltonian Monte Carlo with Neural Networks
We present a general-purpose method to train Markov chain Monte Carlo
kernels, parameterized by deep neural networks, that converge and mix quickly
to their target distribution. Our method generalizes Hamiltonian Monte Carlo
and is trained to maximize expected squared jumped distance, a proxy for mixing
speed. We demonstrate large empirical gains on a collection of simple but
challenging distributions, for instance achieving a 106x improvement in
effective sample size in one case, and mixing when standard HMC makes no
measurable progress in a second. Finally, we show quantitative and qualitative
gains on a real-world task: latent-variable generative modeling. We release an
open source TensorFlow implementation of the algorithm.Comment: ICLR 201
Efficient and optimal binary Hopfield associative memory storage using minimum probability flow
We present an algorithm to store binary memories in a Hopfield neural network
using minimum probability flow, a recent technique to fit parameters in
energy-based probabilistic models. In the case of memories without noise, our
algorithm provably achieves optimal pattern storage (which we show is at least
one pattern per neuron) and outperforms classical methods both in speed and
memory recovery. Moreover, when trained on noisy or corrupted versions of a
fixed set of binary patterns, our algorithm finds networks which correctly
store the originals. We also demonstrate this finding visually with the
unsupervised storage and clean-up of large binary fingerprint images from
significantly corrupted samples.Comment: 6 pages, 4 figures, 2012 Neural Information Processing Systems (NIPS)
workshop on Discrete Optimization in Machine Learning (DISCML
PCA of high dimensional random walks with comparison to neural network training
One technique to visualize the training of neural networks is to perform PCA
on the parameters over the course of training and to project to the subspace
spanned by the first few PCA components. In this paper we compare this
technique to the PCA of a high dimensional random walk. We compute the
eigenvalues and eigenvectors of the covariance of the trajectory and prove that
in the long trajectory and high dimensional limit most of the variance is in
the first few PCA components, and that the projection of the trajectory onto
any subspace spanned by PCA components is a Lissajous curve. We generalize
these results to a random walk with momentum and to an Ornstein-Uhlenbeck
processes (i.e., a random walk in a quadratic potential) and show that in high
dimensions the walk is not mean reverting, but will instead be trapped at a
fixed distance from the minimum. We finally compare the distribution of PCA
variances and the PCA projected training trajectories of a linear model trained
on CIFAR-10 and ResNet-50-v2 trained on Imagenet and find that the distribution
of PCA variances resembles a random walk with drift
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
We propose a new technique, Singular Vector Canonical Correlation Analysis
(SVCCA), a tool for quickly comparing two representations in a way that is both
invariant to affine transform (allowing comparison between different layers and
networks) and fast to compute (allowing more comparisons to be calculated than
with previous methods). We deploy this tool to measure the intrinsic
dimensionality of layers, showing in some cases needless over-parameterization;
to probe learning dynamics throughout training, finding that networks converge
to final representations from the bottom up; to show where class-specific
information in networks is formed; and to suggest new training regimes that
simultaneously save computation and overfit less. Code:
https://github.com/google/svcca/Comment: Accepted to NIPS 2017, code: https://github.com/google/svcca/ , new
plots on Imagene
Density estimation using Real NVP
Unsupervised learning of probabilistic models is a central yet challenging
problem in machine learning. Specifically, designing models with tractable
learning, sampling, inference and evaluation is crucial in solving this task.
We extend the space of such models using real-valued non-volume preserving
(real NVP) transformations, a set of powerful invertible and learnable
transformations, resulting in an unsupervised learning algorithm with exact
log-likelihood computation, exact sampling, exact inference of latent
variables, and an interpretable latent space. We demonstrate its ability to
model natural images on four datasets through sampling, log-likelihood
evaluation and latent variable manipulations.Comment: 10 pages of main content, 3 pages of bibliography, 18 pages of
appendix. Accepted at ICLR 201
An Unsupervised Algorithm For Learning Lie Group Transformations
We present several theoretical contributions which allow Lie groups to be fit
to high dimensional datasets. Transformation operators are represented in their
eigen-basis, reducing the computational complexity of parameter estimation to
that of training a linear transformation model. A transformation specific
"blurring" operator is introduced that allows inference to escape local minima
via a smoothing of the transformation space. A penalty on traversed manifold
distance is added which encourages the discovery of sparse, minimal distance,
transformations between states. Both learning and inference are demonstrated
using these methods for the full set of affine transformations on natural image
patches. Transformation operators are then trained on natural video sequences.
It is shown that the learned video transformations provide a better description
of inter-frame differences than the standard motion model based on rigid
translation
Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
We present an algorithm for minimizing a sum of functions that combines the
computational efficiency of stochastic gradient descent (SGD) with the second
order curvature information leveraged by quasi-Newton methods. We unify these
disparate approaches by maintaining an independent Hessian approximation for
each contributing function in the sum. We maintain computational tractability
and limit memory requirements even for high dimensional optimization problems
by storing and manipulating these quadratic approximations in a shared, time
evolving, low dimensional subspace. Each update step requires only a single
contributing function or minibatch evaluation (as in SGD), and each step is
scaled using an approximate inverse Hessian and little to no adjustment of
hyperparameters is required (as is typical for quasi-Newton methods). This
algorithm contrasts with earlier stochastic second order techniques that treat
the Hessian of each contributing function as a noisy approximation to the full
Hessian, rather than as a target for direct estimation. We experimentally
demonstrate improved convergence on seven diverse optimization problems. The
algorithm is released as open source Python and MATLAB packages
- …
