5,732 research outputs found
Evolving Culture vs Local Minima
We propose a theory that relates difficulty of learning in deep architectures
to culture and language. It is articulated around the following hypotheses: (1)
learning in an individual human brain is hampered by the presence of effective
local minima; (2) this optimization difficulty is particularly important when
it comes to learning higher-level abstractions, i.e., concepts that cover a
vast and highly-nonlinear span of sensory configurations; (3) such high-level
abstractions are best represented in brains by the composition of many levels
of representation, i.e., by deep architectures; (4) a human brain can learn
such high-level abstractions if guided by the signals produced by other humans,
which act as hints or indirect supervision for these high-level abstractions;
and (5), language and the recombination and optimization of mental concepts
provide an efficient evolutionary recombination operator, and this gives rise
to rapid search in the space of communicable ideas that help humans build up
better high-level internal representations of their world. These hypotheses put
together imply that human culture and the evolution of ideas have been crucial
to counter an optimization difficulty: this optimization difficulty would
otherwise make it very difficult for human brains to capture high-level
knowledge of the world. The theory is grounded in experimental observations of
the difficulties of training deep artificial neural networks. Plausible
consequences of this theory for the efficiency of cultural evolutions are
sketched
The Consciousness Prior
A new prior is proposed for learning representations of high-level concepts
of the kind we manipulate with language. This prior can be combined with other
priors in order to help disentangling abstract factors from each other. It is
inspired by cognitive neuroscience theories of consciousness, seen as a
bottleneck through which just a few elements, after having been selected by
attention from a broader pool, are then broadcast and condition further
processing, both in perception and decision-making. The set of recently
selected elements one becomes aware of is seen as forming a low-dimensional
conscious state. This conscious state is combining the few concepts
constituting a conscious thought, i.e., what one is immediately conscious of at
a particular moment. We claim that this architectural and
information-processing constraint corresponds to assumptions about the joint
distribution between high-level concepts. To the extent that these assumptions
are generally true (and the form of natural language seems consistent with
them), they can form a useful prior for representation learning. A
low-dimensional thought or conscious state is analogous to a sentence: it
involves only a few variables and yet can make a statement with very high
probability of being true. This is consistent with a joint distribution (over
high-level concepts) which has the form of a sparse factor graph, i.e., where
the dependencies captured by each factor of the factor graph involve only very
few variables while creating a strong dip in the overall energy function. The
consciousness prior also makes it natural to map conscious states to natural
language utterances or to express classical AI knowledge in a form similar to
facts and rules, albeit capturing uncertainty as well as efficient search
mechanisms implemented by attention mechanisms
Independently Controllable Features
Finding features that disentangle the different causes of variation in real
data is a difficult task, that has nonetheless received considerable attention
in static domains like natural images. Interactive environments, in which an
agent can deliberately take actions, offer an opportunity to tackle this task
better, because the agent can experiment with different actions and observe
their effects. We introduce the idea that in interactive environments, latent
factors that control the variation in observed data can be identified by
figuring out what the agent can control. We propose a naive method to find
factors that explain or measure the effect of the actions of a learner, and
test it in illustrative experiments.Comment: RLDM submissio
Low-memory convolutional neural networks through incremental depth-first processing
We introduce an incremental processing scheme for convolutional neural
network (CNN) inference, targeted at embedded applications with limited memory
budgets. Instead of processing layers one by one, individual input pixels are
propagated through all parts of the network they can influence under the given
structural constraints. This depth-first updating scheme comes with hard bounds
on the memory footprint: the memory required is constant in the case of 1D
input and proportional to the square root of the input dimension in the case of
2D input
Depth with Nonlinearity Creates No Bad Local Minima in ResNets
In this paper, we prove that depth with nonlinearity creates no bad local
minima in a type of arbitrarily deep ResNets with arbitrary nonlinear
activation functions, in the sense that the values of all local minima are no
worse than the global minimum value of corresponding classical machine-learning
models, and are guaranteed to further improve via residual representations. As
a result, this paper provides an affirmative answer to an open question stated
in a paper in the conference on Neural Information Processing Systems 2018.
This paper advances the optimization theory of deep learning only for ResNets
and not for other network architectures
Early Inference in Energy-Based Models Approximates Back-Propagation
We show that Langevin MCMC inference in an energy-based model with latent
variables has the property that the early steps of inference, starting from a
stationary point, correspond to propagating error gradients into internal
layers, similarly to back-propagation. The error that is back-propagated is
with respect to visible units that have received an outside driving force
pushing them away from the stationary point. Back-propagated error gradients
correspond to temporal derivatives of the activation of hidden units. This
observation could be an element of a theory for explaining how brains perform
credit assignment in deep hierarchies as efficiently as back-propagation does.
In this theory, the continuous-valued latent variables correspond to averaged
voltage potential (across time, spikes, and possibly neurons in the same
minicolumn), and neural computation corresponds to approximate inference and
error back-propagation at the same time.Comment: arXiv admin note: text overlap with arXiv:1509.0593
Adaptive Drift-Diffusion Process to Learn Time Intervals
Animals learn the timing between consecutive events very easily. Their
precision is usually proportional to the interval to time (Weber's law for
timing). Most current timing models either require a central clock and
unbounded accumulator or whole pre-defined populations of delay lines, decaying
traces or oscillators to represent elapsing time. Current adaptive recurrent
neural networks fail at learning to predict the timing of future events (the
'when') in a realistic manner. In this paper, we present a new model of
interval timing, based on simple temporal integrators, derived from
drift-diffusion models. We develop a simple geometric rule to learn 'when'
instead of 'what'. We provide an analytical proof that the model can learn
inter-event intervals in a number of trials independent of the interval size
and that the temporal precision of the system is proportional to the timed
interval. This new model uses no clock, no gradient, no unbounded accumulators,
no delay lines, and has internal noise allowing generations of individual
trials. Three interesting predictions are made.Comment: 9 pages, 4 figure
Deep Directed Generative Models with Energy-Based Probability Estimation
Training energy-based probabilistic models is confronted with apparently
intractable sums, whose Monte Carlo estimation requires sampling from the
estimated probability distribution in the inner loop of training. This can be
approximately achieved by Markov chain Monte Carlo methods, but may still face
a formidable obstacle that is the difficulty of mixing between modes with sharp
concentrations of probability. Whereas an MCMC process is usually derived from
a given energy function based on mathematical considerations and requires an
arbitrarily long time to obtain good and varied samples, we propose to train a
deep directed generative model (not a Markov chain) so that its sampling
distribution approximately matches the energy function that is being trained.
Inspired by generative adversarial networks, the proposed framework involves
training of two models that represent dual views of the estimated probability
distribution: the energy function (mapping an input configuration to a scalar
energy value) and the generator (mapping a noise vector to a generated
configuration), both represented by deep neural networks
Interpretable Convolutional Filters with SincNet
Deep learning is currently playing a crucial role toward higher levels of
artificial intelligence. This paradigm allows neural networks to learn complex
and abstract representations, that are progressively obtained by combining
simpler ones. Nevertheless, the internal "black-box" representations
automatically discovered by current neural architectures often suffer from a
lack of interpretability, making of primary interest the study of explainable
machine learning techniques. This paper summarizes our recent efforts to
develop a more interpretable neural model for directly processing speech from
the raw waveform. In particular, we propose SincNet, a novel Convolutional
Neural Network (CNN) that encourages the first layer to discover more
meaningful filters by exploiting parametrized sinc functions. In contrast to
standard CNNs, which learn all the elements of each filter, only low and high
cutoff frequencies of band-pass filters are directly learned from data. This
inductive bias offers a very compact way to derive a customized filter-bank
front-end, that only depends on some parameters with a clear physical meaning.
Our experiments, conducted on both speaker and speech recognition, show that
the proposed architecture converges faster, performs better, and is more
interpretable than standard CNNs.Comment: In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text
overlap with arXiv:1808.0015
Reweighted Wake-Sleep
Training deep directed graphical models with many hidden variables and
performing inference remains a major challenge. Helmholtz machines and deep
belief networks are such models, and the wake-sleep algorithm has been proposed
to train them. The wake-sleep algorithm relies on training not just the
directed generative model but also a conditional generative model (the
inference network) that runs backward from visible to latent, estimating the
posterior distribution of latent given visible. We propose a novel
interpretation of the wake-sleep algorithm which suggests that better
estimators of the gradient can be obtained by sampling latent variables
multiple times from the inference network. This view is based on importance
sampling as an estimator of the likelihood, with the approximate inference
network as a proposal distribution. This interpretation is confirmed
experimentally, showing that better likelihood can be achieved with this
reweighted wake-sleep procedure. Based on this interpretation, we propose that
a sigmoidal belief network is not sufficiently powerful for the layers of the
inference network in order to recover a good estimator of the posterior
distribution of latent variables. Our experiments show that using a more
powerful layer model, such as NADE, yields substantially better generative
models
- …
