Search CORE

5,732 research outputs found

Evolving Culture vs Local Minima

Author: Bengio Yoshua
Publication venue
Publication date: 29/11/2012
Field of study

We propose a theory that relates difficulty of learning in deep architectures to culture and language. It is articulated around the following hypotheses: (1) learning in an individual human brain is hampered by the presence of effective local minima; (2) this optimization difficulty is particularly important when it comes to learning higher-level abstractions, i.e., concepts that cover a vast and highly-nonlinear span of sensory configurations; (3) such high-level abstractions are best represented in brains by the composition of many levels of representation, i.e., by deep architectures; (4) a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and (5), language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world. These hypotheses put together imply that human culture and the evolution of ideas have been crucial to counter an optimization difficulty: this optimization difficulty would otherwise make it very difficult for human brains to capture high-level knowledge of the world. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks. Plausible consequences of this theory for the efficiency of cultural evolutions are sketched

arXiv.org e-Print Archive

The Consciousness Prior

Author: Bengio Yoshua
Publication venue
Publication date: 02/12/2019
Field of study

A new prior is proposed for learning representations of high-level concepts of the kind we manipulate with language. This prior can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by cognitive neuroscience theories of consciousness, seen as a bottleneck through which just a few elements, after having been selected by attention from a broader pool, are then broadcast and condition further processing, both in perception and decision-making. The set of recently selected elements one becomes aware of is seen as forming a low-dimensional conscious state. This conscious state is combining the few concepts constituting a conscious thought, i.e., what one is immediately conscious of at a particular moment. We claim that this architectural and information-processing constraint corresponds to assumptions about the joint distribution between high-level concepts. To the extent that these assumptions are generally true (and the form of natural language seems consistent with them), they can form a useful prior for representation learning. A low-dimensional thought or conscious state is analogous to a sentence: it involves only a few variables and yet can make a statement with very high probability of being true. This is consistent with a joint distribution (over high-level concepts) which has the form of a sparse factor graph, i.e., where the dependencies captured by each factor of the factor graph involve only very few variables while creating a strong dip in the overall energy function. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in a form similar to facts and rules, albeit capturing uncertainty as well as efficient search mechanisms implemented by attention mechanisms

arXiv.org e-Print Archive

Independently Controllable Features

Author: Bengio Emmanuel
Bengio Yoshua
Pineau Joelle
Precup Doina
Thomas Valentin
Publication venue
Publication date: 22/03/2017
Field of study

Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive environments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects. We introduce the idea that in interactive environments, latent factors that control the variation in observed data can be identified by figuring out what the agent can control. We propose a naive method to find factors that explain or measure the effect of the actions of a learner, and test it in illustrative experiments.Comment: RLDM submissio

arXiv.org e-Print Archive

Low-memory convolutional neural networks through incremental depth-first processing

Author: Bengio Yoshua
Binas Jonathan
Publication venue
Publication date: 20/05/2019
Field of study

We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limited memory budgets. Instead of processing layers one by one, individual input pixels are propagated through all parts of the network they can influence under the given structural constraints. This depth-first updating scheme comes with hard bounds on the memory footprint: the memory required is constant in the case of 1D input and proportional to the square root of the input dimension in the case of 2D input

arXiv.org e-Print Archive

Depth with Nonlinearity Creates No Bad Local Minima in ResNets

Author: Bengio Yoshua
Kawaguchi Kenji
Publication venue: 'Elsevier BV'
Publication date: 09/07/2019
Field of study

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems 2018. This paper advances the optimization theory of deep learning only for ResNets and not for other network architectures

arXiv.org e-Print Archive

Early Inference in Energy-Based Models Approximates Back-Propagation

Author: Bengio Yoshua
Fischer Asja
Publication venue
Publication date: 07/02/2016
Field of study

We show that Langevin MCMC inference in an energy-based model with latent variables has the property that the early steps of inference, starting from a stationary point, correspond to propagating error gradients into internal layers, similarly to back-propagation. The error that is back-propagated is with respect to visible units that have received an outside driving force pushing them away from the stationary point. Back-propagated error gradients correspond to temporal derivatives of the activation of hidden units. This observation could be an element of a theory for explaining how brains perform credit assignment in deep hierarchies as efficiently as back-propagation does. In this theory, the continuous-valued latent variables correspond to averaged voltage potential (across time, spikes, and possibly neurons in the same minicolumn), and neural computation corresponds to approximate inference and error back-propagation at the same time.Comment: arXiv admin note: text overlap with arXiv:1509.0593

arXiv.org e-Print Archive

Adaptive Drift-Diffusion Process to Learn Time Intervals

Author: Bengio Yoshua
Rivest Francois
Publication venue
Publication date: 11/03/2011
Field of study

Animals learn the timing between consecutive events very easily. Their precision is usually proportional to the interval to time (Weber's law for timing). Most current timing models either require a central clock and unbounded accumulator or whole pre-defined populations of delay lines, decaying traces or oscillators to represent elapsing time. Current adaptive recurrent neural networks fail at learning to predict the timing of future events (the 'when') in a realistic manner. In this paper, we present a new model of interval timing, based on simple temporal integrators, derived from drift-diffusion models. We develop a simple geometric rule to learn 'when' instead of 'what'. We provide an analytical proof that the model can learn inter-event intervals in a number of trials independent of the interval size and that the temporal precision of the system is proportional to the timed interval. This new model uses no clock, no gradient, no unbounded accumulators, no delay lines, and has internal noise allowing generations of individual trials. Three interesting predictions are made.Comment: 9 pages, 4 figure

arXiv.org e-Print Archive

Deep Directed Generative Models with Energy-Based Probability Estimation

Author: Bengio Yoshua
Kim Taesup
Publication venue
Publication date: 10/06/2016
Field of study

Training energy-based probabilistic models is confronted with apparently intractable sums, whose Monte Carlo estimation requires sampling from the estimated probability distribution in the inner loop of training. This can be approximately achieved by Markov chain Monte Carlo methods, but may still face a formidable obstacle that is the difficulty of mixing between modes with sharp concentrations of probability. Whereas an MCMC process is usually derived from a given energy function based on mathematical considerations and requires an arbitrarily long time to obtain good and varied samples, we propose to train a deep directed generative model (not a Markov chain) so that its sampling distribution approximately matches the energy function that is being trained. Inspired by generative adversarial networks, the proposed framework involves training of two models that represent dual views of the estimated probability distribution: the energy function (mapping an input configuration to a scalar energy value) and the generator (mapping a noise vector to a generated configuration), both represented by deep neural networks

arXiv.org e-Print Archive

Interpretable Convolutional Filters with SincNet

Author: Bengio Yoshua
Ravanelli Mirco
Publication venue
Publication date: 09/08/2019
Field of study

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.Comment: In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text overlap with arXiv:1808.0015

arXiv.org e-Print Archive

Reweighted Wake-Sleep

Author: Bengio Yoshua
Bornschein Jörg
Publication venue
Publication date: 16/04/2015
Field of study

Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models

arXiv.org e-Print Archive