Search CORE

30 research outputs found

Residual Doping in Homoepitaxial Zinc Oxide Layers Grown by Metal Organic Vapor Phase Epitaxy

Author: Bisotto Isabelle
Brochen Stéphane
Chicot Gauthier
Ferret Pierre
Feuillet Guy
Granier Carole
Pernot Julien
Ribeaud Alexandre
Rothman Johan
Publication venue: 'Japan Society of Applied Physics'
Publication date: 01/01/2010
Field of study

International audienceFull maximum entropy mobility spectrum analysis was carried out on the basis of temperature and magneticfield- dependent Hall measurements to assess the transport properties of homoepitaxial metal organic vapor phase epitaxy zinc oxide layers. Two different conductivity channels were clearly identified and the channel with higher mobility and higher carrier concentration is associated with the epitaxial layer. Hydrogen impurity acting as residual donor and as a passivating species for acceptors is proposed to explain the higher carrier concentration and mobility in the epilayer. In contrast to heteroepitaxial layers, no conduction channel is observed from the substrate to epilayer interface

Crossref

Hal - Université Grenoble Alpes

HAL-CEA

HAL: Hyper Article en Ligne

Self-Imitation Advantage Learning

Author: Ferret Johan
Geist Matthieu
Pietquin Olivier
Publication venue
Publication date: 22/12/2020
Field of study

Self-imitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of on-policy actor-critic methods in several discrete control tasks. Nevertheless, applying self-imitation to the mostly action-value based off-policy RL methods is not straightforward. We propose SAIL, a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current action-value for self-imitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games.Comment: AAMAS 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Self-Attentional Credit Assignment for Transfer in Reinforcement Learning

Author: Ferret Johan
Geist Matthieu
Marinier Raphaël
Pietquin Olivier
Publication venue
Publication date: 22/11/2019
Field of study

The ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample-efficient. Our main contribution is SECRET, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm.Comment: 21 pages, 10 figures, 3 tables (accepted as an oral presentation at the Learning Transferable Skills workshop, NeurIPS 2019

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning

Author: Ferret Johan
Geist Matthieu
Grinsztajn Nathan
Pietquin Olivier
Preux Philippe
Publication venue
Publication date: 29/10/2021
Field of study

We propose to learn to distinguish reversible from irreversible actions for better informed decision-making in Reinforcement Learning (RL). From theoretical considerations, we show that approximate reversibility can be learned through a simple surrogate task: ranking randomly sampled trajectory events in chronological order. Intuitively, pairs of events that are always observed in the same order are likely to be separated by an irreversible sequence of actions. Conveniently, learning the temporal order of events can be done in a fully self-supervised way, which we use to estimate the reversibility of actions from experience, without any priors. We propose two different strategies that incorporate reversibility in RL agents, one strategy for exploration (RAE) and one strategy for control (RAC). We demonstrate the potential of reversibility-aware agents in several environments, including the challenging Sokoban game. In synthetic tasks, we show that we can learn control policies that never fail and reduce to zero the side-effects of interactions, even without access to the reward function

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Self-Imitation Advantage Learning

Author: Ferret Johan
Geist Matthieu
Pietquin Olivier
Publication venue: HAL CCSD
Publication date: 03/05/2021
Field of study

International audienceSelf-imitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of on-policy actor-critic methods in several discrete control tasks. Nevertheless, applying self-imitation to the mostly action-value based off-policy RL methods is not straightforward. We propose SAIL, a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current action-value for self-imitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games

INRIA a CCSD electronic archive server

Better state exploration using action sequence equivalence

Author: Ferret Johan
Grinsztajn Nathan
Johnstone Toby
Preux Philippe
Publication venue: HAL CCSD
Publication date: 09/12/2022
Field of study

International audienceIncorporating prior knowledge in reinforcement learning algorithms is mainly an open question. Even when insights about the environment dynamics are available, reinforcement learning is traditionally used in a tabula rasa setting and must explore and learn everything from scratch. In this paper, we consider the problem of exploiting priors about action sequence equivalence: that is, when different sequences of actions produce the same effect. We propose a new local exploration strategy calibrated to minimize collisions and maximize new state visitations. We show that this strategy can be computed at little cost, by solving a convex optimization problem. By replacing the usual ϵ-greedy strategy in a DQN, we demonstrate its potential in several environments with various dynamic structures

INRIA a CCSD electronic archive server

Self-Attentional Credit Assignment for Transfer in Reinforcement Learning

Author: Ferret Johan
Geist Matthieu
Marinier Raphaël
Pietquin Olivier
Publication venue: HAL CCSD
Publication date: 11/07/2020
Field of study

International audienceThe ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample efficient. Our main contribution is Secret, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm

INRIA a CCSD electronic archive server

Adversarially Guided Actor-Critic

Author: Ferret Johan
Flet-Berliac Yannis
Geist Matthieu
Pietquin Olivier
Preux Philippe
Publication venue: HAL CCSD
Publication date: 08/02/2021
Field of study

International audienceDespite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

WARP: On the Benefits of Weight Averaged Rewarded Policies

Author: Bachem Olivier
Cedoz Pierre-Louis
Dadashi Robert
Douillard Arthur
Ferret Johan
Girgin Sertan
Hussenot Léonard
Ramé Alexandre
Sessa Pier Giuseppe
Vieillard Nino
Publication venue
Publication date: 24/06/2024
Field of study

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.Comment: 11 main pages (34 pages with Appendix

arXiv.org e-Print Archive

Direct Language Model Alignment from Online AI Feedback

Author: Blondel Mathieu
Ferret Johan
Guo Shangmin
Khalman Misha
Liu Tianlin
Liu Tianqi
Llinares Felipe
Mesnard Thomas
Piot Bilal
Rame Alexandre
Zhang Biao
Zhao Yao
Publication venue
Publication date: 29/02/2024
Field of study

Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.Comment: 18 pages, 9 figures, 4 table

arXiv.org e-Print Archive