30 research outputs found
Residual Doping in Homoepitaxial Zinc Oxide Layers Grown by Metal Organic Vapor Phase Epitaxy
International audienceFull maximum entropy mobility spectrum analysis was carried out on the basis of temperature and magneticfield- dependent Hall measurements to assess the transport properties of homoepitaxial metal organic vapor phase epitaxy zinc oxide layers. Two different conductivity channels were clearly identified and the channel with higher mobility and higher carrier concentration is associated with the epitaxial layer. Hydrogen impurity acting as residual donor and as a passivating species for acceptors is proposed to explain the higher carrier concentration and mobility in the epilayer. In contrast to heteroepitaxial layers, no conduction channel is observed from the substrate to epilayer interface
Self-Imitation Advantage Learning
Self-imitation learning is a Reinforcement Learning (RL) method that
encourages actions whose returns were higher than expected, which helps in hard
exploration and sparse reward problems. It was shown to improve the performance
of on-policy actor-critic methods in several discrete control tasks.
Nevertheless, applying self-imitation to the mostly action-value based
off-policy RL methods is not straightforward. We propose SAIL, a novel
generalization of self-imitation learning for off-policy RL, based on a
modification of the Bellman optimality operator that we connect to Advantage
Learning. Crucially, our method mitigates the problem of stale returns by
choosing the most optimistic return estimate between the observed return and
the current action-value for self-imitation. We demonstrate the empirical
effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard
exploration games.Comment: AAMAS 202
Self-Attentional Credit Assignment for Transfer in Reinforcement Learning
The ability to transfer knowledge to novel environments and tasks is a
sensible desiderata for general learning agents. Despite the apparent promises,
transfer in RL is still an open and little exploited research area. In this
paper, we take a brand-new perspective about transfer: we suggest that the
ability to assign credit unveils structural invariants in the tasks that can be
transferred to make RL more sample-efficient. Our main contribution is SECRET,
a novel approach to transfer learning for RL that uses a backward-view credit
assignment mechanism based on a self-attentive architecture. Two aspects are
key to its generality: it learns to assign credit as a separate offline
supervised process and exclusively modifies the reward function. Consequently,
it can be supplemented by transfer methods that do not modify the reward
function and it can be plugged on top of any RL algorithm.Comment: 21 pages, 10 figures, 3 tables (accepted as an oral presentation at
the Learning Transferable Skills workshop, NeurIPS 2019
There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning
We propose to learn to distinguish reversible from irreversible actions for
better informed decision-making in Reinforcement Learning (RL). From
theoretical considerations, we show that approximate reversibility can be
learned through a simple surrogate task: ranking randomly sampled trajectory
events in chronological order. Intuitively, pairs of events that are always
observed in the same order are likely to be separated by an irreversible
sequence of actions. Conveniently, learning the temporal order of events can be
done in a fully self-supervised way, which we use to estimate the reversibility
of actions from experience, without any priors. We propose two different
strategies that incorporate reversibility in RL agents, one strategy for
exploration (RAE) and one strategy for control (RAC). We demonstrate the
potential of reversibility-aware agents in several environments, including the
challenging Sokoban game. In synthetic tasks, we show that we can learn control
policies that never fail and reduce to zero the side-effects of interactions,
even without access to the reward function
Self-Imitation Advantage Learning
International audienceSelf-imitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of on-policy actor-critic methods in several discrete control tasks. Nevertheless, applying self-imitation to the mostly action-value based off-policy RL methods is not straightforward. We propose SAIL, a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current action-value for self-imitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games
Better state exploration using action sequence equivalence
International audienceIncorporating prior knowledge in reinforcement learning algorithms is mainly an open question. Even when insights about the environment dynamics are available, reinforcement learning is traditionally used in a tabula rasa setting and must explore and learn everything from scratch. In this paper, we consider the problem of exploiting priors about action sequence equivalence: that is, when different sequences of actions produce the same effect. We propose a new local exploration strategy calibrated to minimize collisions and maximize new state visitations. We show that this strategy can be computed at little cost, by solving a convex optimization problem. By replacing the usual ϵ-greedy strategy in a DQN, we demonstrate its potential in several environments with various dynamic structures
Self-Attentional Credit Assignment for Transfer in Reinforcement Learning
International audienceThe ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample efficient. Our main contribution is Secret, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm
Adversarially Guided Actor-Critic
International audienceDespite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks
WARP: On the Benefits of Weight Averaged Rewarded Policies
Reinforcement learning from human feedback (RLHF) aligns large language
models (LLMs) by encouraging their generations to have high rewards, using a
reward model trained on human preferences. To prevent the forgetting of
pre-trained knowledge, RLHF usually incorporates a KL regularization; this
forces the policy to remain close to its supervised fine-tuned initialization,
though it hinders the reward optimization. To tackle the trade-off between KL
and reward, in this paper we introduce a novel alignment strategy named Weight
Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at
three distinct stages. First, it uses the exponential moving average of the
policy as a dynamic anchor in the KL regularization. Second, it applies
spherical interpolation to merge independently fine-tuned policies into a new
enhanced one. Third, it linearly interpolates between this merged model and the
initialization, to recover features from pre-training. This procedure is then
applied iteratively, with each iteration's final model used as an advanced
initialization for the next, progressively refining the KL-reward Pareto front,
achieving superior rewards at fixed KL. Experiments with GEMMA policies
validate that WARP improves their quality and alignment, outperforming other
open-source LLMs.Comment: 11 main pages (34 pages with Appendix
Direct Language Model Alignment from Online AI Feedback
Direct alignment from preferences (DAP) methods, such as DPO, have recently
emerged as efficient alternatives to reinforcement learning from human feedback
(RLHF), that do not require a separate reward model. However, the preference
datasets used in DAP methods are usually collected ahead of training and never
updated, thus the feedback is purely offline. Moreover, responses in these
datasets are often sampled from a language model distinct from the one being
aligned, and since the model evolves over training, the alignment phase is
inevitably off-policy. In this study, we posit that online feedback is key and
improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as
annotator: on each training iteration, we sample two responses from the current
model and prompt the LLM annotator to choose which one is preferred, thus
providing online feedback. Despite its simplicity, we demonstrate via human
evaluation in several tasks that OAIF outperforms both offline DAP and RLHF
methods. We further show that the feedback leveraged in OAIF is easily
controllable, via instruction prompts to the LLM annotator.Comment: 18 pages, 9 figures, 4 table
