1,731 research outputs found
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
While designing the state space of an MDP, it is common to include states
that are transient or not reachable by any policy (e.g., in mountain car, the
product space of speed and position contains configurations that are not
physically reachable). This leads to defining weakly-communicating or
multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able
to perform efficient exploration-exploitation in any finite Markov Decision
Process (MDP) without requiring any form of prior knowledge. In particular, for
any MDP with communicating states, actions and
possible communicating next states,
we derive a regret bound, where is the diameter
(i.e., the longest shortest path) of the communicating part of the MDP. This is
in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that
suffer linear regret in weakly-communicating MDPs, as well as posterior
sampling or regularised algorithms (e.g., REGAL), which require prior knowledge
on the bias span of the optimal policy to bias the exploration to achieve
sub-linear regret. We also prove that in weakly-communicating MDPs, no
algorithm can ever achieve a logarithmic growth of the regret without first
suffering a linear regret for a number of steps that is exponential in the
parameters of the MDP. Finally, we report numerical simulations supporting our
theoretical findings and showing how TUCRL overcomes the limitations of the
state-of-the-art
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
Genetic basis of between-individual and within-individual variance of docility
Funded by Alces Software UCLA Academic Senate Division of Life Sciences National Geographic Society National Science Foundation. Grant Numbers: IDBR-0754247, DEB-1119660, DBI-0242960, DBI-0731346 University of Aberdeen Data deposited at Dryad: doi:10.5061/dryad.11vf0.Peer reviewedPostprin
Blepharitis
Blepharitis is a very common and under-appreciated
eyelid margin condition which causes non-specific ocular
irritation, significant patient distress. Chronic blepharitis is
often difficult to manage. The true prevalence of blepharitis
is difficult to estimate; figures cited in the literature range
from 12%-79% due to the different ways how blepharitis may
manifest itself and ill-defined diagnostic criteria.peer-reviewe
Inverse Reinforcement Learning through Policy Gradient Minimization
Inverse Reinforcement Learning (IRL) deals with the problem of recovering the reward function optimized by an expert given a set of demonstrations of the expert's policy.Most IRL algorithms need to repeatedly compute the optimal policy for different reward functions.This paper proposes a new IRL approach that allows to recover the reward function without the need of solving any "direct" RL problem.The idea is to find the reward function that minimizes the gradient of a parameterized representation of the expert's policy.In particular, when the reward function can be represented as a linear combination of some basis functions, we will show that the aforementioned optimization problem can be efficiently solved.We present an empirical evaluation of the proposed approach on a multidimensional version of the Linear-Quadratic Regulator (LQR) both in the case where the parameters of the expert's policy are known and in the (more realistic) case where the parameters of the expert's policy need to be inferred from the expert's demonstrations.Finally, the algorithm is compared against the state-of-the-art on the mountain car domain, where the expert's policy is unknown
Managing the wildlife tourism commons
ACKNOWLEDGMENTS This work received funding from the MASTS pooling initiative (the Marine Alliance for Science and Technology for Scotland) and their support is gratefully acknowledged.MASTS is funded by the Scottish Funding Council (grant reference HR09011) and contributing institutions. This work was stimulated by discussions with the Moray Firth Dolphin Space Programme and we particularly thank Ben Leyshon (Scottish Natural Heritage) for fruitful discussions. The authors would like to thank K. Barton and C. Konrad for their advice on biased random walks and correlated random fields, D.Murphy for useful discussions during the development of the simulations, and M. Marcoux for important comments on an earlier version of this work. Finally, we thank two anonymous reviewers, whose comments have greatly improved the manuscript.Peer reviewedPublisher PD
- …
