1,731 research outputs found

    Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

    Get PDF
    While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with SCS^{\texttt{C}} communicating states, AA actions and ΓCSC\Gamma^{\texttt{C}} \leq S^{\texttt{C}} possible communicating next states, we derive a O~(DCΓCSCAT)\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT}) regret bound, where DCD^{\texttt{C}} is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art

    Smoothing Policies and Safe Policy Gradients

    Full text link
    Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

    Genetic basis of between-individual and within-individual variance of docility

    Get PDF
    Funded by Alces Software UCLA Academic Senate Division of Life Sciences National Geographic Society National Science Foundation. Grant Numbers: IDBR-0754247, DEB-1119660, DBI-0242960, DBI-0731346 University of Aberdeen Data deposited at Dryad: doi:10.5061/dryad.11vf0.Peer reviewedPostprin

    Blepharitis

    Get PDF
    Blepharitis is a very common and under-appreciated eyelid margin condition which causes non-specific ocular irritation, significant patient distress. Chronic blepharitis is often difficult to manage. The true prevalence of blepharitis is difficult to estimate; figures cited in the literature range from 12%-79% due to the different ways how blepharitis may manifest itself and ill-defined diagnostic criteria.peer-reviewe

    Inverse Reinforcement Learning through Policy Gradient Minimization

    Get PDF
    Inverse Reinforcement Learning (IRL) deals with the problem of recovering the reward function optimized by an expert given a set of demonstrations of the expert's policy.Most IRL algorithms need to repeatedly compute the optimal policy for different reward functions.This paper proposes a new IRL approach that allows to recover the reward function without the need of solving any "direct" RL problem.The idea is to find the reward function that minimizes the gradient of a parameterized representation of the expert's policy.In particular, when the reward function can be represented as a linear combination of some basis functions, we will show that the aforementioned optimization problem can be efficiently solved.We present an empirical evaluation of the proposed approach on a multidimensional version of the Linear-Quadratic Regulator (LQR) both in the case where the parameters of the expert's policy are known and in the (more realistic) case where the parameters of the expert's policy need to be inferred from the expert's demonstrations.Finally, the algorithm is compared against the state-of-the-art on the mountain car domain, where the expert's policy is unknown

    Managing the wildlife tourism commons

    Get PDF
    ACKNOWLEDGMENTS This work received funding from the MASTS pooling initiative (the Marine Alliance for Science and Technology for Scotland) and their support is gratefully acknowledged.MASTS is funded by the Scottish Funding Council (grant reference HR09011) and contributing institutions. This work was stimulated by discussions with the Moray Firth Dolphin Space Programme and we particularly thank Ben Leyshon (Scottish Natural Heritage) for fruitful discussions. The authors would like to thank K. Barton and C. Konrad for their advice on biased random walks and correlated random fields, D.Murphy for useful discussions during the development of the simulations, and M. Marcoux for important comments on an earlier version of this work. Finally, we thank two anonymous reviewers, whose comments have greatly improved the manuscript.Peer reviewedPublisher PD
    corecore