Search CORE

1,731 research outputs found

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Author: Fruit Ronan
Lazaric Alessandro
Pirotta Matteo
Publication venue
Publication date: 02/12/2018
Field of study

While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with

S^{\texttt{C}}

communicating states,

A

actions and

\Gamma^{\texttt{C}} \leq S^{\texttt{C}}

possible communicating next states, we derive a

\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT})

regret bound, where

D^{\texttt{C}}

is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Smoothing Policies and Safe Policy Gradients

Author: Papini Matteo
Pirotta Matteo
Restelli Marcello
Publication venue
Publication date: 08/05/2019
Field of study

Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Politecnico di Milano

UPF Digital Repository

Genetic basis of between-individual and within-individual variance of docility

Author: Blumstein D T
Martin J G A
Petelle M B
Pirotta E
Publication venue: 'Wiley'
Publication date: 09/02/2017
Field of study

Funded by Alces Software UCLA Academic Senate Division of Life Sciences National Geographic Society National Science Foundation. Grant Numbers: IDBR-0754247, DEB-1119660, DBI-0242960, DBI-0731346 University of Aberdeen Data deposited at Dryad: doi:10.5061/dryad.11vf0.Peer reviewedPostprin

Aberdeen University Research

Crossref

Dryad Digital Repository

Electronic Archiving System

Blepharitis

Author: Pirotta Suzanne T.
Vassallo James
Publication venue: Medical Portals Ltd.
Publication date: 01/01/2015
Field of study

Blepharitis is a very common and under-appreciated eyelid margin condition which causes non-specific ocular irritation, significant patient distress. Chronic blepharitis is often difficult to manage. The true prevalence of blepharitis is difficult to estimate; figures cited in the literature range from 12%-79% due to the different ways how blepharitis may manifest itself and ill-defined diagnostic criteria.peer-reviewe

OAR@UM

Inverse Reinforcement Learning through Policy Gradient Minimization

Author: Pirotta Matteo
Restelli Marcello
Publication venue: AAAI Press
Publication date: 01/01/2016
Field of study

Inverse Reinforcement Learning (IRL) deals with the problem of recovering the reward function optimized by an expert given a set of demonstrations of the expert's policy.Most IRL algorithms need to repeatedly compute the optimal policy for different reward functions.This paper proposes a new IRL approach that allows to recover the reward function without the need of solving any "direct" RL problem.The idea is to find the reward function that minimizes the gradient of a parameterized representation of the expert's policy.In particular, when the reward function can be represented as a linear combination of some basis functions, we will show that the aforementioned optimization problem can be efficiently solved.We present an empirical evaluation of the proposed approach on a multidimensional version of the Linear-Quadratic Regulator (LQR) both in the case where the parameters of the expert's policy are known and in the (more realistic) case where the parameters of the expert's policy need to be inferred from the expert's demonstrations.Finally, the algorithm is compared against the state-of-the-art on the mountain car domain, where the expert's policy is unknown

Archivio istituzionale della ricerca - Politecnico di Milano

Association for the Advancement of Artificial Intelligence: AAAI Publications

Managing the wildlife tourism commons

Author: Lusseau David
Pirotta Enrico
Publication venue: 'Wiley'
Publication date: 08/10/2014
Field of study

ACKNOWLEDGMENTS This work received funding from the MASTS pooling initiative (the Marine Alliance for Science and Technology for Scotland) and their support is gratefully acknowledged.MASTS is funded by the Scottish Funding Council (grant reference HR09011) and contributing institutions. This work was stimulated by discussions with the Moray Firth Dolphin Space Programme and we particularly thank Ben Leyshon (Scottish Natural Heritage) for fruitful discussions. The authors would like to thank K. Barton and C. Konrad for their advice on biased random walks and correlated random fields, D.Murphy for useful discussions during the development of the simulations, and M. Marcoux for important comments on an earlier version of this work. Finally, we thank two anonymous reviewers, whose comments have greatly improved the manuscript.Peer reviewedPublisher PD

Aberdeen University Research

Crossref

Adaptive step-size for policy gradient methods

Author: L. Bascetta
M. Pirotta
M. Restelli
Publication venue
Publication date: 01/01/2014
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano