Search CORE

22,616 research outputs found

Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

Author: Hutter Marcus
Publication venue
Publication date: 01/01/2002
Field of study

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle

t

action

y_t

results in perception

x_t

and reward

r_t

, where all quantities in general may depend on the complete history. The perception

x_t

and reward

r_t

are sampled from the (reactive) environmental probability distribution

\mu

. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if

\mu

is known. Reinforcement learning is usually used if

\mu

is unknown. In the Bayesian approach one defines a mixture distribution

\xi

as a weighted sum of distributions \nu\in\M, where \M is any class of distributions including the true environment

\mu

. We show that the Bayes-optimal policy

p^\xi

based on the mixture

\xi

is self-optimizing in the sense that the average value converges asymptotically for all \mu\in\M to the optimal value achieved by the (infeasible) Bayes-optimal policy

p^\mu

which knows

\mu

in advance. We show that the necessary condition that \M admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on \M. As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that

p^\xi

is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in {\em all} environments \nu\in\M and a strictly higher value in at least one.Comment: 15 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Australian National University

On the Convergence Speed of MDL Predictions for Bernoulli Sequences

Author: Jan Pol
Jan Pol
Marcus Hutter
Marcus Hutter
Publication venue
Publication date: 01/01/2004
Field of study

We consider the Minimum Description Length principle for online sequence prediction. If the underlying model class is discrete, then the total expected square loss is a particularly interesting performance measure: (a) this quantity is bounded, implying convergence with probability one, and (b) it additionally specifies a `rate of convergence'. Generally, for MDL only exponential loss bounds hold, as opposed to the linear bounds for a Bayes mixture. We show that this is even the case if the model class contains only Bernoulli distributions. We derive a new upper bound on the prediction error for countable Bernoulli classes. This implies a small bound (comparable to the one for Bayes mixtures) for certain important model classes. The results apply to many Machine Learning tasks including classification and hypothesis testing. We provide arguments that our theorems generalize to countable classes of i.i.d. models.Comment: 17 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Australian National University

Feature Reinforcement Learning: Part I: Unstructured MDPs

Author: Hutter Marcus
Publication venue
Publication date: 01/01/2009
Field of study

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II. The role of POMDPs is also considered there.Comment: 24 LaTeX pages, 5 diagram

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Australian National University

Feature Dynamic Bayesian Networks

Author: Hutter Marcus
Publication venue
Publication date: 24/12/2008
Field of study

Feature Markov Decision Processes (PhiMDPs) are well-suited for learning agents in general environments. Nevertheless, unstructured (Phi)MDPs are limited to relatively simple environments. Structured MDPs like Dynamic Bayesian Networks (DBNs) are used for large-scale real-world problems. In this article I extend PhiMDP to PhiDBN. The primary contribution is to derive a cost criterion that allows to automatically extract the most relevant features from the environment, leading to the "best" DBN representation. I discuss all building blocks required for a complete general learning algorithm.Comment: 7 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Australian National University