298 research outputs found

    Optimistic Planning for Markov Decision Processes

    Get PDF
    International audienceThe reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities

    Least-squares methods for policy iteration

    Get PDF
    Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization. We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed

    Optimal Bidding and Operation of a Power Plant with Solvent-Based Carbon Capture under a CO 2 Allowance Market: A Solution with a Reinforcement Learning-Based Sarsa Temporal-Difference Algorithm

    Get PDF
    In this paper, a reinforcement learning (RL)-based Sarsa temporal-difference (TD) algorithm is applied to search for a unified bidding and operation strategy for a coal-fired power plant with monoethanolamine (MEA)-based post-combustion carbon capture under different carbon dioxide (CO2) allowance market conditions. The objective of the decision maker for the power plant is to maximize the discounted cumulative profit during the power plant lifetime. Two constraints are considered for the objective formulation. Firstly, the tradeoff between the energy-intensive carbon capture and the electricity generation should be made under presumed fixed fuel consumption. Secondly, the CO2 allowances purchased from the CO2 allowance market should be approximately equal to the quantity of CO2 emission from power generation. Three case studies are demonstrated thereafter. In the first case, we show the convergence of the Sarsa TD algorithm and find a deterministic optimal bidding and operation strategy. In the second case, compared with the independently designed operation and bidding strategies discussed in most of the relevant literature, the Sarsa TD-based unified bidding and operation strategy with time-varying flexible market-oriented CO2 capture levels is demonstrated to help the power plant decision maker gain a higher discounted cumulative profit. In the third case, a competitor operating another power plant identical to the preceding plant is considered under the same CO2 allowance market. The competitor also has carbon capture facilities but applies a different strategy to earn profits. The discounted cumulative profits of the two power plants are then compared, thus exhibiting the competitiveness of the power plant that is using the unified bidding and operation strategy explored by the Sarsa TD algorithm

    Reinforcement learning for condition-based control of gas turbine engines

    Get PDF
    A condition-based control framework is proposed for gas turbine engines using reinforcement learning and adaptive dynamic programming (RL-ADP). The system behaviour, specifically the fuel efficiency function and constraints, exhibit unknown degradation patterns which vary from engine to engine. Due to these variations, accurate system models to describe the true system states over the life of the engines are difficult to obtain. Consequently, model-based control techniques are unable to fully compensate for the effects of the variations on the system performance. The proposed RL-ADP control framework is based on Q-learning and uses measurements of desired performance quantities as reward signals to learn and adapt the system efficiency maps. This is achieved without knowledge of the system variation or degradation dynamics, thus providing a through life adaptation strategy that delivers improved system performance. In order to overcome the long standing difficulties associated with the application of adaptive techniques in a safety critical setting, a dual-control loop structure is proposed in the implementation of the RL-ADP scheme. The overall control framework maintains guarantees on the main thrust control loop whilst extracting improved performance as the engine degrades by tuning sets of variable geometry components in the RL-ADP control loop. Simulation results on representative engine data sets demonstrate the effectiveness of this approach as compared to an industry standard gain scheduling

    Optimistic minimax search for noncooperative switched control with or without dwell time

    Get PDF
    International audienceWe consider adversarial problems in which two agents control two switching signals, the first agent aiming to maximize a discounted sum of rewards, and the second aiming to minimize it. Both signals may be subject to constraints on the dwell time after a switch. We search the tree of possible mode sequences with an algorithm called optimistic minimax search with dwell time (OMSd), showing that it obtains a solution close to the minimax-optimal one, and we characterize the rate at which the suboptimality goes to zero. The analysis is driven by a novel measure of problem complexity, and it is first given in the general dwell-time case, after which it is specialized to the unconstrained case. We exemplify the framework for networked control systems where the minimizer signal is a discrete time delay on the control channel, and we provide extensive simulations and a real-time experiment for nonlinear systems of this type

    Multi-Agent Reinforcement Learning as a Rehearsal for Decentralized Planning

    Get PDF
    Decentralized partially observable Markov decision processes (Dec-POMDPs) are a powerful tool for modeling multi-agent planning and decision-making under uncertainty. Prevalent Dec-POMDP solution techniques require centralized computation given full knowledge of the underlying model. Multi-agent reinforcement learning (MARL) based approaches have been recently proposed for distributed solution of Dec-POMDPs without full prior knowledge of the model, but these methods assume that conditions during learning and policy execution are identical. In some practical scenarios this may not be the case. We propose a novel MARL approach in which agents are allowed to rehearse with information that will not be available during policy execution. The key is for the agents to learn policies that do not explicitly rely on these rehearsal features. We also establish a weak convergence result for our algorithm, RLaR, demonstrating that RLaR converges in probability when certain conditions are met. We show experimentally that incorporating rehearsal features can enhance the learning rate compared to non-rehearsal-based learners, and demonstrate fast, (near) optimal performance on many existing benchmark Dec-POMDP problems. We also compare RLaR against an existing approximate Dec-POMDP solver which, like RLaR, does not assume a priori knowledge of the model. While RLaR׳s policy representation is not as scalable, we show that RLaR produces higher quality policies for most problems and horizons studied
    corecore