increasing the action gap - new operators for reinforcement learning

  • View
    60

  • Download
    1

Embed Size (px)

Text of increasing the action gap - new operators for reinforcement learning

  • IncreasingtheActionGap:NewOperators

    forReinforcementLearning

    2017/03/18@NIPS+

  • 2

    :: :: D1 ryo.iwaki@ams.eng.osaka-u.ac.jp :: ::

  • 3

    1NIPS

    SafeandEfficientOff-PolicyReinforcementLearning

    Safe and efficient off-policy reinforcement learning

    Remi Munosmunos@google.comGoogle DeepMind

    Tom Stepletonstepleton@google.com

    Google DeepMind

    Anna Harutyunyananna.harutyunyan@vub.ac.be

    Vrije Universiteit Brussel

    Marc G. Bellemarebellemare@google.com

    Google DeepMind

    Abstract

    In this work, we take a fresh look at some old and new algorithms for off-policy,return-based reinforcement learning. Expressing these in a common form, we de-rive a novel algorithm, Retrace(), with three desired properties: (1) low variance;(2) safety, as it safely uses samples collected from any behaviour policy, whateverits degree of off-policyness; and (3) efficiency, as it makes the best use of sam-ples collected from near on-policy behaviour policies. We analyse the contractivenature of the related operator under both off-policy policy evaluation and controlsettings and derive online sample-based algorithms. To our knowledge, this is thefirst return-based off-policy control algorithm converging a.s. to Q without theGLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary,we prove the convergence of Watkins Q(), which was still an open problem. Weillustrate the benefits of Retrace() on a standard suite of Atari 2600 games.

    One fundamental trade-off in reinforcement learning lies in the definition of the update target: shouldone estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based meth-ods (where return refers to the sum of discounted rewards

    Pt

    trt

    ) offer some advantages overvalue bootstrap methods: they are better behaved when combined with function approximation, andquickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap meth-ods are more readily applied to off-policy data, a common use case. In this paper we show thatlearning from returns need not be at cross-purposes with off-policy learning.

    We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policyevaluation, without correcting for the off-policyness of a trajectory, still converges to the desiredQ value function provided the behavior and target policies are not too far apart (the maxi-mum allowed distance depends on the parameter). Their Q() algorithm learns from trajectoriesgenerated by simply by summing discounted off-policy corrected rewards at each time step. Un-fortunately, the assumption that and are close is restrictive, as well as difficult to uphold in thecontrol case, where the target policy is always greedy with respect to the current Q-function. In thatsense this algorithm is not safe: it does not handle the case of arbitrary off-policyness.

    Alternatively, the Tree-backup (TB) () algorithm (Precup et al., 2000) tolerates arbitrary tar-get/behavior discrepancies by scaling information (here called traces) from future temporal dif-ferences by the product of target policy probabilities. TB() is not efficient in the near on-policycase (similar and ), though, as traces may be cut prematurely, blocking learning from full returns.

    In this work, we express several off-policy, return-based algorithms in a common form. From thiswe derive an improved algorithm, Retrace(), which is both safe and efficient, enjoying convergenceguarantees for off-policy policy evaluation and more importantly for the control setting.

    Retrace() can learn from full returns retrieved from past policy data, as in the context of experiencereplay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (Mnih

    arX

    iv:1

    606.

    0264

    7v1

    [cs.L

    G]

    8 Ju

    n 20

    16

    NIPS 2016

    slideshare

  • 4

  • 5

    difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughouttaking high-dimensional data (210|160colour video at 60 Hz) as inputto demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable mannerillustrated by the temporal evolution of two indices of learning (theagents average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

    We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

    Convolution Convolution Fully connected Fully connected

    No input

    Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

    symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,x ).

    a b

    c d

    0 200 400 600 800

    1,000 1,200 1,400 1,600 1,800 2,000 2,200

    0 20 40 60 80 100 120 140 160 180 200

    Ave

    rage

    sco

    re p

    er e

    piso

    de

    Training epochs

    0 1 2 3 4 5 6 7 8 9

    10 11

    0 20 40 60 80 100 120 140 160 180 200

    Ave

    rage

    act

    ion

    valu

    e (Q

    )

    Training epochs

    0

    1,000

    2,000

    3,000

    4,000

    5,000

    6,000

    0 20 40 60 80 100 120 140 160 180 200

    Ave

    rage

    sco

    re p

    er e

    piso

    de

    Training epochs

    0 1 2 3 4 5 6 7 8 9

    10

    0 20 40 60 80 100 120 140 160 180 200

    Ave

    rage

    act

    ion

    valu

    e (Q

    )

    Training epochs

    Figure 2 | Training curves tracking the agents average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

    on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

    RESEARCH LETTER

    5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

    Macmillan Publishers Limited. All rights reserved2015

    AI [Mnih+15] IEEE ROBOTICS & AUTOMATION MAGAZINE MARCH 2016104

    regression [2]. With a function approximator, the sampled data from the approximated model can be generated by inap-propriate interpolation or extrapolation that improperly up-dates the policy parameters. In addition, if we aggressively

    derive the analytical gradi-ent of the approximated model to update the poli-cy, the approximated gra-dient might be far from the true gradient of the objective function due to the model approximation error. If we consider using these function approxima-tion methods for high-di-mensional systems like humanoid robots, this problem becomes more serious due to the difficul-ty of approximating high-

    dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environ-ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ-ments property and could lead to inappropriate policy up-dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There-fore, our approach is suitable for a real robot learning for high-dimensional systems like humanoid robots.

    Moreover, applying RL to actual robot control is difficult, since it usually requires many learning trials that cannot be exe-cuted in real environments, and the real systems durability is limited. Previous studies used prior knowledge or properly de-signed initial trajectories to apply RL to a real robot and im-proved the robot controllers parameters [1], [4], [10], [19], [32].

    We applied our proposed learning method to our human-oid robot [7] (Figure 13) and show that it can accomplish two different movement-learning tasks without any prior knowl-edge for the cart-pole swing-up task or with a very simple nominal trajectory for the bask