37
Slide credit from David Silver

Slide credit from David Silver - 國立臺灣大學yvchen/f105-adl/doc/161208_DeepRL2.pdf · Reinforcement Learning RL is a general purpose framework for decision making RL is for

Embed Size (px)

Citation preview

Slide credit from David Silver

ReviewReinforcement Learning

2

Reinforcement LearningRL is a general purpose framework for decision making◦ RL is for an agent with the capacity to act

◦ Each action influences the agent’s future state

◦ Success is measured by a scalar reward signal

3

Big three: action, state, reward

Agent and Environment

4

→←MoveRightMoveLeft

observation otaction at

reward rt

Agent

Environment

Major Components in an RL AgentAn RL agent may include one or more of these components◦ Policy: agent’s behavior function

◦ Value function: how good is each state and/or action

◦ Model: agent’s representation of the environment

5

Reinforcement Learning ApproachPolicy-based RL◦ Search directly for optimal policy

Value-based RL◦ Estimate the optimal value function

Model-based RL◦ Build a model of the environment

◦ Plan (e.g. by lookahead) using model

6

is the policy achieving maximum future reward

is maximum value achievable under any policy

RL Agent Taxonomy

7

Deep Reinforcement LearningIdea: deep learning for reinforcement learning◦ Use deep neural networks to represent• Value function

• Policy

• Model

◦ Optimize loss function by SGD

8

Value-Based Deep RLEstimate How Good Each State and/or Action is

9

Value Function ApproximationValue functions are represented by a lookup table

◦ too many states and/or actions to store

◦ too slow to learn the value of each entry individually

Values can be estimated with function approximation

10

Q-NetworksQ-networks represent value functions with weights

◦ generalize from seen states to unseen states

◦ update parameter for function approximation

11

Q-LearningGoal: estimate optimal Q-values◦ Optimal Q-values obey a Bellman equation

◦ Value iteration algorithms solve the Bellman equation

12

learning target

Deep Q-Networks (DQN)Represent value function by deep Q-network with weights

Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

13

Issue: naïve Q-learning oscillates or diverges using NN due to:1) correlations between samples 2) non-stationary targets

Stability Issues with Deep RLNaive Q-learning oscillates or diverges with neural nets1. Data is sequential

◦ Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values◦ Policy may oscillate

◦ Distribution of data can swing from one extreme to another

3. Scale of rewards and Q-values is unknown◦ Naive Q-learning gradients can be unstable when backpropagated

14

Stable Solutions for DQN

15

DQN provides a stable solutions to deep value-based RL1. Use experience replay

◦ Break correlations in data, bring us back to iid setting

◦ Learn from all past policies

2. Freeze target Q-network◦ Avoid oscillation

◦ Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range◦ Robust gradients

Stable Solution 1: Experience ReplayTo remove correlations, build a dataset from agent’s experience◦ Take action at according to 𝜖-greedy policy

◦ Store transition in replay memory D

◦ Sample random mini-batch of transitions from D

◦ Optimize MSE between Q-network and Q-learning targets

16

small prob for exploration

Stable Solution 2: Fixed Target Q-NetworkTo avoid oscillations, fix parameters used in Q-learning target◦Compute Q-learning targets w.r.t. old, fixed parameters

◦Optimize MSE between Q-network and Q-learning targets

◦Periodically update fixed parameters

17

Stable Solution 3: Reward / Value RangeTo avoid oscillations, control the reward / value range◦DQN clips the rewards to [−1, +1]Prevents too large Q-values

Ensures gradients are well-conditioned

18

Deep RL in Atari Games

19

DQN in AtariGoal: end-to-end learning of values Q(s, a) from pixels

◦ Input: state is stack of raw pixels from last 4 frames

◦ Output: Q(s, a) for all joystick/button positions a

◦ Reward is the score change for that step

20DQN Nature Paper [link] [code]

Other Improvements: Double DQNNature DQN

Double DQN: remove upward bias caused by◦ Current Q-network is used to select actions

◦ Older Q-network is used to evaluate actions

22

Other Improvements: Prioritized ReplayPrioritized Replay: weight experience based on surprise◦ Store experience in priority queue according to DQN error

23

Other Improvements: Dueling NetworkDueling Network: split Q-network into two channels

◦ Action-independent value functionValue function estimates how good the state is

◦ Action-dependent advantage functionAdvantage function estimates the additional benefit

24

Policy-Based Deep RLEstimate How Good An Agent ’s Behavior is

25

Deep Policy NetworksRepresent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

26

stochastic policy deterministic policy

Policy GradientThe gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

27

How to deal with continuous actions

Actor-Critic (Value-Based + Policy-Based)

Estimate value function

Update policy parameters by SGD◦ Stochastic policy

◦ Deterministic policy

28

Deterministic Deep Actor-CriticDeep deterministic policy gradient (DDPG) is the continuous analogue of DQN◦ Experience replay: build dataset from agent’s experience

◦ Critic estimates value of current policy by DQN

◦ Actor updates policy in direction that improves Q

29

Critic provides loss function for actor

DDPG in Simulated PhysicsGoal: end-to-end learning of control policy from pixels◦ Input: state is stack of raw pixels from last 4 frames

◦ Output: two separate CNNs for Q and 𝜋

30Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015.

Model-Based Deep RLAgent ’s Representation of the Environment

31

Model-Based Deep RLGoal: learn a transition model of the environment and planbased on the transition model

32

Objective is to maximize the measured goodness of model

Model-based deep RL is challenging, and so far has failed in Atari

Issues for Model-Based Deep RLCompounding errors◦ Errors in the transition model compound over the trajectory

◦ A long trajectory may result in totally wrong rewards

Deep networks of value/policy can “plan” implicitly◦ Each layer of network performs arbitrary computational step

◦ n-layer network can “lookahead” n steps

33

Model-Based Deep RL in GoMonte-Carlo tree search (MCTS)◦ MCTS simulates future trajectories

◦ Builds large lookahead search tree with millions of positions

◦ State-of-the-art Go programs use MCTS

Convolutional Networks◦ 12-layer CNN trained to predict expert moves

◦ Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree

34

1st strong Go program

https://deepmind.com/alphago/Silver, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016.

OpenAI UniverseSoftware platform for measuring and training an AI's general intelligence via the OpenAI gym environment

35

Concluding RemarksRL is a general purpose framework for decision making under interactions between agent and environment

An RL agent may include one or more of these components◦ Policy: agent’s behavior function

◦ Value function: how good is each state and/or action

◦ Model: agent’s representation of the environment

RL problems can be solved by end-to-end deep learning

Reinforcement Learning + Deep Learning = AI

36

ReferencesCourse materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf

ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

37