Policy Gradient with Baselines - cse.buffalo.edu

Preview:

Citation preview

Policy Gradient with Baselines

Alina Vereshchaka

CSE4/510 Reinforcement LearningFall 2019

avereshc@buffalo.edu

October 29, 2019

*Slides are adopted from Deep Reinforcement Learning and Control by Katerina Fragkiadaki (CarnegieMellon) & Policy Gradients by David Silver

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 1 / 26

Policy Objective Functions

Goal: given policy πθ(s, a) with parameters θ, find best θ

In episodic environments we can use the start value

J1(θ) = Vπθ(s1) = Eπθ

[v1]

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)Vπθ

(s)

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

where dπθ(s) is a stationary distribution of Markov chain for πθAlina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 2 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 3 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 4 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 5 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 6 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 7 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 8 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 9 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 10 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 11 / 26

Example: Pong from Pixels

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 12 / 26

REINFORCE with Baseline

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 13 / 26

REINFORCE (Monte-Carlo Policy Gradient)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 14 / 26

Policy Gradient

This vanilla policy gradient update has no bias but high variance. Many following algorithmswere proposed to reduce the variance while keeping the bias unchanged.

∇θJ(θ) = Eπ[Qπ(s, a)∇θ lnπθ(a|s)]

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 15 / 26

Policy Gradient Theorem

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 16 / 26

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 17 / 26

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 18 / 26

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 19 / 26

Temporal structure

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 20 / 26

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ)

π(At |St , θ

Update is proportional to:

the product of a return Gt

the gradient of the probability of taking the action actually taken

divided by the probability of taking that action.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 21 / 26

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 22 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

[∑a

π(a|St)b∇θ log π(a|St)]

= E

[b∇θ

∑a

π(a|St)]

= E [b∇θ1]

= 0

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 23 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

Policy Gradient with Baseline

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 24 / 26

REINFORCE with Baseline

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 25 / 26

*Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation."Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 26 / 26

Recommended