Policy Gradient with Baselines

Alina Vereshchaka

CSE4/510 Reinforcement LearningFall 2019

avereshc@buffalo.edu

October 29, 2019

*Slides are adopted from Deep Reinforcement Learning and Control by Katerina Fragkiadaki (CarnegieMellon) & Policy Gradients by David Silver

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 1 / 26

Policy Objective Functions

Goal: given policy πθ(s, a) with parameters θ, find best θ

In episodic environments we can use the start value

J1(θ) = Vπθ(s1) = Eπθ

In continuing environments we can use the average value

JavV (θ) =∑s

dπθ(s)Vπθ

Or the average reward per time-step

JavR(θ) =∑s

dπθ(s)∑a

πθ(s, a)Ras

where dπθ(s) is a stationary distribution of Markov chain for πθAlina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 2 / 26

Example: Pong from Pixels

REINFORCE with Baseline

REINFORCE (Monte-Carlo Policy Gradient)

Policy Gradient

This vanilla policy gradient update has no bias but high variance. Many following algorithmswere proposed to reduce the variance while keeping the bias unchanged.

∇θJ(θ) = Eπ[Qπ(s, a)∇θ lnπθ(a|s)]

Policy Gradient Theorem

Policy Gradient: Problems & Solutions

Vanilla Policy Gradient

Unbiased but very nosy

Requires lots of samples to make it work

Solutions:

Baseline

Temporal Structure

Other (e.g. KL trust region)

Solutions:

Baseline

Temporal Structure

Solutions:

Baseline

Temporal Structure

Temporal structure

Likelihood ration gradient estimator

Let’s analyze the update:

∆θt = αGt∇θ log πθ(st , at)

It can be further rewritten as follows:

θt+1 = θt + αGt∇θπ(At |St , θ)

π(At |St , θ

Update is proportional to:

the product of a return Gt

the gradient of the probability of taking the action actually taken

divided by the probability of taking that action.

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

θt+1 = θt + αGt∇θπ(At |St , θ) 1©π(At |St , θ) 2©

1© Move most in the directions that favor actions that yield the highest return

2© Update is inversely proportional to the action probability to fight the fact that actionsthat are selected frequently are at an advantage (the updates will be more often in theirdirection)

Policy Gradient with Baseline

Note that in general:

E [b∇θ log π(At |St)] = E

π(a|St)b∇θ log π(a|St)]

[b∇θ

π(a|St)]

= E [b∇θ1]

Thus gradient remains unchanged with the additional term

This holds only if b does not depend on the action (though it can depend on the state)

Implies, we can substract a baseline to reduce variance

[b∇θ

π(a|St)]

= E [b∇θ1]

[b∇θ

π(a|St)]

= E [b∇θ1]

[b∇θ

π(a|St)]

= E [b∇θ1]

[b∇θ

π(a|St)]

= E [b∇θ1]

[b∇θ

π(a|St)]

= E [b∇θ1]

What is we substruct b from the rewards?

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− b]

A good baseline is Vπ(St)

∇θJ(θ) = E [∇θ log π(At |St)(Qπ(St ,At)− Vπ(St)]

= E [∇θ log π(At |St)(Aπ(St ,At)]

Thus we can rewrite the policy gradient using the advantage function Aπ(St ,At)

REINFORCE with Baseline

*Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation."Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 17 October 29, 2019 26 / 26

Policy Gradient with Baselines - cse.buffalo.edu

Documents

Lecture 7: Policy Gradient - David Silver · Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (2) Under aliasing, an optimaldeterministicpolicy

Slides 03 Gradient

Historie und Baselines - willert.de - DOORS - Modul 4... · Inhalt Modul Historie Objekt Historie Baselines Baseline Sets Welche Möglichkeiten bietet DOORS, wenn Daten sich über

Variance Reduction for Policy Gradient Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec6.pdf · \potential" I Theorem: ~r admits the same optimal policies as r.1 I Proof sketch:

Policy Gradient Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf · Parameterized Policies I A family of policies indexed by parameter vector 2Rd I Deterministic: a = ˇ(s;

MRI Hardware: magnet, gradient, and RF coils · MRI Hardware: magnet, gradient, and RF coils Hsiao-Wen Chung ... Gradient shielding to reduce eddy current z gradient coil Magnetic

Learning to learn by gradient descent by gradient descent

On gradient Ricci solitons

Alternating Gradient (AG) Focusing

Lecture 7: Policy Gradient - UCL Computer Science · Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors

Masking Effect&Gradient

Gradient Conjugat

Gradient Es

Distributed Stochastic Gradient MCMC

Policy-Gradient Algorithms for Partially Observable …...driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms

Stochastic Gradient Descent (SGD)

dbz3 gradient

Tutorial Gradient Mesh - Illustrator

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

Policy Gradient Methods: Pathwise Derivative Methods and ...rll.berkeley.edu/deeprlcourse/docs/lec7.pdfPolicy Gradient Methods vs Q-Function Regression Methods I Q-function regression