36

[DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Embed Size (px)

Citation preview

Page 1: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 2: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 3: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 4: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 5: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 6: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 7: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Qπ (s,a) = r(s,a)+ γ Eπ [Qπ (s ',a ')]

L = (r(s,a)+ γQθπ (s ',a ')−Qθ

π (s,a))2

Page 8: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 9: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 10: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Qo(s,a) = r(s,a)+ γ maxa 'Qo(s ',a ')

Qo

L = (r(s,a)+ γ maxa 'Qθ

o(s ',a ')−Qθo(s,a))2

Page 11: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Page 12: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Page 13: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Qπ (s,a) = r(s,a)+ γ Eπ [Qπ (s ',a ')]

L = ( γ ir(si ,ai )i=0

n−1

∑ + γ nQθπ (sn ,an )−Qθ

π (s0,a0 ))2

Page 14: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Page 15: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

L = (r(s,a)+ γ maxa 'Qθ

o(s ',a ')−Qθo(s,a))2

Page 16: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 17: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 18: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 19: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Q∗(s,a) = r(s,a)+ γτ log exp(Q∗(s ',a ') /τ )a '∑

Q∗

Page 20: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

τ log exp(Q∗(s ',a ') /τ )a '∑

= τ log(exp(Q∗(sM ,aM ) /τ ) exp((Q∗(s ',a ')−Q∗(sM ,aM )) /τ )a '∑ )

= maxa 'Q∗(s ',a ')+τ log( exp((Q∗(s ',a ')−Q∗(sM ,aM )) /τ )

a '∑ )

Page 21: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

V ∗(s) = −τ logπ ∗(a | s)+ r(s,a)+ γV ∗(s ')

Page 22: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

s0,v0{a1,...,an}{v1,...,vn}{s1,..., sn}

OMR(π ) = π (ai )(ri + γ vio )

i=1

n

v0o =OMR(π

o ) = maxi(ri + γ vi

o )

Page 23: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

OENT (π ) = π (ai )(ri + γ vi∗ −τ logπ (ai ))

i=1

n

OENT (π ) = −τ π (ai )logπ (ai )

exp((ri + γ vi∗) /τ ) / Si=1

n

∑ +τS

π ∗(ai ) =exp((ri + γ vi

∗) /τ )

exp((ri ' + γ vi '∗ ) /τ )

i '=1

n

Page 24: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

v0∗ =OENT (π

∗) = τ log exp((ri + γ vi∗) /τ )

i=1

n

π ∗(ai ) =exp((ri + γ vi

∗) /τ )

exp((ri ' + γ vi '∗ ) /τ )

i '=1

n

v0∗ = −τ logπ ∗(ai )+ r(si ,ai )+ γ vi

Page 25: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

V ∗(s) = −τ logπ ∗(a | s)+ r(s,a)+ γV ∗(s ')

−V ∗(s1)+ γt−1V ∗(st )+ R(s1:t )−τG(s1:t ,π

∗) = 0

R(sm:n ) = γ ir(sm+i ,am+i )i=0

n−m−1

∑ G(sm:n ,π ) = γ i logπ (am+i | sm+i )i=0

n−m−1

Page 26: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Cθ ,φ (s1:t ) = −Vφ (s1)+ γt−1Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )

Δθ ∝Cθ ,φ (s1:t )∇θG(s1:t ,πθ )

Δφ ∝Cθ ,φ (s1:t )(∇φVφ (s1)−∇φγt−1Vφ (st ))

Page 27: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 28: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning

Aθ ,φ (s1:d+1) = −Vφ (s1)+ γdVφ (sd+1)+ R(s1:d+1)

Δθ ∝Es0:T[ Aθ ,φ (si:i+d )∇θ logπθ (ai | si )i=0

T −1

∑ ]

Δφ ∝Es0:T[ Aθ ,φ (si:i+d )∇φVφ (si )i=0

T −1

∑ ]

Cθ ,φ (s1:t ) = −Vφ (s1)+ γt−1Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )

Page 29: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 30: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 31: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 32: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 33: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 34: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 35: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning
Page 36: [DL輪読会]Bridging the Gap Between Value and Policy Based Reinforcement Learning