18
HAYA! Continuous Deep Q- Learning with Model-based Acceleration 2016 ICML S. Gu, T. Lillicrap, I. Sutskever, S. Levine. Presenter : Hyemin Ahn

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Embed Size (px)

Citation preview

PowerPoint

Continuous Deep Q-Learningwith Model-based Acceleration2016 ICMLS. Gu, T. Lillicrap, I. Sutskever, S. Levine. Presenter : Hyemin Ahn

HAYA!

1

Introduction2016. 11. 18.CPSLAB (EECS)2

Another, and Another improved work ofDeep - Reinforcement Learning

Tried incorporate the advantages of Model-free Reinforcement Learning && Model-based Reinforcement Learning

HAYA!Results : Preview2016. 11. 18.CPSLAB (EECS)3

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)4

AgentHow can we formulize our behavior?

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)5

Wowso scaresuch gunso many bulletsnice suit btw

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)6

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)7

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)8

MDP

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)9

HAYA!Reinforcement Learning : overview2016. 11. 18.CPSLAB (EECS)10

HAYA!Reinforcement Learning : Model Free?2016. 11. 18.CPSLAB (EECS)11

HAYA!Continuous Q-Learning with Normalized Advantage Functions2016. 11. 18.CPSLAB (EECS)12How Authors learned parameterized Q-function with Deep Learning, when the domain of state-action is continuous?

Value function They suggest to use a neural network that separately outputs a value function term, and an advantage term.

HAYA!Of these, the least novel are the value/advantage decomposition of Q(s,a) and the use of locally-adapted linear-Gaussian dynamics.12

Continuous Q-Learning with Normalized Advantage Functions2016. 11. 18.CPSLAB (EECS)13

Trick : assume that we have a target network.EXPERIENCECONTAINER

HAYA!But we dont know the target!13

Accelerating Learning with Imagination Rollouts2016. 11. 18.CPSLAB (EECS)14The sample complexity of model-free algorithms tends to be high when using high-dimensional function approximators.To reduce the sample complexity and accelerate the learning phase, how about using a good exploratory behavior from the trajectory optimization?

HAYA!But we dont know the target!14

Accelerating Learning with Imagination Rollouts2016. 11. 18.CPSLAB (EECS)15how about using a good exploratory behavior from the trajectory optimization?

HAYA!But we dont know the target!15

Experiment : Results2016. 11. 18.CPSLAB (EECS)16

HAYA!Experiment : Results2016. 11. 18.CPSLAB (EECS)17

HAYA!2016. 11. 18.CPSLAB (EECS)18

HAYA!