딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016

EMNLP,DataCom,IJBDI

http://carpedm20.github.io

PlayingAtariwithDeepReinforcementLearning(2013)

https://deepmind.com/research/dqn/

MasteringthegameofGowithdeepneuralnetworksandtreesearch(2016)

https://deepmind.com/research/alphago/

(2016)

http://vizdoom.cs.put.edu.pl/

+ ?ReinforcementLearning

Machine Learning

?

?

Classification

?

?

?

?

Clustering

?

?

?

(ReinforcementLearning)

Environment

Agent

Environment

Agent

StateE

Environment

Agent

StateE

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

Environment

Agent

StateE ActionE = 2

Environment

Agent

ActionE = 2

StateE

Environment

Agent

ActionE = 2StateE RewardE = 1

Environment

Agent

StateE

Environment

Agent

ActionE = 0StateE

Environment

Agent

ActionE = 0StateE RewardE = 1

,

Agent action

action state

reward

: reward action

Environment

Agent

ActionaJStatesJ RewardrJ

ReinforcementLearning

DL+RL ?

AlphaRun

AlphaRun

30 ( 8)

30

9 (2 )

7

4

30830MO74=

5,040

4

1 6

30830MO746 =

14

AlphaRun

AlphaRun

AI 8

1. DeepQ-Network(2013)

2. DoubleQ-Learning(2015)

3. DuelingNetwork(2015)

4. PrioritizedExperienceReplay(2015)

5. Model-freeEpisodicControl(2016)

6. AsynchronousAdvantageousActor-Criticmethod(2016)

7. HumanCheckpointReplay(2016)

8. GradientDescentwithRestart(2016)

1.DeepQ-Network

StateE0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 -1 -1

0 1 1 3 3 -1 -1 -1

0 1 1 0 0 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1

ActionE =?0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 -1 -1

0 1 1 3 3 -1 -1 -1

0 1 1 0 0 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1

ActionE

None 0

? ??

ActionE =?0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 -1 -1

0 1 1 3 3 -1 -1 -1

0 1 1 0 0 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1

E

ActionE =?0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 -1 -1

0 1 1 3 3 -1 -1 -1

0 1 1 0 0 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1

E

Q

QState s Action a Q

s, a

s Q s,Q s,

Q s,

s Q s, = 5Q s, = 0

Q s, = 10

+1+1+5+

+1+1+

+0+1+

Q

-1-1-1+

+1+1+

-1+1-1+

Q

= r + max[`

, ` , O

#1...

,

2.DoubleQ-Learning

= , r + max[`

, `O

= , r + max[`

, `O

= O

= , r + max[`

, `O

0

= , r + max[`

, `O

= r + max[`

, ` , O

= r + , argmax[`

, ` , ODQN

DoubleDQN

= r + , argmax[`

, ` , O

DoubleQ Q

: DeepQ-learning :DoubleQ-learning

: DeepQ-learning :DoubleQ-learning

DoubleQ Q !

60 !

#2...

land1 land3 ...

#2...

...

3.DuelingNetwork

, ?0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

,+1+1+1

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

3 3 3 3 0 0

3 3 3 3 0 0

0 0 0 0 0 0

-1 -1 -1 -1 -1 -1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

, +1+1+1 +0-1-1+

0 0 0 0 0 0

0 0 0 -1 -1 0

0 0 0 -1 -1 0

-1 -1 0 -1 -1 0

-1 -1 0 -1 -1 0

-1 -1 0 -1 -1 0

-1 -1 -1 -1 -1 -1

: 10? 8?

: -2? 1?

: 5? 12?

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

?

!

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

?

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

: () : +1?+3? : +1?+2?

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 1 1 3 3 0 0 0

0 1 1 3 3 0 0 0

0 1 1 0 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1

0()

+1?+3?

-1?-2?

10? 20?

0? 1?

14? 32?Q

?

0()

+1?+3?

-1?-2?Q

Q(s,a)=V(s)+A(s,a)

Q(s,a)=V(s)+A(s,a)Value

Q(s,a)=V(s)+A(s,a) Q

AdvantageValue

(, b)

(, O)

(, c)

DeepQ-Network

()

(, b)

(, O)

(, c)

Dueling Network

Q

Dueling Network

Sum : , ; , , = ; , + (, ; , )

Max : , ; , , = ; , + , ; , [`

(, `; , )

Average: , ; , , = ; , + , ; , b (, `; , )[`

3.Duelingnetwork

: DQN :Sum : Max

60

3.Duelingnetwork

: DQN :Sum : Max

60 100!!!

AI 8

1. DeepQ-Network(2013)

2. DoubleQ-Learning(2015)

3. DuelingNetwork(2015)

4. PrioritizedExperienceReplay(2015)

5. Model-freeEpisodicControl(2016)

6. AsynchronousAdvantageousActor-Criticmethod(2016)

7. HumanCheckpointReplay(2016)

8. GradientDescentwithRestart(2016)

? 8 ..?

1. Hyperparameter tuning

2. Debugging

3. Pretrainedmodel

4. Ensemblemethod

1. Hyperparameter tuning

Optimizer reward

Network

TrainingmethodExperiencememory

Environment

1. activation_fn :activationfunction(relu,tanh,elu,leaky)2. initializer :weightinitializer(xavier)3. regularizer :weightregularizer(None,l1,l2)4. apply_reg :layerswhereregularizerisapplied5. regularizer_scale :scaleofregularizer6. hidden_dims :dimensionofnetwork7. kernel_size :kernelsizeofCNNnetwork8. stride_size :stridesizeofCNNnetwork9. dueling :whethertouseduelingnetwork10.double_q :whethertousedoubleQ-learning11.use_batch_norm :whethertousebatchnormalization

1. optimizer :typeofoptimizer(adam,adagrad,rmsprop)2. batch_norm :whethertousebatchnormalization3. max_step :maximumsteptotrain4. target_q_update_step :#ofsteptoupdatetargetnetwork5. learn_start_step :#ofsteptobeginatraining6. learning_rate_start :themaximumvalueoflearningrate7. learning_rate_end :theminimumvalueoflearningrate8. clip_grad :valueforamaxgradient9. clip_delta :valueforamaxdelta10. clip_reward :valueforamaxreward11. learning_rate_restart :whethertouselearningraterestart

1. history_length :thelengthofhistory2. memory_size :sizeofexperiencememory3. priority :whethertouseprioritizedexperiencememory4. preload_memory :whethertopreloadasavedmemory5. preload_batch_size :batchsizeofpreloadedmemory6. preload_prob :probabilityofsamplingfrompre-mem7. alpha :alphaofprioritizedexperiencememory8. beta_start:betaofprioritizedexperiencememory9. beta_end_t:betaofprioritizedexperiencememory

1. screen_height :#ofrowsforagridstate2. screen_height_to_use :actual#ofrowsforastate3. screen_width :#ofcolumnsforagridstate4. screen_width_to_us :actual#ofcolumnsforastate5. screen_expr :actual#ofrowforastate

c:cookie p:platform s:scoreofcells p:plainjelly

ex)(c+2*p)-s,i+h+m,[rj,sp],((c+2*p)-s)/2.,i+h+m,[rj,sp,hi]6. action_repeat :#ofreapt ofanaction(frameskip)

i :plainitem h:heal m:magic hi:height

sp :speed rj :remained

jump

for land in range(3, 7):

for cookie in cookies:

options = []

option = {

'hidden_dims': ["[800, 400, 200]", "[1000, 800, 200]"],

'double_q': [True, False],

'start_land': land,

'cookie': cookie,

}

options.append(option.copy())

array_run(options, double_q_test')

,,,,,, ,,, ,,, ,,,,,,,,, ,,, ,,, ,,,

2.Debugging

History

Q-value

DuelingNetwork

V(s)

A(s,a)

action Q 0

State 0

Tensorflow fully_connected activationfunction

default nn.relu

Activationfunction None !

tensorflow/contrib/layers/python/layers/layers.py

State Q

ex)

Exploration

reward ?

3.Pretrainedmodel

weight ,

weight

Max:3,586,503Max:3,199,324

4.Ensemblemethods

Experiment#1

Experiment#2

Experiment#3

Experiment#4

weight

?

Ex) weight

,

weight

Session

Session

Graph

Session

Graph Graph

Session

Graph Graph

Session

with tf.session() as sess:network = Network()sess.run(network.output)

sessions = []g = tf.Graph()with g.as_default():

network = Network()sess = tf.Session(graph=g)sessions.append(sess)

sessions[0].run(network.output)

Session

A.I. 360

Technology

딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016