Upload
taehoon-kim
View
18.045
Download
10
Embed Size (px)
Citation preview
AI
EMNLP,DataCom,IJBDI
http://carpedm20.github.io
+
PlayingAtariwithDeepReinforcementLearning(2013)
https://deepmind.com/research/dqn/
MasteringthegameofGowithdeepneuralnetworksandtreesearch(2016)
https://deepmind.com/research/alphago/
(2016)
http://vizdoom.cs.put.edu.pl/
+
+
+ ?ReinforcementLearning
Machine Learning
?
?
?
?
Classification
?
?
?
?
Clustering
?
?
?
?
( )
( )
(ReinforcementLearning)
Agent
Environment
Agent
Environment
Agent
StateE
Environment
Agent
StateE
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
Environment
Agent
StateE ActionE = 2
Environment
Agent
ActionE = 2
StateE
Environment
Agent
ActionE = 2StateE RewardE = 1
Environment
Agent
ActionE = 2StateE RewardE = 1
Environment
Agent
StateE
Environment
Agent
ActionE = 0StateE
Environment
Agent
ActionE = 0StateE RewardE = 1
,
Agent action
action state
reward
: reward action
Environment
Agent
ActionaJStatesJ RewardrJ
ReinforcementLearning
,
DL+RL ?
AlphaRun
AlphaRun
+
?
?
30 ( 8)
30
9 (2 )
7
4
30830MO74=
5,040
4
1 6
30830MO746 =
14
AlphaRun
AlphaRun
A.I.
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
1.DeepQ-Network
StateE0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
ActionE
None 0
? ??
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
E
ActionE =?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 -1 -1
0 1 1 3 3 -1 -1 -1
0 1 1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
E
Q
QState s Action a Q
s, a
s Q s,Q s,
Q s,
s Q s, = 5Q s, = 0
Q s, = 10
s Q s, = 5Q s, = 0
Q s, = 10
Q:
s =
s =
Q
+1+1+5+
+1+1+
+0+1+
Q
-1-1-1+
+1+1+
-1+1-1+
Q
= r + max[`
, ` , O
?
#1...
,
2.DoubleQ-Learning
,
or
= , r + max[`
, `O
= , r + max[`
, `O
= O
= , r + max[`
, `O
0
= , r + max[`
, `O
0
= , r + max[`
, `O
0
= , r + max[`
, `O
0
= , r + max[`
, `O
= r + max[`
, ` , O
= r + , argmax[`
, ` , ODQN
DoubleDQN
= r + , argmax[`
, ` , O
DoubleQ Q
: DeepQ-learning :DoubleQ-learning
: DeepQ-learning :DoubleQ-learning
DoubleQ Q !
60 !
?
, .
#2...
land1 land3 ...
#2...
...
3.DuelingNetwork
, ?0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
,+1+1+1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
3 3 3 3 0 0
3 3 3 3 0 0
0 0 0 0 0 0
-1 -1 -1 -1 -1 -1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
, +1+1+1 +0-1-1+
0 0 0 0 0 0
0 0 0 -1 -1 0
0 0 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 0 -1 -1 0
-1 -1 -1 -1 -1 -1
: 10? 8?
: -2? 1?
: 5? 12?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
?
!
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
: () : +1?+3? : +1?+2?
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 3 3 0 0 0
0 1 1 3 3 0 0 0
0 1 1 0 0 0 0 0
-1 -1 -1 -1 -1 -1 -1 -1
0()
+1?+3?
-1?-2?
10? 20?
0? 1?
14? 32?Q
?
0()
+1?+3?
-1?-2?Q
Q(s,a)=V(s)+A(s,a)
Q(s,a)=V(s)+A(s,a)Value
Q(s,a)=V(s)+A(s,a) Q
AdvantageValue
(, b)
(, O)
(, c)
DeepQ-Network
()
(, b)
(, O)
(, c)
Dueling Network
Q
Dueling Network
Q
Sum : , ; , , = ; , + (, ; , )
Max : , ; , , = ; , + , ; , [`
(, `; , )
Average: , ; , , = ; , + , ; , b (, `; , )[`
3.Duelingnetwork
: DQN :Sum : Max
60
3.Duelingnetwork
: DQN :Sum : Max
60 100!!!
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
AI 8
1. DeepQ-Network(2013)
2. DoubleQ-Learning(2015)
3. DuelingNetwork(2015)
4. PrioritizedExperienceReplay(2015)
5. Model-freeEpisodicControl(2016)
6. AsynchronousAdvantageousActor-Criticmethod(2016)
7. HumanCheckpointReplay(2016)
8. GradientDescentwithRestart(2016)
? 8 ..?
1. Hyperparameter tuning
2. Debugging
3. Pretrainedmodel
4. Ensemblemethod
1. Hyperparameter tuning
Optimizer reward
70+
Network
TrainingmethodExperiencememory
Environment
1. activation_fn :activationfunction(relu,tanh,elu,leaky)2. initializer :weightinitializer(xavier)3. regularizer :weightregularizer(None,l1,l2)4. apply_reg :layerswhereregularizerisapplied5. regularizer_scale :scaleofregularizer6. hidden_dims :dimensionofnetwork7. kernel_size :kernelsizeofCNNnetwork8. stride_size :stridesizeofCNNnetwork9. dueling :whethertouseduelingnetwork10.double_q :whethertousedoubleQ-learning11.use_batch_norm :whethertousebatchnormalization
1. optimizer :typeofoptimizer(adam,adagrad,rmsprop)2. batch_norm :whethertousebatchnormalization3. max_step :maximumsteptotrain4. target_q_update_step :#ofsteptoupdatetargetnetwork5. learn_start_step :#ofsteptobeginatraining6. learning_rate_start :themaximumvalueoflearningrate7. learning_rate_end :theminimumvalueoflearningrate8. clip_grad :valueforamaxgradient9. clip_delta :valueforamaxdelta10. clip_reward :valueforamaxreward11. learning_rate_restart :whethertouselearningraterestart
1. history_length :thelengthofhistory2. memory_size :sizeofexperiencememory3. priority :whethertouseprioritizedexperiencememory4. preload_memory :whethertopreloadasavedmemory5. preload_batch_size :batchsizeofpreloadedmemory6. preload_prob :probabilityofsamplingfrompre-mem7. alpha :alphaofprioritizedexperiencememory8. beta_start:betaofprioritizedexperiencememory9. beta_end_t:betaofprioritizedexperiencememory
1. screen_height :#ofrowsforagridstate2. screen_height_to_use :actual#ofrowsforastate3. screen_width :#ofcolumnsforagridstate4. screen_width_to_us :actual#ofcolumnsforastate5. screen_expr :actual#ofrowforastate
c:cookie p:platform s:scoreofcells p:plainjelly
ex)(c+2*p)-s,i+h+m,[rj,sp],((c+2*p)-s)/2.,i+h+m,[rj,sp,hi]6. action_repeat :#ofreapt ofanaction(frameskip)
i :plainitem h:heal m:magic hi:height
sp :speed rj :remained
jump
,
!
for land in range(3, 7):
for cookie in cookies:
options = []
option = {
'hidden_dims': ["[800, 400, 200]", "[1000, 800, 200]"],
'double_q': [True, False],
'start_land': land,
'cookie': cookie,
}
options.append(option.copy())
array_run(options, double_q_test')
,,,,,, ,,, ,,, ,,,,,,,,, ,,, ,,, ,,,
2.Debugging
...
History
Q-value
DuelingNetwork
V(s)
A(s,a)
action Q 0
State 0
Tensorflow fully_connected activationfunction
default nn.relu
Activationfunction None !
tensorflow/contrib/layers/python/layers/layers.py
State Q
ex)
Exploration
reward ?
3.Pretrainedmodel
.
weight ,
weight
Max:3,586,503Max:3,199,324
4.Ensemblemethods
Experiment#1
Experiment#2
Experiment#3
Experiment#4
weight
?
Ex) weight
,
weight
Session
Session
Graph
Session
Graph
Session
Graph
Session
Graph
node
Session
Graph Graph
Session
Graph Graph
Session
,
with tf.session() as sess:network = Network()sess.run(network.output)
sessions = []g = tf.Graph()with g.as_default():
network = Network()sess = tf.Session(graph=g)sessions.append(sess)
sessions[0].run(network.output)
Session
!
A.I. 360
A.I.
,
.