Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2017
2013
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
http://www.biologyreference.com/Mo-Nu/Neuron.html
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
𝑓 𝑥𝑖 , 𝑤𝑖 = Φ(𝑏 + Σ𝑖(𝑤𝑖 . 𝑥𝑖))
Φ 𝑥 = ቊ1, 𝑖𝑓 𝑥 ≥ 0.50, 𝑖𝑓 𝑥 < 0.5
w1
𝑃 ∧ Q
𝑷 𝑸 𝑷 ∧ Q
𝑇 𝑇 𝑇
𝑇 𝐹 𝐹
𝐹 𝑇 𝐹
𝐹 𝑇 𝐹
Agent
Environment𝑆𝑡+1
𝑆𝑡
sta
te
𝑅𝑡+1
𝑅𝑡
rew
ard
𝐴𝑡
actio
n
Sutton and Barto
•
•
•
•
•
A state St is Markov if and only if:
𝑃 𝑆𝑡+1|𝑆𝑡 = 𝑃 𝑆𝑡+1 𝑆1, … , 𝑆𝑡]
• 𝐺𝑡𝛾 𝛾 ∈ 0,1
𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾2𝑅𝑡+3 + … =
𝑘=0
∞
𝛾𝑘𝑅𝑡+𝑘+1
𝑣𝜋 𝑠 = 𝔼 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝔼 𝑅𝑡+1 + 𝛾𝑣𝜋 𝑠𝑡+1 𝑠𝑡 = 𝑠
𝜋
𝜋
𝑞𝜋 𝑠, 𝑎 = 𝔼 𝑅𝑡+1 + 𝛾𝑞𝜋 𝑠𝑡+1, 𝑎𝑡+1 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎
= ℛ𝑠𝑎 + 𝛾
𝑠′𝜖𝑆
𝒫𝑠𝑠′𝑎 𝑣𝜋 𝑠′
𝜋
𝜋
𝑠 → 𝑣𝜋(𝑠)
𝑠, 𝑎 → 𝑞𝜋(𝑠, 𝑎)
𝑠′ → 𝑣𝜋(𝑠′)
𝑎
𝑠′
𝑟
5.5
510 -3
𝑣 𝑠 = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
𝑣 𝑠 = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25P=.1
max
s,a
r
s’
a’ s’
s
a
𝜋
p r
max
9
510 -3
𝑣∗ 𝑠 = max{−1 + 10,+2 + 5,+3 − 3} = 9
R = -1
R = 2
R = 3
A policy is better if 𝑣𝜋 𝑠 ≥ 𝑣𝜋′ 𝑠 ∀ 𝑠 ∈ 𝑆
𝑣∗ s ≡ max 𝑣𝜋 𝑠 ∀ 𝑠 ∈ 𝑆
1 2 3
4 5 6 7
8 9 10 11
12 13 14
𝑟𝑡 = −1
𝜋 → . = 𝜋 ↑ . =𝜋 ↓ . = 𝜋 ← . = .25
0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
𝑅𝑡 = −1𝑘 = 0
0
15
𝜋: 𝑅𝑎𝑛𝑑𝑜𝑚 𝑃𝑜𝑙𝑖𝑐𝑦
𝜋 → . = 𝜋 ↑ . =𝜋 ↓ . = 𝜋 ← . = .25
𝑟𝑡 = −1
𝑣𝑘=1 1= .25 × −1 + 0𝑣(2)
𝑘=0 →
+.25 × −1 + 0𝑣(1)𝑘=0 ↑
+
.25 × −1 + 0𝑣(5)𝑘=0 ↓
+.25 × −1 + 0𝑣(𝑇)𝑘=0 ←
= −.25 − .25 − .25 − .25 = −𝟏
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
𝜋 → . = 𝜋 ↑ . =𝜋 ↓ . = 𝜋 ← . = .25
𝑟𝑡 = −1
𝑘 = 0 𝑘 = 1
𝑣𝑘=1 7 =.25 × −1 + 0𝑣(7)𝑘=0 →
+.25 × −1 + 0𝑣(3)𝑘=0 ↑
+
.25 × −1 + 0𝑣(11)𝑘=0 ↓
+.25 × −1 + 0𝑣(6)𝑘=0 ←
= −.25 − .25 − .25 − .25 = −𝟏
𝑣𝑘=2 1=.25 × −1 + −1.00𝑣(2)
𝑘=1 →
+ . 25 × −1 + −1.00𝑣(1)𝑘=1 ↑
+
.25 × −1 + −1.00𝑣(5)𝑘=1 ↓
+.25 × −1 + 0𝑣(𝑇)𝑘=1 ←
= .25 × −𝟐 − 𝟐 − 𝟐 − 𝟏 = −𝟏. 𝟕𝟓
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
𝑣𝑘=2 7= −1 × .25 − 1.00𝑣(7)
𝑘=1 →
+ −1 × .25 − 1.00𝑣(3)𝑘=1 ↑
+−1 × .25 − 1.00𝑣(11)
𝑘=1 ↓
+ −1 × .25 − 1.00.𝑣(6)𝑘=1 ←
=
= .25 × −𝟐 − 𝟐 − 𝟐 − 𝟏 = −𝟐
𝜋 → . = 𝜋 ↑ . =𝜋 ↓ . = 𝜋 ← . = .25
𝑟𝑡 = −1
𝑘 = 1 𝑘 = 2
𝑣𝑘=3 1=.25 × −1 + −2.00𝑣(2)
𝑘=2 →
+ . 25 × −1 + −1.75𝑣(1)𝑘=2 ↑
+
.25 × −1 + −2.00𝑣(5)𝑘=2 ↓
+.25 × −1 + 0𝑣(𝑇)𝑘=2 ←
= .25 × −𝟑 − 𝟐. 𝟕𝟓 − 𝟑 − 𝟏 = −𝟐. 𝟒𝟑
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
𝑣𝑘=3 7= −1 × .25 − 2.00𝑣(7)
𝑘=2 →
+ −1 × .25 − 2.00𝑣(3)𝑘=2 ↑
+−1 × .25 − 1.75𝑣(11)
𝑘=2 ↓
+ −1 × .25 − 2.00.𝑣(6)𝑘=2 ←
=
.25 × −𝟑 − 𝟑 − 𝟐. 𝟕𝟓 − 𝟑 = −𝟐.93
𝜋 → . = 𝜋 ↑ . =𝜋 ↓ . = 𝜋 ← . = .25
𝑟𝑡 = −1
𝑘 = 2 𝑘 = 3
𝜋 𝑉
𝜋 → 𝑣𝜋
𝜋 → 𝑔𝑟𝑒𝑒𝑑𝑦(𝑉)
Evaluation
Improvement
𝜋∗ 𝑉∗
https://github.com/rlcode/reinforcement-learning
•
•
•
• Suitable for medium problem of just a few million states.
…
•
•
•
… …
•
•
•
•
…
TD(1)
TD(2)
•
•
•
•
•
s,a
r
s’
max
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼(𝑅 + 𝛾max𝑎′
𝑄 𝑆′, 𝑎′ − 𝑄(𝑆, 𝐴))
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning
•
•
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theoremDavid Silver
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
𝑆𝑐𝑜𝑟𝑒% =(𝐴𝑔𝑒𝑛𝑡 𝑆𝑐𝑜𝑟𝑒 − 𝑅𝑎𝑛𝑑𝑜𝑚 𝑝𝑙𝑎𝑦 𝑆𝑐𝑜𝑟𝑒)
(𝐻𝑢𝑚𝑎𝑛 𝑆𝑐𝑜𝑟𝑒 − 𝑅𝑎𝑛𝑑𝑜𝑚 𝑝𝑙𝑎𝑦 𝑆𝑐𝑜𝑟𝑒)𝑋 100
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
def dqn_sym_nature(action_num, data=None, name='dqn'): """Structure of the Deep Q Network in the Nature 2015 paper
Human-level control through deep reinforcement learning(http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)""”if data is None:
net = mx.symbol.Variable('data’)else:
net = data net = mx.symbol.Variable('data') net = mx.symbol.Convolution(data=net, name='conv1', kernel=(8, 8), stride=(4, 4), num_filter=32) net = mx.symbol.Activation(data=net, name='relu1', act_type="relu") net = mx.symbol.Convolution(data=net, name='conv2', kernel=(4, 4), stride=(2, 2), num_filter=64) net = mx.symbol.Activation(data=net, name='relu2', act_type="relu") net = mx.symbol.Convolution(data=net, name='conv3', kernel=(3, 3), stride=(1, 1), num_filter=64) net = mx.symbol.Activation(data=net, name='relu3', act_type="relu") net = mx.symbol.Flatten(data=net) net = mx.symbol.FullyConnected(data=net, name='fc4', num_hidden=512) net = mx.symbol.Activation(data=net, name='relu4', act_type="relu") net = mx.symbol.FullyConnected(data=net, name='fc5', num_hidden=action_num) net = mx.symbol.Custom(data=net, name=name, op_type='DQNOutput’)return net
DQN = gluon.nn.Sequential()with DQN.name_scope():
#first layerDQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))DQN.add(gluon.nn.Activation('relu'))#second layerDQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))DQN.add(gluon.nn.Activation('relu'))#tird layerDQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))DQN.add(gluon.nn.Activation('relu'))DQN.add(gluon.nn.Flatten())#fourth layerDQN.add(gluon.nn.Dense(512,activation ='relu'))#fifth layerDQN.add(gluon.nn.Dense(num_action,activation ='relu'))
•
•
•
•
•
•
•
• Up to eight NVIDIA Tesla V100 GPUs
• 1 PetaFLOPs of computational performance –
14x better than P2
• 300 GB/s GPU-to-GPU communication
(NVLink) – 9X better than P2
• 16GB GPU memory with 900 GB/sec peak GPU
memory bandwidth
T h e f a s t e s t , m o s t p o w e r f u l G P U i n s t a n c e s i n t h e c l o u d
• Get started quickly with easy-to-launch tutorials
• Hassle-free setup and configuration
• Pay only for what you use – no additional charge for
the AMI
• Accelerate your model training and deployment
• Support for popular deep learning frameworks
End-to-End
Machine Learning
Platform
Zero setup Flexible Model
Training
Pay by the second
$
Build, train, and deploy machine learning models at scale
Lots of companies
doing Machine
Learning
Unable to unlock
business potential
Brainstorming Modeling Teaching
Lack ML
expertise
Leverage Amazon experts with decades of ML
experience with technologies like Amazon Echo,
Amazon Alexa, Prime Air and Amazon GoAmazon ML Lab
provides the missing
ML expertise
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.