Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SD StudyRNN & LSTM 2016/11/10
Seitaro Shinagawa
1/43
This is a description for people
who have already understood
simple neural network
architecture like feed forward
networks.
2/43
I will introduce LSTM,
how to use, tips in chainer.
3/43
わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
1.RNN to LSTM
Output
Layer
Middle(Hidden)
Layer
Input
Layer
Simple RNN
4/43
FAQ with LSTM beginner students.
LSTM LSTMLSTM
I hear LSTM is kind of RNN,
but LSTM looks different architecture…
These have same architecture!
Please follow me! Neural bear
𝒙𝒕
𝒉𝒕
𝒚𝒕
A-san
A-san often sees this RNN A-san often sees this LSTM
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑These are same?
different?
5/43
𝒙𝒕
𝒉𝒕
𝒚𝒕
Introduce LSTM figure from RNN
6/43
𝒙𝒕
𝒉𝒕
𝒚𝒕
Unroll on time scale
Introduce LSTM figure from RNN
7/43
𝒙𝟏
𝒉𝟏
𝒚𝟏
𝒉𝟎
𝒙𝟐
𝒉𝟐
𝒚𝟐
𝒙𝟑
𝒉𝟑
𝒚𝟑
Oh, I often see this in RNN!
𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡
So, this figure focuses on variables
and shows that their relationships.
Unroll on time scale
Introduce LSTM figure from RNN
8/43
Let’s focus on the more actual process
𝒙𝟏
𝒉𝟏
𝒚𝟏
𝒉𝟎
𝒙𝟐
𝒉𝟐
𝒚𝟐
I try to write the architecture detail.
𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕
𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭
is function
𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭
𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1
𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕
𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭
𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭
𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1
See RNN as a large function with
input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)
9/43
𝒙𝟏
𝒉𝟏
𝒚𝟏
𝒉𝟎
𝒙𝟐
𝒉𝟐
𝒚𝟐
𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕
𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭
𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭
𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1
𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕
𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭
𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭
𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1
Let’s focus on the more actual process
See RNN as a large function with
input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)
is function
10/43
𝒙𝟏
𝒉𝟏
𝒚𝟏
𝒉𝟎
𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕
𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭
𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭
𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1
LSTM
Oh, this looks
same as LSTM!
Let’s focus on the more actual process
See RNN as a large function with
input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)
is function
11/43
Summary of this section
RNN RNNRNN
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑
Yeah. Moreover, initial hidden state ℎ0 is
often omitted like below.
LSTM figure is not special!
If you see RNN as LSTM, in fact, you need
to give cell value to next time LSTM module,
but it is mostly omitted, too. 12/43
By the way, if you want to see the contents of LSTM…
𝒙𝟏
𝒉𝟏𝒚𝟏𝒉𝟎
𝒛𝑡 = tanh 𝑾𝑥𝑧𝒙𝑡 +𝑾ℎ𝑧𝒉𝑡−1
ො𝒛t = 𝐳t ⊙𝒈𝑖,𝑡
𝒈𝑖,𝑡 = 𝜎(𝑾𝑥𝑖𝒙𝑡 +𝑾ℎ𝑖𝒉𝑡−1)
𝒈𝑓,𝑡 = 𝜎 𝑾𝒙𝑓𝒙𝑡 +𝑾ℎ𝑓𝒉𝑡−1
𝒈𝑜,𝑡 = 𝜎 𝑾𝑥𝑜𝒙𝑡 +𝑾ℎ𝑜𝒉𝑡−1
𝐡𝐭 = tanh 𝒄𝑡 ⊙𝒈𝑜,𝑡
𝐜t = ො𝒄t−1 + 𝑡𝑎𝑛ℎ ො𝒛𝑡
ො𝒄t−1 = 𝒄t−1 ⊙𝒈𝑓,𝑡
(𝜎 ⋅ = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ とする)
𝐲t = 𝜎 𝑾ℎ𝑦𝒉t
𝒄𝒕
𝒄𝒕−𝟏
13/43
LSTM FAQ
Q. What is the difference beween RNN and LSTM?
Constant Error Carousel(CEC, often called as cell)
input gate, forget gate, output
• Input gate: Select to accept input to cell or not
• Forget gate: Select to throw away cell information or not
• Output gate: Select to 次の時刻にどの程度情報を伝えるか選ぶ
Q. Why does LSTM avoid gradient vanishing problem?
1. BP is suffered because of repeatedly sigmoid diff calculation.
2. RNN output was effected from changeable hidden states.
3.LSTM has a cell and store previous input as sum of weighted
inputs, so they are robust to current hidden states( Of course,
there is a certain limit to remember the sequence)
14/43
わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
LSTM
15/43
わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )
LSTM with Peephole
Known as Standard
LSTM, but peephole
omitted LSTM is
often used, too.
16/43
Chainer usage
Not Peephole(Standard ver. in chainer)
chainer.links.LSTM
with Peephole
chainer.links.StatefulPeepholeLSTM
h = init_state()
h = stateless_lstm(h, x1)
h = stateless_lstm(h, x2)
stateful_lstm(x1)
stateful_lstm(x2)
“Stateful” means wrapping hidden state in
the internal state of the function(※)
Stateful○○ Stateless○○
(※) https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef417/43
2. LSTM Learning Methods
Full BPTT
Truncated BPTT
Graham Neubig NLP tutorial 8- recurrent neural networks
http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf
(BPTT: Back Propagation Trough Time)
18/43
Truncated BPTT by chainer
Chainerの使い方と自然言語処理への応用 から引用http://www.slideshare.net/beam2d/chainer-52369222
19/43
ChainerでTruncated BPTT
LSTMLSTM
𝒙𝟏 𝒙𝟐
𝒚𝟏 𝒚𝟐
LSTM LSTM
𝒙𝟐 𝒙𝟑
𝒚𝟑𝟏 𝒚𝟑𝟐
LSTM
𝒙𝟑
𝒚𝟑𝟎
⋯⋯
𝒉𝟐𝒉𝟏 𝒉𝟑𝟎
BP until 𝒊 = 𝟑𝟎BP Update weights
20/43
Mini-batch calculation with GPU
How should I do if I want to use GPU with
unaligned data length?
Filling end of sequence is standard.
ex): End of sequence is 0
1 2 0
1 3 3 2 0
1 4 2 0
1 2 0 0 0
1 3 3 2 0
1 4 2 0 0I call them
Zero padding
21/43
Learned model become redundant!
They should learn “continuous 0 output rule”
Adding handcraft rule can solve it.
chainer.functions.where
NStepLSTM(v1.16.0 or later)
There are 2 methods in chainer
Mini-batch calculation with GPU
22/43
chainer.functions.where
𝒙𝒕
𝒚𝒕
1 2 0 0 0
1 3 3 2 0
1 4 2 0 0
𝒄𝑡−1, 𝒉𝑡−1 𝒄𝑡 , 𝒉𝑡
False, False,…,False
True , True ,…,True
False, False,…,False
LSTM
𝒄𝑡𝑚𝑝, 𝒉𝑡𝑚𝑝𝒄𝐭 = F.where 𝑺, 𝒄𝑡𝑚𝑝, 𝒄𝑡−1
𝒉𝐭 = F.where 𝑺, 𝒉𝑡𝑚𝑝, 𝒉𝑡−1
𝒉𝒕−𝟏1
𝒉𝒕−𝟏2
𝒉𝒕−𝟏3
𝑺 =
𝒉𝒕−𝟏1
𝒉𝒕2
𝒉𝒕−𝟏3
𝒉𝑡−1 𝒉𝑡
Condition matrix
True False
23/43
NStepLSTM(v1.16.0 or later)
NStepLSTM can auto filling
There is a bug with cudnn,dropout(※)10/25 fixed version marged to master repository
Use latest version(wait for v1.18.0 or git clone from github)
https://github.com/pfnet/chainer/pull/1804
There is no document now, read raw script below
https://github.com/pfnet/chainer/blob/master/chainer/function
s/connection/n_step_lstm.py
http://www.monthly-hack.com/entry/2016/10/24/200000
(※)ChainerのNStepLSTMでニコニコ動画のコメント予測。
I don’t need to listen F.where ?
Hahaha…
24/43
Gradient Clipping can suppress gradient explosion
LSTM can solve gradient vanishing problem, but
RNN also suffer from gradient explosion(※)
※ On the difficulty of training recurrent neural networks
http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf
Proposed by ※
If norm of all gradient is over
the threshold, make norm
threshold
In chainer, you can use
optimizer.add_hook(Gradient_Clipping(threshold))
25/43
DropOut application to LSTM
DropOut is a strong smoothing method,
But DropOut anywhere doesn’t always success.
https://arxiv.org/abs/1603.05118
※ Recurrent Dropout without Memory Loss
According to ※,
1.DropOut hidden recurrent state in LSTM
2.DropOut cell in LSTM
3.DropOut input gate in LSTM
Conclusion: 3.achieved the best performance.
Basically,
Recurrent part →DropOut should not be applied to
Forward part →DropOut should be applied to
26/43
Batch Normalization on LSTM
Batch Normalization?
Scaling activation(sum of weighted input) distribution to N(0,1)
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
In practice,
BN is applied to mini-
batch
In theory,
BN should be applied
to all data
Batch Normalization on Activation x27/43
BN to RNN doesn’t improve the performance(※) hidden-to-hidden suffer from gradient explosion by repeatedly
scaling
input-to-hidden makes learning faster, but not improve
performance
※Batch Normalized Recurrent Neural Networks
https://arxiv.org/abs/1510.01378
3 new proposed way(proposed date order) (Weight Normalization) https://arxiv.org/abs/1602.07868
(Recurrent Batch Normalization) https://arxiv.org/abs/1603.09025
Layer Normalization https://arxiv.org/abs/1607.06450
Batch Normalization on LSTM
28/43
𝑎1(1)
𝑎2(1)
𝑎3(1)
𝑎4(1)
⋯ 𝑎𝐻(1)
𝑎1(1)
𝑎2(1)
𝑎3(1)
𝑎4(1)
⋯ 𝑎𝐻1
⋮ ⋮ ⋮ ⋮ ⋮
𝑎1(1)
𝑎2(1)
𝑎3(1)
𝑎4(1)
⋯ 𝑎𝐻(1)
Difference between
Batch Normalization and Layer NormalizationAssuming activation 𝒂 (𝑎𝑖
(𝑛)= Σ𝑗𝑤𝑖𝑗𝑥𝑗
𝑛, h𝑖
𝑛= 𝑎𝑖
(𝑛))
Batch Normalization
normalizes vertically
Layer Normalization
normalizes horizontal
Variance 𝜎 becomes larger if gradient explosion happens.
Normalization makes output more robust(detail is in paper)29/43
Initialization Tips
Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks
https://arxiv.org/abs/1312.6120v3
A Simple Way to Initialize Recurrent Networks
of Rectified Linear Units https://arxiv.org/abs/1504.00941v2
RNN with ReLU and recurrent weight connections
initialized by identity matrix is as good as LSTM
30/43
From “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units” 31/43
MNIST 784 sequence prediction
32/43
𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡
𝒙𝟏
𝒉𝟏
𝒚𝟏
𝒉𝟎
𝒉𝑡 = ReL𝑈 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑅𝑒𝐿𝑈 𝑾ℎ𝑦𝒉𝑡
IRNN
Initialize by
identity matrix
x=0の時、
h=ReLU(h)
33/43
Extra materials
34/43
Various RNN model
Encoder-Decoder
Bidirectional LSTM
Attention model
35/43
RNN RNNRNN
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑
𝒉𝟎
RNN output is changed by
initial hidden states ℎ0
ℎ0 is also learnable by BP
It can be connected to an
encoder output
→encoder-decoder model
RNNスライスpixel生成
original
gen from learned ℎ0gen from random ℎ0
RNNの隠れ層の初期値に注目する
2 ⋮
First slice is 0(black), but
various sequence appear
36/43
Encoder-Decoder model
RNN RNNRNN
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑
𝒉𝟎𝒅𝒆𝒄RNN RNNRNN
𝒙𝟏𝒆𝒏𝒄 𝒙𝟐
𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄
Point: Use when your I/O data have different sequence length
𝒉𝟎𝒅𝒆𝒄 is learned by encoder and decoder learning on the same
time
To improve performance, you can use beamsearch on Decoder
37/43
Bidirectional LSTM
RNN RNNRNN
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑
𝒉𝟎𝒅𝒆𝒄RNN RNNRNN
𝒙𝟑𝒆𝒏𝒄 𝒙𝟐
𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄
Long long time dependency is difficult to learn unless you use
LSTM (LSTM doesn’t solve gradient vanishing fundamentally)
You can improve performance by using inverted encoder
𝒉𝟎𝒅𝒆𝒄RNN RNNRNN
𝒙𝟏𝒆𝒏𝒄 𝒙𝟐
𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄
I remember latter information!
I remember former information!
38/43
Attention model
RNN RNNRNN
𝒙𝟏 𝒙𝟐 𝒙𝟑
𝒚𝟏 𝒚𝟐 𝒚𝟑
𝒉𝟎𝒅𝒆𝒄RNN RNNRNN
𝒙𝟑𝒆𝒏𝒄 𝒙𝟐
𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄
𝒉𝟎𝒅𝒆𝒄RNN RNNRNN
𝒙𝟏𝒆𝒏𝒄 𝒙𝟐
𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄
𝒉𝟏𝒆𝒏𝒄
𝒉𝟑𝒆𝒏𝒄 𝒉2
𝒆𝒏𝒄
𝒉𝟐𝒆𝒏𝒄
𝒉𝟎𝒅𝒆𝒄
𝒉𝟎𝒅𝒆𝒄
𝛼1,𝑡 𝛼2,𝑡 𝛼3,𝑡
𝜶1 𝜶2 𝜶3
Moreover, using middle
hidden states of encoder
leads better performance!
𝒉𝟏𝒆𝒏𝒄 𝒉𝟐
𝒆𝒏𝒄
𝒉2𝒆𝒏𝒄
𝒉𝟑𝒆𝒏𝒄
39/43
Gated Recurrent Unit (GRU)
Variant of LSTM
• Delete cell
• Gate are reduced to 2
Unless less complexity,
performance is not bad
Often appear on
MT task or SD task
LSTM
40/43
Try to split LSTM, and make them upside down
1. GRU is to hidden states what LSTM is to cell
2. Share Input gate and Output gate as Update gate
3. Delete tanh function of cell output of LSTM
GRU can be interpreted as special case of LSTM
GRU LSTM
41/43
1. Try to split LSTM, and make them upside down
GRU can be interpreted as special case of LSTM
LSTM
42/43
GRU can be interpreted as special case of LSTM
GRU LSTM
1. Try to split LSTM, and make them upside down
2. See LSTM cell as GRU hidden states
3. Share Input gate and Output gate as Update gate
4. Delete tanh function of cell output of LSTM
43/43