Upload
geekslab
View
226
Download
1
Embed Size (px)
Citation preview
Обучение глубоких, очень глубоких и
рекуррентных сетей
Артем Чернодуб
AI&Big Data Lab, 2 июня 2016, Одесса
Neural Network (199x-th)
2 / 46
Deep Neural Network (GoogleNet, 2014)
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3 / 46
Classic Feedforward Neural Networks (before 2006).
• Single hidden layer (Kolmogorov-Cybenko Universal Approximation Theorem as the main hope).
• Vanishing gradients effect prevents using more layers.
• Less than 10K free parameters.• Feature preprocessing stage is often critical. 4 / 46
Deep Feedforward Neural Networks
• Many hidden layers > 1• 100K – 100M free parameters.• Vanishing gradients problem is beaten!• No (or less) feature preprocessing stage.
5 / 46
Deep Learning = Learning of Representations (Features)
The traditional model of pattern recognition (since the late 50's):
fixed/engineered features + trainable classifierHand-crafted Feature
Extractor
TrainableClassifier
Trainable Feature
ExtractorTrainableClassifier
End-to-end learning / Feature learning / Deep learning:
trainable features + trainable classifier
6 / 46
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115.3 (2015): 211-252.
1000 classesTrain: 1,2M imagesTest: 150K images
7 / 46
ILSVRC 2012 results (image classification)
# Team name Method Top-5 error, %
1 SuperVision AlexNet + extra data
0.15315
2 SuperVision AlexNet 0.164223 ISI SIFT+FV, LBP+FV,
GIST+FV0.26172
5 ISI Naive sum of scores from classifiers using each FV
0.26646
7 OXFORD_VGG Mixed selection from High-Level SVM scores and Baseline Scores
0.26979
8 / 46
AlexNet, 2012 — MeGa HiT
A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks // Advances in Neural Information Processing Systems 25 (NIPS 2012).
9 / 46
Deep Face (Facebook)
Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification // CVPR 2014.
Model # of parameters
Accuracy, %
Deep Face Net 128M 97.35Human level N/A 97.5
Training data: 4M facial images
10 / 46
Deeper, deeper and deeperYear Net’s
nameNumber of
layersTop-5
error, %2012 AlexNet 8 15.322013 - - -
2014 VGGNet 19 7.10
2015 ResNet 152 4.49
11 / 46
Cost of computing
https://en.wikipedia.org/wiki/FLOPS
Year Cost per GFLOPS in 2013 USD
1997 $420002003 $1002007 $522011 $1.802013 $0.122015 $0.06$
12 / 46
Training Neural Networks + optimization
13 / 46
1) forward propagation pass
),( )1(i
ijij xwfz
),()1(~ )2(j
jj zwgky
where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights, and g() are the output layer’s activation functions.
14 / 46
2) backpropagation passLocal gradients calculation:
),1(~)1( kyktOUT.)(' )2( OUT
jjHIDj wzf
,)()2( j
OUT
j
zw
kE
.)()1( i
INj
ji
xw
kE
Derivatives calculation:
15 / 46
Bad effect of vanishing (exploding) gradients: two hypotheses
1) increased frequency and severity of bad localminima2) pathological curvature, like the type seen in the well-knownRosenbrock function:222 )(100)1(),( xyxyxf
16 / 46
Bad effect of vanishing (exploding) gradients: a problem
,)( )1()()(
m
im
jmji
zw
kE
,' )1()()1()( mi
i
mij
mj
mj wf 0)(
)(
mjiwkE=> 1mfor
17 / 46
Backpropagation mechanics in vector form
)))1(('()()1( mfdiagmm m aWδδ
Observations:
1mW
1)))1(('( mfdiag a
- robustness (weights decay)
- max(f’) = ¼ for sigmoid
- max(f’) = 1 for tanh- max(f’) = 1 for ReLU
18 / 46
Backpropagation as multiplication of Jacobians
))).1(('()( nfdiagn n aWJ
Jacobian of n-th layer:
Local gradients as product of Jacobians:
),1()()()2( nnnn JJδδ
).1()...1()()()( hnnnnhn JJJδδ
),()()1( nnn Jδδ
If ||J(n)|| < 1 – gradient vanishes;if ||J(n)|| > 1 – gradient probably explodes.
19 / 46
Nonlinear Activation functions
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
𝑓 (𝑥)=max ( 0 ,𝑥 )
𝑓 ′ (𝑥 )={1 ,𝑥 ≥ 00 ,𝑥<0
ReLU activation function
20 / 46
Legendary pretraining
21 / 46
Sparse Autoencoders
22 / 46
Dimensionality reduction
• Use a stacked RBM as deep auto-encoder
1. Train RBM with images as input & output
2. Limit one layer to few dimensions
Information has to pass through middle layer
G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507.23 /
46
How to use unsupervised pre-training stage / 1
24 / 46
How to use unsupervised pre-training stage / 2
25 / 46
How to use unsupervised pre-training stage / 3
26 / 46
How to use unsupervised pre-training stage / 4
27 / 46
Why Multilayer Perceptron (it is a shallow neural
network from 1990-th ???
28 / 46
Convolutional Neural Networks
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
29 / 46
Convolution Layer
Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks
Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook
30 / 46
Implementation tricks: im2col
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006.
31 / 46
Implementation tricks: im2col for convolution
K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006.
32 / 46
Recurrent Neural Network (SRN)
• Pascanu R., Mikolov T., Bengio Y. On the Difficulty of Training Recurrent Neural Networks // Proc. of “ICML’2013”.
• Q.V. Le, N. Jaitly, G.E. Hinton. “A Simple Way to Initialize Recurrent Networks of Rectified• Linear Units” (2015)• M. Ajjovsky, A. Shah, Y. Bengio, "Unitary Evolution Recurrent Neural Networks" (2016)• Henaff M., Szlam A., LeCun Y. Orthogonal RNNs and Long-Memory Tasks //arXiv preprint
arXiv:1602.06662. – 2016.33 /
46
Backpropagation Through Time (BPTT) for SRN
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Unrolled back through time neural network is a deep neural network with shared weights.
34 / 46
Effect of different initializations for SRN
SRNs were initialized by a Gaussian process with zero mean and pre-defined dispersion.
35 / 46
Long-Short Term Memory: adding linear connections to state propagation
Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
36 / 46
Long-Short Term Memory (LSTM)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
37 / 46
Deep Residual Networks adding linear connections to the conv nets
He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). 38 /
46
Deep, big, simple neural nets: no pre-training, simple gradient descent
Ciresan, Dan Claudiu, et al. "Deep, big, simple neural nets for handwritten digit recognition." Neural computation 22.12 (2010): 3207-3220. 39 /
46
Smart initialization
Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.
40 / 46
Batch Normalization: brute force whitening
Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
41 / 46
Orthogonal matricesOrthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors, i.e.
IAAAA TT where I is an identity matrix. Orthogonal matrix is norm-preserving:
BAB where A is orthogonal matrix, B is any matrix. 42 /
46
Examples of orthogonal matrices
43 / 46
Backpropagation mechanics: see again
)))1(('()()1( mfdiagmm m aWδδ
Linear case – orthogonality of W is enough!
mmm Wδδ )()1(
Saxe, Andrew M., James L. McClelland, and Surya Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." arXiv preprint arXiv:1312.6120 (2013).
44 / 46
Smart orthogonal initialization: orthogonal + whitening
Mishkin, Dmytro, and Jiri Matas. "All you need is a good init." arXiv preprint arXiv:1511.06422 (2015).
45 / 46
Orthogonal Permutation Linear Units (OPLU) / sortout
Rennie, Steven J., Vaibhava Goel, and Samuel Thomas. "Deep order statistic networks." Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014.
Chernodub, Artem, and Dimitri Nowicki. "Norm-preserving Orthogonal Permutation Linear Unit Activation Functions (OPLU)." arXiv preprint arXiv:1604.02313 (2016).
)))1(('()()1( mfdiagmm m aWδδ
46 / 46