AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и рекуррентных сетей

Обучение глубоких, очень глубоких и

рекуррентных сетей

Артем Чернодуб

AI&Big Data Lab, 2 июня 2016, Одесса

Neural Network (199x-th)

2 / 46

Deep Neural Network (GoogleNet, 2014)

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3 / 46

Classic Feedforward Neural Networks (before 2006).

• Single hidden layer (Kolmogorov-Cybenko Universal Approximation Theorem as the main hope).

• Vanishing gradients effect prevents using more layers.

• Less than 10K free parameters.• Feature preprocessing stage is often critical. 4 / 46

Deep Feedforward Neural Networks

• Many hidden layers > 1• 100K – 100M free parameters.• Vanishing gradients problem is beaten!• No (or less) feature preprocessing stage.

5 / 46

Deep Learning = Learning of Representations (Features)

The traditional model of pattern recognition (since the late 50's):

fixed/engineered features + trainable classifierHand-crafted Feature

Extractor

TrainableClassifier

Trainable Feature

ExtractorTrainableClassifier

End-to-end learning / Feature learning / Deep learning:

trainable features + trainable classifier

6 / 46

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115.3 (2015): 211-252.

1000 classesTrain: 1,2M imagesTest: 150K images

7 / 46

ILSVRC 2012 results (image classification)

# Team name Method Top-5 error, %

1 SuperVision AlexNet + extra data

0.15315

2 SuperVision AlexNet 0.164223 ISI SIFT+FV, LBP+FV,

GIST+FV0.26172

5 ISI Naive sum of scores from classifiers using each FV

0.26646

7 OXFORD_VGG Mixed selection from High-Level SVM scores and Baseline Scores

0.26979

8 / 46

AlexNet, 2012 — MeGa HiT

A. Kryzhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks // Advances in Neural Information Processing Systems 25 (NIPS 2012).

9 / 46

Deep Face (Facebook)

Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification // CVPR 2014.

Model # of parameters

Accuracy, %

Deep Face Net 128M 97.35Human level N/A 97.5

Training data: 4M facial images

10 / 46

Deeper, deeper and deeperYear Net’s

nameNumber of

layersTop-5

error, %2012 AlexNet 8 15.322013 - - -

2014 VGGNet 19 7.10

2015 ResNet 152 4.49

11 / 46

Cost of computing

https://en.wikipedia.org/wiki/FLOPS

Year Cost per GFLOPS in 2013 USD

1997 $420002003 $1002007 $522011 $1.802013 $0.122015 $0.06$

12 / 46



Training Neural Networks + optimization

13 / 46

1) forward propagation pass

),( )1(i

ijij xwfz

),()1(~ )2(j

jj zwgky

where zj is the postsynaptic value for the j-th hidden neuron, w(1) are the hidden layer’s weights, f() are the hidden layer’s activation functions, w(2) are the output layer’s weights, and g() are the output layer’s activation functions.

14 / 46

2) backpropagation passLocal gradients calculation:

),1(~)1( kyktOUT.)(' )2( OUT

jjHIDj wzf

,)()2( j

OUT

j

zw

kE

.)()1( i

INj

ji

xw

kE

Derivatives calculation:

15 / 46

Bad effect of vanishing (exploding) gradients: two hypotheses

1) increased frequency and severity of bad localminima2) pathological curvature, like the type seen in the well-knownRosenbrock function:222 )(100)1(),( xyxyxf

16 / 46

Bad effect of vanishing (exploding) gradients: a problem

,)( )1()()(

m

im

jmji

zw

kE

,' )1()()1()( mi

i

mij

mj

mj wf 0)(

)(

mjiwkE=> 1mfor

17 / 46

Backpropagation mechanics in vector form

)))1(('()()1( mfdiagmm m aWδδ

Observations:

1mW

1)))1(('( mfdiag a

- robustness (weights decay)

- max(f’) = ¼ for sigmoid

- max(f’) = 1 for tanh- max(f’) = 1 for ReLU

18 / 46

Backpropagation as multiplication of Jacobians

))).1(('()( nfdiagn n aWJ

Jacobian of n-th layer:

Local gradients as product of Jacobians:

),1()()()2( nnnn JJδδ

).1()...1()()()( hnnnnhn JJJδδ

),()()1( nnn Jδδ

If ||J(n)|| < 1 – gradient vanishes;if ||J(n)|| > 1 – gradient probably explodes.

19 / 46

Nonlinear Activation functions

Andrej Karpathy and Fei-Fei. CS231n: Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/convolutional-networks

Yoshua Bengio, Ian Goodfellow and Aaron Courville. Deep Learning // An MIT Press book in preparation http://www-labs.iro.umontreal.ca/~bengioy/DLbook

𝑓 (𝑥)=max ( 0 ,𝑥 )

𝑓 ′ (𝑥 )={1 ,𝑥 ≥ 00 ,𝑥<0

ReLU activation function

20 / 46

http://cs231n.github.io/convolutional-networks



http://www-labs.iro.umontreal.ca/~bengioy/DLbook/


Legendary pretraining

21 / 46

Sparse Autoencoders

22 / 46

Dimensionality reduction

• Use a stacked RBM as deep auto-encoder

1. Train RBM with images as input & output

2. Limit one layer to few dimensions

Information has to pass through middle layer

G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks // Science 313 (2006), p. 504 – 507.23 /

46

How to use unsupervised pre-training stage / 1

24 / 46


25 / 46


26 / 46


27 / 46

Why Multilayer Perceptron (it is a shallow neural

network from 1990-th ???

28 / 46

Convolutional Neural Networks



29 / 46






Convolution Layer



30 / 46






Implementation tricks: im2col

K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006.

31 / 46

Implementation tricks: im2col for convolution

K. Chellapilla, S. Puri, P. Simard. High Performance Convolutional Neural Networks for Document Processing // International Workshop on Frontiers in Handwriting Recognition, 2006.

32 / 46

Recurrent Neural Network (SRN)

• Pascanu R., Mikolov T., Bengio Y. On the Difficulty of Training Recurrent Neural Networks // Proc. of “ICML’2013”.

• Q.V. Le, N. Jaitly, G.E. Hinton. “A Simple Way to Initialize Recurrent Networks of Rectified• Linear Units” (2015)• M. Ajjovsky, A. Shah, Y. Bengio, "Unitary Evolution Recurrent Neural Networks" (2016)• Henaff M., Szlam A., LeCun Y. Orthogonal RNNs and Long-Memory Tasks //arXiv preprint

arXiv:1602.06662. – 2016.33 /

46

Backpropagation Through Time (BPTT) for SRN

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Unrolled back through time neural network is a deep neural network with shared weights.

34 / 46




Effect of different initializations for SRN

SRNs were initialized by a Gaussian process with zero mean and pre-defined dispersion.

35 / 46

Long-Short Term Memory: adding linear connections to state propagation

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

36 / 46

Long-Short Term Memory (LSTM)


37 / 46




Deep Residual Networks adding linear connections to the conv nets

He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). 38 /

46

Deep, big, simple neural nets: no pre-training, simple gradient descent

Ciresan, Dan Claudiu, et al. "Deep, big, simple neural nets for handwritten digit recognition." Neural computation 22.12 (2010): 3207-3220. 39 /

46

Smart initialization

Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.

40 / 46

Batch Normalization: brute force whitening

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

41 / 46

Orthogonal matricesOrthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors, i.e.

IAAAA TT where I is an identity matrix. Orthogonal matrix is norm-preserving:

BAB where A is orthogonal matrix, B is any matrix. 42 /

46

Examples of orthogonal matrices

43 / 46

Backpropagation mechanics: see again


Linear case – orthogonality of W is enough!

mmm Wδδ )()1(

Saxe, Andrew M., James L. McClelland, and Surya Ganguli. "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks." arXiv preprint arXiv:1312.6120 (2013).

44 / 46

Smart orthogonal initialization: orthogonal + whitening

Mishkin, Dmytro, and Jiri Matas. "All you need is a good init." arXiv preprint arXiv:1511.06422 (2015).

45 / 46

Orthogonal Permutation Linear Units (OPLU) / sortout

Rennie, Steven J., Vaibhava Goel, and Samuel Thomas. "Deep order statistic networks." Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014.

Chernodub, Artem, and Dimitri Nowicki. "Norm-preserving Orthogonal Permutation Linear Unit Activation Functions (OPLU)." arXiv preprint arXiv:1604.02313 (2016).


46 / 46

contact: [email protected]

Thanks!

mailto:[email protected]