Upload
jaejun-yoo
View
423
Download
2
Embed Size (px)
Citation preview
Inception & Xception
PR12와 함께 이해하는
Jaejun Yoo
Ph.D. Candidate @KAIST
PR12
10th Sep, 2017
(GoogLeNet)
Today’s contents
GoogLeNet : Inception models
• Going Deeper with Convolution• Rethinking the Inception Architecture for Computer Vision• Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
Today’s contents
Xception model
• Xception: Deep Learning with Depthwise Separable Convolutions
Motivation
Q) What is the best way to improve the performance
of deep neural network?
A) Bigger size!
(Increase the depth and width of the model)
Yea DO ITWE ARE GOOGLE
Problem
1. Bigger model typically means a larger number of parameters
→ overfitting
2. Increased use of computational resources
→ e.g. quadratic increase of computation
𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪𝟐 computations
Problem
1. Bigger model typically means a larger number of parameters
→ overfitting
2. Increased use of computational resources
→ e.g. quadratic increase of computation
𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪𝟐 computations
Solution: Sparsely connected architecture
Sparsely Connected Architecture
1. Mimicking biological system
2. Theoretical underpinnings
→ Arora et al.
Provable bounds for learning some deep representationsICML 2014
“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”
Why Arora et al. is important
To provably solve optimization problems for general neural networks with two or more layers, the algorithms that would be necessary hit some of the biggest open problems in computer science. So, we don't think there's much hope for machine learning researchers to try to find algorithms that are provably optimal for deep networks. This is because the problem is NP-hard, meaning that provably solving it in polynomial time would also solve thousands of open problems that have been open for decades. Indeed, in 1988 J. Stephen Judd shows the following problem to be NP-hard:
https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning
Given a general neural network and a set of training
examples, does there exist a set of edge weights for the
network so that the network produces the correct output
for all the training examples?
Judd also shows that the problem remains NP-hard even if it only requires a network to produce the correct output for just two-thirds of the training examples, which implies that even approximately training a neural network is intrinsically difficult in the worst case. In 1993, Blum and Rivest make the news worse: even a simple network with just two layers and three nodes is NP-hard to train!
Sparsely Connected Architecture
1. Mimicking biological system
2. Theoretical underpinnings
→ Arora et al.
Provable bounds for learning some deep representationsICML 2014
“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”
YES!
Layer-wise learning
Correlation statistics
Videos:
• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/• https://www.youtube.com/watch?v=c43pqQE176g
“If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”
Layer-wise learning
Correlation statistics
Videos:
• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/• https://www.youtube.com/watch?v=c43pqQE176g
“If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”
Hebbian principle!
Problem
1. Sparse matrix computation is very inefficient
→ dense matrix calculation is extremely efficient
2. Even ConvNet changed back from sparse connection to full
connection for better optimize parallel computing
Problem
1. Sparse matrix computation is very inefficient
→ dense matrix calculation is extremely efficient
2. Even ConvNet changed back from sparse connection to full
connection for better optimize parallel computing
Is there any intermediate step?
Inception architecture
Main idea
How to find out an optimal local sparse structure in a convolutional network and how can it be approximated and covered by readily available dense components?
: All we need is to find the optimal local construction and to repeat it spatially
: Arora et al.: layer by layer construction with correlation statistics analysis
Schematic view (naive version)
1x1number of filters
3x3
5x5
1x1 convolutions
3x3 convolutions
5x5 convolutions
Filter concatenation
Previous layer
1x1 convolutions
3x3 convolutions
5x5 convolutions
Filter concatenation
Previous layer
Naive idea
3x3 max pooling
1x1 convolutions
3x3 convolutions
5x5 convolutions
Filter concatenation
Previous layer
Naive idea (does not work!)
3x3 max pooling
1x1 convolutions
3x3 convolutions
5x5 convolutions
Filter concatenation
Previous layer
Inception module
3x3 max pooling
1x1 convolutions
1x1 convolutions
1x1 convolutions
Network in Network (NiN)
MLPConvConv
MLPConv = 𝟏 × 𝟏 ConvConv = GLM
GLM is replaced with a ”micro network” structure
𝟏 × 𝟏 Conv
1. Increase the representational power of neural network
→ In MLPConv sense
2. Dimension reduction
→ usually the filters are highly correlated
𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv
5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times)5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약2.78 배 비용이 더 들어간다.
만약 크기가 같은 2개의 layer 를 하나의 5x5 로 변환하는것과 3x3 짜리 2개로 변환하는 것 사이의 비용을 계산해보자.
5x5xN : (3x3xN) + (3x3xN) = 25 : 9+9 = 25 : 18 (약28% 의 reduction 효과)
https://norman3.github.io/papers/docs/google_inception.html
𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv
https://norman3.github.io/papers/docs/google_inception.html
Xception
Observation 1
• Inception module try to explicitly factoring two tasks done by a
single convolution kernel: mapping cross-channel correlation
and spatial correlation
Inception hypothesis
• By inception module, These two correlations are sufficiently
decoupled.
Xception
Observation
• Inception module try to explicitly factoring two tasks done by a
single convolution kernel: mapping cross-channel correlation
and spatial correlation
Inception hypothesis
• By inception module, These two correlations are sufficiently
decoupled.
Would it be reasonable to make a much stronger hypothesisthan the Inception hypothesis?
Xception
Commonly called “separable convolution” in deep learning
frameworks such as TF and Keras; a spatial convolution
performed independently over each channel of an input,
followed by a pointwise convolution, i.e. 1 × 1 𝑐𝑜𝑛𝑣
Depthwise separable convolution
Xception
1. The order of the operations
2. The presence or absence of a non-linearity after the first
operaton
Xception vs. depthwise separable convolution
Regular convolution Inception Depthwise separable convolution
Xception
Regular convolution Inception Depthwise separable convolution
Inception modules lie in between!
Observation 2
Xception Hypothesis
: Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
Xception
Regular convolution Inception Depthwise separable convolution
Inception modules lie in between!
Observation 2
Xception Hypothesis
: Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
Reference
GoogLeNet : Inception models
Blog
• https://norman3.github.io/papers/docs/google_inception.html (Kor)• https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/ (Eng)
Paper
• Going Deeper with Convolution• Rethinking the Inception Architecture for Computer Vision• Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
Xception models
Paper
• Xception: Deep Learning with Depthwise Separable Convolutions
Reducing the grid size efficiently
https://norman3.github.io/papers/docs/google_inception.html
•Pooling을 먼저하면?
•이 때는 Representational Bottleneck 이 발생한다.
•Pooling으로 인한 정보 손실로 생각하면 될 듯 하다.
•예제로 드는 연산은 (d,d,k)(d,d,k) 를 (d/2,d/2,2k)(d/2,d/2,2k) 로 변환하는 Conv 로 확인. (따라서 여기서는 d=35d=35, k=320k=320)
•어쨌거나 이렇게 하면 실제 연산 수는,
•pooling + stride.1 conv with 2k filter => 2(d/2)2k22(d/2)2k2 연산 수
•strid.1 conv with 2k fileter + pooling => 2d2k22d2k2 연산 수•즉, 왼쪽은 연산량이 좀 더 작지만 Representational Bottleneck 이 발생.
•오른쪽은 정보 손실이 더 적지만 연산량이 2배.
https://norman3.github.io/papers/docs/google_inception.html
Reducing the grid size efficiently