[PR12] Inception and Xception - Jaejun Yoo

Inception & Xception

PR12와 함께 이해하는

Jaejun Yoo

Ph.D. Candidate @KAIST

PR12

10th Sep, 2017

(GoogLeNet)

Today’s contents

GoogLeNet : Inception models

• Going Deeper with Convolution• Rethinking the Inception Architecture for Computer Vision• Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning

https://arxiv.org/abs/1409.4842



Today’s contents

Today’s contents

Xception model

• Xception: Deep Learning with Depthwise Separable Convolutions


Motivation

Q) What is the best way to improve the performance

of deep neural network?

A) …

Motivation

Q) What is the best way to improve the performance

of deep neural network?

A) Bigger size!

(Increase the depth and width of the model)

Yea DO ITWE ARE GOOGLE

Problem

1. Bigger model typically means a larger number of parameters

→ overfitting

2. Increased use of computational resources

→ e.g. quadratic increase of computation

𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪𝟐 computations

Problem

1. Bigger model typically means a larger number of parameters

→ overfitting

2. Increased use of computational resources

→ e.g. quadratic increase of computation

𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪𝟐 computations

Solution: Sparsely connected architecture

Sparsely Connected Architecture

1. Mimicking biological system

2. Theoretical underpinnings

→ Arora et al.

Provable bounds for learning some deep representationsICML 2014

“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”

Why Arora et al. is important

To provably solve optimization problems for general neural networks with two or more layers, the algorithms that would be necessary hit some of the biggest open problems in computer science. So, we don't think there's much hope for machine learning researchers to try to find algorithms that are provably optimal for deep networks. This is because the problem is NP-hard, meaning that provably solving it in polynomial time would also solve thousands of open problems that have been open for decades. Indeed, in 1988 J. Stephen Judd shows the following problem to be NP-hard:

https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning

Given a general neural network and a set of training

examples, does there exist a set of edge weights for the

network so that the network produces the correct output

for all the training examples?

Judd also shows that the problem remains NP-hard even if it only requires a network to produce the correct output for just two-thirds of the training examples, which implies that even approximately training a neural network is intrinsically difficult in the worst case. In 1993, Blum and Rivest make the news worse: even a simple network with just two layers and three nodes is NP-hard to train!

https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning

Sparsely Connected Architecture

1. Mimicking biological system

2. Theoretical underpinnings

→ Arora et al.

Provable bounds for learning some deep representationsICML 2014

“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”

YES!

Layer-wise learning

Correlation statistics

Videos:

• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/• https://www.youtube.com/watch?v=c43pqQE176g

“If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”

http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/

https://www.youtube.com/watch?v=c43pqQE176g

Layer-wise learning

Correlation statistics

Videos:

• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/• https://www.youtube.com/watch?v=c43pqQE176g

“If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”

Hebbian principle!

http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/

https://www.youtube.com/watch?v=c43pqQE176g

Problem

1. Sparse matrix computation is very inefficient

→ dense matrix calculation is extremely efficient

2. Even ConvNet changed back from sparse connection to full

connection for better optimize parallel computing

Problem

1. Sparse matrix computation is very inefficient

→ dense matrix calculation is extremely efficient

2. Even ConvNet changed back from sparse connection to full

connection for better optimize parallel computing

Is there any intermediate step?

Inception architecture

Main idea

How to find out an optimal local sparse structure in a convolutional network and how can it be approximated and covered by readily available dense components?

: All we need is to find the optimal local construction and to repeat it spatially

: Arora et al.: layer by layer construction with correlation statistics analysis

In images, correlations tend to be local

Cover very local clusters by 1x1 convolutions

1x1number of filters

Less spread out correlations


Cover more spread out clusters by 3x3 convolutions

1x1

3x3

number of filters



3x3



3x35x5

A heterogeneous set of convolutions


3x3

5x5

Schematic view (naive version)


3x3

5x5

1x1 convolutions

3x3 convolutions

5x5 convolutions

Filter concatenation

Previous layer

1x1 convolutions

3x3 convolutions

5x5 convolutions


Previous layer

Naive idea

3x3 max pooling

1x1 convolutions

3x3 convolutions

5x5 convolutions


Previous layer

Naive idea (does not work!)

3x3 max pooling

1x1 convolutions

3x3 convolutions

5x5 convolutions


Previous layer

Inception module

3x3 max pooling

1x1 convolutions

1x1 convolutions

1x1 convolutions

Network in Network (NiN)

MLPConvConv

MLPConv = 𝟏 × 𝟏 ConvConv = GLM

GLM is replaced with a ”micro network” structure

𝟏 × 𝟏 Conv

1. Increase the representational power of neural network

→ In MLPConv sense

2. Dimension reduction

→ usually the filters are highly correlated

𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv

5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times)5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약2.78 배 비용이 더 들어간다.

만약 크기가 같은 2개의 layer 를 하나의 5x5 로 변환하는것과 3x3 짜리 2개로 변환하는 것 사이의 비용을 계산해보자.

5x5xN : (3x3xN) + (3x3xN) = 25 : 9+9 = 25 : 18 (약28% 의 reduction 효과)

https://norman3.github.io/papers/docs/google_inception.html


𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv



Xception

Observation 1

• Inception module try to explicitly factoring two tasks done by a

single convolution kernel: mapping cross-channel correlation

and spatial correlation

Inception hypothesis

• By inception module, These two correlations are sufficiently

decoupled.

Xception

Observation

• Inception module try to explicitly factoring two tasks done by a

single convolution kernel: mapping cross-channel correlation

and spatial correlation

Inception hypothesis

• By inception module, These two correlations are sufficiently

decoupled.

Would it be reasonable to make a much stronger hypothesisthan the Inception hypothesis?

Xception

Xception

Xception

Xception

Commonly called “separable convolution” in deep learning

frameworks such as TF and Keras; a spatial convolution

performed independently over each channel of an input,

followed by a pointwise convolution, i.e. 1 × 1 𝑐𝑜𝑛𝑣

Depthwise separable convolution

Xception

1. The order of the operations

2. The presence or absence of a non-linearity after the first

operaton

Xception vs. depthwise separable convolution

Regular convolution Inception Depthwise separable convolution

Xception


Inception modules lie in between!

Observation 2

Xception Hypothesis

: Make the mapping that entirely decouples the cross-channels correlations and spatial correlations

Xception


Inception modules lie in between!

Observation 2

Xception Hypothesis

: Make the mapping that entirely decouples the cross-channels correlations and spatial correlations

Xception

ImageNet JFT

Reference

GoogLeNet : Inception models

Blog

• https://norman3.github.io/papers/docs/google_inception.html (Kor)• https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/ (Eng)

Paper

• Going Deeper with Convolution• Rethinking the Inception Architecture for Computer Vision• Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning

Xception models

Paper

• Xception: Deep Learning with Depthwise Separable Convolutions


https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/





Reducing the grid size efficiently



•Pooling을 먼저하면?

•이 때는 Representational Bottleneck 이 발생한다.

•Pooling으로 인한 정보 손실로 생각하면 될 듯 하다.

•예제로 드는 연산은 (d,d,k)(d,d,k) 를 (d/2,d/2,2k)(d/2,d/2,2k) 로 변환하는 Conv 로 확인. (따라서 여기서는 d=35d=35, k=320k=320)

•어쨌거나 이렇게 하면 실제 연산 수는,

•pooling + stride.1 conv with 2k filter => 2(d/2)2k22(d/2)2k2 연산 수

•strid.1 conv with 2k fileter + pooling => 2d2k22d2k2 연산 수•즉, 왼쪽은 연산량이 좀 더 작지만 Representational Bottleneck 이 발생.

•오른쪽은 정보 손실이 더 적지만 연산량이 2배.


Reducing the grid size efficiently


Science

[PR12] Inception and Xception - Jaejun Yoo