77
Modeling deep structures for 3D scene understanding Wanli Ouyang (欧阳万里) The Chinese University of Hong Kong The University of Sydney

Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Modeling deep structures for 3D scene understanding

Wanli Ouyang (欧阳万里)

The Chinese University of Hong Kong The University of Sydney

Page 2: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

2

Back-bone model design

Conclusion

Introduction

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Page 3: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

3

Image/video classification(CVPR’18, NIPS’18)

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Back-bone model design

Conclusion

Introduction

Page 4: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

4

Image/video classification(CVPR’18, NIPS’18)

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Back-bone model design

Conclusion

Introduction

Codes available!

Page 5: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Deep learning

MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1

Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance

Hold records on most of the computer vision problems

Web-scale visual search, self-driving cars, surveillance, multimedia …

Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors

Page 6: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Performance vs practical need

Conventional model

Deep model Very Deep model

Very deep structured learning

Many other applications

Face recognition

Page 7: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

h

x

y

Structure in neurons

• Conventional neural networks

– Neurons in the same layer have no connection

– Neurons in adjacent layers are fully connected, at least within a local region

Structure exists in brain

Page 8: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Structure in data

?

Page 9: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Structure in data

?

?

Correlation

Page 10: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

10

Introduction

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Page 11: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

11

Introduction

Structured output

Structured deep learning

Depth Estimation (TPAMI’18)

Page 12: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Monocular depth estimation

RGB-Input GT Ours

Page 13: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Motivation•Deep structured dense pixel-level prediction:

In single scale with patch-level refinement due to the O(n3)

complexity of closed-form solution

CNN coarse output CRF-modeling InferenceRepresentative works:

• CRF-RNN:

. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as

recurrent neural networks. In ICCV, 2015.

• Deep convolutional neural field:

. F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE

TPAMI, 38(10):2024–2039, 2016.

In Discrete Domain

Ours: In Multi-scale with pixel-level dense refinement with O(n) complexity

Page 14: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Approach

Multi-Scale Deep Structured Fusion & Prediction + In Continuous Domain +

Within a Joint CNN-CRF Framework

Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “Multi-Scale Continuous CRFs as Sequential Deep

Networks for Monocular Depth Estimation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR

2017, Spotlight Oral, TPAMI18)

Front-End C onvolutional N eural N etw ork

C -M F C -M F C -M F C -M FC -M F

d ?

s1

C -M F C -M F C -M F C -M FC -M F

M ulti-Scale Fusion w ith C ontinuous C R Fs

Side O utputs

s2 s3 s4 s5

...

...

...

...

...

r

r

Page 15: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Message passing using CNN-CRF

• 20 minutes to illustrate CRF …

Page 16: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Results with Different Front-End CNNs on NYUD-V2

Results

Qualitative results on NYUD-V2: significant improvement

over the pretrained front-end CNNs

Code:

Page 17: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Results

Results on KITTI: ours achieved the best performance

compared with the state-of-the-art

Code:

Page 18: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

More effective than the classic multi-scale fusion schemes

RGB-Input GT Ours

Qualitative results on Make3DAchieved the best performance on most of the metrics.

Results

Page 19: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

19

Introduction

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Page 20: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Object detection

Scene graph generation

Mom and her cute baby are brushing teeth Region captioning

Woman

Child

Tooth brush

Tooth brush

Use

Use

Watch

Page 21: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

21

Introduction

Structured featuresStructured output

Structured deep learning

Depth Estimation (TPAMI’18)

Scene Graph Generation (ICCV’17, ECCV’18)

Page 22: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Why structured features?

Contains rich visual information

Page 23: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Woman

Tooth brush

?

Page 24: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Woman

Tooth brush

?Use

Page 25: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Overview our proposed Multi-level Scene Description Network (MSDN)

25

Page 26: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Methodology: Dynamic Graph Construction

26

Page 27: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Methodology: Feature Refining

27

(b) Phrase Updating

(a) Object Updating

(c) Region Updating

Region nodes

phrase nodes

object nodes

Page 28: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Methodology: Object feature updating

28

• Phrase feature merge: Since the features from different phrases have different importance factors for refining objects, we use a gate function to determine weights.

The gate function is defined as:

• Refine object features: For the i-th object, there are two merged features:

Page 29: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Overview our proposed Multi-level Scene Description Network (MSDN)

29

Code:

Page 30: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Quantitative Results

30

Comparison with existing works:• LP: Visual Relationship detection using word embeddings as

language prior (Lu, Cewu, et al., ECCV 2016)• ISGG: Scene graph generation using iterative message

passing (Xu, Danfei, et al. arXiv:1701.02426)

Experiment on object detection & captioning:• FRCNN: Faster R-CNN (Girshick, Ross., ICCV 2015) with the same

number of potential object proposals as used at our MSDN.• Baseline-3-bran.: the baseline model with 3 branches but the

feature refining structure removed.

Page 31: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Qualitative Results

Top-1 region captioning results with detected objects and corresponding relationships are visualized.

Page 32: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

32

Slow inference speed due to large number of phrase proposals

2.45s/img

Page 33: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

33

Factorizable NetAn Efficient Subgraph-based Framework for Scene Graph Generation

Page 34: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

(person-play-snowboard) (snowboard-under-person)

(person-wear-pants) (pants-on-person)

(person-wear-helmet)

(helmet-on-person)

Object nodes Predicate nodes

Page 35: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

(Shared interacting regions)Object nodes Predicate nodes Subgraph nodes

Page 36: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Factorizable Net

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

(1) Image and RPN

proposals (2) Fully-connected Graph(3) Subgraph-based

Representation

person

wear hold

helmet bat

(4) ROI-pooling and

Feature Preparation

(6) Object and Relation

Recognition

Predicate inference (SRI)(5) Spatial-weighted

Message Passing

(SMP)

Object inference

Object feature vectors

Subgraph feature maps

Page 37: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

SMP: Spatial-weighted Message Passing

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

(k × 512 × 1 × 1)

subgraph feature

(1 × 512 × 5 ×5)

merged features

(1 × 512 × 5 × 5)

object features

(k × 512)attention maps

(k × 1 × 5 × 5)

refined features

(1 × 512 × 5 × 5)

SMP : Subgraph Feature Refining

subgraph features

(m × 512 × 5 × 5)

avg-pooled features

(m × 515 × 1 × 1)

object feature

(1 × 512)

merged features

(1 × 512)

refined features

(1 × 512)

attention vector

(m× 1)

SMP: Object Feature Refining

Page 38: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

SMP: subgraph to object

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

subgraph features

(m × 512 × 5 × 5)

avg-pooling or attention

(m × 515 × 1 × 1)

object feature

(1 × 512)

merged features

(1 × 512)

refined features

(1 × 512)

attention vector

(m× 1)

Page 39: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

SMP: object to subgraph

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

(k × 512 × 5 × 5)

subgraph feature

(1 × 512 × 5 × 5)

object features

(k × 512)attention maps (k ×

1 × 5 × 5)

refined features

(1 × 512 × 5 × 5)

Page 40: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

SRI: Spatial-sensitive Relation Inference

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

conv kernel (1 ×512)

Concatenate

subgraph feature

(1 × 512 × 7 × 7)

(1 × 512 × 7 × 7)

(1 × 512 × 7 × 7)

(1 × 1536 × 7 × 7)

Spatial-sensitive Relation Inference (SRI)

predicate

(1 × 512)

Subject feature

(1 × 512)

Object feature

(1 × 512)

Page 41: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Comparison with Existing Methods

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.

Page 42: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Evaluation on Object Detection

[4] Li, Yikang, et al. “Scene graph generation from objects, phrases and region captions.” ICCV 2017.

[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” 2018.

[9] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015.

• FRCNN-64: Faster RCNN with 64 object proposals (experiment settings in [7])

• FRCNN-300: Faster RCNN with 300 object proposals (experiment settings in [9])

• MSDN: Our proposed Multilevel Scene Description Network in [4]

• Ours-w/o-Rel: Adopt the subgraph-based framework but without relationship supervision

• Ours: Our Factorizable Net with 1 SMP (model 5 in Ablation Study)

Page 43: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Does structure only exist for specific task?

Page 44: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

44

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Structured deep learning

Page 45: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Current CNN Structures

Image Classification: Summarize high-level semantic information of the whole image.

Detection/Segmentation:High-level semantic meaning with high spatial resolution

Called U-Net, Hourglass, or Conv-deconv

Page 46: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Architectures designed for different granularities are

DIVERGING

Page 47: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Unify the advantages of networks for pixel-level, region-level, and image-level tasks

Page 48: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Hourglass for Classification

Poor performance.

So what is the problem?

• Different tasks require different resolutions of feature

Directly applying hourglass for classification?

Features with high-level semantics and high resolution is good

• Down sample high-level features with high resolution

Our design

Page 49: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Hourglass for Classification

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

𝑆𝑡𝑟𝑖𝑑𝑒 = 2

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑖𝑛

CLow-level features

𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡

C Concat

𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.

• Hourglass may bring more isolated convolutions than ResNet

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

Normal Res-BlockRes-Block for

down/up samplingOur design

• Different tasks require different resolutions of feature

Page 50: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Observation and design

Solution1. Unify the advantages of

networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

Our observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixedbut not preserved

Page 51: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Difference between mix and preserve and refine

High level

Low level

+ High level

Low level

Mixed features Preserve and refine

M

M

+

+

M Message generation

Page 52: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Overview

224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1

… … … … … ……

Features in the tail part

Features in the body part

Residual Blocks

Features inthe head part

Concat

Fish Tail Fish Body Fish Head

… … …

Page 53: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Preservation & Refinement

……

…………

……

……

TransferringBlocks T (⋅)

DR Blocks

UR Blocks

Regular Connections

……

FishTail

FishBody

FishHead

M(⋅)

𝑑𝑜𝑤𝑛(⋅)

𝑢𝑝(⋅)

𝑟(⋅)M(⋅)

……

……

Up-sampling and Refinement (UR) Blocks

Down-sampling and Refinement (DR) Blocks

Feature from varying depth refines each other here

Sum up every𝒌 adjacent

channelsConcat

Concat

Nearest neighbor up-sampling

𝟐 × 𝟐 Max-Pooling

From Tail

From Body

From Body

From Head

Page 54: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

ResNet

Parameters, × 106

22.59%

21.93%

21.55%21.25%

23.78%

22.30%

21.69%

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

2 4 6 8 10 12

FishNet

ResNet

FLOP, × 109

To

p-1

Erro

r

Codehttps://github.com/kevin-ssy/FishNet

Page 55: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

22.58%(6.35%)

22.20%(6.20%)

22.15%(6.12%)

21.20%

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

DenseNet

ResNetTo

p-1

Erro

r

Parameters, × 106

Codehttps://github.com/kevin-ssy/FishNet

Page 56: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Performance on COCO Detection

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

41.50%

42.00%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

36.00%

36.50%

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

Page 57: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Performance on COCO Instance Segmentation

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

Page 58: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Winning COCO 2018 Instance Segmentation Task

Page 59: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Visualization

Page 60: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Visualization

Page 61: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Visualization

Page 62: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Visualization

Page 63: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Visualization

Page 64: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Codebase

• Comprehensive

RPN Fast/Faster R-CNN

Mask R-CNN FPN

Cascade R-CNN RetinaNet

More … …

• High performance

Better performance

Optimized memory consumption

Faster speed

• Handy to develop

Written with PyTorch

Modular design

GitHub: mmdet√

Page 65: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

FishNet: Advantages

• Better gradient flow to shallow layers

• High-resolution features contain rich low-level and high-level semantics

• Build up correlation among features with different semantic information

– They are preserved and refined from each other

Page 66: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

66

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Structured deep learning

Page 67: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Action Recognition

• Recognize action from videos

Page 68: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical flow in Action Recognition

• Motion is the important information

• Optical flow

– Effective

– Time consuming

Modality Acc. Speed(fps)

RGB 85.5% 680

RGB+Optical Flow 94.0% 14

We need a better motion representation

Page 69: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical flow guided feature

Page 70: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical flow guided feature

{vx , vy} = optical flow

Coefficient for optical flow:

Optical flow:

Intuitive Inspiration

Page 71: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical flow guided feature

Feature flow:

{ } = feature flow

Page 72: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical flow guided feature

1 × 1 Conv 1 × 1 Conv

S

C

Sobel Subtract

Concat

OOFF Unit

Fram

et

Fram

et

+ ∆

t

Cla

ssif

ica

tio

nSu

b-n

etw

ork

...

...

ResolutionK*K

ResolutionK/2*K/2

ResolutionK/4*K/4

CO

CO

...

...

CO

Feat

ure

tFe

atu

re t

+ ∆

t

...

...

OCOFF UnitConcat

C C C

Page 73: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Optical Flow Guided Feature (OFF): Experimental results

1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.

0 50 100 150 200 250

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

FPS

92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

Accuracy (%)Code:

Page 74: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Not only for action recognition

• Also effective for

– Video object detection

– Video compression artifact removal

71.5 72 72.5 73 73.5 74 74.5 75 75.5 76

resnet+rfcn

resnet+rfcn+OFFDetection (mAP)

34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2

DnCNN

DnCNN+OFFCompression Artifact Removal (PSNR)

q40

Page 75: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Outline

75

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Structured deep learning

Page 76: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision

Take home message

76

• Structured deep learning is

– effective

– for output, features

– from observation

• End-to-end joint training bridges the gap between structure modeling and feature learning

Page 77: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision