AI on the Edgevalser.org/webinar/slide/slides/20200603/模型压缩... · 2020. 6. 3. · Huawei Atlas 200 AI Accelerator Module The key approaches used for completing this task:

Huawei Noah’s Ark Lab

Yunhe Wang

AI on the Edge— Discussion on the Gap Between Industry and Academia

ABOUT ME

Enthusiasm

Programmer

PKUer

Researcher

Yunhe Wangwww.wangyunhe.site

[email protected]

[Han et. al. NIPS 2015]

[Han et. al. ICLR 2016 best paper award]

• It is very surprised to see that over 90% of pre-trained parameters in AlexNet and VGGNet are redundant.• The techniques used in visual compression is transferred successfully, e.g. quantization and Huffman

encoding.• Compressed networks can achieve the same performance compared to original baselines after fine-tuning.• Cannot directly obtain a considerable speed-up on mainstream hardwares.

Restrictions for using AI on the edge.

Deep Model Compression

CNNpack: Packing Convolutional Neural Networks in the Frequency Domain (NIPS 2016)

Compressed AlexNet VGGNet-16 ResNet-50

rc 39x 46x 12xrs 25x 9.4x 4.4x

Top1-err 41.6% 29.7% 25.2%Top5-err 19.2% 10.4% 7.8%

Input data DCT bases DCT feature maps

Weightedcombination

Feature maps of this layer

DCT bases K-means clustering

0.4990.4980.5010.5020.500

0.5Huffman&

CSR storage

Original filters l1-shrinkage Quantization Compression

232

572

955.9 12.4 7.9

0

200

400

600

800

AlexNet VGGNet-16 ResNet-50

Memory (MB)

7e8

2e10

3.8e93e7

2.1e9 8.5e80.00E+00

5.00E+09

1.00E+10

1.50E+10

2.00E+10

2.50E+10

AlexNet VGGNet-16 ResNet-50

Multiplications

Input Images Teacher Network

Student NetworkDiscriminator

(Assistant)

Feature Space

Teacher FeatureStudent Feature

LGAN =1n

Pni=1 H(o

iS ,y

i) + � 1nPn

i=1

⇥�log(D(ziT )) + log(1�D(z

iS))

�⇤,

We suggest to develop a teaching assistant network to identify the difference between featuresgenerated by student and teacher network:

Adversarial Learning of Portable Student Networks (AAAI 2018)

Visualization results of different networks trained on the MNIST dataset, where features of a specific categoryin every sub-figure are represented in the same color: (a) features of the original teacher network; (b) featuresof the student network learned using the standard back-propagation strategy; (c) features of the studentnetwork learned using the proposed method with a teaching assistant.

(a) accuracy = 99.2% (b) accuracy = 97.2% (c) accuracy = 99.1%

Adversarial Learning of Portable Student Networks (AAAI 2018)

An illustration of the evolution ofLeNet on the MNIST dataset.Each dot represents an individualin the population, and the thirtybest individuals are shown ineach evolutional iteration. Thefitness of individuals is graduallyimproved with an increasingnumber of iterations, implyingthat the network is morecompact but remaining the sameaccuracy.

Original Filters:

Remained Filters:

Retrained Filters:

Toward Evolutionary Compression (SIGKDD 2018)

Two generators in CycleGAN will be simultaneously compressed:

Statistics of compressed generators

P30 Pro Latency: 6.8s -> 2.1s

Co-Evolutionary Compression for GANs (ICCV 2019)

Generator A

Generator B

Generator A

Generator B

Gen A

Gen B

Iteration = 1 Iteration = 2 Iteration = T… …

… …

Population APopulation A Population A

Population BPopulation B Population B

… …

Input Baseline ThiNet Ours

Student Network

Teacher Network

RandomSignals

Generated Images

Generative Network Distillation

A generator is introduced to approximate training data

DAFL: Data-Free Learning of Student Networks (ICCV 2019)

How to provide perfect model optimization service on the cloud？

Privacy-Related AI Applications

Entertainment APPFaceID

Voice assistant

Fingerprint

Original and Generated Face Images

98.20% on MNIST 92.22% on CIFAR-10 74.47% on CIFAR-100

AdderNet: Do We Really Need Multiplications in Deep Learning？(CVPR 2020)

Using Add in Deep Learning can significantly reduce the energy consumption and area cost of chips.https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdfhttp://eecs.oregonstate.edu/research/vlsi/teaching/ECE471_WIN15/mark_horowitz_ISSCC_2014.pdfhttp://eyeriss.mit.edu/2019_neurips_tutorial.pdf

Feature Visualization on MNIST

Adder Network Convolutional Network

Feature calculation in adder neural network:

Feature calculation in convolutional neural network:

Validations on ImageNet

Huawei HDC 2020: Real-time Video Style Transfer

Inference Time: about 630ms Inference Time: 60ms

Huawei Atlas 200AI Accelerator Module

The key approaches used for completing this task:

1. Model Distillation: remove the optical flow module in the original network

2. Filter Pruning: reduce the computational complexity of the video generator

3. Operator Optimization: automatically select the suitable operators in Atlas 200

https://developer.huaweicloud.com/exhibition/Atlas_neural_style.html

Discussions – Edge Computing

The 4 reasons to move deep learning workloads from the cloud down on to the device

1. Privacy & security: if your data can't leave the premises where it’s captured

2. Latency: if you need to have a real-time response, so in the case of a robotics workload or a self-driving car

3. Reliability: your network up to the cloud might not always be reliable

4. Cost: if a channel is actually costly to use to send the data up to the cloud

ü fast

ü large memory

ü free energy resource

Server/Cloud Mobile device

• small memory• slow• limited energy

resource

Deep Neural Network

Github Link

Zhihu (知乎)

Thank You!Contact me:

[email protected], [email protected]://www.wangyunhe.site

Documents

AI on the Edgevalser.org/webinar/slide/slides/20200603/模型压缩... · 2020. 6. 3. · Huawei Atlas 200 AI Accelerator Module The key approaches used for completing this task: