Download pdf - Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

Jack Han / 한재근 | Solutions Architect | NVIDIA

Inference Optimization Using TensorRT

with Use Cases

TensorRT 4 Adoption

Video

MapsImage

NLP

Speech

Search

Use Cases

AI Inference is exploding

SPEECH

1 BillionVoice Searches Per Day

Google, Bing, etc.

LIVE VIDEO

1 BillionVideos Watched Per Day

Facebook

RECOMMENDATIONS

1 TrillionAds/Rankings Per Day

Impressions

2011 2012 2013 2014 2015 2016 2017

Image(GOP * Bandwidth)

ResNet-50

Inception-v2

Inception-v4

AlexNet

GoogleNet

350X

2013 2014 2015 2016 2017 2018

Speech(GOP * Bandwidth)

DeepSpeech

DeepSpeech 2

DeepSpeech 3

30X

2014 2015 2016 2017 2018

Translation(GOP * Bandwidth)

MoE

OpenNMT

GNMT

10X

Bigger and More ComputeIntensive in quest for accuracy

Cambrian ExplosionConvolutional

Networks

ReLuEncoder/Decoder BatchNorm

Dropout PoolingConcat

Recurrent Networks

GRULSTM

CTC

Beam Search

WaveNet Attention

Generative Adversarial Networks

Speech Enhancement GANCoupled GAN

Conditional GANMedGAN3D-GAN

Reinforcement Learning

DQN Simulation

DDPG

New Species

Neural Collaborative FilteringMixture of Experts

Block Sparse LSTM

Capsule Nets

Inefficiency Limits Innovation

Custom Development

Developers need to reinvent the

plumbing for every application

Single Model Only

Some systems are overused

while others are

underutilized

ASR NLPRec-

ommender

!

Single Framework Only

Solutions can only support

models from one framework

More Accuracy = Time

Speed/accuracy trade-offs for modern convolutional object detectors, Google Research

Inference Performance

Latency EfficiencyThroughput

Easy Deploy

Inference Optimization

Lightweight Model Deep Compression

Small Computation

Suitable Lightweight Hardware

E.g;

MobileNet

SqueezeNet

SEP-Nets

No Network Modification

Continuous the Enovation

E.g;

Quantization and Binarization

Network Pruning and Sharing

Distillation / Factorization

TensorRTFrom framework to target

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA T4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

Speed-up

0

1000

2000

3000

4000

5000

0 10 20 30 40 50 60

TensorRT INT8 TensorRT FP16 TensorRT FP32 GPU Native FP32 CPU Native FP32

ResNet-50

V100

Batch Response time (ms)

Th

rou

gh

put(im

ag

es/s

ec)

TensorRT 5

More Layers / Plugin / APIs

Inference Server Integration

TensorRT Workflow

Optimize Plan

Step 1. Optimize trained model

Model

TensorRT Workflow

Embedded

Automotive

Data center

Step 2. Deploy optimized plan

Plan Runtime

Engine

Step-by-StepTensorRT Plan Build

Network

C++/Python API

Model Parser

Network

Definitions

TensorRT

Builder Engine

https://docs.nvidia.com/deeplearning/sdk/

tensorrt-developer-guide/index.html#support_op

Step 1: Convert trained model into

TensorRT format

Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create

a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine

Step 7: Perform inference

7 Steps to Deployment with TensorRT

TensorRT

Builder Engine

Network

C++/Python API

Network

Definitions

Model Parser

Plugin

Factory

Plugin A Plugin B

Custom Layer Supportusing Plugin Layer

Plugin Layer Procedure

Plugin LayeriCudaEnginePlugin

Factory

Layer?Initialize

(missing)Phase 1.

Initialize

enqueuePhase 2.

Inference

Serialize

DeserializePhase 3.

Store/Load

Supporting Features

Feature C++Python

(x86 only)NvCaffeParser NvUffParser

CNNs Yes Yes Yes Yes

RNNs Yes Yes No No

INT8 Calibration Yes Yes N/A N/A

Asymmetric

PaddingYes Yes No No

TensorRT APIDefine your network using C/C++ or Python

INetworkDefinition* network = builder->createNetwork();

convertRNNWeights, convertRNNBias

setWeightsForGate, setBiasForGate

TensorRT RNN Programming

Dump Weights Load Weights Convert Weights Set Weights

TF format TRT format

TensorRT

RNN Layer

TF format

/usr/src/tensorrt/samples/

common/dumpTFWts.py

ckpt wts

Get started with sampleCharRNN in documentation

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#charRNN_sample

NMT with Transformer Inference

Running 3 binded engines

(Encoder, Generator, Suffle Engine)

Encoder Decoder

Beam

Search

Encoder RNN

Decoder

RNN

Attention

Model

Projection

TopK

Output

Be

am

Sh

uffle

Ba

tch

Re

du

ctio

n

Beam

Scorin

g

Input

Setup

Input

* CPU-Only: TensorFlow on SKL 6140 18 core, FP32

GPU: V100, TensorRT 5, FP16; Sorted data, Batch=128, English to German

Runs on CPU

GPU-Accelerated

Support NMT layers such as

Gather, Softmax, Batch GEMM and Top K

Modular Network Merge

Deploy highly-optimized

language translation apps in

production environments

Get started with NMT sample in documentation

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#nmt_sample

Recommendation Inference

TopK

Output Top

Scores

Embedding

Embedding

Embedding

Embedding

Fused MLP

Kernel

Fully Connected

Bias

Activation

Fully Connected

Bias

Activation

Fully Connected

Bias

Activation

Embedding

Concat

xNUser

1

Context

Item

Item

ItemN

2

...

Runs on CPU

GPU-Accelerated

* CPU-Only: TensorFlow on Intel Xeon E5-2698 v4 CPU

at 2.20 GHz; GPU: V100 with custom networks

Deploy multi-layer perceptron (MLP) based

recommendation apps in production

Predict events (click, purchase, interest)

accurately based on input items (user,

query, observed activity)

Get started with examples on

MNIST character recognition and

movie recommender system

in the documentation

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#sample_movie

Inferencing with Low precision

5.522

65

130

260

0

50

100

150

200

250

300

TFLO

PS

/ TO

PS

Peak Performance

T4P4Float INT8 Float INT8 INT4

Post Training QuantizationMinimize information loss between FP32 and INT8

inference on calibration dataset

100 mini-batches are sufficient

to determine quantization parameter

(~500 samples)

New INT8 APIs and Optimizations

Quantization Aware Training

Custom Calibration

Auto CalibrationFP32

Training

Optimize

to INT8

FP32 weights

Calibration Data O(1000) Images

FP32

Training

Optimize

to INT8

FP32 weights Custom

Calibration

FP32 or INT8 weights

Custom Activation ranges

Quantization

Aware Training

in FP32

Optimize

to INT8

Custom Activation Ranges

FP32 or

accurate INT8 weights

Maximize throughput at low latency with mixed precision compute in production

Apply INT8 quantization aware training or custom calibration algorithms with new APIs

Control precision per-layer with new APIs

Optimizations for depth wise convolution operation

Getting Start with INT8 sample in the documentation

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#int8_sample

Inference Performance

0

500

1000

1500

2000

2500

3000

1 2 4 8

P4 - INT8 V100 - Mixed

0

5

10

15

20

25

1 2 4 8P4 - INT8 V100 - Mixed

Throughput Performance/WATT

Higher is better!!

30

WIDELY ADOPTED

31

NVIDIA INFERENCE MOMENTUM

Image Tagging Video Analysis Advertising Impact Video Captioning Visual Search

Finding Music Sports Performance Customer Service Visual SearchIndustrial

InspectionVoice Recognition

Cybersecurity

32

Using GPUs made it possible to enable media

understanding of the Twitter platform, by

drastically reducing media deep learning model

training time and by allowing us to derive real-time

understanding of live videos at inference time.

Nicolas Koumchatzky, Head of Twitter Cortex

“

”

33

“ At Microsoft, we are working hard to deliver the most advanced AI-powered services to our

customers. Using NVIDIA GPUs in real-time inference workloads has improved Bing’s advanced search

offerings, enabling us to reduce object detection latency for images. We look forward to working

with NVIDIA’s next generation inference hardware and software to expand the way people benefit

from AI products and services.”

-Jordi Ribas CVP, Bing and AI Products, Microsoft

“ AI is becoming increasingly pervasive, and inference is a critical capability customers need to

successfully deploy their AI models, so we’re excited to support NVIDIA’s Turing Tesla T4 GPUs on

Google Cloud Platform soon.”

-Chris Kleban, product manager at Google Cloud

SEOUL | NOVEMBER 7 - 8,2018

www.nvidia.com/ko-kr/ai-conference/