Jack Han / 한재근 | Solutions Architect | NVIDIA
Inference Optimization Using TensorRT
with Use Cases
TensorRT 4 Adoption
Video
MapsImage
NLP
Speech
Search
Use Cases
AI Inference is exploding
SPEECH
1 BillionVoice Searches Per Day
Google, Bing, etc.
LIVE VIDEO
1 BillionVideos Watched Per Day
RECOMMENDATIONS
1 TrillionAds/Rankings Per Day
Impressions
2011 2012 2013 2014 2015 2016 2017
Image(GOP * Bandwidth)
ResNet-50
Inception-v2
Inception-v4
AlexNet
GoogleNet
350X
2013 2014 2015 2016 2017 2018
Speech(GOP * Bandwidth)
DeepSpeech
DeepSpeech 2
DeepSpeech 3
30X
2014 2015 2016 2017 2018
Translation(GOP * Bandwidth)
MoE
OpenNMT
GNMT
10X
Bigger and More ComputeIntensive in quest for accuracy
Cambrian ExplosionConvolutional
Networks
ReLuEncoder/Decoder BatchNorm
Dropout PoolingConcat
Recurrent Networks
GRULSTM
CTC
Beam Search
WaveNet Attention
Generative Adversarial Networks
Speech Enhancement GANCoupled GAN
Conditional GANMedGAN3D-GAN
Reinforcement Learning
DQN Simulation
DDPG
New Species
Neural Collaborative FilteringMixture of Experts
Block Sparse LSTM
Capsule Nets
Inefficiency Limits Innovation
Custom Development
Developers need to reinvent the
plumbing for every application
Single Model Only
Some systems are overused
while others are
underutilized
ASR NLPRec-
ommender
!
Single Framework Only
Solutions can only support
models from one framework
More Accuracy = Time
Speed/accuracy trade-offs for modern convolutional object detectors, Google Research
Inference Performance
Latency EfficiencyThroughput
Easy Deploy
Inference Optimization
Lightweight Model Deep Compression
Small Computation
Suitable Lightweight Hardware
E.g;
MobileNet
SqueezeNet
SEP-Nets
No Network Modification
Continuous the Enovation
E.g;
Quantization and Binarization
Network Pruning and Sharing
Distillation / Factorization
TensorRTFrom framework to target
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA T4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
Speed-up
0
1000
2000
3000
4000
5000
0 10 20 30 40 50 60
TensorRT INT8 TensorRT FP16 TensorRT FP32 GPU Native FP32 CPU Native FP32
ResNet-50
V100
Batch Response time (ms)
Th
rou
gh
put(im
ag
es/s
ec)
TensorRT 5
More Layers / Plugin / APIs
Inference Server Integration
TensorRT Workflow
Optimize Plan
Step 1. Optimize trained model
Model
TensorRT Workflow
Embedded
Automotive
Data center
Step 2. Deploy optimized plan
Plan Runtime
Engine
Step-by-StepTensorRT Plan Build
Network
C++/Python API
Model Parser
Network
Definitions
TensorRT
Builder Engine
https://docs.nvidia.com/deeplearning/sdk/
tensorrt-developer-guide/index.html#support_op
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
Step 3: Register inputs and outputs
Step 4: Optimize model and create
a runtime engine
Step 5: Serialize optimized engine
Step 6: De-serialize engine
Step 7: Perform inference
7 Steps to Deployment with TensorRT
TensorRT
Builder Engine
Network
C++/Python API
Network
Definitions
Model Parser
Plugin
Factory
Plugin A Plugin B
Custom Layer Supportusing Plugin Layer
Plugin Layer Procedure
Plugin LayeriCudaEnginePlugin
Factory
Layer?Initialize
(missing)Phase 1.
Initialize
enqueuePhase 2.
Inference
Serialize
DeserializePhase 3.
Store/Load
Supporting Features
Feature C++Python
(x86 only)NvCaffeParser NvUffParser
CNNs Yes Yes Yes Yes
RNNs Yes Yes No No
INT8 Calibration Yes Yes N/A N/A
Asymmetric
PaddingYes Yes No No
TensorRT APIDefine your network using C/C++ or Python
INetworkDefinition* network = builder->createNetwork();
convertRNNWeights, convertRNNBias
setWeightsForGate, setBiasForGate
TensorRT RNN Programming
Dump Weights Load Weights Convert Weights Set Weights
TF format TRT format
TensorRT
RNN Layer
TF format
/usr/src/tensorrt/samples/
common/dumpTFWts.py
ckpt wts
Get started with sampleCharRNN in documentation
NMT with Transformer Inference
Running 3 binded engines
(Encoder, Generator, Suffle Engine)
Encoder Decoder
Beam
Search
Encoder RNN
Decoder
RNN
Attention
Model
Projection
TopK
Output
Be
am
Sh
uffle
Ba
tch
Re
du
ctio
n
Beam
Scorin
g
Input
Setup
Input
* CPU-Only: TensorFlow on SKL 6140 18 core, FP32
GPU: V100, TensorRT 5, FP16; Sorted data, Batch=128, English to German
Runs on CPU
GPU-Accelerated
Support NMT layers such as
Gather, Softmax, Batch GEMM and Top K
Modular Network Merge
Deploy highly-optimized
language translation apps in
production environments
Get started with NMT sample in documentation
Recommendation Inference
TopK
Output Top
Scores
Embedding
Embedding
Embedding
Embedding
Fused MLP
Kernel
Fully Connected
Bias
Activation
Fully Connected
Bias
Activation
Fully Connected
Bias
Activation
Embedding
Concat
xNUser
1
Context
Item
Item
ItemN
2
...
Runs on CPU
GPU-Accelerated
* CPU-Only: TensorFlow on Intel Xeon E5-2698 v4 CPU
at 2.20 GHz; GPU: V100 with custom networks
Deploy multi-layer perceptron (MLP) based
recommendation apps in production
Predict events (click, purchase, interest)
accurately based on input items (user,
query, observed activity)
Get started with examples on
MNIST character recognition and
movie recommender system
in the documentation
Inferencing with Low precision
5.522
65
130
260
0
50
100
150
200
250
300
TFLO
PS
/ TO
PS
Peak Performance
T4P4Float INT8 Float INT8 INT4
Post Training QuantizationMinimize information loss between FP32 and INT8
inference on calibration dataset
100 mini-batches are sufficient
to determine quantization parameter
(~500 samples)
New INT8 APIs and Optimizations
Quantization Aware Training
Custom Calibration
Auto CalibrationFP32
Training
Optimize
to INT8
FP32 weights
Calibration Data O(1000) Images
FP32
Training
Optimize
to INT8
FP32 weights Custom
Calibration
FP32 or INT8 weights
Custom Activation ranges
Quantization
Aware Training
in FP32
Optimize
to INT8
Custom Activation Ranges
FP32 or
accurate INT8 weights
Maximize throughput at low latency with mixed precision compute in production
Apply INT8 quantization aware training or custom calibration algorithms with new APIs
Control precision per-layer with new APIs
Optimizations for depth wise convolution operation
Getting Start with INT8 sample in the documentation
Inference Performance
0
500
1000
1500
2000
2500
3000
1 2 4 8
P4 - INT8 V100 - Mixed
0
5
10
15
20
25
1 2 4 8P4 - INT8 V100 - Mixed
Throughput Performance/WATT
Higher is better!!
30
WIDELY ADOPTED
31
NVIDIA INFERENCE MOMENTUM
Image Tagging Video Analysis Advertising Impact Video Captioning Visual Search
Finding Music Sports Performance Customer Service Visual SearchIndustrial
InspectionVoice Recognition
Cybersecurity
32
Using GPUs made it possible to enable media
understanding of the Twitter platform, by
drastically reducing media deep learning model
training time and by allowing us to derive real-time
understanding of live videos at inference time.
Nicolas Koumchatzky, Head of Twitter Cortex
“
”
33
“ At Microsoft, we are working hard to deliver the most advanced AI-powered services to our
customers. Using NVIDIA GPUs in real-time inference workloads has improved Bing’s advanced search
offerings, enabling us to reduce object detection latency for images. We look forward to working
with NVIDIA’s next generation inference hardware and software to expand the way people benefit
from AI products and services.”
-Jordi Ribas CVP, Bing and AI Products, Microsoft
“ AI is becoming increasingly pervasive, and inference is a critical capability customers need to
successfully deploy their AI models, so we’re excited to support NVIDIA’s Turing Tesla T4 GPUs on
Google Cloud Platform soon.”
-Chris Kleban, product manager at Google Cloud
SEOUL | NOVEMBER 7 - 8,2018
www.nvidia.com/ko-kr/ai-conference/