Высокопроизводительный инференс глубоких сетей на GPU...

Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRTМаксим Милаков, NVIDIA

• You will learn:• How GPUs are used for DL now?

• Why do you want to use GPUs for inference?

• Why do you want to use TensorRT for inference on GPUs?

• This talk is NOT:• An intro to DL

• A set of code samples

NVIDIAThe AI Computing Company

NVIDIA Powering the Deep Learning Ecosystem

DEEP LEARNING FRAMEWORKS

COMPUTER VISION SPEECH AND AUDIO NATURAL LANGUAGE PROCESSING

Object Detection Voice Recognition Language TranslationRecommendation

EnginesSentiment Analysis

Mocha.jl

Image Classification

NVIDIA DEEP LEARNING SDK

NCCLcuDNN cuBLAS cuSPARSE TensorRT

ML development and deployment cycle

Training with SGD backpropagation

ImageNet: results for 2010-2014

95%28%26%

2010 2011 2012 2013 2014

% Teams using GPUs

Top-5 error

Deployment scenarios - Hyperscale

• Input generated and output used at the client device

• Inference is running at data center

• High throughput

• On-the-fly batching

TensorRT for Hyperscale

Image Classification Object Detection

Image Segmentation

Deployment scenarios - Embedded

• On-device inference

• Small-batch inference

• Low latency

GPU Inference Engine for Automotive

PedestrianDetection

Lane Tracking

Traffic SignRecognition

NVIDIA DRIVE PX 2

TensorRT

High-performance deep learning inference for production deployment

1 8 128

CPU-Only Tesla M4 + TensorRT

Up to 16x More Inference Efficiency

Img/sec/watt

Batch Sizes

GoogLenet, CPU-only vs Tesla M4 + TensorRT on

Single-socket Haswell E5-2698 v3@2.3GHz with HT

EMBEDDED

Jetson TX1

AUTOMOTIVE

Drive PX

DATA CENTER

Tesla M4

Comparing to DL frameworks

• Particularly effective at small batch-sizes

• Improves perf for complex networks the most

Jetson TX1 HALF2 column uses fp16

GoogleNet Performance

TensorRT

• Fuse network layers

• Eliminate concatenation layers

• Kernel specialization

• Auto-tuning for target platform

• Select optimal tensor layout

• Batch size tuningTRAINEDNEURAL NETWORK

OPTIMIZEDINFERENCERUNTIME

Layers supported

• v1 designed for 2D images

• Layers supported• Convolution: 2D

• Activation: ReLU, tanh and sigmoid

• Pooling: max and average

• ElementWise: sum, product or max of two tensors

• LRN: cross-channel only

• Fully-connected: with or without bias

• SoftMax: cross-channel only

• Deconvolution

• Custom layers possible with sandwich approach now

Optimizations

• Eliminate unused layers

• Vertical layer fusion: Fuse convolution, bias, and ReLU layers to form a single layer

• Horizontal layer fusion: Combine layers with the same source tensor and the same parameters

Optimizations: Original network

concat

max pool

next input

3x3 conv.

1x1 conv.

concat

1x1 conv.

bias5x5 conv.

Optimizations: Vertical layer fusion

concat

max pool

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

Optimizations: Horizontal layer fusion

concat

max pool

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Optimizations: Concat elision

max pool

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

TensorRT – 2 phases deployment

• Build• Apply optimizations on the network configuration

• Generate an optimized plan for computing the forward pass

• Deploy• Forward and output the inference result

Deploy

Output

Layers

Batchsize

Inputs

Pascal GPUs for inference

Maximum Efficiency for Scale-out Servers

TESLA P4 TESLA P40

Highest Throughput for Scale-up Servers

P40/P4 – New “Int8” for Inference

TensorRT v2: int8 accuracy

• Almost the same accuracy for major models

• Still working on the procedure to make quantization optimal

TensorRT v2: int8 performance

• Up to 3x performance• No FFT and Winograd yet

• Constant factors

• Beneficial on small batches even on bigger P40

TensorRT v2: more performance

178480

E5-2690v414 Core

M4(FP32)

M40(FP32)

P100(FP16)

P4(INT8)

P40(INT8)

All results are measured, based on GoogLenet with batch size 128Xeon uses MKL 2017 GOLD with FP32, GPU uses TensorRT internal development ver.

12.3 10.6

E5-2690v414 Core

M4(FP32)

M40(FP32)

P100(FP16)

P4(INT8)

P40(INT8)

P40 For Max Inference Throughput,

img/sec

P4 For Max Inference Efficiency,

img/sec/watt

Deep Learning Everywhere

• developer.nvidia.com/tensorrt

• developer.nvidia.com/deep-learning

• developer.nvidia.com/cuda-zone

• mmilakov@nvidia.com

Backup slides

Tesla Products Decoder

Высокопроизводительный инференс глубоких сетей на GPU...

Engineering

Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation

ROSEMOUNT, Высокопроизводительный …...применения. Применение технологии прямого переключения портов приема

S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

Rosemount 5300 Высокопроизводительный …...испарением и турбулентностью. Rosemount 5303, снабженный гибким однопроводным

ВЫСОКОПРОИЗВОДИТЕЛЬНЫЙ НАСТЕННЫЙ ГАЗОВЫЙ … · soldaduras y disolventes, utilizando un producto comercial que no sea ni ácido ni alcalino,

DOI: 10.18522/2073-6606-2016-14-2-26-47 ПО СЛЕДАМ ...ecsocman.hse.ru/data/2016/07/13/1251060028/journal14.2-2.pdf · витие невозможно без глубоких

Азимбаев Галимжан Сайдулаевич — Эндоваскулярный катетерный тромболизис при тромбозе глубоких вен

]file.pdf · 2012. 3. 9. · Centrală murală cu gaz, de înalt randament Magas hozamú fali gázkazán Высокопроизводительный настенный газовый

ЯГУТКИН ЕВГЕНИЙ ГЕННАДЬЕВИЧ …цниитмаш.рф/assets/images/resources/3548...Сверление глубоких отверстий является

JETSON AGX XAVIER FOR AUTONOMOUS MACHINESNsight Developer Tools TensorRT cuDNN VisionWorks OpenCV Vulkan OpenGL libargus Video API Jetson AGX Xavier 30 CUDA アプリケーションエコシステム

AI&BigData Lab 2016. Артем Чернодуб: Обучение глубоких, очень глубоких и рекуррентных сетей

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH … · 2019-07-02 · TensorFlow Runtime Context Context ModelY Backend Dynamic Batcher TensorRT Runtime Context Context MODEL

ASR 9000 как высокопроизводительный BNG: функционал и сценарии применения

Обучение глубоких сетей III Быстрее, глубже, сильнееsergey/teaching/dl2017/DLNikolenko-MailR… · • Машинное обучение —

ASR 9000 как высокопроизводительный BNG: функционал и сценарии применения

«лавание» · 2019-04-10 · «Дельфиненок» в глубоких бассейнах. iii. Методические рекомендации для работы

Inference Optimization Using TensorRT with Use Caseson-demand.gputechconf.com/gtc-kr/2018/pdf/... · Inference Optimization Using TensorRT with Use Cases. TensorRT 4 Adoption Video

Обучение глубоких сетейsergey/slides/N17_HSEWinterSchoolDL.pdf · • 10 лет назад в машинном обучении началась революция

Михаил Бурцев. Развитие технологий искусственного интеллекта: от перцептрона до глубоких нейронных

Keysight U8903B Высокопроизводительный ... · 2019-12-04 · Bluetooth-модулем и полным набором инструментов тестирования,