Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09)...

Maggie Zhang (张雪萌) maggiez@nvidia.com

Accelerate Deep Learning Training at Scale on GPUs

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

36000 Mins (25 Days)

1xK80 | 2015CUDA

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

480 Mins (8 Hours)DGX-1V | 2017Tensor Core

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

ResNet50 v1.5 training

52.7 Minutes on MLPerf

DGX-2H | 2019NVSwitch

1.33 Minutes on MLPerf

At Scale | 2019DGX SuperPOD

DL Training: from single GPU to multi-node

The whole stack must be considered

● Compute

● Network

● Storage

● Frameworks & Libraries

● Numerical methods

● Training recipes

MLPerf: NVIDIA advancing AI training

Time to Train From 8 Hours to 80 Seconds

Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs

Source: https://arxiv.org/pdf/1810.01993.pdf

2018 Gordon Bell Prize Winner

AGENDA

● Introduction

● Unlabeled data:

○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M

documents, 40 GB), C4 (Common Crawl, 745 GB)

○ GAN: unlabeled images and videos

○ Reinforcement learning: unsupervised self-play generates unlimited data

● Labeled data:

○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000

categories

○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving

Datasets getting larger

DL models increasing in complexity

Image Recognition

NLP – Generative Tasks

ChatbotsE-mail auto-completionDocument Summarization

Autonomous VehiclesSocial TaggingVisual Search

Q&ASentimentTranslation

26M340M

Next-level use-cases require gigantic models

https://github.com/NVIDIA/Megatron-LM

Project Megatron

8.3B parameters

8-way Model Parallel

64-way Data Parallel

24x larger than BERT

Speech Recognition

Translation

Object Detection

AGENDA

● Introduction

Scaling == whack-a-mole ?

Solving one bottleneck and another one pops up

Multi-node infrastructure requirements

System Design

Data Center

ManagementSW Stack

Multi-Node

Success

● Hardware GPU cluster design:○ Compute: significant CPU to GPU ratio, interconnect with GPU

○ Storage: high speed NFS, multi-tier caching

○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA

● GPU cluster management:○ Scheduler: Slurm vs. Kubernetes

○ Container technologies: Docker, Enroot, Singularity, etc.

● Integrated software stack:○ NVIDIA libraries: CUDA, cuDNN, NCCL

○ DL Framework scale-out optimization

○ Model scale-out implementation & optimization

Challenges of multi-node DL training

A basic recipe for deep learning scaling

Step 1: Optimize your single GPU model

Step 2: Scale to multiple GPUs on one node

Step 3: Scale to multiple nodes

Case study

• BERT model scripts:https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERTConfigurations for convergence, from 8 to 1500 GPUs, multi-node ready

• Clone and train your own BERT model on multi-node Or download a pre-trained BERT model from NGC and fine-tune for your NLP task

Bidirectional Encoder Representations from Transformers

Super Human Question & Answering

NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance

• Pre-training on non-labelled data opens up opportunities to using massive amounts of data:• BooksCorpus (800 million words)• English Wikipedia (2.5 billion words), multi-language Wikipedia• WebText (OpenAI, 8M documents, 40 GB of text)

• More data tends to lead to better accuracy

• BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.

Why multi-node BERT training

BERT multi-node pre-training performance

(16 GB)

GPUs Time to train

1 8 153.6 (6.3

4 32 39.3

16 128 10.4

DGX-2H

(32 GB)

GPUs Time to train

1 16 58.4 (2.4 days)

4 64 15.4

16 256 3.9

64 1024 1.2

Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results

* Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer

** Gradient accumulation is applied to DGX-2H 1,4,16 node

Metric: Time to train

• Create efficient data pipeline

• Enable mixed precision training

• Enable XLA

• Ensure latest GPU libraries

• Develop model in container to facilitate scaling out

Step 1: Optimize model

• Use tf.data to create performant input pipelines

• Test I/O bottlenecks with a trivial model

• NVIDIA DALI accelerates image-based input pipelines

Data pipeline

d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))d = d.repeat()d = d.shuffle(buffer_size=len(input_files))

# `cycle_length` is the number of parallel files that get read.cycle_length = min(num_cpu_threads, len(input_files))d = d.apply(

tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset,cycle_length=cycle_length))

d = d.shuffle(buffer_size=100)

d = d.apply(tf.contrib.data.map_and_batch(

lambda record: _decode_record(record, name_to_features),batch_size=batch_size,num_parallel_batches=num_cpu_threads,drop_remainder=True if is_training else False))

TFRecord - fast binary format

Parallel read, map, & batch

Fused map & batch op

Data pipeline

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py

• 1-line optimizer wrapper:opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

• Up to 3x speed up in training on Tensor Cores with• Same accuracy• No change in hyperparameters• ½ memory bandwidth & footprint

• Optimal on Volta and Turing GPUs

Automatic Mixed Precision (AMP)

Step 1: Optimize modelAutomatic Mixed Precision (AMP)

• Robust speedup across different TensorFlow workloads

• https://arxiv.org/abs/1710.03740

Step 1: Optimize modelXLA (Accelerated Linear Algebra)

• TensorFlow XLA can accelerate models with minimal code changes

• XLA optimizes graph, mostly by fusing compatible kernels

• Set XLA optimization level:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo

deling/BERT/run_pretraining.py#L531

System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests

run using NVIDIA 18.11 TensorFlow container.

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

• Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL)

Latest GPU optimizations

• NGC containers: fully featured DL containers

• DL frameworks compiled with latest GPU libraries

• Portability of application libraries facilitates multi-node scale-out

Latest GPU optimizations

• Understand Data Parallel training concepts

• Ensure optimal inter-GPU communication

• Apply high level API for multi-GPU training

Step 2: Scale to multiple GPUs

• Single GPU

Under the hood

• Multiple GPU

• Data parallel training

Under the hood

• Allreduce algorithm

• NCCL: NVIDIA Collective Communication Library

• Inter-GPU communication:

Step 2: Scale to multiple GPUsUnder the hood

Effective bandwidth in GB/s

• Full non-blocking bandwidth

Step 2: Scale to multiple GPUsUnder the hood

• Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras

• Strong NCCL integration

• Sample commands:

• Single-node (4 GPUs):

horovodrun -np 4 -H localhost:4 python train.py

• Multi-node (4 nodes with 4 GPUs each):

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Approach 1: Horovod

import tensorflow as tfimport horovod.tensorflow as hvd

# Initialize Horovodhvd.init()

# Pin GPU to be usedconfig = tf.ConfigProto()config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...loss = ...opt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())

# Add Horovod Distributed Optimizeropt = hvd.DistributedOptimizer(opt)

Approach 1: Horovod

# Add hook to synchronize initial statehooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operationtrain_op = opt.minimize(loss)

# Only checkpoint on rank 0ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None

# Session

with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,config=config, hooks=hooks) as mon_sess:

while not mon_sess.should_stop():# Perform synchronous training.mon_sess.run(train_op)

• Recently released native API that also support Allreduce with NCCL

• Multi-GPU:tf.distribute.MirrorStrategy

• Multi-node:tf.distribute.experimental.MultiWorkerMirroredStrategy

Step 2: Scale to multiple GPUsApproach 2: tf.distribute.Strategy

Source: https://www.tensorflow.org/guide/distributed_training

• Adopt optimizer designed for large batch size

• Ensure effective inter-node communication

• Move data close to compute

• Consider full application & system software stack

• Optimizer inspired by LARS• Layerwise Adaptive learning rate (You et al.)

• Allows training at huge global batch size• Originally, BERT+Adam (Devlin et al.) – global batch 256

• BERT+LAMB (You et al.) – global batch 64k

• Massive data parallelism

• Lower interconnect pressure with gradient accumulation

Step 3: Scale to multiple nodesLAMB optimizer

BERT+LAMB

Robustly scale to large batch size

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py

class LAMBOptimizer(tf.train.Optimizer):"""A LAMB optimizer that includes "correct" L2 weight decay."""

def __init__(self,learning_rate,weight_decay_rate=0.0,beta_1=0.9,beta_2=0.999,epsilon=1e-6,exclude_from_weight_decay=None,name="LAMBOptimizer"):

"""Constructs a LAMBOptimizer."""super(LAMBOptimizer, self).__init__(False, name)

Step 3: Scale to multiple nodesLAMB optimizer

• Inter-GPU communication (bigger picture):

Step 3: Scale to multiple nodesUnder the hood

Effective bandwidth in GB/s

• Tensor Fusion

• Batch tensors together during allreduce

• HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...

• Gradient Compression (FP16 Allreduce):

• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)

• Reduces network utilization

Step 3: Scale to multiple nodesFurther Horovod optimizations

• DNN datasets are large

• Read-dominated at beginning of each epoch

• Keep data close to compute as much as possible:

• RAM disk, SSDs in RAID 0, Fast network attached storage

Step 3: Scale to multiple nodesStorage

• Integrated software and hardware system for multi-node scaling

• State-of-the-art compute, GPU interconnect, node interconnect, and storage

Step 3: Scale to multiple nodesReference architecture: DGX SuperPOD

NVIDIA DGX SuperPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

200 Gb/s per node

800 Gb/s per node

White paper: https://www.nvidia.com/en-us/data-

center/resources/nvidia-dgx-superpod-reference-architecture/

• Deep Learning Model:

• Hyperparameters tuned for multi-node scaling

• Multi-node launcher scripts

• Deep Learning Container:

• Optimized DL frameworks, GPU libraries, and multi-node software

• Host:

• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)

Step 3: Scale to multiple nodesSoftware stack - Application

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

Step 3: Scale to multiple nodesSoftware stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS

controllerEnroot | DockerPyxis

NGC model containers (Pytorch, Tensorflow from 19.09)

DeepOps leverages Ansible for automated

large scale cluster deployment. Deployment doc

Deployment with DeepOps

Bootstrap all nodes

Prepare provisioning node

Provision all node(s)

Deploy Slurm on Slurm nodes

Deploy DL/ML development tools

Deploy Production AI applications

Deploy management services DeepO

- Build your own GPU cluster following the DGX Pod and DGX

SuperPOD reference architectures.

- Clone the DeepOps repo and follow the cluster setup guide.

Open a GitHub issue if any problem.

• Scaling requires careful consideration of algorithms and infrastructure at each step

• Optimized single-GPU model

• Efficient & scalable Allreduce library

• GPU interconnect, networking, storage

• NVIDIA platform makes scaling DL training easier and more efficient

• Deep Learning Examples with SOTA accuracy and performance

• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack

• Accelerated compute platform designed for performance and scaling

SummaryScaling is important and we are here to help

Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09)...

Documents

Distributed tensorflow on kubernetes

CRYPTFLOW: Secure TensorFlow Inference · 2020. 3. 19. · to TensorFlow where computations are over ﬂoating-point values. Athos automatically converts TensorFlow code over ﬂoating-point

배워봅시다 머신러닝 with TensorFlow

HTML5 Conference LT TensorFlow

Aprendizaje automático aplicado utilizando TensorFlow

TensorFlow on Mobile

機械学習ライブラリ : TensorFlow

Tensorflow - whitepaper2015

1.Introduction to Python and TensorFlow

TensorFlow White Paperを読む

TensorFlow 入門

Deep Learning with PyTorch

텐서플로 걸음마 (TensorFlow Tutorial)

TensorFlow Tutorial - isl-homepage.github.io€¦ · 4 TensorFlow Dev Summit - 17.02.16 텐서플우 1.0 공식버전릴리즈 빨라짐: 8GPUs + inception v3 모델을사용할경우7.3배속도향상

TensorFlow et Keras Version en ligne OFFERTE ! + QUIZ L ... · TensorFlow et Keras L’intelligence artificielle appliquée à la robotique humanoïde Ce livre sur TensorFlow et sur

TensorFlow Dev Summit 2017 요약

Introduction to Python and TensorFlow

Keras és TensorFlow referencia architektúrák az

Introduction to Deep Learning with TensorFlow

TENSORFLOW LITE PADA PERANGKAT BERGERAK GUNA …