Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09)...

Preview:

Citation preview

Maggie Zhang (张雪萌) maggiez@nvidia.com

Accelerate Deep Learning Training at Scale on GPUs

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

3

2015

36000 Mins (25 Days)

1xK80 | 2015CUDA

2016

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

2017

480 Mins (8 Hours)DGX-1V | 2017Tensor Core

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

2018

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

ResNet50 v1.5 training

2019

52.7 Minutes on MLPerf

DGX-2H | 2019NVSwitch

1.33 Minutes on MLPerf

At Scale | 2019DGX SuperPOD

DL Training: from single GPU to multi-node

4

The whole stack must be considered

● Compute

● Network

● Storage

● Frameworks & Libraries

● Numerical methods

● Training recipes

5

MLPerf: NVIDIA advancing AI training

Time to Train From 8 Hours to 80 Seconds

2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23

6

Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs

Source: https://arxiv.org/pdf/1810.01993.pdf

2018 Gordon Bell Prize Winner

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

8

● Unlabeled data:

○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M

documents, 40 GB), C4 (Common Crawl, 745 GB)

○ GAN: unlabeled images and videos

○ Reinforcement learning: unsupervised self-play generates unlimited data

● Labeled data:

○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000

categories

○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving

Datasets getting larger

9

DL models increasing in complexity

Image Recognition

NLP

NLP – Generative Tasks

ChatbotsE-mail auto-completionDocument Summarization

Autonomous VehiclesSocial TaggingVisual Search

Q&ASentimentTranslation

1.5Bn

26M340M

Next-level use-cases require gigantic models

https://github.com/NVIDIA/Megatron-LM

Project Megatron

8.3B parameters

8-way Model Parallel

64-way Data Parallel

24x larger than BERT

Speech Recognition

Translation

Object Detection

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

11

Scaling == whack-a-mole ?

Solving one bottleneck and another one pops up

12

Multi-node infrastructure requirements

System Design

Data Center

ManagementSW Stack

Multi-Node

Success

13

● Hardware GPU cluster design:○ Compute: significant CPU to GPU ratio, interconnect with GPU

○ Storage: high speed NFS, multi-tier caching

○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA

● GPU cluster management:○ Scheduler: Slurm vs. Kubernetes

○ Container technologies: Docker, Enroot, Singularity, etc.

● Integrated software stack:○ NVIDIA libraries: CUDA, cuDNN, NCCL

○ DL Framework scale-out optimization

○ Model scale-out implementation & optimization

Challenges of multi-node DL training

14

A basic recipe for deep learning scaling

Step 1: Optimize your single GPU model

Step 2: Scale to multiple GPUs on one node

Step 3: Scale to multiple nodes

15

Case study

• BERT model scripts:https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERTConfigurations for convergence, from 8 to 1500 GPUs, multi-node ready

• Clone and train your own BERT model on multi-node Or download a pre-trained BERT model from NGC and fine-tune for your NLP task

Bidirectional Encoder Representations from Transformers

Super Human Question & Answering

NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance

16

• Pre-training on non-labelled data opens up opportunities to using massive amounts of data:• BooksCorpus (800 million words)• English Wikipedia (2.5 billion words), multi-language Wikipedia• WebText (OpenAI, 8M documents, 40 GB of text)

• More data tends to lead to better accuracy

• BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.

Why multi-node BERT training

17

BERT multi-node pre-training performance

DGX-1

(16 GB)

GPUs Time to train

(Hrs)

1 8 153.6 (6.3

days)

4 32 39.3

16 128 10.4

DGX-2H

(32 GB)

GPUs Time to train

(Hrs)

1 16 58.4 (2.4 days)

4 64 15.4

16 256 3.9

64 1024 1.2

Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results

* Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer

** Gradient accumulation is applied to DGX-2H 1,4,16 node

Metric: Time to train

18

• Create efficient data pipeline

• Enable mixed precision training

• Enable XLA

• Ensure latest GPU libraries

• Develop model in container to facilitate scaling out

Step 1: Optimize model

19

Step 1: Optimize model

• Use tf.data to create performant input pipelines

• Test I/O bottlenecks with a trivial model

• NVIDIA DALI accelerates image-based input pipelines

Data pipeline

20

d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))d = d.repeat()d = d.shuffle(buffer_size=len(input_files))

# `cycle_length` is the number of parallel files that get read.cycle_length = min(num_cpu_threads, len(input_files))d = d.apply(

tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset,cycle_length=cycle_length))

d = d.shuffle(buffer_size=100)

d = d.apply(tf.contrib.data.map_and_batch(

lambda record: _decode_record(record, name_to_features),batch_size=batch_size,num_parallel_batches=num_cpu_threads,drop_remainder=True if is_training else False))

BERT

TFRecord - fast binary format

Parallel read, map, & batch

Fused map & batch op

Data pipeline

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py

21

Step 1: Optimize model

• 1-line optimizer wrapper:opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

• Up to 3x speed up in training on Tensor Cores with• Same accuracy• No change in hyperparameters• ½ memory bandwidth & footprint

• Optimal on Volta and Turing GPUs

Automatic Mixed Precision (AMP)

22

Step 1: Optimize modelAutomatic Mixed Precision (AMP)

• Robust speedup across different TensorFlow workloads

• https://arxiv.org/abs/1710.03740

23

Step 1: Optimize modelXLA (Accelerated Linear Algebra)

• TensorFlow XLA can accelerate models with minimal code changes

• XLA optimizes graph, mostly by fusing compatible kernels

• Set XLA optimization level:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo

deling/BERT/run_pretraining.py#L531

System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests

run using NVIDIA 18.11 TensorFlow container.

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

24

Step 1: Optimize model

• Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL)

Latest GPU optimizations

25

Step 1: Optimize model

• NGC containers: fully featured DL containers

• DL frameworks compiled with latest GPU libraries

• Portability of application libraries facilitates multi-node scale-out

Latest GPU optimizations

26

27

• Understand Data Parallel training concepts

• Ensure optimal inter-GPU communication

• Apply high level API for multi-GPU training

Step 2: Scale to multiple GPUs

28

Step 2: Scale to multiple GPUs

• Single GPU

Under the hood

29

Step 2: Scale to multiple GPUs

• Multiple GPU

• Data parallel training

Under the hood

• Allreduce algorithm

• NCCL: NVIDIA Collective Communication Library

30

• Inter-GPU communication:

Step 2: Scale to multiple GPUsUnder the hood

Effective bandwidth in GB/s

31

• Full non-blocking bandwidth

Step 2: Scale to multiple GPUsUnder the hood

32

Step 2: Scale to multiple GPUs

• Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras

• Strong NCCL integration

• Sample commands:

• Single-node (4 GPUs):

horovodrun -np 4 -H localhost:4 python train.py

• Multi-node (4 nodes with 4 GPUs each):

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Approach 1: Horovod

33

Step 2: Scale to multiple GPUs

import tensorflow as tfimport horovod.tensorflow as hvd

# Initialize Horovodhvd.init()

# Pin GPU to be usedconfig = tf.ConfigProto()config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...loss = ...opt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())

# Add Horovod Distributed Optimizeropt = hvd.DistributedOptimizer(opt)

Approach 1: Horovod

# Add hook to synchronize initial statehooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operationtrain_op = opt.minimize(loss)

# Only checkpoint on rank 0ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None

# Session

with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,config=config, hooks=hooks) as mon_sess:

while not mon_sess.should_stop():# Perform synchronous training.mon_sess.run(train_op)

34

• Recently released native API that also support Allreduce with NCCL

• Multi-GPU:tf.distribute.MirrorStrategy

• Multi-node:tf.distribute.experimental.MultiWorkerMirroredStrategy

Step 2: Scale to multiple GPUsApproach 2: tf.distribute.Strategy

Source: https://www.tensorflow.org/guide/distributed_training

35

• Adopt optimizer designed for large batch size

• Ensure effective inter-node communication

• Move data close to compute

• Consider full application & system software stack

Step 3: Scale to multiple nodes

36

• Optimizer inspired by LARS• Layerwise Adaptive learning rate (You et al.)

• Allows training at huge global batch size• Originally, BERT+Adam (Devlin et al.) – global batch 256

• BERT+LAMB (You et al.) – global batch 64k

• Massive data parallelism

• Lower interconnect pressure with gradient accumulation

Step 3: Scale to multiple nodesLAMB optimizer

37

BERT+LAMB

Robustly scale to large batch size

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py

class LAMBOptimizer(tf.train.Optimizer):"""A LAMB optimizer that includes "correct" L2 weight decay."""

def __init__(self,learning_rate,weight_decay_rate=0.0,beta_1=0.9,beta_2=0.999,epsilon=1e-6,exclude_from_weight_decay=None,name="LAMBOptimizer"):

"""Constructs a LAMBOptimizer."""super(LAMBOptimizer, self).__init__(False, name)

.

.

.

Step 3: Scale to multiple nodesLAMB optimizer

38

• Inter-GPU communication (bigger picture):

Step 3: Scale to multiple nodesUnder the hood

Effective bandwidth in GB/s

42

• Tensor Fusion

• Batch tensors together during allreduce

• HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...

• Gradient Compression (FP16 Allreduce):

• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)

• Reduces network utilization

Step 3: Scale to multiple nodesFurther Horovod optimizations

43

• DNN datasets are large

• Read-dominated at beginning of each epoch

• Keep data close to compute as much as possible:

• RAM disk, SSDs in RAID 0, Fast network attached storage

Step 3: Scale to multiple nodesStorage

44

• Integrated software and hardware system for multi-node scaling

• State-of-the-art compute, GPU interconnect, node interconnect, and storage

Step 3: Scale to multiple nodesReference architecture: DGX SuperPOD

45

NVIDIA DGX SuperPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

GPFS

200 Gb/s per node

800 Gb/s per node

White paper: https://www.nvidia.com/en-us/data-

center/resources/nvidia-dgx-superpod-reference-architecture/

46

• Deep Learning Model:

• Hyperparameters tuned for multi-node scaling

• Multi-node launcher scripts

• Deep Learning Container:

• Optimized DL frameworks, GPU libraries, and multi-node software

• Host:

• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)

Step 3: Scale to multiple nodesSoftware stack - Application

47

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

Step 3: Scale to multiple nodesSoftware stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS

Slurm

controllerEnroot | DockerPyxis

NGC model containers (Pytorch, Tensorflow from 19.09)

DCGM

48

DeepOps leverages Ansible for automated

large scale cluster deployment. Deployment doc

Deployment with DeepOps

Bootstrap all nodes

Prepare provisioning node

Provision all node(s)

Deploy Slurm on Slurm nodes

Deploy DL/ML development tools

Deploy Production AI applications

Deploy management services DeepO

ps

- Build your own GPU cluster following the DGX Pod and DGX

SuperPOD reference architectures.

- Clone the DeepOps repo and follow the cluster setup guide.

Open a GitHub issue if any problem.

Step 3: Scale to multiple nodes

49

• Scaling requires careful consideration of algorithms and infrastructure at each step

• Optimized single-GPU model

• Efficient & scalable Allreduce library

• GPU interconnect, networking, storage

...

• NVIDIA platform makes scaling DL training easier and more efficient

• Deep Learning Examples with SOTA accuracy and performance

• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack

• Accelerated compute platform designed for performance and scaling

SummaryScaling is important and we are here to help