유현곤부장 | NVIDIA 코리아 Feb. 2017 · NASA Frontier Labs / Asteroid Grand Challenge. 4 Computational Science DATA SCIENCE Engine for AI ... 435 GPU-Accelerated Applications

유현곤 부장 | NVIDIA 코리아

Feb. 2017

GPU 가속솔루션소개

2

100% of DL Frameworks Accelerated

All Top 10 HPC Applications Accelerated

Gaussian

ANSYS Fluent

GROMACS

Simulia

Abaqus

NAMD

WRF

VASP

OpenFOAM

LS-DYNA

AMBER

+425More Applications

120,0002014

400,0002016

3x GPU Developers

TORCH

THEANO

CAFFE

MATCONVNET

PURINEMOCHA.JL

MINERVA MXNET*

BIG SUR TENSORFLOW

WATSON CNTK

1,5002014

19,4002016

13x Organizations Engaged with NVIDIA for DL

GPU COMPUTING HAS REACHED A TIPPING POINT

3

Monitoring Effects of Carbon and Greenhouse Gas Emissions

DEEP LEARNING IS VITAL TO HPC

Reducing Cancer DiagnosisError Rate by 85%

NASA Frontier Labs / Asteroid Grand Challenge

4

Engine for AI SupercomputingComputational Science DATA SCIENCE

FUTURE SYSTEM NEEDS TO ACCELERATE COMPUTATIONAL SCIENCE & DATA SCIENCE

5

435 GPU-Accelerated Applications 100% of DL Frameworks Accelerated # of Developers in 2 Years

Gaussian

ANSYS Fluent

GROMACS

Simulia

Abaqus

NAMD

WRF

VASP

OpenFOAM

LS-DYNA

AMBER

+425 More HPC Applications

TORCH

THEANO

CAFFE

MATCONVNET

PURINEMOCHA.JL

MINERVA MXNET*

BIG SUR TENSORFLOW

WATSON CNTK

120,000

400,000

2,200

55,000

20162014

3x GPU Developers

25x DL Developers

NVIDIA HAS THE LEADING ACCELERATED COMPUTING PLATFORM

25

Pascal- 5 Miracles

Pascal

16nm FinFET

CoWoS HBM2

NVLink

cuDNN

NVIDIA DGX-1 NVIDIA DGX SATURNV 65x in 3 Years

K40

K80 + cuDNN1

M40 + cuDNN4

P100 + cuDNN5

0x

10x

20x

30x

40x

50x

60x

70x

2013 2014 2015 2016

AlexNet Training Performance

NVIDIA IS DEEPLY INVESTED IN AI SUPERCOMPUTING

26NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA FOR SIMLUATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGESDIRECTIVES

ACCELERATED COMPUTING TOOLKIT

cuBLAS cuDNNcuSparse

27

HOW GPU ACCELERATION WORKSApplication Code

+

GPU CPU5% of Code

Compute-Intensive FunctionsRest of Sequential

CPU Code

28

Powerful

LSDALTONSimulation of molecular energies

1.0x

11.7x

CPU GPU

Big PerformanceCCSD(T) Module, Alanine-3

Titan System: AMD CPU vs Tesla K20X

Speedup v

s CPU

Simple Portable

OPENACCWorld’s Only Performance Portable Programming Model for HPC

main()

{

<serial code>

#pragma acc kernels

{

<parallel code>

}

}

Add Simple Compiler Hint

ARM

PEZY

POWER

Sunway

x86 CPU

x86 Xeon Phi

NVIDIA GPU

Quicker Development

Lines of Code Modified

<100 Lines

# of Weeks Required

1 Week

29

CUDA

30

CUDA TOOLKIT 8

Comprehensive C/C++ development environment

Out of box performance on Pascal

Unified Memory on Pascal enables simple programming with large datasets

New critical path analysis profiling feature quickly identifies system-level bottlenecks

Everything you need to accelerate applications

developer.nvidia.com/cuda-toolkit

19x

HPGMG with AMR

Larger Simulations &

More Accurate Results

P100 speedup overK80/CUDA7.5

3.5 x 1.5 x

VASP MILC

https://developer.nvidia.com/cuda-toolkit


CUDA 8 – WHAT’S NEW

New Pascal Architecture

Stacked Memory

NVLINK

FP16 math

P100 SupportLarge Datasets

Demand Paging

New Tuning APIs

Standard C/C++ Allocators

Unified Memory

New nvGRAPH library

cuBLAS improvements for Deep Learning

LibrariesCritical Path Analysis

2x faster compile time

OpenACC profiling

Debug CUDA Apps on display GPU

Developer Tools

NVIDIA CONFIDENTIAL. FOR USE UNDER NDA

32

nvGRAPHAccelerated Graph Analytics

nvGRAPH for high performance graph analytics

Deliver results up to 3x faster than CPU-only

Solve graphs with up to 2.5 Billion edges on 1x M40

Accelerates a wide range of graph analytics apps:

developer.nvidia.com/nvgraph

PageRank Single Source Shortest

Path

Single Source Widest

Path

Search Robotic Path Planning IP Routing

Recommendation Engines Power Network Planning Chip Design / EDA

Social Ad Placement Logistics & Supply Chain

Planning

Traffic sensitive routing0

1

2

3

Itera

tions/

s

nvGRAPH: 3x Speedup

48 Core Xeon E5

nvGRAPH on M40

PageRank on Twitter 1.5B edge dataset

CPU System:4U server w/ 4x12-core Xeon E5-2697 CPU,

30M Cache, 2.70 GHz, 512 GB RAM

33

CUDA 8.0 PROFILINGPowerful Profiling with Dependency Analysis

In heterogeneous applications that do significant computation on both CPUs and GPUs, it can be a challenge to locate the best place to spend your optimization effort

Visual Profiler provides dependency analysis between GPU kernels and CPU CUDA API calls, enabling critical path analysis in your application to help you more profitably target your optimization effort.

35

NVLINK - GPU CLUSTER

Two fully connected quads, connected at corners

160GB/s per GPU bidirectional to Peers

Load/store access to Peer Memory

Full atomics to Peer GPUs

High speed copy engines for bulk data copy

PCIe to/from CPU

37

MAXIMIZING BANDWIDTHMVPAICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt BiBW

0

5000

10000

15000

20000

25000

Bandw

idth

(M

B/s

)

Staging BW (MB/s)

P2P BW (MB/s)

21.3 GB/s

38

NVLINK TOPOLOGYNVLINK, Pascal architecture whitepaper


IBM POWER NVLINK SYSTEMSAPPROVED DESIGNS FOR OPENPOWER ECOSYSTEM

4 GPU POWER8 FOR PASCAL

P100 SXM2

NIC NIC

CAPI CAPI

P100 SXM2

40

GPUDIRECT P2P ON PASCALearly results, P2P thru NVLink

0

5000

10000

15000

20000

25000

30000

35000

40000

Bandw

idth

(M

B/s

)

OpenMPI intra-node GPU-to-GPU pt2pt BiBW

P100 NVLink

K80@875 PCI-E34.2 GB/s

41

x86 TO POWER MIGRATION

CUDA code : win x86 CPU + CUDA : recompile with arch option

JCUDA : win x86 CPU + JCUDA : modify header file in jcuda git repository

Tensorflow : Ubuntu X86 CPU : dependency for protobuf, bezel

한국 사용자 지원사례

42

OPENACC

Wayne Gaudin and Oliver Perks

Atomic Weapons Establishment, UK

We were extremely impressed that we can run

OpenACC on a CPU with no code change and get

equivalent performance to our OpenMP/MPI

implementation.

OpenACC Performance Portability: CloverLeaf

Hydrodynamics Application OpenACC Performance Portability

Sp

ee

du

p v

s 1

CP

U C

ore

Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80 (dual GPU)

CloverLeaf

“

”

8

0

5

10

15

20

25

30

35

40

45

Haswell: OpenMP

Haswell: OpenACC

Tesla K80: OpenACC

Tesla P100: OpenACC

Sp

eed

up

vs

Sin

gle

Hasw

ell

Co

reCloverLeaf Performance – Tesla P100 Pascal

7.2x 6.6x

13x

34x

CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled

GPU: NVIDIA Tesla K80 (single GPU), NVIDIA Tesla P100 (Single GPU)

OS: CentOS 6.6, Compiler: PGI 16.5

45

UNIFIED MEMORY

Traditional Developer View Developer View With Unified Memory

Unified MemorySystem Memory

GPU Memory

Dramatically Lower Developer Effort

46

UNIFIED MEMORY

void foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels copy(x[0:N],y[0:N],z[0:N])

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

Traditional Developer ViewDeveloper View With

Unified Memoryvoid foo(FILE *fp, int N) {

float *x, *y, *z;

x = (float *)malloc(N*sizeof(float));

y = (float *)malloc(N*sizeof(float));

z = (float *)malloc(N*sizeof(float));

fread(x, sizeof(float), N, fp);

fread(y, sizeof(float), N, fp);

#pragma acc kernels

for (int i=0; i<N; ++i)

z[i] = x[i] + y[i];

use_data(z);

free(z); free(y); free(x);

}

47

LBM D2Q37

D2Q37 model

Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven

Reproduce dynamics of fluid by simulating virtual particles which collide and propagate

Simulation of large systems requires double precision computation and many GPUs

Lattice Boltzmann Method (LBM)

49

LBM D2Q37 – COLLIDE ACCELERATEDCPU Profile (480x512) using Unified Memory – 1 MPI rank

Rank Method Time (s)

Final

Time (s)

UM+propagate+bc

Time (s)

Initial

0 main 7.69 2.39 1.89

1 collide 0.52 49.99 17.01

2 lbm 0.41 4.72 0.06

3 init 0.19 0.19 0.04

4 printMass 0.15 0.17 0.01

5 propagate 0.13 2.15 10.71

6 bc 0.09 0.11 0.17

7 projection 0.05 0.05 0.06

Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s)Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

50

LBM D2Q37 – COLLIDE ACCELERATEDNVVP Timeline (480x512) using Unified Memory – 1 MPI rank

Data stays on GPU while

simulation is running

51

PERFORMANCE PORTABILITY FOR EXASCALEOptimize Once, Run Everywhere with OpenACC

20162015 2017

NVIDIA GPU NVIDIA GPU NVIDIA GPU

AMD GPU AMD GPU AMD GPU

x86 CPU x86 CPU x86 CPU

x86 Xeon Phi x86 Xeon Phi

OpenPOWER CPU OpenPOWER CPU

ARM CPU

PGI Roadmaps are subject to change without notice.

감사합니다.

Documents

유현곤부장 | NVIDIA 코리아 Feb. 2017 · NASA Frontier Labs / Asteroid Grand Challenge. 4 Computational Science DATA SCIENCE Engine for AI ... 435 GPU-Accelerated Applications