Upload
trinhbao
View
215
Download
0
Embed Size (px)
Citation preview
유현곤 부장 | NVIDIA 코리아
Feb. 2017
GPU 가속솔루션소개
2
100% of DL Frameworks Accelerated
All Top 10 HPC Applications Accelerated
Gaussian
ANSYS Fluent
GROMACS
Simulia
Abaqus
NAMD
WRF
VASP
OpenFOAM
LS-DYNA
AMBER
+425More Applications
120,0002014
400,0002016
3x GPU Developers
TORCH
THEANO
CAFFE
MATCONVNET
PURINEMOCHA.JL
MINERVA MXNET*
BIG SUR TENSORFLOW
WATSON CNTK
1,5002014
19,4002016
13x Organizations Engaged with NVIDIA for DL
GPU COMPUTING HAS REACHED A TIPPING POINT
3
Monitoring Effects of Carbon and Greenhouse Gas Emissions
DEEP LEARNING IS VITAL TO HPC
Reducing Cancer DiagnosisError Rate by 85%
NASA Frontier Labs / Asteroid Grand Challenge
4
Engine for AI SupercomputingComputational Science DATA SCIENCE
FUTURE SYSTEM NEEDS TO ACCELERATE COMPUTATIONAL SCIENCE & DATA SCIENCE
5
435 GPU-Accelerated Applications 100% of DL Frameworks Accelerated # of Developers in 2 Years
Gaussian
ANSYS Fluent
GROMACS
Simulia
Abaqus
NAMD
WRF
VASP
OpenFOAM
LS-DYNA
AMBER
+425 More HPC Applications
TORCH
THEANO
CAFFE
MATCONVNET
PURINEMOCHA.JL
MINERVA MXNET*
BIG SUR TENSORFLOW
WATSON CNTK
120,000
400,000
2,200
55,000
20162014
3x GPU Developers
25x DL Developers
NVIDIA HAS THE LEADING ACCELERATED COMPUTING PLATFORM
25
Pascal- 5 Miracles
Pascal
16nm FinFET
CoWoS HBM2
NVLink
cuDNN
NVIDIA DGX-1 NVIDIA DGX SATURNV 65x in 3 Years
K40
K80 + cuDNN1
M40 + cuDNN4
P100 + cuDNN5
0x
10x
20x
30x
40x
50x
60x
70x
2013 2014 2015 2016
AlexNet Training Performance
NVIDIA IS DEEPLY INVESTED IN AI SUPERCOMPUTING
26NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
TESLA FOR SIMLUATION
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGESDIRECTIVES
ACCELERATED COMPUTING TOOLKIT
cuBLAS cuDNNcuSparse
27
HOW GPU ACCELERATION WORKSApplication Code
+
GPU CPU5% of Code
Compute-Intensive FunctionsRest of Sequential
CPU Code
28
Powerful
LSDALTONSimulation of molecular energies
1.0x
11.7x
CPU GPU
Big PerformanceCCSD(T) Module, Alanine-3
Titan System: AMD CPU vs Tesla K20X
Speedup v
s CPU
Simple Portable
OPENACCWorld’s Only Performance Portable Programming Model for HPC
main()
{
<serial code>
#pragma acc kernels
{
<parallel code>
}
}
Add Simple Compiler Hint
ARM
PEZY
POWER
Sunway
x86 CPU
x86 Xeon Phi
NVIDIA GPU
Quicker Development
Lines of Code Modified
<100 Lines
# of Weeks Required
1 Week
29
CUDA
30
CUDA TOOLKIT 8
Comprehensive C/C++ development environment
Out of box performance on Pascal
Unified Memory on Pascal enables simple programming with large datasets
New critical path analysis profiling feature quickly identifies system-level bottlenecks
Everything you need to accelerate applications
developer.nvidia.com/cuda-toolkit
19x
HPGMG with AMR
Larger Simulations &
More Accurate Results
P100 speedup overK80/CUDA7.5
3.5 x 1.5 x
VASP MILC
31NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
CUDA 8 – WHAT’S NEW
New Pascal Architecture
Stacked Memory
NVLINK
FP16 math
P100 SupportLarge Datasets
Demand Paging
New Tuning APIs
Standard C/C++ Allocators
Unified Memory
New nvGRAPH library
cuBLAS improvements for Deep Learning
LibrariesCritical Path Analysis
2x faster compile time
OpenACC profiling
Debug CUDA Apps on display GPU
Developer Tools
NVIDIA CONFIDENTIAL. FOR USE UNDER NDA
32
nvGRAPHAccelerated Graph Analytics
nvGRAPH for high performance graph analytics
Deliver results up to 3x faster than CPU-only
Solve graphs with up to 2.5 Billion edges on 1x M40
Accelerates a wide range of graph analytics apps:
developer.nvidia.com/nvgraph
PageRank Single Source Shortest
Path
Single Source Widest
Path
Search Robotic Path Planning IP Routing
Recommendation Engines Power Network Planning Chip Design / EDA
Social Ad Placement Logistics & Supply Chain
Planning
Traffic sensitive routing0
1
2
3
Itera
tions/
s
nvGRAPH: 3x Speedup
48 Core Xeon E5
nvGRAPH on M40
PageRank on Twitter 1.5B edge dataset
CPU System:4U server w/ 4x12-core Xeon E5-2697 CPU,
30M Cache, 2.70 GHz, 512 GB RAM
33
CUDA 8.0 PROFILINGPowerful Profiling with Dependency Analysis
In heterogeneous applications that do significant computation on both CPUs and GPUs, it can be a challenge to locate the best place to spend your optimization effort
Visual Profiler provides dependency analysis between GPU kernels and CPU CUDA API calls, enabling critical path analysis in your application to help you more profitably target your optimization effort.
35
NVLINK - GPU CLUSTER
Two fully connected quads, connected at corners
160GB/s per GPU bidirectional to Peers
Load/store access to Peer Memory
Full atomics to Peer GPUs
High speed copy engines for bulk data copy
PCIe to/from CPU
37
MAXIMIZING BANDWIDTHMVPAICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt BiBW
0
5000
10000
15000
20000
25000
Bandw
idth
(M
B/s
)
Staging BW (MB/s)
P2P BW (MB/s)
21.3 GB/s
38
NVLINK TOPOLOGYNVLINK, Pascal architecture whitepaper
39NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
IBM POWER NVLINK SYSTEMSAPPROVED DESIGNS FOR OPENPOWER ECOSYSTEM
4 GPU POWER8 FOR PASCAL
P100 SXM2
NIC NIC
CAPI CAPI
P100 SXM2
40
GPUDIRECT P2P ON PASCALearly results, P2P thru NVLink
0
5000
10000
15000
20000
25000
30000
35000
40000
Bandw
idth
(M
B/s
)
OpenMPI intra-node GPU-to-GPU pt2pt BiBW
P100 NVLink
K80@875 PCI-E34.2 GB/s
41
x86 TO POWER MIGRATION
CUDA code : win x86 CPU + CUDA : recompile with arch option
JCUDA : win x86 CPU + JCUDA : modify header file in jcuda git repository
Tensorflow : Ubuntu X86 CPU : dependency for protobuf, bezel
한국 사용자 지원사례
42
OPENACC
Wayne Gaudin and Oliver Perks
Atomic Weapons Establishment, UK
We were extremely impressed that we can run
OpenACC on a CPU with no code change and get
equivalent performance to our OpenMP/MPI
implementation.
OpenACC Performance Portability: CloverLeaf
Hydrodynamics Application OpenACC Performance Portability
Sp
ee
du
p v
s 1
CP
U C
ore
Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80 (dual GPU)
CloverLeaf
“
”
8
0
5
10
15
20
25
30
35
40
45
Haswell: OpenMP
Haswell: OpenACC
Tesla K80: OpenACC
Tesla P100: OpenACC
Sp
eed
up
vs
Sin
gle
Hasw
ell
Co
reCloverLeaf Performance – Tesla P100 Pascal
7.2x 6.6x
13x
34x
CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled
GPU: NVIDIA Tesla K80 (single GPU), NVIDIA Tesla P100 (Single GPU)
OS: CentOS 6.6, Compiler: PGI 16.5
45
UNIFIED MEMORY
Traditional Developer View Developer View With Unified Memory
Unified MemorySystem Memory
GPU Memory
Dramatically Lower Developer Effort
46
UNIFIED MEMORY
void foo(FILE *fp, int N) {
float *x, *y, *z;
x = (float *)malloc(N*sizeof(float));
y = (float *)malloc(N*sizeof(float));
z = (float *)malloc(N*sizeof(float));
fread(x, sizeof(float), N, fp);
fread(y, sizeof(float), N, fp);
#pragma acc kernels copy(x[0:N],y[0:N],z[0:N])
for (int i=0; i<N; ++i)
z[i] = x[i] + y[i];
use_data(z);
free(z); free(y); free(x);
}
Traditional Developer ViewDeveloper View With
Unified Memoryvoid foo(FILE *fp, int N) {
float *x, *y, *z;
x = (float *)malloc(N*sizeof(float));
y = (float *)malloc(N*sizeof(float));
z = (float *)malloc(N*sizeof(float));
fread(x, sizeof(float), N, fp);
fread(y, sizeof(float), N, fp);
#pragma acc kernels
for (int i=0; i<N; ++i)
z[i] = x[i] + y[i];
use_data(z);
free(z); free(y); free(x);
}
47
LBM D2Q37
D2Q37 model
Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven
Reproduce dynamics of fluid by simulating virtual particles which collide and propagate
Simulation of large systems requires double precision computation and many GPUs
Lattice Boltzmann Method (LBM)
49
LBM D2Q37 – COLLIDE ACCELERATEDCPU Profile (480x512) using Unified Memory – 1 MPI rank
Rank Method Time (s)
Final
Time (s)
UM+propagate+bc
Time (s)
Initial
0 main 7.69 2.39 1.89
1 collide 0.52 49.99 17.01
2 lbm 0.41 4.72 0.06
3 init 0.19 0.19 0.04
4 printMass 0.15 0.17 0.01
5 propagate 0.13 2.15 10.71
6 bc 0.09 0.11 0.17
7 projection 0.05 0.05 0.06
Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s)Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)
50
LBM D2Q37 – COLLIDE ACCELERATEDNVVP Timeline (480x512) using Unified Memory – 1 MPI rank
Data stays on GPU while
simulation is running
51
PERFORMANCE PORTABILITY FOR EXASCALEOptimize Once, Run Everywhere with OpenACC
20162015 2017
NVIDIA GPU NVIDIA GPU NVIDIA GPU
AMD GPU AMD GPU AMD GPU
x86 CPU x86 CPU x86 CPU
x86 Xeon Phi x86 Xeon Phi
OpenPOWER CPU OpenPOWER CPU
ARM CPU
PGI Roadmaps are subject to change without notice.
감사합니다.