Pascal GPU를탑재한세계최초의상용서버 IBM “Minsky” 4 항목 P100 M40 Architecture Pascal Maxwell SMs 56 24 FP32 CUDA Cores / SM 64 128 FP32 CUDA Cores / GPU 3584 3072

Welcome to the Waitless World

Pascal GPU를 탑재한 세계 최초의 상용 서버

IBM “Minsky”

IBM Power Systems

1

1.Minsky 특징 개요

신기술에 의한

기존 문제의 해결

진정한 오픈

아키텍처

OpenPOWER

플랫폼

최신, 최고의 GPU

PASCAL P100

Page Migration Engine + Unified Memory = 한결단순해진 개발 업무

기존 GPU 서버의 P2P 문제를해결하는 NVLink 기술

Google, IBM, NVIDIA, Mellanox, 삼성전자 등 200여 회원사가 함께하는 OpenPOWER 파운데이션

POWER 아키텍처 공개에 의한진정한 오픈 아키텍처

PASCAL 아키텍처 GPU를 장착한유일한 상용 서버

Half-precision 성능 21 TFLOPS

기존의 3배에 달하는 GPU 메모리대역폭

2

IBM GPU 서버 코드 네임 “Minsky”

2. 제안 장비 세부규격 > GPU 서버 IBM “Minsky”

IBM POWER8 CPU와 NVIDIA P100 GPU의 조합 최신 Pascal 아키텍처의 P100 4장 장착 양방향 40+40GB/sec의 대역폭을 가지는 NVLink를 통해GPU-GPU는 물론, CPU-GPU도 연결

물리적 core 1개당 8개의 HW thread (SMT-8)를 가지는POWER8 프로레서

2U 공간 안에 강력한 GPU 컴퓨팅 파워를 압축하여 성능대비 상면적 및 전력 소비량에서 월등한 이점

항목

POWER8 processor (3.3GHz 8-core or 2.9GHz 10-core)

2(제안은 3.3GHz 8-core)

HDD (1TB 7.2k rpm SATA) 2

PCIe card (1G, IB, NVMe) 3

GPU (PASCAL P100) 4

Total Power Supply AC input (W) 2223 W

Form Factor 2U

Width 442 mm

Height 86 mm

Depth 822 mm

Weight (추정 최대치) 29.4kg

코드 네임

“Minsky”

Pascal GPU를 탑재한 세계 최초의 상용 서버

2016년 현재, GPU-GPU는 물론 GPU-CPU도 NVLink로 연결된 유일한 상용 서버

3

2. 제안 장비 세부규격 > Minsky 시스템 HW 세부 구조

POWER8 with NVLink (2x)

• 190W

• Integrated NVLink 1.0

Memory DIMM’s Riser (8x)

• 4 IS DDR4 DIMMs per riser

• Single Centaur per riser

• 32 IS DIMM’s total

• 32-1024 GB memory capacity

PCIe slot (3x)

• Gen3 PCIeNVidia GPU

• SXM2 form factor

• NVLink 1.0

• 300 W

• Max of 2 per socket

Power Supplies (2x)

• 1300W

• Common Form Factor Supply

Cooling Fans (4x)

• 80mm Counter- Rotating Fans

• Hot swap

Storage Option (2x)

• 0-2, SATA HDD.SSD

• Tray design for install/removal

• Hot Swap

Service Controller Card

• BMC Content

GPU 최적화 구조 2-socket POWER8, 4장의 P100 GPU, 32개의 DDR4 DIMM slot, 3개의 PCIe slot

4

항목 P100 M40

Architecture Pascal Maxwell

SMs 56 24

FP32 CUDA Cores / SM 64 128

FP32 CUDA Cores / GPU 3584 3072

FP64 CUDA Cores / SM 32 4

FP64 CUDA Cores / GPU 1792 96

Base Clock 1328 MHz 948 MHz

GPU Boost Clock 1480 MHz 1114 MHz

CPU-GPU link NVLink PCIe Gen3

Peak FP16 GFLOPs 21200 N/A

Peak FP32 GFLOPs 10600 6840

Peak FP64 GFLOPs 5300 210

Memory Interface 4096-bit HBM2 384-bit GDDR5

Memory Size 16 GB Up to 24 GB

L2 Cache Size 4096 KB 3072 KB

Memory bandwidth 720 GB/s 288 GB/s

TDP 300 Watts 250 Watts

Transistors 15.3 billion 8 billion

Manufacturing Process 16-nm FinFET 28-nm

3. 제안 시스템 특장점 > P100 vs. M40의 사양 비교

새로운 half-precision

instruction에 의한3배의 성능

더 많은 CUDA core

더 빠른 clock speed

CoWoS HBM2 메모리의

M40 대비 2.5배의

메모리 대역폭

NVLink에 의해 2.5배 향상된P2P 대역폭

FinFET (Fin Field Effect Transistor)CoWoS (Chip-on Wafer-on-Substrate) HBM2 (High Bandwidth Memory 2)

5

Source : http://www.nvidia.com/object/gpu-architecture.html#utm_source=shorturl&utm_medium=referrer&utm_campaign=pascalhttp://www.nvidia.com/object/tesla-p100.html

3. 제안 시스템 특장점 > Pascal 아키텍처의 5가지 신기술

FinFET (Fin Field Effect Transistor)CoWoS (Chip-on Wafer-on-Substrate) HBM2 (High Bandwidth Memory 2)PME (Page Migration Engine)UM (Unified Memory)

16 nm

FinFET

New

FP16

instrn

NVLink

CoWoS

HBM2

PME +

UM

http://www.nvidia.com/object/gpu-architecture.html#utm_source=shorturl&utm_medium=referrer&utm_campaign=pascal

http://www.nvidia.com/object/tesla-p100.html

6

3. 제안 시스템 특장점 > CoWoS와 HBM2 GPU 메모리

P100 HBM2 stack과 GP100 GPU의 단면

• 기존 GDDR5 GPU 설계에서처럼 많은메모리 chip이 GPU를 둘러싸는 대신, HBM2는 여러 메모리 die를 입체적으로 쌓고 그것들을 through-silicon via와 microbump로 연결

• 메모리 stack은 passive silicon interposer를 통해 GPU die에 연결

• 메모리 대역폭이 HBM1이 stack 당125GB/s이었던 것에 비해, HBM2는180GB/s

• P100은 4-die HBM2 stack, 총 16GB를장착

4개의 HBM2 stack에 의한 총 720GB/sec의 메모리 대역폭메모리 BW 3배 향상

7

3. 제안 시스템 특장점 > P100의 메모리 지원

• Page migration engine

• Virtual Memory Demand Paging 지원으로, 49-bit 가상 주소를 통해 GPU 메모리는 물론 48-bit의 CPU 주소까지 통제

• GPU page faulting 지원으로 수천개의 동시page fault를 처리

• 2MB page size 지원으로 GPU 메모리의 TLB (Translation Look-Aside Buffer) 효율 향상

• Unified memory

• Kepler와 Maxwell의 unified memory에 걸려 있던 GPU 메모리 크기 내로의 제한이 Pascal에서는제거되어, 전체 시스템 메모리를 다 unified memory로 사용 가능함

• 이제 개발자들이 GPU 메모리 내의 data 이동 관리보다 컴퓨팅 자체에 집중하는 것이 가능

Page Migration Engine & Unified Memory에 의한 GPU 메모리의 한계 극복단순해지는 개발 작업

8

3. 제안 시스템 특장점 > NVLink

• NVLink를 통해 연결된 GPU들은 local memory 뿐만 아니라 다른 GPU의 메모리도 직접 access 가능

• Pascal의 atomic operation fully 지원

• NVIDIA의 새로운 High-Speed Signaling interconnect (NVHS)를 사용

• 1개 연결 pair가 20GB/s, 8개 연결이 sub-link를 구성하고 2개의 sub-link가 양방향 통신을수행

• 하나의 link는 양방향 40GB/s 대역폭을 지원

• P100은 4개의 link를 지원하여 총 160GB/sec 지원

• POWER8도 4개의 link를 지원하여 GPU-GPU 뿐만아니라 GPU-CPU도 NVLink로 연결

GPU

Graphics Memory

System Memory

GPU

Graphics Memory

NVLi

nk

40+40 G

B/s

기존 PCIe Gen3 대비

2.5배

GPU-GPU 뿐만 아니라 GPU-CPU도 NVLink로 연결 가능한 것은 POWER8 뿐NVLink

9

3. 제안 시스템 특장점 > CPU:GPU간 NVLink

• POWER8은 CPU-GPU간 NVLink 기술이 적용되는 유일한 플랫폼

– 미래가 아닌 지금 당장의 테스트의 결과로도2.5배가 넘는 대역폭을 구현

– NVLink bus는 PCIe link보다 더 높은 효율을 달성 (82.5% vs 74% of peak)

– NVLink의 혜택을 얻기 위해 필요한 code 수정은 없음 (CUDA 8 and go)

• 예측되는 미래를 위한 대역폭을 필요로 하는개발자들에 필요한 플랫폼

– 2016년부터 NVLink를 탑재한 POWER8을 공급

– Xeon E5-2600 Series CPU는 PCI-E x16 3.0을2017년 중에도 유지할 계획*

11.8

33

0

5

10

15

20

25

30

35

Link Bandwidth, Ping-pong (GB/sec)

Unidirectional Device Bandwidth Test

Tesla K40, PCI-E Tesla P100, NVLink

~2

.79

X

Typical Ping-pong PCI-E device bandwidth: ~74% of theoretical 16GB unidirectional max

http://www.nextplatform.com/2015/05/26/intel-lets-slip-broadwell-skylake-xeon-chip-specs/http://wccftech.com/intel-14nm-skylake-ep-10nm-cannonlake-ep-supported-purley-platform-160w-tdp-48-pcie-lanes-6-channel-ddr4/

기존 K40에 비해 2.79배 빨라진 대역폭 성능 테스트 결과NVLink의 효과

http://www.nextplatform.com/2015/05/26/intel-lets-slip-broadwell-skylake-xeon-chip-specs/

http://wccftech.com/intel-14nm-skylake-ep-10nm-cannonlake-ep-supported-purley-platform-160w-tdp-48-pcie-lanes-6-channel-ddr4/

10

• Minksy는 ‘두껍고도 수평적으로’ (both fatand flat) 설계된 시스템

– 어느 link에서도 data 병목이 생기지않도록 설계

– GPU에서도 CPU처럼 시스템 메모리를 취급 (시스템 메모리 최대 1TB)

– 같은 socket의 GPU간 ‘두꺼운’ pipe 구현

• 보편적 업무와 알고리즘에 잘 맞는 구조

– Startup/teardown시 폭발적인 성능

– Host-device 간의 안정적인 data stream

– 두 GPU간의 안정적 transfer

– (부족한 대역폭으로 인한) host-device간의 bus transfer 문제를 해소

Fabric

IB

CPU DDR4

IB

DDR4 CPU

GPUGPU NVLink GPUGPU NVLink

115GB/s 115GB/s

80 GB/s 80 GB/s

Unified Memory Space up to 1TB

3. 제안 시스템 특장점 > GPU를 ‘full peer’로 취급

NVLink와 Unified Memory를 통해 병목을 최소화P2P 문제의 해소

11

• 새로운 애플리케이션 개발과 포팅에 가장 편리한 시스템

– NVIDIA Page Migration Engine에 의해 편리해지는unified memory space

• Unified memory: 메모리 address가 CPU와 GPU를가로질러 1TB 이상으로 확장됨

• Hardware managed transfers: Explicit data transfer의 필요성을 제거

– POWER8 with NVLink를 통한 빠른 data throughput

• 더 큰 메모리는 더 빠른 CPU-GPU간 data 이동속도를 요구

너무 큰

memory

space가

필요

Data 이동이

너무 복잡

너무 많은

date 이동

GPU data

이동에 너무

많은 custom

코딩이 필요

SW적 UVM

기능은 너무

제한적

Page

faulting

지원 필요

3. 제안 시스템 특장점 > Page Migration Engine과 POWER8 NVLink

기존 GPU 프로그래밍의 어려움을 신기술을 통해 해결단순해진 프로그래밍

12

• NVLink 장착 POWER8의 성능 향상 : 기존 프로세서에 비해 Lattice QCD code 성능 약 4배 향상

• P100을 장착한 x86 서버 : 같은 code에 대해 통상적으로 기존보다 약 2.5배 성능 향상

• IBM 연구팀 : “Application이 bus 속도를 따라 가지못한다”

– MILC는 refactoring 필요

0

500

1000

1500

2000

2500

32x32x32x128 32x32x32x256 32x32x32x512

Job

Th

rou

ghp

ut

(GFL

OP

S)Lattice Size

2x Tesla K80 4x Tesla P100

Minksy Performance Increasevs 2x Tesla K80 System: MILC/LQCD

x86 Platform Speedup, vs CPU, 2x Tesla K80

~3.74X ~3.90X ~3.97X

~2.5X 2xTesl

a K80

3. 제안 시스템 특장점 > Minsky에서의 성능 향상

K80 2장 (GPU 4장) vs. P100 4장약 4배의 성능 향상

13

Source : https://developer.nvidia.com/cuda-release-candidate-download Source : http://openpowerfoundation.org/blogs/openpower-deep-learning-distribution/

3. 제안 시스템 특장점 > POWER8의 Machine Learning 지원

OpenPOWER에서 내놓은 주요 MLDL 프레임워크의 배포판 “MLDL Distro” 간편한 MLDL 설치

https://developer.nvidia.com/cuda-release-candidate-download

http://openpowerfoundation.org/blogs/openpower-deep-learning-distribution/

14

git clone -b r0.8 --recurse-submodules https://github.com/tensorflow/tensorflow.gitcd tensorflow

Add the following lines to third_party/gpus/crosstool/CROSSTOOLdefault_toolchain {

cpu: "ppc"toolchain_identifier: "local_linux"

}./configure

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

“My team ported TensorFlow and we're working with Google to include support in the source distribution. The next release of the MLDL distro will include TensorFlow.”

Michael Gschwind, PhDChief Engineer, Machine Learning and Deep Learning

Fellow, IEEE - Member, IBM Academy of Technology - IBM Master Inventor

3. 제안 시스템 특장점 > POWER8의 Tensorflow 지원

간단한 tool config 수정에 의한 Tensorflow 지원Tensorflow 지원

https://github.com/tensorflow/tensorflow.git

15

Source : http://www.spec.org/cpu2006/results/rfp2006.html

VendorModel

SPECfp_rate2006

Threads Cores Chips Peak Peak/Core

Dell PowerEdge R730 (Intel Xeon E5-2690 v4, 2.60 GHz) 56 28 2 888 31.7

HP ProLiant DL380 Gen9 (2.60 GHz, Intel Xeon E5-2690 v4) 56 28 2 952 34.0

IBM Power S822LC (2.92 GHz, 20 core, Ubuntu) 80 20 2 888 44.4

IBM Power S824 (3.5 GHz, 24 core, RHEL) 192 24 4 1130 47.1

3. 제안 시스템 특장점 > POWER8 vs. E5-2690 v4의 성능 차이

공식 SPEC Floating Point 벤치마크에서 입증된 POWER8의 성능POWER8의 성능

http://www.spec.org/cpu2006/results/rint2006.html

16

• IT업계 전반의 폭넓은 혁신 유도

• 현재의 데이타센타 기술의 문제점을 해결하는 보다 나은 대안을 제시

• POWER 기술 관련 생태계 활성화

OpenPOWER Foundation 결성 목적

OpenPOWER Foundation 현황

• 2013년 IBM / Google / Mellanox / NVIDIA / TYAN 5개 회사로 시작

• 2016년 3월 현재 200개 이상으로 확대 및 강화

• 한국에서는 삼성전자 / SK Hynix 2개사가 메모리 분야에서 참여

OpenPOWER와의 협업으로 설계/생산된 새로운 POWER8

3. 제안 시스템 특장점 > 오픈 시스템을 위한 OpenPOWER Foundation

2016 4월, OpenPOWER 플래티넘 멤버인 구글의 POWER 아키텍처서버 개발과 SW 포팅에 대한 공개

POWER 아키텍처 자체의 공개를 통한, Google, IBM, Nvidia, Mellanox 등의 협업진정한 개방형 시스템

17

POWER 9

Future

Extreme Analytics Optimization

Extreme Big Data Optimization

On-chip accelerators

POWER 822 nm

2014

12 Cores SMT+++ Reliability ++ FPGA Support Transactional Memory PCIe Acceleration L4 cache

POWER 7/7+45/32 nm

2010

Eight Cores On-Chip eDRAM Power-Optimized Cores Memory Subsystem ++ SMT++ Reliability + VSM & VSX Protection Keys+

POWER 6/6+65 nm

2007

Dual Core High Frequencies Virtualization + Memory Subsystem + Altivec Instruction Retry Dynamic Energy Mgmt SMT + Protection Keys

POWER 5/5+130/90 nm

2004

프로세서 자체 개발 및 제조 기술 보유

POWER9까지 굳건한 로드맵 제시

3. 제안 시스템 특장점 > IBM POWER 프로세서 로드맵

POWER 아키텍처는 지난 20년 간 꾸준한 로드맵을 준수탄탄한 로드맵

18

P8 Tuleta - 4U2 P8 (+ 2 GPU)

PCIe Gen3CAPI

2014 2016

FDR InfiniBand

Power Systems

ProgrammingModel

2015 2017

Connect-IB(dual ports)

Firestone - 2U2 P8 + 2 GPU

PCIe Gen3CAPI

HPC Future - 2UEnhanced CAPI

Enhanced NVLink

CUDA 5.5 CUDA 7Open MP 4.0

Air Cooled

CUDA 8Open ACCOpen MP 4.0

CUDA 9OpenMP 4.x

Adapters

Air/Water Cooled

HPC Next - 2UCAPI

NVLink

HPC Future- 2UEnhanced CAPI

Enhanced NVLink

Switches

Chip Technology

Road

to

Exascale

CAPI over PCI-express Gen3

2014 20162015 2017

NVIDIA GPU(GK210)

NVIDIA GPU(GV100)

NVIDIA GPU(GP100)

GPUs

MellanoxInterconnect Technology

CPU

NVLink Enhanced NVLink

PCI-express Gen3CPU Links

EDR InfiniBand

ConnectX-4(dual ports)

Enhanced CAPI over PCI-express Gen4

HDR InfiniBand

ConnectX-5(dual ports)

JDA

Power8 Power8 Power8 Power Future

3. 제안 시스템 특장점 > IBM / Mellanox / NVIDIA 협업 로드맵

19

3. 제안 시스템 특장점 > NVIDIA - IBM Acceleration Lab 지원

“Team up with IBM, NVIDIA on Advanced Acceleration”전문적인 기술 지원

Advanced Acceleration

Going to POWER

Going Parallel

이미 GPU 가속을 사용하시는 고객 NVLink를 이용한 성능 향상을 실현

x86에서만 GPU를 사용하셨던 고객 ppc64로의 포팅 및 성능 테스트

POWER 및 GPU 경험이 아직 없으신 고객 GPU 가속 및 ppc64 포팅의 동시 진행

Email for more information: [email protected]

mailto:[email protected]

20

|

2

0

POWER8과 Tesla P100의 강력한 조합

•Tesla P100의 3배 성능과 5배 메모리 대역폭

•2.5배의 CPU-GPU 대역폭

NVLink의 대역폭을통해 x86이 낼 수 없는 업무 성능을 실현

•CPU-GPU간 NVLink가 가능한 유일한 플랫폼

•IBM/OpenPOWER에서만 낼수 있는 성능 향상

•지금 당장 실현되는 성능

기존과 신규 HPC 업무에 대해 더 편리해진 프로그래밍

•Page Migration Engine을NVLink가 탑재된 POWER8과 결합

•기존 업무에 대해서도 성능병목을 해결

낡은 Maxwell 대신새로운 Pascal에 투자할 기회

•Tesla P100과 NVLink를 탑재한 플랫폼을 이번에 구매 가능

4.“Minsky” 제안 특장점 요약

MAXWELL 대신, 최신의 PASCAL에 투자할 최적의 기회미래를 위한 투자

Q&A

Documents

Pascal GPU를탑재한세계최초의상용서버 IBM “Minsky” 4 항목 P100 M40 Architecture Pascal Maxwell SMs 56 24 FP32 CUDA Cores / SM 64 128 FP32 CUDA Cores / GPU 3584 3072