13
Confidential © 2016 Supermicro S erver Dpt. James Hao 郝旭光 Deep Learning with GPU Optimized Servers 让深度学习更加高效

Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

© 2016 Supermicro

Server Dpt.

James Hao 郝旭光

Deep Learning with GPU Optimized Servers让深度学习更加高效

Page 2: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

HETEROGENEOUS COMPUTING

PARALLEL WORKLOADSSERIAL WORKLOADS• Optimized for low-latency

access to cached data sets

• Control logic for out-of-order

and speculative execution

• Optimized for data-parallel,

throughput computation

• Architecture tolerant of

memory latency

• More transistors dedicated

to computation

Page 3: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

Defense Intelligence Safety and Security

Computational

Finance

GPU Compute enables future applications• Enriching the user experience via GPU compute

• Delivering heterogeneous energy-efficient

computing

• Allows developers to unlock the potential of

complex application for consumers

Research and Scientific Machine Learning

Media & Entertainment

Oil & Gas

CAD and CAE

GPU APPLICATIONS

Page 4: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

X10 GPU Server Portfolio

7048GR

4:2 (4U)

4028GR

8:2 (4U)

1028GQ

4:2 (1U)

2028GR

6:2 (2U)

1028GR

3:2 (1U)

1018GR/5018GR

2:1 (1U)

GPU

GPU

Ratio:

GPU:CPUTOWER RACK DEEP LEARNING

GP

U O

PT

IMIZ

ED

4028GR-TR2

10:2 (4U)

1028GQ-TXRT

4:2 (1U)

4028GR-TXRT

8:2 (4U)

Page 5: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

Machine Learning – Driven By Scale

CPU GPU Cloud

(Many CPU)HPC

(Many GPU)1 million

Connections

(2007)

10 million

Connections

(2008)

1 billion

Connections

(2011)

100 billion

Connections

(2015)

Architecture

CodeExperiment

Page 6: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

CUSTOMER PAIN POINTS

Machine Learning / AI

applications have large

datasets well beyond one

single GPU.

PROBLEM SOLUTION

Aggregate GPU resources

to tackle large dataset

computation, in

conjunction with high

speed connectivity to

minimize latency

Page 7: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

Server Portfolio GPU Peering

Best–in-class technology designed for augmented performance in Machine Learning

applications to enable can train twice as fast and explore networks twice as large.

1028GQ-TXR

1U Chassis

Dual HSW/BDW CPUs

16 DDR4 DIMMs

2 2.5” HS HDD bays

4 Pascal w/ 40GB/s NVLink

3/1 x16/x8 PCIe 3.0 slot

2 2000W Titanium PWS

Scalability

4028GR-TR2

4U Chassis

Dual HSW/BDW CPUs

24 DDR4 DIMMs

24 2.5” HS HDD bays

10 Double-Wide GPUs

11/1 x16/x8 PCIe 3.0 slot;

4 (2+2) 2000W Titanium PWS

Flexibility

10 4 4028GR-TXR

4U Chassis

Dual HSW/BDW CPUs

24 DDR4 DIMMs

16 2.5” HS HDD bays

8 Pascal w/ 20GB/s NVLink

4/2 x16/x8 PCIe 3.0 slot

4 (2+2) 2000W Titanium PWS

HyperScale

8

Page 8: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

SYS-4028GR-TR(T) SYS-4028GR-TR(T)2

1

2

3

4

7

8

9

10

5 6

1

2

3

4

9

10

11

12

5 8

FROM TO SYS-4028GR-TR(T) SYS-4028GR-TR(T)2

(uSEC) (uSEC)

GPU1 GPU2 6.6 6.6

GPU2 GPU4 6.7 6.6

GPU3 GPU9 21.2 6.7

New Architecture More Performance

Page 9: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

NVLINK

80 GB/sNVLink

• Interconnect at 80 GB/s

(Speed of CPU Memory)

Stacked 3D Memory

• 4x Higher Bandwidth – 1 TB/s

(2.5x Capacity, 4x more Efficient)

Unified Memory

• Lower level of Development

(Available today in CUDA 6)

Stacked HBM

Memory 1TB/sDDR4 Memory

50-75 GB/s

Unified

Memory

PASCAL GPU ARCHITECTURE

Page 10: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

SYS-1028GQ-TXR / TXRT

PASCAL GPU READY• Performance – 10 TFLOPs FP32• NVLink – 5x PCIe• 3D Memory - 2x Memory Bandwidth

X10 SUPERMICRO ADVANTAGE● PERFORMANCE: 4x PASCAL with GPUs IN 1U

● NVLINK: 80GB/s High Bandwidth GPU Interconnect

● GPU RDMA: Direct Internode GPU Interconnect

● EFFICIENCY: Titanium-rated Power Supply

● DESIGN: No GPU preheating ADVANTAGES• All GPUs capable of Peer-to-Peer direct access to all other GPUs’ memory as well as

direct transfer (memcpy) operations via NVLink at high Bandwidth

• High performance for collective communications

• PCIe bandwidth fully available for host and/or NIC communication during inter-GPU

communication

Unparalleled 1U platform for the highest parallel applications. No one else can do so much in

a 1U!!!! Up to Pascals with NVLink in , supporting Optimized GPU RDMA

Page 11: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

NVLINK ARCHITECTURE: CUBE MESH

SYS-4028GR-TXRTProcessor Support

Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)

8 Tesla P100 (Pascal) GPUs (SXM2)

Memory Capacity

24 DIMMs, 3TB ECC DDR4 2400MHz

Expansion Slots 4 PCI-e 3.0 x16 (For RDMA via EDR)2 PCI-e 3.0 x8

I/O ports 1x VGA, 2x 10G-BaseT LAN, 3x USB 3.0, and 1x IPMI dedicated LAN port

Drive Bays

16 hot-swap 2.5” drives bay (Support 8x NVMe)

System Cooling

8 heavy duty fans optimize to support 8 GPU cards

Power Supply

4 x 2000W (2+2) Titanium Level efficiency redundant power supply

1

● THROUGHPUT: Highest Parallelism with 8x Pascal GPUs

● NVLINK: 80GB/s High Bandwidth GPU Interconnect

● RDMA FABRIC: Lowest latency of data access and transfer

● FLEXIBILITY: Revolutionary Rack Scale Design

● DESIGN: Independent GPU and CPU thermal zones

2

3

4

6

7

Key Features:

5

Page 12: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

GPU: 1U DP SYS-1028GQ-TR(T)

12

3

4

6

7

5

Processor Support

Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)

Memory Capacity 16 DIMMs, up to 1TB ECC DDR4 2400MHz

Expansion Slots 4 PCI-e x16 Gen 3.0 for double-wide GPU cards2 x8 (in x16 slot) LP card

I/O ports 1x VGA, 2x GbE or 2x 10GbaseT LAN, 2x USB 3.0, and 1x IPMI dedicated LAN port

Drive Bays2 hot-swap 2.5” drives bays; 4 total 2.5” HDD bays

System Cooling 9 counter rotating fans with optimal fan speed control

Power Supply2000W Platinum Level efficiency redundant power supply

1

Motherboard: X10DGQ

Chassis: CSE-118GQETS-R2K03P

• Supports up to 4 double width GPU cards (including GTX)

• Redundant Platinum Level 2000W power supplies

• No GPU-Preheat

• Cost Optimized System

• Oil & Gas

• Research & Scientifics

• VDI technology

• Computational Finance

2

3

4

5

6

7

Key Features: Key Applications:

Page 13: Deep Learning with GPU Optimized Servers 让深度学习更加高效images.nvidia.com/cn/gtc/downloads/pdf/partners/605. GPU Server f… · Best–in-class technology designed for

Confidential

© 2016 Supermicro

Thank You!