Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Confidential
© 2016 Supermicro
Server Dpt.
James Hao 郝旭光
Deep Learning with GPU Optimized Servers让深度学习更加高效
Confidential
HETEROGENEOUS COMPUTING
PARALLEL WORKLOADSSERIAL WORKLOADS• Optimized for low-latency
access to cached data sets
• Control logic for out-of-order
and speculative execution
• Optimized for data-parallel,
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated
to computation
Confidential
Defense Intelligence Safety and Security
Computational
Finance
GPU Compute enables future applications• Enriching the user experience via GPU compute
• Delivering heterogeneous energy-efficient
computing
• Allows developers to unlock the potential of
complex application for consumers
Research and Scientific Machine Learning
Media & Entertainment
Oil & Gas
CAD and CAE
GPU APPLICATIONS
Confidential
X10 GPU Server Portfolio
7048GR
4:2 (4U)
4028GR
8:2 (4U)
1028GQ
4:2 (1U)
2028GR
6:2 (2U)
1028GR
3:2 (1U)
1018GR/5018GR
2:1 (1U)
GPU
GPU
Ratio:
GPU:CPUTOWER RACK DEEP LEARNING
GP
U O
PT
IMIZ
ED
4028GR-TR2
10:2 (4U)
1028GQ-TXRT
4:2 (1U)
4028GR-TXRT
8:2 (4U)
Confidential
Machine Learning – Driven By Scale
CPU GPU Cloud
(Many CPU)HPC
(Many GPU)1 million
Connections
(2007)
10 million
Connections
(2008)
1 billion
Connections
(2011)
100 billion
Connections
(2015)
Architecture
CodeExperiment
Confidential
CUSTOMER PAIN POINTS
Machine Learning / AI
applications have large
datasets well beyond one
single GPU.
PROBLEM SOLUTION
Aggregate GPU resources
to tackle large dataset
computation, in
conjunction with high
speed connectivity to
minimize latency
Confidential
Server Portfolio GPU Peering
Best–in-class technology designed for augmented performance in Machine Learning
applications to enable can train twice as fast and explore networks twice as large.
1028GQ-TXR
1U Chassis
Dual HSW/BDW CPUs
16 DDR4 DIMMs
2 2.5” HS HDD bays
4 Pascal w/ 40GB/s NVLink
3/1 x16/x8 PCIe 3.0 slot
2 2000W Titanium PWS
Scalability
4028GR-TR2
4U Chassis
Dual HSW/BDW CPUs
24 DDR4 DIMMs
24 2.5” HS HDD bays
10 Double-Wide GPUs
11/1 x16/x8 PCIe 3.0 slot;
4 (2+2) 2000W Titanium PWS
Flexibility
10 4 4028GR-TXR
4U Chassis
Dual HSW/BDW CPUs
24 DDR4 DIMMs
16 2.5” HS HDD bays
8 Pascal w/ 20GB/s NVLink
4/2 x16/x8 PCIe 3.0 slot
4 (2+2) 2000W Titanium PWS
HyperScale
8
Confidential
SYS-4028GR-TR(T) SYS-4028GR-TR(T)2
1
2
3
4
7
8
9
10
5 6
1
2
3
4
9
10
11
12
5 8
FROM TO SYS-4028GR-TR(T) SYS-4028GR-TR(T)2
(uSEC) (uSEC)
GPU1 GPU2 6.6 6.6
GPU2 GPU4 6.7 6.6
GPU3 GPU9 21.2 6.7
New Architecture More Performance
Confidential
NVLINK
80 GB/sNVLink
• Interconnect at 80 GB/s
(Speed of CPU Memory)
Stacked 3D Memory
• 4x Higher Bandwidth – 1 TB/s
(2.5x Capacity, 4x more Efficient)
Unified Memory
• Lower level of Development
(Available today in CUDA 6)
Stacked HBM
Memory 1TB/sDDR4 Memory
50-75 GB/s
Unified
Memory
PASCAL GPU ARCHITECTURE
Confidential
SYS-1028GQ-TXR / TXRT
PASCAL GPU READY• Performance – 10 TFLOPs FP32• NVLink – 5x PCIe• 3D Memory - 2x Memory Bandwidth
X10 SUPERMICRO ADVANTAGE● PERFORMANCE: 4x PASCAL with GPUs IN 1U
● NVLINK: 80GB/s High Bandwidth GPU Interconnect
● GPU RDMA: Direct Internode GPU Interconnect
● EFFICIENCY: Titanium-rated Power Supply
● DESIGN: No GPU preheating ADVANTAGES• All GPUs capable of Peer-to-Peer direct access to all other GPUs’ memory as well as
direct transfer (memcpy) operations via NVLink at high Bandwidth
• High performance for collective communications
• PCIe bandwidth fully available for host and/or NIC communication during inter-GPU
communication
Unparalleled 1U platform for the highest parallel applications. No one else can do so much in
a 1U!!!! Up to Pascals with NVLink in , supporting Optimized GPU RDMA
Confidential
NVLINK ARCHITECTURE: CUBE MESH
SYS-4028GR-TXRTProcessor Support
Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)
8 Tesla P100 (Pascal) GPUs (SXM2)
Memory Capacity
24 DIMMs, 3TB ECC DDR4 2400MHz
Expansion Slots 4 PCI-e 3.0 x16 (For RDMA via EDR)2 PCI-e 3.0 x8
I/O ports 1x VGA, 2x 10G-BaseT LAN, 3x USB 3.0, and 1x IPMI dedicated LAN port
Drive Bays
16 hot-swap 2.5” drives bay (Support 8x NVMe)
System Cooling
8 heavy duty fans optimize to support 8 GPU cards
Power Supply
4 x 2000W (2+2) Titanium Level efficiency redundant power supply
1
● THROUGHPUT: Highest Parallelism with 8x Pascal GPUs
● NVLINK: 80GB/s High Bandwidth GPU Interconnect
● RDMA FABRIC: Lowest latency of data access and transfer
● FLEXIBILITY: Revolutionary Rack Scale Design
● DESIGN: Independent GPU and CPU thermal zones
2
3
4
6
7
Key Features:
5
Confidential
GPU: 1U DP SYS-1028GQ-TR(T)
12
3
4
6
7
5
Processor Support
Dual Xeon E5-2600 v4/v3 CPUs (Socket R3)
Memory Capacity 16 DIMMs, up to 1TB ECC DDR4 2400MHz
Expansion Slots 4 PCI-e x16 Gen 3.0 for double-wide GPU cards2 x8 (in x16 slot) LP card
I/O ports 1x VGA, 2x GbE or 2x 10GbaseT LAN, 2x USB 3.0, and 1x IPMI dedicated LAN port
Drive Bays2 hot-swap 2.5” drives bays; 4 total 2.5” HDD bays
System Cooling 9 counter rotating fans with optimal fan speed control
Power Supply2000W Platinum Level efficiency redundant power supply
1
Motherboard: X10DGQ
Chassis: CSE-118GQETS-R2K03P
• Supports up to 4 double width GPU cards (including GTX)
• Redundant Platinum Level 2000W power supplies
• No GPU-Preheat
• Cost Optimized System
• Oil & Gas
• Research & Scientifics
• VDI technology
• Computational Finance
2
3
4
5
6
7
Key Features: Key Applications:
Confidential
© 2016 Supermicro
Thank You!