Upload
dirk
View
318
Download
0
Embed Size (px)
DESCRIPTION
生命科学、气象行业 高性能计算解决方案及成功案例分享. 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司. 内容. 生命科学高性能计算解决方案 GPU 加速解决方案 高性能存储解决方案 WRF V3.3 ( 气象行业应用 ) 在 Dell R720 服务器 程序测试及优化 g cc 编译器器 Intel 编译器 成功案例分享. 生命科学 HPC GPU 方案. 在生命科学领域中 很多用户采用 GPU 加速解决方案. CPU + GPU 计算. HPCC GPU 异构平台. - PowerPoint PPT Presentation
Citation preview
Global Marketing1 Confidential
生命科学、气象行业高性能计算解决方案及成功案例分享凌巍才高性能计算产品技术顾问戴尔(中国)有限公司
Global Marketing2 Confidential
• 生命科学高性能计算解决方案– GPU 加速解决方案 – 高性能存储解决方案
• WRF V3.3 ( 气象行业应用 ) 在 Dell R720 服务器 程序测试及优化– gcc 编译器器 – Intel 编译器
• 成功案例分享
内容
Global Marketing
生命科学HPC GPU 方案
Global Marketing4 Confidential
在生命科学领域中很多用户采用 GPU 加速解决方案
Global Marketing
CPU + GPU 计算
5 Confidential
Global Marketing
HPCC GPU 异构平台
6 Confidential
Global Marketing
支持 GPU 的 Dell 服务器方案 (2012 年 ,12 代服务器 )
7 Confidential
C6220 C6220 C6145 C6145
T620
R720
C410x C410x C410x C410x
C6220 C6145
GPU:Socket Ratio 1:1 2:1 1:1 2:1 2:1 1:1Total System Boards 8 4 4 2 1 1Total HIC 8 4 8 4 0 0IB Capable Yes Yes Yes Yes Yes* YesTotal GPU 16 16 16 16 4 2Per GPU B/W 8 4 8 4 4 16MSRP (M2075) $117,000 $86,900 $114,000 $85,250 $19,000 $13,000Power Envelope (est) 5.525 kW 4.118 kW 5.030 kW 3.802 kW Theoretical GFLOPs TBD TBD 9,326 8,932 2,431 1,401 Est. GFLOPs TBD TBD 2,891 1,697 TBD TBD GFLOPS/Rack U TBD TBD 413 339 486 701 $/GFLOPS TBD TBD 39 50 8 9 Rack Size 7 5 7 5 5 2GPU/Rack U 2.3 3.2 2.3 3.2 0.8 1.0
External Solutions (PowerEdge C) Internal Solutions
Global Marketing
GPU 扩展箱方案 (GPU 外置方案 )Dell PowerEdge C410x
8
PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe
• 3U chassis, 19” wide, 143 pounds• PCI express modules: 10 front, 6 rear • PCI form factors: HH/HL and FH/HL• Up to 225W per module• PCIe inputs: 8PCIe x16 IPASS ports• PCI fan out options: x16 to 1 slot, x16 to 2
slot, x16 to 3 slot, x16 to 4 slot• GPUs supported: NVIDIA M1060, M2050,
M2070 (TBD)• Thermals: high-efficiency 92mm fans; N +
1 fan redundancy• Management: On-board BMC; IPMI 2.0;
dedicated management port• Power supplies: 4 x 1400W hot-plug, high
efficiency PSUs; N+1 power redundancy• Services vary by region: IT Consulting,
Server and Storage Deployment, Rack Integration (US only), Support Services
Confidential
Great for: HPC including universities, oil & gas, biomed research, design, simulation, mapping, visualization, rendering, and gaming
Global Marketing9
PowerEdge C410x PCIe 模块• Serviceable PCIe module (taco) capable of supporting any half-
height, half-length (HH/HL) or full-height/half-length (FH/HL) cards• FH/FL cards supported with extended PCIe module• Future-proofing on next generations of NVIDIA
and AMD ATI GPU cards
Power connectorfor GPGPU card
Board-to-board connector for X16 Gen PCIesignals and power
GPU card
LED
Confidential
Global Marketing
4 GPU / x16 16GPU/5U3 GPU / x16 12GPU/5U
2 GPU / x16 16GPU/7U1 GPU / x16 8GPU/7U
PowerEdge C410x Configurations• Enabling HPC applications to optimize cost /
performance equation off single x16
PCISwitch GPU
GPUGPU
x16
GPU
HostPCISwitch GPU
GPUGPU
x16Host
PCISwitch GPUx16Host PCI
Switch GPUGPU
x16
GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis
Confidential10
HICx16
iPass cable
C410x
HIC
C410xiPass cable
x16
x16
x16
HIC
C410x
x16
x16
x16
x16
iPass cable
Host HIC
C410x
x16
x16
iPass cable
7U = (1) C410x + (2) C6100
7U = (1) C410x + (2) C6100
5U = (1) C410x + (1) C6100
5U = (1) C410x + (1) C6100
C6100
C6100
C6100
C6100
Global Marketing11
Flexibility of the PowerEdge C410x• Increases to 8:1 possible with dual x16
PCISwitch GPU
GPUGPU
x16
GPU
Host
PCISwitch
GPUGPUGPUGPUx16
Confidential
PCISwitch GPU
GPU
x16
Host
PCISwitch
GPUGPUx16
x16
x16
x16
x16
C410x
x16
x16
x16
x16
x16
x16
x16
x16
C410x
iPass cable
HICHIC
iPass cable
HICHIC
iPass cable
iPass cable
Global Marketing12
PowerEdge C6100 Configurations “2:1 Sandwich”
C410x
C6100
C6100
C6100 “2:1 Sandwich”One Dell C410x (16 GPUs)Two C6100 (8 nodes)One x16 slot for each node to 2 GPUs7U total
16 GPUs total8 nodes total (2 GPUs per board)
• Two C6100• 8 system boards
• 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host
• Single port x16 HIC (iPASS)• Single C410x
• 16 GPUs (fully populated)• PCIe x8 per GPU• Total space = 7U
Note: This configuration is equivalent tousing the C6100 and the NVIDIA S2050but this configuration is more dense
Confidential
Details
Summary
Global Marketing13
PowerEdge C6100 Configurations “4:1 Sandwich”
C410x
C6100
C6100 “4:1 Sandwich”One Dell C410x (16 GPUs)One C6100 (4 nodes)One x16 slot for each node to 4 GPUs5U total
16 GPUs total4 nodes total (4 GPUs per
board)
• One C6100• 4 system boards
• 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host
• Single port x16 HIC (iPASS)• Single C410x
• 16 GPUs (fully populated)• PCIe x4 per GPU• Total space = 5U
Confidential
Details
Summary
Global Marketing14
PowerEdge C6100 Configurations “8:1 Sandwich” (Possible Future Development)
C410x
C6100
C6100 “8:1 Sandwich”Two Dell C410x (32 GPUs)One C6100 (4 nodes)One x16 slot for each node to 8 GPUs8U total
32 GPUs total4 nodes total (8 GPUs per board)
• One C6100• 4 system boards
• 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host
• Single port x16 HIC (iPASS)• Two C410x
• 32 GPUs (fully populated)• PCIe x2 per GPU• Total space = 8U• See later table for metrics
C410x
Confidential
Details
Summary
Global MarketingDell Confidential
PowerEdge C6145 Configurations “8:1 Sandwich”
C6145
C6145 “16:1 Sandwich”One Dell C410x (16 GPUs)One C6145 (2 nodes)Two-Four HIC slots for each node to 16 GPUs5U total
16 GPUs total2 nodes total (16 GPUs per
board)
Details
5U of Rack Space
C410x
• One C6145• 2 system boards
• 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host
• 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS)
• One C410x• 16 GPUs (fully populated)
• PCIe x4-x8 per GPU• Total space = 5U
Details
Global MarketingDell Confidential
PowerEdge C6145 Configurations “16:1 Sandwich”
C410x
C6145
C6145 “16:1 Sandwich”Two Dell C410x (32 GPUs)One C6145 (2 nodes)Four HIC slots for each node to 16 GPUs8U total
32 GPUs total2 nodes total (16 GPUs per
board)
Details
8U of Rack Space
C410x
• One C6145• 2 system boards
• 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host
• 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS)
• Two C410x• 32 GPUs (fully populated)
• PCIe x4 per GPU• Total space = 8U
Details
Global Marketing
PowerEdge C410x Block Diagram
GPUs x 16
Switch Level 2 x 4
Switch Level 1 x 8
Host Connections
x 8
Global Marketing
C410X BMC 控制台配置界面
Global Marketing
GPU 扩展箱支持服务器列表• Dell external GPU solution
support– Hardware Interface Card (HIC) in PCIe
slot connects to external GPU(s) in C410x
– Dell ‘slot validates’ NVIDIA interface cards to verify power, thermals, etc.
HIC/C410x Support Matrix
ServerC410x
SupportPlanned Support
DateC6100 Yes NowC6105 RTS+ Now – BIOS
1.7.1 or later
C6145 RTS NowC1100 Yes NowPrecision R5500 Yes Now – Disable
SSC in BIOS
R710 Yes NowM610x Yes NowR410 Yes NowR720 RTS RTSR720xd RTS RTSR620 RTS RTSC6220 RTS RTS
生命科学应用测试 : GPU-HMMER
Dell High Performance Computing 20
415 983 1419 22930
2000
4000
6000
8000
10000
12000GPU-HMMER CPU vs. GPU
CPUC410x / C6100 (1)
Length of HMM
Wal
l Clo
ck (s
)
2.9X 2.8X
2.7X
1.8X
GPU:Host Scaling : GPU-HMMER
Dell High Performance Computing 21
415 983 1419 22930
1000200030004000500060007000
GPU-HMMER: GPU Scaling
C410x / C6100 (1)C410x / C6100 (2)C410x / C6100 (4)Internal 2-x16 (2)
Length of HMM
Wal
l clo
ck (s
) Speedup 1.8X 3.6X 7.2X 3.6X
GPU:Host Scaling: NAMD
Dell High Performance Computing 22
STMV0
0.20.40.60.8
11.21.41.6
0.10
0.47
0.82
1.52
0.95
NAMD
CPUC410x / C6100 (1)C410x / C6100 (2)C410x / C6100 (4)Internal 2-x16 (2)
Step
s/Se
cond
Speedup 4.7X 8.2X 15.2X 9.5X
GPU:Host Scaling : LAMMPS JL-Cut
Dell High Performance Computing 23
256000 500000 10001880
200400600800
100012001400160018002000
LAMMPS LJ GPU Scaling
C410x / C6100 (1)C410x / C6100 (2)C410x / C6100 (4)Internal 2-x16 (2)
Number of Particles
Wal
l clo
ck (s
) Speedup 8.5X 13.5X 14.4X 14.0X
Global Marketing
生命科学存储方案
生命科学计算、数据容量增长率
The Lustre Parallel File System• Key Lustre Components:
1.Clients (compute nodes)• “Users” of the file system where applications run• The Dell HPC Cluster
2. Meta Data Server (MDS)•Holds meta-data information
3. Object Storage Server (OSS)• Provides back-end storage for the users’ files• Additional OSS units increase throughput linearly
Meta Data Server (MDS)
Clients
OSS OSS OSS…
27
Confidential
InfiniBand (IPoIB) NFS Performance: Sequential Read
• Peaks:– NSS Small: 1 node doing IO (fairly level until 4 nodes)– NSS Medium: 4 nodes doing IO (not much drop-off)– NSS Large: 8 nodes doing IO (good performance over range)
1 2 4 8 16 24 320
200000
400000
600000
800000
1000000
1200000
1400000
1600000
NSS IPoIB Sequential Reads
NSS SmallNSS MediumNSS Large
Threads (Nodes)
Thro
ughp
ut K
B/s
Infiniband (IPoIB) NFS Performance: Sequential Write
• Peaks:– NSS Small: 1 node doing IO (steady drop off to 16 nodes)– NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes)– NSS Large: 4 nodes doing IO (good performance over range)
1 2 4 8 16 24 320
200000
400000
600000
800000
1000000
1200000
1400000
1600000
NSS IPoIB Sequential Writes
NSS SmallNSS MediumNSS Large
Threads (Nodes)
Thro
ughp
ut K
B/s
31
Confidential
Global Marketing
WRF V3.3 应用程序测试调优
Confidential33
Dell 测试环境• Dell R720
– cpu : 2x Intel Sandy Bridge E5- 2650, – Memory: 8x 8MB (64GB Memory)– Harddisk: 2x 300 GB 15Krpm (Raid 0)
• BIOS Setting– disable HT– memory optimized– High Performance enable ( Power Max)
• OS– Redhat Enterprise Linux 6.3
Gcc 测试• gcc, gfortran, gc++• Zlib 1.2.5• HDF5 1.8.8• Netcdf 4• WRF V3.3
Confidential34
测试结果• 输出文件 wrf : 2011 年 11 月 30 日 至 2011 年 12 月 5 日
(13H9M53S)– wrf.exe starts at: Sun Apr 29 09:35:36 CST 2012 …– wrf: SUCCESS COMPLETE WRF– wrf.exe completed at: Sun Apr 29 22:45:29 CST 2012
Confidential36
配置文件•# Settings for x86_64 Linux, gfortran compiler with gcc (smpar)•DMPARALLEL = 1•OMPCPP = -D_OPENMP•OMP = -fopenmp•OMPCC = -fopenmp•SFC = gfortran•SCC = gcc•CCOMP = gcc•DM_FC = mpif90 -f90=$(SFC)•DM_CC = mpicc -cc=$(SCC)•FC = $(SFC)•CC = $(SCC) -DFSEEKO64_OK •LD = $(FC)•RWORDSIZE = $(NATIVE_RWORDSIZE)•PROMOTION = # -fdefault-real-8 # uncomment manually•ARCH_LOCAL = -DNONSTANDARD_SYSTEM_SUBR•CFLAGS_LOCAL = -w -O3 -c -DLANDREAD_STUB•LDFLAGS_LOCAL = •CPLUSPLUSLIB = •ESMF_LDFLAG = $(CPLUSPLUSLIB)•FCOPTIM = -O3 -ftree-vectorize -ftree-loop-linear -funroll-loops•FCREDUCEDOPT = $(FCOPTIM)•FCNOOPT = -O0•FCDEBUG = # -g $(FCNOOPT)•FORMAT_FIXED = -ffixed-form•FORMAT_FREE = -ffree-form -ffree-line-length-none•FCSUFFIX = •BYTESWAPIO = -fconvert=big-endian -frecord-marker=4•FCBASEOPTS_NO_G = -w $(FORMAT_FREE) $(BYTESWAPIO)•FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG)•MODULE_SRCH_FLAG = •TRADFLAG = -traditional•CPP = /lib/cpp -C -P•AR = ar•ARFLAGS = ru•M4 = m4 -G•RANLIB = ranlib•CC_TOOLS = $(SCC) Confidential37
Wrf.out
38 Confidential
…. WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16 WRF TILE 1 IS 1 IE 250 JS 1 JE 10 WRF TILE 2 IS 1 IE 250 JS 11 JE 20 WRF TILE 3 IS 1 IE 250 JS 21 JE 30 WRF TILE 4 IS 1 IE 250 JS 31 JE 39 WRF TILE 5 IS 1 IE 250 JS 40 JE 48 WRF TILE 6 IS 1 IE 250 JS 49 JE 57 WRF TILE 7 IS 1 IE 250 JS 58 JE 66 WRF TILE 8 IS 1 IE 250 JS 67 JE 75 WRF TILE 9 IS 1 IE 250 JS 76 JE 84 WRF TILE 10 IS 1 IE 250 JS 85 JE 93 WRF TILE 11 IS 1 IE 250 JS 94 JE 102 WRF TILE 12 IS 1 IE 250 JS 103 JE 111 WRF TILE 13 IS 1 IE 250 JS 112 JE 120 WRF TILE 14 IS 1 IE 250 JS 121 JE 130 WRF TILE 15 IS 1 IE 250 JS 131 JE 140 WRF TILE 16 IS 1 IE 250 JS 141 JE 150 WRF NUMBER OF TILES = 16…..
系统资源分析 CPU • CPU: (mpstat –P ALL)• •Linux 2.6.32-257.el6.x86_64 (r720) 04/29/2012 _x86_64_ (16 CPU)• •04:06:40 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle•04:06:40 PM all 85.27 0.00 2.62 0.01 0.00 0.00 0.00 0.00 12.10•04:06:40 PM 0 85.71 0.00 2.58 0.01 0.00 0.00 0.00 0.00 11.69•04:06:40 PM 1 85.05 0.00 2.77 0.05 0.00 0.04 0.00 0.00 12.09•04:06:40 PM 2 85.26 0.00 2.69 0.00 0.00 0.00 0.00 0.00 12.05•04:06:40 PM 3 85.24 0.00 2.65 0.01 0.00 0.00 0.00 0.00 12.10•04:06:40 PM 4 87.36 0.00 1.90 0.00 0.00 0.00 0.00 0.00 10.73•04:06:40 PM 5 84.97 0.00 2.70 0.00 0.00 0.00 0.00 0.00 12.33•04:06:40 PM 6 85.23 0.00 2.64 0.00 0.00 0.00 0.00 0.00 12.13•04:06:40 PM 7 84.97 0.00 2.71 0.00 0.00 0.00 0.00 0.00 12.32•04:06:40 PM 8 85.33 0.00 2.60 0.00 0.00 0.00 0.00 0.00 12.06•04:06:40 PM 9 85.32 0.00 2.57 0.00 0.00 0.00 0.00 0.00 12.11•04:06:40 PM 10 84.88 0.00 2.77 0.00 0.00 0.00 0.00 0.00 12.35•04:06:40 PM 11 84.93 0.00 2.69 0.00 0.00 0.00 0.00 0.00 12.38•04:06:40 PM 12 85.16 0.00 2.62 0.00 0.00 0.00 0.00 0.00 12.21•04:06:40 PM 13 85.00 0.00 2.69 0.00 0.00 0.00 0.00 0.00 12.31•04:06:40 PM 14 84.91 0.00 2.75 0.00 0.00 0.00 0.00 0.00 12.34•04:06:40 PM 15 85.02 0.00 2.65 0.00 0.00 0.00 0.00 0.00 12.33
Confidential39
系统资源分析 (Memory)• Memory : (free)
Confidential40
total used free shared buffers cached
Mem: 65895488 32823072 33072416 0 38220 26885024
-/+ buffers/cache: 5899828 59995660
Swap: 66027512 0 66027512
系统资源分析 (IO, HDD)
Confidential41
IO: (iostat)Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtnsda 9.01 125.71 2063.47 3096354 50823660dm-0 0.64 12.63 1.99 311170 49016dm-1 0.01 0.10 0.00 2576 0dm-2 258.17 112.05 2061.48 2759698 50774616 HDD : (df)Filesystem 1K-blocks Used Available Use% Mounted on/dev/mapper/vg_r720-lv_root 51606140 5002372 43982328 11% /tmpfs 32947744 88 32947656 1% /dev/shm/dev/sda1 495844 37433 432811 8% /boot/dev/mapper/vg_r720-lv_home 458559680 58258760 377007380 14% /home
Intel 测试
42 Confidential
43
Confidential
Intel links• http://software.intel.com/en-us/articles/building-the-wrf-
with-intel-compilers-on-linux-and-improving-performance-on-intel-architecture/
• http://software.intel.com/en-us/articles/wrf-and-wps-v311-installation-bkm-with-inter-compilers-and-intelr-mpi/
• http://www.hpcadvisorycouncil.com/pdf/WRF_Best_Practices.pdf
Confidential44
Intel Compilers Flags
45 Confidential
Intel 调优
46 Confidential
http://software.intel.com/en-us/articles/performance-hints-for-wrf-on-intel-architecture/
1 。 Reducing MPI overhead:• -genv I_MPI_PIN_DOMAIN omp • -genv KMP_AFFINITY=compact • -perhost
2 。 Improving cache and memory bandwidth utilization:• numtiles = X
3 。 Using Intel® Math Kernel Library (MKL) DFT for polar filters:• Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed
4 。 Speeding up computations by reducing precision:• -fp-model fast=2 -no-prec-div -no-prec-sqrt
Global Marketing
案例分享
华大基因研究院
清华大学生命科学院
Success References in Life Science• 国内
– Beijing Genome Institute (BGI)– Tsinghua University Life Institute– Beijing Normal University– Jiang Su Tai Cang Life Institute– The 4th Military Medical University– …
• 国外– David H. Murdock Research Institute– Virginia Bioinformatics Institute – University of Florida speeds up memory intensive gene – UCSF – National Center for Supercomputing Applications– …
Confidential50
51
Confidential
谢谢!