31
Security Level: 创新开放、聚点成面 华为助力打造教育行业HPC一体化基础设施 谢海波, [email protected] HPC解决方案总监 华为IT产品线

创新开放、聚点成面 华为助力打造教育行业HPC一体化基础设施hpc.csu.edu.cn/uploads/Img2/20180710/4.pdf · 2018-07-10 · E9000 X6000 2488 G5500 Heterogeneous

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Security Level:

创新开放、聚点成面华为助力打造教育行业HPC一体化基础设施

谢海波, [email protected]解决方案总监华为IT产品线

Security Level:

山东大学公共计算平台全国首家高性能云计算集群

384万亿次计算能力服务于全校的公共计算平台

3

02华为的创新基因

03开放的HPC解决方案 聚点成面,携手共赢

4

华为介绍

国家/地区 世界500强排名 员工 研发员工 联合创新中心 研究院/所

万 万

5

持续投资研究,驱动创新

累计研发投资2008 to 2017

亿

6

持续创新,让计算变简单

SAP HANA一体机

HPC

大数据

超融合基础设施

Azure Stack 解决方案

边缘计算视频分析解决

方案

加速

加速部件

ACC

FPGA

NIC

NVMe SSD

FPGA

Intelligent NIC

FusionServer

XFusionServer

EFusionServer

GFusionServer

机架优化

标准2S-8S x86, 为大中型企业优化设计

传统 模块化

高密服务器面向大规模应用部署优化

刀片系统融合基础设施,提供最大化效率

GPU服务器面向需要GPU计算环境的HPC、视频和AI/DL等场景

独特创新

通用 专用应用

FDM

DEMT

创新

芯片

NC: Hi 1503NIC: Hi 1822存储: Hi 1812BMC: Hi 1710

7

算、存、传、管全领域的芯片设计能力

Node Controller

CPUBMC

NIC controller

SSD controller

1

4

3

NC interconnect

chip

CPU CPU

CPU

CPU

Storage/Network etc. I/O controller chips

Server management chip

Hi1503

Hi1812

Hi1822

Hi1710

1

2

3

42

High-speed interconnect chip for Intel E7 v3/v4 processors32S scale-up

PCIe/NVMe SSD storage controller chipRead/Write I/O acceleration

Programmable network controller chipDC high-speed flexible interconnect

BMC management chipBuilt-in fault diagnosis expert library and patented processing mechanism

8

DEMT:高效能专利技术

0.5%1.4%

3.8%

5.9%

9.0%

16%12.3%

9.7%

5.5%

1.9%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

• Optimized Digital Thermal Sensor (DTS) algorithm for dynamically tuning target temperature, with higher tuning precision

• Power capping: al locates power supply/heat dissipation resources according to actual equipment power

• Proportional-integral-derivative (PID) algorithm based fan speed tuning according to loads, and component/ambient temperature

• Fans enhanced with Deep Sleep Technology (DST)

• Low Workload Low Watt (LW2) technology• Automatic switching between active and standby

power supplies improves overall energy efficiency• High-voltage DC (HVDC) power supply

Power saving of DEMT with different CPU loads

1

2

3

4

Average

13.4%

Dynamic Energy Management Technology (DEMT) slashes overall power consumption by an average of 13.4% without compromising service performance.

9

FDM:高可靠专利技术Unique Fault Diagnosis & Management (FDM) Technology enables comprehensive out-of-band fault information collection and analysis, with diagnosis precision of up to 93%!

Comprehensive Hardware Component Diagnosis Key Technical Features

CPUMemory

PCIe

Voltage

RAID

Fan modules

Storage device

Power supplyTemperature

Out-of-band fault diagnosis system

Fault diagnosis expert library

93%Fault

generation

System runs normally

System crashes

In-band collection

Information receiving

Parsing Diagnosisexpert library

Raw data

Parsed data

Pre-warningexpert library

Output

Out-of-band collection

CATERR-related fault locating accuracyFault data

summary library

10

eSight Server:数据中心全生命周期管理软件Automates and Smartens Up Server Entire-Lifecycle Management, Maximizes O&M Efficiency

Delivery O&M Out-of-Service

eSight Server

Out-of-band OS deployment, shortens service rollout time by 50%•Template-based configuration management, batch deployment via out-of-band OS

Precise out-of-band locating of faults, with up to 93% diagnosis effectiveness•Comprehensive fault info collection & analysis out-of-band, enabling precise fault locating

Automated firmware update subscription, streamlines upgrade process•Automated subscription, download, and detection of firmware and driver version updates

Stateless computing shortens configuration recovery time to less than 3 minutes•Automated configuration management, no manual intervention required

Planning

11

华为HPC服务器产品家族

Dedicated Nodes - Big Memory, I/O Expansion, Multiple Accelerators

High Density Rack Mount Servers

High Performance Blades

58858100

1288/2288

KunLun 32S System

5288

E9000

X6000

2488

G5500

Heterogeneous Server

12

华为在高性能领域的持续与战略技术投资

Infrastructurem

iddleware Resource manage

Huawei MPI

Applications

Job scheduling

Huawei Compiler &

libraries

Weather & Ocean manufacturing EDA Life-

science AI

Processor, New Fabric- CCIX, GENZ

Next-generation

NAS storage system

Interconnection(IB, RoCE, dedicated

low-latency technology)

Oil & Gas

HPC application characterization, monitoring and tuning Unified

portal for HPC workflows

6, Huawei MPI, optimized for CPU and networking devices

5, Huawei Tool-chains

HPC Cloud with cloud bursting or

hybrid cloud system

1, Dedicated processor and fabric for HPC system

3, ASIC for ultra low latency networking tech, RoCE for specific HPC market

2, NAS system with burst buffer

4, Unique advantages, Cloud bursting or Hybrid Cloud infrastructure

13

华为的创新基因

03开放的HPC解决方案 聚点成面,携手共赢

14

开放共赢的HPC产业生态

Infrastructurem

iddleware Resource manage

Huawei MPI

Applications

Job scheduling

Remote visualization

Weather & Ocean manufacturing EDA Life-

science AI

Processor, New Fabric- CCIX, GENZ

Next-generation

NAS storage system

Interconnection(IB, RoCE, dedicated

low-latency technology)

Oil & Gas

HPC application characterization, monitoring and tuning Unified

portal for HPC workflows

HPC Cloud with cloud bursting or

hybrid cloud system

1, partnering with professional ISV to deliver professional HPC solution

2, partnering with commercial ISV and community , deep involve into application development form the very beginning

15

华为HPC解决方案:Transforming HPC

On-premise/Private Cloud HPC solution HPC Public CloudFundamental technical

innovation

Huawei, Transforming HPC

16

华为HPC价值主张:面向应用,加速传统HPC与New HPC的融合

极致高效 面向应用

面向应用优化

的极致性能

l 灵活的模块化架构

l 多样化的创新形态

l 深度应用优化的硬件加速

更小的空间、更低的能耗

获取更高的性能

l 端到端的工程设计能力

l 高效可靠的液冷技术

l 一体化的集成交付和安装

SDS

Big Data

Graph

适配变化

面向未来的

HPC融合架构

l 新兴技术的快速应用

l 多用途的HPC系统

l HPC与云结合

CloudAI

Big Data

17

L1:华为端到端HPC方案能力

All-In-Room 大中型HPC

• 机柜级部署,现场安装仅需 4小时• 1~6个IT机柜,支持

10~100TFlops HPC系统

All-In-Cabinet 小型HPC

FusionModule500 FusionModule800

All-In-Container 集装箱HPC

• 支持单排或双排密闭冷/热通道部署,面积在500平米以下

• 2~48个IT机柜,支持100TFlops~1PFlops HPC系统

• 工厂预制,预测试,现场交付,缩短80%部署周期

• 8个IT机柜,支持10~100TFlops HPC系统

FusionModule2000

单排

FusionModule1000A

双排

18

L1+L2:华为HPC全液冷方案

• CPU, Memory and VRD are cooled directly by up to 45 ℃ water• Chiller is optional, cooling PUE < 1.1• Board-level liquid cooling + Cabinet-level air-to-liquid heat exchange• No need for row air conditioners and water chillers

Internal serrated micro-channel CPU Heat Sink

Inter-DIMM water flowMemory cooling board

Optimized heat dissipation teeth spacing and flow resistance

Optimized serrated design

Energy efficiency up by10%

Multi-channel water flow

Shorter heat transfer path cut thermal resistance

65%

Fence-style fixture designArea of contact with air cooling

80%

Hybrid Rack design

Impact to ambient 0%

19

面向工作负载优化的服务器设计

20

最快的横向扩展文件存储

OceanStor 9000

l 支持3~288个节点线性扩展l 系统带宽可达400GB/sl 支持单一文件系统100PB

OceanStor DFS

文件系统Lustre软件参考架构

OceanStor V3

l NAS和SAN融合架构l RAID 2.0技术保证数据可靠

21

面向New HPC的产品与方案

Enabling HPC Cloud

Open Telekom CloudHPC Class Hardware• InfiniBand Fabric• Hardware Accelerators• Bare Metal Compute Node

Advancing Cloud Software Stack• HPC class storage• Container support• Same stack for private & public cloud

Leverage AI

Business Driven Innovation• Team with customers to identify

new business problems• Creative use of new technology

such as AI• Partner with industry leaders

Unified HPC and Big Data Platform

Big Data Acceleration

Emerging Hardware Technology• Large-Memory Compute Node• New Storage Class Memory• FPGA and custom accelerator

Advancing Big Data Software• Massive streaming data• Millisecond latency• Artificial Intelligence

22

No Compromise!基于混合云的HPC Cloud方案

GPU acceleration FPGA data pre-processing

High performance network

Bare metal service

Optimal cloud acceleration and data pre-processing

Nvidia P100 GPU Acceleration

100G IB Service Network2μs ,Low Latency

Bare metal + SDIShared storage

Hybrid cloud for HPC and big data

10+ European top research institutes30%67%

Design emulation cloud Scientific computing cloud Energy exploration cloud

Cloud

VMs with high specifications

128vCPU+4TB RAM

23

HPC Singularity!

GPU acceleration FPGA data pre-processing

High performance network

Bare metal service

Optimal cloud acceleration and data pre-processing

Nvidia P100 GPU Acceleration

100G IB Service Network2μs ,Low Latency

Bare metal + SDIShared storage

Design emulation cloud Scientific computing cloud Energy exploration cloud

Cloud

Singularity !

VMs with high specifications

128vCPU+4TB RAM

24

统一的HPC+AI融合方案

HPC Storage Resources

Cluster Manager DL Framework, Library, Tools

Workload Management SoftwareAI/DL Application HPC Application

MPI, Math Library, Compiler

Management Portal / GUI

HPC Compute Resources Accelerator Resources PoolsContainer VM Bare Metal

Huawei ATLAS Resource Management Software Platform

CPU1 CPU2

SW

GPU GPU GPU GPU

SW

GPU GPU GPU GPU

IO

Topo1: Single RC-AI Training*

CPU1 CPU2

SW

GPU GPU GPU GPU

SW

GPU GPU GPU GPU

IOIO

Topo2: Balanced –HPC, Cloud*

CPU1 CPU2

SW

GPU GPU GPU GPU

SW

GPU GPU GPU GPU

IOIO

SW SW

IO IO

Topo3: High BW-HPC* Based on G5500 with G560 single-node configuredsupport 8 x NVIDIA Tesla P100/P40 | 1 or 2 x E5-2600 v5 | 24 DDR4 DIMMs | 8 x 3.5-inch SATA HDDs + 6 x NVMe SSDs + 2 x 2.5-inch SSD/SATA/SASTopo1 & topo2 Support one click topology change in BIOS

Computing 最新一代Intel Xeon

最新一代GPU

RDMA网卡

GPU

25

全球化丰富的HPC实施案例

斯坦福大学

多伦多大学

计算加拿大

内布拉斯加大学

田纳西大学

数字领域公司

新加波GlobalFoundries

新加坡科学技术研究所

新加坡国立大学

菲律宾气象局一期

澳门气象局

维多利亚大学

昆士兰大学

肯迪大学

塔斯马尼亚大学

智利CASSAC天文台

巴西麦肯锡大学

巴西圣保罗州立大学

巴西UNESP大学

委内瑞拉国家石油公司

墨西哥水利局

墨西哥农业部

土耳其学术网络与信息中心(ULAKBIM)

土耳其Yilidiz科技大学(YTU)

土耳其伊斯坦布尔科技大学(ITU)

土耳其Harran大学

土耳其Yeditepe大学

土耳其国家石油公司

中国

欧洲

亚太

北美

拉美

中亚

沙特MOI

非洲

中东

津巴布韦高等教育科技发展部

埃及亚历山大图书馆

德国戴姆勒集团

德国大众集团

德国宝马汽车

德国马克斯普朗克学会瑞士欧洲原子能研究所

波兰波兹南超算与网络中心

意大利原子能研究所

意大利CRS4跨学科研究中心

法国照明娱乐公司

英国南极调查局

英国纽卡斯尔大学

英国帝国理工大学

英国拉夫堡大学

德国吕贝克大学

德国慕尼黑大学

波兰华沙大学

俄罗斯圣彼得堡大学

丹麦DTU大学

瑞士洛桑联邦理工学院

瑞典乌普萨拉大学

26

全行业覆盖的HPC项目建设经验

制造/车企高等院校

超算中心 & 科研机构油气 & 媒资 & 气象

27

华为的创新基因 开放的HPC解决方案 聚点成面,携手共赢

28

冷冻电镜三维重构计算平台

上海科技大学电镜高性能计算集群

29

KunLun超级计算机助力清华大学探秘天体物理KunLun大内存计算使能宇宙大尺度的再电离过程模拟

30

山东大学公共计算平台全国首家高性能云计算集群

384万亿次计算能力服务于全校的公共计算平台

Copyright©2017 Huawei Technologies Co., Ltd. All Rights Reserved.The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.

Thank You.