Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Supercomputer Fugaku
Toshiyuki Shimizu
FUJITSU LIMITED
Feb. 18th, 2020
Copyright 2020 FUJITSU LIMITED
Outline
◼ Fugaku project overview
◼ Co-design
◼ Approach
◼ Design results
◼ Performance & energy consumption evaluation
◼ Green500
◼ OSS apps
◼ Fugaku priority issues
◼ Summary
Copyright 2020 FUJITSU LIMITED1
Copyright 2020 FUJITSU LIMITED
Supercomputer “Fugaku”, formerly known as Post-K
ApproachFocus
Application performance
Power efficiency
Usability
Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2
Leading-edge Si-technology, Fujitsu's proven low power & high performance logic design, and power-controlling knobs
Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Linux
2
Fugaku project schedule2011 2012 2013 2014 2015 2016 2017 2018 2020 20212019
Basicdesign
Detailed design &Implementation
Manufacturing,Installationand Tuning
Generaloperation
2022
Copyright 2020 FUJITSU LIMITED
Co-Design w/ apps groupsArchitecture &
sizingSelect apps
Feasibility studyApps review
Fugaku development & delivery
3
Fugaku co-design
◼ Co-design goals◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW
• Design applications, compilers, libraries, and hardware
◼ Approach◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator
• Computation time: brief & precise estimation
• Communication time: bandwidth and latency for communication w/ some attributes for communication patterns
• I/O time:
◼ Then, optimize apps/compilers etc. and resolve bottlenecks
◼ Estimation of performance and power◼ Precise performance estimation for primary kernels
• Make & run Fugaku objects on the Fugaku cycle base simulator
◼ Brief performance estimation for other sections• Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst. fetch, branch, fp exec, data
load/store wait cycles of L1D/L2, etc.
◼ Power estimation• DGEMM execution toggles on the emulator + estimation of memory and interconnect considering utilization + loss of convertors
Copyright 2020 FUJITSU LIMITED4
Co-design iterations around year 2015
◼ Apply each of application kernels
◼ Define/refine a set of architecture parameters
◼ Implement/tune the kernel under the architecture parameters
◼ Evaluate execution time using the estimation tools
◼ Identify hardware bottlenecks and explore design space
◼ Examples of architecture parameters
◼ Frequency, # of arch regs, SIMD width, cache structure & size, # of cores…
◼ Memory, interconnect parameters
◼ Implementation of instructions: i.e. Combined gather…
Copyright 2020 FUJITSU LIMITED5
A64FX CPU
◼ Arm SVE, high performance and high efficiency
◼ DP performance 2.7+ TFLOPS, >90%@DGEMM, (CPU freq=1.8/2.0/2.2 GHz)
◼ Memory BW 1024 GB/s, >80%@STREAM Triad
12x compute cores1x assistant core
A64FX
ISA (Base, extension) Armv8.2-A, SVE
Peak DP performance 2.7+ TFLOPS
SIMD width 512-bit
# of cores 48 + 4
Memory capacity 32 GiB (HBM2 x4)
Memory peak bandwidth 1024 GB/s
PCIe Gen3 16 lanes
High speed interconnect TofuD integrated
PCleController
TofuInterface
C
C
C
C
NOC
HB
M2
HB
M2
HB
M2
HB
M2
CMG CMG
CMG CMG
CMG:Core Memory Group NOC:Network on Chip
Copyright 2020 FUJITSU LIMITED
FUGAKU operating cond.
6
◼ TSMC 7nm FinFET & CoWoS
◼ Broadcom SerDes, HBM I/O, and SRAMs
◼ 8.786 billion transistors
◼ 594 signal pins
A64FX leading-edge Si-technology
Copyright 2020 FUJITSU LIMITED
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core
Core Core Core Core
Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core
L2Cache
L2Cache
L2Cache
L2Cache
HB
M2
Interfa
ceH
BM
2 Interfa
ce
HB
M2
inte
rfa
ceH
BM
2 In
terf
ace
PCIe InterfaceTofuD Interface
RIN
G-B
us
7
Fujitsu-designed CPU core w/ High Memory Bandwidth
◼ A64FX out-of-order controls in cores, caches, and memories achieve superior throughput
HBM2 8GiBHBM2 8GiBHBM2 8GiBHBM2 8GiB
CMG
L1D 64KiB, 4way
512-bit wide SIMD 2x FMAs
CoreCore
230+GB/s
115+GB/s
12x Compute Cores + 1x Assistant Core
L2 Cache 8MiB, 16way
256GB/s
115+GB/s
57+GB/s
Core Core
x4
BW and calc. perf. A64FX B/F
DP floating perf. (TFlops) 2.7+ -
L1 data cache (TB/s) 11+ 4
L2 cache (TB/s) 3.6+ 1.3
Memory BW (GB/s) 1024 0.37
Copyright 2020 FUJITSU LIMITED8
A64FX optimized load efficiency for apps performance
◼ 128 bytes/cycle sustained bandwidth even for unaligned SIMD load
◼ “Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a“128-byte aligned block” for a pair of two regs, even & odd
L1D cacheRead port0
Read port1
Read data064B/cycle
Read data164B/cycle
Mem.
128B
0 1 2 3 4 5 6 7
0 1 3 2 6 7 45
8B
Regs
flow-1
flow-2
flow-4 flow-3
Maximizes BW to 32 bytes/cyc.Copyright 2020 FUJITSU LIMITED
Suggested through Co-design work w/ app teams
9
A64FX Power Knobs to reduce power consumption
◼ “Power knob” limits units’ activities via user APIs
◼ Performance/Watt can be optimized by utilizing Power knobs
Copyright 2020 FUJITSU LIMITED
L1 inst.cache
BranchPredictor
Decode& Issue
RSE0
RSA
RSE1
RSBR
PGPREXA
EXB
EAGAEXC
EAGBEXD
PFPR
FetchPort
Store Port
L1 datacache
HBM2 Controller
Fetch Issue Dispatch Reg-read Execute Cache and Memory
CSE
Commit
PC
ControlRegisters
L2 cache
HBM2
Write Buffer
Tofu controller
FLA
PPR
FLB
PRX
FLA pipeline only
HBM2 B/W adjust(units of 10%)
Frequency reduction
Decode width: 4 to 2
EXA pipeline only
PCI Controller 52 cores
10
Fugaku system software developed with RIKEN
Fugaku system hardware
Multi-Kernel System: RHEL8 and light-weight kernel (IHK/McKernel)
Fujitsu Technical Computing Suite / RIKEN-developed system software
Fugaku applications
Management software Programming environmentFile system
System managementfor high availability &
power saving operation
Job management for higher system
utilization & power efficiency
LLIONVM-based file I/O
Accelerator for Fugaku
OpenMP, COARRAY, Math. libs.
Open MPI, MPICH(PiP), DTF
XcalableMP, FDPSFEFSLustre-based
distributed file system
Compilers, AI frameworks
Debug & tuning tools, Containers
Fugaku developed w/ RIKEN
Copyright 2020 FUJITSU LIMITED
TensorflowChainerPytorch
DockerKVM
Singularity
FDPS: Framework for Developing Particle SimulatorPiP: Process-in-Process DTF: Data Transfer FrameworkIHK: Interface for Heterogeneous Kernels
11
Outline
◼ Fugaku project overview
◼ Co-design
◼ Approach
◼ Design results
◼ Performance & energy consumption evaluation
◼ Green500
◼ OSS apps
◼ Fugaku priority issues
◼ Summary
Copyright 2020 FUJITSU LIMITED12
Green500, Nov. 2019
◼ A64FX prototype –
◼ Fujitsu A64FX 48C 2GHz
ranked #1 on the list
◼ 768x general purpose A64FX CPU w/o accelerators
• 1.9995 PFLOPS @ HPL, 84.75%
• 16.876 GF/W
• Power quality level 2
Copyright 2020 FUJITSU LIMITED
https://www.top500.org/green500/lists/2019/11/
13
How we achieved
◼ Key for GF/W is {energy efficient HW} x {parallel/exec efficiency}
◼ A64FX is designed for energy efficient
◼ Fujitsu’s proven CPU microarchitecture & 7nm FinFET
◼ SoC design: Tofu interconnect D integrated
◼ CoWoS: 4x HBM2 for main memory integrated
◼ Superior parallel/exec efficiency
◼ Math. libraries are tuned for application efficiency
◼ Comm. libs are also tuned utilizing long experience of Tofu @ K computer
◼ Performance tuning is efficiently done utilizing rich performance analyzer/monitor
Copyright 2020 FUJITSU LIMITED
+ Concerted efforts of co-design
14
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 5,000,000 10,000,000 15,000,000 20,000,000
SC19 TOP500 calculation efficiency
◼ A64FX superior efficiency 84.75%with small Nmax
◼ Results of:
◼ Optimized communication and math. libs
◼ Optimization ofoverlapped communication
Copyright 2020 FUJITSU LIMITED
A64FX prototype
Nmax
ChallengeEf
fici
ency
w/ Accw/o Acc
15
A64FX CPU performance evaluation for real apps
◼ Open source software, Realapps on an A64FX @ 2.2GHz
◼ Up to 1.8x faster over the latest x86 processor (24core, 2.9GHz) x2
◼ High memory BW and long SIMD length of A64FX work effectively with these applications
Copyright 2020 FUJITSU LIMITED
OpenFOAM
FrontlSTR
ABINIT
SALMON
SPECFEM3D
WRF
MPAS
FUJITSU A64FX x86 (24core, 2.9GHz) x2
0.5 1.0 1.5
Relative speed up ratio
16
A64FX CPU power efficiency for real apps
◼ Performance / Energy consumption on an A64FX @ 2.2GHz
◼ Up to 3.7x more efficient over the latest x86 processor (24core, 2.9GHz) x2
◼ High efficiency is achieved by energy-conscious design and implementation
Copyright 2020 FUJITSU LIMITED
OpenFOAM
FrontlSTR
ABINIT
SALMON
SPECFEM3D
WRF
MPAS
1.0 2.0 3.0 4.0
Relative power efficiency ratio
FUJITSU A64FX x86 (24core, 2.9GHz) x2
17
Fugaku priority issues and performance prediction
◼ 100x app performance and power budget requirements are met
◼ Some apps utilize power knob to reduce power consumption and achieve high energy efficiency
◼Measuring and optimizing using real HW
Copyright 2020 FUJITSU LIMITED
https://postk-web.r-ccs.riken.jp/perf.html
18
Fugaku project schedule2011 2012 2013 2014 2015 2016 2017 2018 2020 20212019
Basicdesign
Detailed design &Implementation
Manufacturing,Installationand Tuning
Generaloperation
2022
Copyright 2020 FUJITSU LIMITED
Co-Design w/ apps groupsArchitecture &
sizingSelect apps
Feasibility studyApps review
Fugaku development & delivery
Previous estimation w/o HW w/ HW
Apps tuning
19
Current status normalized by the estimation w/o HW
◼ Performance and energy (=power*elapse) of apps
Copyright 2020 FUJITSU LIMITED
A
Esti
mat
ion
w/
HW
/ w
/o H
W
Fast
er t
han
pre
v. e
st.
Low
er e
ner
gy
than
pre
v. e
st.
B C D E FDGEMM Stream
Perf
orm
ance
Ener
gy
Note: evaluation underway
20
Current status normalized by the estimation w/o HW
◼ Performance and energy (=power*elapse) of apps
Copyright 2020 FUJITSU LIMITED
A
Esti
mat
ion
w/
HW
/ w
/o H
W
B C D E FDGEMM Stream
• Performances are match due to precise simulation
• Lower power 8% in logic & 15% in memory use; acceptable variations
Computations were reduced after the previous estimation
Good estimation and optimization: errors are less than 20%
Perf
orm
ance
Ener
gy
Note: evaluation underway
21
Summary
Copyright 2020 FUJITSU LIMITED
https://www.riken.jp/pr/news/2019/20190827_1/
◼ “Fugaku” is co-designed and runs apps at high performance w/ optimal power consumption
◼ Arm HPC ecosystem and apps portfolio are growing by efforts of communities and selections of HW
Fujitsu PRIMEHPC FX1000
◼ Fujitsu Supercomputer PRIMEHPC FX1000 & FX700 based on Fugaku tech.
◼ Cray CS500 using Fujitsu A64FX Arm CPU
FX700
22
Copyright 2020 FUJITSU LIMITED23