27
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu , S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18

Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe,A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi

Tohoku University14 November, 2018

SC18

Page 2: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 2

Page 3: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Background

• Supercomputers become important infrastructures• Widely used for scien8fic researches as well as various

industries• Top1 Summit system reaches 143.5 Pflop/s

• Big gap between theore8cal performance and sustained performance◎ Compute-intensive applica8ons stand to benefit from

high peak performance✖Memory-intensive applica8ons are limited by lower

memory performance

14 November, 2018 SC18 3

Memory performance has gained more and more attentions

Page 4: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

A new vector supercomputerSX-Aurora TSUBASA• Two important concepts of its design

• High usability• High sustained performance

• New memory integration technology• Realize the world’s highest memory bandwidth

• New architecture• Vector host (VH) is attached to vector engines (VEs)

• VE is responsible for executing an entire application• VH is used for processing system calls invoked by the

applications

14 November, 2018 5X86 Linux

VE

Vector processor

VEVH

SC18

Page 5: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

New execution model

• Conven1onal model • New execution model

6

VH VE

Exe module load Startprocessing

System call(I/O, etc)

Transparent offloadOS function

Finishprocessing

Host GPU

Startprocessing

Kernelexecution

Kernelexecution

Fisnishprocessing SC18

Page 6: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Highlights of the execution model

• Two advantages over conventional execution model• Avoid frequent data transfers between VE and VH

• Applications are entirely executed on VE• Only necessary data for system calls need to be transferred

→High sustained performance• No special programming

• Explicit specifications of computation kernels are not necessary

• System calls are transparently offloadedto the VH• Programmers do not need to care system calls

→High usability

14 November, 2018 SC18 7

VH VE

Finishprocessing

OS function

Offload System call(I/O, etc)

Exe module load Startprocessing

Page 7: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Specification of SX-Aurora TSUBASA• Memory bandwidth

• 1.22 TB/s world’s highest memory bandwidth• Six HBM2 memory modules integraEon

• 3.0 TB/s LLC bandwidth• LLC is connected to cores via 2D mesh network

• ComputaEonal capability• 2.15 Tflop/[email protected] GHz

• 8 powerful vector cores• 16 nm FINFET process technology• 4.8 billion transistors• 14.96 mm x 33.00 mm

14 November, 2018 SC18Block diagram of a vector processor

Page 8: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 9

Page 9: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Experimental environments

• SX-Aurora TSUBASA A300-2

• 2x VEs Type 10B

• 1x VH

14 November, 2018 SC18 10

VE Type 10BFrequency 1.4 GHz

Peak FP / core 268.8 Gflop/s

# cores 8

Peak DP Flops / socket 2.15 Tflop/s

Memory BW 1.2 TB/s

Memory capacity 48 GB

Memory config HBM2

VH Intel Xeon Gold 6126Frequency 2.60 GHz / 3.70 GHz (Turbo)

Peak FP / core 83.2 Gflop/s

# cores 12

Peak DP Flops 998.4 Gflop/s

Mem BW 128 GB/s

Mem Capacity 96 GB

Mem config DDR4-2666 DIMM 16GB x 6

X86 Linux

VE

Vector Engines

VEVH

Page 10: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Experimental environments cont.

14 November, 2018 SC18 11

Processor SX-AuroraType 10B

Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi

KNL 7290Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz

# of cores 8 12 4 5120 72

DP flop/s(SP flop/s)

2.15 T(4.30 T)

998.4 GF (1996.8 GF) 256 GF 7 TF

(14 TF)3.456 TF

(6.912 TF)

Memory subsystem HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 MCDRAM

DDR4

Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 450+ GB/s115.2 GB/s

Memory capacity 48 GB 96 GB 64 GB 16 GB 16 GB

96 GB

LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A

LLC capacity 16 MB shared

19.25 MB shared 1 MB private 6 MB shared 1 MB shared

by 2 cores

Page 11: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Applications used for evaluation

• SGEMM/DGEMM• Matrix-matrix mul;plica;ons to evaluate the Peak flop/s

• Stream benchmark• Simple kernels (copy, scale, add, triad) to measure sustained

memory performance

• Himeno benchmark• Jacobi kernels with a 19-point stencil as a memory-intensive

kernels

• Applica;on kernels• Kernels of prac;cal applica;ons of Tohoku univ in Earthquake,

CFD, Electromagne;c

• Microbenchmark to evaluate the execu;on model• Mixture with vector-friendly Jacobi kernels and I/O kernels

14 November, 2018 SC18 12

Page 12: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Overview of application kernels

14 November, 2018 13

Kernels Fields Methods Memory access

Meshsize

CodeB/F

ActualB/F

Land mine Electromagnetic FDTD Sequential 100x750x750 6.22 5.15

Earthquake Seismology

Friction Law Sequential 2047x2047x256 4.00 4.00

Turbulent Flow CFD Navier-Stokes Sequential 512x16384x512 1.91 0.35

Antenna Electromagnetic FDTD Sequential 252755x9x97336 1.73 0.98

Plasma Physics Lax-Wendroff Indirect 20,048,000 1.12 0.075

Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084

SC18

Page 13: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 14

Page 14: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

SGEMM/DGEMM Performance

14 November, 2018 SC18 15

• High scalability up to 8 threads• High vectorization ratio 99.36%, good vector length 253.8

• High efficiency• Efficiency 97.8~99.2%

0102030405060708090100

0500

10001500200025003000350040004500

1 2 3 4 5 6 7 8

Effic

ienc

y (%

)

GEM

M p

erfo

rman

ce (G

flop/

s)

Number of threads

DGEMM Gflop/s SGEMM Gflop/sDGEMM efficiency SGEMM Efficiency

Page 15: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Memory performance(Stream Triad)

• High sustained memory bandwidth of SX-Aurora TSUBASA• Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81%

• Scalability• Saturated even when the number of threads is less than half

14 November, 2018 SC18 16

0

200

400

600

800

1000

1200

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Stre

am b

andw

idth

(GB/

s)

x 4.72 x 11.7 x 1.37 x 2.23

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

Stre

am b

andw

idth

(GB/

s)

Number of threads

SX-Aurora… SX-ACE Skylake

Page 16: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Himeno (Jacobi) performance

• Higher performance… except GPU• Vector reduction becomes bottleneck due to copy among vector pipes

• Nice thread scalability• 6.9x speedup in 8 threads => 86% parallel efficiency

14 November, 2018 SC18 17

0

50

100

150

200

250

300

350

SX-AuroraTSUBASA

SX-ACE Skylake Tesla V100 KNL

Him

eno

perf

orm

ance

(Gflo

p/s) x 3.4 x 7.96 x 0.93 x 2.1

0

40

80

120

160

200

240

280

320

0 2 4 6 8 10 12

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of threads

SX-Aurora TSUBASA SX-ACE Skylake

x 6.9

Page 17: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Application kernel performance

• SX-Aurora TSUBASA could achieve high performance• Plasma, Turbine => Indirect access, memory latency-bound• Antenna => computation-bound to memory BW-bound• Land mine, Earthqauke, Turbulent flow =>memory or LLC BW-

bound

14 November, 2018 SC18 18

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

1

4

16

64

256

1024

4096

0.1250.25 0.5 1 2 4 8 16 32 64 128

Atta

inab

le p

erfo

rman

ce (G

flop/

s)

Arithmetic intensity (Flops/Byte)

SX-Aurora TSUBASA SX-ACE

Land mine

Earthquake Antenna

Antenna

Turbulent flow

Plasma

PlasmaTurbine

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

Page 18: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Memory bound? or LLC bound?

• Further analysis using 4 types of Bytes/Flop ratio• Memory B/F = (memory BW) / (peak performance)• LLC B/F = (LLC BW) / (peak performance)• Code B/F = (necessary data in Byte) / (# FP operations)• Actual B/F = (# block memory access) * (block size) / (# FP operations)

• Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound• Code B/F < Actual B/F * LLC BW / Memory BW => memory bound

14 November, 2018 SC18 19

B/F ratio Actual < Memory Memory > Actual

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound *

Page 19: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Application kernel performance

• Memory B/F = 1.22 TB/s / 2.15 TF= 0.57

• LLC B/F = 2.66 TB/s / 2.15 TF= 1.24

• Land mine (Code 6.22, Actual 5.79) => LLC bound• Earthqauke (Code 6.00, Actual 2.00) => LLC bound• Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound• Antenna (Code 1.73, Actual 0.98) => memory BW bound

14 November, 2018 SC18 20

0

10

20

30

40

50

60

70

80

Land mine Earthquake Turbulentflow

Antenna Plasma Turbine

Exec

utio

n tim

e (s

ec)

SX-Aurora TSUBASA SX-ACE Skylake

x 4.3 x 2.9x 3.4

x 9.9

x 2.6 x 3.4

B/F ratio Actual < Memory Actual > Memory

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound

Page 20: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Evaluation of the execution model• (Transparent/Explicit)

Offload from VE to VH• Offload from VH to VE

21

VH VE

SAP VVE

offload

0

5

10

15

20

25

30

35

40

VE execution VH offload VE offload

Exec

utio

n tim

e (s

ec)

Data transfer Jacobi 1 I/O Jacobi 2

VH VE

VAPS VH

offload

VAP

S

VAP

VAP

S

VAP

V

V

SAP

Transparent VE to VH explicit VE to VH VH to VE

SAP

Page 21: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Multi-VE performance on A300-8

• Stream VE-level scalability• Almost ideal scalability up to 8 VEs

• Himeno VE-level scalability• Good scalability up to 4VEs• Lack of vector lengths when more than 5VEs

• Problem size is too small

14 November, 2018 SC18 22

0100020003000400050006000700080009000

1 2 3 4 5 6 7 8

Stre

am b

andw

idth

(GB/

s)

Number of VEs

Aurora Ideal

0200400600800

10001200140016001800200022002400

1 2 3 4 5 6 7 8

Him

eno

perf

orm

ance

(Gflo

p/s)

Number of VEs

Aurora Ideal

Page 22: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Outline

• Background• Overview of SX-Aurora TSUBASA• Performance evaluation

• Benchmark performance• Application performance

• Conclusions

14 November, 2018 SC18 23

Page 23: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Application evaluation

• Applications• Tsunami simulation code

• Mainly consists on ground fluctuation and Tsunami inundation prediction

• Used for a real-time tsunami inundation system• A costal region of Japan(1244x826km) with an 810m resolution

• Direct numerical simulation code of turbulent flows• Incompressible Navier-Stokes equations and a continuum

equation are solved by 3D FFT• MPI parallelization to 4VEs (1~32 processes)• 1283, 2563, 5123 grid points

14 November, 2018 SC18 24

Page 24: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Performance of the Tsunami code

• Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE.• Socket performance is x 2.5, x 1.9, x 3.5

14 November, 2018 SC18 25

Page 25: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Performance of the DNS code

• About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 1283, 2563, 5123 grids• LLC hit ratios of 2563 and 5123 are lower than that of 1283

• 10% is bank conflicts

14 November, 2018 SC18 26

Page 26: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Conclusions

• A vector supercomputer SX-Aurora TSUBASA• The highest bandwidth 1.22 TB/s by six HBM2 integration• New execution model to achieve high usability and high

sustained performance• Performance evaluation

• Benchmark programs→ High sustained memory performance→ effectiveness of a new execution model• Application programs→ High sustained performance

• Future work• Further optimizations for SX-Aurora TSUBASA

14 November, 2018 SC18 27

Page 27: Performance Evaluation of a Vector Supercomputer SX …sc18.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap346s5.pdf•A vector supercomputer SX-Aurora TSUBASA •The

Acknowledgements

• People• Hiroyuki Takizawa• Ryusuke Egawa• Souya Fujimoto• Yasuhisa Masaoka

• Projects• The closed beta program for

early access to Aurora• High performance computing

division (jointly-organized with NEC), Cyberscience Center, Tohoku university

14 November, 2018 SC18 28