RZEGORZ WASNIEWSKI ARKO ABIĆ, MACIEJ BESTA, JOOST …sc19.supercomputing.org/proceedings/tech_paper/tech... · 2020. 1. 27. · spcl.inf.ethz.ch @spcl_eth MATRICES Molecular simulations

spcl.inf.ethz.ch

@spcl_eth

GRZEGORZ KWASNIEWSKI, MARKO KABIĆ, MACIEJ BESTA, JOOST VANDEVONDELE, RAFFAELE SOLCÀ, TORSTEN HOEFLER

Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication

PARALLEL I/O LOWER BOUND

PERFORMANCE VS OTHER LIBRARIES

spcl.inf.ethz.ch

@spcl_eth

MATRICESMolecular simulations

[1] Del Ben, Mauro, et al. "Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution." Computer Physics Communications 187 (2015): 120-129.

m = n = 17,408k = 3,735,552

B

A C

64 H20 molecules [1]

2

spcl.inf.ethz.ch

@spcl_eth

Molecular simulations

MATRICESDissipative Quantum Transport Simulations [2]

[2] Ziogas, Alexandros Nikolaos, et al. "A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.“ (SC19), Nov. 2019.

𝑂(𝑁𝑘𝑧𝑁𝐸𝑁𝑞𝑧𝑁𝐸𝑝ℎ𝑁𝐴𝑁𝐵𝑁3𝐷) > 4 ∙ 1011 12x12 MMM

G1 G2 G3 G4 G5 G6 G7 G8

G1 G2G9 G13G5 G6 G10 G14

Gk-2 Gk-1 Gk Gk+1 Gk+2

Gk-1 Gk Gk+1 Gk+2 Gk+3

Gk Gk+1 Gk+2 Gk+3 Gk+4

X

D1

D2

D3

D4

Σk

Σk+1

Σk+2

=

3

spcl.inf.ethz.ch

@spcl_eth

Molecular simulations

MATRICESDissipative Quantum Transport Simulations [2]

[2] Ziogas, Alexandros Nikolaos, et al. "A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.“ (SC19), Nov. 2019.

𝑂(𝑁𝑘𝑧𝑁𝐸𝑁𝑞𝑧𝑁𝐸𝑝ℎ𝑁𝐴𝑁𝐵𝑁3𝐷) > 4 ∙ 1011 12x12 MMM

Gk-2 Gk-1 Gk Gk+1 Gk+2

Gk-1 Gk Gk+1 Gk+2 Gk+3

Gk Gk+1 Gk+2 Gk+3 Gk+4

X

D1

D2

D3

D4

Σk

Σk+1

Σk+2

=

m = n = 12k = 84047-76%

of total runtime

B

A C

4

spcl.inf.ethz.ch

@spcl_eth

DISTRIBUTED SYSTEMS

4,608 nodes27,648 V100

2,414,592 cores

MATRICESMolecular simulations

Dissipative Quantum Transport Simulations

m = n = 12k = 840

m = n = 17,408k = 3,735,552

B

A C

B

A C

40,960 nodes10,649,600 cores

5

spcl.inf.ethz.ch

@spcl_eth

6

Wo

rst-

case

I/O

co

st

< 1969 1969 1994 2011 2013 201919971981 2004

1D 2D 3D

naiveCannon’s [1]

SUMMA [4]

“2.5D” [7]

2017 time

lower bound

[1] Lynn Elliot Cannon, 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D.Dissertation.[2] Hong Jia-Wei and Hsiang-Tsung Kung, 1981. I/O complexity: The red-blue pebble game. InSTOC.[3] Jaeyoung Choi, et al, 1994. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers.[4] Robert A Van De Geijn and Jerrell Watts, 1997. SUMMA: Scalable universal matrix multiplication algorithm.[6] Dror Irony et al., 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication.

3D [5]

[5] Ramesh Agarwal et al., 1995. A three-dimensional approach to parallel matrix multiplication.[7] Edgar Solomonik and James Demmel, 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms[8] J.Demmel et al., 2013. Communication-Optima Parallel Recursive Rectangular MatrixMultiplication[9] Tyler Michael Smith and Robert A. van de Geijn, 2017. Pushing the Bounds for Matrix-Matrix Multiplication

PUMMA [3]

CARMA [8]

COSMAOptimal

decomposition in all scenarios

spcl.inf.ethz.ch

@spcl_eth

7

A

B

C

spcl.inf.ethz.ch

@spcl_eth

p7

8

p0

p2p3p4p5p6

1D

p1

p0, p1, p2, p3, p4, p5, p6, p7

p0p1p2p3p4p5p6p7

spcl.inf.ethz.ch

@spcl_eth

9

2D

p0 p1 p2

p3 p4 p5

p6 p7 p8

p0,p1,p2

p3,p4,p5

p6,p7,p8

p0,p3,p6 p1,p4,p7 p2,p5,p8

spcl.inf.ethz.ch

@spcl_eth

10

3D

p0 p1

p2 p3

p4 p5

p0,p2 p1,p3

p0,p1

p2,p3

p4,p5

p6,p7

p4,p6 p5,p7

p6 p7

spcl.inf.ethz.ch

@spcl_eth

11

TOP-DOWN

p0 p1

p2 p3

p4 p5

p0,p2 p1,p3

p0,p1

p2,p3

p4,p5

p6,p7

p4,p6 p5,p7

p6 p7

spcl.inf.ethz.ch

@spcl_eth

12

TOP-DOWN

p0 p1

p2 p3

p4 p5

p0,p2 p1,p3

p0,p1

p2,p3

p4,p5

p6,p7

p4,p6 p5,p7

p6 p7

p0:cache

size: S

spcl.inf.ethz.ch

@spcl_eth

BOTTOM-UP

13

TOP-DOWN

p0 p1 p2 p3

p4 p5 p6 p7

p0,p4 p1,p5

p0,p1,p2,p3

p2,p6 p3,p7

p4,p5,p6,p7

p0 p1

p2 p3

p4 p5

p0,p2 p1,p3

p0,p1

p2,p3

p4,p5

p6,p7

p4,p6 p5,p7

p6 p7

spcl.inf.ethz.ch

@spcl_eth

BOTTOM-UP

14

TOP-DOWN

Communicationbetween ranks ! Suboptimal shape!

No communicationbetween ranks

Optimal shape

p0 p1

p2 p3

p4 p5

p0,p2 p1,p3

p0,p1

p2,p3

p4,p5

p6,p7

p4,p6 p5,p7

p6 p7

p0 p1 p2 p3

p4 p5 p6 p7

p0,p4 p1,p5

p0,p1,p2,p3

p2,p6 p3,p7

p4,p5,p6,p7

spcl.inf.ethz.ch

@spcl_eth

RED-BLUE PEBBLE GAME [Hong, Kung. 1981]

15

S=5 B=∞

spcl.inf.ethz.ch

@spcl_eth


16

S=5 B=∞

spcl.inf.ethz.ch

@spcl_eth

17

S=5 B=∞


spcl.inf.ethz.ch

@spcl_eth

18

DOMINATOR SET MINIMUM SET

S = 4size = 2S = 8

I/O LOWER BOUND!

RED-BLUE PEBBLE GAME: 2S-PARTITION [Hong, Kung. 1981]

spcl.inf.ethz.ch

@spcl_eth

19

# loads ≥ 4# stores ≥ 4





𝑄 ≥ 𝑆 ∙ (𝐻(2𝑆) − 1)

Number of subsets in 2S partition

Minimal I/O per subset

RED-BLUE PEBBLE GAME: 2S-PARTITION [Hong, Kung. 1981]

spcl.inf.ethz.ch

@spcl_eth

20


S = 4SIZE = 8

I/O LOWER BOUND8 – 3 = 5

OUR WORK: X-PARTITION

spcl.inf.ethz.ch

@spcl_eth

21


S = 4SIZE = 8

I/O LOWER BOUND8 – 3 = 5

𝑄 ≥ (𝑋 − 𝑅 𝑆 + 𝑇(𝑆)) ∙ (𝐻(𝑋) − 1)

Number of subsets in X partition

Dominator set size Maximum reuse

minimum store size

OUR WORK: X-PARTITION

spcl.inf.ethz.ch

@spcl_eth

3D iteration spaceMatrix A

Matrix B

k

k

m

n

22

spcl.inf.ethz.ch

@spcl_eth

Matrix A 3D iteration space

Matrix B

23

𝑺 computed elements

𝑺 reused elements

𝑺 loaded elements

𝑺 loaded elements

spcl.inf.ethz.ch

@spcl_eth


Matrix B

24

spcl.inf.ethz.ch

@spcl_eth


Matrix B

25

Up to p=6 processors:2D decomposition is optimal

Each “pillar”:

2 𝑆𝐾 loads𝑆 stores

spcl.inf.ethz.ch

@spcl_eth


Matrix B

26

p=6

spcl.inf.ethz.ch

@spcl_eth


Matrix B

27

p=12

spcl.inf.ethz.ch

@spcl_eth


Matrix B

28

p=12

𝑺

𝑺

𝑺

spcl.inf.ethz.ch

@spcl_eth

Matrix A 3D iteration space 29

Matrix Bp=54

𝟑 𝑴𝑵𝑲

𝒑

spcl.inf.ethz.ch

@spcl_eth

30

IMPLEMENTATION OPTIMIZATIONS

Communication-computation overlap

Communication buffer optimization

One-Sided and Two-Sided Communication

Processor grid optimization

spcl.inf.ethz.ch

@spcl_eth

31

IMPLEMENTATION OPTIMIZATIONS

Processor grid optimizationExample:

𝒑 = 𝟔𝟓 = 𝟏 ∙ 𝟓 ∙ 𝟏𝟑

Dropping one processor:-increases computation per processor by 1.5%-reduces communication by 36%

spcl.inf.ethz.ch

@spcl_eth

Comparison targets:

32

EVALUATION

LIBRARY DECOMPOSITION ALGORITHM

THEORETICAL COMPLEXITY

Intel MKL ScaLAPACK 2D (SUMMA) 𝑘

𝑝𝑚 + 𝑛 +

𝑚𝑛

𝑝

Cyclops Tensor Framework (CTF) 2.5D 𝑘 𝑚 + 𝑛3/2

𝑝 𝑆+

𝑚𝑛𝑆

𝑘(𝑚 + 𝑛)

CARMA Recursive2min 3

𝑚𝑛𝑘

𝑝 𝑆,𝑚𝑛𝑘

𝑝

2/3

+𝑚𝑛𝑘

𝑝

2/3

COSMA Bottom-up𝐦𝐢𝐧

𝟐𝒎𝒏𝒌

𝒑 𝑺+ 𝑺, 𝟑

𝒎𝒏𝒌

𝒑

𝟐/𝟑

spcl.inf.ethz.ch

@spcl_eth

33

Piz Daint Supercomputer (6th in TOP500)

spcl.inf.ethz.ch

@spcl_eth

34

EVALUATION

total comm. volume per rank [MB] speedup

shape benchmark ScaLAPACK CTF CARMA COSMA min mean max

strong scaling 203 222 195 107 1.07 1.94 4.81

limited memory 816 986 799 424 1.23 1.71 2.99

extra memory 303 350 291 151 1.14 2.03 4.73

strong scaling 2636 2278 659 545 1.24 2 6.55

limited memory 368 541 128 88 1.3 2.61 8.26

extra memory 133 152 48 35 1.31 2.55 6.7

strong scaling 3507 2024 541 410 1.31 2.22 3.22

limited memory 989 672 399 194 1.42 1.7 2.27

extra memory 122 77 77 29 1.35 1.76 2.8

strong scaling 134 68 10 7 1.21 4.02 12.81

limited memory 47 101 26 8 1.31 2.07 3.41

extra memory 15 15 10 3 1.5 2.29 3.59overall 1.07 2.17 12.81

CTF:p=128,

m=n=k=12123

ScaLAPACK:p=17029,

m=n=113072,k=512

spcl.inf.ethz.ch

@spcl_eth

35

EVALUATION

Total communication volume for “largeK” matrices

↓lower = better

spcl.inf.ethz.ch

@spcl_eth

36

EVALUATION

% of achieved peak performance for “largeK” matrices

↑higher = better

spcl.inf.ethz.ch

@spcl_eth

Total communication volume

37

EVALUATION – square matrices

% of achieved peak performance

spcl.inf.ethz.ch

@spcl_eth

38

EVALUATION

Time distribution of COSMA communication and computation kernels

Only 13% ?Only 24% ?Only 31% ?Only 31% ?

Yes, but…2x less

communication than second bestWould result in:60% increase in

runtime

Yes, but…10x less

communication than second bestWould result in:130% increase in

runtime

spcl.inf.ethz.ch

@spcl_eth

39

PORTABILITY AND USABILITY

COSMA

SCALAPACK LAYOUT

C/C++INTERFACE

FORTRANINTERFACE

YOURCODE

COSMA LAYOUT

CUSTOM LAYOUT

GPU

CPU

CUDA ROCm

BACKENDFRONTEND

spcl.inf.ethz.ch

@spcl_eth

GPU BACKEND: TILED-MM

𝐴 𝐵 𝐶

𝐴00 𝐴01 𝐴02 𝐴03

𝐴10 𝐴11 𝐴12 𝐴13

𝐴20 𝐴21 𝐴22 𝐴23

𝐴30 𝐴31 𝐴32 𝐴33

𝐵00 𝐵01 𝐵02 𝐵03

𝐵10 𝐵11 𝐵12 𝐵13

𝐵20 𝐵21 𝐵22 𝐵23

𝐵30 𝐵31 𝐵32 𝐵33

𝐶00 𝐶01 𝐶02 𝐶03

𝐶10 𝐶11 𝐶12 𝐶13

𝐶20 𝐶21 𝐶22 𝐶23

𝐶30 𝐶31 𝐶32 𝐶33

𝐴10𝐵00stream 1: 𝐴10 𝐵00 𝐴10𝐵00 𝐴11 𝐵10 𝐴11𝐵10 𝐴12 𝐵20 𝐴12𝐵20 𝐴13 𝐵30 𝐴13𝐵30

𝐴00 𝐵01 𝐴00𝐵01 𝐴01 𝐴01𝐵11 𝐴02 𝐵21 𝐴02𝐵21 𝐴03 𝐵31 𝐴03𝐵31

𝐶10

𝐶01

…

…stream 2: 𝐵11

copy comp copy comp copy comp copy compcopyback

𝐴00𝐵01

𝐴11𝐵10

𝐴01𝐵11

𝐴12𝐵20

𝐴02𝐵21

𝐴02𝐵21

𝐴03𝐵31

𝐶 copied backonly once!

Copy 𝐴 and 𝐵tiles together



𝐶10

𝐶01

…

…stream 2: 𝐵11


𝐴00𝐵01

𝐴11𝐵10

𝐴01𝐵11

𝐴12𝐵20

𝐴02𝐵21

𝐴02𝐵21

𝐴03𝐵31

𝑆1 𝐴 𝐵 𝐴 ∗ 𝐵

𝐴 𝐵 𝐴 ∗ 𝐵𝑆2



time difference


Copy 𝐴 and 𝐵tiles together



𝐶10

𝐶01

…

…stream 2: 𝐵11


𝐴00𝐵01

𝐴11𝐵10

𝐴01𝐵11

𝐴12𝐵20

𝐴02𝐵21

𝐴02𝐵21

𝐴03𝐵31





time difference


Openly available as a standalone library: https://github.com/kabicm/Tiled-MM

spcl.inf.ethz.ch

@spcl_eth

CUSTOM LAYOUT: GRID2GRID

Custom Layout

IDEA: relabel the ranks to minimize the communication cost

COSMA Layout

𝑃1

𝑃0

𝑃2

𝑃3

𝑃1

𝑃0

𝑃2

𝑃3

COSMA

𝑃1

𝑃0

𝑃2

𝑃3

spcl.inf.ethz.ch

@spcl_eth


Custom Layout


COSMA Layout

COSMA

𝑃0

𝑃1

𝑃2

𝑃3

𝑃0

𝑃1

𝑃2

𝑃3

spcl.inf.ethz.ch

@spcl_eth


Custom Layout


COSMA Layout

COSMA

𝑃0

𝑃1

𝑃2

𝑃3

𝑃0

𝑃1

𝑃2

𝑃3

spcl.inf.ethz.ch

@spcl_eth

COSMA


Custom Layout COSMA Layout

𝑃0

𝑃1

𝑃2

𝑃3

𝑃0

𝑃1

𝑃2

𝑃3

Optimal relabeling = maximum weighted perfect matching.

spcl.inf.ethz.ch

@spcl_eth

OPTIMIZATIONS IN ACTION

A C

B

COSMASCALAPACK LAYOUT

SCALAPACKVS

𝑃 = 16 × 16

- Create communicators- Allocate memory- Solve Perfect Matching- Transpose

- Free communicators- Free memory

- Multiply

- Multiply

spcl.inf.ethz.ch

@spcl_eth


A C

B

COSMASCALAPACK LAYOUT

SCALAPACKVS

𝑃 = 16 × 16

- Create communicators- Allocate memory- Solve Perfect Matching- Transpose

- Free communicators- Free memory

- Multiply

- MultiplyLegacy interface

favors

ScaLAPACK

spcl.inf.ethz.ch

@spcl_eth


0

200

400

600

800

1000

1200

1400

1600

20000 40000 60000 80000 100000 120000

GFl

op

/s

Matrix Dimensions (square)

ScaLAPACK COSMA (CPU) COSMA (GPU)

~𝟐 × fasteron CPU

>𝟐. 𝟓 × fasteron GPU

spcl.inf.ethz.ch

@spcl_eth

COSMA: Communication Optimal S-partition-based Matrix multiplication Algorithm

New general method of assessing lower bounds (X-partition) Tight sequential and parallel I/O lower bound proofs

Lowest communication volume and total runtime in

ALL scenarios

https://github.com/eth-cscs/COSMA/

Open source implementation available at Github

Available from Spack package manager

50

Documents

RZEGORZ WASNIEWSKI ARKO ABIĆ, MACIEJ BESTA, JOOST …sc19.supercomputing.org/proceedings/tech_paper/tech... · 2020. 1. 27. · spcl.inf.ethz.ch @spcl_eth MATRICES Molecular simulations