Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
spcl.inf.ethz.ch
@spcl_eth
GRZEGORZ KWASNIEWSKI, MARKO KABIĆ, MACIEJ BESTA, JOOST VANDEVONDELE, RAFFAELE SOLCÀ, TORSTEN HOEFLER
Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication
PARALLEL I/O LOWER BOUND
PERFORMANCE VS OTHER LIBRARIES
spcl.inf.ethz.ch
@spcl_eth
MATRICESMolecular simulations
[1] Del Ben, Mauro, et al. "Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution." Computer Physics Communications 187 (2015): 120-129.
m = n = 17,408k = 3,735,552
B
A C
64 H20 molecules [1]
2
spcl.inf.ethz.ch
@spcl_eth
Molecular simulations
MATRICESDissipative Quantum Transport Simulations [2]
[2] Ziogas, Alexandros Nikolaos, et al. "A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.“ (SC19), Nov. 2019.
𝑂(𝑁𝑘𝑧𝑁𝐸𝑁𝑞𝑧𝑁𝐸𝑝ℎ𝑁𝐴𝑁𝐵𝑁3𝐷) > 4 ∙ 1011 12x12 MMM
G1 G2 G3 G4 G5 G6 G7 G8
G1 G2G9 G13G5 G6 G10 G14
Gk-2 Gk-1 Gk Gk+1 Gk+2
Gk-1 Gk Gk+1 Gk+2 Gk+3
Gk Gk+1 Gk+2 Gk+3 Gk+4
X
D1
D2
D3
D4
Σk
Σk+1
Σk+2
=
3
spcl.inf.ethz.ch
@spcl_eth
Molecular simulations
MATRICESDissipative Quantum Transport Simulations [2]
[2] Ziogas, Alexandros Nikolaos, et al. "A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.“ (SC19), Nov. 2019.
𝑂(𝑁𝑘𝑧𝑁𝐸𝑁𝑞𝑧𝑁𝐸𝑝ℎ𝑁𝐴𝑁𝐵𝑁3𝐷) > 4 ∙ 1011 12x12 MMM
Gk-2 Gk-1 Gk Gk+1 Gk+2
Gk-1 Gk Gk+1 Gk+2 Gk+3
Gk Gk+1 Gk+2 Gk+3 Gk+4
X
D1
D2
D3
D4
Σk
Σk+1
Σk+2
=
m = n = 12k = 84047-76%
of total runtime
B
A C
4
spcl.inf.ethz.ch
@spcl_eth
DISTRIBUTED SYSTEMS
4,608 nodes27,648 V100
2,414,592 cores
MATRICESMolecular simulations
Dissipative Quantum Transport Simulations
m = n = 12k = 840
m = n = 17,408k = 3,735,552
B
A C
B
A C
40,960 nodes10,649,600 cores
5
spcl.inf.ethz.ch
@spcl_eth
6
Wo
rst-
case
I/O
co
st
< 1969 1969 1994 2011 2013 201919971981 2004
1D 2D 3D
naiveCannon’s [1]
SUMMA [4]
“2.5D” [7]
2017 time
lower bound
[1] Lynn Elliot Cannon, 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D.Dissertation.[2] Hong Jia-Wei and Hsiang-Tsung Kung, 1981. I/O complexity: The red-blue pebble game. InSTOC.[3] Jaeyoung Choi, et al, 1994. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers.[4] Robert A Van De Geijn and Jerrell Watts, 1997. SUMMA: Scalable universal matrix multiplication algorithm.[6] Dror Irony et al., 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication.
3D [5]
[5] Ramesh Agarwal et al., 1995. A three-dimensional approach to parallel matrix multiplication.[7] Edgar Solomonik and James Demmel, 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms[8] J.Demmel et al., 2013. Communication-Optima Parallel Recursive Rectangular MatrixMultiplication[9] Tyler Michael Smith and Robert A. van de Geijn, 2017. Pushing the Bounds for Matrix-Matrix Multiplication
PUMMA [3]
CARMA [8]
COSMAOptimal
decomposition in all scenarios
spcl.inf.ethz.ch
@spcl_eth
7
A
B
C
spcl.inf.ethz.ch
@spcl_eth
p7
8
p0
p2p3p4p5p6
1D
p1
p0, p1, p2, p3, p4, p5, p6, p7
p0p1p2p3p4p5p6p7
spcl.inf.ethz.ch
@spcl_eth
9
2D
p0 p1 p2
p3 p4 p5
p6 p7 p8
p0,p1,p2
p3,p4,p5
p6,p7,p8
p0,p3,p6 p1,p4,p7 p2,p5,p8
spcl.inf.ethz.ch
@spcl_eth
10
3D
p0 p1
p2 p3
p4 p5
p0,p2 p1,p3
p0,p1
p2,p3
p4,p5
p6,p7
p4,p6 p5,p7
p6 p7
spcl.inf.ethz.ch
@spcl_eth
11
TOP-DOWN
p0 p1
p2 p3
p4 p5
p0,p2 p1,p3
p0,p1
p2,p3
p4,p5
p6,p7
p4,p6 p5,p7
p6 p7
spcl.inf.ethz.ch
@spcl_eth
12
TOP-DOWN
p0 p1
p2 p3
p4 p5
p0,p2 p1,p3
p0,p1
p2,p3
p4,p5
p6,p7
p4,p6 p5,p7
p6 p7
p0:cache
size: S
spcl.inf.ethz.ch
@spcl_eth
BOTTOM-UP
13
TOP-DOWN
p0 p1 p2 p3
p4 p5 p6 p7
p0,p4 p1,p5
p0,p1,p2,p3
p2,p6 p3,p7
p4,p5,p6,p7
p0 p1
p2 p3
p4 p5
p0,p2 p1,p3
p0,p1
p2,p3
p4,p5
p6,p7
p4,p6 p5,p7
p6 p7
spcl.inf.ethz.ch
@spcl_eth
BOTTOM-UP
14
TOP-DOWN
Communicationbetween ranks ! Suboptimal shape!
No communicationbetween ranks
Optimal shape
p0 p1
p2 p3
p4 p5
p0,p2 p1,p3
p0,p1
p2,p3
p4,p5
p6,p7
p4,p6 p5,p7
p6 p7
p0 p1 p2 p3
p4 p5 p6 p7
p0,p4 p1,p5
p0,p1,p2,p3
p2,p6 p3,p7
p4,p5,p6,p7
spcl.inf.ethz.ch
@spcl_eth
RED-BLUE PEBBLE GAME [Hong, Kung. 1981]
15
S=5 B=∞
spcl.inf.ethz.ch
@spcl_eth
RED-BLUE PEBBLE GAME [Hong, Kung. 1981]
16
S=5 B=∞
spcl.inf.ethz.ch
@spcl_eth
17
S=5 B=∞
RED-BLUE PEBBLE GAME [Hong, Kung. 1981]
spcl.inf.ethz.ch
@spcl_eth
18
DOMINATOR SET MINIMUM SET
S = 4size = 2S = 8
I/O LOWER BOUND!
RED-BLUE PEBBLE GAME: 2S-PARTITION [Hong, Kung. 1981]
spcl.inf.ethz.ch
@spcl_eth
19
# loads ≥ 4# stores ≥ 4
# loads ≥ 4# stores ≥ 4
# loads ≥ 4# stores ≥ 4
# loads ≥ 4# stores ≥ 4
# loads ≥ 4# stores ≥ 4
𝑄 ≥ 𝑆 ∙ (𝐻(2𝑆) − 1)
Number of subsets in 2S partition
Minimal I/O per subset
RED-BLUE PEBBLE GAME: 2S-PARTITION [Hong, Kung. 1981]
spcl.inf.ethz.ch
@spcl_eth
20
DOMINATOR SET MINIMUM SET
S = 4SIZE = 8
I/O LOWER BOUND8 – 3 = 5
OUR WORK: X-PARTITION
spcl.inf.ethz.ch
@spcl_eth
21
DOMINATOR SET MINIMUM SET
S = 4SIZE = 8
I/O LOWER BOUND8 – 3 = 5
𝑄 ≥ (𝑋 − 𝑅 𝑆 + 𝑇(𝑆)) ∙ (𝐻(𝑋) − 1)
Number of subsets in X partition
Dominator set size Maximum reuse
minimum store size
OUR WORK: X-PARTITION
spcl.inf.ethz.ch
@spcl_eth
3D iteration spaceMatrix A
Matrix B
k
k
m
n
22
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
23
𝑺 computed elements
𝑺 reused elements
𝑺 loaded elements
𝑺 loaded elements
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
24
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
25
Up to p=6 processors:2D decomposition is optimal
Each “pillar”:
2 𝑆𝐾 loads𝑆 stores
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
26
p=6
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
27
p=12
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space
Matrix B
28
p=12
𝑺
𝑺
𝑺
spcl.inf.ethz.ch
@spcl_eth
Matrix A 3D iteration space 29
Matrix Bp=54
𝟑 𝑴𝑵𝑲
𝒑
spcl.inf.ethz.ch
@spcl_eth
30
IMPLEMENTATION OPTIMIZATIONS
Communication-computation overlap
Communication buffer optimization
One-Sided and Two-Sided Communication
Processor grid optimization
spcl.inf.ethz.ch
@spcl_eth
31
IMPLEMENTATION OPTIMIZATIONS
Processor grid optimizationExample:
𝒑 = 𝟔𝟓 = 𝟏 ∙ 𝟓 ∙ 𝟏𝟑
Dropping one processor:-increases computation per processor by 1.5%-reduces communication by 36%
spcl.inf.ethz.ch
@spcl_eth
Comparison targets:
32
EVALUATION
LIBRARY DECOMPOSITION ALGORITHM
THEORETICAL COMPLEXITY
Intel MKL ScaLAPACK 2D (SUMMA) 𝑘
𝑝𝑚 + 𝑛 +
𝑚𝑛
𝑝
Cyclops Tensor Framework (CTF) 2.5D 𝑘 𝑚 + 𝑛3/2
𝑝 𝑆+
𝑚𝑛𝑆
𝑘(𝑚 + 𝑛)
CARMA Recursive2min 3
𝑚𝑛𝑘
𝑝 𝑆,𝑚𝑛𝑘
𝑝
2/3
+𝑚𝑛𝑘
𝑝
2/3
COSMA Bottom-up𝐦𝐢𝐧
𝟐𝒎𝒏𝒌
𝒑 𝑺+ 𝑺, 𝟑
𝒎𝒏𝒌
𝒑
𝟐/𝟑
spcl.inf.ethz.ch
@spcl_eth
33
Piz Daint Supercomputer (6th in TOP500)
spcl.inf.ethz.ch
@spcl_eth
34
EVALUATION
total comm. volume per rank [MB] speedup
shape benchmark ScaLAPACK CTF CARMA COSMA min mean max
strong scaling 203 222 195 107 1.07 1.94 4.81
limited memory 816 986 799 424 1.23 1.71 2.99
extra memory 303 350 291 151 1.14 2.03 4.73
strong scaling 2636 2278 659 545 1.24 2 6.55
limited memory 368 541 128 88 1.3 2.61 8.26
extra memory 133 152 48 35 1.31 2.55 6.7
strong scaling 3507 2024 541 410 1.31 2.22 3.22
limited memory 989 672 399 194 1.42 1.7 2.27
extra memory 122 77 77 29 1.35 1.76 2.8
strong scaling 134 68 10 7 1.21 4.02 12.81
limited memory 47 101 26 8 1.31 2.07 3.41
extra memory 15 15 10 3 1.5 2.29 3.59overall 1.07 2.17 12.81
CTF:p=128,
m=n=k=12123
ScaLAPACK:p=17029,
m=n=113072,k=512
spcl.inf.ethz.ch
@spcl_eth
35
EVALUATION
Total communication volume for “largeK” matrices
↓lower = better
spcl.inf.ethz.ch
@spcl_eth
36
EVALUATION
% of achieved peak performance for “largeK” matrices
↑higher = better
spcl.inf.ethz.ch
@spcl_eth
Total communication volume
37
EVALUATION – square matrices
% of achieved peak performance
spcl.inf.ethz.ch
@spcl_eth
38
EVALUATION
Time distribution of COSMA communication and computation kernels
Only 13% ?Only 24% ?Only 31% ?Only 31% ?
Yes, but…2x less
communication than second bestWould result in:60% increase in
runtime
Yes, but…10x less
communication than second bestWould result in:130% increase in
runtime
spcl.inf.ethz.ch
@spcl_eth
39
PORTABILITY AND USABILITY
COSMA
SCALAPACK LAYOUT
C/C++INTERFACE
FORTRANINTERFACE
YOURCODE
COSMA LAYOUT
CUSTOM LAYOUT
GPU
CPU
CUDA ROCm
BACKENDFRONTEND
spcl.inf.ethz.ch
@spcl_eth
GPU BACKEND: TILED-MM
𝐴 𝐵 𝐶
𝐴00 𝐴01 𝐴02 𝐴03
𝐴10 𝐴11 𝐴12 𝐴13
𝐴20 𝐴21 𝐴22 𝐴23
𝐴30 𝐴31 𝐴32 𝐴33
𝐵00 𝐵01 𝐵02 𝐵03
𝐵10 𝐵11 𝐵12 𝐵13
𝐵20 𝐵21 𝐵22 𝐵23
𝐵30 𝐵31 𝐵32 𝐵33
𝐶00 𝐶01 𝐶02 𝐶03
𝐶10 𝐶11 𝐶12 𝐶13
𝐶20 𝐶21 𝐶22 𝐶23
𝐶30 𝐶31 𝐶32 𝐶33
𝐴10𝐵00stream 1: 𝐴10 𝐵00 𝐴10𝐵00 𝐴11 𝐵10 𝐴11𝐵10 𝐴12 𝐵20 𝐴12𝐵20 𝐴13 𝐵30 𝐴13𝐵30
𝐴00 𝐵01 𝐴00𝐵01 𝐴01 𝐴01𝐵11 𝐴02 𝐵21 𝐴02𝐵21 𝐴03 𝐵31 𝐴03𝐵31
𝐶10
𝐶01
…
…stream 2: 𝐵11
copy comp copy comp copy comp copy compcopyback
𝐴00𝐵01
𝐴11𝐵10
𝐴01𝐵11
𝐴12𝐵20
𝐴02𝐵21
𝐴02𝐵21
𝐴03𝐵31
𝐶 copied backonly once!
Copy 𝐴 and 𝐵tiles together
𝐴10𝐵00stream 1: 𝐴10 𝐵00 𝐴10𝐵00 𝐴11 𝐵10 𝐴11𝐵10 𝐴12 𝐵20 𝐴12𝐵20 𝐴13 𝐵30 𝐴13𝐵30
𝐴00 𝐵01 𝐴00𝐵01 𝐴01 𝐴01𝐵11 𝐴02 𝐵21 𝐴02𝐵21 𝐴03 𝐵31 𝐴03𝐵31
𝐶10
𝐶01
…
…stream 2: 𝐵11
copy comp copy comp copy comp copy compcopyback
𝐴00𝐵01
𝐴11𝐵10
𝐴01𝐵11
𝐴12𝐵20
𝐴02𝐵21
𝐴02𝐵21
𝐴03𝐵31
𝑆1 𝐴 𝐵 𝐴 ∗ 𝐵
𝐴 𝐵 𝐴 ∗ 𝐵𝑆2
𝑆1 𝐴 𝐵 𝐴 ∗ 𝐵
𝐴 𝐵 𝐴 ∗ 𝐵𝑆2
time difference
𝐶 copied backonly once!
Copy 𝐴 and 𝐵tiles together
𝐴10𝐵00stream 1: 𝐴10 𝐵00 𝐴10𝐵00 𝐴11 𝐵10 𝐴11𝐵10 𝐴12 𝐵20 𝐴12𝐵20 𝐴13 𝐵30 𝐴13𝐵30
𝐴00 𝐵01 𝐴00𝐵01 𝐴01 𝐴01𝐵11 𝐴02 𝐵21 𝐴02𝐵21 𝐴03 𝐵31 𝐴03𝐵31
𝐶10
𝐶01
…
…stream 2: 𝐵11
copy comp copy comp copy comp copy compcopyback
𝐴00𝐵01
𝐴11𝐵10
𝐴01𝐵11
𝐴12𝐵20
𝐴02𝐵21
𝐴02𝐵21
𝐴03𝐵31
𝑆1 𝐴 𝐵 𝐴 ∗ 𝐵
𝐴 𝐵 𝐴 ∗ 𝐵𝑆2
𝑆1 𝐴 𝐵 𝐴 ∗ 𝐵
𝐴 𝐵 𝐴 ∗ 𝐵𝑆2
time difference
𝐶 copied backonly once!
Openly available as a standalone library: https://github.com/kabicm/Tiled-MM
spcl.inf.ethz.ch
@spcl_eth
CUSTOM LAYOUT: GRID2GRID
Custom Layout
IDEA: relabel the ranks to minimize the communication cost
COSMA Layout
𝑃1
𝑃0
𝑃2
𝑃3
𝑃1
𝑃0
𝑃2
𝑃3
COSMA
𝑃1
𝑃0
𝑃2
𝑃3
spcl.inf.ethz.ch
@spcl_eth
CUSTOM LAYOUT: GRID2GRID
Custom Layout
IDEA: relabel the ranks to minimize the communication cost
COSMA Layout
COSMA
𝑃0
𝑃1
𝑃2
𝑃3
𝑃0
𝑃1
𝑃2
𝑃3
spcl.inf.ethz.ch
@spcl_eth
CUSTOM LAYOUT: GRID2GRID
Custom Layout
IDEA: relabel the ranks to minimize the communication cost
COSMA Layout
COSMA
𝑃0
𝑃1
𝑃2
𝑃3
𝑃0
𝑃1
𝑃2
𝑃3
spcl.inf.ethz.ch
@spcl_eth
COSMA
CUSTOM LAYOUT: GRID2GRID
Custom Layout COSMA Layout
𝑃0
𝑃1
𝑃2
𝑃3
𝑃0
𝑃1
𝑃2
𝑃3
Optimal relabeling = maximum weighted perfect matching.
spcl.inf.ethz.ch
@spcl_eth
OPTIMIZATIONS IN ACTION
A C
B
COSMASCALAPACK LAYOUT
SCALAPACKVS
𝑃 = 16 × 16
- Create communicators- Allocate memory- Solve Perfect Matching- Transpose
- Free communicators- Free memory
- Multiply
- Multiply
spcl.inf.ethz.ch
@spcl_eth
OPTIMIZATIONS IN ACTION
A C
B
COSMASCALAPACK LAYOUT
SCALAPACKVS
𝑃 = 16 × 16
- Create communicators- Allocate memory- Solve Perfect Matching- Transpose
- Free communicators- Free memory
- Multiply
- MultiplyLegacy interface
favors
ScaLAPACK
spcl.inf.ethz.ch
@spcl_eth
OPTIMIZATIONS IN ACTION
0
200
400
600
800
1000
1200
1400
1600
20000 40000 60000 80000 100000 120000
GFl
op
/s
Matrix Dimensions (square)
ScaLAPACK COSMA (CPU) COSMA (GPU)
~𝟐 × fasteron CPU
>𝟐. 𝟓 × fasteron GPU
spcl.inf.ethz.ch
@spcl_eth
COSMA: Communication Optimal S-partition-based Matrix multiplication Algorithm
New general method of assessing lower bounds (X-partition) Tight sequential and parallel I/O lower bound proofs
Lowest communication volume and total runtime in
ALL scenarios
https://github.com/eth-cscs/COSMA/
Open source implementation available at Github
Available from Spack package manager
50