18
Accelerating GMRES via Mixed Precision Neil Lindquist , Piotr Luszczek, Jack Dongarra 12th JLESC Workshop February 25th, 2020 1

Accelerating GMRES via Mixed Precision

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating GMRES via Mixed Precision

Accelerating GMRES via Mixed

Precision

Neil Lindquist, Piotr Luszczek, Jack Dongarra

12th JLESC Workshop

February 25th, 2020

1

Page 2: Accelerating GMRES via Mixed Precision

GMRES

•General purpose, sparse linear solver• Iterative, Krylov solver

•Memory bound performance• Mix single and double precision

2

Page 3: Accelerating GMRES via Mixed Precision

GMRES Algorithm

3

GMRES𝑟𝑒𝑠(𝑨, 𝒙𝟎, 𝒃,𝑴−1)

for 𝑘 = 0, 1, 2, …𝒓𝒌 ← 𝒃− 𝑨𝒙𝒌𝒛𝒌 ← 𝑴−1𝒓𝒌𝛽 ← 𝒛𝒌 2

𝑽:,0 ← Τ𝒛𝒌 𝛽

𝐬 ← 𝛽, 0, 0, … , 0 𝑇

for j = 0, 1, 2, …𝒘 ← 𝑴−1𝑨𝑽:,𝑗𝒘,𝑯:,𝑗 ← 𝑜𝑟𝑡ℎ𝑜𝑔𝑜𝑛𝑎𝑙𝑖𝑧𝑒 𝒘, 𝑽:,𝑗𝑯𝑗+1,𝑗 ← 𝒘 2

𝑽:,𝑗+1 ← Τ𝒘 𝒘 2

𝑯:,𝑗 ← 𝑮𝟎𝑮𝟏…𝑮𝒋−𝟏𝑯:,𝑗

𝑮𝒋 ← 𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥(𝑯:,𝑗)

𝑯:,𝑗 ← 𝑮𝒋𝑯:,𝑗

𝒔 ← 𝑮𝒋𝒔

𝒖𝒌 ← 𝑽𝑯−1𝒔𝒙𝒌+𝟏 ← 𝒙𝒌 + 𝒖𝒌

Computing 𝑨𝒙 = 𝒃. 𝑨−1 ≈ 𝑴−1

Restarts

Iteration count

Page 4: Accelerating GMRES via Mixed Precision

GMRES Algorithm

4

GMRES𝑟𝑒𝑠(𝑨, 𝒙𝟎, 𝒃,𝑴−1)

for 𝑘 = 0, 1, 2, …𝒓𝒌 ← 𝒃− 𝑨𝒙𝒌𝒛𝒌 ← 𝑴−1𝒓𝒌𝛽 ← 𝒛𝒌 2

𝑽:,0 ← Τ𝒛𝒌 𝛽

𝐬 ← 𝛽, 0, 0, … , 0 𝑇

for j = 0, 1, 2, …𝒘 ← 𝑴−1𝑨𝑽:,𝑗𝒘,𝑯:,𝑗 ← 𝑜𝑟𝑡ℎ𝑜𝑔𝑜𝑛𝑎𝑙𝑖𝑧𝑒 𝒘, 𝑽:,𝑗𝑯𝑗+1,𝑗 ← 𝒘 2

𝑽:,𝑗+1 ← Τ𝒘 𝒘 2

𝑯:,𝑗 ← 𝑮𝟎𝑮𝟏…𝑮𝒋−𝟏𝑯:,𝑗

𝑮𝒋 ← 𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥(𝑯:,𝑗)

𝑯:,𝑗 ← 𝑮𝒋𝑯:,𝑗

𝒔 ← 𝑮𝒋𝒔

𝒖𝒌 ← 𝑽𝑯−1𝒔𝒙𝒌+𝟏 ← 𝒙𝒌 + 𝒖𝒌

Computing 𝑨𝒙 = 𝒃. 𝑨−1 ≈ 𝑴−1

Restarts

Iteration count

Double:

Single:

Double:

Page 5: Accelerating GMRES via Mixed Precision

GMRES Simplified Algorithm

5

GMRES𝑟𝑒𝑠(𝑨, 𝒙𝟎, 𝒃,𝑴−1)

for 𝑘 = 0, 1, 2, …𝒓𝒌 ← 𝒃 − 𝑨𝒙𝒌𝒖𝒌 ← GMRES𝑛𝑜 𝑟𝑒𝑠(𝑨, 𝟎, 𝒓𝒌,𝑴

−1)𝒙𝒌+𝟏 ← 𝒙𝒌 + 𝒖𝒌

Double:

Single:

Double:

Page 6: Accelerating GMRES via Mixed Precision

GMRES Simplified Algorithm

6

GMRES𝑟𝑒𝑠(𝑨, 𝒙𝟎, 𝒃,𝑴−1)

for 𝑘 = 0, 1, 2, …𝒓𝒌 ← 𝒃 − 𝑨𝒙𝒌𝒖𝒌 ← 𝑨−1 𝒓𝒌 𝟎𝒙𝒌+𝟏 ← 𝒙𝒌 + 𝒖𝒌

Double:

Single:

Double:

Page 7: Accelerating GMRES via Mixed Precision

Performance

• Target accuracy 10−10 =𝑏−𝐴𝑥 2

𝐴 𝐹 𝑥 2+ 𝑏 2

•Restart strategies:I. 100 inner iterations

II. 100 inner iterations or residual estimate of 10−10

III. First: 100 inner iterations or residual estimate of 10−6

Then: same number of inner iterations

•20-core Haswell node with NVIDIA V100 GPU• cuSparse, cuBLAS, Kokkos

•CSR matrix format

7

Page 8: Accelerating GMRES via Mixed Precision

Performance – Scalar Jacobi

9

• Speedups

• Median time of 3 run

• 3 runs

• Error bars: mins and maxes

• Geometric mean of speedup

• MGS: 14%

• CGSR: 54%

Page 9: Accelerating GMRES via Mixed Precision

Performance – ILU(0)

10

• Speedups

• Median time of 3 run

• 3 runs

• Error bars: mins and maxes

• Geometric mean of speedup

• MGS: -7%

• CGSR: -4%

Page 10: Accelerating GMRES via Mixed Precision

Performance – ILU(0) with Jacobi Solves

11

• ILU(0) w/ 5 Jacobi iterations

for each triangular solve

• Speedups

• Median time of 3 run

• 3 runs

• Error bars: mins and maxes

• Geometric mean of speedup

• MGS: 8%

• CGSR: 14%

Page 11: Accelerating GMRES via Mixed Precision

Future Directions

•Choice of low-precision• Half, Bfloat16

• Compression

•Distributed systems

•Other Krylov methods

•Applications

12

Page 12: Accelerating GMRES via Mixed Precision

Conclusions

•When restarted, mixed-precision GMRES often

outperforms double-precision GMRES

13

Page 13: Accelerating GMRES via Mixed Precision

Extra Slides

14

Page 14: Accelerating GMRES via Mixed Precision

Test Configuration Details

•CUDA 10.2.199, Kokkos 3.1.01, GCC 7.3.0

•https://bitbucket.org/icl/mixed-precision-gmres• tag TPDS-perf

15

Page 15: Accelerating GMRES via Mixed Precision

Publications

• N. Lindquist, P. Luszczek, and J. Dongarra, “Improving the

performance of the GMRES method using mixed-precision

techniques,” in Driving Scientific and Engineering

Discoveries through the Convergence of HPC, Big Data and

AI. DOI: 10.1007/978-3-030-63393-6_4

• [Submitted] N. Lindquist, P. Luszczek, and J. Dongarra,

“Accelerating restarted GMRES with mixed precision

arithmetic,” in Transactions on Parallel and Distributed

Systems.

16

Page 16: Accelerating GMRES via Mixed Precision

Effect on Convergence: Configuration

• ILU(0) preconditioner (𝑀−1)

•CSR matrix format

•Custom, mixed precision kernels w/ Kokkos

•20-core Haswell node• 2x Intel® Xeon® E5-2650 v3 processors

17

Page 17: Accelerating GMRES via Mixed Precision

Effect on Convergence: Configuration

•airfoil_2d from SuiteSparse collection• 𝑛 = 14,214

• 𝑛𝑛𝑧 = 259,688

• 𝜅2 = 1.8 ∙ 106

•Error if GMRES stopped𝑏 − 𝐴𝑥 2

𝐴 𝐹 𝑥 2 + 𝑏 2

18

Page 18: Accelerating GMRES via Mixed Precision

Accuracy results

19

Modified Gram-Schmidt

Orthogonalization (MGS)

Classical Gram-Schmidt with

Reorthogonalization (CGSR)