Kengo Nakajima's Homepage - 並列前処理付き反復法nkl.cc.u-tokyo.ac.jp/seminars/ppOpen-APPL-FVM/JSIAM14w.pdf差分法，有限要素法等による科学技術計算は最終的に

並列前処理付き反復法

中島研吾東京大学情報基盤センター・

同大学院情報理工学系研究科数理情報学専攻

日本応用数理学会三部会連携「応用数理セミナー」

2014年12月26日

差分法，有限要素法等による科学技術計算は最終的には疎な係数行列を係数とする大規模連立一次方程式の求解に帰着される。連立一次方程式の解法としては，クリロフ部分空間型反復解法が広く使用されている。

係数行列の固有値分布を改善し，収束を加速する方法として前処理（Preconditioning）は，実アプリケーションを解く上で重要である。

本チュートリアルでは，前処理付き反復法の基礎から，最新のスーパーコンピュータによる大規模シミュレーション研究を指向した研究動向まで幅広く解説する。

http://nkl.cc.u-tokyo.ac.jp/seminars/JSIAM14w.pdf

2

• Sparse Matrices• Iterative Linear Solvers

− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues

• Example of Parallel MGCG• Ill-Conditioned Problems

TOC

3

• Introduction to Parallel Iterative Solvers

There are a lot of topics and issues all of which I cannot cover. I just try to talk about my experiences in the area of scientific applications and parallel numerical algorithms, with some general introductions.

Goal

4

Finite-Element Method (FEM)有限要素法

1

1

2 3

4 5 6

7 8 9

2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

• 偏微分方程式の解法として広く知られている

– elements (meshes，要素) & nodes (vertices，節点)• 以下の二次元熱伝導問題を考える:

– 16節点，9要素（四角形）

– 一様な熱伝導率 (=1)– 一様な体積発熱 (Q=1)– 節点1で温度固定：T=0– 周囲断熱

02

2

2

2

Q

yT

xT

5

Galerkin FEM procedures

1 2 3

4 5 6

7 8 9

2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1

• 各要素にガラーキン法を適用:

02

2

2

2

dVQyT

xTN

T

V

各要素で：

[N] : 形状関数（内挿関数）

• 偏微分方程式に対して，ガウス・グリーンの定理を適用し，以下の「弱形式」を導く

0

dVNQ

dVyN

yN

xN

xN

V

T

TT

V

}{NT

6

Element Matrix：要素マトリクス

e

B

D C

A

)(

)(

)(

)(

)(

)(

)(

)(

)()()()(

)()()()(

)()()()(

)()()()(

)()()( }{}]{[

eD

eC

eB

eA

eD

eC

eB

eA

eDD

eDC

eDB

eDA

eCD

eCC

eCB

eCA

eBD

eBC

eBB

eBA

eAD

eAC

eAB

eAA

eee

ffff

kkkkkkkkkkkkkkkk

fk

0

dVNQ

dVyN

yN

xN

xN

V

T

TT

V

• 各要素において積分を実行し，要素マトリクスを得る

7

Global Matrix：全体マトリクス各要素マトリクスを全体マトリクスに足しこむ

1

1

2 3

4 5 6

7 8 9

2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

}{}]{[

FFFFFFFFFFFFFFFF

DXXXXDXXXX

XDXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXDX

XXXXDXXXXD

FK

8

1

1

2 3

4 5 6

7 8 9

2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

}{}]{[

FFFFFFFFFFFFFFFF

DXXXXDXXXX

XDXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXDX

XXXXDXXXXD

FK

Global Matrix：全体マトリクス各要素マトリクスを全体マトリクスに足しこむ

9

得られた大規模連立一次方程式を解くある適切な境界条件 (ここでは=0)を適用「疎（ゼロが多い）」な行列

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

FFFFFFFFFFFFFFFF

DXXXXDXXXX

XDXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXDX

XXXXDXXXXD

10

計算結果

11

2D FDM Mesh (5-point stencil)12

13

有限要素法・差分法で得られるマトリクス

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

FFFFFFFFFFFFFFFF

DXXXXDXXXX

XDXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXDX

XXXXDXXXXD

• 疎行列– 0が多い

• A(i,j)のように正方行列の全

成分を記憶することは疎行列では非効率的– 「密」行列向け

• 有限要素法：非零非対角成分の数は高々「数百」規模– 例えば未知数が108個あるとすると記憶容量（ワード数）は

• 正方行列：O(1016)• 非零非対角成分数：O(1010)

• 非零成分のみ記憶するのが効率的

14

行列ベクトル積への適用（非零）非対角成分のみを格納，疎行列向け方法

Compressed Row Storage (CRS)

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

FFFFFFFFFFFFFFFF

DXXXXDXXXX

XDXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXXXDXXXX

XXXXDXXXXXXXDXX

XXDXXXXXDX

XXXXDXXXXD

Diag (i) 対角成分（実数，i=1,N）Index(i) 非対角成分数に関する一次元配列（通し番号）

（整数，i=0,N）Item(k) 非対角成分の要素（列）番号

（整数，k=1, index(N)）AMat(k) 非対角成分

（実数，k=1, index(N)）

{Y}= [A]{X}

do i= 1, NY(i)= Diag(i)*X(i)do k= Index(i-1)+1, Index(i)

Y(i)= Y(i) + Amat(k)*X(Item(k))enddo

enddo

15

行列ベクトル積：密行列⇒とても簡単

NNNNNN

NNNNNN

NN

NN

aaaaaaaa

aaaaaaaa

,1,2,1,

,11,12,11,1

,21,22221

,11,11211

...

.........

...

N

N

N

N

yy

yy

xx

xx

1

2

1

1

2

1

{Y}= [A]{X}

do j= 1, NY(j)= 0.d0do i= 1, N

Y(j)= Y(j) + A(i,j)*X(i)enddo

enddo

Compressed Row Storage (CRS)

3.5101.306.93.15.901.131.234.1005.24.60

05.94.12005.60003.405.1104.105.91.3007.25.28.901.4001.305.107.5001.907.305.206.33.4

0002.3004.21.11

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

16

Compressed Row Storage (CRS)：Cプログラムの中では0番から番号付け

2.4①

3.2④

4.3◎

2.5③

3.7⑤

9.1⑦

1.5④

3.1⑥

4.1①

2.5④

2.7⑤

3.1◎

9.5①

10.4②

4.3⑥

6.5②

9.5⑥

6.4①

2.5②

1.4⑤

13.1⑦

9.5①

1.3②

9.6③

3.1⑤

N= 8

Diagonal ComponentsDiag[0]= 1.1 Diag[1]= 3.6Diag[2]= 5.7Diag[3]= 9.8 Diag[4]= 11.5 Diag[5]= 12.4Diag[6]= 23.1 Diag[7]= 51.3

1.1◎

3.6①

5.7②

9.8③

11.5④

12.4⑤

23.1⑥

51.3⑦

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

17

Compressed Row Storage (CRS)：C0

2.4①

3.2④

4.3◎

2.5③

3.7⑤

9.1⑦

1.5④

3.1⑥

4.1①

2.5④

2.7⑤

3.1◎

9.5①

10.4②

4.3⑥

6.5②

9.5⑥

6.4①

2.5②

1.4⑤

13.1⑦

9.5①

1.3②

9.6③

3.1⑤

1 2 3 4 5 6 71.1◎

3.6①

5.7②

9.8③

11.5④

12.4⑤

23.1⑥

51.3⑦

0

1

2

3

4

5

6

7

18

Compressed Row Storage (CRS)：C

0

1

2

3

4

5

6

7

2.4①,0

3.2④,1

4.3◎,2

2.5③,3

3.7⑤,4

9.1⑦,5

1.5④,6

3.1⑥,7

4.1①,8

2.5④,9

2.7⑤,10

3.1◎,11

9.5①,12

10.4②,13

4.3⑥,14

6.5②,15

9.5⑥,16

6.4①,17

2.5②,18

1.4⑤,19

13.1⑦,20

9.5①,21

1.3②,22

9.6③,23

3.1⑤,24

1.1◎

3.6①

5.7②

9.8③

11.5④

12.4⑤

23.1⑥

51.3⑦

Diag [i] 対角成分（実数，[N]）Index[i] 非対角成分数に関する一次元配列

（通し番号）（整数，[N+1]）Item[k] 非対角成分の要素（列）番号

（整数，[Index[N]]）AMat[k] 非対角成分

（実数，[Index[N]]）

{Y}=[A]{X}

for(i=0;i<N;i++){Y[i] = Diag[i] * X[i]; for(k=Index[i];k<Index[i+1];k++){

Y[i] += AMat[k]*X[Item[k]];}

}

19



• Example of Parallel MGCG

20

科学技術計算における大規模線形方程式の解法

• 多くの科学技術計算は，最終的に大規模線形方程式Ax=bを解くことに帰着される。

– important, expensive• アプリケーションに応じて様々な手法が提案されている

– 疎行列（sparse），密行列（dense）

– 直接法（direct），反復法（iterative）

• 密行列（dense）

– グローバルな相互作用：BEM，スペクトル法，MO，MD（気液）

• 疎行列（sparse）

– ローカルな相互作用：FEM，FDM，MD（固），高速多重極展開付BEM

21

直接法（Direct Method）

• Gaussの消去法，完全LU分解他– 行列の変形，逆行列に相当するものの計算

• 利点– 安定，幅広いアプリケーションに適用可能

• Pivoting– 疎行列，密行列いずれにも適用可能

• 欠点– 反復法よりもメモリ，計算時間を必要とする

• 密行列の場合，O（N3 ）の計算量

– 大規模な計算向けではない• O（N2 ）の記憶容量，O（N3 ）の計算量

22

反復法とは・・・

適当な初期解 x(0)から始めて，繰り返し計算によって真の解に

収束(converge)させていく

,, )2()1( xx

A b

初期解連立一次方程式

)0(

)0(2

)0(1

)0(

nx

xx

x

x

nnnnnn

n

n

b

bb

x

xx

aaa

aaaaaa

2

1

2

1

21

22221

11211

23

反復法（Iterative Method）

• 定常（stationary）法

– 反復計算中，解ベクトル以外の変数は変化せず

– SOR，Gauss-Seidel，Jacobiなど

– 概して遅い

• 非定常（nonstationary）法

– 拘束，最適化条件が加わる

– Krylov部分空間（subspace）への写像を基底として使用するため，Krylov部分空間法とも呼ばれる

– CG（Conjugate Gradient：共役勾配法）

– BiCGSTAB（Bi-Conjugate Gradient Stabilized）

– GMRES（Generalized Minimal Residual）

NbMxxbAx

)()1( kk

24

反復法（Iterative Method）（続き）

• 利点– 直接法と比較して，メモリ使用量，計算量が少ない。

– 並列計算には適している。

• 欠点– 収束性が，アプリケーション，境界条件の影響を受けやすい。

• 収束しない（答えが得られない）可能性がある

– 前処理（preconditioning）が重要。

25

非定常反復法：クリロフ部分空間法（1/2）Krylov Subspace Method xAIbxbAx

以下の反復式を導入しx0, x1, x2, ..., xkを求める：

11

11

1

kk

kk

kk

xrxAxb

xAIbx

kkwhere Axbr ：残差ベクトル（residual）

1

00

k

iik rxx

111111

11

kkkkkk

kkkk

rAIrAxrArAxbxrAbAxbr

26

非定常反復法：クリロフ部分空間法（2/2）Krylov Subspace Method

zkはk次のクリロフ部分空間（Krylov Subspace）に属するベクトル，問題はクリロフ部分空間からどのようにして解の近似ベクトルxkを求めるかにある：

0

1

1

1

100

1

1000

2

000

1

00

rAIIrAIrz

rAIrxrAIrxrxx

k

i

ik

i

ik

k

i

ik

ii

k

iik

01

02

00 ,,,, rArAArr k

27

代表的な非定常反復法：共役勾配法

• Conjugate Gradient法，略して「CG」法– 最も代表的な「非定常」反復法

• 対称正定値行列（Symmetric Positive Definite：SPD）– 任意のベクトル{x}に対して{x}T[A]{x}>0– 全対角成分>0，全固有値>0，全部分行列式（主小行列式・首座行

列式）>0と同値

• アルゴリズム– 最急降下法（Steepest Descent Method）の変種

– x(i)= x(i-1) + ip(i)

• x(i)：反復解，p(i)：探索方向，i：定数）

– 厳密解をyとするとき {x-y}T[A]{x-y}を最小とするような{x}を求める。

– 詳細は参考文献参照

• 例えば：森正武「数値解析（第2版）」（共立出版）

nnnnnn

n

n

n

n

aaaaa

aaaaaaaaaaaaaaaaaaaa

4321

444434241

334333231

224232221

114131211

det

nnnnnn

n

n

n

n

aaaaa

aaaaaaaaaaaaaaaaaaaa

4321

444434241

334333231

224232221

114131211

det

28

共役勾配法（CG法）のアルゴリズム

Compute r(0)= b-[A]x(0)

for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)

i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|

end

• 行列ベクトル積

• ベクトル内積

• ベクトル定数倍の加減（DAXPY）

x(i) : Vectori : Scalar

29



for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end





30



for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end





31




for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end




– Double– {y}= a{x} + {y}

32



for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end


33

CG法アルゴリズムの導出(1/5)

bybxAxxAyyAyxAxx

AyyAyxAxyAxxyxAyx T

,,2,,,2,,,,,

yxAyx T

定数

bxAxxxf ,,21

AhhbAxhxfhxf ,21, 任意のベクトル h

yを厳密解（ Ay=b ）とするとき，下式を最小にするxを求める：

従って，下記 f(x) を最小にするxを求めればよい：

34

bxAxxxf ,,21

AhhbAxhxfhxf ,21, •任意のベクトルh

AhhbAxhxf

AhhbhAxhbxAxx

bhbxAhhAhxAxhAxx

bhbxAhhxAxhx

bhxhxAhxhxf

,21,

,21,,,,

21

,,,21,

21,

21,

21

,,,21,

21

,)(,21

35


)()()1( kk

kk pxx

(1)

CG法は任意の x(0) から始めて，f(x)の最小値を逐次探索する。今，k番目の近似値 x(k)と探索方向p(k)が決まったとすると：

)()()()()(2)()( ,,21 kkk

kkk

kk

kk xfAxbpApppxf

f(x(k+1)) を最小にするためには：

)()(

)()(

)()(

)()()()(

,,

,,0 kk

kk

kk

kk

kk

kk

k

Apprp

AppAxbppxf

)()( kk Axbr は第k近似に対する残差

36


)()()1( kk

kk Aprr

残差 r(k)も以下の式によって計算できる：

本当のところは下記のように（k+1）回目に厳密解 y が求まれば良いのであるが，解がわかっていない場合は困難・・・

)1(1

)1(

kk

k pxy

)()()1()()1(

)()()1()1( ,k

kkkkk

kkkk

ApAxAxrrAxbrAxbr

(2)

)0()0()()1()1( , prprp kk

kk

探索方向を以下の漸化式によって求める：

(3)

37

CG法アルゴリズムの導出(4/5)ところで，下式のような都合の良い直交関係がある：

従って以下が成立する：

0,0,, )()1()1(1

)()1()(

kkkk

kkk ApppApxyAp

0,,,

,,,,,

)()()()()()()(

)()()()()()(

)1()()1()()1()(

kkk

kkkk

kk

kk

kkkk

kk

kkkkkk

ApprpAprp

ApAxbppxAbpAxbpAxAypxyAp

)()(

)()(

,,

kk

kk

k Apprp

0, )1()( kk xyAp

38


)()(

)()1(

)()()()1()()()1()()1(

,,

0,,,,

kk

kk

k

kkk

kkkkk

kkk

AppApr

AppAprApprApp

0, )()1( kk App p(k) と p(k+1) が行列Aに関して共役（conjugate）

(4)

)1()1(

)1()(

1 ,,

ii

ii

i AppApr

Compute p(0)=r(0)= b-[A]x(0)

for i= 1, 2, …calc. i-1x(i)= x(i-1) + i-1p(i-1)r(i)= r(i-1) – i-1[A]q(i-1)

check convergence |r|(if not converged)calc. i-1p(i)= r(i) + i-1 p(i-1)

end

)1()1(

)1()1(

1 ,,

ii

ii

i Apprp

39

CG法アルゴリズム

任意の(i,j)に対して以下の共役関係が得られる:

N次元空間で互いに直交で一次独立な残差ベクトル r(k) はN個しか存在しない，従って共役勾配法は未知数がN個のときにN回以内に収束する⇒ 実際は丸め誤差の影響がある（条件数が大きい場合）

jiApp ji 0, )()(

)()()()()()( ,,,0, kkkkji rrrpjirr

探索方向p(k) ，残差ベクトルr(k)についても以下の関係が成立する：

Top 10 Algorithms in the 20th Century (SIAM)http://www.siam.org/news/news.php?id=637モンテカルロ法，シンプレックス法，クリロフ部分空間法，行列分解法，最適化Fortranコンパイラ，QR法，クイックソート，FFT，整数関係アルゴリズム，FMM（高速多重極法）

40

k，k

k

kk

k

kkkkk

kk

kk

kk

kk

k

rrrrrApr

rrrr

AppApr

)1()1()1()()1()()1(

)()(

)1()1(

)()(

)()1(

,,,

,,

,,

実際はk，kはもうちょっと簡単な形に変形できる：

)()()()(

)()(

)()(

)()(

)()(

)()(

)()(

,,,,

,,

,,

kkkk

kk

kk

kk

kk

kk

kk

k

rrrpApprr

Apprp

AppAxbp

41



for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end


)()(

)1()1(

,,

ii

ii

i Apprr

)2()2(

)1()1(

1 ,,

ii

ii

i rrrr

1 i 2 i

1 i

42

do i= 1, NR(i)= B(i)do j= 1, NR(i)= R(i) - AMAT(i,j)*X(j)

enddoenddo

BNRM2= 0.0D0do i= 1, NBNRM2= BNRM2 + B(i) **2

enddo

プログラム例（CG法）（1/3）

AMAT(i,j): Aのaij成分B(i): bの各成分X(i): xの各成分

P(i): pの各成分Q(i): qの各成分R(i): rの各成分


for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

43

do iter= 1, ITERmaxRHO= 0.d0do i= 1, NRHO= RHO + R(i)*R(i)

enddo

if ( iter.eq.1 ) thendo i= 1, N

P(i)= R(i)enddoelseBETA= RHO / RHO1do i= 1, N

P(i)= R(i) + BETA*P(i)enddo

endif

do i= 1, NQ(i)= 0.d0do j= 1, N

Q(i)= Q(i) + AMAT(i,j)*P(j)enddo

enddo

...enddo

プログラム例（CG法）（2/3）Compute r(0)= b-[A]x(0)

for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

44

do iter= 1, ITERmax...

C1= 0.d0do i= 1, NC1= C1 + P(i)*Q(i)

enddoALPHA= RHO / C1

do i= 1, NX(i)= X(i) + ALPHA * P(i)R(i)= R(i) - ALPHA * Q(i)

enddo

DNRM2 = 0.0do i= 1, NDNRM2= DNRM2 + R(i)**2

enddo

RESID= dsqrt(DNRM2/BNRM2)

if ( RESID.le.EPS) exit

RHO1 = RHO

enddo

プログラム例（CG法）（3/3）Compute r(0)= b-[A]x(0)

for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)

elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

i-1=i-2

2

2

)(

b

Axb k

45

一次元熱伝導方程式支配方程式：熱伝導率＝1（一様）

xxBFxBF

xxdxdxBF

dxd

max2

max2

2

21

@0,0@0,0

一様体積発熱 BF

=0 断熱

以下のような離散化（要素中心で従属変数を定義）をしている

=0

46

一次元熱伝導方程式解析解

xxBFxBF max2

21

=0X=0

断熱となっているのはこの面，しかし温度は計算されない（X=Xmax）。

12255.98505.1200495.494921 2

x=1.d0，メッシュ数=50，とすると，Xmax=49.5，●の点のX座標は49.0となる。BF=1.0d0とすると●での温度は：

47

計算例（N=50）：Jacobi法1000 iters, RESID= 5.443248E-01 PHI(N)= 4.724513E+022000 iters, RESID= 3.255667E-01 PHI(N)= 7.746137E+023000 iters, RESID= 1.947372E-01 PHI(N)= 9.555996E+02...34000 iters, RESID= 2.347113E-08 PHI(N)= 1.225000E+0335000 iters, RESID= 1.403923E-08 PHI(N)= 1.225000E+0335661 iters, RESID= 9.999053E-09 PHI(N)= 1.225000E+03

1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...

41 1.180000E+03 1.180000E+0342 1.189000E+03 1.189000E+0343 1.197000E+03 1.197000E+0344 1.204000E+03 1.204000E+0345 1.210000E+03 1.210000E+0346 1.215000E+03 1.215000E+0347 1.219000E+03 1.219000E+0348 1.222000E+03 1.222000E+0349 1.224000E+03 1.224000E+0350 1.225000E+03 1.225000E+03

反復回数最大残差(50)

数値解，解析解

12255.98505.1200495.494921 2

48

計算例（N=50）：Gauss-Seidel 法1000 iters, RESID= 3.303725E-01 PHI(N)= 7.785284E+022000 iters, RESID= 1.182010E-01 PHI(N)= 1.065259E+033000 iters, RESID= 4.229019E-02 PHI(N)= 1.167848E+03...16000 iters, RESID= 6.657001E-08 PHI(N)= 1.225000E+0317000 iters, RESID= 2.381754E-08 PHI(N)= 1.225000E+0317845 iters, RESID= 9.993196E-09 PHI(N)= 1.225000E+03

1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...


反復回数最大残差(50)

数値解，解析解

49

計算例（N=50）：CG法49 iters, RESID= 0.000000E-00 PHI(N)= 1.225000E+03

1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...


反復回数最大残差(50)数値解，解析解

12255.98505.1200495.494921 2

49回目に収束していることに注意（未知数は49個）

50

反復法（Iterative Method）

• 利点– 直接法と比較して，メモリ使用量，計算量が少ない。

– 並列計算には適している。

• 欠点– 収束性が，アプリケーション，境界条件の影響を受けやすい。

• 収束しない（答えが得られない）可能性がある

– 前処理（preconditioning）が重要。• 条件数（condition number）の大きい問題

51




52

共役勾配法のアルゴリズム

• 行列ベクトル積• ベクトル内積• ベクトル定数倍の加減

x(i) ：ベクトルi ：スカラー


for i= 1, 2, …z(i-1)= r(i-1)

i-1= r(i-1) z(i-1)if i=1p(1)= z(0)

elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

53

前処理（preconditioning）とは?• 反復法の収束は係数行列の固有値分布に依存

– 固有値分布が少なく，かつ1に近いほど収束が早い（単位行列）

– 条件数（condition number）（対称正定）＝最大最小固有値比• 条件数が1に近いほど収束しやすい

• もとの係数行列[A]に良く似た前処理行列[M]を適用すること

によって固有値分布を改善する。– 前処理行列[M]によって元の方程式[A]{x}={b}を[A’]{x’}={b’}へと変換する。ここで[A’]=[M]-1[A]，{b’}=[M]-1{b} である。

– [A’]=[M]-1[A]が単位行列に近ければ良いということになる。

– [A’]=[A][M]-1のように右からかけることもある。

• 「前処理」は密行列，疎行列ともに使用するが，普通は疎行列を対象にすることが多い。

54

前処理付共役勾配法Preconditioned Conjugate Gradient Method （PCG）


for i= 1, 2, …solve [M]z(i-1)= r(i-1)

i-1= r(i-1) z(i-1)if i=1p(1)= z(0)

elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

実際にやるべき計算は：

rMz 1

AMAM ,11

対角スケーリング：簡単＝弱い

DMDM ,11

究極の前処理：本当の逆行列

AMAM ,11

「近似逆行列」の計算が必要：

55

56

対角スケーリング，点ヤコビ前処理

• 前処理行列として，もとの行列の対角成分のみを取り出した行列を前処理行列 [M] とする。

–対角スケーリング，点ヤコビ（point-Jacobi）前処理

N

N

DD

DD

M

0...00000.........00000...0

1

2

1

• solve [M]z(i-1)= r(i-1)という場合に逆行列を簡単に求めることができる。

• 簡単な問題では収束する。

57

ILU(0), IC(0)• 最もよく使用されている前処理（疎行列用）

– 不完全LU分解• Incomplete LU Factorization

– 不完全コレスキー分解• Incomplete Cholesky Factorization（対称行列）

• 不完全な直接法– もとの行列が疎でも，逆行列は疎とは限らない。

– fill-in– もとの行列と同じ非ゼロパターン（fill-in無し）を持って

いるのがILU（0），IC（0）

58

LU分解法：完全LU分解法

• 直接法の一種

– 逆行列を直接求める手法

– 「逆行列」に相当するものを保存しておけるので，右辺が変わったときに計算時間を節約できる

– 逆行列を求める際にFill-in（もとの行列では０であったところに値が入る）が生じる

• LU factorization

59

「不」完全LU分解法

• ILU factorization– Incomplete LU factorization

• Fill-inの発生を制限して，前処理に使う手法

– 不完全な逆行列，少し弱い直接法

– Fill-inを許さないとき：ILU(0)

60

LU分解による連立一次方程式の解法

Ａがn×n行列のとき、Ａを次式のように表すことを（あるいは、そのようなＬとＵそのものを）ＡのLU分解という．

nn

n

n

n

nnnnnnnn

n

n

n

u

uuuuuuuuu

lll

lll

aaaa

aaaaaaaaaaaa

000

000

1

010010001

333

22322

1131211

321

3231

21

321

3333231

2232221

1131211

LUA L：Lower triangular part of matrix AU：Upper triangular part of matrix A

61

連立一次方程式の行列表現

n元の連立一次方程式の一般形

nnnnnn

nn

nn

bxaxaxa

bxaxaxabxaxaxa

2211

22222121

11212111

行列表現

nnnnnn

n

n

b

bb

x

xx

aaa

aaaaaa

2

1

2

1

21

22221

11211

A x b

bAx

62

LU分解を用いたAx=bの解法

１

2

3

LUA となるAのＬＵ分解ＬとＵを求める．

bLy の解ｙを求める．(簡単！）

yUx の解xを求める．(簡単！）

このxが bAx の解となる

bLyLUxAx

63

Ly=bの解法：前進代入

bLy

nnnn b

bb

y

yy

ll

l

2

1

2

1

21

21

1

01001

nnnn byylyl

byylby

2211

22121

11

i

n

ininnnnn ylbylylby

ylbyby

1

12211

12122

11

芋づる式に（one after another）解が求まる．

64

Ux=yの解法：後退代入

yUx

nnnn

n

n

y

yy

x

xx

u

uuuuu

2

1

2

1

222

11211

00

0

11212111

1,111,1

yxuxuxu

yxuxuyxu

nn

nnnnnnn

nnnn

112

111

1,1,111

/

/)(/

uxuyx

uxuyxuyx

j

n

ij

nnnnnnn

nnnn

芋づる式に（one after another）解が求まる．

65

LU分解の求め方

nn

n

n

n

nnnnnnnn

n

n

n

u

uuuuuuuuu

lll

lll

aaaa

aaaaaaaaaaaa

000

000

1

010010001

333

22322

1131211

321

3231

21

321

3333231

2232221

1131211

①

②

③

④

①

②

③

④

nnn uuuuauaua 112111112121111 ,,,,,,

131211111113131112121 ,,,,,, nnn lllulaulaula

nnnn uuuuulauula 223222121222122122 ,,,,,

242322232123132 ,,,, nlllulula

66

数値例

44

3433

242322

14131211

434241

3231

21

00000

0

1010010001

17407822

107624321

uuuuuuuuuu

lllll

lA

第１行 14131211 4,3,2,1 uuuu

第１列

0/002/22,2/22

11411141

1131113111211121

ulululululul

第2行

21017,26

24241421

2323132122221221

uuuluuuluuul

第2列 24,12 42224212413222321231 lulullulul

67

数値例(続き）

44

3433

242322

14131211

434241

3231

21

00000

0

1010010001

17407822

107624321

uuuuuuuuuu

lllll

lA

第３行

17,38

343424321431

333323321331

uuululuuulul

第３列 37 43334323421341 uululul

第４行(第４列） 21 4444344324421441 uuululul

１行、１列、２行、２列、・・・の順に求める式を作っていく．

68

数値例(続き）

結局

2000130021204321

1320011200120001

17407822

107624321

A

L U

69

実例：5点差分

1

1 2 3

4 5 6

7 8 9

10 11 12

23

45

67

89

1011

12

70

実例：5点差分

1

1 2 3

4 5 6

7 8 9

10 11 12

23

4

67

89

1011

12

5

71

実例：係数マトリクス

1 2 3

4 5 6

7 8 9

10 11 1212

34

56

78

910

1112

=X

0.00

3.00

10.00

11.00

10.00

19.00

20.00

16.00

28.00

42.00

36.00

52.00

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00

72

実例：解

1 2 3

4 5 6

7 8 9

10 11 1212

34

56

78

910

1112

=

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

0.00

3.00

10.00

11.00

10.00

19.00

20.00

16.00

28.00

42.00

36.00

52.00

73

完全LU分解したマトリクスlu1.f

もとのマトリクス

LU分解したマトリクス[L][U]同時に表示[L]対角成分（=1）省略

（fill-inが生じている。もともと0だった成分が非ゼロになっている）

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00

74

不完全LU分解したマトリクス（fill-in無し）lu2.f

不完全LU分解したマトリクス（fill-in無し）[L][U]同時に表示[L]対角成分（=1）省略

完全LU分解したマトリクス[L][U]同時に表示[L]対角成分（=1）省略

（fill-inが生じている。もともと0だった成分が非ゼロになっている）

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 5.83 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 0.00 -0.17 5.66 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -0.17 0.00 -0.18 5.65 0.00 0.00 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00 0.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65 0.00 0.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63

75

解の比較：ちょっと違う

不完全LU分解lu2.f

完全LU分解lu1.f

0.92

1.75

2.76

3.79

4.46

5.57

6.66

7.25

8.46

9.66

10.54

11.83

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 5.83 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 0.00 -0.17 5.66 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -0.17 0.00 -0.18 5.65 0.00 0.00 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00 0.00 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65 0.00 0.00 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65

6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00

-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00

0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00

0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00

0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00

0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00

0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00

0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63

76

ILU(0), IC(0) 前処理

• Fill-inを全く考慮しない「不完全な」分解

– 記憶容量，計算量削減

• これを解くと「不完全な」解が得られるが，本来の解とそれほどずれているわけではない

– 問題に依存する




77

78

• Both of convergence (robustness) and efficiency (single/parallel) are important

• Global communications needed– Mat-Vec (P2P communications, MPI_Isend/Irecv/Waitall): Local

Data Structure with HALO effect of latency

– Dot-Products (MPI_Allreduce)– Preconditioning (up to algorithm)

• Remedy for Robust Parallel ILU Preconditioner– Additive Schwartz Domain Decomposition– HID (Hierarchical Interface Decomposition, based on global

nested dissection) [Henon & Saad 2007], ext. HID [KN 2010]• Parallel “Direct” Solvers (e.g. SuperLU, MUMPS etc.)

Parallel Iterative Solvers

Local Data Structures for Parallel FEM/FDM using Krylov Iterative Solvers Example: 2D FDM Mesh (5-point stencil)

79

Example: 2D FDM Mesh (5-point stencil)4-regions/domains

80

Example: 2D FDM Mesh (5-point stencil)4-regions/domains

81

Example: 2D FDM Mesh (5-point stencil)meshes at domain boundary need info. neighboring domains

82

Example: 2D FDM Mesh (5-point stencil)meshes at domain boundary need info. neighboring domains

83

Example: 2D FDM Mesh (5-point stencil)comm. using “HALO (overlapped meshes)”

84

85

Red Lacquered Gate in 64 PEs40,624 elements, 54,659 nodes

k-METISLoad Balance= 1.03

edgecut = 7,563

p-METISLoad Balance= 1.00

edgecut = 7,738

一般化された通信テーブル：送信

• 送信相手

– NeibPETot，NeibPE[neib]• それぞれの送信相手に送るメッセージサイズ

– export_index[neib], neib= 0, NeibPETot-1• 「境界点」番号

– export_item[k], k= 0, export_index[NeibPETot]-1• それぞれの送信相手に送るメッセージ

– SendBuf[k], k= 0, export_index[NeibPETot]-1

C86

送信（MPI_Isend/Irecv/Waitall）neib#0

SendBufneib#1 neib#2 neib#3

BUFlength_e BUFlength_e BUFlength_e BUFlength_e

export_index[0] export_index[1] export_index[2] export_index[3] export_index[4]

for (neib=0; neib<NeibPETot;neib++){for (k=export_index[neib];k<export_index[neib+1];k++){

kk= export_item[k];SendBuf[k]= VAL[kk];

}}

for (neib=0; neib<NeibPETot; neib++){tag= 0;iS_e= export_index[neib];iE_e= export_index[neib+1];BUFlength_e= iE_e - iS_e

ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0,MPI_COMM_WORLD, &ReqSend[neib])

}

MPI_Waitall(NeibPETot, ReqSend, StatSend);

送信バッファへの代入

export_index[neib]～export_index[neib+1]-1番目のexport_itemがneib番目の隣接領域に送信される

C87

MPI_Isend• 送信バッファ「sendbuf」内の，連続した「count」個の送信メッセージを，タグ「tag」

を付けて，コミュニケータ内の，「dest」に送信する。「MPI_Waitall」を呼ぶまで，送信バッファの内容を更新してはならない。

• MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request)– sendbuf 任意 I 送信バッファの先頭アドレス，

– count 整数 I メッセージのサイズ

– datatype 整数 I メッセージのデータタイプ

– dest 整数 I 宛先プロセスのアドレス（ランク）

– tag 整数 I メッセージタグ，送信メッセージの種類を区別するときに使用。

通常は「0」でよい。同じメッセージタグ番号同士で通信。

– comm MPI_Comm I コミュニケータを指定する

– request MPI_Request O 通信識別子。MPI_Waitallで使用。

（配列：サイズは同期する必要のある「MPI_Isend」呼び出し

数（通常は隣接プロセス数など））:C言語については後述

C88

一般化された通信テーブル：受信

• 受信相手

– NeibPETot ，NeibPE[neib]• それぞれの受信相手から受け取るメッセージサイズ

– import_index[neib], neib= 0, NeibPETot-1• 「外点」番号

– import_item[k], k= 0, import_index[NeibPETot]-1• それぞれの受信相手から受け取るメッセージ

– RecvBuf[k], k= 0, import_index[NeibPETot]-1

C89

受信（MPI_Isend/Irecv/Waitall）

neib#0RecvBuf

neib#1 neib#2 neib#3

BUFlength_i BUFlength_i BUFlength_i BUFlength_i

for (neib=0; neib<NeibPETot; neib++){tag= 0;iS_i= import_index[neib];iE_i= import_index[neib+1];BUFlength_i= iE_i - iS_i

ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0,MPI_COMM_WORLD, &ReqRecv[neib])

}

MPI_Waitall(NeibPETot, ReqRecv, StatRecv);

for (neib=0; neib<NeibPETot;neib++){for (k=import_index[neib];k<import_index[neib+1];k++){

kk= import_item[k];VAL[kk]= RecvBuf[k];

}}

受信バッファからの代入

import_index[0] import_index[1] import_index[2] import_index[3] import_index[4]

import_index[neib]～import_index[neib+1]-1番目のimport_itemがneib番目の隣接領域から受信される

C90

MPI_Irecv• 受信バッファ「recvbuf」内の，連続した「count」個の送信メッセージを，タグ「tag」

を付けて，コミュニケータ内の，「dest」から受信する。「MPI_Waitall」を呼ぶまで，受信バッファの内容を利用した処理を実施してはならない。

• MPI_Irecv (recvbuf,count,datatype,dest,tag,comm,request)– recvbuf 任意 I 受信バッファの先頭アドレス，

– count 整数 I メッセージのサイズ

– datatype 整数 I メッセージのデータタイプ

– dest 整数 I 宛先プロセスのアドレス（ランク）

– tag 整数 I メッセージタグ，受信メッセージの種類を区別するときに使用。

通常は「0」でよい。同じメッセージタグ番号同士で通信。

– comm MPI_Comm I コミュニケータを指定する

– request MPI_Request O 通信識別子。MPI_Waitallで使用。

（配列：サイズは同期する必要のある「MPI_Irecv」呼び出し

数（通常は隣接プロセス数など））：C言語については後述

C91

MPI_Waitall• 1対1非ブロッキング通信関数である「MPI_Isend」と「MPI_Irecv」を使用した場合，プ

ロセスの同期を取るのに使用する。

• 送信時はこの「MPI_Waitall」を呼ぶ前に送信バッファの内容を変更してはならない。受信時は「MPI_Waitall」を呼ぶ前に受信バッファの内容を利用してはならない。

• 整合性が取れていれば，「MPI_Isend」と「MPI_Irecv」を同時に同期してもよい。– 「MPI_Isend/Irecv」で同じ通信識別子を使用すること

• 「MPI_Barrier」と同じような機能であるが，代用はできない。– 実装にもよるが，「request」，「status」の内容が正しく更新されず，何度も

「MPI_Isend/Irecv」を呼び出すと処理が遅くなる，というような経験もある。

• MPI_Waitall (count,request,status)– count 整数 I 同期する必要のある「MPI_ISEND」，「MPI_RECV」呼び出し数。

– request 整数 I/O 通信識別子。「MPI_ISEND」，「MPI_IRECV」で利用した識別子名に対応。（配列サイズ：（count））

– status MPI_Status O 状況オブジェクト配列

MPI_STATUS_SIZE： “mpif.h”,”mpi.h”で定められる

パラメータ：C言語については後述

C92

References: Libraries (mainly for flat MPI)

• Talk by the Next Speaker (Tony Drummond)

• Trillinos– http://trilinos.sandia.gov/

• PETSc– http://www.mcs.anl.gov/petsc/

• GeoFEM– http://geofem.tokyo.rist.or.jp/

• ppOpen-HPC – http://ppopenhpc.cc.u-tokyo.ac.jp/

93

Preconditioning for Iterative Solvers• A critical issue for both of robustness and efficiency • Libraries (e.g. PETSc, Trillinos, ppOpen-HPC) cover only

general ones (e.g. ILU(p))• Selection of preconditioner strongly depends on:

– numerical property of matrix– features of physics, PDE, boundary conditions, mat. property,

size of FEM mesh etc.• sometimes, problem specific preconditioning needed

• “Parallel” preconditioning is really an exciting research area, important for practical computing.

• All of computational scientists, computer scientists, and mathematicians must work hard for that under intensive collaboration

94




95

Around the multigrid in a single slide• Multigrid is a scalable method for solving linear equations. • Relaxation methods (smoother/smoothing operator in MG

world) such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error.

• The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid.

• Multigrid method uniformly damps all frequencies of error components with a computational cost that depends only linearly on the problem size (=scalable).– Good for large-scale computations

• Multigrid is also a good preconditioning algorithm for Kryloviterative solvers.

96

Convergence of Gauss-Seidel & SOR

ITERATION#

RES

IDU

ALRapid Convergence(high-frequency error:short wave length)

97

Convergence of Gauss-Seidel & SOR

ITERATION#

RES

IDU

AL Slow Convergence

(low-frequency error:long wave length)

98

Around the multigrid in a single slide• Multigrid is a scalable method for solving linear equations. • Relaxation methods (smoother/smoothing operator in MG

world) such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error.

• The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid.

• Multigrid method uniformly damps all frequencies of error components with a computational cost that depends only linearly on the problem size (=scalable).– Good for large-scale computations

• Multigrid is also a good preconditioning algorithm for Krylov iterative solvers.

99

Multigrid is scalableWeak Scaling: Problem Size/Core Fixed

for 3D Poisson Eqn’s (q)MGCG= Conjugate Gradient with Multigrid Preconditioning

0

500

1000

1500

2000

2500

3000

1.E+06 1.E+07 1.E+08

Itera

tions

DOF

ICCGMGCG

100

Multigrid is scalableWeak Scaling: Problem Size/Core Fixed

Comp. time of MGCG for weak scaling is constant: => scalable

0

500

1000

1500

2000

2500

3000

1.E+06 1.E+07 1.E+08

Itera

tions

DOF

ICCGMGCG

16 3264 128

101

Procedure of Multigrid (1/3)102

Multigrid is a scalable method for solving linear equations. Relaxation methods such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error. The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid. This concept is explained here in the following simple 2-level method. If we have obtained the following linear system on a fine grid :

AF uF = f

and AC as the discrete form of the operator on the coarse grid, a simple coarse grid correction can be given by :

uF(i+1) = uF

(i) + RT AC-1 R ( f - AF uF

(i) )

where RT is the matrix representation of linear interpolation from the coarse grid to the fine grid (prolongation operator) and R is called the restriction operator. Thus, it is possible to calculate the residual on the fine grid, solve the coarse grid problem, and interpolate the coarse grid solution on the fine grid.

102


This process can be described as follows :

1. Relax the equations on the fine grid and obtain the result uF(i)

= SF ( AF, f ). This operator SF (e.g., Gauss-Seidel) is called the smoothing operator (or ).

2. Calculate the residual term on the fine grid by rF = f - AF uF(i).

3. Restrict the residual term on to the coarse grid by rC = R rF.4. Solve the equation AC uC = rC on the coarse grid ; the

accuracy of the solution on the coarse grid affects the convergence of the entire multigrid system.

5. Interpolate (or prolong) the coarse grid correction on the fine grid by uC

(i) = RT uC.6. Update the solution on the fine grid by uF

(i+1) = uF(i) + uC

(i)

103

fine

coarse

w1k : Approx. Solution

vk : CorrectionIk

k-1 : Restriction Operator

Lk Wk = Fk (Linear Equation: Fine Level)

Rk = Fk - Lk w1k

vk = Wk - w1k, Lk vk = Rk

Rk-1 = Ikk-1 Rk

Lk-1 vk-1 = Rk-1 (Linear Equation: Coarse Level)

vk = Ik-1k vk-1

w2k = w1

k + vk

fine

coarse

w1k : Approx. Solution

vk : CorrectionIk

k-1 : Restriction Operator


Rk = Fk - Lk w1k


Rk-1 = Ikk-1 Rk


vk = Ik-1k vk-1

w2k = w1

k + vk

fine

coarse


Rk = Fk - Lk w1k


Rk-1 = Ikk-1 Rk


vk = Ik-1k vk-1

w2k = w1

k + vk

Ik-1k : Prolongation Operator

w2k : Approx. Solution by Multigrid

fine

coarse


Rk = Fk - Lk w1k


Rk-1 = Ikk-1 Rk


vk = Ik-1k vk-1

w2k = w1

k + vk

Ik-1k : Prolongation Operator

w2k : Approx. Solution by Multigrid

104


• Recursive application of this algorithm for 2-level procedure to consecutive systems of coarse-grid equations gives a multigrid V-cycle. If the components of the V-cycle are defined appropriately, the result is a method that uniformly damps all frequencies of error with a computational cost that depends only linearly on the problem size. − In other words, multigrid algorithms are scalable.

• In the V-cycle, starting with the finest grid, all subsequent coarser grids are visited only once. − In the down-cycle, smoothers damp oscillatory error components at different

grid scales. − In the up-cycle, the smooth error components remaining on each grid level

are corrected using the error approximations on the coarser grids. • Alternatively, in a W-cycle, the coarser grids are solved more

rigorously in order to reduce residuals as much as possible before going back to the more expensive finer grids.

105

fine

coarse

(a) V-Cycle

fine

coarse

(a) V-Cycle (b) W-Cycle

fine

coarse

(b) W-Cycle

fine

coarse

106

Multigrid as a Preconditioner107

• Multigrid algorithms tend to be problem-specific solutions and less robust than preconditioned Kryloviterative methods such as the IC/ILU methods.

• Fortunately, it is easy to combine the best features of multigrid and Krylov iterative methods into one algorithm− multigrid-preconditioned Krylov iterative methods.

• The resulting algorithm is robust, efficient and scalable.

• Mutigrid solvers and Krylov iterative solvers preconditioned by multigrid are intrinsically suitable for parallel computing.

Geometric and Algebraic Multigrid108

• One of the most important issues in multigrid is the construction of the coarse grids.

• There are 2 basic multigrid approaches− geometric and algebraic

• In geometric multigrid, the geometry of the problem is used to define the various multigrid components.

• In contrast, algebraic multigrid methods use only the information available in the linear system of equations, such as matrix connectivity.

• Algebraic multigrid method (AMG) is suitable for applications with unstructured grids.

• Many tools for both geometric and algebraic methods on unstructured grids have been developed.

108

“Dark Side” of Multigrid Method109

• Its performance is excellent for well-conditioned simple problems, such as homogeneous Poisson equations.

• But convergence could be worse for ill-conditioned problems.

• Extension of applicability of multigrid method is an active research area.

109

References• Briggs, W.L., Henson, V.E. and McCormick, S.F. (2000)

A Multigrid Tutorial Second Edition, SIAM

• Trottemberg, U., Oosterlee, C. and Schüller, A. (2001) Multigrid, Academic Press

• https://computation.llnl.gov/casc/• Hypre (AMG Library)

– https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html

110




111

Key-Issues for Appl’s/Algorithms towards Post-Peta & Exa Computing

Jack Dongarra (ORNL/U. Tennessee) at ISC 2013

• Hybrid/Heterogeneous Architecture– Multicore + GPU/Manycores (Intel MIC/Xeon Phi)

• Data Movement, Hierarchy of Memory

• Communication/Synchronization Reducing Algorithms• Mixed Precision Computation• Auto-Tuning/Self-Adapting• Fault Resilient Algorithms• Reproducibility of Results

112

113

• Communication overhead becomes significant• Communication-Computation Overlap

– Not so effective for Mat-Vec operations• Communication Avoiding/Reducing Algorithms

• OpenMP/MPI Hybrid Parallel Programming Model– (Next section)

Recent Technical Issues in Parallel Iterative Solvers

114

Communication overhead becomes larger as node/core number increasesWeak Scaling: MGCG on T2K Tokyo

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

64 128 256 512 1024 2048 4096 6144 8192

%

core #

Comm.Comp.

Comm.-Comp. Overlapping115

Internal Meshes

External (HALO) Meshes

Comm.-Comp. Overlapping116

Internal Meshes

External (HALO) MeshesInternal Meshes on Boundary’s

Mat-Vec operations• Overlapping of computations

of internal meshes, and importing external meshes.

• Then computation of international meshes on boundary’s

• Difficult for IC/ILU on Hybrid

Communication Avoiding/Reducing Algorithms for Sparse Linear Solvers

• Krylov Iterative Method without Preconditioning– Demmel, Hoemmen, Mohiyuddin etc. (UC Berkeley)

• s-step method– Just one P2P communication for each Mat-Vec during s

iterations. Convergence becomes unstable for large s.– matrix powers kernel: Ax, A2x, A3x ...

• additional computations needed

• Communication Avoiding ILU0 (CA-ILU0) [Moufawad & Grigori, 2013]– First attempt to CA preconditioning– Nested dissection reordering for limited geometries (2D FDM)

117

Pipelined CG [Ghysels et al. 2013]

0.00E+00

1.00E-03

2.00E-03

3.00E-03

4.00E-03

5.00E-03

6.00E-03

7.00E-03

100 1000 10000 100000

sec.

/MPI

_Allr

educ

e

MPI Process #

Flat MPIHB 4x4HB 8x2HB 16x1

Overhead by MPI_Allreducefor MGCG case

• Overhead by global collective comm. (e.g. MPI_Allreduce)• Change original Krylov solver so that comm. overhead by

global coll. comm. are hidden by overlapping with other computations (Gropp’s asynch. CG, s-step, pipelined ...)

• “MPI_Iallreduce” in MPI-3 specification

118

Comm. Avoiding Krylov Iterative Methods using “Matrix Powers Kernel”

119

Avoiding Communication in Sparse Matrix Computations. James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. , 2008 IPDPS

Required Information of Local Meshes for s-step CA computations (2D 5pt.)

120

s=1(original)

s=2 s=3




121

Nakajima, K., Optimization of Serial and Parallel Communications for Parallel Geometric Multigrid Method, Proceedings of the 20th IEEE International Conference for Parallel and Distributed Systems (ICPADS 2014) (Winner of Best Paper Award), Hsin-Chu, Taiwan, 2014

Reference122

• Optimization of Parallel MGCG– Conjugate Gradient Solver with Multigrid Preconditioning– OpenMP/MPI Hybrid Parallel Programming Model– Efficiency & Convergence

• Parallel Multigrid– “Coarse Grid Solver” is important

Efficiency & Convergence− HPCG (High-Performance Conjugate Gradients)

MGCG by Geometric Multigrid

• Communications are expensive– Serial Communications

Data Transfer through Hierarchical Memory: Sparse Matrix Operations– Parallel Communications

Message Passing through Network

Motivation123

• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver (Geometric)

Parallel MG Solvers: pGW3D‐FVM

qzyx ,,

124

125

• Preconditioned CG Method– Multigrid Preconditioning (MGCG)– IC(0) for Smoothing Operator (Smoother): good for ill-

conditioned problems• Parallel Geometric Multigrid Method

– 8 fine meshes (children) form 1 coarse mesh (parent) in isotropic manner (octree)

– V-cycle– Domain-Decomposition-based: Localized Block-Jacobi,

Overlapped Additive Schwartz Domain Decomposition (ASDD)– Operations using a single core at the coarsest level (redundant)

Linear Solvers

Fujitsu PRIMEHPC FX10 (Oakleaf-FX)at the U. Tokyo

• SPARC64 Ixfx (4,800 nodes, 76,800 cores)• Commercial version of K computerｘ• Peak: 1.13 PFLOPS (1.043 PF, 26th, 41th TOP 500 in 2013 June.)• Memory BWTH 398 TB/sec.

126

• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver (Geometric)

Parallel MG Solvers: pGW3D‐FVM127

qzyx ,,

Computations on Fujitsu FX10• Fujitsu PRIMEHPC FX10 at U.Tokyo (Oakleaf-FX)

– Commercial version of K – 16 cores/node, flat/uniform access to memory– 4,800 nodes 1.043 PF (48th, TOP 500, 2014 Nov.)

128

• Up to 4,096 nodes (65,536 cores) (Large-Scale HPC Challenge) – Max 17,179,869,184 unknowns– Flat MPI, HB 4x4, HB 8x2, HB 16x1

• Weak Scaling• Strong Scaling

– 1283×8= 16,777,216 unknowns, from 8 to 4,096 nodes

• Network Topology is not specified– 1D

L1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

C

L2

Memory

129

HB M x NL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

C

L2

Memory

Number of OpenMP threads per a single MPI process

Number of MPI processper a single node

130

HB 8 x 2

Number of OpenMP threads per a single MPI process

Number of MPI processper a single node

L1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

CL1

C

L2

Memory

8 threads/process 8 threads/process

Flat MPI vs. Hybrid

Hybrid：Hierarchal Structure

Flat-MPI：Each Core -> Independent

corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y

mem

ory

mem

ory

mem

ory

core

core

core

core

core

core

core

core

core

core

core

core

131

• Krylov Iterative Solvers– Dot Products– SMVP– DAXPY– Preconditioning

• IC/ILU Factorization, Forward/Backward Substitution– Global Data Dependency– Reordering needed for parallelism ([KN 2003] on the Earth

Simulator, KN@CMCIM-2002)– Multicoloring, RCM, CM-RCM

Reordering for extracting parallelismin each domain (= MPI Process)

132

Parallerization of ICCG

do i= 1, NVAL= D(i)do k= indexL(i-1)+1, indexL(i)VAL= VAL - (AL(k)**2) * W(itemL(k),DD)

enddoW(i,DD)= 1.d0/VAL

enddo

do i= 1, NWVAL= W(i,Z)do k= indexL(i-1)+1, indexL(i)WVAL= WVAL - AL(k) * W(itemL(k),Z)

enddoW(i,Z)= WVAL * W(i,DD)

enddo

IC Factorization

ForwardSubstitution

133

(Global) Data Dependency: Writing/reading may occur simultaneously, hard to parallelize

do i= 1, NVAL= D(i)do k= indexL(i-1)+1, indexL(i)VAL= VAL - (AL(k)**2) * W(itemL(k),DD)

enddoW(i,DD)= 1.d0/VAL

enddo


enddoW(i,Z)= WVAL * W(i,DD)

enddo

IC Factorization

ForwardSubstitution

134

OpenMP for SpMV: StraightforwardNO data dependency

!$omp parallel do private(ip,i,VAL,k)do ip= 1, PEsmpTOT

do i = INDEX(ip-1)+1, INDEX(ip)VAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)

enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)

enddoW(i,Q)= VAL

enddoenddo

135

Ordering MethodsElements in “same color” are independent: to be parallelized

Talk by Y.Saad’s group in SIAM PP14

64 63 61 58 54 49 43 36

62 60 57 53 48 42 35 28

59 56 52 47 41 34 27 21

55 51 46 40 33 26 20 15

50 45 39 32 25 19 14 10

44 38 31 24 18 13 9 6

37 30 23 17 12 8 5 3

29 22 16 11 7 4 2 1

48 32

31 15

14 62

61 44

43 26

25 8

7 54

53 36

16 64

63 46

45 28

27 10

9 56

55 38

37 20

19 2

47 30

29 12

11 58

57 40

39 22

21 4

3 50

49 33

13 60

59 42

41 24

23 6

5 52

51 35

34 18

17 1

64 63 61 58 54 49 43 36

62 60 57 53 48 42 35 28

59 56 52 47 41 34 27 21

55 51 46 40 33 26 20 15

50 45 39 32 25 19 14 10

44 38 31 24 18 13 9 6

37 30 23 17 12 8 5 3

29 22 16 11 7 4 2 1

1 17 3 18 5 19 7 20

33 49 34 50 35 51 36 52

17 21 19 22 21 23 23 24

37 53 38 54 39 55 40 56

33 25 35 26 37 27 39 28

41 57 42 58 43 59 44 60

49 29 51 30 53 31 55 32

45 61 46 62 47 63 48 64

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

RCMReverse Cuthill-Mckee

MC (Color#=4)Multicoloring

CM-RCM (Color#=4)Cyclic MC + RCM

• MC: Good parallel efficiency with smaller # of colors, bad convergence. Better convergence with many colors, synch. overhead

• RCM: Good convergence, poor parallel efficiency, synch. overhead• CM-RCM: Reasonable convergence & efficiency

136

• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver

Parallel MG Solvers: pGW3D‐FVM

qzyx ,,

• Storage format of coefficient matrices (Serial Comm.)– CRS (Compressed Row Storage)– ELL (Ellpack‐Itpack)

• Comm. /Sych. Reducing MG (Parallel Comm.)– Coarse Grid Aggregation (CGA)– Hierarchical CGA: Communication Reducing CGA

137

ELL: Fixed Loop-length, Nice for Pre-fetching

5000104730003140052100031 1 3

1 2 54 1 33 7 41 5

1 31 2 54 1 33 7 41 5

0

0

(a) CRS (b) ELL

138

Special Treatment for “Boundary” Meshesconnected to “Halo”

• Distribution of Lower/Upper Non-Zero Off-Diagonal Components

• If we adopt RCM (or CM) reordering ...

• Pure Internal Meshes– L: ~3, U: ~3

• Boundary Meshes– L: ~3, U: ~6

External MeshesInternal Meshes on Boundary

Pure Internal Meshes

x

yz

Pure Internal Meshes

Internal Meshes on Boundary

● Internal (lower)

● Internal (upper)

● External (upper)

139

Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)

do icol= NHYP(lev), 1, -1if (mod(icol,2).eq.1) then

!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0do j= 1, 6

SW= SW + AUnew(j,icel)*Rmg(IAUnew(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)

enddoenddo

else!$omp parallel do private (ip,icel,j,SW)

do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0do j= 1, 3

SW= SW + AUnew(j,icel)*Rmg(IAUnew(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)

enddoenddo

endifenddo

IAUnew (6,N), AUnew (6,N)

for Pure Internal Cells

for Boundary Cells

140

Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)

Pure Internal CellsAUnew(6,N)

Boundary CellsAUnew(6,N)

141

Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6Sliced ELL [Monakov et al. 2010] (for SpMV/GPU)

Pure Internal CellsAUnew3(3,N)

Boundary CellsAUnew6(6,N)

142

Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6


!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0do j= 1, 6

SW= SW + AUnew6(j,icel)*Rmg(IAUnew6(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)

enddoenddo


do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)

SW= 0.0d0do j= 1, 3


enddoenddo

endifenddo IAUnew3(3,N), AUnew3(3,N)

IAUnew6(6,N), AUnew6(6,N)


for Boundary Cells

143

Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6

144


!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)SW= 0.0d0do j= 1, 3


enddoenddo


do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)SW= 0.0d0do j= 1, 6


enddoenddo

endifenddo

IAUnew3(3,N), AUnew3(3,N)IAUnew6(6,N), AUnew6(6,N)


for Boundary Cells

Analyses by Detailed Profiler of Fujitsu FX10, single node, Flat MPI, RCM (Multigrid Part), 643cells/core,

1-node

145

Instruction L1Dmiss L2 miss SIMD

Op. Ratio GFLOPS

CRS 1.53109 2.32107 1.67107 30.14% 6.05

OriginalELL 4.91108 1.67107 1.27107 93.88% 6.99

ImprovedELL 4.91108 1.67107 9.14106 93.88% 8.56

Original Approach (restriction)Coarse grid solver at a single core [KN 2010]

146

Level=1

Level=2

Level=m-3

Level=m-2

Level=m-1

Level=mMesh # foreach MPI= 1

Fine

Coarse Coarse grid solver on a single core (further multigrid)

Original Approach (restriction)Coarse grid solver at a single core [KN 2010]

147

Level=1

Level=2

Level=m-3

Level=m-2

Level=m-1

Level=mMesh # foreach MPI= 1

Fine

Coarse Coarse grid solver on a single core (further multigrid)

Communication Overheadat Coarser Levels

Coarse Grid Aggregation (CGA)Coarse Grid Solver is multithreaded [KN 2012]

148

Level=1

Level=2

Level=m-3

Fine

Coarse

Coarse grid solver on a single MPI process (multi-threaded, further multigrid)

• Communication overhead could be reduced

• Coarse grid solver is more expensive than original approach.

• If process number is larger, this effect might be significant

Level=m-2

ResultsCASE Matrix Coarse Grid

C0 CRS Single Core

C1 ELL (original) Single Core

C2 ELL (original) CGA

C3 ELL (new) CGA

C4 ELL (new) hCGA

Class Size

Weak Scaling 643 cells/core 262,144

Strong Scaling 2563 cells 16,777,216

149

150

Results at 4,096 nodes (1.72x1010 DOF)(Fujitsu FX10: Oakleaf‐FX): HB 8x2

lev: switching level to “coarse grid solver”, Opt. Level= 7

■ Parallel■ Serial/Redundant

Fine

Coarse

0.0

5.0

10.0

15.0

20.0

ELL-CGA,lev=6: 51

ELL-CGA,lev=7: 55

ELL-CGA,lev=8: 60

ELL: 65,(NO CGA)

CRS: 66,(NO CGA)

sec.

RestCoarse Grid SolverMPI_AllgatherMPI_Isend/Irecv/Allreduce

C1C2 C0C2 C2

Matrix Coarse Grid

C0 CRS Single Core

C1 ELL (org) Single Core

C2 ELL (org) CGA

C3 ELL (sliced) CGA

151

Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)

DOWN is GOOD

0.00

5.00

10.00

15.00

20.00

100 1000 10000 100000

sec.

CORE#

HB 8x2:C0HB 8x2:C1HB 8x2:C2HB 8x2:C3

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

Flat MPI:C3HB 4x4:C3HB 8x2:C3HB 16x1:C3

Matrix Coarse Grid

C0 CRS Single Core


C2 ELL (org) CGA

C3 ELL (sliced) CGA

152

Weak Scaling: C3Results at 4,096 nodes (1.72x1010 DOF)

0.0

5.0

10.0

15.0

Flat MPI:C3:64

HB 4x4:C3:59

HB 8x2:C3:55

HB 16x1:C3:55

sec.


153

Weak Scaling: C2 (with CGA)Time for Coarse Grid Solver

Efficiency of coarse grid solver for HB 16x1 is x256 of that of flat MPI (1/16 problem size, x16 resource for coarse grid solver)

0.00

1.00

2.00

3.00

4.00

1024 2048 4096 8192 16384 32768 49152 65536

sec.

CORE#

Flat MPI HB 4x4HB 8x2 HB 16x1

Summary so far ...• “Coarse Grid Aggregation (CGA)” is effective for

stabilization of convergence at O(104) cores for MGCG– Smaller number of parallel domains– HB 8x2 is the best at 4,096 nodes– Flat MPI, HB 4x4

• Coarse grid solvers are more expensive, because their number of MPI processes are more than those of HB 8x2 and HB 16x1.

• ELL format is effective !– C0 (CRS) -> C1 (ELL-org.): +20-30%– C2 (ELL-org)-> C3(ELL-new): +20-30%– C0 -> C3: +80-90%

• Coarse Grid Solver – (May be) very expensive for cases with more than O(105) cores – Memory of a single node is not enough– Multiple nodes should be utilized for coarse grid solver 154

Matrix Coarse Grid

C0 CRS Single Core


C2 ELL (org) CGA

C3 ELL (sliced) CGA

Hierarchical CGA: Comm. Reducing MGReduced number of MPI processes[KN 2013]

155

Level=1

Level=2

Level=m-3

Level=m-3

Fine

Coarse

Level=m-2

Coarse grid solver on a single MPI process (multi-threaded, further multigrid)

hCGA: Related Work• Not a new idea, but very few implementations.

– Not effective for peta-scale systems (Dr. U.M.Yang (LLNL), developer of Hypre)

• Existing Works: Repartitioning at Coarse Levels– Lin, P.T., Improving multigrid performance for unstructured mesh

drift-diffusion simulations on 147,000 cores, International Journal for Numerical Methods in Engineering 91 (2012) 971-989 (Sandia)

– Sundar, H. et al, Parallel Geometric-Algebraic Multigrid on Unstructured Forests of Octrees, ACM/IEEE Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC12) (2012) (UT Austin)

– Flat MPI, Repartitioning if DOF < O(103) on each process

156

hCGA in the present work• Accelerate the coarser grid solver

– using multiple processes instead of a single process in CGA– Only 64 cells on each process of lev=6 in the figure

• Straightforward Approach– MPI_Comm_split, MPI_Gather, MPI_Bcast etc.

157

0.0

5.0

10.0

15.0

20.0

ELL-CGA,lev=6: 51

ELL-CGA,lev=7: 55

ELL-CGA,lev=8: 60

ELL: 65,(NO CGA)

CRS: 66,(NO CGA)

sec.


158

Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)

DOWN is GOOD Matrix Coarse Grid

C0 CRS Single Core


C2 ELL (org) CGA

C3 ELL (sliced) CGAC4 ELL (sliced) hCGA

5.0

7.5

10.0

12.5

15.0

100 1000 10000 100000

sec.

CORE#

Flat MPI:C3Flat MPI:C4HB 4x4:C4HB 8x2:C3HB 16x1:C3

0.0

5.0

10.0

15.0

Flat MPI HB 4x4 HB 8x2 HB 16x1

sec.

C3, 512 nodesC4, 512 nodesC3, 4,096 nodesC4, 4,096 nodes

x1.61

Strong Scaling at 4,096 nodes268,435,456 meshes, 163 meshes/core at 4,096 nodes

Flat MPI/ELL (C3), 8 nodes (128 cores) : 100%

159

Matrix Coarse Grid

C0 CRS Single Core


C2 ELL (org) CGA

C3 ELL (sliced) CGA

C4 ELL (sliced) hCGA 0

20

40

60

80

100

120

1024 8192 65536

Para

llel P

erfo

rman

ce (%

)

CORE#

Flat MPI:C3 Flat MPI:C4HB 8x2:C3 HB 8x2:C4

x6.27

Summary• hCGA is effective, but not so significant (except flat MPI)

– flat MPI: x1.61 for weak scaling, x6.27 for strong scaling at 4,096 nodes of Fujitsu FX10

– hCGA will be effective for HB 16x1 with more than 2.50x105 nodes (= 4.00x106 cores) of FX10 (=60 PFLOPS)

• effect of coarse grid solver is significant for Flat MPI with >103 nodes– Communication overhead has been reduced by hCGA

• Future/On-Going Works and Open Problems– Improvement of hCGA

• Overhead by MPI_Allreduce etc. -> P2P comm.– Algorithms

• CA-Multigrid (for coarser levels), CA-SPAI, Pipelined Method– Strategy for Automatic Selection

• switching level, number of processes for hCGA, optimum color #• effects on convergence

– More Flexible ELL for Unstructured Grids– Xeon Phi Clusters

• Hybrid 240(T)x1(P) is not the only choice 160




161

• Unstructured grid with irregular data structure• Large-scale sparse matrices• Preconditioned parallel iterative solvers• “Real-world” ill-conditioned problems

Large-scale Simulations by Parallel FEM Procedures

162

• Various ill-conditioned problems– For example, matrices derived from coupled NS equations are

ill-conditioned even if meshes are uniform.• We have been focusing on 3D solid mechanics

applications with:– heterogeneity– Contact B.C.– BILU/BIC

• Ideas can be extended to other fields.

What are ill-conditioned problems ?

163

Ill-Conditioned ProblemsHeterogeneous Fields, Distorted Meshes

164

Contact Problems in Simulations of Earthquake Generation Cycle

165

• are the most critical issues in scientific computing• are based on

– Global Information: condition number, matrix properties etc.– Local Information: properties of elements (shape, size …)

• require knowledge of– background physics– applications

Preconditioning Methods (of Krylov Iterative Solvers) for Real-World

Applications

166

• Block Jacobi type Localized Preconditioners• Simple problems can easily converge by simple

preconditioners with excellent parallel efficiency.• Difficult (ill-conditioned) problems cannot easily converge

– Effect of domain decomposition on convergence is significant, especially for ill-conditioned problems.• Block Jacobi-type localized preconditioiners• More domains, more iterations

– There are some remedies (e.g. deep fill-ins, deep overlapping), but they are not efficient.

– ASDD does not work well for really ill-conditioned problems.

Technical Issues of “Parallel” Preconditioners in FEM

167

168

Dot products Matrix-vector multiplication Preconditioners DAXPY

Preconditioned Iterative Solver e.g. CG method (Conjugate Gradient)


for i= 1, 2, …solve [M]z(i-1)= r(i-1)

i-1= r(i-1) z(i-1)if i=1p(1)= z(0)

elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)

endifq(i)= [A]p(i)


end

ILU: Global Operations (Forward/Backward Substitution) NOT suitable for parallel computing

Localized ILU Preconditioning

rzL

zzU

!C!C +----------------+!C | {z}= [Minv]{r} |!C +----------------+!C===

do i= 1, NW(i,Z)= W(i,R)

enddo


enddoW(i,Z)= WVAL / D(i)

enddo

do i= N, 1, -1SW = 0.0d0do k= indexU(i), indexU(i-1)+1, -1SW= SW + AU(k) * W(itemU(k),Z)

enddoW(i,Z)= W(i,Z) – SW / D(i)

enddo!C===

Ignoring effects of external points for preconditioning Block-Jacobi Localized

Preconditioning WEAKER than original ILU

More PE’s, more iterations

169

Localized ILU Preconditioning

A1 2 3 4 5 6 PE#1

PE#2

PE#3

PE#4

1 2 3

4 5 6

Considered :

Ignored :

170

171

Overlapped Additive Schwartz Domain Decomposition Method

Stabilization of Localized Preconditioning: ASDD

Global Operation

Local Operation

Global Nesting Correction: Repeating -> Stable

1 2

rMz

222111

11 ,

rMzrMz

)( 111111111111

nnnn zMzMrMzz

)( 111122222222

nnnn zMzMrMzz

172


Stabilization of Localized Preconditioning: ASDDGlobal Nesting Correction: Repeating -> Stable

1 2

)( 111111111111

nnnn zMzMrMzz

)( 111122222222

nnnn zMzMrMzz

)( 11111111111111111111

nnnnnn zMzMrMzrMzzzz

11111111

nn zMzMrr

11111111

nn zzzwhererMz

173


Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOF on Hitachi

SR2201, Number of ASDD cycle/iteration= 1, = 10-8

PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21

16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07

NO Additive Schwartz WITH Additive Schwartz

174


Stabilization of Localized Preconditioning: ASDD

Local Operation (Forward/Backward Substitution)

Global Nesting Correction：Repeating -> Stable

1 2

)( 111111111111

nnnn zMzMrMzz

)( 111122222222

nnnn zMzMrMzz

do iterPRE= 1, iterPREmax

enddo

)(. 111111111

nn zMzMrMcalc

)(. 111222222

nn zMzMrMcalc

Technical Issues of “Parallel” Preconditioners for Iterative Solvers

E=100

E=103

3D Solid MechanicsE: Young’s Modulus

• If domain boundaries are on “stronger” elements, convergence is very bad.

175

Remedies: Domain Decomposition

• Avoid “Strong Elements”– not practical

• Extended Depth of Overlapped Elements– Selective Fill-ins, Selective Overlapping [KN 2007]

• adaptive preconditioning/domain decomposition methods which utilize features of FEM procedures

• PHIDAL/HID (Hierarchical Interface Decomposition) [Henon & Saad 2007]

• Extended HID [KN 2010]

176

Extension of Depth of Overlapping

●：Internal Nodes，●：External Nodes■：Overlapped Elements●：Internal Nodes，●：External Nodes■：Overlapped Elements

5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41 5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

Cost for computation and communication may increase

177

• Multilevel Domain Decomposition– Extension of Nested Dissection

• Non-overlapping at each level: Connectors, Separators• Suitable for Parallel Preconditioning Method

HID: Hierarchical Interface Decomposition [Henon & Saad 2007]

level-1：●level-2：●level-4：●

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

0,12,3

178

Parallel ILU for each Connectorat each LEVEL

• The unknowns are reordered according to their levelnumbers, from the lowest to highest.

• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.

01

23

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

179

Results: 64 coresContact ProblemsBILU(p)-(depth of overlapping)3,090,903 DOF

0

50

100

150

200

250

300

350

BILU(1) BILU(1+) BILU(2)

sec.

0

500

1000

1500

BILU(1) BILU(1+) BILU(2)

ITER

ATI

ON

S

■BILU(p)-(0): Block Jacobi■BILU(p)-(1)■BILU(p)-(1+)■BILU(p)-HID GPBiCG

180

• Development of robust and efficient parallel preconditioning method

• Construction of strategies for optimum selection of preconditioners, partitioning, and related methods/parameters.

• By utilization of both of:– global information obtained from derived coefficient matrices– very local information, such as information of each mesh in

finite-element applications.

Final goal of my recent work in this area after 2000

181

Hetero 3D (1/2)• Parallel FEM Code (Flat MPI)

– 3D linear elasticity problems in cube geometries with heterogeneity

– SPD matrices– Young’s modulus: 10-6~10+6

• (Emin-Emax): controls condition number

• Preconditioned Iterative Solvers– GP-BiCG [Zhang 1997]– BILUT(p,d,t)

• Domain Decomposition– Localized Block-Jacobi with Extended

Overlapping (LBJ)– HID/Extended HID

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

182

Hetero 3D (2/2)• based on the framework for parallel FEM proc. of GeoFEM

– Benchmark developed in FP3C project under Japan-France collaboration

• Parallel Mesh Generation– Fully parallel way

• each process generates local mesh, and assembles local matrices. – Total number of vertices in each direction (Nx, Ny, Nz)– Number of partitions in each direction (Px,Py,Pz)– Number of total MPI processes is equal to PxPyPz– Each MPI process has (Nx/Px)( Ny/Py)( Nz/Pz) vertices.– Spatial distribution of Young’s modulus is given by an external file,

which includes information for heterogeneity for the field of 1283

cube geometry. • If Nx (or Ny or Nz) is larger than 128, distribution of these 1283 cubes is

repeated periodically in each direction.

FP3C

183

BILUT(p,d,t)• Incomplete LU factorization with threshold (ILUT)• ILUT(p,d,t) [KN 2010]

– p: Maximum fill-level specified before factorization– d, t: Criteria for dropping tolerance before/after factorization

• The process (b) can be substituted by other factorization methods or more powerful direct linear solvers, such as MUMPS, SuperLU and etc.

A

Initial Matrix

Dropping Components‐ Aij< d‐ Location

A’

DroppedMatrix

ILU (p)Factorization

(ILU)’

ILUFactorization

(ILUT)’

ILUT(p,d,t)

Dropping Components‐ Aij< t‐ Location

(a) (b) (c)

184

Preliminary Results• Hardware

– 16-240 nodes (160-3,840 cores) of Fujitsu PRIMEHPC FX10 (Oakleaf-FX), University of Tokyo

• Problem Setting– 420×320×240 vertices (3.194×107 elem’s, 9.677×107 DOF)– Strong scaling– Effect of thickness of overlapped zones

• BILUT(p,d,t)-LBJ-X (X=1,2,3)

– Effect of d is small– HID is slightly more robust than LBJ

185

BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Fill-in

Preconditioner NNZ of [M]

Set-up(sec.)

Solver(sec.)

Total(sec.) Iterations

BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096

[NNZ] of [A]: 7.174109

186

BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Overlapping


Set-up(sec.)

Solver(sec.)



[NNZ] of [A]: 7.174109

187

BILUT(p,0,0) at 3,840 coresNO dropping


Set-up(sec.)

Solver(sec.)



[NNZ] of [A]: 7.174109

188

BILUT(p,0,t) at 3,840 coresOptimum Value of t

Preconditioner NNZ of [M] Set-up(sec.)

Solver(sec.)


BILUT(1,0,2.7510-2)-LBJ-1 7.755109 1.36 45.0 46.3 1916BILUT(1,0,2.7510-2)-LBJ-2 1.0191010 2.05 42.0 44.1 1383BILUT(1,0,2.7510-2)-LBJ-3 1.2851010 2.81 54.2 57.0 1492BILUT(2,0,1.0010-2)-LBJ-1 1.1181010 3.11 39.1 42.2 1422BILUT(2,0,1.0010-2)-LBJ-2 1.4871010 4.41 37.1 41.5 1029BILUT(2,0,1.0010-2)-LBJ-3 1.8931010 5.99 37.1 43.1 915BILUT(3,0,2.5010-2)-LBJ-1 8.072109 9.35 38.4 47.7 1526BILUT(3,0,2.5010-2)-LBJ-2 1.0631010 12.7 35.5 48.3 1149BILUT(3,0,2.5010-2)-LBJ-3 1.3421010 17.3 40.9 58.2 1180BILUT(1,0,2.5010-2)-HID 6.850109 2.25 38.5 40.7 1313BILUT(2,0,1.0010-2)-HID 1.0301010 5.04 36.1 41.1 1064

[NNZ] of [A]: 7.174109

189

Strong Scaling up to 3,840 coresaccording to elapsed computation time (set-up+solver) for

BILUT(1,0,2.510-2)-HID with 256 cores

0.00E+00

1.00E+03

2.00E+03

3.00E+03

4.00E+03

0 500 1000 1500 2000 2500 3000 3500 4000

Spee

d-U

p

CORE#

BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2Ideal

70

80

90

100

110

120

130

100 1000 10000

Para

llel P

erfo

rman

ce (%

)

CORE#

BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2

190

Summary• Hetero 3D• Generally speaking, HID is more robust than LBJ with

overlap extention• BILUT(p,d,t)

– effect of d is not significant– [NNZ] of [M] depends on t (not p)– BILU(3,0,t0) > BILU(2,0,t0) > BILU(1,0,t0), although cost of a

single iteration is similar for each method• Critical/optimum value of t

– [NNZ] of [M] = [NNZ] of [A]– Further investigation needed.

191

Future Works• Theoretical/numerical investigation of optimum t

– Eigenvalue analysis etc.– Final Goal: Automatic selection BEFORE computation– (Any related work ?)

• Further investigation/development of LBJ & HID• Comparison with other preconditioners/direct solvers

– (Various types of) Low-Rank Approximation Methods• Hetero 3D will be released as a deliverable of FP3C

project soon– OpenMP/MPI Hybrid version

• BILU(0) is already done, factorization is (was) the problem – Extension to Manycore/GPU clusters

192

Documents

Kengo Nakajima's Homepage - 並列前処理付き反復法nkl.cc.u-tokyo.ac.jp/seminars/ppOpen-APPL-FVM/JSIAM14w.pdf差分法，有限要素法等による科学技術計算は最終的に