Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
並列前処理付き反復法
中島 研吾東京大学情報基盤センター・
同 大学院情報理工学系研究科数理情報学専攻
日本応用数理学会 三部会連携「応用数理セミナー」
2014年12月26日
差分法,有限要素法等による科学技術計算は最終的には疎な係数行列を係数とする大規模連立一次方程式の求解に帰着される。連立一次方程式の解法としては,クリロフ部分空間型反復解法が広く使用されている。
係数行列の固有値分布を改善し,収束を加速する方法として前処理(Preconditioning)は,実アプリケーションを解く上で重要である。
本チュートリアルでは,前処理付き反復法の基礎から,最新のスーパーコンピュータによる大規模シミュレーション研究を指向した研究動向まで幅広く解説する。
http://nkl.cc.u-tokyo.ac.jp/seminars/JSIAM14w.pdf
2
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
TOC
3
• Introduction to Parallel Iterative Solvers
There are a lot of topics and issues all of which I cannot cover. I just try to talk about my experiences in the area of scientific applications and parallel numerical algorithms, with some general introductions.
Goal
4
Finite-Element Method (FEM)有限要素法
1
1
2 3
4 5 6
7 8 9
2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
• 偏微分方程式の解法として広く知られている
– elements (meshes,要素) & nodes (vertices,節点)• 以下の二次元熱伝導問題を考える:
– 16節点,9要素(四角形)
– 一様な熱伝導率 (=1)– 一様な体積発熱 (Q=1)– 節点1で温度固定:T=0– 周囲断熱
02
2
2
2
Q
yT
xT
5
Galerkin FEM procedures
1 2 3
4 5 6
7 8 9
2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1
• 各要素にガラーキン法を適用:
02
2
2
2
dVQyT
xTN
T
V
各要素で:
[N] : 形状関数(内挿関数)
• 偏微分方程式に対して,ガウス・グリーンの定理を適用し,以下の「弱形式」を導く
0
dVNQ
dVyN
yN
xN
xN
V
T
TT
V
}{NT
6
Element Matrix:要素マトリクス
e
B
D C
A
)(
)(
)(
)(
)(
)(
)(
)(
)()()()(
)()()()(
)()()()(
)()()()(
)()()( }{}]{[
eD
eC
eB
eA
eD
eC
eB
eA
eDD
eDC
eDB
eDA
eCD
eCC
eCB
eCA
eBD
eBC
eBB
eBA
eAD
eAC
eAB
eAA
eee
ffff
kkkkkkkkkkkkkkkk
fk
0
dVNQ
dVyN
yN
xN
xN
V
T
TT
V
• 各要素において積分を実行し,要素マトリクスを得る
7
Global Matrix:全体マトリクス各要素マトリクスを全体マトリクスに足しこむ
1
1
2 3
4 5 6
7 8 9
2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
}{}]{[
FFFFFFFFFFFFFFFF
DXXXXDXXXX
XDXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXDX
XXXXDXXXXD
FK
8
1
1
2 3
4 5 6
7 8 9
2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
}{}]{[
FFFFFFFFFFFFFFFF
DXXXXDXXXX
XDXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXDX
XXXXDXXXXD
FK
Global Matrix:全体マトリクス各要素マトリクスを全体マトリクスに足しこむ
9
得られた大規模連立一次方程式を解くある適切な境界条件 (ここでは=0)を適用「疎(ゼロが多い)」な行列
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
FFFFFFFFFFFFFFFF
DXXXXDXXXX
XDXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXDX
XXXXDXXXXD
10
計算結果
11
2D FDM Mesh (5-point stencil)12
13
有限要素法・差分法で得られるマトリクス
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
FFFFFFFFFFFFFFFF
DXXXXDXXXX
XDXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXDX
XXXXDXXXXD
• 疎行列– 0が多い
• A(i,j)のように正方行列の全
成分を記憶することは疎行列では非効率的– 「密」行列向け
• 有限要素法:非零非対角成分の数は高々「数百」規模– 例えば未知数が108個あるとすると記憶容量(ワード数)は
• 正方行列:O(1016)• 非零非対角成分数:O(1010)
• 非零成分のみ記憶するのが効率的
14
行列ベクトル積への適用(非零)非対角成分のみを格納,疎行列向け方法
Compressed Row Storage (CRS)
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
FFFFFFFFFFFFFFFF
DXXXXDXXXX
XDXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXXXDXXXX
XXXXDXXXXXXXDXX
XXDXXXXXDX
XXXXDXXXXD
Diag (i) 対角成分(実数,i=1,N)Index(i) 非対角成分数に関する一次元配列(通し番号)
(整数,i=0,N)Item(k) 非対角成分の要素(列)番号
(整数,k=1, index(N))AMat(k) 非対角成分
(実数,k=1, index(N))
{Y}= [A]{X}
do i= 1, NY(i)= Diag(i)*X(i)do k= Index(i-1)+1, Index(i)
Y(i)= Y(i) + Amat(k)*X(Item(k))enddo
enddo
15
行列ベクトル積:密行列⇒とても簡単
NNNNNN
NNNNNN
NN
NN
aaaaaaaa
aaaaaaaa
,1,2,1,
,11,12,11,1
,21,22221
,11,11211
...
.........
...
N
N
N
N
yy
yy
xx
xx
1
2
1
1
2
1
{Y}= [A]{X}
do j= 1, NY(j)= 0.d0do i= 1, N
Y(j)= Y(j) + A(i,j)*X(i)enddo
enddo
Compressed Row Storage (CRS)
3.5101.306.93.15.901.131.234.1005.24.60
05.94.12005.60003.405.1104.105.91.3007.25.28.901.4001.305.107.5001.907.305.206.33.4
0002.3004.21.11
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
16
Compressed Row Storage (CRS):Cプログラムの中では0番から番号付け
2.4①
3.2④
4.3◎
2.5③
3.7⑤
9.1⑦
1.5④
3.1⑥
4.1①
2.5④
2.7⑤
3.1◎
9.5①
10.4②
4.3⑥
6.5②
9.5⑥
6.4①
2.5②
1.4⑤
13.1⑦
9.5①
1.3②
9.6③
3.1⑤
N= 8
Diagonal ComponentsDiag[0]= 1.1 Diag[1]= 3.6Diag[2]= 5.7Diag[3]= 9.8 Diag[4]= 11.5 Diag[5]= 12.4Diag[6]= 23.1 Diag[7]= 51.3
1.1◎
3.6①
5.7②
9.8③
11.5④
12.4⑤
23.1⑥
51.3⑦
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
17
Compressed Row Storage (CRS):C0
2.4①
3.2④
4.3◎
2.5③
3.7⑤
9.1⑦
1.5④
3.1⑥
4.1①
2.5④
2.7⑤
3.1◎
9.5①
10.4②
4.3⑥
6.5②
9.5⑥
6.4①
2.5②
1.4⑤
13.1⑦
9.5①
1.3②
9.6③
3.1⑤
1 2 3 4 5 6 71.1◎
3.6①
5.7②
9.8③
11.5④
12.4⑤
23.1⑥
51.3⑦
0
1
2
3
4
5
6
7
18
Compressed Row Storage (CRS):C
0
1
2
3
4
5
6
7
2.4①,0
3.2④,1
4.3◎,2
2.5③,3
3.7⑤,4
9.1⑦,5
1.5④,6
3.1⑥,7
4.1①,8
2.5④,9
2.7⑤,10
3.1◎,11
9.5①,12
10.4②,13
4.3⑥,14
6.5②,15
9.5⑥,16
6.4①,17
2.5②,18
1.4⑤,19
13.1⑦,20
9.5①,21
1.3②,22
9.6③,23
3.1⑤,24
1.1◎
3.6①
5.7②
9.8③
11.5④
12.4⑤
23.1⑥
51.3⑦
Diag [i] 対角成分(実数,[N])Index[i] 非対角成分数に関する一次元配列
(通し番号)(整数,[N+1])Item[k] 非対角成分の要素(列)番号
(整数,[Index[N]])AMat[k] 非対角成分
(実数,[Index[N]])
{Y}=[A]{X}
for(i=0;i<N;i++){Y[i] = Diag[i] * X[i]; for(k=Index[i];k<Index[i+1];k++){
Y[i] += AMat[k]*X[Item[k]];}
}
19
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG
20
科学技術計算における大規模線形方程式の解法
• 多くの科学技術計算は,最終的に大規模線形方程式Ax=bを解くことに帰着される。
– important, expensive• アプリケーションに応じて様々な手法が提案されている
– 疎行列(sparse),密行列(dense)
– 直接法(direct),反復法(iterative)
• 密行列(dense)
– グローバルな相互作用:BEM,スペクトル法,MO,MD(気液)
• 疎行列(sparse)
– ローカルな相互作用:FEM,FDM,MD(固),高速多重極展開付BEM
21
直接法(Direct Method)
• Gaussの消去法,完全LU分解他– 行列の変形,逆行列に相当するものの計算
• 利点– 安定,幅広いアプリケーションに適用可能
• Pivoting– 疎行列,密行列いずれにも適用可能
• 欠点– 反復法よりもメモリ,計算時間を必要とする
• 密行列の場合,O(N3 )の計算量
– 大規模な計算向けではない• O(N2 )の記憶容量,O(N3 )の計算量
22
反復法とは・・・
適当な初期解 x(0)から始めて,繰り返し計算によって真の解に
収束(converge)させていく
,, )2()1( xx
A b
初期解連立一次方程式
)0(
)0(2
)0(1
)0(
nx
xx
x
x
nnnnnn
n
n
b
bb
x
xx
aaa
aaaaaa
2
1
2
1
21
22221
11211
23
反復法(Iterative Method)
• 定常(stationary)法
– 反復計算中,解ベクトル以外の変数は変化せず
– SOR,Gauss-Seidel,Jacobiなど
– 概して遅い
• 非定常(nonstationary)法
– 拘束,最適化条件が加わる
– Krylov部分空間(subspace)への写像を基底として使用するため,Krylov部分空間法とも呼ばれる
– CG(Conjugate Gradient:共役勾配法)
– BiCGSTAB(Bi-Conjugate Gradient Stabilized)
– GMRES(Generalized Minimal Residual)
NbMxxbAx
)()1( kk
24
反復法(Iterative Method)(続き)
• 利点– 直接法と比較して,メモリ使用量,計算量が少ない。
– 並列計算には適している。
• 欠点– 収束性が,アプリケーション,境界条件の影響を受けやすい。
• 収束しない(答えが得られない)可能性がある
– 前処理(preconditioning)が重要。
25
非定常反復法:クリロフ部分空間法(1/2)Krylov Subspace Method xAIbxbAx
以下の反復式を導入しx0, x1, x2, ..., xkを求める:
11
11
1
kk
kk
kk
xrxAxb
xAIbx
kkwhere Axbr :残差ベクトル(residual)
1
00
k
iik rxx
111111
11
kkkkkk
kkkk
rAIrAxrArAxbxrAbAxbr
26
非定常反復法:クリロフ部分空間法(2/2)Krylov Subspace Method
zkはk次のクリロフ部分空間(Krylov Subspace)に属するベクトル,問題はクリロフ部分空間からどのようにして解の近似ベクトルxkを求めるかにある:
0
1
1
1
100
1
1000
2
000
1
00
rAIIrAIrz
rAIrxrAIrxrxx
k
i
ik
i
ik
k
i
ik
ii
k
iik
01
02
00 ,,,, rArAArr k
27
代表的な非定常反復法:共役勾配法
• Conjugate Gradient法,略して「CG」法– 最も代表的な「非定常」反復法
• 対称正定値行列(Symmetric Positive Definite:SPD)– 任意のベクトル{x}に対して{x}T[A]{x}>0– 全対角成分>0,全固有値>0,全部分行列式(主小行列式・首座行
列式)>0と同値
• アルゴリズム– 最急降下法(Steepest Descent Method)の変種
– x(i)= x(i-1) + ip(i)
• x(i):反復解,p(i):探索方向,i:定数)
– 厳密解をyとするとき {x-y}T[A]{x-y}を最小とするような{x}を求める。
– 詳細は参考文献参照
• 例えば:森正武「数値解析(第2版)」(共立出版)
nnnnnn
n
n
n
n
aaaaa
aaaaaaaaaaaaaaaaaaaa
4321
444434241
334333231
224232221
114131211
det
nnnnnn
n
n
n
n
aaaaa
aaaaaaaaaaaaaaaaaaaa
4321
444434241
334333231
224232221
114131211
det
28
共役勾配法(CG法)のアルゴリズム
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
• 行列ベクトル積
• ベクトル内積
• ベクトル定数倍の加減(DAXPY)
x(i) : Vectori : Scalar
29
共役勾配法(CG法)のアルゴリズム
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
• 行列ベクトル積
• ベクトル内積
• ベクトル定数倍の加減(DAXPY)
x(i) : Vectori : Scalar
30
共役勾配法(CG法)のアルゴリズム
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
• 行列ベクトル積
• ベクトル内積
• ベクトル定数倍の加減(DAXPY)
x(i) : Vectori : Scalar
31
共役勾配法(CG法)のアルゴリズム
x(i) : Vectori : Scalar
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
• 行列ベクトル積
• ベクトル内積
• ベクトル定数倍の加減(DAXPY)
– Double– {y}= a{x} + {y}
32
共役勾配法(CG法)のアルゴリズム
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
x(i) : Vectori : Scalar
33
CG法アルゴリズムの導出(1/5)
bybxAxxAyyAyxAxx
AyyAyxAxyAxxyxAyx T
,,2,,,2,,,,,
yxAyx T
定数
bxAxxxf ,,21
AhhbAxhxfhxf ,21, 任意のベクトル h
yを厳密解( Ay=b )とするとき,下式を最小にするxを求める:
従って,下記 f(x) を最小にするxを求めればよい:
34
bxAxxxf ,,21
AhhbAxhxfhxf ,21, •任意のベクトルh
AhhbAxhxf
AhhbhAxhbxAxx
bhbxAhhAhxAxhAxx
bhbxAhhxAxhx
bhxhxAhxhxf
,21,
,21,,,,
21
,,,21,
21,
21,
21
,,,21,
21
,)(,21
35
CG法アルゴリズムの導出(2/5)
)()()1( kk
kk pxx
(1)
CG法は任意の x(0) から始めて,f(x)の最小値を逐次探索する。今,k番目の近似値 x(k)と探索方向p(k)が決まったとすると:
)()()()()(2)()( ,,21 kkk
kkk
kk
kk xfAxbpApppxf
f(x(k+1)) を最小にするためには:
)()(
)()(
)()(
)()()()(
,,
,,0 kk
kk
kk
kk
kk
kk
k
Apprp
AppAxbppxf
)()( kk Axbr は第k近似に対する残差
36
CG法アルゴリズムの導出(3/5)
)()()1( kk
kk Aprr
残差 r(k)も以下の式によって計算できる:
本当のところは下記のように(k+1)回目に厳密解 y が求まれば良いのであるが,解がわかっていない場合は困難・・・
)1(1
)1(
kk
k pxy
)()()1()()1(
)()()1()1( ,k
kkkkk
kkkk
ApAxAxrrAxbrAxbr
(2)
)0()0()()1()1( , prprp kk
kk
探索方向を以下の漸化式によって求める:
(3)
37
CG法アルゴリズムの導出(4/5)ところで,下式のような都合の良い直交関係がある:
従って以下が成立する:
0,0,, )()1()1(1
)()1()(
kkkk
kkk ApppApxyAp
0,,,
,,,,,
)()()()()()()(
)()()()()()(
)1()()1()()1()(
kkk
kkkk
kk
kk
kkkk
kk
kkkkkk
ApprpAprp
ApAxbppxAbpAxbpAxAypxyAp
)()(
)()(
,,
kk
kk
k Apprp
0, )1()( kk xyAp
38
CG法アルゴリズムの導出(5/5)
)()(
)()1(
)()()()1()()()1()()1(
,,
0,,,,
kk
kk
k
kkk
kkkkk
kkk
AppApr
AppAprApprApp
0, )()1( kk App p(k) と p(k+1) が行列Aに関して共役(conjugate)
(4)
)1()1(
)1()(
1 ,,
ii
ii
i AppApr
Compute p(0)=r(0)= b-[A]x(0)
for i= 1, 2, …calc. i-1x(i)= x(i-1) + i-1p(i-1)r(i)= r(i-1) – i-1[A]q(i-1)
check convergence |r|(if not converged)calc. i-1p(i)= r(i) + i-1 p(i-1)
end
)1()1(
)1()1(
1 ,,
ii
ii
i Apprp
39
CG法アルゴリズム
任意の(i,j)に対して以下の共役関係が得られる:
N次元空間で互いに直交で一次独立な残差ベクトル r(k) はN個しか存在しない,従って共役勾配法は未知数がN個のときにN回以内に収束する⇒ 実際は丸め誤差の影響がある(条件数が大きい場合)
jiApp ji 0, )()(
)()()()()()( ,,,0, kkkkji rrrpjirr
探索方向p(k) ,残差ベクトルr(k)についても以下の関係が成立する:
Top 10 Algorithms in the 20th Century (SIAM)http://www.siam.org/news/news.php?id=637モンテカルロ法,シンプレックス法,クリロフ部分空間法,行列分解法,最適化Fortranコンパイラ,QR法,クイックソート,FFT,整数関係アルゴリズム,FMM(高速多重極法)
40
k,k
k
kk
k
kkkkk
kk
kk
kk
kk
k
rrrrrApr
rrrr
AppApr
)1()1()1()()1()()1(
)()(
)1()1(
)()(
)()1(
,,,
,,
,,
実際はk,kはもうちょっと簡単な形に変形できる:
)()()()(
)()(
)()(
)()(
)()(
)()(
)()(
,,,,
,,
,,
kkkk
kk
kk
kk
kk
kk
kk
k
rrrpApprr
Apprp
AppAxbp
41
共役勾配法(CG法)のアルゴリズム
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
x(i) : Vectori : Scalar
)()(
)1()1(
,,
ii
ii
i Apprr
)2()2(
)1()1(
1 ,,
ii
ii
i rrrr
1 i 2 i
1 i
42
do i= 1, NR(i)= B(i)do j= 1, NR(i)= R(i) - AMAT(i,j)*X(j)
enddoenddo
BNRM2= 0.0D0do i= 1, NBNRM2= BNRM2 + B(i) **2
enddo
プログラム例(CG法)(1/3)
AMAT(i,j): Aのaij成分B(i): bの各成分X(i): xの各成分
P(i): pの各成分Q(i): qの各成分R(i): rの各成分
Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
43
do iter= 1, ITERmaxRHO= 0.d0do i= 1, NRHO= RHO + R(i)*R(i)
enddo
if ( iter.eq.1 ) thendo i= 1, N
P(i)= R(i)enddoelseBETA= RHO / RHO1do i= 1, N
P(i)= R(i) + BETA*P(i)enddo
endif
do i= 1, NQ(i)= 0.d0do j= 1, N
Q(i)= Q(i) + AMAT(i,j)*P(j)enddo
enddo
...enddo
プログラム例(CG法)(2/3)Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
44
do iter= 1, ITERmax...
C1= 0.d0do i= 1, NC1= C1 + P(i)*Q(i)
enddoALPHA= RHO / C1
do i= 1, NX(i)= X(i) + ALPHA * P(i)R(i)= R(i) - ALPHA * Q(i)
enddo
DNRM2 = 0.0do i= 1, NDNRM2= DNRM2 + R(i)**2
enddo
RESID= dsqrt(DNRM2/BNRM2)
if ( RESID.le.EPS) exit
RHO1 = RHO
enddo
プログラム例(CG法)(3/3)Compute r(0)= b-[A]x(0)
for i= 1, 2, …i-1= r(i-1) r(i-1)if i=1p(1)= r(0)
elsei-1= i-1/i-2p(i)= r(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
i-1=i-2
2
2
)(
b
Axb k
45
一次元熱伝導方程式支配方程式:熱伝導率=1(一様)
xxBFxBF
xxdxdxBF
dxd
max2
max2
2
21
@0,0@0,0
一様体積発熱 BF
=0 断熱
以下のような離散化(要素中心で従属変数を定義)をしている
=0
46
一次元熱伝導方程式解析解
xxBFxBF max2
21
=0X=0
断熱となっているのはこの面,しかし温度は計算されない(X=Xmax)。
12255.98505.1200495.494921 2
x=1.d0,メッシュ数=50,とすると,Xmax=49.5,●の点のX座標は49.0となる。BF=1.0d0とすると●での温度は:
47
計算例(N=50):Jacobi法1000 iters, RESID= 5.443248E-01 PHI(N)= 4.724513E+022000 iters, RESID= 3.255667E-01 PHI(N)= 7.746137E+023000 iters, RESID= 1.947372E-01 PHI(N)= 9.555996E+02...34000 iters, RESID= 2.347113E-08 PHI(N)= 1.225000E+0335000 iters, RESID= 1.403923E-08 PHI(N)= 1.225000E+0335661 iters, RESID= 9.999053E-09 PHI(N)= 1.225000E+03
1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...
41 1.180000E+03 1.180000E+0342 1.189000E+03 1.189000E+0343 1.197000E+03 1.197000E+0344 1.204000E+03 1.204000E+0345 1.210000E+03 1.210000E+0346 1.215000E+03 1.215000E+0347 1.219000E+03 1.219000E+0348 1.222000E+03 1.222000E+0349 1.224000E+03 1.224000E+0350 1.225000E+03 1.225000E+03
反復回数最大残差(50)
数値解,解析解
12255.98505.1200495.494921 2
48
計算例(N=50):Gauss-Seidel 法1000 iters, RESID= 3.303725E-01 PHI(N)= 7.785284E+022000 iters, RESID= 1.182010E-01 PHI(N)= 1.065259E+033000 iters, RESID= 4.229019E-02 PHI(N)= 1.167848E+03...16000 iters, RESID= 6.657001E-08 PHI(N)= 1.225000E+0317000 iters, RESID= 2.381754E-08 PHI(N)= 1.225000E+0317845 iters, RESID= 9.993196E-09 PHI(N)= 1.225000E+03
1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...
41 1.180000E+03 1.180000E+0342 1.189000E+03 1.189000E+0343 1.197000E+03 1.197000E+0344 1.204000E+03 1.204000E+0345 1.210000E+03 1.210000E+0346 1.215000E+03 1.215000E+0347 1.219000E+03 1.219000E+0348 1.222000E+03 1.222000E+0349 1.224000E+03 1.224000E+0350 1.225000E+03 1.225000E+03
反復回数最大残差(50)
数値解,解析解
49
計算例(N=50):CG法49 iters, RESID= 0.000000E-00 PHI(N)= 1.225000E+03
1 0.000000E+00 0.000000E+002 4.899999E+01 4.900000E+013 9.699999E+01 9.700000E+014 1.440000E+02 1.440000E+025 1.900000E+02 1.900000E+02...
41 1.180000E+03 1.180000E+0342 1.189000E+03 1.189000E+0343 1.197000E+03 1.197000E+0344 1.204000E+03 1.204000E+0345 1.210000E+03 1.210000E+0346 1.215000E+03 1.215000E+0347 1.219000E+03 1.219000E+0348 1.222000E+03 1.222000E+0349 1.224000E+03 1.224000E+0350 1.225000E+03 1.225000E+03
反復回数最大残差(50)数値解,解析解
12255.98505.1200495.494921 2
49回目に収束していることに注意(未知数は49個)
50
反復法(Iterative Method)
• 利点– 直接法と比較して,メモリ使用量,計算量が少ない。
– 並列計算には適している。
• 欠点– 収束性が,アプリケーション,境界条件の影響を受けやすい。
• 収束しない(答えが得られない)可能性がある
– 前処理(preconditioning)が重要。• 条件数(condition number)の大きい問題
51
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
52
共役勾配法のアルゴリズム
• 行列ベクトル積• ベクトル内積• ベクトル定数倍の加減
x(i) :ベクトルi :スカラー
Compute r(0)= b-[A]x(0)
for i= 1, 2, …z(i-1)= r(i-1)
i-1= r(i-1) z(i-1)if i=1p(1)= z(0)
elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
53
前処理(preconditioning)とは?• 反復法の収束は係数行列の固有値分布に依存
– 固有値分布が少なく,かつ1に近いほど収束が早い(単位行列)
– 条件数(condition number)(対称正定)=最大最小固有値比• 条件数が1に近いほど収束しやすい
• もとの係数行列[A]に良く似た前処理行列[M]を適用すること
によって固有値分布を改善する。– 前処理行列[M]によって元の方程式[A]{x}={b}を[A’]{x’}={b’}へと変換する。ここで[A’]=[M]-1[A],{b’}=[M]-1{b} である。
– [A’]=[M]-1[A]が単位行列に近ければ良いということになる。
– [A’]=[A][M]-1のように右からかけることもある。
• 「前処理」は密行列,疎行列ともに使用するが,普通は疎行列を対象にすることが多い。
54
前処理付共役勾配法Preconditioned Conjugate Gradient Method (PCG)
Compute r(0)= b-[A]x(0)
for i= 1, 2, …solve [M]z(i-1)= r(i-1)
i-1= r(i-1) z(i-1)if i=1p(1)= z(0)
elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
実際にやるべき計算は:
rMz 1
AMAM ,11
対角スケーリング:簡単=弱い
DMDM ,11
究極の前処理:本当の逆行列
AMAM ,11
「近似逆行列」の計算が必要:
55
56
対角スケーリング,点ヤコビ前処理
• 前処理行列として,もとの行列の対角成分のみを取り出した行列を前処理行列 [M] とする。
–対角スケーリング,点ヤコビ(point-Jacobi)前処理
N
N
DD
DD
M
0...00000.........00000...0
1
2
1
• solve [M]z(i-1)= r(i-1)という場合に逆行列を簡単に求めることができる。
• 簡単な問題では収束する。
57
ILU(0), IC(0)• 最もよく使用されている前処理(疎行列用)
– 不完全LU分解• Incomplete LU Factorization
– 不完全コレスキー分解• Incomplete Cholesky Factorization(対称行列)
• 不完全な直接法– もとの行列が疎でも,逆行列は疎とは限らない。
– fill-in– もとの行列と同じ非ゼロパターン(fill-in無し)を持って
いるのがILU(0),IC(0)
58
LU分解法:完全LU分解法
• 直接法の一種
– 逆行列を直接求める手法
– 「逆行列」に相当するものを保存しておけるので,右辺が変わったときに計算時間を節約できる
– 逆行列を求める際にFill-in(もとの行列では0であったところに値が入る)が生じる
• LU factorization
59
「不」完全LU分解法
• ILU factorization– Incomplete LU factorization
• Fill-inの発生を制限して,前処理に使う手法
– 不完全な逆行列,少し弱い直接法
– Fill-inを許さないとき:ILU(0)
60
LU分解による連立一次方程式の解法
Aがn×n行列のとき、Aを次式のように表すことを(あるいは、そのようなLとUそのものを)AのLU分解という.
nn
n
n
n
nnnnnnnn
n
n
n
u
uuuuuuuuu
lll
lll
aaaa
aaaaaaaaaaaa
000
000
1
010010001
333
22322
1131211
321
3231
21
321
3333231
2232221
1131211
LUA L:Lower triangular part of matrix AU:Upper triangular part of matrix A
61
連立一次方程式の行列表現
n元の連立一次方程式の一般形
nnnnnn
nn
nn
bxaxaxa
bxaxaxabxaxaxa
2211
22222121
11212111
行列表現
nnnnnn
n
n
b
bb
x
xx
aaa
aaaaaa
2
1
2
1
21
22221
11211
A x b
bAx
62
LU分解を用いたAx=bの解法
1
2
3
LUA となるAのLU分解LとUを求める.
bLy の解yを求める.(簡単!)
yUx の解xを求める.(簡単!)
このxが bAx の解となる
bLyLUxAx
63
Ly=bの解法:前進代入
bLy
nnnn b
bb
y
yy
ll
l
2
1
2
1
21
21
1
01001
nnnn byylyl
byylby
2211
22121
11
i
n
ininnnnn ylbylylby
ylbyby
1
12211
12122
11
芋づる式に (one after another)解が求まる.
64
Ux=yの解法:後退代入
yUx
nnnn
n
n
y
yy
x
xx
u
uuuuu
2
1
2
1
222
11211
00
0
11212111
1,111,1
yxuxuxu
yxuxuyxu
nn
nnnnnnn
nnnn
112
111
1,1,111
/
/)(/
uxuyx
uxuyxuyx
j
n
ij
nnnnnnn
nnnn
芋づる式に (one after another)解が求まる.
65
LU分解の求め方
nn
n
n
n
nnnnnnnn
n
n
n
u
uuuuuuuuu
lll
lll
aaaa
aaaaaaaaaaaa
000
000
1
010010001
333
22322
1131211
321
3231
21
321
3333231
2232221
1131211
①
②
③
④
①
②
③
④
nnn uuuuauaua 112111112121111 ,,,,,,
131211111113131112121 ,,,,,, nnn lllulaulaula
nnnn uuuuulauula 223222121222122122 ,,,,,
242322232123132 ,,,, nlllulula
66
数値例
44
3433
242322
14131211
434241
3231
21
00000
0
1010010001
17407822
107624321
uuuuuuuuuu
lllll
lA
第1行 14131211 4,3,2,1 uuuu
第1列
0/002/22,2/22
11411141
1131113111211121
ulululululul
第2行
21017,26
24241421
2323132122221221
uuuluuuluuul
第2列 24,12 42224212413222321231 lulullulul
67
数値例(続き)
44
3433
242322
14131211
434241
3231
21
00000
0
1010010001
17407822
107624321
uuuuuuuuuu
lllll
lA
第3行
17,38
343424321431
333323321331
uuululuuulul
第3列 37 43334323421341 uululul
第4行(第4列) 21 4444344324421441 uuululul
1行、1列、2行、2列、・ ・ ・の順に求める式を作っていく.
68
数値例(続き)
結局
2000130021204321
1320011200120001
17407822
107624321
A
L U
69
実例:5点差分
1
1 2 3
4 5 6
7 8 9
10 11 12
23
45
67
89
1011
12
70
実例:5点差分
1
1 2 3
4 5 6
7 8 9
10 11 12
23
4
67
89
1011
12
5
71
実例:係数マトリクス
1 2 3
4 5 6
7 8 9
10 11 1212
34
56
78
910
1112
=X
0.00
3.00
10.00
11.00
10.00
19.00
20.00
16.00
28.00
42.00
36.00
52.00
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00
72
実例:解
1 2 3
4 5 6
7 8 9
10 11 1212
34
56
78
910
1112
=
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
0.00
3.00
10.00
11.00
10.00
19.00
20.00
16.00
28.00
42.00
36.00
52.00
73
完全LU分解したマトリクスlu1.f
もとのマトリクス
LU分解したマトリクス[L][U]同時に表示[L]対角成分(=1)省略
(fill-inが生じている。もともと0だった成分が非ゼロになっている)
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00 0.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 0.00 0.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 6.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 -1.00 6.00
74
不完全LU分解したマトリクス(fill-in無し)lu2.f
不完全LU分解したマトリクス(fill-in無し)[L][U]同時に表示[L]対角成分(=1)省略
完全LU分解したマトリクス[L][U]同時に表示[L]対角成分(=1)省略
(fill-inが生じている。もともと0だった成分が非ゼロになっている)
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 5.83 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 0.00 -0.17 5.66 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -0.17 0.00 -0.18 5.65 0.00 0.00 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00 0.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65 0.00 0.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63
75
解の比較:ちょっと違う
不完全LU分解lu2.f
完全LU分解lu1.f
0.92
1.75
2.76
3.79
4.46
5.57
6.66
7.25
8.46
9.66
10.54
11.83
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 5.83 0.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 0.00 -0.17 5.66 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -0.17 0.00 -0.18 5.65 0.00 0.00 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00 0.00 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65 0.00 0.00 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -0.17 0.00 0.00 5.83 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.17 5.65 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.65
6.00 -1.00 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 5.83 -1.00 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 5.83 -0.03 -0.17 -1.00 0.00 0.00 0.00 0.00 0.00 0.00
-0.17 -0.03 0.00 5.83 -1.03 0.00 -1.00 0.00 0.00 0.00 0.00 0.00
0.00 -0.17 -0.03 -0.18 5.64 -1.03 -0.18 -1.00 0.00 0.00 0.00 0.00
0.00 0.00 -0.17 0.00 -0.18 5.64 -0.03 -0.18 -1.00 0.00 0.00 0.00
0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01 -1.00 0.00 0.00
0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03 -0.18 -1.00 0.00
0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63 -0.03 -0.18 -1.00
0.00 0.00 0.00 0.00 0.00 0.00 -0.17 -0.03 -0.01 5.82 -1.03 -0.01
0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 -0.03 -0.18 5.63 -1.03
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.18 0.00 -0.18 5.63
76
ILU(0), IC(0) 前処理
• Fill-inを全く考慮しない「不完全な」分解
– 記憶容量,計算量削減
• これを解くと「不完全な」解が得られるが,本来の解とそれほどずれているわけではない
– 問題に依存する
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
77
78
• Both of convergence (robustness) and efficiency (single/parallel) are important
• Global communications needed– Mat-Vec (P2P communications, MPI_Isend/Irecv/Waitall): Local
Data Structure with HALO effect of latency
– Dot-Products (MPI_Allreduce)– Preconditioning (up to algorithm)
• Remedy for Robust Parallel ILU Preconditioner– Additive Schwartz Domain Decomposition– HID (Hierarchical Interface Decomposition, based on global
nested dissection) [Henon & Saad 2007], ext. HID [KN 2010]• Parallel “Direct” Solvers (e.g. SuperLU, MUMPS etc.)
Parallel Iterative Solvers
Local Data Structures for Parallel FEM/FDM using Krylov Iterative Solvers Example: 2D FDM Mesh (5-point stencil)
79
Example: 2D FDM Mesh (5-point stencil)4-regions/domains
80
Example: 2D FDM Mesh (5-point stencil)4-regions/domains
81
Example: 2D FDM Mesh (5-point stencil)meshes at domain boundary need info. neighboring domains
82
Example: 2D FDM Mesh (5-point stencil)meshes at domain boundary need info. neighboring domains
83
Example: 2D FDM Mesh (5-point stencil)comm. using “HALO (overlapped meshes)”
84
85
Red Lacquered Gate in 64 PEs40,624 elements, 54,659 nodes
k-METISLoad Balance= 1.03
edgecut = 7,563
p-METISLoad Balance= 1.00
edgecut = 7,738
一般化された通信テーブル:送信
• 送信相手
– NeibPETot,NeibPE[neib]• それぞれの送信相手に送るメッセージサイズ
– export_index[neib], neib= 0, NeibPETot-1• 「境界点」番号
– export_item[k], k= 0, export_index[NeibPETot]-1• それぞれの送信相手に送るメッセージ
– SendBuf[k], k= 0, export_index[NeibPETot]-1
C86
送信(MPI_Isend/Irecv/Waitall)neib#0
SendBufneib#1 neib#2 neib#3
BUFlength_e BUFlength_e BUFlength_e BUFlength_e
export_index[0] export_index[1] export_index[2] export_index[3] export_index[4]
for (neib=0; neib<NeibPETot;neib++){for (k=export_index[neib];k<export_index[neib+1];k++){
kk= export_item[k];SendBuf[k]= VAL[kk];
}}
for (neib=0; neib<NeibPETot; neib++){tag= 0;iS_e= export_index[neib];iE_e= export_index[neib+1];BUFlength_e= iE_e - iS_e
ierr= MPI_Isend (&SendBuf[iS_e], BUFlength_e, MPI_DOUBLE, NeibPE[neib], 0,MPI_COMM_WORLD, &ReqSend[neib])
}
MPI_Waitall(NeibPETot, ReqSend, StatSend);
送信バッファへの代入
export_index[neib]~export_index[neib+1]-1番目のexport_itemがneib番目の隣接領域に送信される
C87
MPI_Isend• 送信バッファ「sendbuf」内の,連続した「count」個の送信メッセージを,タグ「tag」
を付けて,コミュニケータ内の,「dest」に送信する。「MPI_Waitall」を呼ぶまで,送信バッファの内容を更新してはならない。
• MPI_Isend (sendbuf,count,datatype,dest,tag,comm,request)– sendbuf 任意 I 送信バッファの先頭アドレス,
– count 整数 I メッセージのサイズ
– datatype 整数 I メッセージのデータタイプ
– dest 整数 I 宛先プロセスのアドレス(ランク)
– tag 整数 I メッセージタグ,送信メッセージの種類を区別するときに使用。
通常は「0」でよい。同じメッセージタグ番号同士で通信。
– comm MPI_Comm I コミュニケータを指定する
– request MPI_Request O 通信識別子。MPI_Waitallで使用。
(配列:サイズは同期する必要のある「MPI_Isend」呼び出し
数(通常は隣接プロセス数など)):C言語については後述
C88
一般化された通信テーブル:受信
• 受信相手
– NeibPETot ,NeibPE[neib]• それぞれの受信相手から受け取るメッセージサイズ
– import_index[neib], neib= 0, NeibPETot-1• 「外点」番号
– import_item[k], k= 0, import_index[NeibPETot]-1• それぞれの受信相手から受け取るメッセージ
– RecvBuf[k], k= 0, import_index[NeibPETot]-1
C89
受信(MPI_Isend/Irecv/Waitall)
neib#0RecvBuf
neib#1 neib#2 neib#3
BUFlength_i BUFlength_i BUFlength_i BUFlength_i
for (neib=0; neib<NeibPETot; neib++){tag= 0;iS_i= import_index[neib];iE_i= import_index[neib+1];BUFlength_i= iE_i - iS_i
ierr= MPI_Irecv (&RecvBuf[iS_i], BUFlength_i, MPI_DOUBLE, NeibPE[neib], 0,MPI_COMM_WORLD, &ReqRecv[neib])
}
MPI_Waitall(NeibPETot, ReqRecv, StatRecv);
for (neib=0; neib<NeibPETot;neib++){for (k=import_index[neib];k<import_index[neib+1];k++){
kk= import_item[k];VAL[kk]= RecvBuf[k];
}}
受信バッファからの代入
import_index[0] import_index[1] import_index[2] import_index[3] import_index[4]
import_index[neib]~import_index[neib+1]-1番目のimport_itemがneib番目の隣接領域から受信される
C90
MPI_Irecv• 受信バッファ「recvbuf」内の,連続した「count」個の送信メッセージを,タグ「tag」
を付けて,コミュニケータ内の,「dest」から受信する。「MPI_Waitall」を呼ぶまで,受信バッファの内容を利用した処理を実施してはならない。
• MPI_Irecv (recvbuf,count,datatype,dest,tag,comm,request)– recvbuf 任意 I 受信バッファの先頭アドレス,
– count 整数 I メッセージのサイズ
– datatype 整数 I メッセージのデータタイプ
– dest 整数 I 宛先プロセスのアドレス(ランク)
– tag 整数 I メッセージタグ,受信メッセージの種類を区別するときに使用。
通常は「0」でよい。同じメッセージタグ番号同士で通信。
– comm MPI_Comm I コミュニケータを指定する
– request MPI_Request O 通信識別子。MPI_Waitallで使用。
(配列:サイズは同期する必要のある「MPI_Irecv」呼び出し
数(通常は隣接プロセス数など)):C言語については後述
C91
MPI_Waitall• 1対1非ブロッキング通信関数である「MPI_Isend」と「MPI_Irecv」を使用した場合,プ
ロセスの同期を取るのに使用する。
• 送信時はこの「MPI_Waitall」を呼ぶ前に送信バッファの内容を変更してはならない。受信時は「MPI_Waitall」を呼ぶ前に受信バッファの内容を利用してはならない。
• 整合性が取れていれば, 「MPI_Isend」と「MPI_Irecv」を同時に同期してもよい。– 「MPI_Isend/Irecv」で同じ通信識別子を使用すること
• 「MPI_Barrier」と同じような機能であるが,代用はできない。– 実装にもよるが,「request」,「status」の内容が正しく更新されず,何度も
「MPI_Isend/Irecv」を呼び出すと処理が遅くなる,というような経験もある。
• MPI_Waitall (count,request,status)– count 整数 I 同期する必要のある「MPI_ISEND」 ,「MPI_RECV」呼び出し数。
– request 整数 I/O 通信識別子。「MPI_ISEND」,「MPI_IRECV」で利用した識別子名に対応。(配列サイズ:(count))
– status MPI_Status O 状況オブジェクト配列
MPI_STATUS_SIZE: “mpif.h”,”mpi.h”で定められる
パラメータ:C言語については後述
C92
References: Libraries (mainly for flat MPI)
• Talk by the Next Speaker (Tony Drummond)
• Trillinos– http://trilinos.sandia.gov/
• PETSc– http://www.mcs.anl.gov/petsc/
• GeoFEM– http://geofem.tokyo.rist.or.jp/
• ppOpen-HPC – http://ppopenhpc.cc.u-tokyo.ac.jp/
93
Preconditioning for Iterative Solvers• A critical issue for both of robustness and efficiency • Libraries (e.g. PETSc, Trillinos, ppOpen-HPC) cover only
general ones (e.g. ILU(p))• Selection of preconditioner strongly depends on:
– numerical property of matrix– features of physics, PDE, boundary conditions, mat. property,
size of FEM mesh etc.• sometimes, problem specific preconditioning needed
• “Parallel” preconditioning is really an exciting research area, important for practical computing.
• All of computational scientists, computer scientists, and mathematicians must work hard for that under intensive collaboration
94
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
95
Around the multigrid in a single slide• Multigrid is a scalable method for solving linear equations. • Relaxation methods (smoother/smoothing operator in MG
world) such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error.
• The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid.
• Multigrid method uniformly damps all frequencies of error components with a computational cost that depends only linearly on the problem size (=scalable).– Good for large-scale computations
• Multigrid is also a good preconditioning algorithm for Kryloviterative solvers.
96
Convergence of Gauss-Seidel & SOR
ITERATION#
RES
IDU
ALRapid Convergence(high-frequency error:short wave length)
97
Convergence of Gauss-Seidel & SOR
ITERATION#
RES
IDU
AL Slow Convergence
(low-frequency error:long wave length)
98
Around the multigrid in a single slide• Multigrid is a scalable method for solving linear equations. • Relaxation methods (smoother/smoothing operator in MG
world) such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error.
• The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid.
• Multigrid method uniformly damps all frequencies of error components with a computational cost that depends only linearly on the problem size (=scalable).– Good for large-scale computations
• Multigrid is also a good preconditioning algorithm for Krylov iterative solvers.
99
Multigrid is scalableWeak Scaling: Problem Size/Core Fixed
for 3D Poisson Eqn’s (q)MGCG= Conjugate Gradient with Multigrid Preconditioning
0
500
1000
1500
2000
2500
3000
1.E+06 1.E+07 1.E+08
Itera
tions
DOF
ICCGMGCG
100
Multigrid is scalableWeak Scaling: Problem Size/Core Fixed
Comp. time of MGCG for weak scaling is constant: => scalable
0
500
1000
1500
2000
2500
3000
1.E+06 1.E+07 1.E+08
Itera
tions
DOF
ICCGMGCG
16 3264 128
101
Procedure of Multigrid (1/3)102
Multigrid is a scalable method for solving linear equations. Relaxation methods such as Gauss-Seidel efficiently damp high-frequency error but do not eliminate low-frequency error. The multigrid approach was developed in recognition that this low-frequency error can be accurately and efficiently solved on a coarser grid. This concept is explained here in the following simple 2-level method. If we have obtained the following linear system on a fine grid :
AF uF = f
and AC as the discrete form of the operator on the coarse grid, a simple coarse grid correction can be given by :
uF(i+1) = uF
(i) + RT AC-1 R ( f - AF uF
(i) )
where RT is the matrix representation of linear interpolation from the coarse grid to the fine grid (prolongation operator) and R is called the restriction operator. Thus, it is possible to calculate the residual on the fine grid, solve the coarse grid problem, and interpolate the coarse grid solution on the fine grid.
102
Procedure of Multigrid (2/3)103
This process can be described as follows :
1. Relax the equations on the fine grid and obtain the result uF(i)
= SF ( AF, f ). This operator SF (e.g., Gauss-Seidel) is called the smoothing operator (or ).
2. Calculate the residual term on the fine grid by rF = f - AF uF(i).
3. Restrict the residual term on to the coarse grid by rC = R rF.4. Solve the equation AC uC = rC on the coarse grid ; the
accuracy of the solution on the coarse grid affects the convergence of the entire multigrid system.
5. Interpolate (or prolong) the coarse grid correction on the fine grid by uC
(i) = RT uC.6. Update the solution on the fine grid by uF
(i+1) = uF(i) + uC
(i)
103
fine
coarse
w1k : Approx. Solution
vk : CorrectionIk
k-1 : Restriction Operator
Lk Wk = Fk (Linear Equation: Fine Level)
Rk = Fk - Lk w1k
vk = Wk - w1k, Lk vk = Rk
Rk-1 = Ikk-1 Rk
Lk-1 vk-1 = Rk-1 (Linear Equation: Coarse Level)
vk = Ik-1k vk-1
w2k = w1
k + vk
fine
coarse
w1k : Approx. Solution
vk : CorrectionIk
k-1 : Restriction Operator
Lk Wk = Fk (Linear Equation: Fine Level)
Rk = Fk - Lk w1k
vk = Wk - w1k, Lk vk = Rk
Rk-1 = Ikk-1 Rk
Lk-1 vk-1 = Rk-1 (Linear Equation: Coarse Level)
vk = Ik-1k vk-1
w2k = w1
k + vk
fine
coarse
Lk Wk = Fk (Linear Equation: Fine Level)
Rk = Fk - Lk w1k
vk = Wk - w1k, Lk vk = Rk
Rk-1 = Ikk-1 Rk
Lk-1 vk-1 = Rk-1 (Linear Equation: Coarse Level)
vk = Ik-1k vk-1
w2k = w1
k + vk
Ik-1k : Prolongation Operator
w2k : Approx. Solution by Multigrid
fine
coarse
Lk Wk = Fk (Linear Equation: Fine Level)
Rk = Fk - Lk w1k
vk = Wk - w1k, Lk vk = Rk
Rk-1 = Ikk-1 Rk
Lk-1 vk-1 = Rk-1 (Linear Equation: Coarse Level)
vk = Ik-1k vk-1
w2k = w1
k + vk
Ik-1k : Prolongation Operator
w2k : Approx. Solution by Multigrid
104
Procedure of Multigrid (3/3)105
• Recursive application of this algorithm for 2-level procedure to consecutive systems of coarse-grid equations gives a multigrid V-cycle. If the components of the V-cycle are defined appropriately, the result is a method that uniformly damps all frequencies of error with a computational cost that depends only linearly on the problem size. − In other words, multigrid algorithms are scalable.
• In the V-cycle, starting with the finest grid, all subsequent coarser grids are visited only once. − In the down-cycle, smoothers damp oscillatory error components at different
grid scales. − In the up-cycle, the smooth error components remaining on each grid level
are corrected using the error approximations on the coarser grids. • Alternatively, in a W-cycle, the coarser grids are solved more
rigorously in order to reduce residuals as much as possible before going back to the more expensive finer grids.
105
fine
coarse
(a) V-Cycle
fine
coarse
(a) V-Cycle (b) W-Cycle
fine
coarse
(b) W-Cycle
fine
coarse
106
Multigrid as a Preconditioner107
• Multigrid algorithms tend to be problem-specific solutions and less robust than preconditioned Kryloviterative methods such as the IC/ILU methods.
• Fortunately, it is easy to combine the best features of multigrid and Krylov iterative methods into one algorithm− multigrid-preconditioned Krylov iterative methods.
• The resulting algorithm is robust, efficient and scalable.
• Mutigrid solvers and Krylov iterative solvers preconditioned by multigrid are intrinsically suitable for parallel computing.
Geometric and Algebraic Multigrid108
• One of the most important issues in multigrid is the construction of the coarse grids.
• There are 2 basic multigrid approaches− geometric and algebraic
• In geometric multigrid, the geometry of the problem is used to define the various multigrid components.
• In contrast, algebraic multigrid methods use only the information available in the linear system of equations, such as matrix connectivity.
• Algebraic multigrid method (AMG) is suitable for applications with unstructured grids.
• Many tools for both geometric and algebraic methods on unstructured grids have been developed.
108
“Dark Side” of Multigrid Method109
• Its performance is excellent for well-conditioned simple problems, such as homogeneous Poisson equations.
• But convergence could be worse for ill-conditioned problems.
• Extension of applicability of multigrid method is an active research area.
109
References• Briggs, W.L., Henson, V.E. and McCormick, S.F. (2000)
A Multigrid Tutorial Second Edition, SIAM
• Trottemberg, U., Oosterlee, C. and Schüller, A. (2001) Multigrid, Academic Press
• https://computation.llnl.gov/casc/• Hypre (AMG Library)
– https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html
110
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
111
Key-Issues for Appl’s/Algorithms towards Post-Peta & Exa Computing
Jack Dongarra (ORNL/U. Tennessee) at ISC 2013
• Hybrid/Heterogeneous Architecture– Multicore + GPU/Manycores (Intel MIC/Xeon Phi)
• Data Movement, Hierarchy of Memory
• Communication/Synchronization Reducing Algorithms• Mixed Precision Computation• Auto-Tuning/Self-Adapting• Fault Resilient Algorithms• Reproducibility of Results
112
113
• Communication overhead becomes significant• Communication-Computation Overlap
– Not so effective for Mat-Vec operations• Communication Avoiding/Reducing Algorithms
• OpenMP/MPI Hybrid Parallel Programming Model– (Next section)
Recent Technical Issues in Parallel Iterative Solvers
114
Communication overhead becomes larger as node/core number increasesWeak Scaling: MGCG on T2K Tokyo
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
64 128 256 512 1024 2048 4096 6144 8192
%
core #
Comm.Comp.
Comm.-Comp. Overlapping115
Internal Meshes
External (HALO) Meshes
Comm.-Comp. Overlapping116
Internal Meshes
External (HALO) MeshesInternal Meshes on Boundary’s
Mat-Vec operations• Overlapping of computations
of internal meshes, and importing external meshes.
• Then computation of international meshes on boundary’s
• Difficult for IC/ILU on Hybrid
Communication Avoiding/Reducing Algorithms for Sparse Linear Solvers
• Krylov Iterative Method without Preconditioning– Demmel, Hoemmen, Mohiyuddin etc. (UC Berkeley)
• s-step method– Just one P2P communication for each Mat-Vec during s
iterations. Convergence becomes unstable for large s.– matrix powers kernel: Ax, A2x, A3x ...
• additional computations needed
• Communication Avoiding ILU0 (CA-ILU0) [Moufawad & Grigori, 2013]– First attempt to CA preconditioning– Nested dissection reordering for limited geometries (2D FDM)
117
Pipelined CG [Ghysels et al. 2013]
0.00E+00
1.00E-03
2.00E-03
3.00E-03
4.00E-03
5.00E-03
6.00E-03
7.00E-03
100 1000 10000 100000
sec.
/MPI
_Allr
educ
e
MPI Process #
Flat MPIHB 4x4HB 8x2HB 16x1
Overhead by MPI_Allreducefor MGCG case
• Overhead by global collective comm. (e.g. MPI_Allreduce)• Change original Krylov solver so that comm. overhead by
global coll. comm. are hidden by overlapping with other computations (Gropp’s asynch. CG, s-step, pipelined ...)
• “MPI_Iallreduce” in MPI-3 specification
118
Comm. Avoiding Krylov Iterative Methods using “Matrix Powers Kernel”
119
Avoiding Communication in Sparse Matrix Computations. James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. , 2008 IPDPS
Required Information of Local Meshes for s-step CA computations (2D 5pt.)
120
s=1(original)
s=2 s=3
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
121
Nakajima, K., Optimization of Serial and Parallel Communications for Parallel Geometric Multigrid Method, Proceedings of the 20th IEEE International Conference for Parallel and Distributed Systems (ICPADS 2014) (Winner of Best Paper Award), Hsin-Chu, Taiwan, 2014
Reference122
• Optimization of Parallel MGCG– Conjugate Gradient Solver with Multigrid Preconditioning– OpenMP/MPI Hybrid Parallel Programming Model– Efficiency & Convergence
• Parallel Multigrid– “Coarse Grid Solver” is important
Efficiency & Convergence− HPCG (High-Performance Conjugate Gradients)
MGCG by Geometric Multigrid
• Communications are expensive– Serial Communications
Data Transfer through Hierarchical Memory: Sparse Matrix Operations– Parallel Communications
Message Passing through Network
Motivation123
• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver (Geometric)
Parallel MG Solvers: pGW3D‐FVM
qzyx ,,
124
125
• Preconditioned CG Method– Multigrid Preconditioning (MGCG)– IC(0) for Smoothing Operator (Smoother): good for ill-
conditioned problems• Parallel Geometric Multigrid Method
– 8 fine meshes (children) form 1 coarse mesh (parent) in isotropic manner (octree)
– V-cycle– Domain-Decomposition-based: Localized Block-Jacobi,
Overlapped Additive Schwartz Domain Decomposition (ASDD)– Operations using a single core at the coarsest level (redundant)
Linear Solvers
Fujitsu PRIMEHPC FX10 (Oakleaf-FX)at the U. Tokyo
• SPARC64 Ixfx (4,800 nodes, 76,800 cores)• Commercial version of K computerx• Peak: 1.13 PFLOPS (1.043 PF, 26th, 41th TOP 500 in 2013 June.)• Memory BWTH 398 TB/sec.
126
• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver (Geometric)
Parallel MG Solvers: pGW3D‐FVM127
qzyx ,,
Computations on Fujitsu FX10• Fujitsu PRIMEHPC FX10 at U.Tokyo (Oakleaf-FX)
– Commercial version of K – 16 cores/node, flat/uniform access to memory– 4,800 nodes 1.043 PF (48th, TOP 500, 2014 Nov.)
128
• Up to 4,096 nodes (65,536 cores) (Large-Scale HPC Challenge) – Max 17,179,869,184 unknowns– Flat MPI, HB 4x4, HB 8x2, HB 16x1
• Weak Scaling• Strong Scaling
– 1283×8= 16,777,216 unknowns, from 8 to 4,096 nodes
• Network Topology is not specified– 1D
L1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
C
L2
Memory
129
HB M x NL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
C
L2
Memory
Number of OpenMP threads per a single MPI process
Number of MPI processper a single node
130
HB 8 x 2
Number of OpenMP threads per a single MPI process
Number of MPI processper a single node
L1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
CL1
C
L2
Memory
8 threads/process 8 threads/process
Flat MPI vs. Hybrid
Hybrid:Hierarchal Structure
Flat-MPI:Each Core -> Independent
corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y
mem
ory
mem
ory
mem
ory
core
core
core
core
core
core
core
core
core
core
core
core
131
• Krylov Iterative Solvers– Dot Products– SMVP– DAXPY– Preconditioning
• IC/ILU Factorization, Forward/Backward Substitution– Global Data Dependency– Reordering needed for parallelism ([KN 2003] on the Earth
Simulator, KN@CMCIM-2002)– Multicoloring, RCM, CM-RCM
Reordering for extracting parallelismin each domain (= MPI Process)
132
Parallerization of ICCG
do i= 1, NVAL= D(i)do k= indexL(i-1)+1, indexL(i)VAL= VAL - (AL(k)**2) * W(itemL(k),DD)
enddoW(i,DD)= 1.d0/VAL
enddo
do i= 1, NWVAL= W(i,Z)do k= indexL(i-1)+1, indexL(i)WVAL= WVAL - AL(k) * W(itemL(k),Z)
enddoW(i,Z)= WVAL * W(i,DD)
enddo
IC Factorization
ForwardSubstitution
133
(Global) Data Dependency: Writing/reading may occur simultaneously, hard to parallelize
do i= 1, NVAL= D(i)do k= indexL(i-1)+1, indexL(i)VAL= VAL - (AL(k)**2) * W(itemL(k),DD)
enddoW(i,DD)= 1.d0/VAL
enddo
do i= 1, NWVAL= W(i,Z)do k= indexL(i-1)+1, indexL(i)WVAL= WVAL - AL(k) * W(itemL(k),Z)
enddoW(i,Z)= WVAL * W(i,DD)
enddo
IC Factorization
ForwardSubstitution
134
OpenMP for SpMV: StraightforwardNO data dependency
!$omp parallel do private(ip,i,VAL,k)do ip= 1, PEsmpTOT
do i = INDEX(ip-1)+1, INDEX(ip)VAL= D(i)*W(i,P)do k= indexL(i-1)+1, indexL(i)VAL= VAL + AL(k)*W(itemL(k),P)
enddodo k= indexU(i-1)+1, indexU(i)VAL= VAL + AU(k)*W(itemU(k),P)
enddoW(i,Q)= VAL
enddoenddo
135
Ordering MethodsElements in “same color” are independent: to be parallelized
Talk by Y.Saad’s group in SIAM PP14
64 63 61 58 54 49 43 36
62 60 57 53 48 42 35 28
59 56 52 47 41 34 27 21
55 51 46 40 33 26 20 15
50 45 39 32 25 19 14 10
44 38 31 24 18 13 9 6
37 30 23 17 12 8 5 3
29 22 16 11 7 4 2 1
48 32
31 15
14 62
61 44
43 26
25 8
7 54
53 36
16 64
63 46
45 28
27 10
9 56
55 38
37 20
19 2
47 30
29 12
11 58
57 40
39 22
21 4
3 50
49 33
13 60
59 42
41 24
23 6
5 52
51 35
34 18
17 1
64 63 61 58 54 49 43 36
62 60 57 53 48 42 35 28
59 56 52 47 41 34 27 21
55 51 46 40 33 26 20 15
50 45 39 32 25 19 14 10
44 38 31 24 18 13 9 6
37 30 23 17 12 8 5 3
29 22 16 11 7 4 2 1
1 17 3 18 5 19 7 20
33 49 34 50 35 51 36 52
17 21 19 22 21 23 23 24
37 53 38 54 39 55 40 56
33 25 35 26 37 27 39 28
41 57 42 58 43 59 44 60
49 29 51 30 53 31 55 32
45 61 46 62 47 63 48 64
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
RCMReverse Cuthill-Mckee
MC (Color#=4)Multicoloring
CM-RCM (Color#=4)Cyclic MC + RCM
• MC: Good parallel efficiency with smaller # of colors, bad convergence. Better convergence with many colors, synch. overhead
• RCM: Good convergence, poor parallel efficiency, synch. overhead• CM-RCM: Reasonable convergence & efficiency
136
• 3D Groundwater Flow via Heterogeneous Porous Media− Poisson’s equation− Randomly distributed water conductivity− Finite‐Volume Method on Cubic Voxel Mesh− =10‐5~10+5, Average: 1.00– MGCG Solver
Parallel MG Solvers: pGW3D‐FVM
qzyx ,,
• Storage format of coefficient matrices (Serial Comm.)– CRS (Compressed Row Storage)– ELL (Ellpack‐Itpack)
• Comm. /Sych. Reducing MG (Parallel Comm.)– Coarse Grid Aggregation (CGA)– Hierarchical CGA: Communication Reducing CGA
137
ELL: Fixed Loop-length, Nice for Pre-fetching
5000104730003140052100031 1 3
1 2 54 1 33 7 41 5
1 31 2 54 1 33 7 41 5
0
0
(a) CRS (b) ELL
138
Special Treatment for “Boundary” Meshesconnected to “Halo”
• Distribution of Lower/Upper Non-Zero Off-Diagonal Components
• If we adopt RCM (or CM) reordering ...
• Pure Internal Meshes– L: ~3, U: ~3
• Boundary Meshes– L: ~3, U: ~6
External MeshesInternal Meshes on Boundary
Pure Internal Meshes
x
yz
Pure Internal Meshes
Internal Meshes on Boundary
● Internal (lower)
● Internal (upper)
● External (upper)
139
Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)
do icol= NHYP(lev), 1, -1if (mod(icol,2).eq.1) then
!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0do j= 1, 6
SW= SW + AUnew(j,icel)*Rmg(IAUnew(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
else!$omp parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0do j= 1, 3
SW= SW + AUnew(j,icel)*Rmg(IAUnew(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
endifenddo
IAUnew (6,N), AUnew (6,N)
for Pure Internal Cells
for Boundary Cells
140
Original ELL: Backward Subst.Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)
Pure Internal CellsAUnew(6,N)
Boundary CellsAUnew(6,N)
141
Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6Sliced ELL [Monakov et al. 2010] (for SpMV/GPU)
Pure Internal CellsAUnew3(3,N)
Boundary CellsAUnew6(6,N)
142
Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6
do icol= NHYP(lev), 1, -1if (mod(icol,2).eq.1) then
!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0do j= 1, 6
SW= SW + AUnew6(j,icel)*Rmg(IAUnew6(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
else!$omp parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)
SW= 0.0d0do j= 1, 3
SW= SW + AUnew3(j,icel)*Rmg(IAUnew3(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
endifenddo IAUnew3(3,N), AUnew3(3,N)
IAUnew6(6,N), AUnew6(6,N)
for Pure Internal Cells
for Boundary Cells
143
Improved ELL: Backward Subst.Cache is well-utilized, separated: AUnew3/AUnew6
144
do icol= NHYP(lev), 1, -1if (mod(icol,2).eq.0) then
!$omp parallel do private (ip,icel,j,SW)do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)SW= 0.0d0do j= 1, 3
SW= SW + AUnew3(j,icel)*Rmg(IAUnew3(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
else!$omp parallel do private (ip,icel,j,SW)
do ip= 1, PEsmpTOTdo icel= SMPindex(icol-1,ip,lev)+1, SMPindex(icol,ip,lev)SW= 0.0d0do j= 1, 6
SW= SW + AUnew6(j,icel)*Rmg(IAUnew6(j,icel))enddoRmg(icel)= Rmg(icel) - SW*DDmg(icel)
enddoenddo
endifenddo
IAUnew3(3,N), AUnew3(3,N)IAUnew6(6,N), AUnew6(6,N)
for Pure Internal Cells
for Boundary Cells
Analyses by Detailed Profiler of Fujitsu FX10, single node, Flat MPI, RCM (Multigrid Part), 643cells/core,
1-node
145
Instruction L1Dmiss L2 miss SIMD
Op. Ratio GFLOPS
CRS 1.53109 2.32107 1.67107 30.14% 6.05
OriginalELL 4.91108 1.67107 1.27107 93.88% 6.99
ImprovedELL 4.91108 1.67107 9.14106 93.88% 8.56
Original Approach (restriction)Coarse grid solver at a single core [KN 2010]
146
Level=1
Level=2
Level=m-3
Level=m-2
Level=m-1
Level=mMesh # foreach MPI= 1
Fine
Coarse Coarse grid solver on a single core (further multigrid)
Original Approach (restriction)Coarse grid solver at a single core [KN 2010]
147
Level=1
Level=2
Level=m-3
Level=m-2
Level=m-1
Level=mMesh # foreach MPI= 1
Fine
Coarse Coarse grid solver on a single core (further multigrid)
Communication Overheadat Coarser Levels
Coarse Grid Aggregation (CGA)Coarse Grid Solver is multithreaded [KN 2012]
148
Level=1
Level=2
Level=m-3
Fine
Coarse
Coarse grid solver on a single MPI process (multi-threaded, further multigrid)
• Communication overhead could be reduced
• Coarse grid solver is more expensive than original approach.
• If process number is larger, this effect might be significant
Level=m-2
ResultsCASE Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (original) Single Core
C2 ELL (original) CGA
C3 ELL (new) CGA
C4 ELL (new) hCGA
Class Size
Weak Scaling 643 cells/core 262,144
Strong Scaling 2563 cells 16,777,216
149
150
Results at 4,096 nodes (1.72x1010 DOF)(Fujitsu FX10: Oakleaf‐FX): HB 8x2
lev: switching level to “coarse grid solver”, Opt. Level= 7
■ Parallel■ Serial/Redundant
Fine
Coarse
0.0
5.0
10.0
15.0
20.0
ELL-CGA,lev=6: 51
ELL-CGA,lev=7: 55
ELL-CGA,lev=8: 60
ELL: 65,(NO CGA)
CRS: 66,(NO CGA)
sec.
RestCoarse Grid SolverMPI_AllgatherMPI_Isend/Irecv/Allreduce
C1C2 C0C2 C2
Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced) CGA
151
Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)
DOWN is GOOD
0.00
5.00
10.00
15.00
20.00
100 1000 10000 100000
sec.
CORE#
HB 8x2:C0HB 8x2:C1HB 8x2:C2HB 8x2:C3
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
Flat MPI:C3HB 4x4:C3HB 8x2:C3HB 16x1:C3
Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced) CGA
152
Weak Scaling: C3Results at 4,096 nodes (1.72x1010 DOF)
0.0
5.0
10.0
15.0
Flat MPI:C3:64
HB 4x4:C3:59
HB 8x2:C3:55
HB 16x1:C3:55
sec.
RestCoarse Grid SolverMPI_AllgatherMPI_Isend/Irecv/Allreduce
153
Weak Scaling: C2 (with CGA)Time for Coarse Grid Solver
Efficiency of coarse grid solver for HB 16x1 is x256 of that of flat MPI (1/16 problem size, x16 resource for coarse grid solver)
0.00
1.00
2.00
3.00
4.00
1024 2048 4096 8192 16384 32768 49152 65536
sec.
CORE#
Flat MPI HB 4x4HB 8x2 HB 16x1
Summary so far ...• “Coarse Grid Aggregation (CGA)” is effective for
stabilization of convergence at O(104) cores for MGCG– Smaller number of parallel domains– HB 8x2 is the best at 4,096 nodes– Flat MPI, HB 4x4
• Coarse grid solvers are more expensive, because their number of MPI processes are more than those of HB 8x2 and HB 16x1.
• ELL format is effective !– C0 (CRS) -> C1 (ELL-org.): +20-30%– C2 (ELL-org)-> C3(ELL-new): +20-30%– C0 -> C3: +80-90%
• Coarse Grid Solver – (May be) very expensive for cases with more than O(105) cores – Memory of a single node is not enough– Multiple nodes should be utilized for coarse grid solver 154
Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced) CGA
Hierarchical CGA: Comm. Reducing MGReduced number of MPI processes[KN 2013]
155
Level=1
Level=2
Level=m-3
Level=m-3
Fine
Coarse
Level=m-2
Coarse grid solver on a single MPI process (multi-threaded, further multigrid)
hCGA: Related Work• Not a new idea, but very few implementations.
– Not effective for peta-scale systems (Dr. U.M.Yang (LLNL), developer of Hypre)
• Existing Works: Repartitioning at Coarse Levels– Lin, P.T., Improving multigrid performance for unstructured mesh
drift-diffusion simulations on 147,000 cores, International Journal for Numerical Methods in Engineering 91 (2012) 971-989 (Sandia)
– Sundar, H. et al, Parallel Geometric-Algebraic Multigrid on Unstructured Forests of Octrees, ACM/IEEE Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC12) (2012) (UT Austin)
– Flat MPI, Repartitioning if DOF < O(103) on each process
156
hCGA in the present work• Accelerate the coarser grid solver
– using multiple processes instead of a single process in CGA– Only 64 cells on each process of lev=6 in the figure
• Straightforward Approach– MPI_Comm_split, MPI_Gather, MPI_Bcast etc.
157
0.0
5.0
10.0
15.0
20.0
ELL-CGA,lev=6: 51
ELL-CGA,lev=7: 55
ELL-CGA,lev=8: 60
ELL: 65,(NO CGA)
CRS: 66,(NO CGA)
sec.
RestCoarse Grid SolverMPI_AllgatherMPI_Isend/Irecv/Allreduce
158
Weak Scaling: ~4,096 nodesup to 17,179,869,184 meshes (643 meshes/core)
DOWN is GOOD Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced) CGAC4 ELL (sliced) hCGA
5.0
7.5
10.0
12.5
15.0
100 1000 10000 100000
sec.
CORE#
Flat MPI:C3Flat MPI:C4HB 4x4:C4HB 8x2:C3HB 16x1:C3
0.0
5.0
10.0
15.0
Flat MPI HB 4x4 HB 8x2 HB 16x1
sec.
C3, 512 nodesC4, 512 nodesC3, 4,096 nodesC4, 4,096 nodes
x1.61
Strong Scaling at 4,096 nodes268,435,456 meshes, 163 meshes/core at 4,096 nodes
Flat MPI/ELL (C3), 8 nodes (128 cores) : 100%
159
Matrix Coarse Grid
C0 CRS Single Core
C1 ELL (org) Single Core
C2 ELL (org) CGA
C3 ELL (sliced) CGA
C4 ELL (sliced) hCGA 0
20
40
60
80
100
120
1024 8192 65536
Para
llel P
erfo
rman
ce (%
)
CORE#
Flat MPI:C3 Flat MPI:C4HB 8x2:C3 HB 8x2:C4
x6.27
Summary• hCGA is effective, but not so significant (except flat MPI)
– flat MPI: x1.61 for weak scaling, x6.27 for strong scaling at 4,096 nodes of Fujitsu FX10
– hCGA will be effective for HB 16x1 with more than 2.50x105 nodes (= 4.00x106 cores) of FX10 (=60 PFLOPS)
• effect of coarse grid solver is significant for Flat MPI with >103 nodes– Communication overhead has been reduced by hCGA
• Future/On-Going Works and Open Problems– Improvement of hCGA
• Overhead by MPI_Allreduce etc. -> P2P comm.– Algorithms
• CA-Multigrid (for coarser levels), CA-SPAI, Pipelined Method– Strategy for Automatic Selection
• switching level, number of processes for hCGA, optimum color #• effects on convergence
– More Flexible ELL for Unstructured Grids– Xeon Phi Clusters
• Hybrid 240(T)x1(P) is not the only choice 160
• Sparse Matrices• Iterative Linear Solvers
− Preconditioning− Parallel Iterative Linear Solvers− Multigrid Method− Recent Technical Issues
• Example of Parallel MGCG• Ill-Conditioned Problems
161
• Unstructured grid with irregular data structure• Large-scale sparse matrices• Preconditioned parallel iterative solvers• “Real-world” ill-conditioned problems
Large-scale Simulations by Parallel FEM Procedures
162
• Various ill-conditioned problems– For example, matrices derived from coupled NS equations are
ill-conditioned even if meshes are uniform.• We have been focusing on 3D solid mechanics
applications with:– heterogeneity– Contact B.C.– BILU/BIC
• Ideas can be extended to other fields.
What are ill-conditioned problems ?
163
Ill-Conditioned ProblemsHeterogeneous Fields, Distorted Meshes
164
Contact Problems in Simulations of Earthquake Generation Cycle
165
• are the most critical issues in scientific computing• are based on
– Global Information: condition number, matrix properties etc.– Local Information: properties of elements (shape, size …)
• require knowledge of– background physics– applications
Preconditioning Methods (of Krylov Iterative Solvers) for Real-World
Applications
166
• Block Jacobi type Localized Preconditioners• Simple problems can easily converge by simple
preconditioners with excellent parallel efficiency.• Difficult (ill-conditioned) problems cannot easily converge
– Effect of domain decomposition on convergence is significant, especially for ill-conditioned problems.• Block Jacobi-type localized preconditioiners• More domains, more iterations
– There are some remedies (e.g. deep fill-ins, deep overlapping), but they are not efficient.
– ASDD does not work well for really ill-conditioned problems.
Technical Issues of “Parallel” Preconditioners in FEM
167
168
Dot products Matrix-vector multiplication Preconditioners DAXPY
Preconditioned Iterative Solver e.g. CG method (Conjugate Gradient)
Compute r(0)= b-[A]x(0)
for i= 1, 2, …solve [M]z(i-1)= r(i-1)
i-1= r(i-1) z(i-1)if i=1p(1)= z(0)
elsei-1= i-1/i-2p(i)= z(i-1) + i-1 p(i-1)
endifq(i)= [A]p(i)
i = i-1/p(i)q(i)x(i)= x(i-1) + ip(i)r(i)= r(i-1) - iq(i)check convergence |r|
end
ILU: Global Operations (Forward/Backward Substitution) NOT suitable for parallel computing
Localized ILU Preconditioning
rzL
zzU
!C!C +----------------+!C | {z}= [Minv]{r} |!C +----------------+!C===
do i= 1, NW(i,Z)= W(i,R)
enddo
do i= 1, NWVAL= W(i,Z)do k= indexL(i-1)+1, indexL(i)WVAL= WVAL - AL(k) * W(itemL(k),Z)
enddoW(i,Z)= WVAL / D(i)
enddo
do i= N, 1, -1SW = 0.0d0do k= indexU(i), indexU(i-1)+1, -1SW= SW + AU(k) * W(itemU(k),Z)
enddoW(i,Z)= W(i,Z) – SW / D(i)
enddo!C===
Ignoring effects of external points for preconditioning Block-Jacobi Localized
Preconditioning WEAKER than original ILU
More PE’s, more iterations
169
Localized ILU Preconditioning
A1 2 3 4 5 6 PE#1
PE#2
PE#3
PE#4
1 2 3
4 5 6
Considered :
Ignored :
170
171
Overlapped Additive Schwartz Domain Decomposition Method
Stabilization of Localized Preconditioning: ASDD
Global Operation
Local Operation
Global Nesting Correction: Repeating -> Stable
1 2
rMz
222111
11 ,
rMzrMz
)( 111111111111
nnnn zMzMrMzz
)( 111122222222
nnnn zMzMrMzz
172
Overlapped Additive Schwartz Domain Decomposition Method
Stabilization of Localized Preconditioning: ASDDGlobal Nesting Correction: Repeating -> Stable
1 2
)( 111111111111
nnnn zMzMrMzz
)( 111122222222
nnnn zMzMrMzz
)( 11111111111111111111
nnnnnn zMzMrMzrMzzzz
11111111
nn zMzMrr
11111111
nn zzzwhererMz
173
Overlapped Additive Schwartz Domain Decomposition Method
Effect of additive Schwartz domain decomposition for solid mechanics example example with 3x443 DOF on Hitachi
SR2201, Number of ASDD cycle/iteration= 1, = 10-8
PE # Iter. # Sec. Speed Up Iter.# Sec. Speed Up1 204 233.7 - 144 325.6 -2 253 143.6 1.63 144 163.1 1.994 259 74.3 3.15 145 82.4 3.958 264 36.8 6.36 146 39.7 8.21
16 262 17.4 13.52 144 18.7 17.3332 268 9.6 24.24 147 10.2 31.8064 274 6.6 35.68 150 6.5 50.07
NO Additive Schwartz WITH Additive Schwartz
174
Overlapped Additive Schwartz Domain Decomposition Method
Stabilization of Localized Preconditioning: ASDD
Local Operation (Forward/Backward Substitution)
Global Nesting Correction:Repeating -> Stable
1 2
)( 111111111111
nnnn zMzMrMzz
)( 111122222222
nnnn zMzMrMzz
do iterPRE= 1, iterPREmax
enddo
)(. 111111111
nn zMzMrMcalc
)(. 111222222
nn zMzMrMcalc
Technical Issues of “Parallel” Preconditioners for Iterative Solvers
E=100
E=103
3D Solid MechanicsE: Young’s Modulus
• If domain boundaries are on “stronger” elements, convergence is very bad.
175
Remedies: Domain Decomposition
• Avoid “Strong Elements”– not practical
• Extended Depth of Overlapped Elements– Selective Fill-ins, Selective Overlapping [KN 2007]
• adaptive preconditioning/domain decomposition methods which utilize features of FEM procedures
• PHIDAL/HID (Hierarchical Interface Decomposition) [Henon & Saad 2007]
• Extended HID [KN 2010]
176
Extension of Depth of Overlapping
●:Internal Nodes,●:External Nodes■:Overlapped Elements●:Internal Nodes,●:External Nodes■:Overlapped Elements
5
21 22 23 24 25
1617 18 19
20
1113 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
12
32 41 5
21 22 23 24 25
1617 18 19
20
1113 14
15
67 8 9
10
PE#0PE#1
PE#2PE#3
12
32 41
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7PE#3
1 2 3
4 5
6 7
8 9 11
10
14 13
15
12
PE#0
7 8 9 10
4 5 6 12
3111
2
PE#1
7 1 2 3
10 9 11 12
568
4
PE#2
34
8
69
10 12
1 2
5
11
7PE#3
Cost for computation and communication may increase
177
• Multilevel Domain Decomposition– Extension of Nested Dissection
• Non-overlapping at each level: Connectors, Separators• Suitable for Parallel Preconditioning Method
HID: Hierarchical Interface Decomposition [Henon & Saad 2007]
level-1:●level-2:●level-4:●
0 0 0 1 1 1
0,2 0,2 0,2 1,3 1,3 1,3
2 2 2 3 3 3
2 2 2 2,3 3 3 3
2 2 2 2,3 3 3 3
0 0 0 0,1 1 1 1
0 0 0 0,1 1 1 1
0,12,3
0,12,3
0,12,3
178
Parallel ILU for each Connectorat each LEVEL
• The unknowns are reordered according to their levelnumbers, from the lowest to highest.
• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.
01
23
0,1
0,2
2,3
1,30,1,2,3
Level-1
Level-2
Level-4
179
Results: 64 coresContact ProblemsBILU(p)-(depth of overlapping)3,090,903 DOF
0
50
100
150
200
250
300
350
BILU(1) BILU(1+) BILU(2)
sec.
0
500
1000
1500
BILU(1) BILU(1+) BILU(2)
ITER
ATI
ON
S
■BILU(p)-(0): Block Jacobi■BILU(p)-(1)■BILU(p)-(1+)■BILU(p)-HID GPBiCG
180
• Development of robust and efficient parallel preconditioning method
• Construction of strategies for optimum selection of preconditioners, partitioning, and related methods/parameters.
• By utilization of both of:– global information obtained from derived coefficient matrices– very local information, such as information of each mesh in
finite-element applications.
Final goal of my recent work in this area after 2000
181
Hetero 3D (1/2)• Parallel FEM Code (Flat MPI)
– 3D linear elasticity problems in cube geometries with heterogeneity
– SPD matrices– Young’s modulus: 10-6~10+6
• (Emin-Emax): controls condition number
• Preconditioned Iterative Solvers– GP-BiCG [Zhang 1997]– BILUT(p,d,t)
• Domain Decomposition– Localized Block-Jacobi with Extended
Overlapping (LBJ)– HID/Extended HID
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
x
y
z
Uz=0 @ z=Zmin
Ux=0 @ x=Xmin
Uy=0 @ y=Ymin
Uniform Distributed Force in z-direction @ z=Zmax
(Ny-1) elementsNy nodes
(Nx-1) elementsNx nodes
(Nz-1) elementsNz nodes
182
Hetero 3D (2/2)• based on the framework for parallel FEM proc. of GeoFEM
– Benchmark developed in FP3C project under Japan-France collaboration
• Parallel Mesh Generation– Fully parallel way
• each process generates local mesh, and assembles local matrices. – Total number of vertices in each direction (Nx, Ny, Nz)– Number of partitions in each direction (Px,Py,Pz)– Number of total MPI processes is equal to PxPyPz– Each MPI process has (Nx/Px)( Ny/Py)( Nz/Pz) vertices.– Spatial distribution of Young’s modulus is given by an external file,
which includes information for heterogeneity for the field of 1283
cube geometry. • If Nx (or Ny or Nz) is larger than 128, distribution of these 1283 cubes is
repeated periodically in each direction.
FP3C
183
BILUT(p,d,t)• Incomplete LU factorization with threshold (ILUT)• ILUT(p,d,t) [KN 2010]
– p: Maximum fill-level specified before factorization– d, t: Criteria for dropping tolerance before/after factorization
• The process (b) can be substituted by other factorization methods or more powerful direct linear solvers, such as MUMPS, SuperLU and etc.
A
Initial Matrix
Dropping Components‐ Aij< d‐ Location
A’
DroppedMatrix
ILU (p)Factorization
(ILU)’
ILUFactorization
(ILUT)’
ILUT(p,d,t)
Dropping Components‐ Aij< t‐ Location
(a) (b) (c)
184
Preliminary Results• Hardware
– 16-240 nodes (160-3,840 cores) of Fujitsu PRIMEHPC FX10 (Oakleaf-FX), University of Tokyo
• Problem Setting– 420×320×240 vertices (3.194×107 elem’s, 9.677×107 DOF)– Strong scaling– Effect of thickness of overlapped zones
• BILUT(p,d,t)-LBJ-X (X=1,2,3)
– Effect of d is small– HID is slightly more robust than LBJ
185
BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Fill-in
Preconditioner NNZ of [M]
Set-up(sec.)
Solver(sec.)
Total(sec.) Iterations
BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096
[NNZ] of [A]: 7.174109
186
BILUT(p,0,0) at 3,840 coresNO dropping: Effect of Overlapping
Preconditioner NNZ of [M]
Set-up(sec.)
Solver(sec.)
Total(sec.) Iterations
BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096
[NNZ] of [A]: 7.174109
187
BILUT(p,0,0) at 3,840 coresNO dropping
Preconditioner NNZ of [M]
Set-up(sec.)
Solver(sec.)
Total(sec.) Iterations
BILUT(1,0,0)-LBJ-1 1.9201010 1.35 65.2 66.5 1916BILUT(1,0,0)-LBJ-2 2.5191010 2.03 61.8 63.9 1288BILUT(1,0,0)-LBJ-3 3.1971010 2.79 74.0 76.8 1367BILUT(2,0,0)-LBJ-1 3.3511010 3.09 71.8 74.9 1339BILUT(2,0,0)-LBJ-2 4.3941010 4.39 65.2 69.6 939BILUT(2,0,0)-LBJ-3 5.6311010 5.95 83.6 89.6 1006BILUT(3,0,0)-LBJ-1 6.4681010 9.34 105.2 114.6 1192BILUT(3,0,0)-LBJ-2 8.5231010 12.7 98.4 111.1 823BILUT(3,0,0)-LBJ-3 1.1011011 17.3 101.6 118.9 722BILUT(1,0,0)-HID 1.6361010 2.24 60.7 62.9 1472BILUT(2,0,0)-HID 2.9801010 5.04 66.2 71.7 1096
[NNZ] of [A]: 7.174109
188
BILUT(p,0,t) at 3,840 coresOptimum Value of t
Preconditioner NNZ of [M] Set-up(sec.)
Solver(sec.)
Total(sec.) Iterations
BILUT(1,0,2.7510-2)-LBJ-1 7.755109 1.36 45.0 46.3 1916BILUT(1,0,2.7510-2)-LBJ-2 1.0191010 2.05 42.0 44.1 1383BILUT(1,0,2.7510-2)-LBJ-3 1.2851010 2.81 54.2 57.0 1492BILUT(2,0,1.0010-2)-LBJ-1 1.1181010 3.11 39.1 42.2 1422BILUT(2,0,1.0010-2)-LBJ-2 1.4871010 4.41 37.1 41.5 1029BILUT(2,0,1.0010-2)-LBJ-3 1.8931010 5.99 37.1 43.1 915BILUT(3,0,2.5010-2)-LBJ-1 8.072109 9.35 38.4 47.7 1526BILUT(3,0,2.5010-2)-LBJ-2 1.0631010 12.7 35.5 48.3 1149BILUT(3,0,2.5010-2)-LBJ-3 1.3421010 17.3 40.9 58.2 1180BILUT(1,0,2.5010-2)-HID 6.850109 2.25 38.5 40.7 1313BILUT(2,0,1.0010-2)-HID 1.0301010 5.04 36.1 41.1 1064
[NNZ] of [A]: 7.174109
189
Strong Scaling up to 3,840 coresaccording to elapsed computation time (set-up+solver) for
BILUT(1,0,2.510-2)-HID with 256 cores
0.00E+00
1.00E+03
2.00E+03
3.00E+03
4.00E+03
0 500 1000 1500 2000 2500 3000 3500 4000
Spee
d-U
p
CORE#
BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2Ideal
70
80
90
100
110
120
130
100 1000 10000
Para
llel P
erfo
rman
ce (%
)
CORE#
BILUT(1,0,2.50e-2)-HIDBILUT(2,0,1.00e-2)-HIDBILUT(1,0,2.75e-2)-LBJ-2BILUT(2,0,1.00e-2)-LBJ-2BILUT(3,0,2.50e-2)-LBJ-2
190
Summary• Hetero 3D• Generally speaking, HID is more robust than LBJ with
overlap extention• BILUT(p,d,t)
– effect of d is not significant– [NNZ] of [M] depends on t (not p)– BILU(3,0,t0) > BILU(2,0,t0) > BILU(1,0,t0), although cost of a
single iteration is similar for each method• Critical/optimum value of t
– [NNZ] of [M] = [NNZ] of [A]– Further investigation needed.
191
Future Works• Theoretical/numerical investigation of optimum t
– Eigenvalue analysis etc.– Final Goal: Automatic selection BEFORE computation– (Any related work ?)
• Further investigation/development of LBJ & HID• Comparison with other preconditioners/direct solvers
– (Various types of) Low-Rank Approximation Methods• Hetero 3D will be released as a deliverable of FP3C
project soon– OpenMP/MPI Hybrid version
• BILU(0) is already done, factorization is (was) the problem – Extension to Manycore/GPU clusters
192