View
1
Download
0
Category
Preview:
Citation preview
Parallel multi-frontal solver
for isogeometric
finite element methods on GPU
Maciej Paszyński
Department of Computer Science,
AGH University of Science and Technology, Kraków, Poland
email: paszynsk@agh.edu.pl
Collaborators:
Maciej Woźniak
Krzysztof Kuźnik
AGH University of Science and Technology, Kraków, Poland
Victor Calo
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Pardo
The University of the Basque Country,
Basque Center for Applied Mathematics
and IKERBASQUE (Basque Foundation of Science), Bilbao, Spain
AGH UNIVERSITY OF SCIENCE AND TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE, KRAKOW, POLAND
OUTLINE
1. Introduction
2. Graph grammar model for 1D
• Grammar productions expressing the generation of element elimination tree
• Grammar productions expressing the solver algorithm
3. Numerical results for 1D
• Execution time for p=1,2,3,4,5
• Comparision with MUMPS for p=1,2,3,4,5
4. Generalization to 2D
5. Cp-1 vs C0
6. Numerical results for 2D
• Execution time for p=1,2,3
• Comparision with MUMPS for p=1,2,3
7. Conclusions
INTRODUCTION
B-SPLINES BASED FINITE ELEMENT METHOD
Strong formulation
d
dxA x
d u
d x
B x
d u
d xC x u f x
u 0 0
A 1 d u 1
d x u 1
Weak formulation
Find u V u H 1 0,1 :u 0 0 s.t.
b v,u l v , v V v H 1 0,1 :v 0 0
b v,u A x d v
d x
d u
d xB x v x
d u
d xC x v x u x
d x
0
1
v 1 u 1
l v v 1
Find u V u H 1 0,1 :u 0 0 s.t.
b v,u l v , v V v H 1 0,1 :v 0 0
INTRODUCTION
B-SPLINES BASED FINITE ELEMENT METHOD
Using B-splines as basis functions
u x N
i,px
i
di
v x Nj ,p
x
b N
j ,px ,Ni,p
x ai l N
j ,px
i
, j
contribution of
b(N2,1;N3,1) Linear B-splines
Find u V u H 1 0,1 :u 0 0 s.t.
b v,u l v , v V v H 1 0,1 :v 0 0
INTRODUCTION
B-SPLINES BASED FINITE ELEMENT METHOD
Using B-splines as basis functions
u x N
i,px
i
di
v x Nj ,p
x
b N
j ,px ,Ni,p
x ai l N
j ,px
i
, j
contribution of
b(N2,2;N3,2) Quadratic B-splines
GRAPH GRAMMAR
GENERATION OF 1D ELIMINATION TREE
1D elimination tree obtained by executing productions (P1)-(P2)2-(P2)2-(P3)6
GRAPH GRAMMAR PRODUCTIONS AS ATOMIC TASKS
We assign indices to grammar productions in order to localize
the places where the graph grammar productions were fired
(P1)-(P2)1-(P2)2-(P2)3-(P2)4-(P3)1-(P3)2-(P3)3-(P3)4-(P3)5-(P3)6
TRACE THEORY BASED SCHEDULER
Dependency relation for construction of the elimination tree
(P1)D{(P2)1,(P2)2}
(P2)1D{(P2)3,(P2)4}
(P2)3D{(P3)1,(P3)2}
(P2)4D{(P3)3,(P3)4}
(P2)2D{(P3)5,(P3)6}
Alphabet:
A = {(P1) , (P2)1 , (P2)2 , (P2)3 , (P2)4 , (P3)1 , (P3)2 , (P3)3 , (P3)4 , (P3)5 , (P3)6 }
TRACE THEORY BASED SCHEDULER
Dependency graph
TRACE THEORY BASED SCHEDULER
Dependency graph
TRACE THEORY BASED SCHEDULER
(P1)-(P2)1-(P2)2-(P2)3-(P2)4- (P3)1-(P3)2-(P3)3-(P3)4-(P3)5-(P3)6
[(P1)][(P2)1(P2)2][(P2)3(P2)4(P3)5(P3)6][(P3)1(P3)2(P3)3(P3)4]
Scheduling according to Foata Normal Form:
Thus, the execution of the solver consists of several steps, where independent
tasks are executed in concurrent, interchanged with the synchronization barriers.
kj
kikk
kj
kik
ki
nl
nnll
Daaljlik
Iaaljik
Aa
aaaaaaaaan
11
2122
221
112
11
,...,1,...,1
,...,1,
............11
i<>j where I=AxA\D
Foata Normal Form
(alphabet)
GRAMMAR BASED NUMERICAL INTEGRATION
using Gaussian quadrature the integration over the domain can be substituted
by a weighted summation over Gauss points
b Nj ,p
x ,Ni ,px
A x d N
j ,px
d x
d Ni ,p
x d x
B x N j ,px
d Ni ,p
x d x
C x N j ,px Ni ,p
x
d x
0
1
Nj ,p
1 Ni ,p1
l N
i,px N
i,p1
A x d N
i ,px
d x
d Nj ,p
x d x
B x d N
i,px
d xN
j ,px C x Ni ,p
x N j ,px
d x
0
1
wl
A xl
d Ni ,p
xl
d x
d Nj ,p
xl
d x B x
l d N
i,px
l d x
Nj ,p
xl C x
l Ni,px
l N j ,px
l
l
GRAMMAR BASED NUMERICAL INTEGRATION
GRAMMAR BASED NUMERICAL INTEGRATION
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Generation of frontal matrices at leaves of the eliminaton tree expressed as
the execution of graph grammar productions (A1)-(A)4-(AN)
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Graph grammar productions generating local frontal matrices for left boundary,
interior and right boundary nodes for linear B-splines
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Graph grammar productions merging element frontal matrices at parent level
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Graph grammar production eliminating fully assembled row at parent level
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Graph grammar production for solution at root level
Graph grammar production for merging element frontal matrices at root level
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
Graph grammar production for recursive backward substitution
Expression of the solver execution by graph grammar productions
(A1)-(A)4-(AN) (generation of frontal matrices at leaves of the elimination trees)
(A2)3 (merging contributions at father nodes)
(E2)3 (elimination of fully assembled nodes)
(A2) – (E2) (merging at parent node followed by elimination)
(Aroot) – (Eroot) (merging at root node followed by full forward elimination)
(BS)4 (backward substitutions)
PROCESS OF THE ELIMINATION
EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS
TRACE THEORY BASED SCHEDULER
Dependency relation for the solver algorithm
{(A1),(A)1}D(A2)1
{(A)2,(A)3}D(A2)2
{(A)4,(AN)}D(A2)3
(A2)1D(E2)1
(A2)2D(E2)2
(A2)3D(E2)3
{(E2)1,(E2)2}D(A2)4
(A2)4D(E2)4
{(E2)3(E2)4}D(Aroot)
(Aroot)D(Eroot)
(Eroot)D{(BS)1,(BS)2
(BS)1D{(BS)3,(BS)4}
Alphabet:
A={(A1), (A)1 , (A)2 , (A)3 , (A)4 , (AN), (A2)1 , (A2)2 , (A2)3 , (E2)1 , (E2)2 , (E2)3 , (A2)4 ,
(E2)4 , (Aroot) , (Eroot) , (BS)1 , (BS)2 , (BS)3 , (BS)4 }
TRACE THEORY BASED SCHEDULER
Dependency graph
TRACE THEORY BASED SCHEDULER
Dependency graph
TRACE THEORY BASED SCHEDULER
Scheduling according to Foata Normal Form:
(A1)-(A)1-(A)2-(A)3-(A)4- (AN)-(A2)1-(A2)2- (A2)3-(E2)1-(E2)2-(E2)3- (A2)4- (E2)4-
(Aroot)-(Eroot)-(BS)1-(BS)2-(BS)3-(BS)4
[(A1)(A)1(A)2(A)3(A)4(AN)][(A2)1(A2)2(A2)3][(E2)1(E2)2(E2)3] [(A2)4][(E2)4]
[(Eroot)][(Aroot)][(Eroot)][(BS)1(BS)2][(BS)3(BS)4]
Thus, the execution of the solver consists of several steps, where independent
tasks are executed in concurrent, interchanged with the synchronization barriers.
kj
kikk
kj
kik
ki
nl
nnll
Daaljlik
Iaaljik
Aa
aaaaaaaaan
11
2122
221
112
11
,...,1,...,1
,...,1,
............11
Foata Normal Form
(alphabet)
GRAPH GRAMMAR PRODUCTIONS EXPRESSING
THE SOLVER ALGORITHM
Linear B-splines
GRAPH GRAMMAR PRODUCTIONS EXPRESSING
THE SOLVER ALGORITHM
Quadratic B-splines
1D NUMERICAL RESULTS
LINEAR B-SPLINES
NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock
1D NUMERICAL RESULTS
QUADRATIC B-SPLINES
NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock
1D NUMERICAL RESULTS
CUBIC B-SPLINES
NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock
1D NUMERICAL RESULTS
QINTIC B-SPLINES
NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock
COMPARISON WITH CPU MUMPS SOLVER
NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock
Intel(R) Core(TM)2 Quad CPU Q9400 with 2.66GHz clock, 8GB of memory
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
GENERATION OF 2D ELIMINATION TREE
2D NUMERICAL INTEGRATION
2D ELIMINATION
2D ELIMINATION
2D ELIMINATION
C0 COST
• Size of the top dense problem
O(N0.5)
• Cost of the sequential dense solver
O(M3)
• Computational cost of sequential
C0 solver
O((N0.5)3)=O(N3/2)
• Cost of the parallel dense solver
O(M2)
• Computational cost of sequential
C0 solver
O((N0.5)2)=O(N)
Cp-1 COST
• Size of the top dense problem
O(pN0.5)
• Cost of the sequential dense solver
O(M3)
• Computational cost of sequential
C0 solver
O((pN0.5)3)=O(p3N3/2)
• Cost of the parallel dense solver
O(N2)
• Computational cost of sequential
C0 solver
O((pN0.5)2)=O(p2N)
2D NUMERICAL RESULTS
LINEAR B-SPLINES
GeForce GTX 780 with 2304 cores, 3GB of memory
2D NUMERICAL RESULTS
QUADRATIC B-SPLINES
GeForce GTX 780 with 2304 cores, 3GB of memory
2D NUMERICAL RESULTS
CUBIC B-SPLINES
GeForce GTX 780 with 2304 cores, 3GB of memory
CONCLUSIONS
We have developed the formal graph grammar model for B-splines based
finite element method computations in one and two dimensions.
The multi-frontal direct solver algorithm has been expressed by basic
undividable tasks, and the partial order of execution has been expressed
by a dependency graph
The tasks can be scheduled in a sequence of sets with independent tasks
based on the coloring of the dependency graph
The model allows for an efficient implementation of the generation
and solution of the problem in shared memory architecture
The isogeometric shared memory Cp-1 continuity parallel direct solver
scales like O(p2log(N/p)) for one dimensional problems, and like
O(Np2) for two dimensional problems
Recommended