Parallel multi-frontal solver for isogeometric finite element … · 2014. 5. 26. · Scheduling...

Preview:

Citation preview

Parallel multi-frontal solver

for isogeometric

finite element methods on GPU

Maciej Paszyński

Department of Computer Science,

AGH University of Science and Technology, Kraków, Poland

email: paszynsk@agh.edu.pl

Collaborators:

Maciej Woźniak

Krzysztof Kuźnik

AGH University of Science and Technology, Kraków, Poland

Victor Calo

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

David Pardo

The University of the Basque Country,

Basque Center for Applied Mathematics

and IKERBASQUE (Basque Foundation of Science), Bilbao, Spain

AGH UNIVERSITY OF SCIENCE AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE, KRAKOW, POLAND

OUTLINE

1. Introduction

2. Graph grammar model for 1D

• Grammar productions expressing the generation of element elimination tree

• Grammar productions expressing the solver algorithm

3. Numerical results for 1D

• Execution time for p=1,2,3,4,5

• Comparision with MUMPS for p=1,2,3,4,5

4. Generalization to 2D

5. Cp-1 vs C0

6. Numerical results for 2D

• Execution time for p=1,2,3

• Comparision with MUMPS for p=1,2,3

7. Conclusions

INTRODUCTION

B-SPLINES BASED FINITE ELEMENT METHOD

Strong formulation

d

dxA x

d u

d x

B x

d u

d xC x u f x

u 0 0

A 1 d u 1

d x u 1

Weak formulation

Find u V u H 1 0,1 :u 0 0 s.t.

b v,u l v , v V v H 1 0,1 :v 0 0

b v,u A x d v

d x

d u

d xB x v x

d u

d xC x v x u x

d x

0

1

v 1 u 1

l v v 1

Find u V u H 1 0,1 :u 0 0 s.t.

b v,u l v , v V v H 1 0,1 :v 0 0

INTRODUCTION

B-SPLINES BASED FINITE ELEMENT METHOD

Using B-splines as basis functions

u x N

i,px

i

di

v x Nj ,p

x

b N

j ,px ,Ni,p

x ai l N

j ,px

i

, j

contribution of

b(N2,1;N3,1) Linear B-splines

Find u V u H 1 0,1 :u 0 0 s.t.

b v,u l v , v V v H 1 0,1 :v 0 0

INTRODUCTION

B-SPLINES BASED FINITE ELEMENT METHOD

Using B-splines as basis functions

u x N

i,px

i

di

v x Nj ,p

x

b N

j ,px ,Ni,p

x ai l N

j ,px

i

, j

contribution of

b(N2,2;N3,2) Quadratic B-splines

GRAPH GRAMMAR

GENERATION OF 1D ELIMINATION TREE

1D elimination tree obtained by executing productions (P1)-(P2)2-(P2)2-(P3)6

GRAPH GRAMMAR PRODUCTIONS AS ATOMIC TASKS

We assign indices to grammar productions in order to localize

the places where the graph grammar productions were fired

(P1)-(P2)1-(P2)2-(P2)3-(P2)4-(P3)1-(P3)2-(P3)3-(P3)4-(P3)5-(P3)6

TRACE THEORY BASED SCHEDULER

Dependency relation for construction of the elimination tree

(P1)D{(P2)1,(P2)2}

(P2)1D{(P2)3,(P2)4}

(P2)3D{(P3)1,(P3)2}

(P2)4D{(P3)3,(P3)4}

(P2)2D{(P3)5,(P3)6}

Alphabet:

A = {(P1) , (P2)1 , (P2)2 , (P2)3 , (P2)4 , (P3)1 , (P3)2 , (P3)3 , (P3)4 , (P3)5 , (P3)6 }

TRACE THEORY BASED SCHEDULER

Dependency graph

TRACE THEORY BASED SCHEDULER

Dependency graph

TRACE THEORY BASED SCHEDULER

(P1)-(P2)1-(P2)2-(P2)3-(P2)4- (P3)1-(P3)2-(P3)3-(P3)4-(P3)5-(P3)6

[(P1)][(P2)1(P2)2][(P2)3(P2)4(P3)5(P3)6][(P3)1(P3)2(P3)3(P3)4]

Scheduling according to Foata Normal Form:

Thus, the execution of the solver consists of several steps, where independent

tasks are executed in concurrent, interchanged with the synchronization barriers.

kj

kikk

kj

kik

ki

nl

nnll

Daaljlik

Iaaljik

Aa

aaaaaaaaan

11

2122

221

112

11

,...,1,...,1

,...,1,

............11

i<>j where I=AxA\D

Foata Normal Form

(alphabet)

GRAMMAR BASED NUMERICAL INTEGRATION

using Gaussian quadrature the integration over the domain can be substituted

by a weighted summation over Gauss points

b Nj ,p

x ,Ni ,px

A x d N

j ,px

d x

d Ni ,p

x d x

B x N j ,px

d Ni ,p

x d x

C x N j ,px Ni ,p

x

d x

0

1

Nj ,p

1 Ni ,p1

l N

i,px N

i,p1

A x d N

i ,px

d x

d Nj ,p

x d x

B x d N

i,px

d xN

j ,px C x Ni ,p

x N j ,px

d x

0

1

wl

A xl

d Ni ,p

xl

d x

d Nj ,p

xl

d x B x

l d N

i,px

l d x

Nj ,p

xl C x

l Ni,px

l N j ,px

l

l

GRAMMAR BASED NUMERICAL INTEGRATION

GRAMMAR BASED NUMERICAL INTEGRATION

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Generation of frontal matrices at leaves of the eliminaton tree expressed as

the execution of graph grammar productions (A1)-(A)4-(AN)

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Graph grammar productions generating local frontal matrices for left boundary,

interior and right boundary nodes for linear B-splines

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Graph grammar productions merging element frontal matrices at parent level

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Graph grammar production eliminating fully assembled row at parent level

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Graph grammar production for solution at root level

Graph grammar production for merging element frontal matrices at root level

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

Graph grammar production for recursive backward substitution

Expression of the solver execution by graph grammar productions

(A1)-(A)4-(AN) (generation of frontal matrices at leaves of the elimination trees)

(A2)3 (merging contributions at father nodes)

(E2)3 (elimination of fully assembled nodes)

(A2) – (E2) (merging at parent node followed by elimination)

(Aroot) – (Eroot) (merging at root node followed by full forward elimination)

(BS)4 (backward substitutions)

PROCESS OF THE ELIMINATION

EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS

TRACE THEORY BASED SCHEDULER

Dependency relation for the solver algorithm

{(A1),(A)1}D(A2)1

{(A)2,(A)3}D(A2)2

{(A)4,(AN)}D(A2)3

(A2)1D(E2)1

(A2)2D(E2)2

(A2)3D(E2)3

{(E2)1,(E2)2}D(A2)4

(A2)4D(E2)4

{(E2)3(E2)4}D(Aroot)

(Aroot)D(Eroot)

(Eroot)D{(BS)1,(BS)2

(BS)1D{(BS)3,(BS)4}

Alphabet:

A={(A1), (A)1 , (A)2 , (A)3 , (A)4 , (AN), (A2)1 , (A2)2 , (A2)3 , (E2)1 , (E2)2 , (E2)3 , (A2)4 ,

(E2)4 , (Aroot) , (Eroot) , (BS)1 , (BS)2 , (BS)3 , (BS)4 }

TRACE THEORY BASED SCHEDULER

Dependency graph

TRACE THEORY BASED SCHEDULER

Dependency graph

TRACE THEORY BASED SCHEDULER

Scheduling according to Foata Normal Form:

(A1)-(A)1-(A)2-(A)3-(A)4- (AN)-(A2)1-(A2)2- (A2)3-(E2)1-(E2)2-(E2)3- (A2)4- (E2)4-

(Aroot)-(Eroot)-(BS)1-(BS)2-(BS)3-(BS)4

[(A1)(A)1(A)2(A)3(A)4(AN)][(A2)1(A2)2(A2)3][(E2)1(E2)2(E2)3] [(A2)4][(E2)4]

[(Eroot)][(Aroot)][(Eroot)][(BS)1(BS)2][(BS)3(BS)4]

Thus, the execution of the solver consists of several steps, where independent

tasks are executed in concurrent, interchanged with the synchronization barriers.

kj

kikk

kj

kik

ki

nl

nnll

Daaljlik

Iaaljik

Aa

aaaaaaaaan

11

2122

221

112

11

,...,1,...,1

,...,1,

............11

Foata Normal Form

(alphabet)

GRAPH GRAMMAR PRODUCTIONS EXPRESSING

THE SOLVER ALGORITHM

Linear B-splines

GRAPH GRAMMAR PRODUCTIONS EXPRESSING

THE SOLVER ALGORITHM

Quadratic B-splines

1D NUMERICAL RESULTS

LINEAR B-SPLINES

NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS

QUADRATIC B-SPLINES

NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS

CUBIC B-SPLINES

NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS

QINTIC B-SPLINES

NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

COMPARISON WITH CPU MUMPS SOLVER

NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

Intel(R) Core(TM)2 Quad CPU Q9400 with 2.66GHz clock, 8GB of memory

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

2D NUMERICAL INTEGRATION

2D ELIMINATION

2D ELIMINATION

2D ELIMINATION

C0 COST

• Size of the top dense problem

O(N0.5)

• Cost of the sequential dense solver

O(M3)

• Computational cost of sequential

C0 solver

O((N0.5)3)=O(N3/2)

• Cost of the parallel dense solver

O(M2)

• Computational cost of sequential

C0 solver

O((N0.5)2)=O(N)

Cp-1 COST

• Size of the top dense problem

O(pN0.5)

• Cost of the sequential dense solver

O(M3)

• Computational cost of sequential

C0 solver

O((pN0.5)3)=O(p3N3/2)

• Cost of the parallel dense solver

O(N2)

• Computational cost of sequential

C0 solver

O((pN0.5)2)=O(p2N)

2D NUMERICAL RESULTS

LINEAR B-SPLINES

GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS

QUADRATIC B-SPLINES

GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS

CUBIC B-SPLINES

GeForce GTX 780 with 2304 cores, 3GB of memory

CONCLUSIONS

We have developed the formal graph grammar model for B-splines based

finite element method computations in one and two dimensions.

The multi-frontal direct solver algorithm has been expressed by basic

undividable tasks, and the partial order of execution has been expressed

by a dependency graph

The tasks can be scheduled in a sequence of sets with independent tasks

based on the coloring of the dependency graph

The model allows for an efficient implementation of the generation

and solution of the problem in shared memory architecture

The isogeometric shared memory Cp-1 continuity parallel direct solver

scales like O(p2log(N/p)) for one dimensional problems, and like

O(Np2) for two dimensional problems