Download pdf - 3D Parallel FEM (IV) OpenMP + Hybrid Parallel Programming ...nkl.cc.u-tokyo.ac.jp/14e/04-pFEM/pFEM3D-OMP.pdf · 3D Parallel FEM (IV) OpenMP + Hybrid Parallel Programming Model Kengo

3D P

aral

lel F

EM(IV

)O

penM

P +

Hyb

rid P

aral

lel

Prog

ram

min

g M

odel

Ken

go N

akaj

ima

Info

rmat

ion

Tech

nolo

gy C

ente

r

Pro

gram

min

g fo

r Par

alle

l Com

putin

g (6

16-2

057)

S

emin

ar o

n A

dvan

ced

Com

putin

g (6

16-4

009)

2

Hyb

rid P

aral

lel P

rogr

amm

ing

Mod

el•

Mes

sage

Pas

sing

(e.g

. MP

I) +

Mul

ti Th

read

ing

(e.g

. O

penM

P, C

UD

A, O

penC

L, O

penA

CC

etc

.)•

In K

com

pute

r and

FX

10, h

ybrid

par

alle

l pr

ogra

mm

ing

is re

com

men

ded

–M

PI +

Aut

omat

ic P

aral

leliz

atio

n by

Fuj

itsu’

s C

ompi

ler

–P

erso

nally

, I d

o no

t lik

e to

cal

l thi

s “h

ybrid

” !!!

•E

xpec

tatio

ns fo

r Hyb

rid–

Num

ber o

f MP

I pro

cess

es (a

nd s

ub-d

omai

ns) t

o be

re

duce

d–

O(1

08-1

09)-

way

MP

I mig

ht n

ot s

cale

in E

xasc

ale

Sys

tem

s–

Eas

ily e

xten

ded

to H

eter

ogen

eous

Arc

hite

ctur

es•

CP

U+G

PU

, CP

U+M

anyc

ores

(e.

g. In

tel M

IC/X

eon

Phi

)•

MP

I+X

: Ope

nMP,

Ope

nAC

C, C

UD

A, O

penC

L

Flat

MPI

vs.

Hyb

rid

Hyb

rid：H

iera

rcha

l Str

uctu

re

Flat

-MPI

：Ea

ch C

ore

-> In

depe

nden

t

core

core

core

core

memory

core

core

core

core

memory

core

core

core

core

memory

core

core

core

core

memory

core

core

core

core

memoryco

reco

reco

reco

re

memory

memory

memory

memory

core

core

core

core

core

core

core

core

core

core

core

core

3

4

HB

M x

N

Num

ber o

f Ope

nMP

thre

ads

per a

sin

gle

MP

I pro

cess

Num

ber o

f MP

I pro

cess

per a

sin

gle

node

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L2

Mem

ory

5

Size

(and

num

ber)

of l

ocal

dat

a ch

ange

s ac

cord

ing

to p

aral

lel

prog

ram

min

g m

odel

exam

ple:

6 n

odes

, 96

core

s

HB

4x4

Flat

MPI

HB

16x

1

128 192 64

8 12 1

pcube

128 192 64

4 6 1

pcube

128 192 64

2 3 1

pcube

6

Bat

ch S

crip

t (1/

2)E

nv. V

ar.:

OM

P_N

UM

_TH

RE

AD

S

Hyb

rid16

×1

#!/bin/sh

#PJM -L "node=6"

#PJM -L "elapse=00:05:00"

#PJM -j

#PJM -L "rscgrp=lecture2“

#PJM -g “gt82“

#PJM -o "test.lst"

#PJM --mpi"proc=6"

export OMP_NUM_THREADS=16

mpiexec./sol

rmwk.*

Flat

MPI

#PJM -L "node=6"


#PJM -j


#PJM -g “gt82“

#PJM -o "test.lst"

#PJM --mpi"proc=96"

mpiexec./sol

rmwk.*

Hyb

rid8×

2#!/bin/sh

#PJM -L "node=6"


#PJM -j


#PJM -g “gt82“

#PJM -o "test.lst"

#PJM --mpi“proc=12"


mpiexec./sol

rmwk.*

7

Bat

ch S

crip

t (2/

2)E

nv. V

ar.:

OM

P_N

UM

_TH

RE

AD

S

Hyb

rid4×

4#!/bin/sh

#PJM -L "node=6"


#PJM -j


#PJM -g “gt82“

#PJM -o "test.lst"

#PJM --mpi"proc=24"


mpiexec./sol

rmwk.*

OM

P-1

8

Bac

kgro

und

•M

ultic

ore/

Man

ycor

e P

roce

ssor

s–

Low

pow

er c

onsu

mpt

ion,

Var

ious

type

s of

pro

gram

min

g m

odel

s•

Ope

nMP

–D

irect

ive

base

d, (s

eem

s to

be)

eas

y–

Man

y bo

oks

•D

ata

Dep

ende

ncy

(no

clas

ses

this

yea

r)–

Con

flict

of r

eadi

ng fr

om/w

ritin

g to

mem

ory

–A

ppro

pria

te re

orde

ring

of d

ata

is n

eede

d fo

r “c

onsi

sten

t” pa

ralle

l com

putin

g–

NO

det

aile

d in

form

atio

n in

Ope

nMP

boo

ks: v

ery

com

plic

ated

•O

penM

P/M

PI H

ybrid

Par

alle

l Pro

gram

min

g M

odel

fo

r Mul

ticor

e/M

anyc

ore

Clu

ster

s

OM

P-1

9

SM

PM

EMO

RY

C P U

C P U

C P U

C P U

C P U

C P U

C P U

C P U

•S

MP

–S

ymm

etric

Mul

ti P

roce

ssor

s–

Mul

tiple

CP

U’s

(cor

es) s

hare

a s

ingl

e m

emor

y sp

ace

OM

P-1

10

Wha

t is

Ope

nMP

?ht

tp://

ww

w.o

penm

p.or

g

•A

n A

PI f

or m

ulti-

plat

form

sha

red-

mem

ory

para

llel

prog

ram

min

g in

C/C

++ a

nd F

ortra

n–

Cur

rent

ver

sion

: 4.0

•B

ackg

roun

d–

Mer

ger o

f Cra

y an

d S

GI i

n 19

96–

AS

CI p

roje

ct (D

OE

) sta

rted

•C

/C++

ver

sion

and

For

tran

vers

ion

have

bee

n se

para

tely

dev

elop

ed u

ntil

ver.2

.5.

•Fo

rk-J

oin

Par

alle

l Exe

cutio

n M

odel

•U

sers

hav

e to

spe

cify

eve

ryth

ing

by d

irect

ives

.–

Not

hing

hap

pen,

if th

ere

are

no d

irect

ives

OM

P-1

11

Fork

-Joi

n Pa

ralle

l Exe

cutio

n M

odel

Mas

ter

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

Mas

ter

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

Mas

ter

Mas

ter

Mas

ter PA

RA

LLEL

fork

END

PA

RA

LLEL

join

PAR

ALL

ELfo

rkEN

D P

AR

ALL

ELjo

in

OM

P-1

12

Num

ber o

f Thr

eads

•OMP_NUM_THREADS

–H

ow to

cha

nge

?•

bash

（.b

ashr

c）export OMP_NUM_THREADS=8

•cs

h(.c

shrc

)setenv OMP_NUM_THREADS 8

OM

P-1

13

Info

rmat

ion

abou

t Ope

nMP

•O

penM

P A

rchi

tect

ure

Rev

iew

Boa

rd (A

RB

)–

http

://w

ww

.ope

nmp.

org

•R

efer

ence

s–

Cha

ndra

, R. e

t al.「

Par

alle

l Pro

gram

min

g in

Ope

nMP

」（M

orga

n K

aufm

ann）

–Q

uinn

, M.J

. 「P

aral

lel P

rogr

amm

ing

in C

with

MP

I and

O

penM

P」（M

cGra

wH

ill）

–M

atts

on, T

.G. e

t al.

「P

atte

rns

for P

aral

lel P

rogr

amm

ing」

（A

ddis

on W

esle

y）–

牛島

「O

penM

Pに

よる

並列

プロ

グラ

ミン

グと

数値

計算

法」（丸

善）

–C

hapm

an, B

. et a

l. 「U

sing

Ope

nMP

」（M

IT P

ress

）最

新!

•Ja

pane

se V

ersi

on o

f Ope

nMP

3.0

Spe

c. (F

ujits

u et

c.)

–ht

tp://

ww

w.o

penm

p.or

g/m

p-do

cum

ents

/Ope

nMP

30sp

ec-ja

.pdf

OM

P-1

14

Feat

ures

of O

penM

P•

Dire

ctiv

es–

Loop

s rig

ht a

fter t

he d

irect

ives

are

par

alle

lized

.–

If th

e co

mpi

ler d

oes

not s

uppo

rt O

penM

P, d

irect

ives

are

co

nsid

ered

as

just

com

men

ts.

OM

P-1

15

Ope

nMP/

Dire

ctiv

esA

rray

Ope

ratio

ns

!$omp parallel do

do i= 1, NP

W(i,1)= 0.d0

W(i,2)= 0.d0

enddo

!$omp end parallel do

!$omp parallel do private(iS,iE,i)

!$omp& reduction(+:RHO)

do ip= 1, PEsmpTOT

iS= STACKmcG(ip-1) + 1

iE= STACKmcG(ip )

do i= iS, iE

RHO= RHO + W(i,R)*W(i,Z)

enddo

enddo


Sim

ple

Sub

stitu

tion

Dot

Pro

duct

s

!$omp parallel do

do i= 1, NP

Y(i)= ALPHA*X(i) + Y(i)

enddo


DA

XP

Y

OM

P-1

16

Ope

nMP/

Dire

ceiv

esM

atrix

/Vec

tor P

rodu

cts

!$omp parallel do private(ip,iS,iE,i,j)

do ip= 1, PEsmpTOT


iE= STACKmcG(ip )

do i= iS, iE

W(i,Q)= D(i)*W(i,P)

do j= 1, INL(i)

W(i,Q)= W(i,Q) + W(IAL(j,i),P)

enddo

do j= 1, INU(i)

W(i,Q)= W(i,Q) + W(IAU(j,i),P)

enddo

enddo

enddo


OM

P-1

17

Feat

ures

of O

penM

P•

Dire

ctiv

es–

Loop

s rig

ht a

fter t

he d

irect

ives

are

par

alle

lized

.–

If th

e co

mpi

ler d

oes

not s

uppo

rt O

penM

P, d

irect

ives

are

co

nsid

ered

as

just

com

men

ts.

•N

othi

ng h

appe

n w

ithou

t exp

licit

dire

ctiv

es–

Diff

eren

t fro

m “a

utom

atic

par

alle

lizat

ion/

vect

oriz

atio

n”–

Som

ethi

ng w

rong

may

hap

pen

by u

n-pr

oper

way

of u

sage

–

Dat

a co

nfig

urat

ion,

ord

erin

g et

c. a

re d

one

unde

r use

rs’

resp

onsi

bilit

y•

“Thr

eads

” are

cre

ated

acc

ordi

ng to

the

num

ber o

f co

res

on th

e no

de–

Thre

ad: “

Pro

cess

” in

MP

I–

Gen

eral

ly, “

# th

read

s =

# co

res”

: Xeo

n P

hi s

uppo

rts 4

th

read

s pe

r cor

e (H

yper

Mul

tithr

eadi

ng)

OM

P-1

18

Mem

ory

Con

tent

ion:

メモ

リ競

合

MEM

ORY

C P U

C P U

C P U

C P U

C P U

C P U

C P U

C P U

•D

urin

g a

com

plic

ated

pro

cess

, mul

tiple

thre

ads

may

si

mul

tane

ousl

y try

to u

pdat

e th

e da

ta in

sam

e ad

dres

s on

the

mem

ory.

–e.

g.: M

ultip

le c

ores

upd

ate

a si

ngle

com

pone

nt o

f an

arra

y.–

This

situ

atio

n is

pos

sibl

e.–

Ans

wer

s m

ay c

hang

e co

mpa

red

to s

eria

l cas

es w

ith a

si

ngle

cor

e (th

read

).

OM

P-1

19

Mem

ory

Con

tent

ion

(con

t.)M

EMO

RY

C P U

C P U

C P U

C P U

C P U

C P U

C P U

C P U

•In

this

lect

ure,

no

such

cas

e do

es n

ot h

appe

n by

re

orde

ring

etc.

–In

Ope

nMP

, use

rs a

re re

spon

sibl

e fo

r suc

h is

sues

(e.g

. pr

oper

dat

a co

nfig

urat

ion,

reor

derin

g et

c.)

•G

ener

ally

spe

akin

g, p

erfo

rman

ce p

er c

ore

redu

ces

as n

umbe

r of u

sed

core

s (th

read

num

ber)

incr

ease

s.–

Mem

ory

acce

ss p

erfo

rman

ce: S

TRE

AM

OM

P-1

20

Feat

ures

of O

penM

P (c

ont.)

•“!o

mp

para

llel d

o”-”

!om

p en

d pa

ralle

l do”

•G

loba

l (S

hare

d) V

aria

bles

, Priv

ate

Var

iabl

es–

Def

ault:

Glo

bal (

Sha

red)

–D

ot P

rodu

cts:

redu

ctio

n

W(:,

:)，R

，Z，

PE

smpT

OT

glob

al (s

hare

d)!$omp parallel do private(iS,iE,i)

!$omp& reduction(+:RHO)

do ip= 1, PEsmpTOT


iE= STACKmcG(ip )

do i= iS, iE

RHO= RHO + W(i,R)*W(i,Z)

enddo

enddo


OM

P-1

21

FOR

TRA

N&

C

#include <omp.h>

...

{#pragma omp parallel for default(none) shared(n,x,y) private(i)

for (i=0; i<n; i++)

x[i] += y[i];

}use omp_lib

...

!$omp parallel do shared(n,x,y) private(i)

do i= 1, n

x(i)= x(i) + y(i)

enddo

!$ omp end parallel do

OM

P-1

22

In th

is c

lass

...

•Th

ere

are

man

y ca

pabi

litie

s of

Ope

nMP

.•

In th

is c

lass

, onl

y se

vera

l fun

ctio

ns a

re s

how

n fo

r pa

ralle

lizat

ion

of p

aral

lel F

EM

.

OM

P-1

23

Firs

t thi

ngs

to b

e do

ne(a

fter O

penM

P 3.

0)

•us

e om

p_lib

Fortr

an•

#inc

lude

<om

p.h>

C

OM

P-1

24

Ope

nMP

Dire

ctiv

es (F

ortr

an)

•N

O d

istin

ctio

ns b

etw

een

uppe

r and

low

er c

ases

.•

sent

inel

–Fo

rtran

: !$O

MP

, C$O

MP

, *$O

MP

•!$

OM

P o

nly

for f

ree

form

at

–C

ontin

uatio

n Li

nes

(Sam

e ru

le a

s th

at o

f Fo

rtran

com

pile

r is

app

lied)

•

Exa

mpl

e fo

r !$OMP PARALLEL DO SHARED(A,B,C)

sentinel directive_name [clause[[,] clause]…]

!$OMP PARALLEL DO

!$OMP+SHARED (A,B,C)

!$OMP PARALLEL DO &

!$OMP SHARED (A,B,C)

OM

P-1

25

Ope

nMP

Dire

ctiv

es (C

)

•“＼

” for

con

tinua

tion

lines

•O

nly

low

er c

ase

(exc

ept n

ames

of v

aria

bles

)

#pragma omp directive_name [clause[[,] clause]…]

#pragma omp parallel for shared (a,b,c)

OM

P-1

26

PAR

ALL

EL D

O

•P

aral

leriz

e D

O/fo

r Loo

ps•

Exa

mpl

es o

f “cl

ause

”–

PR

IVA

TE（lis

t）–

SH

AR

ED

（lis

t）–

DE

FAU

LT（P

RIV

ATE

|SH

AR

ED

|NO

NE

）

–R

ED

UC

TIO

N（{o

pera

tion|

intri

nsic

}: lis

t）

!$OMP PARALLEL DO[clause[[,] clause] … ]

(do_loop)

!$OMP END PARALLEL DO

#pragma parallel for [clause[[,] clause] … ]

(for_loop)

OM

P-1

27

RED

UC

TIO

N

•S

imila

r to

“MP

I_R

educ

e”•

Ope

rato

r–

＋，

*，-,

.AN

D.,

.OR

., .E

QV

., .N

EQ

V.

•In

trins

ic–

MA

X, M

IN, I

AN

D, I

OR

, IE

QR

REDUCTION ({operator|instinsic}: list)

reduction ({operator|instinsic}: list)

OM

P-1

28

Exa

mpl

e-1:

A S

impl

e Lo

op!$OMP PARALLEL DO

do i= 1, N

B(i)= (A(i) + B(i)) * 0.50

enddo


•D

efau

lt st

atus

of l

oop

varia

bles

(“i”

in th

is c

ase)

is

priv

ate.

The

refo

re, e

xplic

it de

clar

atio

n is

not

ne

eded

.•

“EN

D P

AR

ALL

EL

DO

” is

not r

equi

red

–In

C, t

here

are

no

defin

ition

s of

“end

par

alle

l do”

OM

P-1

29

Exa

mpl

e-1:

RE

DU

CTI

ON

!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+:A,B)

do i= 1, N

call WORK (Alocal, Blocal)

A= A + Alocal

B= B + Blocal

enddo


•“E

ND

PA

RA

LLE

L D

O” i

s no

t req

uire

d

30

Func

tions

whi

ch c

an b

e us

ed w

ith

Ope

nMP

Nam

eFu

nctio

ns

into

mp_

get_

num

_thr

eads

(voi

d)To

talT

hrea

d #

into

mp_

get_

thre

ad_n

um(v

oid)

Thre

ad ID

doub

leom

p_ge

t_w

time

(voi

d)=

MP

I_W

time

void

omp_

set_

num

_thr

eads

(int

num

_thr

eads

)ca

ll om

p_se

t_nu

m_t

hrea

ds(n

um_t

hrea

ds)

Set

ting

Thre

ad#

OM

P-1

31

Ope

nMP

for D

ot P

rodu

cts

VAL= 0.d0

do i= 1, N

VAL= VAL + W(i,R) * W(i,Z)

enddo

OM

P-1

32

Ope

nMP

for D

ot P

rodu

cts

VAL= 0.d0

do i= 1, N


enddo

VAL= 0.d0

!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)

do i= 1, N


enddo


Dire

ctiv

es a

re ju

st in

serte

d.

OM

P-1

33

Ope

nMP

for D

ot P

rodu

cts

VAL= 0.d0

do i= 1, N


enddo

VAL= 0.d0

!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)

do i= 1, N


enddo


VAL= 0.d0

!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)

do ip= 1, PEsmpTOT

do i= index(ip-1)+1, index(ip)


enddo

enddo


Mul

tiple

Loo

pPEsmpTOT

: Num

ber o

f thr

eads

Add

ition

al a

rray INDEX(:)

is

need

ed.

Effi

cien

cy is

not

nec

essa

rily

good

, bu

t use

rs c

an s

peci

fy th

read

for

each

com

pone

nt o

f dat

a.

Dire

ctiv

es a

re ju

st in

serte

d.

OM

P-1

34

Ope

nMP

for D

ot P

rodu

cts

VAL= 0.d0

!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)

do ip= 1, PEsmpTOT

do i= index(ip-1)+1, index(ip)


enddo

enddo


e.g.

: N=1

00, P

Esm

pTO

T=4

IND

EX

(0)=

0

IND

EX

(1)=

25

IND

EX

(2)=

50

IND

EX

(3)=

75

IND

EX

(4)=

100

Mul

tiple

Loo

pPEsmpTOT

: Num

ber o

f thr

eads

Add

ition

al a

rray INDEX(:)

is

need

ed.

Effi

cien

cy is

not

nec

essa

rily

good

, bu

t use

rs c

an s

peci

fy th

read

for

each

com

pone

nt o

f dat

a.

NO

T go

od fo

r GP

U’s

OM

P-1

35

File

s on

Oak

leaf

-FX

>$ cd <$O-TOP>

>$ cp /home/z30088/class_eps/F/omp.tar .

>$ cp /home/z30088/class_eps/C/omp.tar .

>$ tar xvf omp.tar

>$ cd omp

<$O-omp>

OM

P-1

36

File

s on

Oak

leaf

-FX

>$ cd <$O-omp>

>$ frtpx –Kfast,openmp test.f

>$ fccpx –Kfast,openmp test.c

>$ pjsub go.sh

OM

P-1

37

Run

ning

the

Job

go.sh

#!/bin/sh

#PJM -L "node=1"


#PJM -L "rscgrp=lecture"

#PJM -g "gt71"

#PJM -j

#PJM -o "t0-08.lst"


Number of Threads/process

./a.out

< INPUT.DAT

INPUT.DAT

N nopt

N:

Problem Size (Vector Length)

nopt:

First-touch

(0:No, 1:Yes)

OM

P-1

38

test

.f/te

st.c

•D

AX

PY

•D

ot P

rodu

cts

•E

ffect

of O

penM

P D

irect

ives

•E

ffect

of F

irst T

ouch

Dat

a P

lace

men

t

OM

P-1

39

test

.f (1

/3):

Initi

aliz

atio

nuse omp_lib

implicit REAL*8 (A-H,O-Z)

real(kind=8), dimension(:), allocatable:: X, Y

real(kind=8), dimension(:), allocatable:: Z1, Z2

real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5

integer, dimension(0:2) :: INDEX

!C!C +------+

!C | INIT |

!C +------+

!C===write (*,*) 'N, nopt?'

read (*,*) N, nopt

allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))

if (nopt.eq.0) then

X = 1.d0

Y = 1.d0

Z1= 0.d0

Z2= 0.d0

Z3= 0.d0

Z4= 0.d0

Z5= 0.d0

else

!$ompparallel do private (i)

do i= 1, N

X (i)= 0.d0

Y (i)= 0.d0

Z1(i)= 0.d0

Z2(i)= 0.d0

Z3(i)= 0.d0

Z4(i)= 0.d0

Z5(i)= 0.d0

enddo

!$ompend parallel do

endif

ALPHA= 1.d0

!C===

Pro

blem

Siz

e, O

ptio

n

nopt

＝0

NO

Firs

t Tou

chno

pt≠0

with

Firs

t Tou

ch

OM

P-1

40

test

.f (1

/3):

Initi

aliz

atio

nuse omp_lib






!C!C +------+

!C | INIT |

!C +------+


read (*,*) N, nopt


if (nopt.eq.0) then

X = 1.d0

Y = 1.d0

Z1= 0.d0

Z2= 0.d0

Z3= 0.d0

Z4= 0.d0

Z5= 0.d0

else


do i= 1, N

X (i)= 0.d0

Y (i)= 0.d0

Z1(i)= 0.d0

Z2(i)= 0.d0

Z3(i)= 0.d0

Z4(i)= 0.d0

Z5(i)= 0.d0

enddo


endif

ALPHA= 1.d0

!C===

nopt

＝0

NO

Firs

t Tou

ch

Initi

aliz

atio

n is

not

par

alle

lized

OM

P-1

41

test

.f (1

/3):

Initi

aliz

atio

nuse omp_lib






!C!C +------+

!C | INIT |

!C +------+


read (*,*) N, nopt


if (nopt.eq.0) then

X = 1.d0

Y = 1.d0

Z1= 0.d0

Z2= 0.d0

Z3= 0.d0

Z4= 0.d0

Z5= 0.d0

else


do i= 1, N

X (i)= 0.d0

Y (i)= 0.d0

Z1(i)= 0.d0

Z2(i)= 0.d0

Z3(i)= 0.d0

Z4(i)= 0.d0

Z5(i)= 0.d0

enddo


endif

ALPHA= 1.d0

!C===

nopt≠0

with

Firs

t Tou

ch

Initi

aliz

atio

n is

in p

aral

lel m

anne

r.

OM

P-1

42

test

.f (2

/3):

DA

XP

Y!C!C +-------+

!C | DAXPY |

!C +-------+

!C===S2time= omp_get_wtime()

!$omp parallel do private (i)

do i= 1, N

Z1(i)= ALPHA*X(i) + Y(i)





enddo


E2time= omp_get_wtime()

write (*,'(/a)') '# DAXPY'

write (*,'( a, 1pe16.6)') ' omp-1 ', E2time -S2time

!C===

OM

P-1

43

test

.f (3

/3):

Dot

Pro

duct

s!C!C +-----+

!C | DOT |

!C +-----+

!C===V1= 0.d0

V2= 0.d0

V3= 0.d0

V4= 0.d0

V5= 0.d0

S2time= omp_get_wtime()

!$ompparallel do private(i) reduction (+:V1,V2,V3,V4,V5)

do i= 1, N

V1= V1 + X(i)*(Y(i)+1.d0)

V2= V2 + X(i)*(Y(i)+2.d0)

V3= V3 + X(i)*(Y(i)+3.d0)

V4= V4 + X(i)*(Y(i)+4.d0)

V5= V5 + X(i)*(Y(i)+5.d0)

enddo


E2time= omp_get_wtime()

write (*,'(/a)') '# DOT'

write (*,'( a, 1pe16.6)') ' omp-1 ', E2time -S2time

!C===stop

end

OM

P-1

44

test

.c: D

umm

y P

ragm

a ne

eded

#pragma omp parallel

{}

if (nopt==0) {for(i=0; i<N; i++) {

X[i] = 1.0;

Y[i] = 1.0;

Z1[i] = 0.0;

Z2[i] = 0.0;

Z3[i] = 0.0;

Z4[i] = 0.0;

Z5[i] = 0.0;}

}else{

#pragma omp parallel for private(i)

for(i=0; i<N; i++) {

X[i] = 1.0;

Y[i] = 1.0;

Z1[i] = 0.0;

Z2[i] = 0.0;

Z3[i] = 0.0;

Z4[i] = 0.0;

Z5[i] = 0.0;

}}

Dum

my

Pra

gma

need

ed b

efor

e ac

tual

com

puta

tion

(for F

ujits

u C

com

pile

r). T

his

prob

lem

may

be

fixed

. Th

read

s fo

r Ope

nMP

are

gene

rate

d du

ring

the

first

par

alle

l lo

op, a

nd th

is p

roce

dure

is e

xpen

sive

. In

Fortr

an, t

his

dum

my

loop

is im

plic

itly

inse

rted

by c

ompi

ler.

OM

P-1

45

DA

XP

Y: E

ffect

of F

irst T

ouch

FX10

, N=6

0,00

0,00

0T2

K, N

=10,

000,

000

•T2

K: E

ffect

of f

irst t

ouch

is la

rge

•N

OT

scal

able

: mem

ory

cont

entio

n,

sync

hron

izat

ion

over

head

0.00

0.10

0.20

0.30

0.40

12

48

12

sec.

thre

ad#

FX10

-NO

FX10

-YES

0.00

0.10

0.20

0.30

0.40

12

48

1216

sec.th

read

#

T2K

-NO

T2K

-YES

46

Firs

t Tou

ch D

ata

Plac

emen

t“P

atte

rns

for P

aral

lel P

rogr

amm

ing”

Mat

tson

, T.G

. et a

l.

To re

duce

mem

ory

traffi

c in

the

syst

em, i

t is

impo

rtant

to k

eep

the

data

clo

se to

the

PE

s th

at w

ill w

ork

with

the

data

(e.g

. NU

MA

cont

rol).

On

NU

MA

com

pute

rs, t

his

corr

espo

nds

to m

akin

g su

re th

e pa

ges

of

mem

ory

are

allo

cate

d an

d “o

wne

d” b

y th

e P

Es

that

will

be

wor

king

with

the

data

con

tain

ed in

the

page

.

The

mos

t com

mon

NU

MA

page

-pla

cem

ent a

lgor

ithm

is th

e “fi

rst t

ouch

” alg

orith

m, i

n w

hich

the

PE

firs

t ref

eren

cing

a re

gion

of

mem

ory

will

hav

e th

e pa

ge h

oldi

ng th

at m

emor

y as

sign

ed to

it.

A ve

ry c

omm

on te

chni

que

in O

penM

P pr

ogra

m is

to in

itial

ize

data

in p

aral

lel u

sing

the

sam

e lo

op s

ched

ule

as w

ill b

e us

ed la

ter

in th

e co

mpu

tatio

ns.

OM

P-1

OM

P-1

47

L1 CL2L1 CL2

L1 CL2L1 CL2

L3

Mem

ory

L1C L2L1C L2

L1C L2L1C L2

Mem

ory

L3

L1 CL2L1 CL2

L1 CL2L1 CL2

L3

Mem

ory

L1C L2L1C L2

L1C L2L1C L2

Mem

ory

L3

L1 CL2L1 CL2

L1 CL2L1 CL2

L3

Mem

ory

L1 CL2L1 CL2

L1 CL2L1 CL2

L1 CL2L1 CL2

L3

Mem

ory

L1 CL2L1 CL2

L1C L2L1C L2

L1C L2L1C L2

L1C L2L1C L2

L3

Mem

ory

L1C L2L1C L2

L1C L2L1C L2

L1C L2L1C L2

L3

Mem

ory

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L2

Mem

ory

T2K

/Tok

yoC

ray

XE6

(Hop

per)

Fujit

su F

X10

(Oak

leaf

-FX)

OM

P-1

48

NU

MA

Arc

hite

ctur

eN

on-U

nifo

rm M

emor

y A

cces

s•

Dat

a sh

ould

be

on th

e lo

cal m

emor

y–

Loca

l mem

ory:

Mem

ory

of C

PU

soc

ket w

here

the

core

loca

tes.

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

49

NU

MA

Arc

hite

ctur

eN

on-U

nifo

rm M

emor

y A

cces

s•

If da

ta a

re o

n re

mot

e m

emor

y, it

is n

ot

effic

ient

. Co

re

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

OM

P-1

50

NU

MA

Arc

hite

ctur

eN

on-U

nifo

rm M

emor

y A

cces

s•

“Firs

t tou

ch d

ata

plac

emen

t” ca

n ke

ep

data

on

loca

l mem

ory.

•

“Par

alle

l ini

tializ

atio

n” o

f ar

rays

is e

ffect

ive.

Co

re

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Core

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

L1

Cor

e

L1

Core

L1

Core

L1L2

L2L2

L2L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

Core

Core

Core

Core

L1L1

L1L1

L2L2

L2L2

L3

Mem

ory

OM

P-1

OM

P-1

51

test

.f (1

/3):

Initi

aliz

atio

nuse omp_lib






!C!C +------+

!C | INIT |

!C +------+


read (*,*) N, nopt


if (nopt.eq.0) then

X = 1.d0

Y = 1.d0

Z1= 0.d0

Z2= 0.d0

Z3= 0.d0

Z4= 0.d0

Z5= 0.d0

else


do i= 1, N

X (i)= 0.d0

Y (i)= 0.d0

Z1(i)= 0.d0

Z2(i)= 0.d0

Z3(i)= 0.d0

Z4(i)= 0.d0

Z5(i)= 0.d0

enddo


endif

ALPHA= 1.d0

!C===

nopt

＝0

NO

Firs

t Tou

ch

Initi

aliz

atio

n is

not

par

alle

lized

OM

P-1

52

test

.f (1

/3):

Initi

aliz

atio

nuse omp_lib






!C!C +------+

!C | INIT |

!C +------+


read (*,*) N, nopt


if (nopt.eq.0) then

X = 1.d0

Y = 1.d0

Z1= 0.d0

Z2= 0.d0

Z3= 0.d0

Z4= 0.d0

Z5= 0.d0

else


do i= 1, N

X (i)= 0.d0

Y (i)= 0.d0

Z1(i)= 0.d0

Z2(i)= 0.d0

Z3(i)= 0.d0

Z4(i)= 0.d0

Z5(i)= 0.d0

enddo


endif

ALPHA= 1.d0

!C===

nopt≠0

with

Firs

t Tou

ch

Initi

aliz

atio

n is

in p

aral

lel m

anne

r.

53

Rep

ort P

1 (1

/2)

•A

pply

mul

ti-th

read

ing

by O

penM

P on

par

alle

l FE

M

code

usi

ng M

PI

–C

G S

olve

r (so

lver

_CG

, sol

ver_

SR

)–

Mat

rix A

ssem

blin

g (m

at_a

ss_m

ain,

mat

_ass

_bc)

•H

ybrid

par

alle

l pro

gram

min

g m

odel

•E

valu

ate

the

effe

cts

of–

Pro

blem

siz

e, p

aral

lel p

rogr

amm

ing

mod

el, t

hrea

d #

54

Rep

ort P

1 (2

/2)

•D

eadl

ine:

17:

00 O

ctob

er 1

1th

(Sat

), 20

14.

–S

end

files

via

e-m

ail a

t nak

ajim

a(at

)cc.

u-to

kyo.

ac.jp

•R

epor

t–

Cov

er P

age:

Nam

e, ID

, and

Pro

blem

ID (P

1)

–Le

ss th

an 2

0pa

ges

incl

udin

g fig

ures

and

tabl

es (A

4).

•S

trate

gy•

Stru

ctur

e of

the

Pro

gram

•N

umer

ical

Exp

erim

ents

, Per

form

ance

Ana

lysi

s•

Rem

arks

–S

ourc

e lis

t of t

he p

rogr

am

•G

rade

–A

（優

）m

ight

be

give

n, if

“CG

” and

“MA

T_A

SS

” are

don

e.–

B（良

）is

the

high

est g

rade

if “M

AT_

AS

S” i

s N

OT

done

.

55

FOR

TRA

N（so

lver

_CG

）!$omp parallel do private(i)

do i= 1, N

X(i) = X (i) + ALPHA * WW(i,P)

WW(i,R)= WW(i,R) -

ALPHA * WW(i,Q)

enddo

DNRM20= 0.d0

!$omp

parallel do private(i) reduction (+:DNRM20)

do i= 1, N

DNRM20= DNRM20 + WW(i,R)**2

enddo

!$omp

parallel do private(j,k,i,WVAL)

do j= 1, N

WVAL= D(j)*WW(j,P)

do k= index(j-1)+1, index(j)

i= item(k)

WVAL= WVAL + AMAT(k)*WW(i,P)

enddo

WW(j,Q)= WVAL

enddo

56

C（so

lver

_CG

）#pragma omp parallel for private (i)

for(i=0;i<N;i++){

X [i] += ALPHA *WW[P][i];

WW[R][i]+= -ALPHA *WW[Q][i];

} DNRM20= 0.e0;

#pragma omp

parallel for private (i) reduction (+:DNRM20)

for(i=0;i<N;i++){

DNRM20+=WW[R][i]*WW[R][i];

}

#pragma omp

parallel for private (j,i,k,WVAL)

for( j=0;j<N;j++){

WVAL= D[j] * WW[P][j];

for(k=indexLU[j];k<indexLU[j+1];k++){

i=itemLU[k];

WVAL+= AMAT[k] * WW[P][i];

} WW[Q][j]=WVAL;

57

solv

er_S

R (s

end)

for( neib=1;neib<=NEIBPETOT;neib++){

istart=EXPORT_INDEX[neib-1];

inum

=EXPORT_INDEX[neib]-istart;

#pragma omp

parallel for private (k,ii)

for( k=istart;k<istart+inum;k++){

ii= EXPORT_ITEM[k];

WS[k]= X[ii-1];

} MPI_Isend(&WS[istart],inum,MPI_DOUBLE,

NEIBPE[neib-1],0,MPI_COMM_WORLD,&req1[neib-1]);

}

do neib= 1, NEIBPETOT

istart= EXPORT_INDEX(neib-1)

inum

= EXPORT_INDEX(neib

) -

istart

!$omp

parallel do private(k,ii)

do k= istart+1, istart+inum

ii = EXPORT_ITEM(k)

WS(k)= X(ii)

enddo

call MPI_Isend

(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &

& NEIBPE(neib), 0, MPI_COMM_WORLD, req1(neib), &

& ierr)

enddo

58

How

to a

pply

mul

ti-th

read

ing

•C

G S

olve

r–

Just

inse

rt O

penM

P di

rect

ives

–IL

U/IC

pre

cond

ition

ing

is m

uch

mor

e di

fficu

lt•

MAT

_AS

S (m

at_a

ss_m

ain,

mat

_ass

_bc)

–D

ata

Dep

ende

ncy

–Av

oid

to a

ccum

ulat

e co

ntrib

utio

ns o

f mul

tiple

ele

men

ts to

a

sing

le n

ode

sim

ulta

neou

sly

(in p

aral

lel)

•re

sults

may

be

chan

ged

•de

adlo

ck m

ay o

ccur

–C

olor

ing

•E

lem

ents

in a

sam

e co

lor d

o no

t sha

re a

nod

e•

Par

alle

l ope

ratio

ns a

re p

ossi

ble

for e

lem

ents

in e

ach

colo

r•

In th

is c

ase,

we

need

onl

y 8

colo

rs fo

r 3D

pro

blem

s (4

col

ors

for

2D p

robl

ems)

•C

olor

ing

part

is v

ery

expe

nsiv

e: p

aral

leliz

atio

n is

diff

icul

t

59

Mul

ti-Th

read

ing:

Mat

_Ass

Par

alle

l ope

ratio

ns a

re p

ossi

ble

for e

lem

ents

in s

ame

colo

r (th

ey a

re in

depe

nden

t)

60

Col

orin

g (1

/2)

allocate (ELMCOLORindex(0:NP)) Number of elements in each color

allocate (ELMCOLORitem

(ICELTOT))

Element ID renumbered according to “

color”

if (allocated (IWKX)) deallocate

(IWKX)

allocate (IWKX(0:NP,3))

IWKX= 0

icou= 0

do icol= 1, NP

do i= 1, NP

IWKX(i,1)= 0

enddo

do icel= 1, ICELTOT

if (IWKX(icel,2).eq.0) then

in1= ICELNOD(icel,1)








ip1= IWKX(in1,1)

ip2= IWKX(in2,1)

ip3= IWKX(in3,1)

ip4= IWKX(in4,1)

ip5= IWKX(in5,1)

ip6= IWKX(in6,1)

ip7= IWKX(in7,1)

ip8= IWKX(in8,1)

61

Col

orin

g (2

/2)

isum= ip1 + ip2 + ip3 + ip4 + ip5 + ip6 + ip7 + ip8

if (isum.eq.0) then

None of the nodes is accessed in same color

icou= icou

+ 1

IWKX(icol,3)= icou

(Current) number of elements in each color

IWKX(icel,2)= icol

ELMCOLORitem(icou)= icel

ID of icou-th

element= icel

IWKX(in1,1)= 1

These nodes on the same elements can not be

IWKX(in2,1)= 1 accessed in same color

IWKX(in3,1)= 1

IWKX(in4,1)= 1

IWKX(in5,1)= 1

IWKX(in6,1)= 1

IWKX(in7,1)= 1

IWKX(in8,1)= 1

if (icou.eq.ICELTOT) goto

100

until all elements are colored

endif

endif

enddo

enddo

100 continue

ELMCOLORtot= icol

Number of Colors

IWKX(0 ,3)= 0

IWKX(ELMCOLORtot,3)= ICELTOT

do icol= 0, ELMCOLORtot

ELMCOLORindex(icol)= IWKX(icol,3)

enddo

62

Mul

ti-Th

read

ed M

atrix

Ass

embl

ing

Pro

cedu

redo icol= 1, ELMCOLORtot

!$omp

parallel do private (icel0,icel) &

!$omp& private (in1,in2,in3,in4,in5,in6,in7,in8) &

!$omp& private (nodLOCAL,ie,je,ip,jp,kk,iiS,iiE,k) &

!$omp& private (DETJ,PNX,PNY,PNZ,QVC,QV0,COEFij,coef,SHi) &

!$omp& private (PNXi,PNYi,PNZi,PNXj,PNYj,PNZj,ipn,jpn,kpn) &

!$omp& private (X1,X2,X3,X4,X5,X6,X7,X8) &

!$omp& private (Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8) &

!$omp& private (Z1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,COND0)

do icel0= ELMCOLORindex(icol-1)+1, ELMCOLORindex(icol)

icel= ELMCOLORitem(icel0)









...

63

Res

ults

(1/2

)51

2×38

4×25

6= 5

0,33

1,64

8 no

des

12 n

odes

，19

2co

res

643 =

262,

144

node

s/co

re

ndx,

ndy,

ndz

（#M

PI p

roc.

）Ite

r’sse

c.

Flat

MP

I8

6 4

(192

)12

4073

.9

HB

1×

168

6 4

(192

)12

4073

.6-K

open

mpで

コン

パイ

ル，

OM

P_N

UM

_TH

RE

AD

S=1

HB

2×

84

6 4

( 9

6)12

4078

.8H

B 4

×4

4 3

4 (

48)

1240

80.3

HB

8×

24

3 2

( 2

4)12

4081

.1H

B 1

6×1

2 3

2 (

12)

1240

81.9

512 384 256

ndx ndy ndz

pcube

64

Res

ults

(2/2

)51

2×38

4×25

6= 5

0,33

1,64

8 no

des

12 n

odes

，19

2co

res

643 =

262,

144

node

s/co

re

OM

P_N

UM

_TH

REA

DS

sec.

Spee

d-U

p

110

56.2

1.00

259

2.5

1.78

428

9.8

3.64

814

8.1

7.13

1210

3.6

10.1

916

81.9

12.9

0Fl

at M

PI,

1pr

oc./n

ode

1082

.4-

512 384 256

2 3 2

pcube

65

Flat

MPI

vs.

Hyb

rid•

Dep

ends

on

appl

icat

ions

, pro

blem

siz

e, H

W e

tc.

•Fl

at M

PI i

s ge

nera

lly b

ette

r for

spa

rse

linea

r sol

vers

, if

num

ber o

f com

putin

g no

des

is n

ot s

o la

rge.

–M

emor

y co

nten

tion

•H

ybrid

bec

omes

bet

ter,

if nu

mbe

r of c

ompu

ting

node

s is

larg

er.

–Fe

wer

num

ber o

f MP

I pro

cess

es.

•Fl

at M

PI i

s no

t rea

listic

for I

ntel

Xeo

n P

hi/M

IC w

ith

240

thre

ads/

node

–M

PI p

roce

ss re

quire

s ce

rtain

am

ount

of m

emor

y.

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L1 CL1 C

L2

Mem

ory

66

omp

para

llel (

do)

•“o

mp

para

llel-o

mp

end

para

llel”

gene

rate

s/el

imin

ates

thre

ads

at e

very

cal

l: fo

rk-jo

in•

This

cou

ld b

e ov

erhe

ad fo

r mul

tiple

loop

s •

omp

para

llel +

om

p do

/om

p fo

r

#pragma omp

parallel ...

#pragma omp

for {

...

#pragma omp

for {

!$omp

parallel ...

!$omp

do

do i= 1, N

...

!$omp

do

do i= 1, N

...

!$omp

end parallel

必須

OM

P-1

67

Tuni

ng•

List

of M

essa

ges

by C

ompi

ler (

Com

pile

Lis

t)•

精密

PA

可視

化機

能(E

xcel

)(P

reci

sion

PA

Vis

ibilit

y Fu

nctio

n)

List

of M

essa

ges

by C

ompi

ler

(Com

pile

Lis

t)

•-Q

t–

List

of M

essa

ges

by C

ompi

ler (

Com

pile

Lis

t)–

*.ls

t–

Fortr

an O

nly

•In

C, “

-Qt”

is n

ot a

vila

ble

–P

leas

e us

e “-N

src”

–D

ispl

ayed

on

scre

en

68

F90 = mpifrtpx

F90OPTFLAGS= -Kfast,openmp -Qt

F90FLAGS = $(F90OPTFLAGS)

Cur

rent

ver

sion

of C

/C++

com

pile

r ca

n pr

oduc

e lis

t of m

essa

ges

69

Fortran/C/C++

-Nlst=p 標

準の

最適

化情

報(デ

フォ

ルト)

-Nlst=t 詳

細な

最適

化情

報

Fortran ONLY

-Nlst=a 名

前の

属性

情報

-Nlst=d 派

生型

の構

成情

報-Nlst=i

イン

クル

ード

され

たフ

ァイ

ルの

プロ

グラ

ムリ

スト

およ

びイ

ンク

ルー

ドフ

ァイ

ル名

一覧

-Nlst=m 自

動並

列化

の状

況をOpenMP指

示文

によ

って

表現

した

原始

プロ

グラ

ム出

力-Nlst=x 名

前お

よび

文番

号の

相互

参照

情報

Info

in *.

lst

70

SIM

D In

form

atio

n71

Aut

omat

ic P

aral

leliz

atio

n72

Exam

ple

of th

e Li

st73

101 1 !C

102 1 !C +----------------+

103 1 !C | {z}= [Minv]{r} |

104 1 !C +----------------+

105 1 !C===

106 1

107 1 !$omp parallel do private(ip,i)

108 2 p do ip= 1, PEsmpTOT

<<< Loop-information Start >>>

<<< [OPTIMIZATION]

<<< SIMD

<<< SOFTWARE PIPELINING

<<< Loop-information End >>>

109 3 p 8v do i = SMPindexG(ip-1)+1, SMPindexG(ip)

110 3 p 8v W(i,Z)= W(i,R)

111 3 p 8v enddo

112 2 p enddo

113 1 !$omp end parallel do

114 1

115 1 Stime= omp_get_wtime()

116 1 call fapp_start ("precond", 1, 1)

117 2 do ic= 1, NCOLORtot

118 2 !$omp parallel do private(ip,ip1,i,WVAL,k)

119 3 p do ip= 1, PEsmpTOT

120 3 p ip1= (ic-1)*PEsmpTOT + ip

121 4 p do i= SMPindex(ip1-1)+1, SMPindex(ip1)

122 4 p WVAL= W(i,Z)

<<< Loop-information Start >>>

<<< [OPTIMIZATION]

<<< SIMD

<<< SOFTWARE PIPELINING

<<< Loop-information End >>>

123 5 p 4v do k= indexL(i-1)+1, indexL(i)

124 5 p 4v WVAL= WVAL -

AL(k) * W(itemL(k),Z)

125 5 p 4v enddo

126 4 p W(i,Z)= WVAL * W(i,DD)

127 4 p enddo

128 3 p enddo

129 2 !$omp end parallel do

130 2 enddo

OM

P-3

74

3.5

精密

PA可

視化

機能

(Exc

el)

(Pre

cisi

on P

A V

isib

ility

Fun

ctio

n)(1

/3):

Inse

rtin

g C

all’s

, Com

pile

& R

un

call start_collection

("SpMV")

!$omp

parallel do private(ip,i,VAL,k)

do ip= 1, PEsmpTOT

do i= SMPindex((ip-1)*NCOLORtot)+1, SMPindex(ip*NCOLORtot)

VAL= D(i)*W(i,P)

do k= 1, 3

VAL= VAL + AL(k,i)*W(itemL(k,i),P)

enddo

do k= 1, 3

VAL= VAL + AU(k,i)*W(itemU(k,i),P)

enddo

W(i,Q)= VAL

enddo

enddo

!$omp

end parallel do

call stop_collection

("SpMV")

OM

P-3

75

3.5

精密

PA可

視化

機能

(Exc

el)

(Pre

cisi

on P

A V

isib

ility

Fun

ctio

n)

(2/3

): C

olle

ctin

g Pe

rfor

man

ce D

ata:

7X

Exec

’sD

irect

orie

s: p

a1~p

a7, -

Hpa

=1~7

#!/bin/sh

#PJM -L "node=1"


#PJM -L "rscgrp=lecture"

#PJM -g “

gt71"

#PJM -j

#PJM -o "3.lst"

#PJM --mpi

"proc=1"


fapp

-C -d pa1

-Ihwm

-Hpa=1

./sol-r3k

OM

P-3

76

3.5

精密

PA可

視化

機能

(Exc

el)

(Pre

cisi

on P

A V

isib

ility

Fun

ctio

n)

(3/3

): Pe

rfor

man

ce A

naly

sis:

Tra

nsfo

rmat

ion

+ Ex

cel

fapppx

-A -d pa1 -o output_prof_1.csv -tcsv

-Hpa

fapppx


-Hpa

fapppx


-Hpa

fapppx


-Hpa

fapppx


-Hpa

fapppx


-Hpa

fapppx


-Hpa

Fujit

su F

X10:

CA

SE-1

, CM

-RC

M(2

)L1

-dem

.-mis

s:25

.6%

, Mem

. thr

ough

put:4

1.8G

B/s

ec.

Forw

ard/

Bac

kwar

d Su

bstit

utio

n

0.0E

+00

5.0E

-01

1.0E

+00

1.5E

+00

2.0E

+00

2.5E

+00

3.0E

+00

3.5E

+00

4.0E

+00

Thr

ead

0Thr

ead

1Thr

ead

2Thr

ead

3Thr

ead

4Thr

ead

5Thr

ead

6Thr

ead

7Thr

ead

8Thr

ead

9Thr

ead

10Thr

ead

11Thr

ead

12Thr

ead

13Thr

ead

14Thr

ead

15

[秒]

整数

ロー

ドメ

モリ

アク

セス

待ち

浮動

小数

点ロ

ード

メモ

リア

クセ

ス待

ちス

トア

待ち

整数

ロー

ドキ

ャッ

シュ

アク

セス

待ち

浮動

小数

点ロ

ード

キャ

ッシ

ュア

クセ

ス待

ち整

数演

算待

ち浮

動小

数点

演算

待ち

分岐

命令

待ち

命令

フェ

ッチ

待ち

バリ

ア同

期待

ちuO

Pコ

ミッ

トそ

の他

の待

ち1命

令コ

ミッ

ト整

数レ

ジス

タ書

き込

み制

約2/

3命令

コミ

ット

4命令

コミ

ット

77

src0

: CR

S, C

oale

sced

OM

P-3

Fujit

su F

X10:

CA

SE-2

, CM

-RC

M(2

)25

.6%

, 41.

8GB

/sec

.

0.0E

+00

5.0E

-01

1.0E

+00

1.5E

+00

2.0E

+00

2.5E

+00

3.0E

+00

3.5E

+00

4.0E

+00

Thr

ead

0Thr

ead

1Thr

ead

2Thr

ead

3Thr

ead

4Thr

ead

5Thr

ead

6Thr

ead

7Thr

ead

8Thr

ead

9Thr

ead

10Thr

ead

11Thr

ead

12Thr

ead

13Thr

ead

14Thr

ead

15

[秒]

整数

ロー

ドメ

モリ

アク

セス

待ち

浮動

小数

点ロ

ード

メモ

リア

クセ

ス待

ちス

トア

待ち

整数

ロー

ドキ

ャッ

シュ

アク

セス

待ち

浮動

小数

点ロ

ード

キャ

ッシ

ュア

クセ

ス待

ち整

数演

算待

ち浮

動小

数点

演算

待ち

分岐

命令

待ち

命令

フェ

ッチ

待ち

バリ

ア同

期待

ちuO

Pコ

ミッ

トそ

の他

の待

ち1命

令コ

ミッ

ト整

数レ

ジス

タ書

き込

み制

約2/

3命令

コミ

ット

4命令

コミ

ット

78

reor

der0

: CR

S, S

eque

ntia

l

OM

P-3

Fujit

su F

X10:

CA

SE-3

, CM

-RC

M(2

)5.

4%, 6

4.0G

B/s

ec.

0.0E

+00

5.0E

-01

1.0E

+00

1.5E

+00

2.0E

+00

2.5E

+00

3.0E

+00

3.5E

+00

4.0E

+00

Thr

ead

0Thr

ead

1Thr

ead

2Thr

ead

3Thr

ead

4Thr

ead

5Thr

ead

6Thr

ead

7Thr

ead

8Thr

ead

9Thr

ead

10Thr

ead

11Thr

ead

12Thr

ead

13Thr

ead

14Thr

ead

15

[秒]

整数

ロー

ドメ

モリ

アク

セス

待ち

浮動

小数

点ロ

ード

メモ

リア

クセ

ス待

ちス

トア

待ち

整数

ロー

ドキ

ャッ

シュ

アク

セス

待ち

浮動

小数

点ロ

ード

キャ

ッシ

ュア

クセ

ス待

ち整

数演

算待

ち浮

動小

数点

演算

待ち

分岐

命令

待ち

命令

フェ

ッチ

待ち

バリ

ア同

期待

ちuO

Pコ

ミッ

トそ

の他

の待

ち1命

令コ

ミッ

ト整

数レ

ジス

タ書

き込

み制

約2/

3命令

コミ

ット

4命令

コミ

ット

79

ELL,

Seq

uent

ial

OM

P-3