3D P
aral
lel F
EM(IV
)O
penM
P +
Hyb
rid P
aral
lel
Prog
ram
min
g M
odel
Ken
go N
akaj
ima
Info
rmat
ion
Tech
nolo
gy C
ente
r
Pro
gram
min
g fo
r Par
alle
l Com
putin
g (6
16-2
057)
S
emin
ar o
n A
dvan
ced
Com
putin
g (6
16-4
009)
2
Hyb
rid P
aral
lel P
rogr
amm
ing
Mod
el•
Mes
sage
Pas
sing
(e.g
. MP
I) +
Mul
ti Th
read
ing
(e.g
. O
penM
P, C
UD
A, O
penC
L, O
penA
CC
etc
.)•
In K
com
pute
r and
FX
10, h
ybrid
par
alle
l pr
ogra
mm
ing
is re
com
men
ded
–M
PI +
Aut
omat
ic P
aral
leliz
atio
n by
Fuj
itsu’
s C
ompi
ler
–P
erso
nally
, I d
o no
t lik
e to
cal
l thi
s “h
ybrid
” !!!
•E
xpec
tatio
ns fo
r Hyb
rid–
Num
ber o
f MP
I pro
cess
es (a
nd s
ub-d
omai
ns) t
o be
re
duce
d–
O(1
08-1
09)-
way
MP
I mig
ht n
ot s
cale
in E
xasc
ale
Sys
tem
s–
Eas
ily e
xten
ded
to H
eter
ogen
eous
Arc
hite
ctur
es•
CP
U+G
PU
, CP
U+M
anyc
ores
(e.
g. In
tel M
IC/X
eon
Phi
)•
MP
I+X
: Ope
nMP,
Ope
nAC
C, C
UD
A, O
penC
L
Flat
MPI
vs.
Hyb
rid
Hyb
rid:H
iera
rcha
l Str
uctu
re
Flat
-MPI
:Ea
ch C
ore
-> In
depe
nden
t
core
core
core
core
memory
core
core
core
core
memory
core
core
core
core
memory
core
core
core
core
memory
core
core
core
core
memoryco
reco
reco
reco
re
memory
memory
memory
memory
core
core
core
core
core
core
core
core
core
core
core
core
3
4
HB
M x
N
Num
ber o
f Ope
nMP
thre
ads
per a
sin
gle
MP
I pro
cess
Num
ber o
f MP
I pro
cess
per a
sin
gle
node
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L2
Mem
ory
5
Size
(and
num
ber)
of l
ocal
dat
a ch
ange
s ac
cord
ing
to p
aral
lel
prog
ram
min
g m
odel
exam
ple:
6 n
odes
, 96
core
s
HB
4x4
Flat
MPI
HB
16x
1
128 192 64
8 12 1
pcube
128 192 64
4 6 1
pcube
128 192 64
2 3 1
pcube
6
Bat
ch S
crip
t (1/
2)E
nv. V
ar.:
OM
P_N
UM
_TH
RE
AD
S
Hyb
rid16
×1
#!/bin/sh
#PJM -L "node=6"
#PJM -L "elapse=00:05:00"
#PJM -j
#PJM -L "rscgrp=lecture2“
#PJM -g “gt82“
#PJM -o "test.lst"
#PJM --mpi"proc=6"
export OMP_NUM_THREADS=16
mpiexec./sol
rmwk.*
Flat
MPI
#PJM -L "node=6"
#PJM -L "elapse=00:05:00"
#PJM -j
#PJM -L "rscgrp=lecture2“
#PJM -g “gt82“
#PJM -o "test.lst"
#PJM --mpi"proc=96"
mpiexec./sol
rmwk.*
Hyb
rid8×
2#!/bin/sh
#PJM -L "node=6"
#PJM -L "elapse=00:05:00"
#PJM -j
#PJM -L "rscgrp=lecture2“
#PJM -g “gt82“
#PJM -o "test.lst"
#PJM --mpi“proc=12"
export OMP_NUM_THREADS=8
mpiexec./sol
rmwk.*
7
Bat
ch S
crip
t (2/
2)E
nv. V
ar.:
OM
P_N
UM
_TH
RE
AD
S
Hyb
rid4×
4#!/bin/sh
#PJM -L "node=6"
#PJM -L "elapse=00:05:00"
#PJM -j
#PJM -L "rscgrp=lecture2“
#PJM -g “gt82“
#PJM -o "test.lst"
#PJM --mpi"proc=24"
export OMP_NUM_THREADS=4
mpiexec./sol
rmwk.*
OM
P-1
8
Bac
kgro
und
•M
ultic
ore/
Man
ycor
e P
roce
ssor
s–
Low
pow
er c
onsu
mpt
ion,
Var
ious
type
s of
pro
gram
min
g m
odel
s•
Ope
nMP
–D
irect
ive
base
d, (s
eem
s to
be)
eas
y–
Man
y bo
oks
•D
ata
Dep
ende
ncy
(no
clas
ses
this
yea
r)–
Con
flict
of r
eadi
ng fr
om/w
ritin
g to
mem
ory
–A
ppro
pria
te re
orde
ring
of d
ata
is n
eede
d fo
r “c
onsi
sten
t” pa
ralle
l com
putin
g–
NO
det
aile
d in
form
atio
n in
Ope
nMP
boo
ks: v
ery
com
plic
ated
•O
penM
P/M
PI H
ybrid
Par
alle
l Pro
gram
min
g M
odel
fo
r Mul
ticor
e/M
anyc
ore
Clu
ster
s
OM
P-1
9
SM
PM
EMO
RY
C P U
C P U
C P U
C P U
C P U
C P U
C P U
C P U
•S
MP
–S
ymm
etric
Mul
ti P
roce
ssor
s–
Mul
tiple
CP
U’s
(cor
es) s
hare
a s
ingl
e m
emor
y sp
ace
OM
P-1
10
Wha
t is
Ope
nMP
?ht
tp://
ww
w.o
penm
p.or
g
•A
n A
PI f
or m
ulti-
plat
form
sha
red-
mem
ory
para
llel
prog
ram
min
g in
C/C
++ a
nd F
ortra
n–
Cur
rent
ver
sion
: 4.0
•B
ackg
roun
d–
Mer
ger o
f Cra
y an
d S
GI i
n 19
96–
AS
CI p
roje
ct (D
OE
) sta
rted
•C
/C++
ver
sion
and
For
tran
vers
ion
have
bee
n se
para
tely
dev
elop
ed u
ntil
ver.2
.5.
•Fo
rk-J
oin
Par
alle
l Exe
cutio
n M
odel
•U
sers
hav
e to
spe
cify
eve
ryth
ing
by d
irect
ives
.–
Not
hing
hap
pen,
if th
ere
are
no d
irect
ives
OM
P-1
11
Fork
-Joi
n Pa
ralle
l Exe
cutio
n M
odel
Mas
ter
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
Mas
ter
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
Mas
ter
Mas
ter
Mas
ter PA
RA
LLEL
fork
END
PA
RA
LLEL
join
PAR
ALL
ELfo
rkEN
D P
AR
ALL
ELjo
in
OM
P-1
12
Num
ber o
f Thr
eads
•OMP_NUM_THREADS
–H
ow to
cha
nge
?•
bash
(.b
ashr
c)export OMP_NUM_THREADS=8
•cs
h(.c
shrc
)setenv OMP_NUM_THREADS 8
OM
P-1
13
Info
rmat
ion
abou
t Ope
nMP
•O
penM
P A
rchi
tect
ure
Rev
iew
Boa
rd (A
RB
)–
http
://w
ww
.ope
nmp.
org
•R
efer
ence
s–
Cha
ndra
, R. e
t al.「
Par
alle
l Pro
gram
min
g in
Ope
nMP
」(M
orga
n K
aufm
ann)
–Q
uinn
, M.J
. 「P
aral
lel P
rogr
amm
ing
in C
with
MP
I and
O
penM
P」(M
cGra
wH
ill)
–M
atts
on, T
.G. e
t al.
「P
atte
rns
for P
aral
lel P
rogr
amm
ing」
(A
ddis
on W
esle
y)–
牛島
「O
penM
Pに
よる
並列
プロ
グラ
ミン
グと
数値
計算
法」(丸
善)
–C
hapm
an, B
. et a
l. 「U
sing
Ope
nMP
」(M
IT P
ress
)最
新!
•Ja
pane
se V
ersi
on o
f Ope
nMP
3.0
Spe
c. (F
ujits
u et
c.)
–ht
tp://
ww
w.o
penm
p.or
g/m
p-do
cum
ents
/Ope
nMP
30sp
ec-ja
OM
P-1
14
Feat
ures
of O
penM
P•
Dire
ctiv
es–
Loop
s rig
ht a
fter t
he d
irect
ives
are
par
alle
lized
.–
If th
e co
mpi
ler d
oes
not s
uppo
rt O
penM
P, d
irect
ives
are
co
nsid
ered
as
just
com
men
ts.
OM
P-1
15
Ope
nMP/
Dire
ctiv
esA
rray
Ope
ratio
ns
!$omp parallel do
do i= 1, NP
W(i,1)= 0.d0
W(i,2)= 0.d0
enddo
!$omp end parallel do
!$omp parallel do private(iS,iE,i)
!$omp& reduction(+:RHO)
do ip= 1, PEsmpTOT
iS= STACKmcG(ip-1) + 1
iE= STACKmcG(ip )
do i= iS, iE
RHO= RHO + W(i,R)*W(i,Z)
enddo
enddo
!$omp end parallel do
Sim
ple
Sub
stitu
tion
Dot
Pro
duct
s
!$omp parallel do
do i= 1, NP
Y(i)= ALPHA*X(i) + Y(i)
enddo
!$omp end parallel do
DA
XP
Y
OM
P-1
16
Ope
nMP/
Dire
ceiv
esM
atrix
/Vec
tor P
rodu
cts
!$omp parallel do private(ip,iS,iE,i,j)
do ip= 1, PEsmpTOT
iS= STACKmcG(ip-1) + 1
iE= STACKmcG(ip )
do i= iS, iE
W(i,Q)= D(i)*W(i,P)
do j= 1, INL(i)
W(i,Q)= W(i,Q) + W(IAL(j,i),P)
enddo
do j= 1, INU(i)
W(i,Q)= W(i,Q) + W(IAU(j,i),P)
enddo
enddo
enddo
!$omp end parallel do
OM
P-1
17
Feat
ures
of O
penM
P•
Dire
ctiv
es–
Loop
s rig
ht a
fter t
he d
irect
ives
are
par
alle
lized
.–
If th
e co
mpi
ler d
oes
not s
uppo
rt O
penM
P, d
irect
ives
are
co
nsid
ered
as
just
com
men
ts.
•N
othi
ng h
appe
n w
ithou
t exp
licit
dire
ctiv
es–
Diff
eren
t fro
m “a
utom
atic
par
alle
lizat
ion/
vect
oriz
atio
n”–
Som
ethi
ng w
rong
may
hap
pen
by u
n-pr
oper
way
of u
sage
–
Dat
a co
nfig
urat
ion,
ord
erin
g et
c. a
re d
one
unde
r use
rs’
resp
onsi
bilit
y•
“Thr
eads
” are
cre
ated
acc
ordi
ng to
the
num
ber o
f co
res
on th
e no
de–
Thre
ad: “
Pro
cess
” in
MP
I–
Gen
eral
ly, “
# th
read
s =
# co
res”
: Xeo
n P
hi s
uppo
rts 4
th
read
s pe
r cor
e (H
yper
Mul
tithr
eadi
ng)
OM
P-1
18
Mem
ory
Con
tent
ion:
メモ
リ競
合
MEM
ORY
C P U
C P U
C P U
C P U
C P U
C P U
C P U
C P U
•D
urin
g a
com
plic
ated
pro
cess
, mul
tiple
thre
ads
may
si
mul
tane
ousl
y try
to u
pdat
e th
e da
ta in
sam
e ad
dres
s on
the
mem
ory.
–e.
g.: M
ultip
le c
ores
upd
ate
a si
ngle
com
pone
nt o
f an
arra
y.–
This
situ
atio
n is
pos
sibl
e.–
Ans
wer
s m
ay c
hang
e co
mpa
red
to s
eria
l cas
es w
ith a
si
ngle
cor
e (th
read
).
OM
P-1
19
Mem
ory
Con
tent
ion
(con
t.)M
EMO
RY
C P U
C P U
C P U
C P U
C P U
C P U
C P U
C P U
•In
this
lect
ure,
no
such
cas
e do
es n
ot h
appe
n by
re
orde
ring
etc.
–In
Ope
nMP
, use
rs a
re re
spon
sibl
e fo
r suc
h is
sues
(e.g
. pr
oper
dat
a co
nfig
urat
ion,
reor
derin
g et
c.)
•G
ener
ally
spe
akin
g, p
erfo
rman
ce p
er c
ore
redu
ces
as n
umbe
r of u
sed
core
s (th
read
num
ber)
incr
ease
s.–
Mem
ory
acce
ss p
erfo
rman
ce: S
TRE
AM
OM
P-1
20
Feat
ures
of O
penM
P (c
ont.)
•“!o
mp
para
llel d
o”-”
!om
p en
d pa
ralle
l do”
•G
loba
l (S
hare
d) V
aria
bles
, Priv
ate
Var
iabl
es–
Def
ault:
Glo
bal (
Sha
red)
–D
ot P
rodu
cts:
redu
ctio
n
W(:,
:),R
,Z,
PE
smpT
OT
glob
al (s
hare
d)!$omp parallel do private(iS,iE,i)
!$omp& reduction(+:RHO)
do ip= 1, PEsmpTOT
iS= STACKmcG(ip-1) + 1
iE= STACKmcG(ip )
do i= iS, iE
RHO= RHO + W(i,R)*W(i,Z)
enddo
enddo
!$omp end parallel do
OM
P-1
21
FOR
TRA
N&
C
#include <omp.h>
...
{#pragma omp parallel for default(none) shared(n,x,y) private(i)
for (i=0; i<n; i++)
x[i] += y[i];
}use omp_lib
...
!$omp parallel do shared(n,x,y) private(i)
do i= 1, n
x(i)= x(i) + y(i)
enddo
!$ omp end parallel do
OM
P-1
22
In th
is c
lass
...
•Th
ere
are
man
y ca
pabi
litie
s of
Ope
nMP
.•
In th
is c
lass
, onl
y se
vera
l fun
ctio
ns a
re s
how
n fo
r pa
ralle
lizat
ion
of p
aral
lel F
EM
.
OM
P-1
23
Firs
t thi
ngs
to b
e do
ne(a
fter O
penM
P 3.
0)
•us
e om
p_lib
Fortr
an•
#inc
lude
<om
p.h>
C
OM
P-1
24
Ope
nMP
Dire
ctiv
es (F
ortr
an)
•N
O d
istin
ctio
ns b
etw
een
uppe
r and
low
er c
ases
.•
sent
inel
–Fo
rtran
: !$O
MP
, C$O
MP
, *$O
MP
•!$
OM
P o
nly
for f
ree
form
at
–C
ontin
uatio
n Li
nes
(Sam
e ru
le a
s th
at o
f Fo
rtran
com
pile
r is
app
lied)
•
Exa
mpl
e fo
r !$OMP PARALLEL DO SHARED(A,B,C)
sentinel directive_name [clause[[,] clause]…]
!$OMP PARALLEL DO
!$OMP+SHARED (A,B,C)
!$OMP PARALLEL DO &
!$OMP SHARED (A,B,C)
OM
P-1
25
Ope
nMP
Dire
ctiv
es (C
)
•“\
” for
con
tinua
tion
lines
•O
nly
low
er c
ase
(exc
ept n
ames
of v
aria
bles
)
#pragma omp directive_name [clause[[,] clause]…]
#pragma omp parallel for shared (a,b,c)
OM
P-1
26
PAR
ALL
EL D
O
•P
aral
leriz
e D
O/fo
r Loo
ps•
Exa
mpl
es o
f “cl
ause
”–
PR
IVA
TE(lis
t)–
SH
AR
ED
(lis
t)–
DE
FAU
LT(P
RIV
ATE
|SH
AR
ED
|NO
NE
)
–R
ED
UC
TIO
N({o
pera
tion|
intri
nsic
}: lis
t)
!$OMP PARALLEL DO[clause[[,] clause] … ]
(do_loop)
!$OMP END PARALLEL DO
#pragma parallel for [clause[[,] clause] … ]
(for_loop)
OM
P-1
27
RED
UC
TIO
N
•S
imila
r to
“MP
I_R
educ
e”•
Ope
rato
r–
+,
*,-,
.AN
D.,
.OR
., .E
QV
., .N
EQ
V.
•In
trins
ic–
MA
X, M
IN, I
AN
D, I
OR
, IE
QR
REDUCTION ({operator|instinsic}: list)
reduction ({operator|instinsic}: list)
OM
P-1
28
Exa
mpl
e-1:
A S
impl
e Lo
op!$OMP PARALLEL DO
do i= 1, N
B(i)= (A(i) + B(i)) * 0.50
enddo
!$OMP END PARALLEL DO
•D
efau
lt st
atus
of l
oop
varia
bles
(“i”
in th
is c
ase)
is
priv
ate.
The
refo
re, e
xplic
it de
clar
atio
n is
not
ne
eded
.•
“EN
D P
AR
ALL
EL
DO
” is
not r
equi
red
–In
C, t
here
are
no
defin
ition
s of
“end
par
alle
l do”
OM
P-1
29
Exa
mpl
e-1:
RE
DU
CTI
ON
!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+:A,B)
do i= 1, N
call WORK (Alocal, Blocal)
A= A + Alocal
B= B + Blocal
enddo
!$OMP END PARALLEL DO
•“E
ND
PA
RA
LLE
L D
O” i
s no
t req
uire
d
30
Func
tions
whi
ch c
an b
e us
ed w
ith
Ope
nMP
Nam
eFu
nctio
ns
into
mp_
get_
num
_thr
eads
(voi
d)To
talT
hrea
d #
into
mp_
get_
thre
ad_n
um(v
oid)
Thre
ad ID
doub
leom
p_ge
t_w
time
(voi
d)=
MP
I_W
time
void
omp_
set_
num
_thr
eads
(int
num
_thr
eads
)ca
ll om
p_se
t_nu
m_t
hrea
ds(n
um_t
hrea
ds)
Set
ting
Thre
ad#
OM
P-1
31
Ope
nMP
for D
ot P
rodu
cts
VAL= 0.d0
do i= 1, N
VAL= VAL + W(i,R) * W(i,Z)
enddo
OM
P-1
32
Ope
nMP
for D
ot P
rodu
cts
VAL= 0.d0
do i= 1, N
VAL= VAL + W(i,R) * W(i,Z)
enddo
VAL= 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)
do i= 1, N
VAL= VAL + W(i,R) * W(i,Z)
enddo
!$OMP END PARALLEL DO
Dire
ctiv
es a
re ju
st in
serte
d.
OM
P-1
33
Ope
nMP
for D
ot P
rodu
cts
VAL= 0.d0
do i= 1, N
VAL= VAL + W(i,R) * W(i,Z)
enddo
VAL= 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:VAL)
do i= 1, N
VAL= VAL + W(i,R) * W(i,Z)
enddo
!$OMP END PARALLEL DO
VAL= 0.d0
!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)
do ip= 1, PEsmpTOT
do i= index(ip-1)+1, index(ip)
VAL= VAL + W(i,R) * W(i,Z)
enddo
enddo
!$OMP END PARALLEL DO
Mul
tiple
Loo
pPEsmpTOT
: Num
ber o
f thr
eads
Add
ition
al a
rray INDEX(:)
is
need
ed.
Effi
cien
cy is
not
nec
essa
rily
good
, bu
t use
rs c
an s
peci
fy th
read
for
each
com
pone
nt o
f dat
a.
Dire
ctiv
es a
re ju
st in
serte
d.
OM
P-1
34
Ope
nMP
for D
ot P
rodu
cts
VAL= 0.d0
!$OMP PARALLEL DO PRIVATE(ip,i) REDUCTION(+:VAL)
do ip= 1, PEsmpTOT
do i= index(ip-1)+1, index(ip)
VAL= VAL + W(i,R) * W(i,Z)
enddo
enddo
!$OMP END PARALLEL DO
e.g.
: N=1
00, P
Esm
pTO
T=4
IND
EX
(0)=
0
IND
EX
(1)=
25
IND
EX
(2)=
50
IND
EX
(3)=
75
IND
EX
(4)=
100
Mul
tiple
Loo
pPEsmpTOT
: Num
ber o
f thr
eads
Add
ition
al a
rray INDEX(:)
is
need
ed.
Effi
cien
cy is
not
nec
essa
rily
good
, bu
t use
rs c
an s
peci
fy th
read
for
each
com
pone
nt o
f dat
a.
NO
T go
od fo
r GP
U’s
OM
P-1
35
File
s on
Oak
leaf
-FX
>$ cd <$O-TOP>
>$ cp /home/z30088/class_eps/F/omp.tar .
>$ cp /home/z30088/class_eps/C/omp.tar .
>$ tar xvf omp.tar
>$ cd omp
<$O-omp>
OM
P-1
36
File
s on
Oak
leaf
-FX
>$ cd <$O-omp>
>$ frtpx –Kfast,openmp test.f
>$ fccpx –Kfast,openmp test.c
>$ pjsub go.sh
OM
P-1
37
Run
ning
the
Job
go.sh
#!/bin/sh
#PJM -L "node=1"
#PJM -L "elapse=00:10:00"
#PJM -L "rscgrp=lecture"
#PJM -g "gt71"
#PJM -j
#PJM -o "t0-08.lst"
export OMP_NUM_THREADS=8
Number of Threads/process
./a.out
< INPUT.DAT
INPUT.DAT
N nopt
N:
Problem Size (Vector Length)
nopt:
First-touch
(0:No, 1:Yes)
OM
P-1
38
test
.f/te
st.c
•D
AX
PY
•D
ot P
rodu
cts
•E
ffect
of O
penM
P D
irect
ives
•E
ffect
of F
irst T
ouch
Dat
a P
lace
men
t
OM
P-1
39
test
.f (1
/3):
Initi
aliz
atio
nuse omp_lib
implicit REAL*8 (A-H,O-Z)
real(kind=8), dimension(:), allocatable:: X, Y
real(kind=8), dimension(:), allocatable:: Z1, Z2
real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5
integer, dimension(0:2) :: INDEX
!C!C +------+
!C | INIT |
!C +------+
!C===write (*,*) 'N, nopt?'
read (*,*) N, nopt
allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))
if (nopt.eq.0) then
X = 1.d0
Y = 1.d0
Z1= 0.d0
Z2= 0.d0
Z3= 0.d0
Z4= 0.d0
Z5= 0.d0
else
!$ompparallel do private (i)
do i= 1, N
X (i)= 0.d0
Y (i)= 0.d0
Z1(i)= 0.d0
Z2(i)= 0.d0
Z3(i)= 0.d0
Z4(i)= 0.d0
Z5(i)= 0.d0
enddo
!$ompend parallel do
endif
ALPHA= 1.d0
!C===
Pro
blem
Siz
e, O
ptio
n
nopt
=0
NO
Firs
t Tou
chno
pt≠0
with
Firs
t Tou
ch
OM
P-1
40
test
.f (1
/3):
Initi
aliz
atio
nuse omp_lib
implicit REAL*8 (A-H,O-Z)
real(kind=8), dimension(:), allocatable:: X, Y
real(kind=8), dimension(:), allocatable:: Z1, Z2
real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5
integer, dimension(0:2) :: INDEX
!C!C +------+
!C | INIT |
!C +------+
!C===write (*,*) 'N, nopt?'
read (*,*) N, nopt
allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))
if (nopt.eq.0) then
X = 1.d0
Y = 1.d0
Z1= 0.d0
Z2= 0.d0
Z3= 0.d0
Z4= 0.d0
Z5= 0.d0
else
!$ompparallel do private (i)
do i= 1, N
X (i)= 0.d0
Y (i)= 0.d0
Z1(i)= 0.d0
Z2(i)= 0.d0
Z3(i)= 0.d0
Z4(i)= 0.d0
Z5(i)= 0.d0
enddo
!$ompend parallel do
endif
ALPHA= 1.d0
!C===
nopt
=0
NO
Firs
t Tou
ch
Initi
aliz
atio
n is
not
par
alle
lized
OM
P-1
41
test
.f (1
/3):
Initi
aliz
atio
nuse omp_lib
implicit REAL*8 (A-H,O-Z)
real(kind=8), dimension(:), allocatable:: X, Y
real(kind=8), dimension(:), allocatable:: Z1, Z2
real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5
integer, dimension(0:2) :: INDEX
!C!C +------+
!C | INIT |
!C +------+
!C===write (*,*) 'N, nopt?'
read (*,*) N, nopt
allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))
if (nopt.eq.0) then
X = 1.d0
Y = 1.d0
Z1= 0.d0
Z2= 0.d0
Z3= 0.d0
Z4= 0.d0
Z5= 0.d0
else
!$ompparallel do private (i)
do i= 1, N
X (i)= 0.d0
Y (i)= 0.d0
Z1(i)= 0.d0
Z2(i)= 0.d0
Z3(i)= 0.d0
Z4(i)= 0.d0
Z5(i)= 0.d0
enddo
!$ompend parallel do
endif
ALPHA= 1.d0
!C===
nopt≠0
with
Firs
t Tou
ch
Initi
aliz
atio
n is
in p
aral
lel m
anne
r.
OM
P-1
42
test
.f (2
/3):
DA
XP
Y!C!C +-------+
!C | DAXPY |
!C +-------+
!C===S2time= omp_get_wtime()
!$omp parallel do private (i)
do i= 1, N
Z1(i)= ALPHA*X(i) + Y(i)
Z2(i)= ALPHA*X(i) + Y(i)
Z3(i)= ALPHA*X(i) + Y(i)
Z4(i)= ALPHA*X(i) + Y(i)
Z5(i)= ALPHA*X(i) + Y(i)
enddo
!$omp end parallel do
E2time= omp_get_wtime()
write (*,'(/a)') '# DAXPY'
write (*,'( a, 1pe16.6)') ' omp-1 ', E2time -S2time
!C===
OM
P-1
43
test
.f (3
/3):
Dot
Pro
duct
s!C!C +-----+
!C | DOT |
!C +-----+
!C===V1= 0.d0
V2= 0.d0
V3= 0.d0
V4= 0.d0
V5= 0.d0
S2time= omp_get_wtime()
!$ompparallel do private(i) reduction (+:V1,V2,V3,V4,V5)
do i= 1, N
V1= V1 + X(i)*(Y(i)+1.d0)
V2= V2 + X(i)*(Y(i)+2.d0)
V3= V3 + X(i)*(Y(i)+3.d0)
V4= V4 + X(i)*(Y(i)+4.d0)
V5= V5 + X(i)*(Y(i)+5.d0)
enddo
!$ompend parallel do
E2time= omp_get_wtime()
write (*,'(/a)') '# DOT'
write (*,'( a, 1pe16.6)') ' omp-1 ', E2time -S2time
!C===stop
end
OM
P-1
44
test
.c: D
umm
y P
ragm
a ne
eded
#pragma omp parallel
{}
if (nopt==0) {for(i=0; i<N; i++) {
X[i] = 1.0;
Y[i] = 1.0;
Z1[i] = 0.0;
Z2[i] = 0.0;
Z3[i] = 0.0;
Z4[i] = 0.0;
Z5[i] = 0.0;}
}else{
#pragma omp parallel for private(i)
for(i=0; i<N; i++) {
X[i] = 1.0;
Y[i] = 1.0;
Z1[i] = 0.0;
Z2[i] = 0.0;
Z3[i] = 0.0;
Z4[i] = 0.0;
Z5[i] = 0.0;
}}
Dum
my
Pra
gma
need
ed b
efor
e ac
tual
com
puta
tion
(for F
ujits
u C
com
pile
r). T
his
prob
lem
may
be
fixed
. Th
read
s fo
r Ope
nMP
are
gene
rate
d du
ring
the
first
par
alle
l lo
op, a
nd th
is p
roce
dure
is e
xpen
sive
. In
Fortr
an, t
his
dum
my
loop
is im
plic
itly
inse
rted
by c
ompi
ler.
OM
P-1
45
DA
XP
Y: E
ffect
of F
irst T
ouch
FX10
, N=6
0,00
0,00
0T2
K, N
=10,
000,
000
•T2
K: E
ffect
of f
irst t
ouch
is la
rge
•N
OT
scal
able
: mem
ory
cont
entio
n,
sync
hron
izat
ion
over
head
0.00
0.10
0.20
0.30
0.40
12
48
12
sec.
thre
ad#
FX10
-NO
FX10
-YES
0.00
0.10
0.20
0.30
0.40
12
48
1216
sec.th
read
#
T2K
-NO
T2K
-YES
46
Firs
t Tou
ch D
ata
Plac
emen
t“P
atte
rns
for P
aral
lel P
rogr
amm
ing”
Mat
tson
, T.G
. et a
l.
To re
duce
mem
ory
traffi
c in
the
syst
em, i
t is
impo
rtant
to k
eep
the
data
clo
se to
the
PE
s th
at w
ill w
ork
with
the
data
(e.g
. NU
MA
cont
rol).
On
NU
MA
com
pute
rs, t
his
corr
espo
nds
to m
akin
g su
re th
e pa
ges
of
mem
ory
are
allo
cate
d an
d “o
wne
d” b
y th
e P
Es
that
will
be
wor
king
with
the
data
con
tain
ed in
the
page
.
The
mos
t com
mon
NU
MA
page
-pla
cem
ent a
lgor
ithm
is th
e “fi
rst t
ouch
” alg
orith
m, i
n w
hich
the
PE
firs
t ref
eren
cing
a re
gion
of
mem
ory
will
hav
e th
e pa
ge h
oldi
ng th
at m
emor
y as
sign
ed to
it.
A ve
ry c
omm
on te
chni
que
in O
penM
P pr
ogra
m is
to in
itial
ize
data
in p
aral
lel u
sing
the
sam
e lo
op s
ched
ule
as w
ill b
e us
ed la
ter
in th
e co
mpu
tatio
ns.
OM
P-1
OM
P-1
47
L1 CL2L1 CL2
L1 CL2L1 CL2
L3
Mem
ory
L1C L2L1C L2
L1C L2L1C L2
Mem
ory
L3
L1 CL2L1 CL2
L1 CL2L1 CL2
L3
Mem
ory
L1C L2L1C L2
L1C L2L1C L2
Mem
ory
L3
L1 CL2L1 CL2
L1 CL2L1 CL2
L3
Mem
ory
L1 CL2L1 CL2
L1 CL2L1 CL2
L1 CL2L1 CL2
L3
Mem
ory
L1 CL2L1 CL2
L1C L2L1C L2
L1C L2L1C L2
L1C L2L1C L2
L3
Mem
ory
L1C L2L1C L2
L1C L2L1C L2
L1C L2L1C L2
L3
Mem
ory
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L2
Mem
ory
T2K
/Tok
yoC
ray
XE6
(Hop
per)
Fujit
su F
X10
(Oak
leaf
-FX)
OM
P-1
48
NU
MA
Arc
hite
ctur
eN
on-U
nifo
rm M
emor
y A
cces
s•
Dat
a sh
ould
be
on th
e lo
cal m
emor
y–
Loca
l mem
ory:
Mem
ory
of C
PU
soc
ket w
here
the
core
loca
tes.
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
49
NU
MA
Arc
hite
ctur
eN
on-U
nifo
rm M
emor
y A
cces
s•
If da
ta a
re o
n re
mot
e m
emor
y, it
is n
ot
effic
ient
. Co
re
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
OM
P-1
50
NU
MA
Arc
hite
ctur
eN
on-U
nifo
rm M
emor
y A
cces
s•
“Firs
t tou
ch d
ata
plac
emen
t” ca
n ke
ep
data
on
loca
l mem
ory.
•
“Par
alle
l ini
tializ
atio
n” o
f ar
rays
is e
ffect
ive.
Co
re
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Core
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
L1
Cor
e
L1
Core
L1
Core
L1L2
L2L2
L2L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
Core
Core
Core
Core
L1L1
L1L1
L2L2
L2L2
L3
Mem
ory
OM
P-1
OM
P-1
51
test
.f (1
/3):
Initi
aliz
atio
nuse omp_lib
implicit REAL*8 (A-H,O-Z)
real(kind=8), dimension(:), allocatable:: X, Y
real(kind=8), dimension(:), allocatable:: Z1, Z2
real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5
integer, dimension(0:2) :: INDEX
!C!C +------+
!C | INIT |
!C +------+
!C===write (*,*) 'N, nopt?'
read (*,*) N, nopt
allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))
if (nopt.eq.0) then
X = 1.d0
Y = 1.d0
Z1= 0.d0
Z2= 0.d0
Z3= 0.d0
Z4= 0.d0
Z5= 0.d0
else
!$ompparallel do private (i)
do i= 1, N
X (i)= 0.d0
Y (i)= 0.d0
Z1(i)= 0.d0
Z2(i)= 0.d0
Z3(i)= 0.d0
Z4(i)= 0.d0
Z5(i)= 0.d0
enddo
!$ompend parallel do
endif
ALPHA= 1.d0
!C===
nopt
=0
NO
Firs
t Tou
ch
Initi
aliz
atio
n is
not
par
alle
lized
OM
P-1
52
test
.f (1
/3):
Initi
aliz
atio
nuse omp_lib
implicit REAL*8 (A-H,O-Z)
real(kind=8), dimension(:), allocatable:: X, Y
real(kind=8), dimension(:), allocatable:: Z1, Z2
real(kind=8), dimension(:), allocatable:: Z3, Z4, Z5
integer, dimension(0:2) :: INDEX
!C!C +------+
!C | INIT |
!C +------+
!C===write (*,*) 'N, nopt?'
read (*,*) N, nopt
allocate (X(N), Y(N), Z1(N), Z2(N), Z3(N), Z4(N), Z5(N))
if (nopt.eq.0) then
X = 1.d0
Y = 1.d0
Z1= 0.d0
Z2= 0.d0
Z3= 0.d0
Z4= 0.d0
Z5= 0.d0
else
!$ompparallel do private (i)
do i= 1, N
X (i)= 0.d0
Y (i)= 0.d0
Z1(i)= 0.d0
Z2(i)= 0.d0
Z3(i)= 0.d0
Z4(i)= 0.d0
Z5(i)= 0.d0
enddo
!$ompend parallel do
endif
ALPHA= 1.d0
!C===
nopt≠0
with
Firs
t Tou
ch
Initi
aliz
atio
n is
in p
aral
lel m
anne
r.
53
Rep
ort P
1 (1
/2)
•A
pply
mul
ti-th
read
ing
by O
penM
P on
par
alle
l FE
M
code
usi
ng M
PI
–C
G S
olve
r (so
lver
_CG
, sol
ver_
SR
)–
Mat
rix A
ssem
blin
g (m
at_a
ss_m
ain,
mat
_ass
_bc)
•H
ybrid
par
alle
l pro
gram
min
g m
odel
•E
valu
ate
the
effe
cts
of–
Pro
blem
siz
e, p
aral
lel p
rogr
amm
ing
mod
el, t
hrea
d #
54
Rep
ort P
1 (2
/2)
•D
eadl
ine:
17:
00 O
ctob
er 1
1th
(Sat
), 20
14.
–S
end
files
via
e-m
ail a
t nak
ajim
a(at
)cc.
u-to
kyo.
ac.jp
•R
epor
t–
Cov
er P
age:
Nam
e, ID
, and
Pro
blem
ID (P
1)
–Le
ss th
an 2
0pa
ges
incl
udin
g fig
ures
and
tabl
es (A
4).
•S
trate
gy•
Stru
ctur
e of
the
Pro
gram
•N
umer
ical
Exp
erim
ents
, Per
form
ance
Ana
lysi
s•
Rem
arks
–S
ourc
e lis
t of t
he p
rogr
am
•G
rade
–A
(優
)m
ight
be
give
n, if
“CG
” and
“MA
T_A
SS
” are
don
e.–
B(良
)is
the
high
est g
rade
if “M
AT_
AS
S” i
s N
OT
done
.
55
FOR
TRA
N(so
lver
_CG
)!$omp parallel do private(i)
do i= 1, N
X(i) = X (i) + ALPHA * WW(i,P)
WW(i,R)= WW(i,R) -
ALPHA * WW(i,Q)
enddo
DNRM20= 0.d0
!$omp
parallel do private(i) reduction (+:DNRM20)
do i= 1, N
DNRM20= DNRM20 + WW(i,R)**2
enddo
!$omp
parallel do private(j,k,i,WVAL)
do j= 1, N
WVAL= D(j)*WW(j,P)
do k= index(j-1)+1, index(j)
i= item(k)
WVAL= WVAL + AMAT(k)*WW(i,P)
enddo
WW(j,Q)= WVAL
enddo
56
C(so
lver
_CG
)#pragma omp parallel for private (i)
for(i=0;i<N;i++){
X [i] += ALPHA *WW[P][i];
WW[R][i]+= -ALPHA *WW[Q][i];
} DNRM20= 0.e0;
#pragma omp
parallel for private (i) reduction (+:DNRM20)
for(i=0;i<N;i++){
DNRM20+=WW[R][i]*WW[R][i];
}
#pragma omp
parallel for private (j,i,k,WVAL)
for( j=0;j<N;j++){
WVAL= D[j] * WW[P][j];
for(k=indexLU[j];k<indexLU[j+1];k++){
i=itemLU[k];
WVAL+= AMAT[k] * WW[P][i];
} WW[Q][j]=WVAL;
57
solv
er_S
R (s
end)
for( neib=1;neib<=NEIBPETOT;neib++){
istart=EXPORT_INDEX[neib-1];
inum
=EXPORT_INDEX[neib]-istart;
#pragma omp
parallel for private (k,ii)
for( k=istart;k<istart+inum;k++){
ii= EXPORT_ITEM[k];
WS[k]= X[ii-1];
} MPI_Isend(&WS[istart],inum,MPI_DOUBLE,
NEIBPE[neib-1],0,MPI_COMM_WORLD,&req1[neib-1]);
}
do neib= 1, NEIBPETOT
istart= EXPORT_INDEX(neib-1)
inum
= EXPORT_INDEX(neib
) -
istart
!$omp
parallel do private(k,ii)
do k= istart+1, istart+inum
ii = EXPORT_ITEM(k)
WS(k)= X(ii)
enddo
call MPI_Isend
(WS(istart+1), inum, MPI_DOUBLE_PRECISION, &
& NEIBPE(neib), 0, MPI_COMM_WORLD, req1(neib), &
& ierr)
enddo
58
How
to a
pply
mul
ti-th
read
ing
•C
G S
olve
r–
Just
inse
rt O
penM
P di
rect
ives
–IL
U/IC
pre
cond
ition
ing
is m
uch
mor
e di
fficu
lt•
MAT
_AS
S (m
at_a
ss_m
ain,
mat
_ass
_bc)
–D
ata
Dep
ende
ncy
–Av
oid
to a
ccum
ulat
e co
ntrib
utio
ns o
f mul
tiple
ele
men
ts to
a
sing
le n
ode
sim
ulta
neou
sly
(in p
aral
lel)
•re
sults
may
be
chan
ged
•de
adlo
ck m
ay o
ccur
–C
olor
ing
•E
lem
ents
in a
sam
e co
lor d
o no
t sha
re a
nod
e•
Par
alle
l ope
ratio
ns a
re p
ossi
ble
for e
lem
ents
in e
ach
colo
r•
In th
is c
ase,
we
need
onl
y 8
colo
rs fo
r 3D
pro
blem
s (4
col
ors
for
2D p
robl
ems)
•C
olor
ing
part
is v
ery
expe
nsiv
e: p
aral
leliz
atio
n is
diff
icul
t
59
Mul
ti-Th
read
ing:
Mat
_Ass
Par
alle
l ope
ratio
ns a
re p
ossi
ble
for e
lem
ents
in s
ame
colo
r (th
ey a
re in
depe
nden
t)
60
Col
orin
g (1
/2)
allocate (ELMCOLORindex(0:NP)) Number of elements in each color
allocate (ELMCOLORitem
(ICELTOT))
Element ID renumbered according to “
color”
if (allocated (IWKX)) deallocate
(IWKX)
allocate (IWKX(0:NP,3))
IWKX= 0
icou= 0
do icol= 1, NP
do i= 1, NP
IWKX(i,1)= 0
enddo
do icel= 1, ICELTOT
if (IWKX(icel,2).eq.0) then
in1= ICELNOD(icel,1)
in2= ICELNOD(icel,2)
in3= ICELNOD(icel,3)
in4= ICELNOD(icel,4)
in5= ICELNOD(icel,5)
in6= ICELNOD(icel,6)
in7= ICELNOD(icel,7)
in8= ICELNOD(icel,8)
ip1= IWKX(in1,1)
ip2= IWKX(in2,1)
ip3= IWKX(in3,1)
ip4= IWKX(in4,1)
ip5= IWKX(in5,1)
ip6= IWKX(in6,1)
ip7= IWKX(in7,1)
ip8= IWKX(in8,1)
61
Col
orin
g (2
/2)
isum= ip1 + ip2 + ip3 + ip4 + ip5 + ip6 + ip7 + ip8
if (isum.eq.0) then
None of the nodes is accessed in same color
icou= icou
+ 1
IWKX(icol,3)= icou
(Current) number of elements in each color
IWKX(icel,2)= icol
ELMCOLORitem(icou)= icel
ID of icou-th
element= icel
IWKX(in1,1)= 1
These nodes on the same elements can not be
IWKX(in2,1)= 1 accessed in same color
IWKX(in3,1)= 1
IWKX(in4,1)= 1
IWKX(in5,1)= 1
IWKX(in6,1)= 1
IWKX(in7,1)= 1
IWKX(in8,1)= 1
if (icou.eq.ICELTOT) goto
100
until all elements are colored
endif
endif
enddo
enddo
100 continue
ELMCOLORtot= icol
Number of Colors
IWKX(0 ,3)= 0
IWKX(ELMCOLORtot,3)= ICELTOT
do icol= 0, ELMCOLORtot
ELMCOLORindex(icol)= IWKX(icol,3)
enddo
62
Mul
ti-Th
read
ed M
atrix
Ass
embl
ing
Pro
cedu
redo icol= 1, ELMCOLORtot
!$omp
parallel do private (icel0,icel) &
!$omp& private (in1,in2,in3,in4,in5,in6,in7,in8) &
!$omp& private (nodLOCAL,ie,je,ip,jp,kk,iiS,iiE,k) &
!$omp& private (DETJ,PNX,PNY,PNZ,QVC,QV0,COEFij,coef,SHi) &
!$omp& private (PNXi,PNYi,PNZi,PNXj,PNYj,PNZj,ipn,jpn,kpn) &
!$omp& private (X1,X2,X3,X4,X5,X6,X7,X8) &
!$omp& private (Y1,Y2,Y3,Y4,Y5,Y6,Y7,Y8) &
!$omp& private (Z1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,COND0)
do icel0= ELMCOLORindex(icol-1)+1, ELMCOLORindex(icol)
icel= ELMCOLORitem(icel0)
in1= ICELNOD(icel,1)
in2= ICELNOD(icel,2)
in3= ICELNOD(icel,3)
in4= ICELNOD(icel,4)
in5= ICELNOD(icel,5)
in6= ICELNOD(icel,6)
in7= ICELNOD(icel,7)
in8= ICELNOD(icel,8)
...
63
Res
ults
(1/2
)51
2×38
4×25
6= 5
0,33
1,64
8 no
des
12 n
odes
,19
2co
res
643 =
262,
144
node
s/co
re
ndx,
ndy,
ndz
(#M
PI p
roc.
)Ite
r’sse
c.
Flat
MP
I8
6 4
(192
)12
4073
.9
HB
1×
168
6 4
(192
)12
4073
.6-K
open
mpで
コン
パイ
ル,
OM
P_N
UM
_TH
RE
AD
S=1
HB
2×
84
6 4
( 9
6)12
4078
.8H
B 4
×4
4 3
4 (
48)
1240
80.3
HB
8×
24
3 2
( 2
4)12
4081
.1H
B 1
6×1
2 3
2 (
12)
1240
81.9
512 384 256
ndx ndy ndz
pcube
64
Res
ults
(2/2
)51
2×38
4×25
6= 5
0,33
1,64
8 no
des
12 n
odes
,19
2co
res
643 =
262,
144
node
s/co
re
OM
P_N
UM
_TH
REA
DS
sec.
Spee
d-U
p
110
56.2
1.00
259
2.5
1.78
428
9.8
3.64
814
8.1
7.13
1210
3.6
10.1
916
81.9
12.9
0Fl
at M
PI,
1pr
oc./n
ode
1082
.4-
512 384 256
2 3 2
pcube
65
Flat
MPI
vs.
Hyb
rid•
Dep
ends
on
appl
icat
ions
, pro
blem
siz
e, H
W e
tc.
•Fl
at M
PI i
s ge
nera
lly b
ette
r for
spa
rse
linea
r sol
vers
, if
num
ber o
f com
putin
g no
des
is n
ot s
o la
rge.
–M
emor
y co
nten
tion
•H
ybrid
bec
omes
bet
ter,
if nu
mbe
r of c
ompu
ting
node
s is
larg
er.
–Fe
wer
num
ber o
f MP
I pro
cess
es.
•Fl
at M
PI i
s no
t rea
listic
for I
ntel
Xeo
n P
hi/M
IC w
ith
240
thre
ads/
node
–M
PI p
roce
ss re
quire
s ce
rtain
am
ount
of m
emor
y.
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L1 CL1 C
L2
Mem
ory
66
omp
para
llel (
do)
•“o
mp
para
llel-o
mp
end
para
llel”
gene
rate
s/el
imin
ates
thre
ads
at e
very
cal
l: fo
rk-jo
in•
This
cou
ld b
e ov
erhe
ad fo
r mul
tiple
loop
s •
omp
para
llel +
om
p do
/om
p fo
r
#pragma omp
parallel ...
#pragma omp
for {
...
#pragma omp
for {
!$omp
parallel ...
!$omp
do
do i= 1, N
...
!$omp
do
do i= 1, N
...
!$omp
end parallel
必須
OM
P-1
67
Tuni
ng•
List
of M
essa
ges
by C
ompi
ler (
Com
pile
Lis
t)•
精密
PA
可視
化機
能(E
xcel
)(P
reci
sion
PA
Vis
ibilit
y Fu
nctio
n)
List
of M
essa
ges
by C
ompi
ler
(Com
pile
Lis
t)
•-Q
t–
List
of M
essa
ges
by C
ompi
ler (
Com
pile
Lis
t)–
*.ls
t–
Fortr
an O
nly
•In
C, “
-Qt”
is n
ot a
vila
ble
–P
leas
e us
e “-N
src”
–D
ispl
ayed
on
scre
en
68
F90 = mpifrtpx
F90OPTFLAGS= -Kfast,openmp -Qt
F90FLAGS = $(F90OPTFLAGS)
Cur
rent
ver
sion
of C
/C++
com
pile
r ca
n pr
oduc
e lis
t of m
essa
ges
69
Fortran/C/C++
-Nlst=p 標
準の
最適
化情
報(デ
フォ
ルト)
-Nlst=t 詳
細な
最適
化情
報
Fortran ONLY
-Nlst=a 名
前の
属性
情報
-Nlst=d 派
生型
の構
成情
報-Nlst=i
イン
クル
ード
され
たフ
ァイ
ルの
プロ
グラ
ムリ
スト
およ
びイ
ンク
ルー
ドフ
ァイ
ル名
一覧
-Nlst=m 自
動並
列化
の状
況をOpenMP指
示文
によ
って
表現
した
原始
プロ
グラ
ム出
力-Nlst=x 名
前お
よび
文番
号の
相互
参照
情報
Info
in *.
lst
70
SIM
D In
form
atio
n71
Aut
omat
ic P
aral
leliz
atio
n72
Exam
ple
of th
e Li
st73
101 1 !C
102 1 !C +----------------+
103 1 !C | {z}= [Minv]{r} |
104 1 !C +----------------+
105 1 !C===
106 1
107 1 !$omp parallel do private(ip,i)
108 2 p do ip= 1, PEsmpTOT
<<< Loop-information Start >>>
<<< [OPTIMIZATION]
<<< SIMD
<<< SOFTWARE PIPELINING
<<< Loop-information End >>>
109 3 p 8v do i = SMPindexG(ip-1)+1, SMPindexG(ip)
110 3 p 8v W(i,Z)= W(i,R)
111 3 p 8v enddo
112 2 p enddo
113 1 !$omp end parallel do
114 1
115 1 Stime= omp_get_wtime()
116 1 call fapp_start ("precond", 1, 1)
117 2 do ic= 1, NCOLORtot
118 2 !$omp parallel do private(ip,ip1,i,WVAL,k)
119 3 p do ip= 1, PEsmpTOT
120 3 p ip1= (ic-1)*PEsmpTOT + ip
121 4 p do i= SMPindex(ip1-1)+1, SMPindex(ip1)
122 4 p WVAL= W(i,Z)
<<< Loop-information Start >>>
<<< [OPTIMIZATION]
<<< SIMD
<<< SOFTWARE PIPELINING
<<< Loop-information End >>>
123 5 p 4v do k= indexL(i-1)+1, indexL(i)
124 5 p 4v WVAL= WVAL -
AL(k) * W(itemL(k),Z)
125 5 p 4v enddo
126 4 p W(i,Z)= WVAL * W(i,DD)
127 4 p enddo
128 3 p enddo
129 2 !$omp end parallel do
130 2 enddo
OM
P-3
74
3.5
精密
PA可
視化
機能
(Exc
el)
(Pre
cisi
on P
A V
isib
ility
Fun
ctio
n)(1
/3):
Inse
rtin
g C
all’s
, Com
pile
& R
un
call start_collection
("SpMV")
!$omp
parallel do private(ip,i,VAL,k)
do ip= 1, PEsmpTOT
do i= SMPindex((ip-1)*NCOLORtot)+1, SMPindex(ip*NCOLORtot)
VAL= D(i)*W(i,P)
do k= 1, 3
VAL= VAL + AL(k,i)*W(itemL(k,i),P)
enddo
do k= 1, 3
VAL= VAL + AU(k,i)*W(itemU(k,i),P)
enddo
W(i,Q)= VAL
enddo
enddo
!$omp
end parallel do
call stop_collection
("SpMV")
OM
P-3
75
3.5
精密
PA可
視化
機能
(Exc
el)
(Pre
cisi
on P
A V
isib
ility
Fun
ctio
n)
(2/3
): C
olle
ctin
g Pe
rfor
man
ce D
ata:
7X
Exec
’sD
irect
orie
s: p
a1~p
a7, -
Hpa
=1~7
#!/bin/sh
#PJM -L "node=1"
#PJM -L "elapse=00:05:00"
#PJM -L "rscgrp=lecture"
#PJM -g “
gt71"
#PJM -j
#PJM -o "3.lst"
#PJM --mpi
"proc=1"
export OMP_NUM_THREADS=16
fapp
-C -d pa1
-Ihwm
-Hpa=1
./sol-r3k
OM
P-3
76
3.5
精密
PA可
視化
機能
(Exc
el)
(Pre
cisi
on P
A V
isib
ility
Fun
ctio
n)
(3/3
): Pe
rfor
man
ce A
naly
sis:
Tra
nsfo
rmat
ion
+ Ex
cel
fapppx
-A -d pa1 -o output_prof_1.csv -tcsv
-Hpa
fapppx
-A -d pa2 -o output_prof_2.csv -tcsv
-Hpa
fapppx
-A -d pa3 -o output_prof_3.csv -tcsv
-Hpa
fapppx
-A -d pa4 -o output_prof_4.csv -tcsv
-Hpa
fapppx
-A -d pa5 -o output_prof_5.csv -tcsv
-Hpa
fapppx
-A -d pa6 -o output_prof_6.csv -tcsv
-Hpa
fapppx
-A -d pa7 -o output_prof_7.csv -tcsv
-Hpa
Fujit
su F
X10:
CA
SE-1
, CM
-RC
M(2
)L1
-dem
.-mis
s:25
.6%
, Mem
. thr
ough
put:4
1.8G
B/s
ec.
Forw
ard/
Bac
kwar
d Su
bstit
utio
n
0.0E
+00
5.0E
-01
1.0E
+00
1.5E
+00
2.0E
+00
2.5E
+00
3.0E
+00
3.5E
+00
4.0E
+00
Thr
ead
0Thr
ead
1Thr
ead
2Thr
ead
3Thr
ead
4Thr
ead
5Thr
ead
6Thr
ead
7Thr
ead
8Thr
ead
9Thr
ead
10Thr
ead
11Thr
ead
12Thr
ead
13Thr
ead
14Thr
ead
15
[秒]
整数
ロー
ドメ
モリ
アク
セス
待ち
浮動
小数
点ロ
ード
メモ
リア
クセ
ス待
ちス
トア
待ち
整数
ロー
ドキ
ャッ
シュ
アク
セス
待ち
浮動
小数
点ロ
ード
キャ
ッシ
ュア
クセ
ス待
ち整
数演
算待
ち浮
動小
数点
演算
待ち
分岐
命令
待ち
命令
フェ
ッチ
待ち
バリ
ア同
期待
ちuO
Pコ
ミッ
トそ
の他
の待
ち1命
令コ
ミッ
ト整
数レ
ジス
タ書
き込
み制
約2/
3命令
コミ
ット
4命令
コミ
ット
77
src0
: CR
S, C
oale
sced
OM
P-3
Fujit
su F
X10:
CA
SE-2
, CM
-RC
M(2
)25
.6%
, 41.
8GB
/sec
.
0.0E
+00
5.0E
-01
1.0E
+00
1.5E
+00
2.0E
+00
2.5E
+00
3.0E
+00
3.5E
+00
4.0E
+00
Thr
ead
0Thr
ead
1Thr
ead
2Thr
ead
3Thr
ead
4Thr
ead
5Thr
ead
6Thr
ead
7Thr
ead
8Thr
ead
9Thr
ead
10Thr
ead
11Thr
ead
12Thr
ead
13Thr
ead
14Thr
ead
15
[秒]
整数
ロー
ドメ
モリ
アク
セス
待ち
浮動
小数
点ロ
ード
メモ
リア
クセ
ス待
ちス
トア
待ち
整数
ロー
ドキ
ャッ
シュ
アク
セス
待ち
浮動
小数
点ロ
ード
キャ
ッシ
ュア
クセ
ス待
ち整
数演
算待
ち浮
動小
数点
演算
待ち
分岐
命令
待ち
命令
フェ
ッチ
待ち
バリ
ア同
期待
ちuO
Pコ
ミッ
トそ
の他
の待
ち1命
令コ
ミッ
ト整
数レ
ジス
タ書
き込
み制
約2/
3命令
コミ
ット
4命令
コミ
ット
78
reor
der0
: CR
S, S
eque
ntia
l
OM
P-3
Fujit
su F
X10:
CA
SE-3
, CM
-RC
M(2
)5.
4%, 6
4.0G
B/s
ec.
0.0E
+00
5.0E
-01
1.0E
+00
1.5E
+00
2.0E
+00
2.5E
+00
3.0E
+00
3.5E
+00
4.0E
+00
Thr
ead
0Thr
ead
1Thr
ead
2Thr
ead
3Thr
ead
4Thr
ead
5Thr
ead
6Thr
ead
7Thr
ead
8Thr
ead
9Thr
ead
10Thr
ead
11Thr
ead
12Thr
ead
13Thr
ead
14Thr
ead
15
[秒]
整数
ロー
ドメ
モリ
アク
セス
待ち
浮動
小数
点ロ
ード
メモ
リア
クセ
ス待
ちス
トア
待ち
整数
ロー
ドキ
ャッ
シュ
アク
セス
待ち
浮動
小数
点ロ
ード
キャ
ッシ
ュア
クセ
ス待
ち整
数演
算待
ち浮
動小
数点
演算
待ち
分岐
命令
待ち
命令
フェ
ッチ
待ち
バリ
ア同
期待
ちuO
Pコ
ミッ
トそ
の他
の待
ち1命
令コ
ミッ
ト整
数レ
ジス
タ書
き込
み制
約2/
3命令
コミ
ット
4命令
コミ
ット
79
ELL,
Seq
uent
ial
OM
P-3