65
Advanced Topics of OpenMP Programming Christian Terboven [email protected] Center for Computing and Communication RWTH Aachen University PPCES 2010 March 24th, RWTH Aachen University

CT Advanced OpenMP

Embed Size (px)

DESCRIPTION

Open MP advanced examples

Citation preview

Page 1: CT Advanced OpenMP

Advanced Topics of OpenMP Programming

Christian [email protected]

Center for Computing and CommunicationRWTH Aachen University

PPCES 2010March 24th, RWTH Aachen University

Page 2: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

2

Page 3: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Summary of first parto OpenMP is a parallel programming model for Shared-

Memory machines. That is, all threads have access to a shared main memory. In addition to that, each threadmay have private data.

o The parallelism has to be expressed explicitly by theprogrammer. The base construct is a Parallel Region:A Team of threads is provided by the runtime system.

o Using the available Worksharing constructs, the work can bedistributed among the threads of a team, influencing thescheduling is possible.

o To control the parallelization, thread exclusion andsynchronization constructs can be used.3

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 4: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

The ordered constructo Allows to execute a structured block within a parallel loop in

sequential order

o In addition, an ordered clause has to be added to theParallel Region in which this construct occurs

o Can be used e.g. to enforce ordering on printing of data

o May help to determine whether there is a data race4

C/C++

#pragma omp ordered... structured block ...

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 5: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

OpenMP Environment Variableso OMP_NUM_THREADS: Controls how many threads will be used

to execute the program.

o OMP_SCHEDULE: If the schedule-type runtime is specified in a schedule clause, the value specified in this environmentvariable will be used.

o OMP_DYNAMIC: The OpenMP runtime is allowed to smartlyguess how many threads might deliver the best performance. If you want full control, set this variable to false.

o OMP_NESTED: Most OpenMP implementations require this tobe set to true in order to enabled nested Parallel Regions. Remember: Nesting Worksharing constructs is not possible.5

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Fibonacciw/ Tasks

OpenMP &Architecture Summary

Page 6: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

OpenMP API: Lockso OpenMP provides a set of low-level locking routines,

similar to semaphores:– void omp_func_lock (omp_lock_t *lck), with func:

• init / init_nest: Initialize the lock variable• destroy / destroy_nest: Remove the lock variable association• set / set_nest: Set the lock, wait until lock acquired• test / test_nest: Set the lock, but test and return if lock could not

be acquired• unset / unset_nest: Unset the lock

– Argument is address to an instance of omp_lock_t type– Simple lock: May not be locked if already in a locked state– Nested lock: May be locked multiple times by the same thread

6

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 7: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Memory Model of OpenMPo OpenMP: Shared-Memory model

– All threads share a common address space (shared memory)– Threads can have private data (explicit user control)– Fork-Join execution model

o Weak memory model– Temporary View: Memory consistency is guaranteed only after

certain points, namely implicit and explicit flushes

o Any OpenMP barrier includes a flusho Entry to and exit from critical regions include a flusho Entry to and exit from lock routines (OpenMP API) include a

flush7

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Fibonacciw/ Tasks

OpenMP &Architecture Summary

Page 8: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

The flush directive

o Enforces shared data to be consistent (but be cautious!)– If a thread has updated some variables, their values will be

flushed to memory, thus accessible to other threads– If a thread has not updated a value, the construct will ensure

that any local copy will get latest value from memory

o BUT: Do not use this for thread synchronization– Compiler optimization might come in your way– Rather use OpenMP lock functions for thread synchronization

8

C/C++

#pragma omp flush [(list)]

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Fibonacciw/ Tasks

OpenMP &Architecture Summary

Page 9: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Book recommendationUsing OpenMP: Portable Shared Memory Parallel ProgrammingBarbara Chapman, Gabriele Jost, Ruud Van Der Pas

o ISBN-10: 0262533022 o ISBN-13: 978-0262533027 o MIT Press, Cambridge, UK

9

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 10: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

10

Page 11: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Race Conditiono Data Race: The typical multi-threaded programming error

occurs when:– Two or more threads of a single process access the same

memory location concurrently in between two synchronizationpoints, and

– At least on of these accesses modifies the location, and– The accesses are not protected by locks or critical regions.

o In many cases private clauses or barriers are missingo Non-deterministic occurrence: The sequence of the

execution of parallel loop iterations is non-deterministicand may change from run to run, for example

o Hard to find using a traditional debugger, instead use- Sun Thread Analyzer - Intel Thread Checker11

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 12: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Program with a Data Race#pragma omp parallel{

[...]

/* compute stencil, residual and update */

#pragma omp forfor (j=1; j<m-1; j++)

for (i=1; i<n-1; i++){

resid =(ax * (UOLD(j,i-1) + UOLD(j,i+1))+ ay * (UOLD(j-1,i) + UOLD(j+1,i))+ b * UOLD(j,i) - F(j,i) ) / b;

U(j,i) = UOLD(j,i) - omega * resid;error = error + resid*resid;

}

} /* end of parallel region */printf(“error: %f”, double);12

There are two OpenMP errors in this code, let‘s find them ...

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 13: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Usage: Sun Thread Analyzero Compile with the Sun Compilers on Linux or Solaris:

– Add compiler / linker switch: -xinstrument=datarace

o Execute the program (with multiple threads) with collect– export OMP_NUM_THREADS=2

– collect –r on program arguments

o The verification is done only for the given dataset– The tool traces all memory accesses → Runtime and memory

requirements might explode!– Thus, use the smallest (and still meaningful) dataset

o View result: tha test.1.er13

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 14: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Analysis Result: Sun Thread Analyzer (1/2)o Three Data Races are found:

o Switch to „Dual Source“ to examine the races14

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 15: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Analysis Result: Sun Thread Analyzer (2/2)

o io resid

o error

15Tool gives you the line ofthe Data Race – user has

to find the variable!

Page 16: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Is this program now correct? (1/2)#pragma omp parallel{

[...]

/* compute stencil, residual and update */#pragma omp for private(i, resid, error)

for (j=1; j<m-1; j++)for (i=1; i<n-1; i++){resid =(ax * (UOLD(j,i-1) + UOLD(j,i+1))

+ ay * (UOLD(j-1,i) + UOLD(j+1,i))+ b * UOLD(j,i) - F(j,i) ) / b;

U(j,i) = UOLD(j,i) - omega * resid;error = error + resid*resid;

}} /* end of parallel region */16

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 17: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Is this program now correct? (2/2)o Thread Analyzer tells you so…

o It contains no Data Races anymore, but error has to bereduced!

#pragma omp for private(i, resid) \reduction(+:error)

for (j=1; j<m-1; j++)

for (i=1; i<n-1; i++){

[…]

error = error + resid*resid;}

} /* end of parallel region */17

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 18: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Our advice

18

Never put an OpenMP code intoproduction ...

... without using Thread Analyzeror Thread Checker !!!

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 19: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

19

Page 20: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

How to parallelize a While-loop?o How would you parallelize this code?

typedef list<double> dList;

dList myList;

/* fill myList with tons of items */

dList::iterator it = myList.begin();

while (it != myList.end())

{

*it = processListItem(*it);

it++;

}

o One possibility: Create a fixed-sized array containing all listitems and a parallel loop running over this arrayConcept: Inspector / Executor

20

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 21: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

How to parallelize a While-loop!o Or: Use Tasking in OpenMP 3.0#pragma omp parallel{#pragma omp single{

dList::iterator it = myList.begin();

while (it != myList.end())

{

#pragma omp task{

*it = processListItem(*it);

}

it++;

}

}}

o All while-loop iterations are independent from each other!21

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 22: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Biggest change in OpenMP 3.0: Task Parallelismo Tasks allow to parallelize irregular problems, e.g.

– unbounded loops– recursive algorithms– Producer / Consumer patterns– and more …

o Task: A work unit which execution may be deferred– Can also be executed immediately

o Tasks are composed of– Code to execute– Data environment– Internal control variables (ICV)22

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 23: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Tasks in OpenMP: Overviewo Tasks are executed by the threads of the Team

o Data environment of a Task is constructed at creation time

o A Task can be tied to a thread – only that thread may executeit – or untied

o Tasks are either implicit or explicit

o Implicit tasks: The thread encountering a Parallel construct– Creates as many implicit Tasks as there are threads in the Team– Each thread executes one implicit Task– Implicit Tasks are tied→ Different description than in 2.5, but equivalent semantics!23

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 24: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

The task directive

o Each encountering thread creates a new Task– Code and data is being packaged up– Tasks can be nested

• Into another Task directive• Into a Worksharing construct

o Data scoping clauses:• shared(list)• private(list)• firstprivate(list)• default(shared | none)

24

C/C++

#pragma omp task [clause [[,] clause] ... ]... structured block ...

Schedule clauses:• untied

Other clauses:• if(expr)

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 25: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Task synchronization (1/2)o At OpenMP barrier (implicit or explicit)

– All tasks created by any thread of the current Team areguaranteed to be completed at barrier exit

o Task barrier: taskwait– Encountering Task suspends until child tasks are complete

• Only direct childs, not descendants!

25

C/C++

#pragma omp taskwait

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 26: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Task synchronization (2/2)o Simple example of Task synchronization in OpenMP 3.0:

#pragma omp parallel num_threads(np){

#pragma omp taskfunction_A();

#pragma omp barrier#pragma omp single

{

#pragma omp taskfunction_B();

}

}26

np Tasks created here, one for each thread

All Tasks guaranteed to be completed here

1 Task created here

B-Task guaranteed to be completed here

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 27: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Tasks in OpenMP: Data Scopingo Some rules from Parallel Regions apply:

– Static and Global variables are shared– Automatic Storage (local) variables are private

o If no default clause is given:– Orphaned Task variables are firstprivate by default!– Non-Orphaned Task variables inherit the shared attribute!→ Variables are firstprivate unless shared in the enclosing

context

o So far no verification tool is available to check Taskingprograms for correctness!27

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 28: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Tasks in OpenMP: Scheduling

o Default: Tasks are tied to the thread that first executes them→ not neccessarily the creator. Scheduling constraints:– Only the Thread a Task is tied to can execute it– A Task can only be suspended at a suspend point

• Task creation, Task finish, taskwait, barrier– If Task is not suspended in a barrier, executing Thread can only

switch to a direct descendant of all Tasks tied to the Thread

o Tasks created with the untied clause are never tied– No scheduling restrictions, e.g. can be suspended at any point– But: More freedom to the implementation, e.g. load balancing

28

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 29: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Tasks in OpenMP: if clauseo If the expression of an if clause on a Task evaluates to false

– The encountering Task is suspended– The new Task is executed immediately– The parent Task resumes when new Tasks finishes→ Used for optimization, e.g. avoid creation of small tasks

o If the expression of an if clause on a Parallel Region evaluates to false– The Parallel Region is executed with a Team of one Thread only→ Used for optimization, e.g. avoid going parallel

o In both cases the OpenMP data scoping rules still apply!29

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 30: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Task pitfalls (1/3)o It is the user‘s responsability to ensure data is alive:

// within Parallel Region

void foo() {

int a[LARGE_N];

#pragma omp task{

bar1(a);

}

#pragma omp task{

bar2(a);

}

}30

If not shared: Parent Task may have exitedfoo() by the time bar() accesses a: a isvariable of automatic storage durationand thus is disposed when foo() is exited.

Variable a has to be shared in order toprevent copyin to task (default firstprivate).

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 31: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Task pitfalls (2/3)o It is the user‘s responsability to ensure data is alive:

// within Parallel Region

void foo() {

int a[LARGE_N];

#pragma omp task shared(a){

bar1(a);

}

#pragma omp task shared(a){

bar2(a);

}

#pragma omp taskwait}

31 Wait for all Tasks that have been createdon this level.

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 32: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Task pitfalls (3/3)o Examing your code thoroughly before using untied Tasks:

int dummy;

#pragma omp threadprivate(dummy)

void foo() {dummy = …; }

void bar() {… = dummy; }

#pragma omp task untied{

foo();

bar();

}32 Task could switch to a different Thread

between foo() and bar().

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 33: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Other news in OpenMP 3.0 (1/5)o Static schedule guarantees

#pragma omp for schedule(static) nowaitfor(i = 1; i < N; i++)

a[i] = …

#pragma omp for schedule(static)for (i = 1; i < N; i++)

c[i] = a[i] + …

33

Allowed in OpenMP 3.0 if and only if:- Number of iterations is the same- Chunk is the same (or not specified)

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 34: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Other news in OpenMP 3.0 (2/5)o Loop collapsing

#pragma omp for collapse(2)for(i = 1; i < N; i++)

for(j = 1; j < M; j++)

for(k = 1; k < K; k++)

foo(i, j, k);

34

Iteration space from i-loop and j-loop iscollapsed into a single one, if loops areperfectly nested and form a rectangulariteration space.

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 35: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Other news in OpenMP 3.0 (3/5)o New variable types allowed in for-Worksharing

#pragma omp forfor (unsigned int i = 0; i < N; i++)

foo(i);

vector v;

vector::iterator it;

#pragma omp forfor (it = v.begin(); it < v.end(); it++)

foo(it);35

Legal in OpenMP 3.0:- Usigned integer types- Pointer types- Random access iterators (C++)

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 36: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Other news in OpenMP 3.0 (4/5)o Improvements in the API for Nested Parallelism:

– How many nested Parallel Regions?• int omp_get_level()

– How many nested Parallel Regions are active?• int omp_get_active_level()

– Which thread-id was my ancestor, in given level?• int omp_get_ancestor_thread_num(int level)

– How many Threads were in my ancestor‘s team, at given level?• int omp_get_team_size(int level)

o This is now well-defined in OpenMP 3.0:omp_set_num_threads(3);

#pragma omp parallel {

omp_set_num_threads(omp_get_thread_num() + 2);

#pragma omp parallel {

foo();

} }

36

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 37: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Other news in OpenMP 3.0 (5/5)o Improved definition of environment interaction

– Env. Var. OMP_MAX_NESTED_LEVEL + API functions• Controls the maximum number of active parallel regions

– Env. Var. OMP_THREAD_LIMIT + API functions• Controls the maximum number of OpenMP threads

– Env. Var. OMP_STACKSIZE• Controls the stack size of child threads

– Env. Var. OMP_WAIT_POLICY• Control the thread idle policy:

– active: Good for dedicated systems– passive: Good for shared systems (e.g. in batch mode)37

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 38: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

38

Page 39: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Recursive approach to compute Fibonacciint main(int argc,

char* argv[])

{

[...]

fib(input);

[...]

}

39

int fib(int n) {

if (n < 2) return n;

int x = fib(n - 1);

int y = fib(n - 2);

return x+y;

}

o On the following slides we will discuss three approaches toparallelize this recursive code with Tasking.

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 40: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

First version parallelized with Tasking (omp-v1)int main(int argc,

char* argv[])

{

[...]

#pragma omp parallel{#pragma omp single{

fib(input);

}}

[...]

}

40

int fib(int n) {

if (n < 2) return n;

int x, y;

#pragma omp task shared(x){

x = fib(n - 1);

}#pragma omp task shared(y){

y = fib(n - 2);

}#pragma omp taskwait

return x+y;

}o Only one Task / Thread enters fib() from main(), it is

responsable for creating the two initial work taskso Taskwait is required, as otherwise x and y would be lost

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 41: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Scalability measurements (1/3)

41

0

1

2

3

4

5

6

7

8

9

1 2 4 8

Spee

dup

#Threads

Speedup of Fibonacci with Tasks

optimal

omp-v1

o Overhead of task creation prevents better scalability!

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 42: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Improved parallelization with Tasking (omp-v2)int main(int argc,

char* argv[])

{

[...]

#pragma omp parallel{#pragma omp single{

fib(input);

}}

[...]

}

42

int fib(int n) {

if (n < 2) return n;

int x, y;

#pragma omp task shared(x) \if(n > 30)

{x = fib(n - 1);

}#pragma omp task shared(y) \

if(n > 30){

y = fib(n - 2);

}#pragma omp taskwait

return x+y;

}o Improvement: Don‘t create yet

another task once a certain (smallenough) n is reached

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Fibonacciw/ Tasks

OpenMP &Architecture Summary

Page 43: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Scalability measurements (2/3)

43

0

1

2

3

4

5

6

7

8

9

1 2 4 8

Spee

dup

#Threads

Speedup of Fibonacci with Tasks

optimal

omp-v1

omp-v2

o Speedup is ok, but we still have some overheadwhen running with 4 or 8 threads

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Fibonacciw/ Tasks

OpenMP &Architecture Summary

Page 44: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Improved parallelization with Tasking (omp-v3)int main(int argc,

char* argv[])

{

[...]

#pragma omp parallel{#pragma omp single{

fib(input);

}}

[...]

}

44

int fib(int n) {

if (n < 2) return n;

if (n <= 30)

return serfib(n);

int x, y;

#pragma omp task shared(x){

x = fib(n - 1);

}#pragma omp task shared(y){

y = fib(n - 2);

}#pragma omp taskwait

return x+y;

}o Improvement: Skip the OpenMP overhead once a certain nis reached (no issue w/ production compilers)

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 45: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Scalability measurements (3/3)

45

0

1

2

3

4

5

6

7

8

9

1 2 4 8

Spee

dup

#Threads

Speedup of Fibonacci with Tasks

optimal

omp-v1

omp-v2

omp-v3

o Everything ok now

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 46: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

46

Page 47: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Repetition: Performance Aspects

o Performance Measurements– Runtime (real wall time, user cpu time, system time)– FLOPS: number of floating point operations (per sec)– Speedup: performance gain relative to one core/thread– Efficiency: Speedup relative to the theoretical maximum

o Performance Impacts– Load Imbalance– Data Locality on cc-NUMA architectures– Memory Bandwidth (consumption per thread)– Cache Effects47

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 48: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Case Study: Sparse Matrix-Vector-Multiplication

48

Beijing Botanical Garden

left: original buildingbottom-right: lattice model

bottom-left: matrix shape

(image copyright: Beijing Botanical Garden and University of Florida, Sparse Matrix Collection)

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 49: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Experiment Setup and Hardwareo Intel Xeon 5450 (2x4 cores) – 3,0 GHz; 12MB Cache

– L2 cache shared between 2 cores– flat memory architecture (FSB)

o AMD Opteron 875 (4x2 cores) - 2,2 GHz; 8MB Cache– L2 cache not shared– ccNUMA architecture (HT-links)

o Matrices in CRS format– „large“ means 75 MB >> caches

→ SMXV is an important kernel in numerical codes (GMRES, …)49

M

C C C C C C C C

M

C C C C

M

M

C C C C

M

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 50: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Performance with Static Row Distribution (1)o compact: threads packed tightly; scatter: threads distributed

50

static row distribution

0,000

200,000

400,000

600,000

800,000

1000,000

1200,000

1400,000

0 1 2 3 4 5 6 7 8 9

threads

mflo

ps

AMD Opteron Intel Xeon, compact Intel Xeon, scatter

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

up to 25% better

ca. 1200 mflops

ca. 850 mflops

Page 51: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Performance Analysis of (1)o Load Balancing: Static data distri-

bution is not optimal!

o Data Locality on cc-NUMA archi-tectures: Static data distributionis not optimal!

o Memory Bandwidth: Compactthread placement is not optimalon Intel Xeon processors!

51

M

C C C C C C C C

T T

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 52: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Performance with Dynamic Row Distribution (2)o compact: threads packed tightly; scatter: threads distributed

52

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

dynamic row distribution

0,000

200,000

400,000

600,000

800,000

1000,000

1200,000

1400,000

0 1 2 3 4 5 6 7 8 9

threads

mflo

ps

Intel Xeon, scatter Intel Xeon, compact AMD Opteron

ca. 985 mflops

ca. 660 mflops

Page 53: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Performance Analysis of (2)o Why does the Xeon deliver better performance with the

Dynamic row distribution, but the Operon gets worse?→Data Locality. The Opteron is a cc-NUMA architecture, the

threads are distributed along the cores over all sockets, butthe data is not!

→ Solution: Parallel initializationof the matrix to employ thefirst-touch mechanism of Operating Systems: Data is placednear to where it is first access from.

53

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

M

C C C C

M

M

C C C C

M

Data Data

M

C C C C C C C C

DataM

C C C C

M

M

C C C C

M

Data

dynamicstatic

Page 54: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Performance with Pre-Calculated Row Distribution (3)

54

static row distribution, sorted matrix

0,000

500,000

1000,000

1500,000

2000,000

2500,000

0 1 2 3 4 5 6 7 8 9

threads

mflo

ps

AMD Opteron Intel Xeon

ca. 2000 mflops

ca. 1000 mflops

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 55: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Final Performance Analysiso By exploiting data locality the AMD Opteron is able to reach

about 2000 mflops!o The Intel Xeon performance stagnates at about 1000 mflops,

since the memory bandwidth of the fronside bus isexhausted with using four threads already:

o If the matrix would be smaller and fit into the cache theresult would look different…55

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

M

C C C C C C C C

Data

T TT T

Page 56: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

56

Page 57: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Comparing Processors and Boxes …Metric \ Server SF V40z FSC RX200 S4

Processor Chip AMD Opteron 8752.2 GHz

Intel Xeon 54503.00 GHz

# sockets 4 2

# cores 8 (dual-core) 8 (quad-core)

# threads 8 8

Accumulated L2 $ 8 mb 16 mb

L2 $ Strategy Separate per core Shared by 2 cores

Technology 90 nm 45 nm

Peak Performance 35.2 GFLOPS 96 GFLOPS

Dimension 3 units 1 unit

57

Note: Here we compare machines of different ages – which can be seen as unfair!For example newer Opteron-based machines provide similar settings in 1 unit…

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 58: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Measuring Memory Bandwidtho Do not look at the CPU performance only – the memory

subsystem‘s performance is crucial for your HPC application!

long long *x, *xstart, *xend, mask;for (x = xstart; x < xend; x++) *x ^= mask;

o Each loop iteration: One load + One store

o We ran this kernel with multiple threads working on private data at a time using OpenMP (large memory footprint >> L2)

o Explicit processor binding to control the thread placement– Linux: taskset command– Solaris: SUN_MP_PROCBIND environment variable

58

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 59: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Selected results: 1 thread

59

M

C C C C C C C C

M

C C C C

M

M

C C C C

M2x Clovertown, 2.66 GHz1 thread: 3.970 GB/s

4x Opteron 875, 2.2 GHz1 thread: 3.998 GB/s

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 60: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Memory Bandwidth: Dual-Socket Quad-Core Clovertown

1 thread: 3.970 GB/s

2 threads: 3.998 GB/sM

C C C C C C C C

2 threads: 6.871 GB/sM

C C C C C C C C

2 threads: 4.661 GB/sM

C C C C C C C C

8 threads: 8.006 GB/sM

C C C C C C C C60

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 61: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Memory Bandwidth: Quad-Socket Dual-Core Opteron1 thread: 3.998 GB/s

M

C C C C

M

M

C C C C

M

M

C C C C

M

M

C C C C

M2 threads: 4.674 GB/s 2 threads: 8.210 GB/s

M

C C C C

M

M

C C C C

M2 threads: 4.335 GB/s

ccNUMA!61

M

C C C C

M

M

C C C C

M

8 threads: 18.470 GB/s

Page 62: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

What does that mean for my application?o Ouch, these boxes all behave differently…

o Lets look at a Sparse Matrix Vector multiplication:

o Which architecture is suited best, depends on:– Can your application profit from big caches?– Can your application profit from shared / separate caches?– Can your application profit from a high clock rate?– Is your application memory bound anyway?– … and more factors …

62

SF V40z(Opteron)

FSC RX200 S4(Xeon)

Sparse MatVec (small) GFLOPS 2.17 9.34

Sparse MatVec (large) GFLOPS 1.47 0.91

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 63: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics

o Tools for OpenMP programming

o OpenMP 3.0 and Tasks

o Example and Case Study– Fibonacci w/ Tasks– Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture

o Summary of second part

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

63

Page 64: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

Summary of second part

o Tool-support is a special strength of OpenMP (compared toother multi-threading paradigms): Always use a verificationtool before putting your parallel code in production!

o The Tasking concept of OpenMP 3.0 allows OpenMP to beapplicable for a much broader range of applications than 2.5.

o Multicore architectures are manifold.o In order to achieve the full performance potential, programs

have to go parallel and respect the (memory) architecture.

64

Repetition Tools forOpenMP

OpenMP 3.0& Tasks

Example +Case Study

OpenMP &Architecture Summary

Page 65: CT Advanced OpenMP

Advanced Topics of OpenMP Programming 24.03.2010 – C. Terboven

The End

Thank you for

your attention!

65