58
Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.1: Threads Dr. Peter Tröger + Teaching Team

OpenHPI - Parallel Programming Concepts - Week 3

Embed Size (px)

DESCRIPTION

Week 3 in the OpenHPI course on parallel programming concepts is all about threads and tasks. Find the whole course at http://bit.ly/1l3uD4h.

Citation preview

Page 1: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.1: Threads

Dr. Peter Tröger + Teaching Team

Admin (Peter)
Text
Page 2: OpenHPI - Parallel Programming Concepts - Week 3

Week 2 ! Week 3

■  Week 2 □  Parallelism and Concurrency

◊  Parallel programming is concurrent programming ◊ Concurrent software can leverage parallel hardware

□  Concurrency Problems - Race condition, deadlock, livelock, … □  Critical Sections - Progress, semaphores, Mutex, …

□  Monitor Concept - Condition variables, wait(), notify(), … □  Advanced Concurrency - Spinlocks, reader / writer locks, …

■  This week provides a walkthrough of parallel programming approaches for shared memory systems □  Not complete, not exhaustive

■  Consider coding exercises and additional readings

2

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 3: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming for Shared Memory

■  Processes □  Concurrent processes with dedicated memory

□  Process management by operating system □  Support for explicit memory sharing

■  Light-weight processes (LWP) / threads □  Concurrent threads with shared process memory

□  Thread scheduling by operating system or library □  Support for thread-local storage, if needed

■  Tasks □  Concurrent tasks with shared process memory

□  Typically operating system not involved, dynamic mapping to threads by task library

□  Support for private variables per task

3

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 4: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming for Shared Memory

4

Process

Explicitly Shared Memory

■  Different programming models for concurrency in shared memory

■  Processes and threads mapped to processing elements (cores)

■  Process- und thread-based programming typically part of operating system lectures

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Memory

Process

Memory

Thread Thread

Task

Task

Task

Task

Concurrent Processes Concurrent Threads

Concurrent Tasks

Main Thread

Process

Memory

Main Thread

Process

Memory

Main Thread Thread Thread

Page 5: OpenHPI - Parallel Programming Concepts - Week 3

POSIX Threads (PThreads)

■  Part of the POSIX specification for operating system APIs

■  Implemented by all Unix-compatible systems (Linux, MacOS X, Solaris, …)

■  Functionality

□  Thread lifecycle management □  Mutex-based synchronization □  Synchronization based on condition variables □  Synchronization based on reader/writer locks □  Optional support for barriers

■  Semaphore API is a separate POSIX specification (sem_ prefix)

5

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Process

Memory

Thread Thread

Concurrent Threads

Page 6: OpenHPI - Parallel Programming Concepts - Week 3

/************************************************************************* AUTHOR: Blaise Barney **************************************************************************/ #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 5 void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); } int main(int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0;t<NUM_THREADS;t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc){ printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } /* Last thing that main() should do */ pthread_exit(NULL); }

Page 7: OpenHPI - Parallel Programming Concepts - Week 3

POSIX Threads

■  pthread_create() □  Run a given function as concurrent activity

□  Operating system scheduler decides upon parallel execution ■  pthread_join()

□  Blocks the caller until the specific thread terminates □  Allows to determine exit code from pthread_exit()

■  pthread_exit() □  Implicit call on function return □  No release of any resources, cleanup handlers supported

7

Main thread

Worker Thread 1

pthread_create() pthread_join()

pthread_exit()

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 8: OpenHPI - Parallel Programming Concepts - Week 3

8

#include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <math.h> #define NUM_THREADS 4 // #include statements omitted for space reasons void *BusyWork(void *t) { int i; long tid; double result=0.0; tid = (long)t; printf("Thread %ld starting...\n",tid); for (i=0; i<1000000; i++) { result = result + sin(i) * tan(i); } printf("Thread %ld done. Result = %e\n",tid, result); pthread_exit((void*) t); } int main (int argc, char *argv[]) { pthread_t thread[NUM_THREADS]; pthread_attr_t attr; int rc; long t; void *status; pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); for(t=0; t<NUM_THREADS; t++) { printf("Main: creating thread %ld\n", t); rc = pthread_create(&thread[t], &attr, BusyWork, (void *)t); if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1);}} pthread_attr_destroy(&attr); for(t=0; t<NUM_THREADS; t++) { rc = pthread_join(thread[t], &status); if (rc) { printf("ERROR; return code from pthread_join() is %d\n", rc); exit(-1); } printf("Main: join with thread %ld, status %ld\n",t,(long)status);} printf("Main: program completed. Exiting.\n"); pthread_exit(NULL); }

Page 9: OpenHPI - Parallel Programming Concepts - Week 3

POSIX Threads

■  API supports (at least) mutex and condition variable concept □  Thread synchronization and critical section protection

■  pthread_mutex_init() □  Initialize new mutex, which is unlocked by default □  Resource of the surrounding process

■  pthread_mutex_lock() and pthread_mutex_trylock()

□  Lock the mutex for the calling thread □  Block / do not block if the mutex is already locked

■  pthread_mutex_unlock() □  Release the mutex

□  Operating system decides which other thread is woken up □  Focus on speed of operation, no deadlock prevention

9

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 10: OpenHPI - Parallel Programming Concepts - Week 3

10

#include <pthread.h> #include <unistd.h> #include <stdio.h> #define NUMTHREADS 5 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; int sharedData=0; int sharedData2=0; void *theThread(void *parm) { printf("Thread attempts to lock mutex\n"); pthread_mutex_lock(&mutex); printf("Thread got the mutex lock\n"); ++sharedData; --sharedData2; printf("Thread unlocks Mutex\n"); pthread_mutex_unlock(&mutex); return NULL; } int main(int argc, char **argv) { pthread_t thread[NUMTHREADS]; int i; printf("Main thread attempts to lock mutex\n"); pthread_mutex_lock(&mutex); printf("Main thread got the mutex lock\n"); for (i=0; i<NUMTHREADS; ++i) { // create 3 threads pthread_create(&thread[i], NULL, theThread, NULL); } printf("Wait a bit until we are 'done' with the shared data\n"); sleep(3); printf("Main thread unlocks mutex\n"); pthread_mutex_unlock(&mutex); for (i=0; i <NUMTHREADS; ++i) { pthread_join(thread[i], NULL); } pthread_mutex_destroy(&mutex); return 0;}

Page 11: OpenHPI - Parallel Programming Concepts - Week 3

POSIX API vs. Windows API

11

POSIX Windows

pthread_create() CreateThread()

pthread_exit() ExitThread()

pthread_cancel() TerminateThread()

pthread_mutex_init() CreateMutex()

pthread_mutex_lock() WaitForSingleObject()

pthread_mutex_trylock() WaitForSingleObject(hThread, 0)

Condition variables Auto-reset events

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 12: OpenHPI - Parallel Programming Concepts - Week 3

Java, .NET, C++, PHP, …

■  High-level languages also offer threading support ■  Interpreted / JIT-compiled languages

□  Rich API for thread management and shared data structures □  Example: Java Runnable interface, .NET System.Threading □  Today mostly 1:1 mapping of high-level threads to operating

system threads (native threads) ■  Threads as part of the language definition (C++ 11) vs.

threads as operating system functionality (PThreads)

12

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

#include <thread> #include <iostream>

void write_message(std::string const& message) { std::cout<<message; }

int main() { std::thread t(write_message, "hello world from std::thread\n"); write_message("hello world from main\n"); t.join(); }

Page 13: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.2: Tasks with OpenMP

Dr. Peter Tröger + Teaching Team

Page 14: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming for Shared Memory

14

Process

Explicitly Shared Memory

■  Different programming models for concurrency with shared memory

■  Processes and threads mapped to processing elements (cores)

■  Task model supports more fine-grained parallelization than with native threads

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Memory

Process

Memory

Thread Thread

Task

Task

Task

Task

Concurrent Processes Concurrent Threads

Concurrent Tasks

Main Thread

Process

Memory

Main Thread

Process

Memory

Main Thread Thread Thread

Page 15: OpenHPI - Parallel Programming Concepts - Week 3

OpenMP

■  Language extension for C/C++ and Fortran □  Versions in other languages available

■  Combination of compiler support and run-time library □  Special compiler instructions in the code (“pragma”) □  Expression of intended parallelization by the developer □  Result is a binary that relies on OpenMP functionality

■  OpenMP library responsible for thread management □  Transparent for application code □  Additional configuration with environment variables ◊  OMP_NUM_THREADS: Upper limit for threads being used

◊  OMP_SCHEDULE: Scheduling type for parallel activities ■  State-of-the-art for portable parallel programming in C

15

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 16: OpenHPI - Parallel Programming Concepts - Week 3

OpenMP

■  Programming with the fork-join model □  Master thread forks into declared tasks

□  Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool

□  Worker task barrier before finalization (join)

16

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

[Wik

iped

ia]

Page 17: OpenHPI - Parallel Programming Concepts - Week 3

OpenMP

■  Parallel region □  Parallel tasks defined in a dedicated code block,

marked by #pragma omp parallel □  Should have only one entry and one exit point

□  Implicit barrier at beginning and end of the block

17

[Wik

iped

ia]

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 18: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Region

■  Encountering thread for the region generates implicit tasks ■  Task execution may suspend at some scheduling point:

□  At implicit barrier regions or barrier primitives □  At task / taskwait constructs □  At the end of a task region

18

#include <omp.h> #include <stdio.h> int main (int argc, char * const argv[]) { #pragma omp parallel printf("Hello from thread %d, nthreads %d\n”, omp_get_thread_num(), omp_get_num_threads()); return 0; } >> gcc -fopenmp -o omp omp.c

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 19: OpenHPI - Parallel Programming Concepts - Week 3

Work Sharing

■  Possibilities for creation of tasks inside a parallel region □  omp sections - Define code blocks dividable among threads

◊  Implicit barrier at the end □  omp for - Automatically divide loop iterations into tasks ◊  Implicit barrier at the end

□  omp single / master - Denotes a task to be executed only by first arriving thread resp. the master thread ◊  Implicit barrier at the end

◊  Intended for critical sections □  omp task - Explicitly define a task

■  Clause combinations possible: #pragma omp parallel for

19

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 20: OpenHPI - Parallel Programming Concepts - Week 3

Loop Parallelization

■  omp for: Parallel execution of iteration chunks

■  Implications on exception handling, break-out calls and continue primitive

■  Mapping of threads to iteration chunk tasks controlled by schedule clause

■  Large chunks are good for caching and overhead avoidance

■  Small chunks are good for load balancing

20

PT 2012

#include <math.h> void compute(int n, float *a, float *b, float *c, float *y, float *z) { int i; #pragma omp parallel { #pragma omp for schedule(static) nowait for (i=0, i<n; i++) { c[i] = (a[i] + b[i]) / 2.0; z[i] = sqrt(c[i]); y[i] = z[i-1] + a[i]; } } }

#pragma omp parallel for for(i=0; i<n; i++) { value = some_complex_function(a[i]); #pragma omp critical sum = sum+value; }

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 21: OpenHPI - Parallel Programming Concepts - Week 3

Loop Parallelization

■  schedule (static, [chunk]) □  Contiguous ranges of iterations (chunks) of equal size

□  Low overhead, round robin assignment to threads, static scheduling

□  Default is one chunk per thread ■  schedule (dynamic, [chunk])

□  Threads grab iterations resp. chunks □  Higher overhead, but good for unbalanced work load

■  schedule (guided, [chunk]) □  Dynamic schedule, shrinking ranges per step □  Starts with large block, until minimum chunk size is reached □  Good for computations with increasing iteration length

(e.g. prime sieve test)

21

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 22: OpenHPI - Parallel Programming Concepts - Week 3

Data Sharing

■  shared variable: Name provides access in all tasks □  Only tagging for the runtime, no critical section enforcement

■  private variable: Clone variable in each task

□  Results in one data copy per task

□  firstprivate: Initialization with last value before region

□  lastprivate: Result from last loop cycle or lexically last section directive

22

#include <omp.h> main () { int nthreads, tid; #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); }}}

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 23: OpenHPI - Parallel Programming Concepts - Week 3

Memory Model

■  OpenMP considers the memory wall problem □  Hide memory latency by deferring read / write operations

□  Task view on shared memory is not always consistent ■  Example: Keeping loop variable in a register for efficiency

□  Makes loop variable a thread-specific variable □  Tasks in other threads see outdated values in memory

■  But: Sometimes a consistent view is demanded □  flush operation - Finalize all read / write operations,

make view on shared memory consistent □  Implicit flush on different occasions, such as barriers □  Complicated issue, check documentation

23

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 24: OpenHPI - Parallel Programming Concepts - Week 3

Task Scheduling

■  Classical task scheduling with central queue □  All worker threads fetch tasks from a central queue

□  Scalability issue with increasing thread (resp. core) count ■  Work stealing in OpenMP (and other libraries)

□  Task queue per thread □  Idling thread “steals” tasks from another thread

□  Independent from thread scheduling

□  Only mutual synchronization

□  No central queue

24

Thre

ad

New Task

Next Task

Task

Que

ue

Thre

ad

New Task

Next Task

Task

Que

ue

Work Stealing

Page 25: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.3: Beyond OpenMP

Dr. Peter Tröger + Teaching Team

Page 26: OpenHPI - Parallel Programming Concepts - Week 3

Cilk

■  C language combined with several new keywords □  True language extension, instead of new compiler pragmas

□  Developed at MIT since 1994 □  Initial commercial version Cilk++ with C / C++ support

■  Since 2010, offered by Intel as Cilk Plus □  Official language specification to foster other implementations

□  Support for Windows, Linux, and MacOS X ■  Basic concept of serialization

□  Cilk keywords may be replaced by empty operations □  Leads to non-concurrent code

□  Promise that code semantics remain the same

26

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 27: OpenHPI - Parallel Programming Concepts - Week 3

Intel Cilk Plus

■  Three keywords to express potential parallelism

■  cilk_spawn: Asynchronous function call □  Runtime decides,

spawning is not mandated ■  cilk_sync: Wait until all

spawned calls are completed □  Barrier for cilk_spawn activity

■  cilk_for: Allows loop iterations to be performed in parallel □  Runtime decides,

parallelization is not mandated

27

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

cilk_for (int i=0; i<8; ++i) { do_work(i); }

for (int i=0; i<8; ++i) { cilk_spawn do_work(i); } cilk_sync;

for (int i=0; i<8; ++i) { do_work(i); }

Page 28: OpenHPI - Parallel Programming Concepts - Week 3

Intel Cilk Plus

■  Cilk supports the high-level expression of array operations □  Gives the runtime a chance

to parallelize work □  Intended for SIMD-style

operations without any ordering constraints

■  New operator [:] □  Specify data parallelism

on an array □  array-expression[lower-

bound : length : stride] □  Multi-dimensional sections

are supported: a[:][:]

■  Short-hand description for complex loops □  A[:]=5

for (i = 0; i < 10; i++) A[i] = 5;

□  A[0:n] = 5; □  A[0:5:2] = 5;

for (i = 0; i < 10; i += 2) A[i] = 5;

□  A[:] = B[:];

□  A[:] = B[:] + 5; □  D[:] = A[:] + B[:]; □  func (A[:]);

28

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 29: OpenHPI - Parallel Programming Concepts - Week 3

Intel Threading Building Blocks (TBB)

■  Portable C++ library, toolkit for different operating systems ■  Also available as open source version

■  Complements basic OpenMP / Cilk features □  Loop parallelization, synchronization, explicit tasks

■  High-level concurrent containers, recursion support □  hash map, queue, vector, set

■  High-level parallel operations □  Prefix scan, sorting, data-flow pipelining, reduction, … □  Scalable library implementation, based on tasks

■  Unfair scheduling approach

□  Consider data locality, optimize for cache utilization ■  Comparable: Microsoft C++ Concurrency Runtime

29

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 30: OpenHPI - Parallel Programming Concepts - Week 3

class FibTask: public task { public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides task::execute long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); // Set ref_count to 'two children plus one for wait". set_ref_count(3); // Start b running. spawn( b ); // Start a running and wait for all children (a and b). spawn_and_wait_for_all(a); // Do the sum *sum = x+y; } return NULL; };

[intel.com]

fibn=fibn-1 + fibn-2

Page 31: OpenHPI - Parallel Programming Concepts - Week 3

#include ″tbb/compat/thread″ #include ″tbb/tbb_allocator.h″ // zero_allocator defined here #include ″tbb/atomic.h″ #include ″tbb/concurrent_vector.h″ using namespace tbb; typedef concurrent_vector<atomic<Foo*>, zero_allocator<atomic<Foo*> > > FooVector; Foo* WaitForElement( const FooVector& v, size_t i ) { // Wait for ith element to be allocated while( i>=v.size() ) std::this_thread::yield(); // Wait for ith element to be constructed while( v[i]==NULL ) std::this_thread::yield(); return v[i]; }

[intel.com]

Waiting for an element

Page 32: OpenHPI - Parallel Programming Concepts - Week 3

Task-Based Concurrency in Java

■  Major concepts introduced with Java 5 ■  Abstraction of task management with Executors

□  java.util.concurrent.Executor □  Implementing object provides execute() method □  Can execute submitted Runnable tasks □  No assumption on where the task runs,

typically in managed thread pool □  ThreadPoolExecutor provided by class library

■  java.util.concurrent.ExecutorService □  Additional submit() function, which returns a Future object

■  Methods for submitting large collections of Callable�s

32

Page 33: OpenHPI - Parallel Programming Concepts - Week 3

High-Level Concurrency

33 Microsoft Parallel Patterns Library java.util.concurrent

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 34: OpenHPI - Parallel Programming Concepts - Week 3

Functional Programming

■  Contrary paradigm to imperative programming □  Program is a large set of functions

□  All these functions just map input to output □  Treats execution as collection of function evaluations

■  Foundations in lambda calculus (1930‘s) and Lisp (late 50’s) ■  Side-effect free computation when functions have no local state

□  Function result depends only on input, not on shared data □  Order of function evaluation becomes irrelevant □  Automated parallelization can freely schedule the work □  Race conditions become less probable

■  Trend to add functional programming paradigms in imperative languages (anonymous functions, filter, map, …)

34

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 35: OpenHPI - Parallel Programming Concepts - Week 3

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Imperative to Functional

35 alert("get the lobster");! PutInPot("lobster");! PutInPot("water");!! alert("get the chicken");! BoomBoom("chicken");! BoomBoom("coconut");!

function Cook( i1, i2, f ) {! alert("get the " + i1);! f(i1); f(i2); } !!Cook( "lobster", "water", PutInPot);!Cook( "chicken", "coconut", BoomBoom): !

http://www.joelonsoftware.com/items/2006/08/01.html

Optimize

•  Higher order functions •  Functions as argument or

return value •  Execution of parameter

function may be transparently parallelized

Page 36: OpenHPI - Parallel Programming Concepts - Week 3

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Imperative to Functional

36 alert("get the lobster");! PutInPot("lobster");! PutInPot("water");!! alert("get the chicken");! BoomBoom("chicken");! BoomBoom("coconut");!

http://www.joelonsoftware.com/items/2006/08/01.html

Optimize

function Cook( i1, i2, f ) {! alert("get the " + i1);! f(i1); f(i2); } !!Cook("lobster", "water", ! function(x) {alert("pot " + x); } );!Cook("chicken", "coconut", ! function(x) {alert("boom " + x); });!

•  Anonymous functions •  Also lambda function

or function literal •  Convenient tool when

higher-order functions are supported

Page 37: OpenHPI - Parallel Programming Concepts - Week 3

Functional Programming

■  Higher order functions: Functions as argument or return value ■  Pure functions: No memory or I/O side effects

□  Constant result with side-effect free parameters □  All functions (with available input) can run in parallel

■  Meanwhile many functional languages (again) available □  JVM-based: Clojure, Scala (parts of it), …

□  Common Lisp, Erlang, F#, Haskell, ML, Ocaml, Scheme, … ■  Functional constructs (map, reduce, filter, folding, iterators,

immutable variables, …) in all popular languages (see week 6) ■  Perfect foundation for implicit parallelism

□  Instead of spawning tasks / threads, let the runtime decide □  Demands discipline to produce pure functions

37

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 38: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.4: Scala

Dr. Peter Tröger + Teaching Team

Page 39: OpenHPI - Parallel Programming Concepts - Week 3

Scala – “Scalable Language”

■  Example for the combination of OO and functional programming □  Expressions, statements, blocks as in Java

□  Every value is an object, every operation is a method call □  Functions as first-class concept □  Programmer chooses concurrency syntax □  Task-based parallelism supported by the language

■  Compiles to JVM (or .NET) byte code ■  Most language constructs are library functions ■  Interacts with class library of the runtime environment ■  Twitter moved some parts from Ruby to Scala in 2009

39

object HelloWorld extends App { println("Hello, world!") }

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 40: OpenHPI - Parallel Programming Concepts - Week 3

Scala Basics

■  All data types are objects, all operations are methods ■  Operator / infix notation

□  7.5-1.5 □  “hello”+”world”

■  Object notation □  (“hello”).+(“world”)

■  Implicit conversions, some given by default □  (“hello”)*5 □  0.until(3) resp. 0 until 3 □  (1 to 4).foreach(println)

■  Type inference □  var name = “Foo”

■  Immutable variables with “val” □  val name = “Scala”

40

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 41: OpenHPI - Parallel Programming Concepts - Week 3

Functions in Scala

■  Functions as first-class value ■  Possible to be passed as parameter, or used as result

■  () return value for procedures def sumUpRange(f: Int => Int, a: Int, b: Int): Int = if (a > b) 0 else f(a) + sum(f, a + 1, b) def id(x: Int): Int = x def sumUpIntRange (a: Int, b: Int): Int = sumUpRange(id, a, b) def square(x: Int): Int = x * x def sumSquareR (a: Int, b: Int): Int = sumUpRange(square, a, b)

■  Anonymous functions, type deduction def sumUpSquares(a: Int, b: Int): Int = sumUpRange(x => x * x, a, b)

41

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 42: OpenHPI - Parallel Programming Concepts - Week 3

Example: Quicksort

■  Recursive implementation of Quicksort

■  Similar to other imperative languages

■  swap as procedure with empty result ()

■  Functions in functions ■  Read-only value

definition with val

42 def sort(xs: Array[Int]) { def swap(i: Int, j: Int) { val t = xs(i) xs(i) = xs(j); xs(j) = t; () } def sort_recursive(l: Int, r: Int) { val pivot = xs((l + r) / 2) var i = l; var j = r while (i <= j) { while (xs(i) < pivot) i += 1 while (xs(j) > pivot) j -= 1 if (i <= j) {

swap(i, j); i += 1; j -= 1 }} if (l < j) sort_recursive(l, j) if (i < r) sort_recursive(i, r) } sort_recursive(0, xs.length - 1)}

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 43: OpenHPI - Parallel Programming Concepts - Week 3

Example: Quicksort

■  Functional style (same complexity, higher memory consumption)

□  Return empty / single element array as already sorted □  Partition array elements according to pivot element

□  Higher-order function filter takes predicate function (“pivot > x”) as argument and applies it for filtering

□  Sorting of sub-arrays with predefined sort function

43

def sort(xs: Array[Int]): Array[Int] = {

if (xs.length <= 1)

xs

else {

val pivot = xs(xs.length / 2)

Array.concat( sort(xs filter (pivot >)),

xs filter (pivot ==),

sort(xs filter (pivot <)))

}} OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 44: OpenHPI - Parallel Programming Concepts - Week 3

Concurrent Programming with Scala

■  Implicit superclass is scala.AnyRef, provides typical monitor functions scala> classOf[AnyRef].getMethods.foreach(println) def wait() def wait(msec: Long) def notify() def notifyAll()

■  Synchronized function, argument expression as critical section def synchronized[A] (e: => A): A

■  Synchronized variable with put, blocking get and unset val v=new scala.concurrent.SyncVar()

■  Futures, reader / writer locks, semaphores, ... val x = future(someLengthyComputation) anotherLengthyComputation val y = f(x()) + g(x())

■  Explicit parallelism through spawn (expr) and Actor concept (see week 5)

44

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 45: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Collections

45 scala> var sum = 0 sum: Int = 0 scala> val list = (1 to 1000).toList.par list: scala.collection.parallel.immutable.ParSeq[Int] = ParVector(1, 2, 3,… scala> list.foreach(sum += _); sum res01: Int = 467766

scala> var sum = 0 sum: Int = 0 scala> list.foreach(sum += _); sum res02: Int = 457073

scala> var sum = 0 sum: Int = 0 scala> list.foreach(sum += _); sum res03: Int = 468520

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

■  foreach on a parallel collection data type is automatically parallelized

■  Rich support for different data structures

■  Example: Parallel collections can still lead to race conditions □  “+=“ operator reads

and writes on the same variable

Page 46: OpenHPI - Parallel Programming Concepts - Week 3

Parallel Programming Concepts OpenHPI Course Week 3 : Shared Memory Parallelism - Programming Unit 3.5: Partitioned Global Address Space

Dr. Peter Tröger + Teaching Team

Page 47: OpenHPI - Parallel Programming Concepts - Week 3

Memory Model

■  Concurrency a first-class language citizen □  Demands a memory model for the language

◊ When is a written value visible? ◊  Example: OpenMP flush directive

□  Leads to ‘promises’ about the memory access behavior □  Beyond them, compiler and ILP can optimize the code

■  Proper memory model brings predictable concurrency ■  Examples

□  X86 processor machine code specification (native code) □  C++11 language specification (native code)

□  C# language specification □  Java memory model specification in JSR-133

■  Is this enough?

47

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 48: OpenHPI - Parallel Programming Concepts - Week 3

Socket

NUMA

■  Eight cores on 2 sockets in an SMP system ■  Memory controllers + chip interconnect realize a single memory

address space for the software

Core Core

L1 L1

L3 Cache

RAM

L2 L2

Core Core

L1

L2

L1

L2

Memory Controller

RAM

Chip

Interconnect

Socket

Core Core

L1 L1

L3 Cache

L2 L2

Core Core

L1

L2

L1

L2

Memory Controller

48

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 49: OpenHPI - Parallel Programming Concepts - Week 3

PGAS Languages

■  Non-uniform memory architectures (NUMA) became default ■  But: Understanding of memory in programming is flat

□  All variables are equal in access time □  Considering the memory hierarchy is low-level coding

(e.g. cache-aware programming) ■  Partitioned global address space (PGAS) approach

□  Driven by high-performance computing community □  Modern approach for large-scale NUMA

□  Explicit notion of memory partition per processor ◊ Data is designated as local (near) or global (possibly far) ◊  Programmer is aware of NUMA nodes

□  Performance optimization for deep memory hierarchies

49

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 50: OpenHPI - Parallel Programming Concepts - Week 3

PGAS Languages and Libraries

■  PGAS languages □  Unified Parallel C (Ansi C)

□  Co-Array Fortran / Fortress (F90) □  Titanium (Java), Chapel (Cray), X10 (IBM), … □  All research, no wide-spread solution on industry level

■  Core data management functionality can be re-used as library

□  Global Arrays (GA) Toolkit □  Global-Address Space Networking (GASNet) ◊ Used by many PGAS languages –

UPC, Co-Array Fortran, Titanium, Chapel □  Aggregate Remote Memory Copy Interface (ARMCI) □  Kernel Lattice Parallelism (KeLP)

50

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 51: OpenHPI - Parallel Programming Concepts - Week 3

X10

■  Parallel object-oriented PGAS language by IBM □  Java derivate, compiles to C++ or pure Java code

□  Different binaries can interact through common runtime □  Transport: Shared memory, TCP/IP, MPI, CUDA, … □  Linux, MacOS X, Windows, AIX; X86, Power □  Full developer support with Eclipse environment

■  Fork-join execution model ■  One application instance runs at a fixed number of places

□  Each place represents a NUMA node □  Distinguishing between place-local and global data

□  main() method runs automatically at place 0 □  Each place has a private copy of static variables

51

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 52: OpenHPI - Parallel Programming Concepts - Week 3

X10

52

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  Parallel tasks, each operating in one place of the PGAS □  Direct variable access only in local place

■  Implementation strategy is flexible □  One operating system process per place, manages thread pool □  Work-stealing scheduler

[IBM

]

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 53: OpenHPI - Parallel Programming Concepts - Week 3

X10

53

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  async S □  Creates a new task that executes S, returns immediately

□  S may reference all variables in the enclosing block □  Runtime chooses a (NUMA) place for execution

■  finish S □  Execute S and wait for all transitively spawned tasks (barrier)

[IBM

]

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 54: OpenHPI - Parallel Programming Concepts - Week 3

Example

54 public class Fib { public static def fib(n:int) { if (n<=2) return 1; val f1:int; val f2:int; finish { async { f1 = fib(n-1); } f2 = fib(n-2); } return f1 + f2; } public static def main(args:Array[String](1)) { val n = (args.size > 0) ? int.parse(args(0)) : 10; Console.OUT.println("Computing Fib("+n+")"); val f = fib(n); Console.OUT.println("Fib("+n+") = "+f); } }

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 55: OpenHPI - Parallel Programming Concepts - Week 3

X10

55

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  atomic S □  Execute S atomically, with respect to all other atomic blocks

□  S must access only local data ■  when(c) S

□  Suspend current task until c, then execute S atomically

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 56: OpenHPI - Parallel Programming Concepts - Week 3

class Buffer[T]{T isref, T haszero} { protected var date:T = null; public def send(v:T){v != null} { when(date==null) { date=v; } } public def receive() { when(date != null) { val v = date; date = null; return v; } } } class Account {

public var value:Int; def transfer(src: Account, v:Int) { atomic { src.value -= v; this.value += v; } } }

Page 57: OpenHPI - Parallel Programming Concepts - Week 3

Tasks Can Move

57

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  at(p) S - Execute statement S at place p, block current task ■  at(p) e - Evaluate expression e at place p and return the result

■  at(p) async S □  Create new task at p to run S, return immediately

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Page 58: OpenHPI - Parallel Programming Concepts - Week 3

Summary: Week 3

■  Short overview of shared memory parallel programming ideas ■  Different levels of abstractions

□  Process model, thread model, task model ■  Threads for concurrency and parallelization

□  Standardized POSIX interface □  Java / .NET concurrency functionality

■  Tasks for concurrency and parallelization □  OpenMP for C / C++, Java, .NET, Cilk, …

■  Functional language constructs for implicit parallelism ■  PGAS languages for NUMA optimization

58

Specialized languages help the programmer to achieve speedup. What about accordingly specialized parallel hardware?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger