View
222
Download
0
Category
Preview:
DESCRIPTION
Parallel code for multicore systems
Citation preview
Parallel code for multicore systems
An overview of programming models
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA
13.10.2011 1 Multicore Briefing - parallel programming models
Overview
There are just too many and remember that the compiler will not help you.
Threading models for multicore processors POSIX threads Intel Threading building blocks OpenMP
Threading models for GPGPUs CUDA OpenCL
Parallel programming for distributed memory MPI
Overall goal: Exploit the parallelism built into the hardware!
13.10.2011 2 Multicore Briefing - parallel programming models
shared memory and accelerators
distributed memory
POSIX threads
13.10.2011 3 Multicore Briefing - parallel programming models
Why threads for parallel programs?
Thread == Lightweight process Independent instruction stream In simulation we usually run one thread per (virtual or physical) core, but
more is possible
New processes are expensive to generate (via fork()) Threads share all the data of a process, so they are cheap
Inter-process communication is slow and cumbersome Shared memory between threads provides an easy way to communicate
and synchronize
A threading model puts threads to use by making them accessible to the programmer Either explicitly or wrapped in some parallel paradigm
13.10.2011 4 Multicore Briefing - parallel programming models
POSIX threads example: Matrix-vector multiply with 100 threads
static double a[100][100], b[100], c[100]; int main(int argc, char* argv[]) { pthread_t tids[100]; ... for (int i = 0; i < 100; i++) pthread_create(tids + i, NULL, mult, (void *)(c + i)); for (int i = 0; i < 100; i++) pthread_join(tids[i], NULL); ... } static void *mult(void *cp) { int i = (double *)cp - c; double sum = 0; for (int j = 0; j < 100; j++) sum += a[i][j] * b[j]; c[i] = sum; return NULL; }
Adapted from material by J. Kleinder
There are no shared resources here! (not really)
13.10.2011 5 Multicore Briefing - parallel programming models
POSIX threads pros and cons
Pros Most basic threading interface Straightforward, manageable API Dynamic generation and destruction of threads Reasonable synchronization primitives Full execution control
Cons Most basic threading interface Higher functions (reductions, synchronization, work distributions, task
queueing) must be done by hand Only available with C API Only available on (near-) POSIX compliant OSs Compiler has no clue about threads
13.10.2011 6 Multicore Briefing - parallel programming models
Intel Threading Building Blocks (TBB)
13.10.2011 7 Multicore Briefing - parallel programming models
Intel Threading Building Blocks (TBB)
Introduced by Intel in 2006
C++ threading library Uses POSIX threads under the hood Programmer works with tasks rather than threads Task stealing model Parallel C++ containers
Commercial and open source variants exist
13.10.2011 8 Multicore Briefing - parallel programming models
A simple parallel loop in TBB: Apply Foo() to every element of an array
#include "tbb/tbb.h" using namespace tbb; class ApplyFoo { float *const my_a; public: void operator()( const blocked_range& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } ApplyFoo( float a[] ) : my_a(a) {} }; void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range(0,n), ApplyFoo(a)); }
Adapted from the Intel TBB tutorial
13.10.2011 9 Multicore Briefing - parallel programming models
TBB pros and cons
Pros High-level programming model Task concept is often more natural for real-world problems than thread
concept Built-in parallel (thread-safe) containers Built-in work distribution (configurable, but not too finely) Available for Linux, Windows, MacOS
Cons C++ only Mapping of threads to resources (cores) not part of the model Number of threads concept only vaguely implemented Dynamic work sharing and task stealing introduce variability, difficult to
optimize under ccNUMA constraints Compiler has no clue about threads
13.10.2011 10 Multicore Briefing - parallel programming models
OpenMP
13.10.2011 11 Multicore Briefing - parallel programming models
Parallel Programming with OpenMP
Easy and portable parallel programming of shared memory computers: OpenMP
Standardized set of compiler directives & library functions: http://www.openmp.org/ FORTRAN, C and C++ interfaces Supported by most/all commercial compilers, GNU starting with 4.2 Few free tools are available
OpenMP program can be written to compile and execute on a single-processor machine just by ignoring the directives
13.10.2011 12 Multicore Briefing - parallel programming models
private
Shared Memory
Shared Memory Model used by OpenMP
T
T
T
T
n Threads access globally shared memory
n Data can be shared or private n shared data available
to all threads (in principle)
n private data only to thread that owns it
private
private
private
Central concept of OpenMP programming: Threads
13.10.2011 13 Multicore Briefing - parallel programming models
OpenMP Program Execution Fork and Join
Program start: only master thread runs
Parallel region: team of worker threads is generated (fork)
synchronize when leaving parallel region (join)
Only master executes sequential part worker threads usually sleep
Task and data distribution via directives
Usually optimal: one thread per core Thread # 0 1 2 3 4 5
13.10.2011 14 Multicore Briefing - parallel programming models
! function to integrate double f(double x) { return 4.0/(1.0+x*x); }
w=1.0/n; sum=0.0; for(i=1; i
Example: Numerical integration in OpenMP
concurrent execution by team of threads
worksharing among threads
sequential execution
... pi=0.0; w=1.0/n; #pragma omp parallel private(x,sum) { sum=0.0; #pragma omp for for(i=1; i
OpenMP pros and cons
Pros High-level programming model Available for Fortran, C, C++ Ideal for data parallelism, some support for task parallelism Built-in work distribution Directive concept is part of the language Good support for incremental parallelization
Cons Mapping of threads to resources (cores) not part of the model OpenMP parallelization may interfere with compiler optimization Parallel data structures are not part of the model Only limited synchronization facilities Model revolves around parallel region concept
13.10.2011 17 Multicore Briefing - parallel programming models
CUDA
13.10.2011 18 Multicore Briefing - parallel programming models
NVIDIA CUDA
Compute Unified Device Architecture Hardware architecture and software environment Convenient programming model for using NVIDIA GPUs as
general-purpose compute devices Implements Single Instruction Multiple Threads (SIMT) approach Programming model
Accelerator style: Main program runs on host CPU, kernels are offloaded to GPU
Unified binary for host + device Supports multiple GPUs Data transfer to/from device is
explicit Kernel execution may be
asynchronous to CPU code Latest devices (Fermi) allow
multiple concurrent kernels
13.10.2011 19 Multicore Briefing - parallel programming models
GPU #1
GPU #2 PCIe link
A simple CUDA example: Host code
13.10.2011 20 Multicore Briefing - parallel programming models
// allocate memory on host h_A = (float *)malloc(DATA_SZ); h_C = (float *)malloc(DATA_SZ); h_C_GPU = (float *)malloc(RESULT_SZ); // allocate memory on CUDA device cudaMalloc((void **)&d_A, DATA_SZ) ; cudaMalloc((void **)&d_C, RESULT_SZ) ; //Copy data to GPU memory for further processing cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_C, DATA_SZ, cudaMemcpyHostToDevice) ; cudaThreadSynchronize() ; //Kernel Call: do_work_on_gpu(d_C, d_A, DATA_N); cudaThreadSynchronize(); // copy result back to host cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) ;
A simple CUDA example: CUDA kernel
13.10.2011 21 Multicore Briefing - parallel programming models
__global__ void do_work_on_gpu ( float *d_C, float *d_A, int elementN ) { for ( int pos = (blockIdx.x * blockDim.x) + threadIdx.x; pos < elementN ; pos += blockDim.x*gridDim.x ) { d_C[pos] = 5.0f * d_A[pos]; } __syncthreads(); }
CUDA pros and cons
Pros Relatively straightforward programming model Low-level programming, explicit data management Compatible with many NVIDIA GPUs code runs usually without changes Available for C, but wrappers for many languages available
including scripting languages Directive-based compiler extensions available (e.g., PGI) Potential for overlapping GPU computation with CPU tasks
Cons Restricted to NVIDIA GPUs
No support for multicore processors No support for AMD GPUs
Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution Not an open standard
13.10.2011 22 Multicore Briefing - parallel programming models
OpenCL
13.10.2011 23 Multicore Briefing - parallel programming models
OpenCL
Open Computing Language Open standard Convenient programming model for using any kind of
accelerator GPGPUs, multicore CPUs,
Programming model similar to CUDA but more flexible Pure kernel code often portable from CUDA without major changes
13.10.2011 24 Multicore Briefing - parallel programming models
13.10.2011 25 Multicore Briefing - parallel programming models
A simple OpenCL example: Host code
// Get platform (Platform is NVIDIA Corp or Intel Corp or AMD Corp) std::vector platforms; cl::Platform::get(&platforms); // Get devices std::vector devices; platforms.front().getDevices( DEVTOQUERY , &devices ); // Build context and Command Queue cl::Context context( devices ); cl::CommandQueue cmdQ ( context , devices[0] ); // Read Kernel and compile JIT cl::Program::Sources sourceCode ; source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); sourceCode.push_back(std::make_pair(source_str,source_size)); cl::Program program = cl::Program ( context ,sourceCode ); program.build ( devices ) ; cl::Kernel kernel(program, "VectorCopy"); //Allocate buffer cl::Buffer D_A(context,CL_MEM_READ_WRITE,sizeof(REAL)*Vectorlength); //Copy data cmdQ.enqueueWriteBuffer (D_A , true,0,sizeof(REAL)*Vectorlength , &H_A[0]); // Bind parameters to kernel cl::KernelFunctor kernel_func = kernel.bind (cmdQ, cl::NDRange(Globalsize), cl::NDRange(Workgroupsize)); // Call Kernel event = kernel_func(D_A, D_B, D_C, scalar, Vectorlength, i ) ;
OpenCL pros and cons
Pros Relatively straightforward programming model Low-level programming, explicit data management Available for NVIDIA and AMD GPUs, and multicore CPUs Potential for overlapping GPU computation with CPU tasks CUDA kernel code largely re-usable Some support for modern SIMD instruction sets
Cons Available for C(99)/C++ Just in time kernel compilation Low-level programming, explicit data management Powerful tools are just beginning to emerge Largely manual work distribution, but more flexible than CUDA Best performance on all architectures requires specialized code for each
13.10.2011 26 Multicore Briefing - parallel programming models
MPI
The Message Passing Interface
13.10.2011 27 Multicore Briefing - parallel programming models
The message passing paradigm: A programming model
Distributed memory architecture:
Each process(or) can only access its dedicated address space.
No global shared address space
Data exchange and communication between processes is done by explicitly passing messages through a communication network
Message passing library:
Should be flexible, efficient and portable
Hide communication hardware and software layers from application programmer
Message
13.10.2011 28 Multicore Briefing - parallel programming models
The message passing paradigm
Widely accepted standard in HPC / numerical simulation: Message Passing Library (MPI) See http://www.mpi-forum.org for documents Many free and commercial implementations: Intel MPI, OpenMPI,
MVAPICH, Process-based approach: All variables are local! Same program on each processor/machine (SPMD)
No restriction of the general MP model, because processes can be distinguished by their rank (see later)
The program is written in a sequential language (Fortran/C/C++)
Data exchange between processes: Send/receive messages via MPI library calls This is usually the most tedious but also the most flexible way of
parallelization
13.10.2011 29 Multicore Briefing - parallel programming models
Processes run throughout program execution: All variables are local
Startup phase: launch tasks establishes communication context
(communicator) among all tasks
Point-to-point data transfer: between pairs of tasks may be blocking or non-blocking explicit synchronization is needed for
non-blocking Collective communication:
between all tasks or a subgroup of tasks
presently blocking-only reductions, scatter/gather operations efficiency of library call
Clean shutdown
MPI in a nutshell: Parallel execution
+
Process ID: 0 1 2 3 4
13.10.2011 30 Multicore Briefing - parallel programming models
program hello use mpi implicit none integer rank, size, ierror
call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) 'Hello World! I am ',rank,' of ',size
call MPI_FINALIZE(ierror)
end program
13.10.2011 31 Multicore Briefing - parallel programming models
MPI in a nutshell Hello World!
Hello World! I am 3 of 4 Hello World! I am 1 of 4 Hello World! I am 0 of 4 Hello World! I am 2 of 4
MPI in a nutshell Transmitting a message
MPI requires the following information: Which processor is sending the message. Where is the data on the sending processor. What kind of data is being sent. How much data is there.
Which processor(s) are receiving the message. Where should the data be left on the receiving processor. How much data is the receiving processor prepared to accept.
Sender and receiver must pass their information to MPI separately
Holds for point-to-point communication
13.10.2011 32 Multicore Briefing - parallel programming models
Message
MPI pros and cons
Pros Suitable for distributed-memory and shared-memory machines Supports massive parallelism Well supported, many free and commercial implementations Tremendous code base, huge experience in the field Standard supports Fortran and C, wrappers for other languages exist
including scripting languages Hybrid MPI+X models are supported: X {OpenMP,CUDA,OpenCL,TBB,}
Cons Execution environment is crucial to set up Huge standard (500+ functions) with many obscure bits and pieces Incremental parallelization next to impossible most sequential code
needs serious restructuring Performance properties sometimes hard to understand
also implementation-dependent
13.10.2011 33 Multicore Briefing - parallel programming models
Recommended