55
Parallel Computing on GPUs Christian Kehl 01.01.2011

GPU Computing

Embed Size (px)

Citation preview

Page 1: GPU Computing

Parallel Computing on GPUs

Christian Kehl01.01.2011

Page 2: GPU Computing

2

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Page 3: GPU Computing

3

Basics of Parallel Computing

Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007

Page 4: GPU Computing

4

Basics of Parallel Computing

Page 5: GPU Computing

5

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Page 6: GPU Computing

6

Brief History of SIMD vs. MIMD Architectures

Page 7: GPU Computing

7

Brief History of SIMD vs. MIMD Architectures

Page 8: GPU Computing

8

Brief History of SIMD vs. MIMD Architectures

Page 9: GPU Computing

9

Brief History of SIMD vs. MIMD Architectures

• 2004 – programmable GPU Core via Shader Technology• 2007 – CUDA (Compute Unified Device

Architecture) Release 1.0• December 2008 – First Open Compute

Language Spec• March 2009 – Uniform Shader, first BETA

Releases of OpenCL• August 2009 – Release and Implementation

of OpenCL 1.0

Page 10: GPU Computing

10

Brief History of SIMD vs. MIMD Architectures

• SIMD technologies in GPUs:– Vector processing (ILLIAC IV)–mathematical operation units (ILLIAC IV)– Pipelining (CRAY-1)– local memory caching (CRAY-1)– atomic instructions (CRAY-1)– synchronized instruction execution and memory

access (MASPAR)

Page 11: GPU Computing

11

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Page 12: GPU Computing

12

Platform Model

• One Host + one or more Compute Devices• Each Compute Device is composed of one or

more Compute Units• Each Compute Unit is further divided into one or

more Processing Elements

OpenCL

Page 13: GPU Computing

13

• Total number of work-items = Gx * Gy

• Size of each work-group = Sx * Sy

• Global ID can be computed from work-group ID and local ID

Kernel ExecutionOpenCL

Page 14: GPU Computing

14

Memory ManagementOpenCL

Page 15: GPU Computing

15

Memory ManagementOpenCL

Page 16: GPU Computing

16

• Address spaces– Private - private to a work-item– Local - local to a work-group– Global - accessible by all work-items in all work-

groups– Constant - read only global space

Memory ModelOpenCL

Page 17: GPU Computing

17

Programming LanguageOpenCL

• Every GPU Computing technology natively written in C/C++ (Host)

• Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)

• Device Code exclusively written in standard C + Extensions

Page 18: GPU Computing

18

• Pointers to functions not allowed• Pointers to pointers allowed within a kernel, but not

as an argument• Bit-fields not supported• Variable-length arrays and structures not supported• Recursion not supported• Writes to a pointer of types less than 32-bit not

supported• Double types not supported, but reserved• 3D Image writes not supported

• Some restrictions are addressed through extensions

Language Restrictions

OpenCL

Page 19: GPU Computing

19

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Page 20: GPU Computing

20

• Multimedia Data and Tasks best-suited for SIMD Processing

• Multimedia Data – sequential Bytestreams; each Byte independent

• Image Processing in particular suited for GPUs

• original GPU task: „Compute <several FLOP> for every Pixel of the screen“ ( Computer Graphics)

• same task for images, only FLOP‘s are different

Common Application Domain

Page 21: GPU Computing

21

• possible features realizable on the GPU– contrast- and luminance configuration– gamma scaling– (pixel-by-pixel-) histogram scaling– convolution filtering– edge highlighting– negative image / image inversion– …

Common Application Domain – Image Processing

Page 22: GPU Computing

22

• simple example: Inversion• implementation and use of a framework for

switching between different GPGPU technologies• creation of a command queue for each GPU• reading GPU kernel via kernel file on-the-fly• creation of buffers for input and output image• memory copy of input image data to global GPU

memory• set of kernel arguments and kernel execution• memory copy of GPU output buffer data to new

image

InversionImage Processing

Page 23: GPU Computing

23

evaluated and confirmed minimum speedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP

4 : 1

Image Processing Inversion

Page 24: GPU Computing

25

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Page 25: GPU Computing

26

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 26: GPU Computing

27

Task

• Spring-Mass-System defined by a differential equation

• Behavior of the system must be simulated over varying damping values

• Therefore: numerical solution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000

• Analysis of computation time and speed-up for different compute architectures

Page 27: GPU Computing

28

Task

based on Simulation News Europe (SNE) CP2:• 1000 simulation iterations over simulation

horizon with generated damping values (Monte-Carlo Study)

• consequtive averaging for s(t)• tε[0 … 2] sec; h=0.01 200 steps

Page 28: GPU Computing

29

Task

on present architectures too lightweighted-> Modification:

• 5000 iterations with Monte-Carlo• h=0.001 2000 steps

Aim of Analysis: Knowledge about spring behavior for different damping values (trajectory array)

Page 29: GPU Computing

30

Task• Simple Spring-Mass-System

d … damping constantc … spring constant

• Movement equation derived by Newton‘s 2nd axiom

• Modelling needed -> „Massenfreischnitt“– mass is moved– force balancing Equation

Page 30: GPU Computing

31

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 31: GPU Computing

32

Modelling

• numerical integration based on 2nd order differential equation

DE order n n DEs 1st order

order 2nd DE -

)()/()'()/(')'(

)()'(')'(

FFF

0

:axiomNewton 2.

CDT

tsmktsmdts

tcstdstms

FFF TDC

)(')'(

)()'(

)()(

tats

tvts

tsts

Page 32: GPU Computing

33

Modelling• Transformation by substitution

122

2

21

21

)/()/('

)()/()'()/(')'()'(

)'()'(

')(),()(

)()/()'()/(')'(

smcsmds

tsmctsmdtsts

ststs

stststs

tsmctsmdts

• random damping parameter d for interval limits [800;1200]; • 5000 iterations

m 0 s(0)

m/s 0.1v(0)(0)s'

:esstart valu

2s t0s;t

450;9000

:CP2 SNEby given

Endstart

kgmc

Page 33: GPU Computing

34

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 34: GPU Computing

35

Euler as simple ODE solver

• numerical integration by explicit Euler method

...

);()(

);()(

);()()(

)(

:Lsg

's and t, s esstart valu

System ODE !

s(t) tajectory?

2223

1112

000110

00

000

stfhsts

stfhsts

stfhsstshts

sts

')()(

')()(

)()/()()/('

)('

- steps allover iterate

:Problem-Mass-Springfor Use

222

111

122

21

shtshts

shtshts

tsmctsmds

tss

Page 35: GPU Computing

36

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 36: GPU Computing

37

existing MIMD Solutions

Page 37: GPU Computing

38

existing MIMD Solutions

• Approach can not be applied to GPU Architectures

• MIMD-Requirements:– each PE with own instruction flow– each PE can access RAM individually

• GPU Architecture -> SIMD– each PE computes the same instruction at the

same time– each PE has to be at the same instruction for

accessing RAM

Therefore: Development SIMD-Approach

Page 38: GPU Computing

39

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 39: GPU Computing

40

An SIMD Approach

• S.P./R.F.:– simultaneous execution of sequential

Simulation with varying d-Parameter on spatially distributed PE‘s

– Averaging dependend on trajectories

• C.K.:– simultaneous computation with all d-

Parameters for time tn, iterative repetition until tend

– Averaging dependend on steps

Page 40: GPU Computing

41

An SIMD-Approach

Page 41: GPU Computing

42

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 42: GPU Computing

43

OpenMP

• Parallization Technology based on shared memory principle

• synchronization hidden for developer• thread management controlable• For System-V-based OS:

– parallization by process forking

• For Windows-based OS:– parallization by WinThread creation (AMD

Study/Intel Tech Paper)

Page 43: GPU Computing

44

OpenMP

• in C/C++: pragma-based preprocessor directives

• in C# represented by ParallelLoops• more than just parallizing Loops (AMD Tech

Report)• Literature:

– AMD/Intel Tech Papers– Thomas Rauber, „Parallele Programmierung“– Barbara Chapman, „Using OpenMP: Portable

Shared Memory Parallel Programming“

Page 44: GPU Computing

45

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plot• Speed-Up-Study• Parallization Conclusions• Resumée

Page 45: GPU Computing

46

Result Plot

resulting trajectory for all technologies

Page 46: GPU Computing

47

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 47: GPU Computing

48

Speed-Up Study

# Cores MIMD Single

MIMD OpenMP

SIMD Single SIMD OpenMP

SIMD OpenCL

1 1.0 (T=56.53)

1.0 0.9 (T=64.63)

0.9 0.4 (T=144.6)

2 X 1.8 X 1.4 X

4 X 3.5 X 2.0 X

8 X 5.7 X 1.7 X

16 X 5.1 X 0.5 X

dyn/std 1.0 5.7 0.9 1.7 0.4

OpenMP – own Study – Comparison CPU/GPUSIMD Single: presented SIMD approach on CPUSIMD OpenMP: presented SIMD approach parallized on CPUSIMD OpenCL: Control of number of executing units not possible,

therefore only 1 value

Page 48: GPU Computing

49

Speed-Up StudySIMD OpenCL SIMD single MIMD single SIMD

OpenMPMIMD OpenMP

Page 49: GPU Computing

50

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 50: GPU Computing

51

Parallization Conclusions

• problem unsuited for SIMD parallization• On-GPU-Reduction too time expensive, Therefore:

– Euler computation on GPU– Average computation on CPU

• most time intensive operation: MemCopy between GPU and Main Memory

• for more complex problems oder different ODE solver procedure speed-up behavior can change

Page 51: GPU Computing

52

Parallization Conclusion• MIMD-Approach S.P./R.F. efficient for SNE CP2• OpenMP realization for MIMD- and SIMD-

Approach possible (and done)• OpenMP MIMD realization almost linear

speedup• more set Threads than PEs physically

available leads to significant Thread-Overhead

• OpenMP chooses automatically number threads to physical available PEs for dynamic assignement

Page 52: GPU Computing

53

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

Page 53: GPU Computing

54

Resumée

• task can be solved on CPUs and GPUs• For GPU Computing new approaches and

algorithm porting required• although GPUs have massive number of

parallel operating cores, speed-up not for every application domain possible

Page 54: GPU Computing

55

Resumée• Advantages GPU Computing:

– for suited problems (e.g. Multimedia) very fast and scalable

– cheap HPC technology in comparison to scientific supercomputers

– energy-efficient– massive computing power in small size

• Disadvantage GPU Computing:– limited instruction set– strictly SIMD– SIMD Algorithm development hard– no execution supervision (e.g. segmentation/page

fault)

Page 55: GPU Computing

56

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP