GPU Computing

Parallel Computing on GPUs

Christian Kehl01.01.2011

2

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

3

Basics of Parallel Computing

Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007

4

Basics of Parallel Computing

5

Overview




6

Brief History of SIMD vs. MIMD Architectures

7


8


9


• 2004 – programmable GPU Core via Shader Technology• 2007 – CUDA (Compute Unified Device

Architecture) Release 1.0• December 2008 – First Open Compute

Language Spec• March 2009 – Uniform Shader, first BETA

Releases of OpenCL• August 2009 – Release and Implementation

of OpenCL 1.0

10


• SIMD technologies in GPUs:– Vector processing (ILLIAC IV)–mathematical operation units (ILLIAC IV)– Pipelining (CRAY-1)– local memory caching (CRAY-1)– atomic instructions (CRAY-1)– synchronized instruction execution and memory

access (MASPAR)

11

Overview




12

Platform Model

• One Host + one or more Compute Devices• Each Compute Device is composed of one or

more Compute Units• Each Compute Unit is further divided into one or

more Processing Elements

OpenCL

13

• Total number of work-items = Gx * Gy

• Size of each work-group = Sx * Sy

• Global ID can be computed from work-group ID and local ID

Kernel ExecutionOpenCL

14

Memory ManagementOpenCL

15

Memory ManagementOpenCL

16

• Address spaces– Private - private to a work-item– Local - local to a work-group– Global - accessible by all work-items in all work-

groups– Constant - read only global space

Memory ModelOpenCL

17

Programming LanguageOpenCL

• Every GPU Computing technology natively written in C/C++ (Host)

• Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)

• Device Code exclusively written in standard C + Extensions

18

• Pointers to functions not allowed• Pointers to pointers allowed within a kernel, but not

as an argument• Bit-fields not supported• Variable-length arrays and structures not supported• Recursion not supported• Writes to a pointer of types less than 32-bit not

supported• Double types not supported, but reserved• 3D Image writes not supported

• Some restrictions are addressed through extensions

Language Restrictions

OpenCL

19

Overview




20

• Multimedia Data and Tasks best-suited for SIMD Processing

• Multimedia Data – sequential Bytestreams; each Byte independent

• Image Processing in particular suited for GPUs

• original GPU task: „Compute <several FLOP> for every Pixel of the screen“ ( Computer Graphics)

• same task for images, only FLOP‘s are different

Common Application Domain

21

• possible features realizable on the GPU– contrast- and luminance configuration– gamma scaling– (pixel-by-pixel-) histogram scaling– convolution filtering– edge highlighting– negative image / image inversion– …

Common Application Domain – Image Processing

22

• simple example: Inversion• implementation and use of a framework for

switching between different GPGPU technologies• creation of a command queue for each GPU• reading GPU kernel via kernel file on-the-fly• creation of buffers for input and output image• memory copy of input image data to global GPU

memory• set of kernel arguments and kernel execution• memory copy of GPU output buffer data to new

image

InversionImage Processing

23

evaluated and confirmed minimum speedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP

4 : 1

Image Processing Inversion

25

Overview




26

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

27

Task

• Spring-Mass-System defined by a differential equation

• Behavior of the system must be simulated over varying damping values

• Therefore: numerical solution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000

• Analysis of computation time and speed-up for different compute architectures

28

Task

based on Simulation News Europe (SNE) CP2:• 1000 simulation iterations over simulation

horizon with generated damping values (Monte-Carlo Study)

• consequtive averaging for s(t)• tε[0 … 2] sec; h=0.01 200 steps

29

Task

on present architectures too lightweighted-> Modification:

• 5000 iterations with Monte-Carlo• h=0.001 2000 steps

Aim of Analysis: Knowledge about spring behavior for different damping values (trajectory array)

30

Task• Simple Spring-Mass-System

d … damping constantc … spring constant

• Movement equation derived by Newton‘s 2nd axiom

• Modelling needed -> „Massenfreischnitt“– mass is moved– force balancing Equation

31



32

Modelling

• numerical integration based on 2nd order differential equation

DE order n n DEs 1st order

order 2nd DE -

)()/()'()/(')'(

)()'(')'(

FFF

0

:axiomNewton 2.

CDT

tsmktsmdts

tcstdstms

FFF TDC

)(')'(

)()'(

)()(

tats

tvts

tsts

33

Modelling• Transformation by substitution

122

2

21

21

)/()/('

)()/()'()/(')'()'(

)'()'(

')(),()(

)()/()'()/(')'(

smcsmds

tsmctsmdtsts

ststs

stststs

tsmctsmdts

• random damping parameter d for interval limits [800;1200]; • 5000 iterations

m 0 s(0)

m/s 0.1v(0)(0)s'

:esstart valu

2s t0s;t

450;9000

:CP2 SNEby given

Endstart

kgmc

34



35

Euler as simple ODE solver

• numerical integration by explicit Euler method

...

);()(

);()(

);()()(

)(

:Lsg

's and t, s esstart valu

System ODE !

s(t) tajectory?

2223

1112

000110

00

000

stfhsts

stfhsts

stfhsstshts

sts

')()(

')()(

)()/()()/('

)('

- steps allover iterate

:Problem-Mass-Springfor Use

222

111

122

21

shtshts

shtshts

tsmctsmds

tss

36



37

existing MIMD Solutions

38

existing MIMD Solutions

• Approach can not be applied to GPU Architectures

• MIMD-Requirements:– each PE with own instruction flow– each PE can access RAM individually

• GPU Architecture -> SIMD– each PE computes the same instruction at the

same time– each PE has to be at the same instruction for

accessing RAM

Therefore: Development SIMD-Approach

39



40

An SIMD Approach

• S.P./R.F.:– simultaneous execution of sequential

Simulation with varying d-Parameter on spatially distributed PE‘s

– Averaging dependend on trajectories

• C.K.:– simultaneous computation with all d-

Parameters for time tn, iterative repetition until tend

– Averaging dependend on steps

41

An SIMD-Approach

42



43

OpenMP

• Parallization Technology based on shared memory principle

• synchronization hidden for developer• thread management controlable• For System-V-based OS:

– parallization by process forking

• For Windows-based OS:– parallization by WinThread creation (AMD

Study/Intel Tech Paper)

44

OpenMP

• in C/C++: pragma-based preprocessor directives

• in C# represented by ParallelLoops• more than just parallizing Loops (AMD Tech

Report)• Literature:

– AMD/Intel Tech Papers– Thomas Rauber, „Parallele Programmierung“– Barbara Chapman, „Using OpenMP: Portable

Shared Memory Parallel Programming“

45


• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plot• Speed-Up-Study• Parallization Conclusions• Resumée

46

Result Plot

resulting trajectory for all technologies

47



48

Speed-Up Study

# Cores MIMD Single

MIMD OpenMP

SIMD Single SIMD OpenMP

SIMD OpenCL

1 1.0 (T=56.53)

1.0 0.9 (T=64.63)

0.9 0.4 (T=144.6)

2 X 1.8 X 1.4 X

4 X 3.5 X 2.0 X

8 X 5.7 X 1.7 X

16 X 5.1 X 0.5 X

dyn/std 1.0 5.7 0.9 1.7 0.4

OpenMP – own Study – Comparison CPU/GPUSIMD Single: presented SIMD approach on CPUSIMD OpenMP: presented SIMD approach parallized on CPUSIMD OpenCL: Control of number of executing units not possible,

therefore only 1 value

49

Speed-Up StudySIMD OpenCL SIMD single MIMD single SIMD

OpenMPMIMD OpenMP

50



51

Parallization Conclusions

• problem unsuited for SIMD parallization• On-GPU-Reduction too time expensive, Therefore:

– Euler computation on GPU– Average computation on CPU

• most time intensive operation: MemCopy between GPU and Main Memory

• for more complex problems oder different ODE solver procedure speed-up behavior can change

52

Parallization Conclusion• MIMD-Approach S.P./R.F. efficient for SNE CP2• OpenMP realization for MIMD- and SIMD-

Approach possible (and done)• OpenMP MIMD realization almost linear

speedup• more set Threads than PEs physically

available leads to significant Thread-Overhead

• OpenMP chooses automatically number threads to physical available PEs for dynamic assignement

53



54

Resumée

• task can be solved on CPUs and GPUs• For GPU Computing new approaches and

algorithm porting required• although GPUs have massive number of

parallel operating cores, speed-up not for every application domain possible

55

Resumée• Advantages GPU Computing:

– for suited problems (e.g. Multimedia) very fast and scalable

– cheap HPC technology in comparison to scientific supercomputers

– energy-efficient– massive computing power in small size

• Disadvantage GPU Computing:– limited instruction set– strictly SIMD– SIMD Algorithm development hard– no execution supervision (e.g. segmentation/page

fault)

56

Overview




Education

GPU Computing