GPU Computing

Parallel Computing on GPUs

Christian Kehl01.01.2011

Overview

• Basics of Parallel Computing• Brief History of SIMD vs. MIMD

Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-

System using OpenCL and OpenMP

Basics of Parallel Computing

Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007

Basics of Parallel Computing

Overview

Brief History of SIMD vs. MIMD Architectures

• 2004 – programmable GPU Core via Shader Technology• 2007 – CUDA (Compute Unified Device

Architecture) Release 1.0• December 2008 – First Open Compute

Language Spec• March 2009 – Uniform Shader, first BETA

Releases of OpenCL• August 2009 – Release and Implementation

of OpenCL 1.0

Brief History of SIMD vs. MIMD Architectures

• SIMD technologies in GPUs:– Vector processing (ILLIAC IV)–mathematical operation units (ILLIAC IV)– Pipelining (CRAY-1)– local memory caching (CRAY-1)– atomic instructions (CRAY-1)– synchronized instruction execution and memory

access (MASPAR)

Overview

Platform Model

• One Host + one or more Compute Devices• Each Compute Device is composed of one or

more Compute Units• Each Compute Unit is further divided into one or

more Processing Elements

OpenCL

• Total number of work-items = Gx * Gy

• Size of each work-group = Sx * Sy

• Global ID can be computed from work-group ID and local ID

Kernel ExecutionOpenCL

Memory ManagementOpenCL

• Address spaces– Private - private to a work-item– Local - local to a work-group– Global - accessible by all work-items in all work-

groups– Constant - read only global space

Memory ModelOpenCL

Programming LanguageOpenCL

• Every GPU Computing technology natively written in C/C++ (Host)

• Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)

• Device Code exclusively written in standard C + Extensions

• Pointers to functions not allowed• Pointers to pointers allowed within a kernel, but not

as an argument• Bit-fields not supported• Variable-length arrays and structures not supported• Recursion not supported• Writes to a pointer of types less than 32-bit not

supported• Double types not supported, but reserved• 3D Image writes not supported

• Some restrictions are addressed through extensions

Language Restrictions

OpenCL

Overview

• Multimedia Data and Tasks best-suited for SIMD Processing

• Multimedia Data – sequential Bytestreams; each Byte independent

• Image Processing in particular suited for GPUs

• original GPU task: „Compute <several FLOP> for every Pixel of the screen“ ( Computer Graphics)

• same task for images, only FLOP‘s are different

Common Application Domain

• possible features realizable on the GPU– contrast- and luminance configuration– gamma scaling– (pixel-by-pixel-) histogram scaling– convolution filtering– edge highlighting– negative image / image inversion– …

Common Application Domain – Image Processing

• simple example: Inversion• implementation and use of a framework for

switching between different GPGPU technologies• creation of a command queue for each GPU• reading GPU kernel via kernel file on-the-fly• creation of buffers for input and output image• memory copy of input image data to global GPU

memory• set of kernel arguments and kernel execution• memory copy of GPU output buffer data to new

InversionImage Processing

evaluated and confirmed minimum speedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP

Image Processing Inversion

Overview

MC Study of a SMS using OpenCL and OpenMP

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée

• Spring-Mass-System defined by a differential equation

• Behavior of the system must be simulated over varying damping values

• Therefore: numerical solution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000

• Analysis of computation time and speed-up for different compute architectures

based on Simulation News Europe (SNE) CP2:• 1000 simulation iterations over simulation

horizon with generated damping values (Monte-Carlo Study)

• consequtive averaging for s(t)• tε[0 … 2] sec; h=0.01 200 steps

on present architectures too lightweighted-> Modification:

• 5000 iterations with Monte-Carlo• h=0.001 2000 steps

Aim of Analysis: Knowledge about spring behavior for different damping values (trajectory array)

Task• Simple Spring-Mass-System

d … damping constantc … spring constant

• Movement equation derived by Newton‘s 2nd axiom

• Modelling needed -> „Massenfreischnitt“– mass is moved– force balancing Equation

Modelling

• numerical integration based on 2nd order differential equation

DE order n n DEs 1st order

order 2nd DE -

)()/()'()/(')'(

)()'(')'(

:axiomNewton 2.

tsmktsmdts

tcstdstms

FFF TDC

)(')'(

Modelling• Transformation by substitution

)/()/('

)()/()'()/(')'()'(

)'()'(

')(),()(

)()/()'()/(')'(

smcsmds

tsmctsmdtsts

stststs

tsmctsmdts

• random damping parameter d for interval limits [800;1200]; • 5000 iterations

m 0 s(0)

m/s 0.1v(0)(0)s'

:esstart valu

2s t0s;t

450;9000

:CP2 SNEby given

Endstart

Euler as simple ODE solver

• numerical integration by explicit Euler method

);()()(

's and t, s esstart valu

System ODE !

s(t) tajectory?

000110

stfhsts

stfhsstshts

)()/()()/('

- steps allover iterate

:Problem-Mass-Springfor Use

shtshts

tsmctsmds

existing MIMD Solutions

• Approach can not be applied to GPU Architectures

• MIMD-Requirements:– each PE with own instruction flow– each PE can access RAM individually

• GPU Architecture -> SIMD– each PE computes the same instruction at the

same time– each PE has to be at the same instruction for

accessing RAM

Therefore: Development SIMD-Approach

An SIMD Approach

• S.P./R.F.:– simultaneous execution of sequential

Simulation with varying d-Parameter on spatially distributed PE‘s

– Averaging dependend on trajectories

• C.K.:– simultaneous computation with all d-

Parameters for time tn, iterative repetition until tend

– Averaging dependend on steps

An SIMD-Approach

OpenMP

• Parallization Technology based on shared memory principle

• synchronization hidden for developer• thread management controlable• For System-V-based OS:

– parallization by process forking

• For Windows-based OS:– parallization by WinThread creation (AMD

Study/Intel Tech Paper)

OpenMP

• in C/C++: pragma-based preprocessor directives

• in C# represented by ParallelLoops• more than just parallizing Loops (AMD Tech

Report)• Literature:

– AMD/Intel Tech Papers– Thomas Rauber, „Parallele Programmierung“– Barbara Chapman, „Using OpenMP: Portable

Shared Memory Parallel Programming“

• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plot• Speed-Up-Study• Parallization Conclusions• Resumée

Result Plot

resulting trajectory for all technologies

Speed-Up Study

# Cores MIMD Single

MIMD OpenMP

SIMD Single SIMD OpenMP

SIMD OpenCL

1 1.0 (T=56.53)

1.0 0.9 (T=64.63)

0.9 0.4 (T=144.6)

2 X 1.8 X 1.4 X

4 X 3.5 X 2.0 X

8 X 5.7 X 1.7 X

16 X 5.1 X 0.5 X

dyn/std 1.0 5.7 0.9 1.7 0.4

OpenMP – own Study – Comparison CPU/GPUSIMD Single: presented SIMD approach on CPUSIMD OpenMP: presented SIMD approach parallized on CPUSIMD OpenCL: Control of number of executing units not possible,

therefore only 1 value

Speed-Up StudySIMD OpenCL SIMD single MIMD single SIMD

OpenMPMIMD OpenMP

Parallization Conclusions

• problem unsuited for SIMD parallization• On-GPU-Reduction too time expensive, Therefore:

– Euler computation on GPU– Average computation on CPU

• most time intensive operation: MemCopy between GPU and Main Memory

• for more complex problems oder different ODE solver procedure speed-up behavior can change

Parallization Conclusion• MIMD-Approach S.P./R.F. efficient for SNE CP2• OpenMP realization for MIMD- and SIMD-

Approach possible (and done)• OpenMP MIMD realization almost linear

speedup• more set Threads than PEs physically

available leads to significant Thread-Overhead

• OpenMP chooses automatically number threads to physical available PEs for dynamic assignement

Resumée

• task can be solved on CPUs and GPUs• For GPU Computing new approaches and

algorithm porting required• although GPUs have massive number of

parallel operating cores, speed-up not for every application domain possible

Resumée• Advantages GPU Computing:

– for suited problems (e.g. Multimedia) very fast and scalable

– cheap HPC technology in comparison to scientific supercomputers

– energy-efficient– massive computing power in small size

• Disadvantage GPU Computing:– limited instruction set– strictly SIMD– SIMD Algorithm development hard– no execution supervision (e.g. segmentation/page

fault)

Overview

GPU Computing

Education

OpenACCで始めるGPUコンピューティング： OpenACC GPU ......Introduction of GPU computing by OpenACC: Optimize Loops 成瀬彰 1 はじめに前回までに、我々が推奨するOpenACCによるアプ

Inspur GPU Server - 株式会社キング・テック2017-7-21 · Inspur AI Computing Platform 3 GPU Server 4 GPU Server 8 GPU Server 16 GPU Server NF5280M4 (2CPU + 3 GPU) NF5280M5

Parallel Computing with MATLAB - MathWorks · 24 What is a Graphics Processing Unit (GPU) Originally for graphics acceleration, now also used for scientific calculations Massively

Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing

Experiences with gpu Computing - University of California ...jowens/talks/intel-santaclara-070420.… · • Mark Harris, unc/nvidia • gpu Gems (Addison-Wesley) • Vol 1: 2004;

MEC (Mobile Edge Computing) + GPUコンピューティングについて

Multi-GPU Computing of Large Scale Phase-Field Simulation

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing Tetsuya Odajima, Taisuke Boku, Toshihiro Hanawa, Jinpil Lee, Mitsuhisa

Masterarbeit akultätF Informatik Untersuchungen zur ...meixner/masterarbeitKlass.pdf · the GPU computing power in higher-level programming languages increases. Both the evaluation

Tesla GPU Computing · deep learning 개발머쉰 ... Deep Learning 개발 환경 2015.01 . 미루웨어 Tesla GPU Computing Unutu 장점(개발머쉰) 2015.01 캐노니컬은 6개월마다

OpenACC CUDAによる GPUコンピューティングGPU Computing 5 GPU コンピューティング Low latency + High throughput CPU GPU 6 アプリケーション実行アプリケーション・コード

Cjharris Gpu Computing Opencl

クラウドでアクセラレーテッドコンピューティン …...P2/G2: GPU-accelerated computing 各GPUの数千CUDAコアによる高並列計算豊富なAPI群(CUDA, OpenACC,

Hybrid CUDA, OpenMP, and MPI Parallel Programming on Multicore GPU … · 2012-12-06 · Parallel Computing (cont.) •Massively parallel computing has become a commodity technology

Parallel computing and GPU introduction · 2013-12-17 · Parallel computing and GPU introduction . 黃子桓 ‹tzhuan@gmail.com› Agenda • Parallel computing • GPU introduction

Themenübersichtgoeddeke/teaching/themenuebersicht.pdfmerik, Finite Elemente, GPU Computing, Höchstleistungsrechnen, und cooles buntes Zeug. Außerdem ist bei den meisten Themen der

Computing Unified Device Architecture (CUDA) Programação em GPU (CUDA) Msc. Lucas de Paula Veronese Tiago Alves de Oliveira lucas.veronese@lcad.inf.ufes.br

CST Studio Suite R 2020updates.cst.com/downloads/GPU_Computing_Guide_2020.pdf · 2019-07-30 · 4 3DS.COM/SIMULIA c Dassault Systèmes GPU Computing Guide 2020 2 Supported Hardware

GPU-Computing - uni-hamburg.de · GPU-Computing am RRZ/SVPP Arbeitsgruppe: Scientific Visualization and Parallel Processing der Informatik GPU-Cluster: 96 CPU-Cores und 24 Nvidia

Real-time 3D Video Processing Using Multi-stream GPU ... 3D Video Processing U… · Real-time 3D Video Processing Using Multi-stream GPU Parallel Computing Kenia Picos, Víctor H