comp422-2011-Lecture21-CUDA

8/6/2019 comp422-2011-Lecture21-CUDA

1/42

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

Programming GPUs with CUDA

COMP 422 Lecture 21 12 April 2011


2/42

Why GPUs?

Two major trendsGPU performance is pulling away from traditional processors

~10x memory bandwidth & floating point ops

availability of general (non-graphics) programming interfaces

GPU in every PC and workstationmassive volume, potentially broad impact

2Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0


3/42

NVidia Tesla GPU

3Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png

Similar Tesla S870 server inbadlands.rcsg.rice.edu(installed March, 2008)

Tesla (G80) Tesla2 (GT200)

CUDA Cores 128 240

Processor Clock 1.69 GHz 1.47 GHz

Floating Point Precision IEEE 754 SP IEEE 754 DP

Dedicated Memory 512 MB 1 GB GDDR3

Memory Clock (MHz) 1.1 GHz 1.2 GHz

Memory Interface Width 256-bit 512-bit

Memory Bandwidth 70.4 GB/s 159 GB/s


4/42

GPGPU?

General Purpose computation using GPUapplications beyond 3D graphics

typically, data-intensive science and engineering applications

Data-intensive algorithms leverage GPU attributes

large data arrays, streaming throughputfine-grain SIMD parallelism

low-latency floating point computation

4


5/42

GPGPU Programming of Yesteryear

Stream-based programming model

Express algorithms in terms of graphics operationsuse GPU pixel shaders as general-purpose SP floating point units

Directly exploit

pixel shadersvertex shaders

video memory

Example: GPUSort (Govindaraju, Manocha; 2005)5

Figure Credits: Dongho Kim, School of Media, Soongsil University

threads interact throughoff-chip video memory


6/42

Fragment from GPUSort

//invert the other half of the bitonic array and mergeglBegin(GL_QUADS);

for(intstart=0; start


7/42

CUDA

CUDA = Compute Unified Device Architecture Software platform for parallel computing on Nvidia GPUs

introduced in 2006

Nvidias repositioning of GPUs as versatile compute devices

C plus a few simple extensions

write a program for one thread

instantiate for many parallel threads

familiar language; simple data-parallel extensions

CUDA is a scalable parallel programming model

runs on any number of processors without recompiling

7Slide credit: Patrick LeGresley, NVidia


8/42

Tesla GPU Architecture Abstraction

NVidia GeForce 8 architecture128 CUDA cores (AKA programmable pixel shaders)

8 thread processor clusters (TPC)

2 streaming multiprocessors (SM) per TPC

8 streaming processors (SP) per SM

8

Figure Credit:http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf

SM SP


9/42


10/42

GPU Comparison Summary

10Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf


11/42

Why CUDA?

Business rationaleopportunity for Nvidia to sell more chips

extend the demand from graphics into HPC

insurance against uncertain future for discrete GPUs

both Intel and AMD aim to integrate GPUs on future microprocessors

Technical rationalehides GPU architecture behind the programming API

programmers never write directly to the metal

insulate programmers from details of GPU hardware

enables Nvidia to change GPU architecture completely, transparently

preserve investment in CUDA programs

simplifies the programming of multithreaded hardware

CUDA automatically manages threads

11


12/42

CUDA Design Goals

Support heterogeneous parallel programming (CPU + GPU)

Scale to hundreds of cores, thousands of parallel threads

Enable programmer to focus on parallel algorithmsnot GPU characteristics, programming language, work

scheduling ...

12


13/42

CUDA Software Stack for Heterogeneous Computing

13Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1


14/42

Key CUDA Abstractions

Hierarchy of concurrent threads

Lightweight synchronization primitives

Shared memory model for cooperating threads

14


15/42

Hierarchy of Concurrent Threads

Parallel kernels composed of many threadsall threads execute same sequential program

use parallel threads rather than sequential loops

Threads are grouped into thread blocksthreads in block can sync and share memory

Blocks are grouped into gridsthreads and blocks have unique IDs

threadIdx: 1D, 2D, or 3D blockIdx: 1D or 2D

15Slide credit: Patrick LeGresley, NVidia

simplifies addressing

when processing

multidimensional data


16/42

CUDA Programming Example

16

void saxpy_serial(int n, float alpha, float *x, float *y) {

for (int i = 0; i< n; i++)

y[i] = alpha * x[i] + y[i];

}// invoke serial saxpy kernel

saxpy_serial(n, 2.0, x, y)

Computing y = ax + y with a serial loop

Host code

__global__

void saxpy_parallel(int n, float alpha, float *x, float *y) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n) y[i] = alpha * x[i] + y[i];

}

Computing y = ax + y in parallel using CUDA

// invoke parallel saxpy kernel (256 threads per block)int nblocks = (n + 255)/256

saxpy_parallel(n, 2.0, x, y)

Device code

Host code


17/42

Synchronization and Coordination

Threads within a block may synchronize with barriers... step 1 ...

__syncthreads();

... step 2 ...

Blocks can coordinate via atomic memory operationse.g. increment shared queue pointer with atomicInc()

Implicit barrier between kernels launched by hostvec_minus(a, b, c)

vec_dot(c, c)

17


18/42

CPU vs. GPGPU vs. CUDA

Comparing the abstract models

18

CPU

CPU GPGPU CUDA/GPGPU

Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf


19/42

CUDA Memory Model

19Figure credits: Patrick LeGresley, NVidia


20/42

Memory Model (Continued)

20Figure credit: Patrick LeGresley, NVidia


21/42

Memory Access Latencies

Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle

Local Memory DRAM, no cache - *slow*

Global Memory DRAM, no cache - *slow*

Constant Memory DRAM, cached, 110s100s of cycles,depends on cache locality

Texture Memory DRAM, cached, 110s100s of cyclesdepends on cache locality

Instruction Memory (invisible) DRAM, cached

21


22/42

Minimal Extensions to C

Declaration specifiers to indicate where things livefunctions

__global__void KernelFunc(...); // kernel callable from host

must return void

__device__float DeviceFunc(...); // function callable on device

no recursion

no static variables within function

__host__float HostFunc(); // only callable on host

variables (next slide)

Extend function invocation syntax for parallel kernel launchKernelFunc(...); // 500 blocks, 128 threads each

Built-in variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;

22


23/42

Invoking a Kernel Function

Call kernel function with an execution configuation

Any call to a kernel function is asynchronousexplicit synchronization is needed to block

cudaThreadSynchronize() forces runtime to wait until allpreceding device tasks have finished

Within kernel, declare shared memory asextern int __shared[];

23


24/42

CUDA Variable Declarations

__device__ is optional with __local__, __shared__, or__constant__

Automatic variables without any qualifier reside in a registerexcept arrays: reside in local memory

Pointers can only point to memory allocated or declared inglobal memory

allocated on the host and passed to the kernel

__global__ void Kernelfunc(float *ptr)

address obtained for a global variable: float *ptr = &GlobalVar 24


25/42

Using Per Block Shared Memory

Variables shared across block__shared__int *begin, *end;

Scratchpad memory__shared__int scratch[blocksize];

scratch[threadIdx.x] = begin[threadIdx.x];

// compute on scratch values

begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];

__syncthreads();int left = scratch[threadIdx.x - 1];

25


26/42

Features Available in GPU Code

Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;

Intrinsics that expose specific operations in kernel code_syncthreads(); // barrier synchronization

Standard math library operationsexponentiation, truncation and rounding, trigonometric

functions, min/max/abs, log, quotient/remainder, etc.

Atomic memory operationsatomicAdd, atomicMin, atomicAnd, atomicCAS, etc.

26


27/42

Runtime Support

Memory management for pointers to GPU memorycudaMalloc(), cudaFree()

Copying from host to/from device, device to devicecudaMemcpy(), cudaMemcpy2D()

27


28/42

More Complete Example: Vector Addition

28

...

kernel code


29/42

Vector Addition Host Code

29


30/42

Extended C Summary

30


31/42

Compiling CUDA

31


32/42

Ideal CUDA programs

High intrinsic parallelisme.g. per-element operations

Minimal communication (if any) between threadslimited synchronization

High ratio of arithmetic to memory operations Few control flow statements

SIMD execution

divergent paths among threads in a block may be serialized (costly)

compiler may replace conditional instructions by predicated

operations to reduce divergence

32


33/42

CUDA Matrix Multiply: Host Code

33


34/42

CUDA Matrix Multiply: Device Code

34


35/42

Optimization Considerations

Kernel optimizationsmake use of shared memory

minimize use divergent control flow

SIMD execution must follow all paths taken within a thread group

use intrinsic instructions when possible

exploit the hardware support behind them

CPU/GPU interactionmaximize PCIe throughput

use asynchronous memory copies

Key resource considerations for Tesla GPUsMax 512 threads per block

Up to 8 blocks per SM

8K registers per SM (16K for Tesla2)

16 KB shared mem per SM

16 KB local mem per thread

64 KB of constant mem 35

Use compiler option:

maxregisters=

to limit the number

of registers used per

thread


36/42

GPU Application Domains

36


37/42


38/42

CUDA Alternative: OpenCL

Emerging framework for writing programs that execute onheterogeneous platforms, including CPUs, GPUs, etc.supports both task and data parallelism

based on subset of ISO C99 with extensions for parallelism

numerics based on IEEE 754 floating point standard

efficiently interoperated with graphics APIs, e.g. OpenGL

OpenCL managed by non-profit Khronos Group

Initial specification approved for public release Dec. 8, 2008specification 1.0.33 released Feb 4, 2009

38


39/42

OpenCL Kernel Example: 1D FFT

39


40/42

OpenCL Host Program: 1D FFT

40


41/42

Device Programming Abstractions

CPUsingle-threaded, serial instruction stream

superscalar: manage pipelines for multiple functional units

SIMD short vector operations: 3-4 operations per cycle

data in cache or memory

GPGPUuse GPU pixel shaders as general purpose processors

operate on data in video memory

threads interact with each other through off-chip memory

CUDAautomatically manages threadsdivides data set into smaller chunks stored in on-chip memory

reduces need to access off-chip memory improves performance

multiple threads can share each chunk

41


42/42

42

References

Patrick LeGresley, High Performance Computing with CUDA, StanfordUniversity Colloquium, October 2008, http://www.stanford.edu/dept/ICME/docs/seminars/LeGresley-2008-10-27.pdf

Vivek Sarkar. Introduction to General-Purpose computation on GPUs(GPGPUs), COMP 635, September 2007

Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. DobbsPortal, http://www.ddj.com/architect/207200659, April 2008-March 2009.

Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report,January 2008.

N. Govindaraju et al. A cache-efficient sorting algorithm for database anddata mining computations using graphics processors. http://

gamma.cs.unc.edu/SORT/gpusort.pdf

http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort

http://en.wikipedia.org/wiki/OpenCL

http://www.khronos.org/opencl

http://www.khronos.org/files/opencl-quick-reference-card.pdf

Documents

comp422-2011-Lecture21-CUDA