Upload
askbilladdmicrosoft
View
216
Download
0
Embed Size (px)
Citation preview
8/6/2019 comp422-2011-Lecture21-CUDA
1/42
John Mellor-Crummey
Department of Computer Science
Rice University
Programming GPUs with CUDA
COMP 422 Lecture 21 12 April 2011
8/6/2019 comp422-2011-Lecture21-CUDA
2/42
Why GPUs?
Two major trendsGPU performance is pulling away from traditional processors
~10x memory bandwidth & floating point ops
availability of general (non-graphics) programming interfaces
GPU in every PC and workstationmassive volume, potentially broad impact
2Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0
8/6/2019 comp422-2011-Lecture21-CUDA
3/42
NVidia Tesla GPU
3Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png
Similar Tesla S870 server inbadlands.rcsg.rice.edu(installed March, 2008)
Tesla (G80) Tesla2 (GT200)
CUDA Cores 128 240
Processor Clock 1.69 GHz 1.47 GHz
Floating Point Precision IEEE 754 SP IEEE 754 DP
Dedicated Memory 512 MB 1 GB GDDR3
Memory Clock (MHz) 1.1 GHz 1.2 GHz
Memory Interface Width 256-bit 512-bit
Memory Bandwidth 70.4 GB/s 159 GB/s
8/6/2019 comp422-2011-Lecture21-CUDA
4/42
GPGPU?
General Purpose computation using GPUapplications beyond 3D graphics
typically, data-intensive science and engineering applications
Data-intensive algorithms leverage GPU attributes
large data arrays, streaming throughputfine-grain SIMD parallelism
low-latency floating point computation
4
8/6/2019 comp422-2011-Lecture21-CUDA
5/42
GPGPU Programming of Yesteryear
Stream-based programming model
Express algorithms in terms of graphics operationsuse GPU pixel shaders as general-purpose SP floating point units
Directly exploit
pixel shadersvertex shaders
video memory
Example: GPUSort (Govindaraju, Manocha; 2005)5
Figure Credits: Dongho Kim, School of Media, Soongsil University
threads interact throughoff-chip video memory
8/6/2019 comp422-2011-Lecture21-CUDA
6/42
Fragment from GPUSort
//invert the other half of the bitonic array and mergeglBegin(GL_QUADS);
for(intstart=0; start
8/6/2019 comp422-2011-Lecture21-CUDA
7/42
CUDA
CUDA = Compute Unified Device Architecture Software platform for parallel computing on Nvidia GPUs
introduced in 2006
Nvidias repositioning of GPUs as versatile compute devices
C plus a few simple extensions
write a program for one thread
instantiate for many parallel threads
familiar language; simple data-parallel extensions
CUDA is a scalable parallel programming model
runs on any number of processors without recompiling
7Slide credit: Patrick LeGresley, NVidia
8/6/2019 comp422-2011-Lecture21-CUDA
8/42
Tesla GPU Architecture Abstraction
NVidia GeForce 8 architecture128 CUDA cores (AKA programmable pixel shaders)
8 thread processor clusters (TPC)
2 streaming multiprocessors (SM) per TPC
8 streaming processors (SP) per SM
8
Figure Credit:http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf
SM SP
8/6/2019 comp422-2011-Lecture21-CUDA
9/42
8/6/2019 comp422-2011-Lecture21-CUDA
10/42
GPU Comparison Summary
10Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
8/6/2019 comp422-2011-Lecture21-CUDA
11/42
Why CUDA?
Business rationaleopportunity for Nvidia to sell more chips
extend the demand from graphics into HPC
insurance against uncertain future for discrete GPUs
both Intel and AMD aim to integrate GPUs on future microprocessors
Technical rationalehides GPU architecture behind the programming API
programmers never write directly to the metal
insulate programmers from details of GPU hardware
enables Nvidia to change GPU architecture completely, transparently
preserve investment in CUDA programs
simplifies the programming of multithreaded hardware
CUDA automatically manages threads
11
8/6/2019 comp422-2011-Lecture21-CUDA
12/42
CUDA Design Goals
Support heterogeneous parallel programming (CPU + GPU)
Scale to hundreds of cores, thousands of parallel threads
Enable programmer to focus on parallel algorithmsnot GPU characteristics, programming language, work
scheduling ...
12
8/6/2019 comp422-2011-Lecture21-CUDA
13/42
CUDA Software Stack for Heterogeneous Computing
13Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1
8/6/2019 comp422-2011-Lecture21-CUDA
14/42
Key CUDA Abstractions
Hierarchy of concurrent threads
Lightweight synchronization primitives
Shared memory model for cooperating threads
14
8/6/2019 comp422-2011-Lecture21-CUDA
15/42
Hierarchy of Concurrent Threads
Parallel kernels composed of many threadsall threads execute same sequential program
use parallel threads rather than sequential loops
Threads are grouped into thread blocksthreads in block can sync and share memory
Blocks are grouped into gridsthreads and blocks have unique IDs
threadIdx: 1D, 2D, or 3D blockIdx: 1D or 2D
15Slide credit: Patrick LeGresley, NVidia
simplifies addressing
when processing
multidimensional data
8/6/2019 comp422-2011-Lecture21-CUDA
16/42
CUDA Programming Example
16
void saxpy_serial(int n, float alpha, float *x, float *y) {
for (int i = 0; i< n; i++)
y[i] = alpha * x[i] + y[i];
}// invoke serial saxpy kernel
saxpy_serial(n, 2.0, x, y)
Computing y = ax + y with a serial loop
Host code
__global__
void saxpy_parallel(int n, float alpha, float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = alpha * x[i] + y[i];
}
Computing y = ax + y in parallel using CUDA
// invoke parallel saxpy kernel (256 threads per block)int nblocks = (n + 255)/256
saxpy_parallel(n, 2.0, x, y)
Device code
Host code
8/6/2019 comp422-2011-Lecture21-CUDA
17/42
Synchronization and Coordination
Threads within a block may synchronize with barriers... step 1 ...
__syncthreads();
... step 2 ...
Blocks can coordinate via atomic memory operationse.g. increment shared queue pointer with atomicInc()
Implicit barrier between kernels launched by hostvec_minus(a, b, c)
vec_dot(c, c)
17
8/6/2019 comp422-2011-Lecture21-CUDA
18/42
CPU vs. GPGPU vs. CUDA
Comparing the abstract models
18
CPU
CPU GPGPU CUDA/GPGPU
Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf
8/6/2019 comp422-2011-Lecture21-CUDA
19/42
CUDA Memory Model
19Figure credits: Patrick LeGresley, NVidia
8/6/2019 comp422-2011-Lecture21-CUDA
20/42
Memory Model (Continued)
20Figure credit: Patrick LeGresley, NVidia
8/6/2019 comp422-2011-Lecture21-CUDA
21/42
Memory Access Latencies
Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle
Local Memory DRAM, no cache - *slow*
Global Memory DRAM, no cache - *slow*
Constant Memory DRAM, cached, 110s100s of cycles,depends on cache locality
Texture Memory DRAM, cached, 110s100s of cyclesdepends on cache locality
Instruction Memory (invisible) DRAM, cached
21
8/6/2019 comp422-2011-Lecture21-CUDA
22/42
Minimal Extensions to C
Declaration specifiers to indicate where things livefunctions
__global__void KernelFunc(...); // kernel callable from host
must return void
__device__float DeviceFunc(...); // function callable on device
no recursion
no static variables within function
__host__float HostFunc(); // only callable on host
variables (next slide)
Extend function invocation syntax for parallel kernel launchKernelFunc(...); // 500 blocks, 128 threads each
Built-in variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;
22
8/6/2019 comp422-2011-Lecture21-CUDA
23/42
Invoking a Kernel Function
Call kernel function with an execution configuation
Any call to a kernel function is asynchronousexplicit synchronization is needed to block
cudaThreadSynchronize() forces runtime to wait until allpreceding device tasks have finished
Within kernel, declare shared memory asextern int __shared[];
23
8/6/2019 comp422-2011-Lecture21-CUDA
24/42
CUDA Variable Declarations
__device__ is optional with __local__, __shared__, or__constant__
Automatic variables without any qualifier reside in a registerexcept arrays: reside in local memory
Pointers can only point to memory allocated or declared inglobal memory
allocated on the host and passed to the kernel
__global__ void Kernelfunc(float *ptr)
address obtained for a global variable: float *ptr = &GlobalVar 24
8/6/2019 comp422-2011-Lecture21-CUDA
25/42
Using Per Block Shared Memory
Variables shared across block__shared__int *begin, *end;
Scratchpad memory__shared__int scratch[blocksize];
scratch[threadIdx.x] = begin[threadIdx.x];
// compute on scratch values
begin[threadIdx.x] = scratch[threadIdx.x];
Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];
__syncthreads();int left = scratch[threadIdx.x - 1];
25
8/6/2019 comp422-2011-Lecture21-CUDA
26/42
Features Available in GPU Code
Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;
Intrinsics that expose specific operations in kernel code_syncthreads(); // barrier synchronization
Standard math library operationsexponentiation, truncation and rounding, trigonometric
functions, min/max/abs, log, quotient/remainder, etc.
Atomic memory operationsatomicAdd, atomicMin, atomicAnd, atomicCAS, etc.
26
8/6/2019 comp422-2011-Lecture21-CUDA
27/42
Runtime Support
Memory management for pointers to GPU memorycudaMalloc(), cudaFree()
Copying from host to/from device, device to devicecudaMemcpy(), cudaMemcpy2D()
27
8/6/2019 comp422-2011-Lecture21-CUDA
28/42
More Complete Example: Vector Addition
28
...
kernel code
8/6/2019 comp422-2011-Lecture21-CUDA
29/42
Vector Addition Host Code
29
8/6/2019 comp422-2011-Lecture21-CUDA
30/42
Extended C Summary
30
8/6/2019 comp422-2011-Lecture21-CUDA
31/42
Compiling CUDA
31
8/6/2019 comp422-2011-Lecture21-CUDA
32/42
Ideal CUDA programs
High intrinsic parallelisme.g. per-element operations
Minimal communication (if any) between threadslimited synchronization
High ratio of arithmetic to memory operations Few control flow statements
SIMD execution
divergent paths among threads in a block may be serialized (costly)
compiler may replace conditional instructions by predicated
operations to reduce divergence
32
8/6/2019 comp422-2011-Lecture21-CUDA
33/42
CUDA Matrix Multiply: Host Code
33
8/6/2019 comp422-2011-Lecture21-CUDA
34/42
CUDA Matrix Multiply: Device Code
34
8/6/2019 comp422-2011-Lecture21-CUDA
35/42
Optimization Considerations
Kernel optimizationsmake use of shared memory
minimize use divergent control flow
SIMD execution must follow all paths taken within a thread group
use intrinsic instructions when possible
exploit the hardware support behind them
CPU/GPU interactionmaximize PCIe throughput
use asynchronous memory copies
Key resource considerations for Tesla GPUsMax 512 threads per block
Up to 8 blocks per SM
8K registers per SM (16K for Tesla2)
16 KB shared mem per SM
16 KB local mem per thread
64 KB of constant mem 35
Use compiler option:
maxregisters=
to limit the number
of registers used per
thread
8/6/2019 comp422-2011-Lecture21-CUDA
36/42
GPU Application Domains
36
8/6/2019 comp422-2011-Lecture21-CUDA
37/42
8/6/2019 comp422-2011-Lecture21-CUDA
38/42
CUDA Alternative: OpenCL
Emerging framework for writing programs that execute onheterogeneous platforms, including CPUs, GPUs, etc.supports both task and data parallelism
based on subset of ISO C99 with extensions for parallelism
numerics based on IEEE 754 floating point standard
efficiently interoperated with graphics APIs, e.g. OpenGL
OpenCL managed by non-profit Khronos Group
Initial specification approved for public release Dec. 8, 2008specification 1.0.33 released Feb 4, 2009
38
8/6/2019 comp422-2011-Lecture21-CUDA
39/42
OpenCL Kernel Example: 1D FFT
39
8/6/2019 comp422-2011-Lecture21-CUDA
40/42
OpenCL Host Program: 1D FFT
40
8/6/2019 comp422-2011-Lecture21-CUDA
41/42
Device Programming Abstractions
CPUsingle-threaded, serial instruction stream
superscalar: manage pipelines for multiple functional units
SIMD short vector operations: 3-4 operations per cycle
data in cache or memory
GPGPUuse GPU pixel shaders as general purpose processors
operate on data in video memory
threads interact with each other through off-chip memory
CUDAautomatically manages threadsdivides data set into smaller chunks stored in on-chip memory
reduces need to access off-chip memory improves performance
multiple threads can share each chunk
41
8/6/2019 comp422-2011-Lecture21-CUDA
42/42
42
References
Patrick LeGresley, High Performance Computing with CUDA, StanfordUniversity Colloquium, October 2008, http://www.stanford.edu/dept/ICME/docs/seminars/LeGresley-2008-10-27.pdf
Vivek Sarkar. Introduction to General-Purpose computation on GPUs(GPGPUs), COMP 635, September 2007
Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. DobbsPortal, http://www.ddj.com/architect/207200659, April 2008-March 2009.
Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report,January 2008.
N. Govindaraju et al. A cache-efficient sorting algorithm for database anddata mining computations using graphics processors. http://
gamma.cs.unc.edu/SORT/gpusort.pdf
http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort
http://en.wikipedia.org/wiki/OpenCL
http://www.khronos.org/opencl
http://www.khronos.org/files/opencl-quick-reference-card.pdf