comp422-2011-Lecture21-CUDA

Embed Size (px)

Citation preview

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    1/42

    John Mellor-Crummey

    Department of Computer Science

    Rice University

    [email protected]

    Programming GPUs with CUDA

    COMP 422 Lecture 21 12 April 2011

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    2/42

    Why GPUs?

    Two major trendsGPU performance is pulling away from traditional processors

    ~10x memory bandwidth & floating point ops

    availability of general (non-graphics) programming interfaces

    GPU in every PC and workstationmassive volume, potentially broad impact

    2Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    3/42

    NVidia Tesla GPU

    3Figure Credit: http://images.nvidia.com/products/tesla_c870/Tesla_C870_F_med.png

    Similar Tesla S870 server inbadlands.rcsg.rice.edu(installed March, 2008)

    Tesla (G80) Tesla2 (GT200)

    CUDA Cores 128 240

    Processor Clock 1.69 GHz 1.47 GHz

    Floating Point Precision IEEE 754 SP IEEE 754 DP

    Dedicated Memory 512 MB 1 GB GDDR3

    Memory Clock (MHz) 1.1 GHz 1.2 GHz

    Memory Interface Width 256-bit 512-bit

    Memory Bandwidth 70.4 GB/s 159 GB/s

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    4/42

    GPGPU?

    General Purpose computation using GPUapplications beyond 3D graphics

    typically, data-intensive science and engineering applications

    Data-intensive algorithms leverage GPU attributes

    large data arrays, streaming throughputfine-grain SIMD parallelism

    low-latency floating point computation

    4

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    5/42

    GPGPU Programming of Yesteryear

    Stream-based programming model

    Express algorithms in terms of graphics operationsuse GPU pixel shaders as general-purpose SP floating point units

    Directly exploit

    pixel shadersvertex shaders

    video memory

    Example: GPUSort (Govindaraju, Manocha; 2005)5

    Figure Credits: Dongho Kim, School of Media, Soongsil University

    threads interact throughoff-chip video memory

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    6/42

    Fragment from GPUSort

    //invert the other half of the bitonic array and mergeglBegin(GL_QUADS);

    for(intstart=0; start

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    7/42

    CUDA

    CUDA = Compute Unified Device Architecture Software platform for parallel computing on Nvidia GPUs

    introduced in 2006

    Nvidias repositioning of GPUs as versatile compute devices

    C plus a few simple extensions

    write a program for one thread

    instantiate for many parallel threads

    familiar language; simple data-parallel extensions

    CUDA is a scalable parallel programming model

    runs on any number of processors without recompiling

    7Slide credit: Patrick LeGresley, NVidia

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    8/42

    Tesla GPU Architecture Abstraction

    NVidia GeForce 8 architecture128 CUDA cores (AKA programmable pixel shaders)

    8 thread processor clusters (TPC)

    2 streaming multiprocessors (SM) per TPC

    8 streaming processors (SP) per SM

    8

    Figure Credit:http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf

    SM SP

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    9/42

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    10/42

    GPU Comparison Summary

    10Figure Credit: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    11/42

    Why CUDA?

    Business rationaleopportunity for Nvidia to sell more chips

    extend the demand from graphics into HPC

    insurance against uncertain future for discrete GPUs

    both Intel and AMD aim to integrate GPUs on future microprocessors

    Technical rationalehides GPU architecture behind the programming API

    programmers never write directly to the metal

    insulate programmers from details of GPU hardware

    enables Nvidia to change GPU architecture completely, transparently

    preserve investment in CUDA programs

    simplifies the programming of multithreaded hardware

    CUDA automatically manages threads

    11

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    12/42

    CUDA Design Goals

    Support heterogeneous parallel programming (CPU + GPU)

    Scale to hundreds of cores, thousands of parallel threads

    Enable programmer to focus on parallel algorithmsnot GPU characteristics, programming language, work

    scheduling ...

    12

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    13/42

    CUDA Software Stack for Heterogeneous Computing

    13Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    14/42

    Key CUDA Abstractions

    Hierarchy of concurrent threads

    Lightweight synchronization primitives

    Shared memory model for cooperating threads

    14

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    15/42

    Hierarchy of Concurrent Threads

    Parallel kernels composed of many threadsall threads execute same sequential program

    use parallel threads rather than sequential loops

    Threads are grouped into thread blocksthreads in block can sync and share memory

    Blocks are grouped into gridsthreads and blocks have unique IDs

    threadIdx: 1D, 2D, or 3D blockIdx: 1D or 2D

    15Slide credit: Patrick LeGresley, NVidia

    simplifies addressing

    when processing

    multidimensional data

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    16/42

    CUDA Programming Example

    16

    void saxpy_serial(int n, float alpha, float *x, float *y) {

    for (int i = 0; i< n; i++)

    y[i] = alpha * x[i] + y[i];

    }// invoke serial saxpy kernel

    saxpy_serial(n, 2.0, x, y)

    Computing y = ax + y with a serial loop

    Host code

    __global__

    void saxpy_parallel(int n, float alpha, float *x, float *y) {

    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < n) y[i] = alpha * x[i] + y[i];

    }

    Computing y = ax + y in parallel using CUDA

    // invoke parallel saxpy kernel (256 threads per block)int nblocks = (n + 255)/256

    saxpy_parallel(n, 2.0, x, y)

    Device code

    Host code

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    17/42

    Synchronization and Coordination

    Threads within a block may synchronize with barriers... step 1 ...

    __syncthreads();

    ... step 2 ...

    Blocks can coordinate via atomic memory operationse.g. increment shared queue pointer with atomicInc()

    Implicit barrier between kernels launched by hostvec_minus(a, b, c)

    vec_dot(c, c)

    17

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    18/42

    CPU vs. GPGPU vs. CUDA

    Comparing the abstract models

    18

    CPU

    CPU GPGPU CUDA/GPGPU

    Figure Credit: http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    19/42

    CUDA Memory Model

    19Figure credits: Patrick LeGresley, NVidia

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    20/42

    Memory Model (Continued)

    20Figure credit: Patrick LeGresley, NVidia

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    21/42

    Memory Access Latencies

    Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle

    Local Memory DRAM, no cache - *slow*

    Global Memory DRAM, no cache - *slow*

    Constant Memory DRAM, cached, 110s100s of cycles,depends on cache locality

    Texture Memory DRAM, cached, 110s100s of cyclesdepends on cache locality

    Instruction Memory (invisible) DRAM, cached

    21

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    22/42

    Minimal Extensions to C

    Declaration specifiers to indicate where things livefunctions

    __global__void KernelFunc(...); // kernel callable from host

    must return void

    __device__float DeviceFunc(...); // function callable on device

    no recursion

    no static variables within function

    __host__float HostFunc(); // only callable on host

    variables (next slide)

    Extend function invocation syntax for parallel kernel launchKernelFunc(...); // 500 blocks, 128 threads each

    Built-in variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;

    22

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    23/42

    Invoking a Kernel Function

    Call kernel function with an execution configuation

    Any call to a kernel function is asynchronousexplicit synchronization is needed to block

    cudaThreadSynchronize() forces runtime to wait until allpreceding device tasks have finished

    Within kernel, declare shared memory asextern int __shared[];

    23

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    24/42

    CUDA Variable Declarations

    __device__ is optional with __local__, __shared__, or__constant__

    Automatic variables without any qualifier reside in a registerexcept arrays: reside in local memory

    Pointers can only point to memory allocated or declared inglobal memory

    allocated on the host and passed to the kernel

    __global__ void Kernelfunc(float *ptr)

    address obtained for a global variable: float *ptr = &GlobalVar 24

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    25/42

    Using Per Block Shared Memory

    Variables shared across block__shared__int *begin, *end;

    Scratchpad memory__shared__int scratch[blocksize];

    scratch[threadIdx.x] = begin[threadIdx.x];

    // compute on scratch values

    begin[threadIdx.x] = scratch[threadIdx.x];

    Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];

    __syncthreads();int left = scratch[threadIdx.x - 1];

    25

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    26/42

    Features Available in GPU Code

    Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;

    Intrinsics that expose specific operations in kernel code_syncthreads(); // barrier synchronization

    Standard math library operationsexponentiation, truncation and rounding, trigonometric

    functions, min/max/abs, log, quotient/remainder, etc.

    Atomic memory operationsatomicAdd, atomicMin, atomicAnd, atomicCAS, etc.

    26

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    27/42

    Runtime Support

    Memory management for pointers to GPU memorycudaMalloc(), cudaFree()

    Copying from host to/from device, device to devicecudaMemcpy(), cudaMemcpy2D()

    27

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    28/42

    More Complete Example: Vector Addition

    28

    ...

    kernel code

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    29/42

    Vector Addition Host Code

    29

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    30/42

    Extended C Summary

    30

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    31/42

    Compiling CUDA

    31

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    32/42

    Ideal CUDA programs

    High intrinsic parallelisme.g. per-element operations

    Minimal communication (if any) between threadslimited synchronization

    High ratio of arithmetic to memory operations Few control flow statements

    SIMD execution

    divergent paths among threads in a block may be serialized (costly)

    compiler may replace conditional instructions by predicated

    operations to reduce divergence

    32

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    33/42

    CUDA Matrix Multiply: Host Code

    33

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    34/42

    CUDA Matrix Multiply: Device Code

    34

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    35/42

    Optimization Considerations

    Kernel optimizationsmake use of shared memory

    minimize use divergent control flow

    SIMD execution must follow all paths taken within a thread group

    use intrinsic instructions when possible

    exploit the hardware support behind them

    CPU/GPU interactionmaximize PCIe throughput

    use asynchronous memory copies

    Key resource considerations for Tesla GPUsMax 512 threads per block

    Up to 8 blocks per SM

    8K registers per SM (16K for Tesla2)

    16 KB shared mem per SM

    16 KB local mem per thread

    64 KB of constant mem 35

    Use compiler option:

    maxregisters=

    to limit the number

    of registers used per

    thread

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    36/42

    GPU Application Domains

    36

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    37/42

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    38/42

    CUDA Alternative: OpenCL

    Emerging framework for writing programs that execute onheterogeneous platforms, including CPUs, GPUs, etc.supports both task and data parallelism

    based on subset of ISO C99 with extensions for parallelism

    numerics based on IEEE 754 floating point standard

    efficiently interoperated with graphics APIs, e.g. OpenGL

    OpenCL managed by non-profit Khronos Group

    Initial specification approved for public release Dec. 8, 2008specification 1.0.33 released Feb 4, 2009

    38

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    39/42

    OpenCL Kernel Example: 1D FFT

    39

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    40/42

    OpenCL Host Program: 1D FFT

    40

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    41/42

    Device Programming Abstractions

    CPUsingle-threaded, serial instruction stream

    superscalar: manage pipelines for multiple functional units

    SIMD short vector operations: 3-4 operations per cycle

    data in cache or memory

    GPGPUuse GPU pixel shaders as general purpose processors

    operate on data in video memory

    threads interact with each other through off-chip memory

    CUDAautomatically manages threadsdivides data set into smaller chunks stored in on-chip memory

    reduces need to access off-chip memory improves performance

    multiple threads can share each chunk

    41

  • 8/6/2019 comp422-2011-Lecture21-CUDA

    42/42

    42

    References

    Patrick LeGresley, High Performance Computing with CUDA, StanfordUniversity Colloquium, October 2008, http://www.stanford.edu/dept/ICME/docs/seminars/LeGresley-2008-10-27.pdf

    Vivek Sarkar. Introduction to General-Purpose computation on GPUs(GPGPUs), COMP 635, September 2007

    Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. DobbsPortal, http://www.ddj.com/architect/207200659, April 2008-March 2009.

    Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report,January 2008.

    N. Govindaraju et al. A cache-efficient sorting algorithm for database anddata mining computations using graphics processors. http://

    gamma.cs.unc.edu/SORT/gpusort.pdf

    http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort

    http://en.wikipedia.org/wiki/OpenCL

    http://www.khronos.org/opencl

    http://www.khronos.org/files/opencl-quick-reference-card.pdf