Presentation CUDA

Embed Size (px)

Citation preview

  • 8/8/2019 Presentation CUDA

    1/37

    HIGH PERFORMANCE COMPUTING

    ON GPU

  • 8/8/2019 Presentation CUDA

    2/37

    Graphics Processing UnitsA graphics processing unit orGPU is a specialized microprocessor that offloadsand accelerates 3D or 2D graphics rendering.

    Modern GPUs highly parallel structure makes them more effective.

    NVIDIA's Tesla Architecture exposes the computational Horse power of the

    NVIDIA's GPU.

    GPU is specialized for compute-

    intensive, highly parallel

    designed such that moretransistors are devoted to data

    processing rather than data

    caching and flow control.

  • 8/8/2019 Presentation CUDA

    3/37

    Physical Memory Layout of NVIDIA GPUs

    The Device has its own GlobalMemory which all the

    cores(Thread processors) can access. N multiprocessors

    have M cores each. Cores share an instruction unitwith

    other cores in a multiprocessor. Each processor has its own

    localmemory(residing on DRAM), separate register set and

    all the M cores shares an on chip memory called shared

    memory.The Host can write to the globalmemorybut not

    the shared memory.

  • 8/8/2019 Presentation CUDA

    4/37

    TESLA C1060

    NVIDIA Tesla C1060 has 10 series NVIDIA

    architecture having 30 multiprocessor with 8cores , a double precision unit and on chip

    shared memory.

  • 8/8/2019 Presentation CUDA

    5/37

    What is CUDA ?

    CUDA is a scalable parallelprogramming

    modeland a software environment for

    parallelcomputing

    Minimal extensions to familiar C/C++ environment

    Heterogeneous serial-parallel programming model

  • 8/8/2019 Presentation CUDA

    6/37

    Kernels and ThreadsParallel portions ofan application are executed on

    the device as kernels One kernel is executed at a time

    All the parallel threads execute the same kernel.

    Some devices of Highcomputation power canexecute more than one concurrent kernels.

    Important Definition

  • 8/8/2019 Presentation CUDA

    7/37

    More about threads

    A CUDA kernel is executed by an array ofthreads.

    All threads run the same code.

    Each thread has an ID that it uses to computememory addresses and make control decisions.

    Computation of memory address and control decisions will be discussed later.

  • 8/8/2019 Presentation CUDA

    8/37

    THREAD BATCHING

    Kernellaunches a grid of thread blocks.

    Threads within a block cooperate via shared memory

    Threads within a block can synchronize(Thread

    Coorporation)Threads in different blocks cannot cooperate

  • 8/8/2019 Presentation CUDA

    9/37

    MEMORY ACCESS

  • 8/8/2019 Presentation CUDA

    10/37

    EXECUTION MODEL

  • 8/8/2019 Presentation CUDA

    11/37

    CUDA C and Compilation CUDA C provides a simple path for users familiar with the C programming language to

    easily write programs for execution by the device. It consists of a minimal set ofextensions to the C language and a runtime library.

    CUDA provides nvcc compiler which spilts the CUDA code into PTX code(Used atruntime) and standard C code (calls the Standard C Compiler at compile time).

    RUNS ON CPU RUNS ON GPU

  • 8/8/2019 Presentation CUDA

    12/37

    Managing memoryGPUs memory can only be managed by the CPU and CPU has access only

    to the Global Memory .

    The following memory operations implies only for the Global

    memory (Not to the local or shared memory).

    Allocate/Free memory

    cudaMalloc(void ** pointer, size_t nbytes) //To allocate nbytes of memory

    cudaMemset(void * pointer, int value, size_t count) //To set count bytes to

    value.

    cudaFree(void* pointer) // To free memory allocated by cudaMalloc

    HOST DEVICE data transfer

    cud

    aMem

    cpy(void *dst, void *sr

    c, size_t nbytes,enum

    cud

    aMem

    cpyK

    ind direction)

    //Transfers nbytes of data from src to dst. direction specify the initial and final memory type

  • 8/8/2019 Presentation CUDA

    13/37

    CUDA function Qulaifiers

    __global__

    Function called from host and executed on device

    Must return void

    Eg, kernels__device__

    Function called from device and run on device.

    Cannot be called from host code

    __host__ Function called from host and executed on host

    (default)

  • 8/8/2019 Presentation CUDA

    14/37

    Kernel Calls and Unique Thread Index

    Kernels are called by the modified syntax:

    kernel()Here dim3 is a vector type with x , y , z as the members.

    We can initialise dim3 objects by the constructor

    For 1D grid: dim3 dG(var_x,1,1) or dim3 dG(var)

    For 2D grid: dim3 dG(var_x,var_y,1) or dim3 dG(var1,var2)

    Similarly for blocks:For 1D block: dim3 dB(var_x,1,1) or dim3 dB(var)

    For 2D block: dim3 dB(var_x,var_y,1) or dim3 dV(var_x,var_y)

    For 3D blocks: dim3 dB(var_x,var_y,var_z) or dim3 dB(var_x,var_y,var_z)

  • 8/8/2019 Presentation CUDA

    15/37

  • 8/8/2019 Presentation CUDA

    16/37

    Thread Synchronization

    Host synchronization:

    void CudaThreadsynchronize();

    Blocks until all the CUDA calls are executed.

    Device synchronization:

    void __syncthreads();

    Synchronizes the threads in a Blocks.

    There no way to synchronize threads outside the block.

    Programmer should be careful to avoid RAW/WAW/WAR

    hazards.

  • 8/8/2019 Presentation CUDA

    17/37

    Heteroprogramming and Synchronization

    // copy data from host to device

    cudaMemcpy(a_d, a_h, numBytes,cudaMemcpyHostToDevice);

    // execute the kernel

    inc_gpu(a_d,N);

    // run independent CPU code

    run_cpu_stuff();

    // copy data from device back to host cudaMemcpy(a_h, a_d, numBytes,

    cudaMemcpyDeviceToHost);

  • 8/8/2019 Presentation CUDA

    18/37

    Error Reporting

    Example:

    cudaThreadSynchronize();

    Kernel_Launch(arg_list);

    cudaThreadSynchronize();

    printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );

    All CUDA calls return error code but some calls areAsynchronous, so programming should synchronize to

    Keep checks.

  • 8/8/2019 Presentation CUDA

    19/37

    Hardware Implementation

    The CUDA architecture is built around a scalable array of multithreaded

    Multiprocessors . When a CUDA program on the host CPU invokes a kernel grid,

    the blocks of the gridare enumerated and distributed to multiprocessors with

    available execution capacity. The threads of a thread blockexecute concurrently on

    one multiprocessor, and multiple thread blocks can execute concurrently on one

    multiprocessor. As thread blocks terminate, new blocks are launched on thevacated multiprocessors. This makes the Framework Scalable.

    A multiprocessor is designed to execute hundreds of threads concurrently. To

    manage such a large amount of threads, it employs a unique architecture called

    SIMT(Single-Instruction,Multiple-Thread).When a multiprocessor is given one or

    more thread blocks to execute, it partitions them into warps. A warp executes one

    common instruction at a time, so full efficiency is realized when all 32 threads of a

    warp agree on their execution path.

  • 8/8/2019 Presentation CUDA

    20/37

    PERFROMANCE OPTIMIZATION

    Performance optimization revolves around threebasic strategies:

    Maximizing parallel execution

    Optimizing memory usage Optimizing instruction usage to achieve

    maximum instruction throughput

  • 8/8/2019 Presentation CUDA

    21/37

    Maximizing parallel execution

    Amdahls law states that the maximum speed-up (S) of aprogram is

    where Pis the fraction of the total serial execution time taken by the portion of code that can be parallelized and

    N is the number of processors over which the parallel portion of the code runs. The larger N is (that is, the greaterthe number of processors), the smaller the P/N fraction.

    It can be simpler to view N as a very large number, which essentiallytransforms the equation into

    S = 1 / 1P

    Now, if of a program is parallelized, the maximum speed-up over serialcode is 1/ (1 ) = 4. So our aim is to increase P by increasing the fractionof parallel code.

  • 8/8/2019 Presentation CUDA

    22/37

    Optimizing memory transfers

    To run kernels, data values must be transferred from the host

    to the device along the PCI Express (PCIe) bus. It is importantto minimize data transfer between the host and the device,even if that means running kernels on the GPU that do notdemonstrate any speed-up.

  • 8/8/2019 Presentation CUDA

    23/37

    DeviceDevice transfer

    CUDA provides function for device to device data transferwhich can only be called from the Host code.

    The call to cudaMemcopy() is Asynchronous but

    next kernel wont start until memory transfer is

    complete. and what if there is large amout of memory trasfer?

    The GPU cores will be idle.

    To increase the performance we can allot the job, ofcopying N bytes of data, to B blocks each running k threadsin parallel. For best performance N=k * B .( Eg, it takes 4.5times less time if we allot the job of copying 1 MB of datato around 1k threads ).

  • 8/8/2019 Presentation CUDA

    24/37

    Shared Memory

    Each Multiprocessor has 16 kb of Shared Memoryassociated with it

    Provides thread corporation within a blockof

    threads. Sharing of memory access

    Redundant computations

    Because it is on-chip, shared memory is much

    faster than local and global memory.

  • 8/8/2019 Presentation CUDA

    25/37

    Coalesced Access to Global Memory

    Global memory can be viewed in terms of aligned segmentsof 16 and 32 words.

    Coalesced access in which all

    threads but one access the

    corresponding word in a segment

    Choosing thread block sizes as multiples of 16, facilitates

    memory accesses by half warps that are aligned to segments.

    But a warp-size is 32 , so there should be minimum 32

    threads.

    Misaligned sequential

    addresses that fall within two

    128-byte segments

  • 8/8/2019 Presentation CUDA

    26/37

    Optimizing Instruction Usage

    A warp executes one common instruction at a time, so

    full efficiency is realized when all 32 threads of a warp

    agree on their execution path. Any flow control

    instruction (if, switch, do, for, while) can significantlyaffect the instruction throughput by causing threads of

    the same warp to different execution paths. If this

    happens, the different execution paths must be

    serialized, increasing the total number of instructionsexecuted for this warp.

  • 8/8/2019 Presentation CUDA

    27/37

    Parallelizing w.r.t. pixelsIf the processing of pixels are independent .

    Eg ,Conversion from rgb to grey ,Conversion from one format to

    another.

    char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;

    char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Host

    char* dev_img_rgb, dev_img_grey; //device pointers

    cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char));

    cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char)); //allocating in Device

    //read image into the HOST memory

    //copy that rgb image into the Device memory

    cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);

    Kernel(dev_img_rgb,dev_img_grey);

    //copy back to host memory

    cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaHostToDevice) ;

  • 8/8/2019 Presentation CUDA

    28/37

    Visualizing the kernel execution

    .Every Block contains

    256 threads

    Reading from

    Writes to

    Parallel Execution of Each Block

    256 as the threads per block encourages

    coleased memory access

  • 8/8/2019 Presentation CUDA

    29/37

    Improvementschar* host_img_rgb= malloc(3*height*width*sizeof(char)) ;

    char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Hostchar* dev_img_rgb, dev_img_grey; //device pointers

    cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char)); //allocatingin Device

    //read image into the HOST memory

    //copy that rgb image into the Device memory

    cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);

    cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char));

    Kernel(dev_img_rgb,dev_img_grey);

    cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaDeviceToHost) ;

  • 8/8/2019 Presentation CUDA

    30/37

    cudaMalloc((void **) &(dev_img_rgb), 3

    *width*height*sizeof(char));

    cudaMalloc((void **) &(dev_img_grey),

    width*height*sizeof(char));

    Better way:

    cudaMalloc((void**)&temp_dev_point,4*width*height*sizeof(c

    har));

    dev_img_rgb=temp_dev_point;

    dev_img_grey=temp_dev_point + (width*height);

    For eg, it takes 12 times less time allocating 6000 bytes than to

    allocate 4 arrays of 1500 each.

    IMPROVEMENT IN ALLOCATION

  • 8/8/2019 Presentation CUDA

    31/37

    Problems in data transfer and

    executioncudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cuda

    HostToDevice);

    Kernel(dev_img_rgb,dev_img_grey);

    Kernel has to wait for the data transfer .

    Therefore the cores are idle.

    Moreover the HostDevice Transfer is slow.

  • 8/8/2019 Presentation CUDA

    32/37

    Page Locked Memory

    Cuda allows the programmer to allocate Page

    locked Host memory.

    The data transfer rate, between Page lockedhost memory and the device memory, is high.

    It allows Asynchronous Data transfer.

  • 8/8/2019 Presentation CUDA

    33/37

    Concurrency//Use of Streams//Creating Streams

    cudaStream_t stream[height];

    for (int i = 0; i < height; ++i)

    cudaStreamCreate(&stream[i]);

    //Specifying sequence of host to device transfers.

    for (int i = 0; i < height; ++i)

    cudaMemcpyAsync(dev_img_rgb + (i * 3* width), host_img_rgb + (i* 3 *width) , width * sizeof(char), cudaMemcpyHostToDevice, stream[i]);

    //Specifying sequence of kernel launches

    for (int i = 0; i < height; ++i)

    Kernel (dev_img_rgb + i * 3*width,dev_img_grey + i * width);

    //Specifying sequence of device to host transfers.

    for (int i = 0; i < height; ++i)

    cudaMemcpyAsync(host_img_grey + (i * width), dev_img_grey + (i * width) ,width * sizeof(char), cudaMemcpyDeviceToHost, stream[i]);

  • 8/8/2019 Presentation CUDA

    34/37

    Host->Device

    Host->Device

    Execution

    Execution

    Comparison of Timelines for non-concurrent

    and concurrent execution

    Device->Host

    Device->Host

  • 8/8/2019 Presentation CUDA

    35/37

    Parallelizing nested loops

    Eg, Parallelizing w.r.t. the pixels in a Patch

    for(int i=0;i

  • 8/8/2019 Presentation CUDA

    36/37

    //We launch a 2-D grid

    dim3 grid(width/patch_width,

    height/patch_height);

    //and grid with 2-D blocks

    dim3 block(patch_width,patch_height);

    //launch kernel

    Kernel_name(arg list..);

    How to find the index of the patch inside the grid?

    blockIdx.y * gridDim.x + blockIdx.x

    How to find the index of the pixel inside the block?

    threadIdx.y * blockDim.x + threadIdx.x

  • 8/8/2019 Presentation CUDA

    37/37

    How to choose best configuration

    argument.

    Cuda provides an OCCUPANCY Calculator

    as an excel file.

    Occupancy is the ratio of the number ofactive warps per multiprocessor to the

    maximum number of possible active warps.