Presentation CUDA

8/8/2019 Presentation CUDA

1/37

HIGH PERFORMANCE COMPUTING

ON GPU


2/37

Graphics Processing UnitsA graphics processing unit orGPU is a specialized microprocessor that offloadsand accelerates 3D or 2D graphics rendering.

Modern GPUs highly parallel structure makes them more effective.

NVIDIA's Tesla Architecture exposes the computational Horse power of the

NVIDIA's GPU.

GPU is specialized for compute-

intensive, highly parallel

designed such that moretransistors are devoted to data

processing rather than data

caching and flow control.


3/37

Physical Memory Layout of NVIDIA GPUs

The Device has its own GlobalMemory which all the

cores(Thread processors) can access. N multiprocessors

have M cores each. Cores share an instruction unitwith

other cores in a multiprocessor. Each processor has its own

localmemory(residing on DRAM), separate register set and

all the M cores shares an on chip memory called shared

memory.The Host can write to the globalmemorybut not

the shared memory.


4/37

TESLA C1060

NVIDIA Tesla C1060 has 10 series NVIDIA

architecture having 30 multiprocessor with 8cores , a double precision unit and on chip

shared memory.


5/37

What is CUDA ?

CUDA is a scalable parallelprogramming

modeland a software environment for

parallelcomputing

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel programming model


6/37

Kernels and ThreadsParallel portions ofan application are executed on

the device as kernels One kernel is executed at a time

All the parallel threads execute the same kernel.

Some devices of Highcomputation power canexecute more than one concurrent kernels.

Important Definition


7/37

More about threads

A CUDA kernel is executed by an array ofthreads.

All threads run the same code.

Each thread has an ID that it uses to computememory addresses and make control decisions.

Computation of memory address and control decisions will be discussed later.


8/37

THREAD BATCHING

Kernellaunches a grid of thread blocks.

Threads within a block cooperate via shared memory

Threads within a block can synchronize(Thread

Coorporation)Threads in different blocks cannot cooperate


9/37

MEMORY ACCESS


10/37

EXECUTION MODEL


11/37

CUDA C and Compilation CUDA C provides a simple path for users familiar with the C programming language to

easily write programs for execution by the device. It consists of a minimal set ofextensions to the C language and a runtime library.

CUDA provides nvcc compiler which spilts the CUDA code into PTX code(Used atruntime) and standard C code (calls the Standard C Compiler at compile time).

RUNS ON CPU RUNS ON GPU


12/37

Managing memoryGPUs memory can only be managed by the CPU and CPU has access only

to the Global Memory .

The following memory operations implies only for the Global

memory (Not to the local or shared memory).

Allocate/Free memory

cudaMalloc(void ** pointer, size_t nbytes) //To allocate nbytes of memory

cudaMemset(void * pointer, int value, size_t count) //To set count bytes to

value.

cudaFree(void* pointer) // To free memory allocated by cudaMalloc

HOST DEVICE data transfer

cud

aMem

cpy(void *dst, void *sr

c, size_t nbytes,enum

cud

aMem

cpyK

ind direction)

//Transfers nbytes of data from src to dst. direction specify the initial and final memory type


13/37

CUDA function Qulaifiers

__global__

Function called from host and executed on device

Must return void

Eg, kernels__device__

Function called from device and run on device.

Cannot be called from host code

__host__ Function called from host and executed on host

(default)


14/37

Kernel Calls and Unique Thread Index

Kernels are called by the modified syntax:

kernel()Here dim3 is a vector type with x , y , z as the members.

We can initialise dim3 objects by the constructor

For 1D grid: dim3 dG(var_x,1,1) or dim3 dG(var)

For 2D grid: dim3 dG(var_x,var_y,1) or dim3 dG(var1,var2)

Similarly for blocks:For 1D block: dim3 dB(var_x,1,1) or dim3 dB(var)

For 2D block: dim3 dB(var_x,var_y,1) or dim3 dV(var_x,var_y)

For 3D blocks: dim3 dB(var_x,var_y,var_z) or dim3 dB(var_x,var_y,var_z)


15/37


16/37

Thread Synchronization

Host synchronization:

void CudaThreadsynchronize();

Blocks until all the CUDA calls are executed.

Device synchronization:

void __syncthreads();

Synchronizes the threads in a Blocks.

There no way to synchronize threads outside the block.

Programmer should be careful to avoid RAW/WAW/WAR

hazards.


17/37

Heteroprogramming and Synchronization

// copy data from host to device

cudaMemcpy(a_d, a_h, numBytes,cudaMemcpyHostToDevice);

// execute the kernel

inc_gpu(a_d,N);

// run independent CPU code

run_cpu_stuff();

// copy data from device back to host cudaMemcpy(a_h, a_d, numBytes,

cudaMemcpyDeviceToHost);


18/37

Error Reporting

Example:

cudaThreadSynchronize();

Kernel_Launch(arg_list);

cudaThreadSynchronize();

printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );

All CUDA calls return error code but some calls areAsynchronous, so programming should synchronize to

Keep checks.


19/37

Hardware Implementation

The CUDA architecture is built around a scalable array of multithreaded

Multiprocessors . When a CUDA program on the host CPU invokes a kernel grid,

the blocks of the gridare enumerated and distributed to multiprocessors with

available execution capacity. The threads of a thread blockexecute concurrently on

one multiprocessor, and multiple thread blocks can execute concurrently on one

multiprocessor. As thread blocks terminate, new blocks are launched on thevacated multiprocessors. This makes the Framework Scalable.

A multiprocessor is designed to execute hundreds of threads concurrently. To

manage such a large amount of threads, it employs a unique architecture called

SIMT(Single-Instruction,Multiple-Thread).When a multiprocessor is given one or

more thread blocks to execute, it partitions them into warps. A warp executes one

common instruction at a time, so full efficiency is realized when all 32 threads of a

warp agree on their execution path.


20/37

PERFROMANCE OPTIMIZATION

Performance optimization revolves around threebasic strategies:

Maximizing parallel execution

Optimizing memory usage Optimizing instruction usage to achieve

maximum instruction throughput


21/37

Maximizing parallel execution

Amdahls law states that the maximum speed-up (S) of aprogram is

where Pis the fraction of the total serial execution time taken by the portion of code that can be parallelized and

N is the number of processors over which the parallel portion of the code runs. The larger N is (that is, the greaterthe number of processors), the smaller the P/N fraction.

It can be simpler to view N as a very large number, which essentiallytransforms the equation into

S = 1 / 1P

Now, if of a program is parallelized, the maximum speed-up over serialcode is 1/ (1 ) = 4. So our aim is to increase P by increasing the fractionof parallel code.


22/37

Optimizing memory transfers

To run kernels, data values must be transferred from the host

to the device along the PCI Express (PCIe) bus. It is importantto minimize data transfer between the host and the device,even if that means running kernels on the GPU that do notdemonstrate any speed-up.


23/37

DeviceDevice transfer

CUDA provides function for device to device data transferwhich can only be called from the Host code.

The call to cudaMemcopy() is Asynchronous but

next kernel wont start until memory transfer is

complete. and what if there is large amout of memory trasfer?

The GPU cores will be idle.

To increase the performance we can allot the job, ofcopying N bytes of data, to B blocks each running k threadsin parallel. For best performance N=k * B .( Eg, it takes 4.5times less time if we allot the job of copying 1 MB of datato around 1k threads ).


24/37

Shared Memory

Each Multiprocessor has 16 kb of Shared Memoryassociated with it

Provides thread corporation within a blockof

threads. Sharing of memory access

Redundant computations

Because it is on-chip, shared memory is much

faster than local and global memory.


25/37

Coalesced Access to Global Memory

Global memory can be viewed in terms of aligned segmentsof 16 and 32 words.

Coalesced access in which all

threads but one access the

corresponding word in a segment

Choosing thread block sizes as multiples of 16, facilitates

memory accesses by half warps that are aligned to segments.

But a warp-size is 32 , so there should be minimum 32

threads.

Misaligned sequential

addresses that fall within two

128-byte segments


26/37

Optimizing Instruction Usage

A warp executes one common instruction at a time, so

full efficiency is realized when all 32 threads of a warp

agree on their execution path. Any flow control

instruction (if, switch, do, for, while) can significantlyaffect the instruction throughput by causing threads of

the same warp to different execution paths. If this

happens, the different execution paths must be

serialized, increasing the total number of instructionsexecuted for this warp.


27/37

Parallelizing w.r.t. pixelsIf the processing of pixels are independent .

Eg ,Conversion from rgb to grey ,Conversion from one format to

another.

char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;

char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Host

char* dev_img_rgb, dev_img_grey; //device pointers

cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char));

cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char)); //allocating in Device

//read image into the HOST memory

//copy that rgb image into the Device memory

cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);

Kernel(dev_img_rgb,dev_img_grey);

//copy back to host memory

cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaHostToDevice) ;


28/37

Visualizing the kernel execution

.Every Block contains

256 threads

Reading from

Writes to

Parallel Execution of Each Block

256 as the threads per block encourages

coleased memory access


29/37

Improvementschar* host_img_rgb= malloc(3*height*width*sizeof(char)) ;

char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Hostchar* dev_img_rgb, dev_img_grey; //device pointers

cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char)); //allocatingin Device

//read image into the HOST memory

//copy that rgb image into the Device memory

cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);

cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char));


cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaDeviceToHost) ;


30/37

cudaMalloc((void **) &(dev_img_rgb), 3

*width*height*sizeof(char));

cudaMalloc((void **) &(dev_img_grey),

width*height*sizeof(char));

Better way:

cudaMalloc((void**)&temp_dev_point,4*width*height*sizeof(c

har));

dev_img_rgb=temp_dev_point;

dev_img_grey=temp_dev_point + (width*height);

For eg, it takes 12 times less time allocating 6000 bytes than to

allocate 4 arrays of 1500 each.

IMPROVEMENT IN ALLOCATION


31/37

Problems in data transfer and

executioncudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cuda

HostToDevice);


Kernel has to wait for the data transfer .

Therefore the cores are idle.

Moreover the HostDevice Transfer is slow.


32/37

Page Locked Memory

Cuda allows the programmer to allocate Page

locked Host memory.

The data transfer rate, between Page lockedhost memory and the device memory, is high.

It allows Asynchronous Data transfer.


33/37

Concurrency//Use of Streams//Creating Streams

cudaStream_t stream[height];

for (int i = 0; i < height; ++i)

cudaStreamCreate(&stream[i]);

//Specifying sequence of host to device transfers.


cudaMemcpyAsync(dev_img_rgb + (i * 3* width), host_img_rgb + (i* 3 *width) , width * sizeof(char), cudaMemcpyHostToDevice, stream[i]);

//Specifying sequence of kernel launches


Kernel (dev_img_rgb + i * 3*width,dev_img_grey + i * width);

//Specifying sequence of device to host transfers.


cudaMemcpyAsync(host_img_grey + (i * width), dev_img_grey + (i * width) ,width * sizeof(char), cudaMemcpyDeviceToHost, stream[i]);


34/37

Host->Device

Host->Device

Execution

Execution

Comparison of Timelines for non-concurrent

and concurrent execution

Device->Host

Device->Host


35/37

Parallelizing nested loops

Eg, Parallelizing w.r.t. the pixels in a Patch

for(int i=0;i


36/37

//We launch a 2-D grid

dim3 grid(width/patch_width,

height/patch_height);

//and grid with 2-D blocks

dim3 block(patch_width,patch_height);

//launch kernel

Kernel_name(arg list..);

How to find the index of the patch inside the grid?

blockIdx.y * gridDim.x + blockIdx.x

How to find the index of the pixel inside the block?

threadIdx.y * blockDim.x + threadIdx.x


37/37

How to choose best configuration

argument.

Cuda provides an OCCUPANCY Calculator

as an excel file.

Occupancy is the ratio of the number ofactive warps per multiprocessor to the

maximum number of possible active warps.

Documents

Presentation CUDA