Chun-Yuan Lin Brief of GPU&CUDA 2015/12/16. Introduce to Myself Name: Chun-Yuan Lin ( 林俊淵 ) (1977) Education: Ph.D., Dept. CS, FCU Univ. Experience: Post

Chun-Yuan Lin

Brief of GPU&CUDA

112/04/21

Introduce to MyselfName: Chun-Yuan Lin (林俊淵 ) (1977)Education: Ph.D., Dept. CS, FCU Univ.Experience: Post Dr. Fellow, Institute of Molecular and Cellular Biology

Post Dr. Fellow, Dept. CS

NTHU Univ. Research: Parallel and Distributed Processing, Parallel and Distributed

Programming Language, Algorithm Analysis, Information

Retrieve, Genomics, Proteomics, and Bioinformatics.Contact: [email protected]

http://sslab.cs.nthu.edu.tw/~cylin/

112/04/21

My family

112/04/21

Please let me know who are you

112/04/21

Introduction to Share memory programming

112/04/21

Types of Parallel Computers

Two principal types:

Shared memory multiprocessor

Distributed memory multicomputer

Another type:Distributed Shared memory multiprocessor

112/04/21

Shared Memory Multiprocessor SystemNatural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module :

Processors

Interconnectionnetwork

Memory moduleOneaddressspace

112/04/21

Simplistic view of a small shared memory multiprocessor

Examples:Dual PentiumsQuad Pentiums

Processors Shared memory

Bus

112/04/21

Programming Shared Memory Multiprocessors

Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access variables declared outside threads.

Example Pthreads

Sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism.

Example OpenMP - industry standard - needs OpenMP compiler

112/04/21

Flynn (1966) created a classification for computers based upon instruction streams and data streams:

Single instruction stream-single data stream (SISD) computer

Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items.

Flynn’s Classifications

112/04/21

Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer

General-purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data.

Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.

112/04/21

Single Instruction Stream-Multiple Data Stream (SIMD) Computer

A specially designed computer - a single instruction stream from a single program, but multiple data streams exist. Instructions from program broadcast to more than one processor. Each processor executes same instruction in synchronism, but using different data.

Developed because a number of important applications that mostly operate upon arrays of data.

112/04/21

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, each processor will have its own program to execute:

Program

Processor

Data

Program

Processor

Data

InstructionsInstructions

112/04/21

Single Program Multiple Data (SPMD) Structure

Single source program written and each processor executes its personal copy of this program, although independently and not in synchronism.

Source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

112/04/21

The MIMD category includes a wide class of computers. For this reason, in 1988, E. E. Johnson proposed a further classification of such machines based on their memory structure (global or distributed) and the mechanism used for communication/synchronization (shared variables or message passing).

112/04/21

Programming with Shared Memory

we outline the methods of programming systems that have shared memory, including the use of processes, threads, parallel programming languages, and sequential languages with compiler directives and library routines. standard UNIX process"fork-join" modelIEEE thread standard PthreadsOpenMP, a widely accepted industry standard for parallel

programming on a shared memory multiprocessor.

112/04/21

Shared memory multiprocessor system

Any memory location can be accessible by any of the processors. (dual- and quad-Pentium systems, cost-effective)

A single address space exists, meaning that each memory location is given a unique address within a single range of addresses.

Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections etc.)

112/04/21

Threads

The process created with UNIX fork is a "heavyweight" process; it is a completely separate program with its own variables, stack, and personal memory allocation. heavyweight processes are expensive in time and memory space.

A much more efficient mechanism is one in which independent concurrent sequences are defined within a process, so-cal1ed threads.

The threads al1 share the same memory space and global variables of the process and are much less expensive in time and memory space than the processes themselves.

112/04/21

Each thread needs its own stack and also stores information regarding registers but shares the code and other parts.

Creation of a thread can take three orders of magnitude less time than process creation.

In addition, a thread will immediately have access to shared global variables.

Equally important, threads can be synchronized much more efficiently than processes.

Multithreading also helps alleviate the long latency of message-passing112/04/21

Differences between a process and threads

112/04/21

Creating Shared Data

The key aspect of shared memory programming is that shared memory provides the possibility of creating variables and data structures that can be accessed directly by every processor.

If UNIX heavyweight processes are to share data. additional shared memory system calls are necessary. Typically, each process has its own virtual address space within the virtual memory management system.

It is not necessary to create shared data items explicitly when using threads. Variables declared at the top of the main program (main thread) are global and are available to all threads.

112/04/21

Accessing Shared Data

Accessing shared data needs careful control. (process, thread) (write is a problem)

Consider two processes each of which is to add one to a shared data item, x. Necessary for the contents of the location x to be read, x + 1 computed, and the result written back to the location:

X = 10, answer is 12, but may get 11 112/04/21

Critical Section

A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time. (when a process reach a critical section…)

This mechanism is known as mutual exclusion.

This concept also appears in an operating systems.

112/04/21

Deadlock

Can occur with two processes when one requires a resource held by the other, and this process requires a resource held by the first process.

112/04/21

Barriers

process/thread synchronization is often needed in shared memory programs.

Pthreads do not have a native barrier, so barriers have to be hand-coded using a condition variable and mutex.

A global counter variable is incremented each time a thread reaches the barrier, and all the threads are released when the counter has reached a defined number of threads. The threads are released by the last thread reaching the barrier using broadcast signal

112/04/21

Language Constructs for Parallelism

Shared Data

Shared memory variables might be declared as shared with, say,

shared int x; shared int *p;

112/04/21

par Construct

For specifying concurrent statements:

par {S1;S2;..Sn;

}

The keyword par indicates that statements in the body are to be executed concurrently. This is instruction-level parallelism.

112/04/21

Multiple concurrent processes or threads could be specified by listing the routines that are to be executed concurrently.

par {proc1;proc2;..procn;

}

112/04/21

forall Construct

To start multiple similar processes together:

forall (i = 0; i < n; i++) {S1;S2;..Sm;

}

which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.

112/04/21

Example

forall (i = 0; i < 5; i++)a[i] = 0;

clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

112/04/21

OpenMP

An accepted standard developed in the late 1990s by a group of industry specialists.

Consists of a small set of compiler directives, augmented with a small set of library routines and environment variables using the base language Fortran and C/C++.

The compiler directives can specify such things as the par and forall operations described previously.

Several OpenMP compilers available.

112/04/21

For C/C++, the OpenMP directives are contained in #pragmastatements. The OpenMP #pragma statements have the format:

#pragma omp directive_name ...

where omp is an OpenMP keyword.

May be additional parameters (clauses) after the directive name for different options.

Some directives require code to specified in a structured block (a statement or statements) that follows the directive and then the directive and structured block form a “construct”.

112/04/21

Parallel Directive

#pragma omp parallelstructured_block

creates multiple threads, each one executing the specifiedstructured_block, either a single statement or a compoundstatement created with { ...} with a single entry point and a single exit point.

There is an implicit barrier at the end of the construct.The directive corresponds to forall construct.

112/04/21

112/04/21

Number of threads in a team

Established by either:

1. num_threads clause after the parallel directive, or2. omp_set_num_threads() library routine being previously called, or3. the environment variable OMP_NUM_THREADS is defined in the order given or is system dependent if none of the above.

Number of threads available can also be altered automatically to achieve best use of system resources by a “dynamic adjustment” mechanism.

112/04/21

Work-Sharing

Three constructs in this classification:

sectionsforsingle

In all cases, there is an implicit barrier at the end of the construct unless a nowait clause is included.

Note that these constructs do not start a new team of threads. That done by an enclosing parallel construct.

112/04/21

Shared Memory Programming

Performance Issues

112/04/21

Shared Data in Systems with Caches

All modern computer systems have cache memory, high-speed memory closely attached to each processor for holding recently referenced data and code.

Cache coherence protocols

Update policy - copies of data in all caches are updated at the time one copy is altered.

Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it.

112/04/21

False Sharing

Different parts of block required by different processors but notsame bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated though the actual data is not shared.

112/04/21

Critical Sections Serializing Code

High performance programs should have as few as possible critical sections as their use can serialize the code.

Suppose, all processes happen to come to their critical section together.

They will execute their critical sections one after the other.

In that situation, the execution time becomes almost that of a single processor.

112/04/21

What is GPU?

Graphics Processing Units

112/04/21

The ChallengeRender infinitely complex scenesAnd extremely high resolutionIn 1/60th of one second

Luxo Jr. 1985 took 2-3 hours per frame to render on a Cray-1 supercomputer

Today we can easily render that in 1/30th of one second

Over 300,000x fasterStill not even close to where we need to

be… but look how far we’ve come!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

ECE 498AL1, University of Illinois, Urbana-Champaign

112/04/21

PC/DirectX Shader Model Timeline

Quake 3

Giants Halo Far Cry UE3Half-Life

1998 1999 2000 2001 2002 2003 2004

DirectX 6Multitexturing

Riva TNT

DirectX 8SM 1.x

GeForce 3 Cg

DirectX 9SM 2.0

GeForceFX

DirectX 9.0cSM 3.0

GeForce 6DirectX 5Riva 128

DirectX 7T&L TextureStageState

GeForce 256



112/04/21

A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics

API

GPU in every PC and workstation – massive volume and potential impact

Why Massively Parallel Processor



112/04/21

112/04/21

16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW,

4GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

GeForce 8800



112/04/21

G80 Characteristics367 GFLOPS peak performance (25-50 times of

current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per

app30-100 times speedup over high-end

microprocessors on scientific and media applications: medical imaging, molecular dynamics

“I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.”

-John Stone, VMD group, Physics UIUC© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


112/04/21

ObjectiveTo understand the major factors that dictate

performance when using GPU as an compute accelerator for the CPUThe feeds and speeds of the traditional CPU worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for performance

programming in modern GPU’sKnowing yesterday, today, and tomorrow

The PC world is becoming flatterOutsourcing of computation is becoming easier…



112/04/21

Future Apps Reflect a Concurrent World

Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio coding

and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products

These “Super-apps” represent and model physical, concurrent world

Various granularities of parallelism exist, but…programming model must not hinder parallel

implementationdata delivery needs careful management



112/04/21

Stretching from Both Ends for the Meat

New GPU’s cover massively parallel parts of applications better than CPU

Attempts to grow current CPU architectures “out” or domain-specific architectures “in” lack successUsing a strong combination on apps a compelling idea CUDA

Traditional applications

Current architecture coverage

New applications

Domain-specificarchitecture coverage

Obstacles© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


Bandwidth – Gravity of Modern Computer SystemsThe Bandwidth between key components

ultimately dictates system performanceEspecially true for massively parallel systems

processing massive amount of data Tricks like buffering, reordering, caching can

temporarily defy the rules in some casesUltimately, the performance goes falls back to

what the “speeds and feeds” dictate



112/04/21

Classic PC architectureNorthbridge connects

3 components that must be communicate at high speedCPU, DRAM, videoVideo also needs to have

1st-class access to DRAMPrevious NVIDIA cards

are connected to AGP, up to 2 GB/s transfers

Southbridge serves as a concentrator for slower I/O devices

CPU

Core Logic Chipset© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


PCI Bus SpecificationConnected to the southBridge

Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate

More recently 66 MHz, 64-bit, 512 MB/second peakUpstream bandwidth remain slow for device (256MB/s peak)Shared bus with arbitration

Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge



An Example of Physical Reality Behind CUDA CPU

(host)

GPU w/ local DRAM

(device)



Northbridge handles “primary” PCIe to video/GPU and DRAM.PCIe x16 bandwidth at 8 GB/s (4 GB each direction)

112/04/21

Graphic Processor Unit

112/04/21

Parallel Computing on a GPU

NVIDIA GPU Computing Architecture Via a separate HW interface In laptops, desktops, workstations, servers G80 to G200

8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications

Programmable in C with CUDA tools Multithreaded SPMD model uses

application data parallelism and thread parallelism

Tesla C870

Tesla S870

Tesla D870

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL1, University of Illinois, Urbana-Champaign

Tesla C10601T GFLOPS

Tesla S1070

TESLA S1070NVIDIA® Tesla™ S1070 : 4 teraflop 1U system。

112/04/21

What is GPGPU ?General Purpose computation using GPU in

applications other than 3D graphicsGPU accelerates critical path of application

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Applications – see //GPGPU.orgGame effects (FX) physics, image processingPhysical modeling, computational engineering,

matrix algebra, convolution, correlation, sorting© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


112/04/21

DirectX 5 / OpenGL 1.0 and BeforeHardwired pipeline

Inputs are DIFFUSE, FOG, TEXTUREOperations are SELECT, MUL, ADD, BLENDBlended with FOG

RESULT = (1.0-FOG)*COLOR + FOG*FOGCOLOR

Example HardwareRIVA 128, Voodoo 1, Reality Engine, Infinite Reality

No “ops”, “stages”, programs, or recirculation



112/04/21

The 3D Graphics Pipeline

Application

Scene Management

Geometry

Rasterization

Pixel Processing

ROP/FBI/Display

FrameBuffer

Memory

Host

GPU



112/04/21

The GeForce Graphics Pipeline

Host

Vertex ControlVertex Cache

VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

TextureCache Frame

BufferMemory

Matt20



112/04/21

Traditional Graphics Pipeline

Unified Shader Pipline (G80~)

112/04/21

Feeding the GPUGPU accepts a sequence of commands and

dataVertex positions, colors, and other shader

parametersTexture map imagesCommands like “draw triangles with the

following vertices until you get a command to stop drawing triangles”.

Application pushes data using Direct3D or OpenGL

GPU can pull commands and data from system memory or from its local memory



112/04/21

CUDA“Compute Unified Device Architecture”General purpose programming model

GPU = dedicated super-threaded, massively data parallel co-processor

Targeted software stackCompute oriented drivers, language, and tools

Driver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute - graphics free APIGuaranteed maximum download & readback

speedsExplicit GPU memory management



112/04/21

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

…

GeForce-8 Series HW Overview

112/04/21

CUDA Programming Model:A Highly Multithreaded Coprocessor The GPU is viewed as a compute device

that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel

Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Differences between GPU and CPU threads GPU threads are extremely lightweight

Very little creation overhead GPU needs 1000s of threads for full efficiency

Multi-core CPU needs only a few© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007


112/04/21

Thread Batching: Grids and Blocks A kernel is executed as

a grid of thread blocks All threads share data

memory space A thread block is a

batch of threads that can cooperate with each other by: Synchronizing their

execution Efficiently sharing data

through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA



112/04/21

CUDA Device Memory Space Overview Each thread can:

R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant

memory Read only per-grid texture

memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Host

• The host can R/W global, constant, and texture memories

112/04/21© David Kirk/NVIDIA and Wen-mei W.

Hwu, 2007ECE 498AL1, University of Illinois, Urbana-

Champaign

Global, Constant, and Texture Memories(Long Latency Accesses)

Global memoryMain means of

communicating R/W Data between host and device

Contents visible to all threads

Texture and Constant MemoriesConstants initialized

by host Contents visible to all

threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Host

Courtesy: NDVIA112/04/21



SIMTSIMT (single instruction multi threads)

Serial Code (CPU)

. . .

. . .

Serial Code (CPU)

Parallel Kernel (GPU)

Parallel Kernel (GPU)

112/04/21

What is Behind such an Evolution?

The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data

processing rather than data caching and flow control

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU



112/04/21

CPU v.s GPU

112/04/21

ResourceCUDA ZONE:

http://www.nvidia.com.tw/object/cuda_home_tw.html#

CUDA Course: http://www.nvidia.com.tw/object/cuda_university_courses_tw.html

112/04/21

Documents

Chun-Yuan Lin Brief of GPU&CUDA 2015/12/16. Introduce to Myself Name: Chun-Yuan Lin ( 林俊淵 ) (1977) Education: Ph.D., Dept. CS, FCU Univ. Experience: Post